## Declarative vs Imperative Programming

Declarative code declares "what" it wants (think: demands made by a child).

Imperative code declares "how" it wants to do something (think: step by step instructions to bake a cake)

All systems require, at some level, imperative code. The computer needs to be told exactly how to do things.

There are many advantages, however, in having the primary core set of instructions for a given piece of software written in a declarative fashion. In the context of data analysis, declarative code allows us to abstract away, and therefore swap out, the way in which we actually compute the results.

For example, this allows the computational engine to be different depending on the size of data and size of the computer (or number of computers!). Let's take a look at the `map` operation. In basic Python, we could write the map operation in a very imperative fashion:

```python
out = []
for x in data:
    out.append(fn(x))
```

When we write the `for` loop, Python (and any reader!) doesn't know what we are planning to do inside the for block. It's a generic for loop, therefore the program must do what it says: loop through the data one at a time, starting with the first piece and moving to the last, sequentially. 

Now let's consider the `map` in Pandas: 

```python
df.map(fn)
```

Here we have declared that we want each element to be changed to the transformation implied by a given function, `fn`. We don't care how it's done! The transformation could be implemented sequentially, one at a time, but you can imagine that the computational engine could also choose to implement it in parallel, performing multiple operations at once. Similarly, this could be implimented in a distributed fashion, on multiple machines at once! 

We'll learn more about parallelism and distributed computing later. The takeaway here is that by using higher-level abstractions, brought to us by pandas' declarative interface, we don't have to worry about the computational process, which can be optimized for us.

## The split-apply-combine pattern with DataFrames

Often we want to do a special operation with our data: 

1. Group rows together by some categorical column value (`.groupby`)
2. Apply a function to each group, that either **maps** rows or **reduces** into a single row or value (`.apply`)
3. Combine the results from each group into a either a Series or DataFrame. 

This is referred to as the **split-apply-combine** pattern and is very powerful! This is another "declarative" abstraction that Pandas gives us, allowing us to declare what we want ("data grouped by x, then transformed by function y"), without having to write out the process of how to do the grouping and transforming, step by step.

Let's explore it step-by-step:

In [34]:
import pandas as pd

scoreboard = [{ 'player': 'kewld00d1', 'best_score': 100, 'last_score': 100, 'country': 'gr', 'level': 'n00b'},
              { 'player': 'saphyre', 'best_score': 250, 'last_score': 120, 'country': 'gr', 'level': 'n00b'},
              { 'player': 'chckn0rris', 'best_score': 300, 'last_score': 200, 'country': 'gr', 'level': 'expert'},
              { 'player': 'pumpkin', 'best_score': 550, 'last_score': 20, 'country': 'gr', 'level': 'expert'},
              { 'player': 'tr0llhuntah', 'best_score': 200, 'last_score': 150, 'country': 'no', 'level': 'n00b'},
              { 'player': 'nynja', 'best_score': 100, 'last_score': 100, 'country': 'no', 'level': 'expert'},
              { 'player': 'angel90210', 'best_score': 400, 'last_score': 200, 'country': 'no', 'level': 'expert'},
              { 'player': '111111', 'best_score': 50, 'last_score': 50, 'country': 'no', 'level': 'n00b'}]

df = pd.DataFrame(scoreboard)

In [35]:
# The .groupby method has one parameter: the column(s) with
# which you wish to make "groups" of rows. It returns a new
# data type: the DataFrameGroupBy class.

df.groupby('country')

<pandas.core.groupby.generic.DataFrameGroupBy object at 0x7f9f9ba19850>

In [37]:
# The DataFrameGroupBy class is an iterable of 2-tuples. 
# The first element of the tuple is the "category" (the values
# of the groupby columns) and the second is a mini-dataframe
# (the rows of the original element for which the columns have the
# values indicated by the first element of the tuple!)

# Let's see it: 

for group, mini_df in df.groupby('country'):
    print('group:', group)
    print('df:\n', mini_df)
    print('\n')

group: gr
df:
        player  best_score  last_score country   level
0   kewld00d1         100         100      gr    n00b
1     saphyre         250         120      gr    n00b
2  chckn0rris         300         200      gr  expert
3     pumpkin         550          20      gr  expert


group: no
df:
         player  best_score  last_score country   level
4  tr0llhuntah         200         150      no    n00b
5        nynja         100         100      no  expert
6   angel90210         400         200      no  expert
7       111111          50          50      no    n00b




In [38]:
# We group by because we want to apply some operation to 
# each group separately. For example, if we wanted the 
# mean of the best_score in each group: 


for group, mini_df in df.groupby('country'):
    print(group + ':', mini_df.best_score.mean())

gr: 300.0
no: 187.5


In [39]:
# We use `.apply` to return the means as a Series object:

def get_mean(df):
    return df.best_score.mean()

df.groupby('country').apply(get_mean)

country
gr    300.0
no    187.5
dtype: float64

In [None]:
# Let's try to filter the elements of the dataframe
# So that we only have the best player (with the highest
# best score) in each country.
# .apply let's us do that too!

# Steps to perform such an operation: 

# 1. Create a function that operates on a dataframe
#    and returns a row with the "best player"

def best_player(df):
    return df

# 2. Test your "best_player" function on the whole
#    dataframe:

best_player(df)

# 3. Now group by the country and "apply" the function.
#    Because your function works (on the whole df), you
#    know it will work on the mini dataframes inside the
#    GroupBy object

# df.groupby('country').apply(best_player)

# 4. Take a look at the index. `.apply` returns a new index
#    that consists of the values used in the groupby clause.
#    This can be very helpful when it returns a series, but
#    if it returns a dataframe where we already have that
#    information in a column ('country'), it doesn't help us.
#    We can use the `.reset_index()` method to return to a 
#    numeric index. We pass the `drop=True` keyword in order
#    to throw away the information in the current index.

# df.groupby('country').apply(best_player).reset_index(drop=True)

In [None]:
# Challenge:

# Get the second best player by country!
# HINT: use sort_values and iloc!

## Built-in Aggregations in groupby

Many aggregation functions that exist on Series and DataFrames (mean, max, min, etc.) can be called directly via the groupby object:

* `.groupby(col).max()`
* `.groupby(col).mean()`

In [None]:
# Applying aggregations directly 
# to DataFrameGroupBy object:

df.groupby('country').mean()

In [None]:
# Challenge:
# Get the mean score, by country AND level? 

def level_mean(df):
    # Hint: you will need to group by "level"
    # in this function, then get the mean best_score
    pass

df.groupby('country').apply(level_mean)

## Multiple Groupby!

That groupby induction that we just performed, it's quite a common use-case! So there's an even easier way to do it in Pandas.

We can group by more than one column!

`.groupby([col_a, col_b])`

In [None]:
# Multiple groupby!
# For example the task we accomplished above 
# could also be written simply as:

df.groupby(['country', 'level']).mean().best_score

In [None]:
# Take a look at the structure of the multiple groupby!

list(df.groupby(['country', 'level']))

## Combining DataFrames

There are two primary tools for combining separate DataFrames. Not that in each case, we must think **very** carefully about what we are doing and what missing values might be created. We will look at this closely with examples.

* **Concatenate:** Place one dataframe below or to the side of the other. Assuming columns or rows somehow line up via index/columns.
* **Merge:** Combine dataframes using values in one or more columns to align rows.

## Concatenation with `pd.concat`

`pd.concat` uses the columns (when concatenating vertically) or the index (when concatenating horizontally) to line up the dataframes and put them together. 

What will happen if an index (column) value exists only on one of the dataframe? We can control that (with the `join` keyword parameter), but by default it will simply have a missing value in the columns (rows) that originated in the dataframe where the index (column) didn't exist. 

In [None]:
# Concatenation is mostly used when the rows or columns are shared. 

# For example, you might have data with the same columns and want 
# to concatenate them on axis 0:
# But note: what happened to the index? 
# We might want to reset it. 

df1 = pd.DataFrame({"A": pd.Series([1,2,3]), "B": pd.Series([4,5,6])})
df2 = pd.DataFrame({"A": pd.Series([7]), "B": pd.Series([10])})
pd.concat([df1,df2], axis=0)

In [None]:
# Similarly, you might have data with the same rows and different columns:

df1 = pd.DataFrame({"A": pd.Series([1,2,3]), "B": pd.Series([4,5,6])})
df2 = pd.DataFrame({"B": pd.Series([7,8,9]), "C": pd.Series([10,11,12])})
pd.concat([df1,df2], axis=1)

In [None]:
# But note what happens if the rows do not align, and you concatenate on axis 1:

df1 = pd.DataFrame({"A": pd.Series([1,2,3]), "B": pd.Series([4,5,6])})
df2 = pd.DataFrame({"B": pd.Series([7]), "C": pd.Series([10])})
pd.concat([df1,df2], axis=1)

## Merge

Merge is commonly used when your two dataframes must be connected and they do not share an index or columns such as when we concatenated.

With merge we will connect two DataFrames on some common piece of information, e.g. a common column. The structure of the command is:

```python
pd.merge(leftdf, rightdf, how='inner', on='', *keywds)
```

* `on` defines the column(s) which should be used to line up the data. You can also define `left_on` and `right_on` separately if the columns are named differently.
* `how` has four options: 

        1. 'inner': intersection of keys
        2. 'outer': union of keys
        3. 'left': use keys from left only
        4. 'right': use keys from right only

In [None]:
# Play:

# Try the different merge techniques and see how the dataframes
# combine! HINT: remember you can look at `?pd.merge` if you 
# are confused about how to use it.

df1 = pd.DataFrame({"A": pd.Series([1,2,3]), "B": pd.Series([4,5,6])})
df2 = pd.DataFrame({"A": pd.Series([4]), "C": pd.Series([7])})



## Working with non-rectangular data

We mentioned in the beginning that Pandas is a library for working with rectangular data.

What if your data is not rectangular? What does non-rectangular data look like? Very often our data might come in dictionaries. Imagine data about a "tweet". It might look like this:

```python
{
    "screenname": "wunderkid",
    "id_str": "928374987",
    "text": "Woah, pandas is so much fun #worldrocked #jawdrop #win",
    "hashtags": ["worldrocked", "jawdrop", "win"]
}
```

How would you fit this into a rectangular data format? Do the "hashtags" cause a problem?

In [None]:
raw_tweets = [{ "screenname": "wunderkid",
          "id_str": "928374987",
          "text": "Woah, pandas is so much fun #worldrocked #jawdrop #ml",
          "hashtags": ["worldrocked", "jawdrop", "ml"]},
          {"screenname": "pumpkin",
           "id_str": "98214039",
           "text": "I stay up all night dreaming of linear models #datascience #ml",
           "hashtags": ["datascience", "ml"]}]

tweets = pd.DataFrame(raw_tweets)

# What is the "hashtag" column made of? 
tweets

## Merge keeps the data flat

The correct way to use data such as this in pandas, data with nested lists, is to copy each tweet to multiple rows, one row for each hashtag.

We can use "merge" to do this for us automatically if we put the data into two separate dataframes, one for the hashtags and one for the rest of the tweets. This is called "normalized form" and is often how you will find data if you get it from a SQL database:

In [None]:
# We can use the "colums" keyword to select only some columns when
# we create the dataframe: 

tweets = pd.DataFrame(raw_tweets, columns = ["screenname", "id_str", "text"])
tweets

In [None]:
# Challenge:
# 
# Create a DataFrame object called "hashtags".
# It should have two columns "id" and "hashtag".
# "id" should be the id of the user who tweeted
# the hashtag, "hashtag" shold be the hashtag
# that was tweeted.



In [None]:
# Challenge:

# Try to merge the two DataFrame objects!