# <center> Dataframe Groupby
## <center> Split - Apply - Combine
    


> By “group by” we are referring to a process involving one or more of the following steps:

    - Splitting the data into groups/categories based on some criteria.

    - Applying a function to each group independently.
    
    - Combining the results into a data structure.

    
*source https://pandas.pydata.org/docs/user_guide/groupby.html*

![split_apply_combine](resources/split_apply_combine.png)

It's very similar to the sql group by:


```sql
    SELECT col_1, sum(col_2), count(col_3), max(col_4)
    FROM table
    GROUP BY col_1
    HAVING count(col_3) > x
```

# Part 1: Split

## One method: pandas.Dataframe.groupby()

This methods enable to group by row (multiple columns together) or by column (multiple rows together)

We will focus on grouping **by column**

The syntax is:
    
```python
    grouped = df.groupby("column_name")
```

> - df is a **pandas.Dataframe**
> - we use the methods groupby
> - we choose one column to group by is values
> - grouped has a special type **pandas.GroupBy**


https://pandas.pydata.org/docs/reference/groupby.html

**Example**

In [None]:
import pandas as pd

In [None]:
df = pd.read_csv("data/users.csv", sep=';')
df = df.head(100)
df.head()

In [None]:
grouped = df.groupby('channelgrouping')

In [None]:
grouped

## Get informations about a GroupBy object

### GroupBy describe()

Returns a table of statistics about each group

In [None]:
grouped.describe()

### GroupBy groups

Return a dict: 

```
{group names: [rows index]}
```

- Useful to know the number of groups
- Enable to iterate on groups

In [None]:
for name, group in grouped:
    print(name)    
    print(type(group))

In [None]:
for name, group in grouped.groups.items():
    print('------')
    print(name)
    print(group)
    print('\n')

In [None]:
number_of_groups = len(grouped.groups)
number_of_groups

### GroupBy get_group()

Return a dataframe of a specific group

In [None]:
grouped.get_group("direct")

### GroupBy: select some columns


> grouped[["list", "of", "column", 'names]]

In [None]:
grouped_page_views = grouped[["date", "fullvisitorid", "pageviews"]]
grouped_page_views.get_group("direct").head()

# Part 2: Apply

Apply can have several meanings:

### Aggregation
#### Compute a summary statistic (or statistics) for each group.

- *Compute group sums or means.*
- *Compute group sizes / counts.*

### Transformation
#### Perform some group-specific computations and return a like-indexed object.

- *Filling NAs within groups with a value derived from each group.*
- *Standardize data (zscore) within a group.*


### Filtration
#### Discard some groups, according to a group-wise computation that evaluates True or False.

- Discard data that belongs to groups with only a few members.*
- *Filter out data based on the group sum or mean.*

## 2.1 Aggregation

The result of the aggregation will have the group names as the **new index**

### Basic Aggregations

In [None]:
light_df = df[["channelgrouping", "date", "fullvisitorid", "pageviews"]]

grouped = light_df.groupby("channelgrouping")


In [None]:
light_df

The methods describe() returns a dataframe with all basic arithmetic aggregation for each group and for each column:

- count
- mean
- std
- min
- quartiles
- max

In [None]:
grouped.describe()

We can also call each methods one by one.

In [None]:
grouped.sum()

In [None]:
grouped.mean()

In [None]:
grouped.min()

In [None]:
grouped.max()

### Custom Aggregations

> Using aggregate() or agg() methods

agg() takes as argument:

- a function
- a list of functions
- a dict of functions


Syntac:

```python
    grouped.aggregate(function)

    OR

    grouped.aggregate([function1, function2, function3])
    
    OR
    
    grouped.aggregate({column_1: function1, column_2: function2, column_3: function3})
```


[Official doc](https://pandas.pydata.org/pandas-docs/version/0.22/generated/pandas.core.groupby.DataFrameGroupBy.agg.html)

### Using functions from numpy package, we can replicate the preivous behaviour

- np.sum
- np.mean
- np.std
...

In [None]:
import numpy as np

In [None]:
grouped.agg(np.sum)

In [None]:
grouped.aggregate(np.sum)

### We can apply several functions to the Groupby Object

In [None]:
grouped.agg(['sum', 'min', 'max'])

⚠️ In that example, in doesn't make sense to apply all functions to all columns

### We can apply a different function per column

In [None]:
grouped.agg({'fullvisitorid': 'count', 'pageviews': 'sum'})

### We can apply several different functions per column

In [None]:
grouped.agg({'fullvisitorid': ['count'], 'pageviews': ['sum', 'min', 'max']})

### We can apply custom functions

The function must follow some rules:
    
- Take a pd.Series as argument
- Not modify the pd.Series it's applied to

**Example:**
    
We want to split the groups by category of average page views    

In [None]:
def categorize_avg_page_views(page_views_series):
    """
    The function takes an Series object with the numbe of page views as argument
    It return the category in which the average stands
    Args:
        page_views_series: 

    Returns:

    """
    if np.mean(page_views_series) < 2:
        return "[0, 2]"
    elif np.mean(page_views_series) < 4:
        return "[2, 4]"
    elif np.mean(page_views_series) < 6:
        return "[4, 6]"
    else:
        return "[6, ++["

In [None]:
grouped.aggregate(categorize_avg_page_views)

 ⚠️ We get a warning because the function is also applied to the column `date` of type string, which cannot be compared to an integer

**Exercice 1**

Try to apply the function to the `date` column only
What error do we get?



In [None]:
## Type your answer here
# {"date": [categorize_avg_page_views]}


grouped.aggregate({"pageviews": [categorize_avg_page_views, np.sum]})

**Exercice 2**

Our function applies only to the pageviews.

1. Write the command to run the aggregation `categorize_avg_page_views`
2. Add the aggregation `mean`to the same column 

## Type your answer here

## 2.2 Transformation 

The transform method returns an object that has the **same size** as the one being grouped. 

The transform function must:

    - Return a result that is the same size as the group chunk
    - Operate column-by-column on the group chunk
    - Do not perform modify the group chunk. 

**Example**

We want to count the number of element in each group. And having the result for each row

In [None]:
transformed = grouped.date.transform('count')

In [None]:
print(type(transformed))
print(transformed)

**Output of a transform operation**

- The result is a pandas.Series

- It has the same number of rows as the initial datafram

- As a result, we can use it as a new column of df


In [None]:
df["count_per_channel_grouping"] = grouped.date.transform('count')

In [None]:
df.head()
# We have a new column count_per_channel_grouping

## 2.3 Filtration 

The filter method returns a **subset** of the original dataframe

The argument of filter must be a **function**:

    - applied to the group as a whole
    - that returns True or False



**Example**

We want to filter out groups that have less than 10 non Null pageviews elements

In [None]:
def filter_less_than_10(group):
    return len(group) >= 10

In [None]:
filtered_df = grouped.filter(filter_less_than_10)
filtered_df

**Compare the number of rows between the initial df and the filtered df**

In [None]:
# Type the answer here
len(df) - len(filtered_df)

**Exercice**

Filter the `channel_grouping` groups to keep only the ones that have on average more than 2 pageviews

In [None]:
# Type the answer here


----

# <center>  RECAP

    


> We can split a dataframe per group using the methods groupby

**It creates an object pd.GroupBy**


> We can aggregate, transform or filter this GroupBy object

- **aggregate** produces a dataframe which size is the number or groups

- **transform** produces a dataframe of the same size
    
- **filter** produces a subset of the initial dataframe

![split_apply_combine](resources/split_apply_combine.png)