# <span style="color:#130654; font-family: Helvetica; font-size: 200%; font-weight:700"> Pandas | <span style="font-size: 50%; font-weight:300">Aggregation & Grouping</span>

Data aggregation and grouping allows us to create summaries for display or analysis. 

<br>

To use pandas in python import it first by using the following command:

In [1]:
# import pandas
import pandas as pd

# import other libraries here
import numpy as np

<br>

### <span style="color:#130654">Create DataFrame</span>

Creating a dataset using dictionary:

In [2]:
data = {'weight' : [42, 49, 56, 48, 46, 68 , 70, 58, 76, 45],
          'height' : [157, 177, 171, 144, 152, 168, 136, 176, 174, 177],
          'gender' : ['male', 'male', 'male', 'female', 'male', 'female', 'male', 'female', 'female', 'male']
}

In [3]:
df = pd.DataFrame(data)
df.head()

Unnamed: 0,weight,height,gender
0,42,157,male
1,49,177,male
2,56,171,male
3,48,144,female
4,46,152,male


<br>

### <span style="color:#130654">groupby()</span>

`groupby()` function involves some combination of:
1. Splitting the data into groups based on some criteria.
2. Applying a function to each group independently.
3. Combining the results into a data structure.

In many situations, we split the data into sets and we apply some functionality on each subset. In the apply functionality, we can perform the following operations:
- Aggregation − computing a summary statistic
- Transformation − perform some group-specific operation
- Filtration − discarding the data with some condition

*Syntax:*
```python
Series.groupby(self, by=None, axis=0, level=None, as_index=True, sort=True, group_keys=True, squeeze=False, observed=False, **kwargs)
```

|      Name      | Description                                                  | Type                                        | Required |
| :------------: | :----------------------------------------------------------- | :------------------------------------------ | :------- |
|     **by**     | Used to determine the groups for the groupby. <br /><br />1. If by is a function, it’s called on each value of the object’s index. <br />2. If a dict or Series is passed, the Series or dict VALUES will be used to determine the groups. <br />3. If an ndarray is passed, the values are used as-is determine the groups.<br />4.  A label or list of labels may be passed to group by the columns in self. | mapping, function, label, or list of labels | Required |
| **seriesaxis** | Split along rows (0) or columns (1).                         | {0 or ‘index’, 1 or ‘columns’}              | Required |
|   **level**    | If the axis is a MultiIndex (hierarchical), group by a particular level or levels. | int, level name, or sequence of such        | Required |
|  **as_index**  | For aggregated output, return object with group labels as the index. | bool                                        | Required |
|    **sort**    | Sort group keys.                                             | bool                                        | Required |
| **group_keys** | When calling apply, add group keys to index to identify pieces. | bool                                        | Required |
|  **squeeze**   | Reduce the dimensionality of the return type if possible, otherwise return a consistent type. | bool                                        | Required |
|  **observed**  | This only applies if any of the groupers are Categoricals.   | bool                                        | Required |
|    **kwargs    | Only accepts keyword argument ‘mutated’ and is passed to groupby. |                                             | Optional |

*Return:*
> DataFrameGroupBy or SeriesGroupBy

*Example:*

In [42]:
df.groupby('gender')

<pandas.core.groupby.generic.DataFrameGroupBy object at 0x0000017CCF376D60>

GroupBy has conveniently returned a DataFrameGroupBy object. It has split the data into separate groups. 

`groupby()` won’t do anything unless it is being told explicitly to do so.

In [43]:
df.groupby('gender').sum()

Unnamed: 0_level_0,weight,height
gender,Unnamed: 1_level_1,Unnamed: 2_level_1
female,250,662
male,308,970


We did not tell GroupBy which column we wanted it to apply the aggregation function on, so it applied it to all the relevant columns and returned the output.

GroupBy object supports column indexing just like a DataFrame.

In [44]:
df.groupby('gender')['height'].sum()

gender
female    662
male      970
Name: height, dtype: int64

<br>

**The Split-Apply-Combine Strategy**

GroupBy employs the Split-Apply-Combine strategy coined by Hadley Wickham in his paper in 2011. Using this strategy, a data analyst can break down a big problem into manageable parts, perform operations on individual parts and combine them back together to answer a specific question.

Step 1 : Splitting the data into separate groups:

In [45]:
df_female = df['gender'] == 'female'
df[df_female]

Unnamed: 0,weight,height,gender
3,48,144,female
5,68,168,female
7,58,176,female
8,76,174,female


In [46]:
df_male = df['gender'] == 'male'
df[df_male]

Unnamed: 0,weight,height,gender
0,42,157,male
1,49,177,male
2,56,171,male
4,46,152,male
6,70,136,male
9,45,177,male


Step 2: Applying the operation that we need to perform

In [47]:
fh_avg = df[df_female]['height'].mean()
mh_avg = df[df_male]['height'].mean()

print(fh_avg, mh_avg)

165.5 161.66666666666666


Step 3: Combining the result to output a DataFrame

In [48]:
df_h_op = pd.DataFrame({'gender': ['male', 'female'], 'avg_height': [mh_avg, fh_avg]})
df_h_op

Unnamed: 0,gender,avg_height
0,male,161.666667
1,female,165.5


Note: Single line code to acheive all three steps, but we have to first select relevant columns in dataframe using `loc`. Otherwise indexing single column from `groupby()` will result in series, instead of DataFrame and thus extra effort will be required to convert series into DataFrame.

In [57]:
df.loc[:, ['gender', 'height']].groupby('gender').mean()

Unnamed: 0_level_0,height
gender,Unnamed: 1_level_1
female,165.5
male,161.666667


<br>

Iterate over group object:

In [60]:
obj = df.groupby('gender')

We can display the indices in each group by calling the groups on the GroupBy object:

In [61]:
obj.groups

{'female': Int64Index([3, 5, 7, 8], dtype='int64'),
 'male': Int64Index([0, 1, 2, 4, 6, 9], dtype='int64')}

In [63]:
for name, groups in obj:
    print(name, 'contains', groups.shape[0], 'rows')

female contains 4 rows
male contains 6 rows


<br>

`get_group` provides a specific group out of all the groups.

In [65]:
obj.get_group('female')

Unnamed: 0,weight,height,gender
3,48,144,female
5,68,168,female
7,58,176,female
8,76,174,female


<br>

### <span style="color:#130654">aggregate()</span>

The `aggregate()` function uses to one or more operations over the specified axis.

<span style="color:green">Note: `agg()` is an alias for `aggregate()`.</span>

*Syntax:*
```python
Series.aggregate(self, func, axis=0, *args, **kwargs)
```

|    Name    | Description                                                  | Type                       | Required |
| :--------: | :----------------------------------------------------------- | :------------------------- | :------- |
|  **func**  | Function to use for aggregating the data. <br /><br />Accepted combinations are:<br />1. function<br />2. string function name<br />3. list of functions and/or function names, e.g. `[np.sum, 'mean']`<br />4. dict of axis labels -> functions, function names or list of such. | unction, str, list or dict | Required |
|  **axis**  | Parameter needed for compatibility with DataFrame.           | {0 or ‘index’}             | Required |
|  **args**  | Positional arguments to pass to func.                        |                            | Required |
| ******kwds | Keyword arguments to pass to func.                           |                            | Required |

*Return:*
> scalar, Series or DataFrame

The return can be:
- scalar : when Series.agg is called with single function
- Series : when DataFrame.agg is called with a single function
- DataFrame : when DataFrame.agg is called with several functions

*Example 1:* Applying on group by

In [95]:
df.groupby('gender').agg([np.sum, np.mean])

Unnamed: 0_level_0,weight,weight,height,height
Unnamed: 0_level_1,sum,mean,sum,mean
gender,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2
female,250,62.5,662,165.5
male,308,51.333333,970,161.666667


In [91]:
df['dob'] = ["1995", "1992", "1986", "1998", "1987", "1893", "1992", "1987", "1994", "1995"]

In [99]:
df.groupby(['gender', 'dob'], as_index=False).agg({'height': np.mean, 'weight': np.sum})

Unnamed: 0,gender,dob,height,weight
0,female,1893,168.0,68
1,female,1987,176.0,58
2,female,1994,174.0,76
3,female,1998,144.0,48
4,male,1986,171.0,56
5,male,1987,152.0,46
6,male,1992,156.5,119
7,male,1995,167.0,87


Rename the aggregated columns to improve their comprehensibility:

In [102]:
df.groupby(['gender', 'dob']).agg(avg_height = ('height', np.mean), total_weight = ('weight', np.sum))

Unnamed: 0_level_0,Unnamed: 1_level_0,avg_height,total_weight
gender,dob,Unnamed: 2_level_1,Unnamed: 3_level_1
female,1893,168.0,68
female,1987,176.0,58
female,1994,174.0,76
female,1998,144.0,48
male,1986,171.0,56
male,1987,152.0,46
male,1992,156.5,119
male,1995,167.0,87


<br>

*Example 2:* Applying Directly on DataFrame

For applying aggregation directly on dataframe remember to convert categorical column to index otherwise operations will be applied to sting values.

In [82]:
df.agg(['sum', 'min', 'average'])

Unnamed: 0,weight,height,gender
sum,558.0,1632.0,malemalemalefemalemalefemalemalefemalefemalemale
min,42.0,136.0,female
average,55.8,163.2,


In [83]:
df_df = df.set_index('gender')

Aggregate these functions over the rows.

In [85]:
df_df.agg(['sum', 'min', 'average'])

Unnamed: 0,weight,height
sum,558.0,1632.0
min,42.0,136.0
average,55.8,163.2


Different aggregations per column.

In [86]:
df_df.agg({'height' : ['sum', 'min'], 
        'weight' : ['mean', 'max']})

Unnamed: 0,height,weight
max,,76.0
mean,,55.8
min,136.0,
sum,1632.0,


Aggregate over the columns.

In [87]:
df_df.agg(['sum', 'min'],axis=1)

Unnamed: 0,sum,min
male,199,42
male,226,49
male,227,56
female,192,48
male,198,46
female,236,68
male,206,70
female,234,58
female,250,76
male,222,45


In [69]:
obj.agg(['sum', 'mean'])

Unnamed: 0_level_0,weight,weight,height,height
Unnamed: 0_level_1,sum,mean,sum,mean
gender,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2
female,250,62.5,662,165.5
male,308,51.333333,970,161.666667


<br>

### <span style="color:#130654">transform()</span>

Transformation allows us to perform some computation on the groups as a whole and then return the combined DataFrame. This is done using the transform() function.

`transform()` function is used to call function on self producing a Series with transformed values and that has the same axis length as self.

Transform comes in handy during feature extraction. As the name suggests, we extract new features from existing ones. Let’s understand the importance of the transform function with the help of an example.

*Syntax:*
```python
Series.transform(self, func, axis=0, *args, **kwargs)
```

|   Name   | Description                                                  | Type                       | Required |
| :------: | :----------------------------------------------------------- | :------------------------- | :------- |
| **func** | Function to use for transforming the data.<br /><br />Accepted combinations are:<br />1. function<br />2. string function name<br />3. list of functions and/or function names, e.g. [np.exp. 'sqrt']<br />4. dict of axis labels -> functions, function names or list of such. | function, str, list or dict | Required |
| **axis** | Parameter needed for compatibility with DataFrame            | {0 or ‘index’}             | Required |
| **args** | Positional arguments to pass to func.                        |                            | Required |
|  **kwds  | Keyword arguments to pass to func.                           |                            | Required |

*Return:*
> Series

In [138]:
df.groupby('gender')['height'].transform('mean')

0    161.666667
1    161.666667
2    161.666667
3    165.500000
4    161.666667
5    165.500000
6    161.666667
7    165.500000
8    165.500000
9    161.666667
Name: height, dtype: float64

<br>

### <span style="color:#130654">filter()</span>

Filtration allows us to discard certain values based on computation and return only a subset of the group. We can do this using the filter() function in Pandas.

*Syntax:*
```python
filter(self, func, dropna, *args, **kwargs)
```

|   Name   | Description                                                  | Type                       | Required |
| :------: | :----------------------------------------------------------- | :------------------------- | :------- |
| **func** | Function to apply to each subframe. Should return True or False. | function, str, list or dict | Required |
| **dropna** | Drop groups that do not pass the filter. True by default; <br>if False, groups that evaluate False are filled with NaNs. | bool             | Required |
| **args** | Positional arguments to pass to func.                        |                            | Required |
|  **kwds  | Keyword arguments to pass to func.                           |                            | Required |

*Return:*
> DataFrame

In [7]:
df.head()

Unnamed: 0,weight,height,gender
0,42,157,male
1,49,177,male
2,56,171,male
3,48,144,female
4,46,152,male


<br>

In [8]:
grouped = df.groupby('gender')

In [22]:
grouped.filter(lambda x : x['weight'].mean() > 54)

Unnamed: 0,weight,height,gender
3,48,144,female
5,68,168,female
7,58,176,female
8,76,174,female
