# Data Aggregation with Groupby

<center><img src="../images/stock/pexels-pixabay-210182.jpg"></center>

This lesson will guide you through the process of data aggregation using the `groupby()` method in Python's Pandas library. We'll use the "mpg" dataset from the Seaborn library for our examples.

## Getting Started - Import Libraries

First, we need to import the necessary libraries (Pandas and Seaborn).

```python
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
````

In [None]:
## Begin Example
!pip install seaborn
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
## End Example

## Getting Started - Load Dataset

Now we load the dataset. The Seaborn library has several built-in datasets that we can easily refer to. 

For this lesson, we will focus on the `mpg` dataset.

```python
mpg_df = sns.load_dataset("mpg")
```

In [None]:
## Begin Example
mpg_df = sns.load_dataset("mpg")
mpg_df.info()
## End Example

## Split-Apply-Combine: The Concept Behind Groupby

<center><img src="../images/stock/pexels-arnie-chou-304906-1877271.jpg"></center>

The `groupby()` method is based on the split-apply-combine strategy:

* __Split:__ The data is divided into groups based on one or more columns.

* __Apply:__ You apply a function (e.g., mean, sum, count) to each group independently.

* __Combine:__ The results from each group are combined into a new data structure.

## Understanding Groupby

Let's break down the `groupby()` method step by step.

* What is a Groupby Object?

    * When you apply the `groupby()` method to a DataFrame, it doesn't immediately perform calculations. 
    * Instead, it creates a DataFrameGroupBy object. 
    * This object contains information about how the data has been split into groups, but the calculations are deferred until you specify an aggregation function.

__Syntax__

The basic syntax for `groupby()` is:

```python
df.groupby(by=column_name(s))
```

* `df`: The Pandas DataFrame you want to group.

* `by`: The column name (or a list of column names) that you want to group the data by.

## Example: Grouping by Cylinders

Let's group the `mpg_df` by the `cylinders` column:

In [None]:
## Begin Example
cylinders_grouped = mpg_df.groupby("cylinders")

cylinders_grouped

The output will show you a `DataFrameGroupBy` object, indicating that the data has been grouped, but no calculations have been performed yet.

## Applying Aggregation Functions

<center><img src="../images/stock/pexels-padrinan-3785930.jpg"></center>

Now, let's apply some aggregation functions to the grouped data.


### Mean

Calculate the average value for each group

In [None]:
## Begin Example
numeric_mpg_df = mpg_df.select_dtypes(include=['number'])
cylinders_mean = numeric_mpg_df.groupby('cylinders').mean()

plt.figure(figsize=(8, 6))
sns.barplot(x=cylinders_mean.index, y=cylinders_mean['mpg'])
plt.title('Average MPG by Number of Cylinders')
plt.xlabel('Number of Cylinders')
plt.ylabel('Average MPG')
plt.show()
## End Example

### Sum

Calculate the sum of values for each group


In [None]:
cylinders_sum = cylinders_grouped.sum()
cylinders_sum

## Selecting Columns

<center><img src="../images/stock/pexels-pixabay-159298.jpg"></center>

You can select specific columns before or after applying the `groupby()` method.

### Selecting Before Grouping

This can be more efficient if you only need to aggregate a subset of columns.

In [None]:
## Begin Example

cylinders_mpg_hp_mean = mpg_df[['mpg', 'horsepower', 'cylinders']].groupby('cylinders').mean()
print(cylinders_mpg_hp_mean)

plt.figure(figsize=(8, 6))
sns.heatmap(cylinders_mpg_hp_mean, annot=True, cmap='viridis')
plt.title('Mean MPG and Horsepower by Cylinders')
plt.xlabel('Variables')
plt.ylabel('Number of Cylinders')
plt.show()
## End Example

### Selecting After Grouping

You can select columns from the resulting aggregated DataFrame.

In [None]:
## Begin Example
# Group by 'cylinders', calculate the mean of all columns, and then select the 'mpg' column
cylinders_mean_mpg = mpg_df.groupby('cylinders').mean()['mpg']
print(cylinders_mean_mpg)

# Visualize
plt.figure(figsize=(8, 6))
plt.plot(cylinders_mean_mpg.index, cylinders_mean_mpg.values, marker='o')
plt.title('Average MPG by Number of Cylinders')
plt.xlabel('Number of Cylinders')
plt.ylabel('Average MPG')
plt.grid(True)
plt.show()
## End Example

## Multiple Aggregations

You can apply multiple aggregation functions at once using the `agg()` method.

## Using a List

Apply different functions to the same column(s).

In [None]:
## Begin Example
# Group by 'cylinders' and calculate the mean and sum of 'mpg'
cylinders_mpg_agg = mpg_df.groupby('cylinders')['mpg'].agg(['mean', 'sum'])
cylinders_mpg_agg
## End Example

## Using a Dictionary

Apply different functions to different columns.

In [None]:
## Begin Example
# Group by 'cylinders' and calculate the mean of 'mpg' and the sum of 'horsepower'
cylinders_agg_dict = mpg_df.groupby('cylinders').agg({'mpg': 'mean', 'horsepower': 'sum'})
print(cylinders_agg_dict)

# Create a plot
plt.figure(figsize=(10, 6))
cylinders_agg_dict.plot(kind='bar')
plt.title('Aggregation of MPG and Horsepower by Cylinders')
plt.xlabel('Number of Cylinders')
plt.ylabel('Value')
plt.legend(['Mean MPG', 'Sum Horsepower'])
plt.show()
## End Example

## Other Useful Aggregation Functions

Here are some other commonly used aggregation functions:

* __`count().`__: Number of non-null values in each group.

* __`min().`__: Minimum value in each group.

* __`max().`__: Maximum value in each group.

* __`any().`__: Returns True if any value in the group is True.

* __`all().`__: Returns True if all values in the group are True.

* __`median.`__(): Median value of each group.

* __`std().`__: Standard deviation of each group.

## Using Multiple Aggregation Functions


<center><img src="../images/stock/pexels-vividcafe-681335.jpg"></center>



In [None]:
## Begin Example
# Group by 'origin' and calculate multiple aggregations for 'mpg'
origin_mpg_agg = mpg_df.groupby('origin')['mpg'].agg(['mean', 'median', 'min', 'max', 'count'])
print(origin_mpg_agg)

# Visualize
plt.figure(figsize=(12, 6))
origin_mpg_agg.plot(kind='bar')
plt.title('MPG Aggregation by Origin')
plt.xlabel('Origin')
plt.ylabel('Value')
plt.xticks(rotation=45)
plt.show()
## End Example

## Grouping by Multiple Columns

You can group by more than one column. This creates a hierarchical index in the resulting DataFrame.

In [None]:
## Begin Example

# Group by 'origin' and 'cylinders' and calculate the mean 'mpg'
origin_cylinders_mpg_mean = mpg_df.groupby(['origin', 'cylinders'])['mpg'].mean()
print(origin_cylinders_mpg_mean)

# Group by 'origin' and 'cylinders' and calculate the mean 'mpg'
origin_cylinders_mpg_mean = mpg_df.groupby(['origin', 'cylinders'])['mpg'].mean().reset_index()

# Create a plot
plt.figure(figsize=(10, 6))
sns.catplot(x='origin', y='mpg', hue='cylinders', data=origin_cylinders_mpg_mean, kind='bar')
plt.title('Average MPG by Origin and Cylinders')
plt.xlabel('Origin')
plt.ylabel('Average MPG')
plt.show()
## End Example

## Resetting the Index

When you group by multiple columns, the resulting DataFrame has a hierarchical index. To make the grouping columns regular columns, use `reset_index()`.

In [None]:
## Begin Example
# Group by 'origin' and 'cylinders' and calculate the mean 'mpg', then reset the index
origin_cylinders_mpg_mean_reset = mpg_df.groupby(['origin', 'cylinders'])['mpg'].mean().reset_index()
print(origin_cylinders_mpg_mean_reset)

## End Example