# Data Aggregation with Groupby

<center><img src="../images/stock/pexels-pixabay-210182.jpg"></center>

This lesson will guide you through the process of data aggregation using the `groupby()` method in Python's Pandas library. We'll use the "mpg" dataset from the Seaborn library for our examples.

## Getting Started - Import Libraries

First, we need to import the necessary libraries (Pandas and Seaborn).

```python
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
````

In [1]:
## Import Libraries
!pip install seaborn
import pandas as pd
import numpy as np

import matplotlib.pyplot as plt
import seaborn as sns



ImportError: Unable to import required dependencies:
numpy: Error importing numpy: you should not try to import numpy from
        its source directory; please exit the numpy source tree, and relaunch
        your python interpreter from there.

## Getting Started - Load Dataset

Now we load the dataset. The Seaborn library has several built-in datasets that we can easily refer to. 

For this lesson, we will focus on the `mpg` dataset.


The `mpg` dataset provides technical specifications of cars in the context of fuel consumption in miles per gallon.

__Columns Description:__
* `mpg`: Fuel efficiency in miles per gallon.
* `cylinders`: Number of engine cylinders 
* `displacement`: Engine displacement in cubic inches
* `horsepower`: Engine horsepower 
* `weight`: Vehicle weight in pounds
* `acceleration`: Time to accelerate from 0 to 60 mph in seconds 
* `model_year`: Year the car model was released
* `origin`: Country of origin (USA, Europe, Asia)
* `name`: Car model name 

To load the dataset, use the following command:

```python
df = sns.load_dataset("mpg")
```

In [None]:
## Load and Preview Dataset
mpg_df = sns.load_dataset("mpg")
mpg_df.head(10)


## Getting - Started Inspect the Dataset
Now let's take a quick look at the data.

In [None]:
## View Key Information on the Dataset
mpg_df.info()
## End Example

## Getting Started - Some Light Cleaning

Based on the `.info()`, there are null values. Let's take a deeper look using `.isnull()`


In [None]:
## View Rows Containing Null Values
null_mask = mpg_df.isnull().any(axis=1)
null_rows = mpg_df[null_mask]

null_rows

There are only a few rows containing null values. For this lesson we will simply drop those rows.

In [None]:
## Drop Rows Containing Null Values
mpg_df_dropna = mpg_df.dropna(axis=0)
mpg_df_dropna.info()

## Getting Started - Minor Transformations

Next, we'll modify the `name` column.

Each entry for `name` contains the make and model of the car.
Let's extract the make and place it into a new column named `make`.

In [None]:
# Create a new column 'make' by extracting the first word from 'car_name'
mpg_df_dropna.loc[:, "make"] = mpg_df_dropna["name"].str.split(" ", n=1, expand=True)[0]
mpg_df_dropna.head(5)

Now let's remove the make from the `name` column.

In [None]:
# Remove the make from the "name" column
mpg_df_dropna.loc[:, "name"] = mpg_df_dropna["name"].str.split(" ", n=1, expand=True)[1]
mpg_df_dropna.head(5)

And finally, rename `name` to `model`

In [None]:
# Rename the "Name" column to "model"
mpg_df_cleaned = mpg_df_dropna.rename(columns={"car name": "model"})

mpg_df_cleaned.info()

If we take a look at the unique makes, we'll see that our work is not done.

In [None]:
mpg_df_cleaned["make"].unique()

There are a few typos and abbreviations in the unique `make` values.

Let's replace the typos and abbreviations with the full make name using the `.replace()` method.

Syntax:
```python
df[column_name].replace(string_to_replace, replacement_string]
```

Syntax Using a Dictionary:
```python
replacement_dictionary = {"vokswagen": "volkswagen",
                          "vw": "volkswagen",
                          "toyouta": "toyota",
                          "maxda": "mazda",
                          "hi": "ih"}
df[column_name] = df[column_name].replace(replacement_dictionary)
```

## Visualizations

Let's take a look at some initial visuals

In [None]:
### Bar Plot - Country of Origin

plt.figure(figsize=(10,5))
ax = sns.countplot(x = 'origin', data = data_cleaned, color = '#4287f5')
ax.bar_label(ax.containers[0], label_type='edge')
plt.title("Country of Origin Distribution", fontsize = 20)
plt.xlabel("Country", fontsize = 15)
plt.ylabel("Count", fontsize = 15)
plt.show()

The majority of cars in this dataset from USA

### Line Plot - Model Year and Miles Per Gallon

In [None]:
## Begin Visual
plt.figure(figsize=(10,5), dpi=100)
plt.title("model year against mpg\n", fontsize = 18)
plt.xlabel("model year\n", fontsize = 12)
plt.ylabel("mpg\n", fontsize = 12)
sns.lineplot(x = 'model_year', 
             y = 'mpg', 
             data = data_cleaned)
## End Visual

Based on the Line Plot, fuel efficiency began to increase in the mid 70's.

## Split-Apply-Combine: The Concept Behind Groupby

<center><img src="../images/stock/pexels-arnie-chou-304906-1877271.jpg"></center>

The `groupby()` method is based on the split-apply-combine strategy:

* __Split:__ The data is divided into groups based on one or more columns.

* __Apply:__ You apply a function (e.g., mean, sum, count) to each group independently.

* __Combine:__ The results from each group are combined into a new data structure.

## Understanding Groupby

Let's break down the `groupby()` method step by step.

* What is a Groupby Object?

    * When you apply the `groupby()` method to a DataFrame, it doesn't immediately perform calculations. 
    * Instead, it creates a DataFrameGroupBy object. 
    * This object contains information about how the data has been split into groups, but the calculations are deferred until you specify an aggregation function.

__Syntax__

The basic syntax for `groupby()` is:

```python
df.groupby(column_name)
```

* `df`: The Pandas DataFrame you want to group.

* `column_name`: The column name (or a list of column names) that you want to group the data by.

## Example: Grouping by Origin

Let's group the `mpg_df` by the `origin` column:

In [None]:
## Begin Example
origin_grouped = mpg_df.groupby("origin")

origin_grouped

The output will show you a `DataFrameGroupBy` object, indicating that the data has been grouped, but no calculations have been performed yet.

## Applying Aggregation Functions

<center><img src="../images/stock/pexels-padrinan-3785930.jpg"></center>

Now, let's apply some aggregation functions to the grouped data.


### Mean

Based on this dataset, which country of origin produced the most/least fuel efficient vehicles?

Let's calculate the average city-fuel efficiency of these cars based on Country of Origin

In [1]:
## Begin Calculation
origin_mpg_mean = origin_grouped["mpg"].mean()
origin_mpg_mean

NameError: name 'origin_grouped' is not defined

Based on the dataset, Japanese vehicles had greater fuel efficiency, on average, with Europe in second place, followed by USA.

### Sum

Based on the dataset, which country of origin produced the highest total horsepower?

Let's calculate the total horsepower value for each origin.

In [None]:
## Begin Calculation
cylinders_sum = cylinders_grouped.sum()
cylinders_sum

## Selecting Columns

<center><img src="../images/stock/pexels-pixabay-159298.jpg"></center>

You can select specific columns before or after applying the `groupby()` method.

## Other Useful Aggregation Functions

Here are some other commonly used aggregation functions:

* __`count().`__: Number of non-null values in each group.

* __`min().`__: Minimum value in each group.

* __`max().`__: Maximum value in each group.

* __`any().`__: Returns True if any value in the group is True.

* __`all().`__: Returns True if all values in the group are True.

* __`median.`__(): Median value of each group.

* __`std().`__: Standard deviation of each group.

## Grouping by Multiple Columns

<center><img src="../images/stock/pexels-vividcafe-681335.jpg"></center>

You can group by more than one column. This creates a hierarchical index in the resulting DataFrame.

In [None]:
## Begin Example

# Group by 'origin' and 'cylinders' and calculate the mean 'mpg'
origin_cylinders_mpg_mean = mpg_df.groupby(['origin', 'cylinders'])['mpg'].mean()
print(origin_cylinders_mpg_mean)

# Group by 'origin' and 'cylinders' and calculate the mean 'mpg'
origin_cylinders_mpg_mean = mpg_df.groupby(['origin', 'cylinders'])['mpg'].mean().reset_index()

# Create a plot
plt.figure(figsize=(10, 6))
sns.catplot(x='origin', y='mpg', hue='cylinders', data=origin_cylinders_mpg_mean, kind='bar')
plt.title('Average MPG by Origin and Cylinders')
plt.xlabel('Origin')
plt.ylabel('Average MPG')
plt.show()
## End Example

In [None]:
Both the make and  model year are index columns of the new datadrame. 

Together they form a hierchical index or multilevel index.

Multilevel index makes it possible to work with data that has an arbitrary number of dimensions within the 2D structure of a dataframe by using multiple columns to uniquely identify each row.

    The resulting dataframe can be viewed as a 3D dataset, with three axes (make, model_year, horsepower aggregate values)

## Resetting the Index

When you group by multiple columns, the resulting DataFrame has a hierarchical index. To make the grouping columns regular columns, use `reset_index()`.

In [None]:
## Begin Example
# Group by 'origin' and 'cylinders' and calculate the mean 'mpg', then reset the index
origin_cylinders_mpg_mean_reset = mpg_df.groupby(['origin', 'cylinders'])['mpg'].mean().reset_index()
print(origin_cylinders_mpg_mean_reset)

## End Example