<a href="https://colab.research.google.com/github/Lokeshpatnana/Pandas/blob/main/Pandas_Grouping_and_Aggregates.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
import pandas as pd
import numpy as np

# Downloading and Loading Datasets
Downloading all the required csv files and loading the data into the dataframes

In [None]:
# eCommerce Dataset
!wget https://nkb-backend-otg-media-static.s3.ap-south-1.amazonaws.com/otg_prod/media/Tech_4.0/AI_ML/Datasets/shopping_data.csv

shopping_df = pd.read_csv('shopping_data.csv')

In [None]:
# Covid Dataset
!wget https://nkb-backend-otg-media-static.s3.ap-south-1.amazonaws.com/otg_prod/media/Tech_4.0/AI_ML/Datasets/italy-covid-daywise.csv

covid_df = pd.read_csv('italy-covid-daywise.csv')

In [None]:
# Stackoverflow Survey Dataset
!wget https://nkb-backend-otg-media-static.s3.ap-south-1.amazonaws.com/otg_prod/media/Tech_4.0/AI_ML/Datasets/survey_results_public.csv

survey_df = pd.read_csv('survey_results_public.csv')

In [None]:
# Film Dataset
!wget https://nkb-backend-otg-media-static.s3.ap-south-1.amazonaws.com/otg_prod/media/Tech_4.0/AI_ML/Datasets/film.csv

films_df = pd.read_csv('film.csv')

# Grouping

### pd.DataFrame.group_by
* `pd.DataFrame.group_by(by=None, axis=0, sort=True, dropna=True)`
  *   `by` is used to determine the groups.
  *   `axis` specifies whether to split across rows or columns.



In [None]:
shopping_df

In [None]:
product_grp = shopping_df.groupby(['Product'])
product_grp

In [None]:
product_grp.groups

In [None]:
product_grp.get_group('Macbook Pro Laptop')

In [None]:
type(product_grp.get_group('Macbook Pro Laptop'))

In [None]:
films_df

In [None]:
films_group = films_df.groupby(['Subject', 'Year'])

In [None]:
films_group.groups

**By default `dropna` is True, so group keys containing `NaN` values are dropped along with the row/column.**

In [None]:
values = [[1, 2, 3], [1, None, 4], [2, 1, 3], [1, 2, 2]]
numeric_df = pd.DataFrame(values, columns=["a", "b", "c"])
numeric_df

In [None]:
grp_b = numeric_df.groupby(by=["b"])
grp_b.groups

# Aggregations

* Aggregation operations are **always performed over an axis**, either the index (default) or the column axis.
* This behavior is **different from numpy aggregation functions** (mean, median, prod, sum, std, var), where the default is to compute the aggregation of the flattened array, e.g., numpy.mean(arr_2d) as opposed to
numpy.mean(arr_2d, axis=0).

In [None]:
numeric_df

In [None]:
numeric_df.mean()

In [None]:
print('mean of prices:', shopping_df['Price Each'].mean())
print('median of prices:', shopping_df['Price Each'].median())
print('maximum of prices:', shopping_df['Price Each'].max())
print('total number of products:', shopping_df['Product'].count())

### `pd.DataFrame.value_counts`
Can be used to count the number of times each value is repeated.

In [None]:
shopping_df['Product'].unique()

In [None]:
print('count of each distinct product:\n', shopping_df['Product'].value_counts())

**If `normalize=True` then the object returned will contain the relative frequencies of the unique values.**

In [None]:
shopping_df['Product'].value_counts(normalize=True)

### Applying aggregations on groups

In [None]:
shopping_df

In [None]:
iphone_filter = (shopping_df['Product'] == "iPhone")
shopping_df[iphone_filter]["Price Each"].median()

In [None]:
shopping_df.groupby('Product')["Price Each"].median()

In [None]:
shopping_df.groupby('Product').median()

Effectively, *shopping_df* is
1. Split into groups based on *Product*
2. `median` function is applied to each group
3. The results from each group are combined into a `DataFrame`




In [None]:
films_df

In [None]:
films_df.groupby('Director')['Popularity'].mean()

In [None]:
films_df.groupby('Director')['Popularity'].mean().loc['Alda, Alan']

### Using aggregations with filters

In [None]:
shopping_df

In [None]:
filt = (shopping_df['Product'] == "Flatscreen TV")
shopping_df.loc[filt]['Purchase Address'].str.contains('CA')

Here, `sum` will give the number of True values

In [None]:
shopping_df.loc[filt]['Purchase Address'].str.contains('CA').sum()

### pd.DataFrame.aggregate
* `pd.DataFrame.aggregrate(func=None, axis=0, *args, **kwargs)`
  *   `func` is the function to use for aggregating the data.
  *   `axis` specifices whether to apply the function to each row or each column.
  *  `*args` are the positional arguments to pass to func.
  *  `**kwargs` are the keyword arguments to pass to func.
  *  *agg* is an alias for *aggregate*.



In [None]:
df = pd.DataFrame([[1, 2, 3],
                   [4, 5, 6],
                   [7, 8, 9]],
                  columns=['A', 'B', 'C'])
df

In [None]:
df.mean()

In [None]:
df.agg(['mean'])

In [None]:
df.agg(['sum', 'min', 'mean'])

In [None]:
year_grp = films_df.groupby('Year')
year_grp['Popularity'].agg(['median', 'mean'])

In [None]:
shopping_df

In [None]:
product_grp = shopping_df.groupby('Product')
product_grp['Price Each'].agg(['median', 'mean']).loc['Flatscreen TV']

In [None]:
filt = (shopping_df['Product'] == "Flatscreen TV")
shopping_df.loc[filt]['Purchase Address'].str.contains('CA').sum()

The following code throws an error because `product_grp['Purchase Address']` is not a `Series` object.  
It is a `SeriesGroupBy` object.

In [None]:
product_grp = shopping_df.groupby('Product')
product_grp['Purchase Address'].str.contains('CA').sum()

In [None]:
type(product_grp['Purchase Address'])

**Use the `.apply` method on `SeriesGroupBy` objects**

In [None]:
product_grp['Purchase Address'].apply(lambda x: x.str.contains('CA').sum())

### pd.DataFrame.cumsum

* `pd.DataFrame.cumsum(axis=None, skipna=True)`

  * Returns the cumulative sum of a `Series` or `DataFrame`.

#### Series

* By default, `NaN` values are ignored.

In [None]:
series = pd.Series([3, np.nan, 4, -6, 0])
series

In [None]:
series.cumsum()

#### Dataframe
* By default, it iterates over the rows and finds the sum in each column.
* This is equivalent to `axis=None` or `axis='index'`.

In [None]:
df = pd.DataFrame({
    "A":[1.0, -3.0, 2.0],
    "B" : [1.0, np.nan, 0.0],
    "C" : [3.0, -2.0, -1.1]
})

df

In [None]:
df.cumsum()

**To iterate over columns and find the sum in each row, use `axis=1`.**

In [None]:
df.cumsum(axis=1)

# Try It Yourself


For the following questions, use the **Film** dataset.
1.   Find the average number of movies released each year with the `subject` comedy .
2.   Get the median `length` of the films released in `1990`.
3.   Find the number of movies each unique set of actors (present in the `Actor` column) acted in.

For the following questions, use the **Covid** dataset.
4.  Find the cumulative sum of the `total cases` reported.
5.  Find the total number of deaths reported.  
