# Unit 5 - Groupby
---

1. [Simple groupby](#section1)
2. [Working with dates](#section2)
3. [Groupby on two or more attributes](#section3)
4. [Groupby with a lambda function](#section4)
5. [Groupby with multiple functions](#section5)



##### One of the most useful functions

[groupby documentation](https://pandas.pydata.org/pandas-docs/stable/user_guide/groupby.html)

#### Split to groups by some criteria + do something with each group seperatly

In [1]:
import pandas as pd
import numpy as np

In [2]:
url = 'https://raw.githubusercontent.com/owid/covid-19-data/master/public/data/vaccinations/vaccinations.csv'
vacc_df = pd.read_csv(url)

In [None]:
vacc_df

## 1. Simple groupby

We split the data into groups\
Nothing happens here, since we didn't indicate what to do with each group\
But: no error. The split is valid :-)

In [None]:
grouped = vacc_df.groupby('location')
grouped

Now lets perform a split and then an apply of an aggregation function

The `median` of `daily_vaccinations` according to `location`:

In [None]:
med_df = vacc_df.groupby('location')[['daily_vaccinations']].median()
med_df

In [6]:
#med_df[["location"]]

Note that this format means `location` is now the index

this means `vacc_df[["location"]]` won't work anymore

##### If you plan to continue using this data and need the index as an attribute:

##### add `reset_index()` and then assign

In [None]:
med_df = med_df.reset_index()
med_df
#med_df[["location"]]

-----
##### So now we are ready to answer the questions:
##### How do we fill missing values for `total_vaccinations` according to the mean of each country?

We now understand this:

In [None]:
x = vacc_df.groupby(['location'])[['total_vaccinations']].fillna(method='ffill')

Advanced comment: \
`.mean()` is a built-in **aggregation** function\
`.fillna()` is a built-in **transformation** function\
groupby allows you to aggregte, transform, or filter the data.


### <span style="color:blue"> Exercise:</span>
> What is the average (mean) of the `daily_vaccinations` in each location?
>
> If we do not reset the index, how can we call the `index`?


## 2. Working with dates

How do we extract the month? Currently `date` is an object:

In [None]:
vacc_df[['date']].info()

First, change the `date` into a `datetime` object and extract the month

In [None]:
vacc_df['date'] = pd.to_datetime(vacc_df['date'])
vacc_df[['date']].dtypes

In [None]:
vacc_df['month'] = pd.DatetimeIndex(vacc_df['date']).month
vacc_df[['location','month','date','daily_vaccinations']].head(3)

You can use any combination [from here](https://docs.python.org/3/library/datetime.html#strftime-and-strptime-behavior)

In [None]:
vacc_df['year-month'] = pd.DatetimeIndex(vacc_df['date']).strftime('%y-%m')
vacc_df[["year-month",'date']]

### <span style="color:blue"> Exercise:</span>
> Extract the `year` and add it as a new column called `year` in `vacc_df`
>
> Extract the name of the day and add it as a new column called `weekday` in `vacc_df`
>
> Run the sanity check: `vacc_df[["date","year","weekday"]]` 

In [37]:
# sanity check
#vacc_df[["date","year","weekday"]]

## 3. Groupby on two or more attributes

Now, groupby `location`, `month`, and `year`

In [None]:
vacc_df.groupby(['location','month','year'])[['daily_vaccinations', 'total_vaccinations']].mean().reset_index()

### <span style="color:blue"> Exercise:</span>
> 
> what will happen if we switch the order of the indexes: `['month', 'location']`?

## 4. Aggregation with a user defined function

Groupby the mean using a lambda function:

In [None]:
vacc_df.groupby(['location', 'month'])[['daily_vaccinations', 'total_vaccinations']].\
agg(lambda x: np.log(x.mean()) if x.mean()!=0 else 0).reset_index()

### <span style="color:blue"> Exercise:</span>
>
> Create your own lambda function that returns 1/x.sum()

## 5. Multiple aggregations

In [None]:
vacc_group = vacc_df.groupby('location').\
agg({'daily_people_vaccinated': ['first', 'last' , 'mean', 'median', 'max'],\
     'total_vaccinations':['max', lambda x: x.max()/1000000]     
    })
vacc_group = vacc_group.reset_index()
vacc_group

## 6. Tidy your output



If you want to access the data and not deal with a multi-index, flatten the data by dropping a level and rename the columns:

In [None]:
vacc_group.columns

Each column currently has a multi-index, that is - several levels (two levels in our case).
We use [droplevel](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.droplevel.html) to remove one of the indexes.\
`droplevel(level, axis=0)`\
`level` - the position of the index to drop. The topmost or leftmost index is 0.\
`axis` - 0 removes a level in the columns, 1 removes a level in the rows.\
In our case, we have two rows of index, so `axis = 1`.

In [None]:
vacc_group = vacc_group.droplevel(0, axis=1) 
#vacc_group.columns = vacc_group.columns.droplevel(0)  #this is from older version of pandas
vacc_group

Rename the columns

In [None]:
vacc_group.columns = ['location','daily_first','daily_last','daily_mean','daily_median','daily_max','total_max','total_max2']
vacc_group

`unstack` takes the innermost index and creates a column from it

In [None]:
vacc_df['year'] = pd.DatetimeIndex(vacc_df['date']).year

In [None]:
yr_mn_grp = vacc_df.groupby(['month','year'])[['daily_vaccinations']].mean().unstack()
yr_mn_grp 

tidy up the table so it can be further used:

In [None]:
#yr_mn_grp.columns = yr_mn_grp.columns.droplevel(0) #older version
yr_mn_grp = yr_mn_grp.droplevel(0, axis=1) 
yr_mn_grp = yr_mn_grp.reset_index()
yr_mn_grp = yr_mn_grp.rename_axis(None, axis=1)
yr_mn_grp

In [None]:
daily_grp = vacc_df.groupby(['year-month','location'])[['daily_vaccinations']].mean().unstack()
daily_grp = daily_grp.transpose()
daily_grp

### <span style="color:blue"> Exercise:</span>
>
> Remove the multi-index from `daily_grp`

---
>A summary:
>
>* `groupby()` - group according to the columns specified
>
>* `reset_index()`  adds a numerical index
>
>* `pd.to_datetime(df['date'])` - changes the attribute type to datetime
>
>* `pd.DatetimeIndex(df['date']).month` - extracts the month from the datatime attribute
>
>* `apply` - applies a function on each row (axis =0) in the dataframe. Change to (axis = 1) to apply the function on each column [documentation](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.apply.html#pandas.DataFrame.apply)
>
>* `lambda` - small anonymous function
>
>* `agg` - apply multiple functions at once, one for each specified column [documentation](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.agg.html)
>
>* `unstack` - unstack the inner-most index onto a column
>
>* `droplevel(0, axis = 1)` - drops the highest (first) level in the column index of a multi-index dataframe
>
>* `transpose` - switch between columns and rows
---

#### This was a lot of information.

#### Keep your balance. Practice. You will make it.

<div>
<img src="https://raw.githubusercontent.com/nlihin/data-analytics/main/images/balance.jpg" width="500"/>
</div>

Photo by <a href="https://unsplash.com/@martinsanchez?utm_source=unsplash&utm_medium=referral&utm_content=creditCopyText">Martin Sanchez</a> on <a href="https://unsplash.com/s/photos/perfect-balance?utm_source=unsplash&utm_medium=referral&utm_content=creditCopyText">Unsplash</a>
  