# Unit 3 - Missing values and Data statistics
---

1. [Find rows with missing values](#section1)
2. [Remove missing values using dropna()](#section2)
3. [Fill missing values using fillna()](#section3)
4. [Fill missing values using interpolate()](#section4)
5. [A note on slicing - copy()](#section5)
6. [GroupBy()](#section6)





In [144]:
import pandas as pd
import numpy as np

In [145]:
url = 'https://raw.githubusercontent.com/owid/covid-19-data/master/public/data/vaccinations/vaccinations.csv'
vacc_df = pd.read_csv(url)
vacc_df.shape

(87913, 16)

---
>## Reminder:
>##### Treating missing values is optional. Sometimes we just leave the dataframe with missing values!!
---


<a id='section1'></a>
### 1. Find rows with missing values

`null` / `na` - no value

`NaN` - **N**ot **a** **N**umber - the value is missing. This value will be ignored in calculations such as `.mean()`

`isnull()` is a pandas function, so either use it on a dataframe or call it through pd

In [146]:
vacc_df.head()

Unnamed: 0,location,iso_code,date,total_vaccinations,people_vaccinated,people_fully_vaccinated,total_boosters,daily_vaccinations_raw,daily_vaccinations,total_vaccinations_per_hundred,people_vaccinated_per_hundred,people_fully_vaccinated_per_hundred,total_boosters_per_hundred,daily_vaccinations_per_million,daily_people_vaccinated,daily_people_vaccinated_per_hundred
0,Afghanistan,AFG,2021-02-22,0.0,0.0,,,,,0.0,0.0,,,,,
1,Afghanistan,AFG,2021-02-23,,,,,,1367.0,,,,,34.0,1367.0,0.003
2,Afghanistan,AFG,2021-02-24,,,,,,1367.0,,,,,34.0,1367.0,0.003
3,Afghanistan,AFG,2021-02-25,,,,,,1367.0,,,,,34.0,1367.0,0.003
4,Afghanistan,AFG,2021-02-26,,,,,,1367.0,,,,,34.0,1367.0,0.003


In [147]:
vacc_df.isnull().sum()

location                                   0
iso_code                                   0
date                                       0
total_vaccinations                     40599
people_vaccinated                      42833
people_fully_vaccinated                45558
total_boosters                         69282
daily_vaccinations_raw                 48528
daily_vaccinations                       299
total_vaccinations_per_hundred         40599
people_vaccinated_per_hundred          42833
people_fully_vaccinated_per_hundred    45558
total_boosters_per_hundred             69282
daily_vaccinations_per_million           299
daily_people_vaccinated                 1641
daily_people_vaccinated_per_hundred     1641
dtype: int64

##### call it through pandas:

In [148]:
pd.isnull(vacc_df).sum()

location                                   0
iso_code                                   0
date                                       0
total_vaccinations                     40599
people_vaccinated                      42833
people_fully_vaccinated                45558
total_boosters                         69282
daily_vaccinations_raw                 48528
daily_vaccinations                       299
total_vaccinations_per_hundred         40599
people_vaccinated_per_hundred          42833
people_fully_vaccinated_per_hundred    45558
total_boosters_per_hundred             69282
daily_vaccinations_per_million           299
daily_people_vaccinated                 1641
daily_people_vaccinated_per_hundred     1641
dtype: int64

##### View specific columns:

In [149]:
vacc_df[['daily_vaccinations', 'total_vaccinations']].notnull().sum()

daily_vaccinations    87614
total_vaccinations    47314
dtype: int64

In [150]:
vacc_df[['daily_vaccinations']].isnull().sum()

daily_vaccinations    299
dtype: int64

##### Using numpy: `isnan` is a numpy function

In [151]:
np.isnan(vacc_df[['daily_vaccinations']]).sum()

daily_vaccinations    299
dtype: int64

<a id='section2'></a>
### 2. Remove missing values using dropna() 

##### Look at Zimbabwe for example. Zimbabwe contains missing values:

In [152]:
zimbabwe = vacc_df.loc[vacc_df.location == 'Zimbabwe']
zimbabwe.head(10)

Unnamed: 0,location,iso_code,date,total_vaccinations,people_vaccinated,people_fully_vaccinated,total_boosters,daily_vaccinations_raw,daily_vaccinations,total_vaccinations_per_hundred,people_vaccinated_per_hundred,people_fully_vaccinated_per_hundred,total_boosters_per_hundred,daily_vaccinations_per_million,daily_people_vaccinated,daily_people_vaccinated_per_hundred
87525,Zimbabwe,ZWE,2021-02-18,39.0,39.0,,,,,0.0,0.0,,,,,
87526,Zimbabwe,ZWE,2021-02-19,,,,,,425.0,,,,,28.0,425.0,0.003
87527,Zimbabwe,ZWE,2021-02-20,,,,,,425.0,,,,,28.0,425.0,0.003
87528,Zimbabwe,ZWE,2021-02-21,1314.0,1314.0,,,,425.0,0.01,0.01,,,28.0,425.0,0.003
87529,Zimbabwe,ZWE,2021-02-22,,,,,,660.0,,,,,44.0,660.0,0.004
87530,Zimbabwe,ZWE,2021-02-23,4041.0,4041.0,,,,800.0,0.03,0.03,,,53.0,800.0,0.005
87531,Zimbabwe,ZWE,2021-02-24,6115.0,6115.0,,,2074.0,1013.0,0.04,0.04,,,67.0,1013.0,0.007
87532,Zimbabwe,ZWE,2021-02-25,11264.0,11264.0,,,5149.0,1604.0,0.07,0.07,,,106.0,1604.0,0.011
87533,Zimbabwe,ZWE,2021-02-26,12836.0,12836.0,,,1572.0,1767.0,0.09,0.09,,,117.0,1767.0,0.012
87534,Zimbabwe,ZWE,2021-02-27,15962.0,15962.0,,,3126.0,2153.0,0.11,0.11,,,143.0,2153.0,0.014


In [153]:
zimbabwe[['total_vaccinations']].isnull().sum()

total_vaccinations    25
dtype: int64

In [154]:
zimbabwe['total_vaccinations'].notnull().sum()

363

##### We can see the difference when counting the number of values per row:

In [155]:
zimbabwe.count()

location                               388
iso_code                               388
date                                   388
total_vaccinations                     363
people_vaccinated                      363
people_fully_vaccinated                334
total_boosters                          75
daily_vaccinations_raw                 338
daily_vaccinations                     387
total_vaccinations_per_hundred         363
people_vaccinated_per_hundred          363
people_fully_vaccinated_per_hundred    334
total_boosters_per_hundred              75
daily_vaccinations_per_million         387
daily_people_vaccinated                387
daily_people_vaccinated_per_hundred    387
dtype: int64

##### Remove all rows that contain one or more missing values: 

In [156]:
zimbabwe.dropna()

Unnamed: 0,location,iso_code,date,total_vaccinations,people_vaccinated,people_fully_vaccinated,total_boosters,daily_vaccinations_raw,daily_vaccinations,total_vaccinations_per_hundred,people_vaccinated_per_hundred,people_fully_vaccinated_per_hundred,total_boosters_per_hundred,daily_vaccinations_per_million,daily_people_vaccinated,daily_people_vaccinated_per_hundred
87837,Zimbabwe,ZWE,2021-12-27,7222381.0,4105296.0,3113142.0,3943.0,7741.0,12481.0,47.86,27.20,20.63,0.03,827.0,6425.0,0.043
87838,Zimbabwe,ZWE,2021-12-28,7226334.0,4107151.0,3115190.0,3993.0,3953.0,10627.0,47.88,27.21,20.64,0.03,704.0,5407.0,0.036
87839,Zimbabwe,ZWE,2021-12-29,7238939.0,4112517.0,3121776.0,4646.0,12605.0,10190.0,47.96,27.25,20.68,0.03,675.0,4796.0,0.032
87842,Zimbabwe,ZWE,2022-01-01,7276239.0,4130228.0,3140338.0,5673.0,11448.0,9599.0,48.21,27.37,20.81,0.04,636.0,4308.0,0.029
87843,Zimbabwe,ZWE,2022-01-02,7288786.0,4133140.0,3144021.0,11625.0,12547.0,10592.0,48.30,27.39,20.83,0.08,702.0,4265.0,0.028
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
87908,Zimbabwe,ZWE,2022-03-08,7949875.0,4381982.0,3414446.0,153447.0,6550.0,6931.0,52.68,29.03,22.62,1.02,459.0,2304.0,0.015
87909,Zimbabwe,ZWE,2022-03-09,7963423.0,4386789.0,3417692.0,158942.0,13548.0,7526.0,52.77,29.07,22.65,1.05,499.0,2580.0,0.017
87910,Zimbabwe,ZWE,2022-03-10,7972666.0,4390103.0,3421237.0,161326.0,9243.0,7365.0,52.83,29.09,22.67,1.07,488.0,2454.0,0.016
87911,Zimbabwe,ZWE,2022-03-11,7985886.0,4394455.0,3425864.0,165567.0,13220.0,7895.0,52.91,29.12,22.70,1.10,523.0,2794.0,0.019


Note: `dropna()`, like most other functions in the pandas API returns a new DataFrame 
(a copy of the original with changes) as the result, so you should assign it back if you want to see changes:

In [157]:
zimbabwe.head()

Unnamed: 0,location,iso_code,date,total_vaccinations,people_vaccinated,people_fully_vaccinated,total_boosters,daily_vaccinations_raw,daily_vaccinations,total_vaccinations_per_hundred,people_vaccinated_per_hundred,people_fully_vaccinated_per_hundred,total_boosters_per_hundred,daily_vaccinations_per_million,daily_people_vaccinated,daily_people_vaccinated_per_hundred
87525,Zimbabwe,ZWE,2021-02-18,39.0,39.0,,,,,0.0,0.0,,,,,
87526,Zimbabwe,ZWE,2021-02-19,,,,,,425.0,,,,,28.0,425.0,0.003
87527,Zimbabwe,ZWE,2021-02-20,,,,,,425.0,,,,,28.0,425.0,0.003
87528,Zimbabwe,ZWE,2021-02-21,1314.0,1314.0,,,,425.0,0.01,0.01,,,28.0,425.0,0.003
87529,Zimbabwe,ZWE,2021-02-22,,,,,,660.0,,,,,44.0,660.0,0.004


assign it back:

In [158]:
zimbabwe = zimbabwe.dropna()
zimbabwe

Unnamed: 0,location,iso_code,date,total_vaccinations,people_vaccinated,people_fully_vaccinated,total_boosters,daily_vaccinations_raw,daily_vaccinations,total_vaccinations_per_hundred,people_vaccinated_per_hundred,people_fully_vaccinated_per_hundred,total_boosters_per_hundred,daily_vaccinations_per_million,daily_people_vaccinated,daily_people_vaccinated_per_hundred
87837,Zimbabwe,ZWE,2021-12-27,7222381.0,4105296.0,3113142.0,3943.0,7741.0,12481.0,47.86,27.20,20.63,0.03,827.0,6425.0,0.043
87838,Zimbabwe,ZWE,2021-12-28,7226334.0,4107151.0,3115190.0,3993.0,3953.0,10627.0,47.88,27.21,20.64,0.03,704.0,5407.0,0.036
87839,Zimbabwe,ZWE,2021-12-29,7238939.0,4112517.0,3121776.0,4646.0,12605.0,10190.0,47.96,27.25,20.68,0.03,675.0,4796.0,0.032
87842,Zimbabwe,ZWE,2022-01-01,7276239.0,4130228.0,3140338.0,5673.0,11448.0,9599.0,48.21,27.37,20.81,0.04,636.0,4308.0,0.029
87843,Zimbabwe,ZWE,2022-01-02,7288786.0,4133140.0,3144021.0,11625.0,12547.0,10592.0,48.30,27.39,20.83,0.08,702.0,4265.0,0.028
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
87908,Zimbabwe,ZWE,2022-03-08,7949875.0,4381982.0,3414446.0,153447.0,6550.0,6931.0,52.68,29.03,22.62,1.02,459.0,2304.0,0.015
87909,Zimbabwe,ZWE,2022-03-09,7963423.0,4386789.0,3417692.0,158942.0,13548.0,7526.0,52.77,29.07,22.65,1.05,499.0,2580.0,0.017
87910,Zimbabwe,ZWE,2022-03-10,7972666.0,4390103.0,3421237.0,161326.0,9243.0,7365.0,52.83,29.09,22.67,1.07,488.0,2454.0,0.016
87911,Zimbabwe,ZWE,2022-03-11,7985886.0,4394455.0,3425864.0,165567.0,13220.0,7895.0,52.91,29.12,22.70,1.10,523.0,2794.0,0.019


In [159]:
zimbabwe = vacc_df.loc[vacc_df.location == 'Zimbabwe']

# Remove all values for a specific column - using `subset`

In [160]:
zimbabwe.dropna(subset = ['total_vaccinations'])

Unnamed: 0,location,iso_code,date,total_vaccinations,people_vaccinated,people_fully_vaccinated,total_boosters,daily_vaccinations_raw,daily_vaccinations,total_vaccinations_per_hundred,people_vaccinated_per_hundred,people_fully_vaccinated_per_hundred,total_boosters_per_hundred,daily_vaccinations_per_million,daily_people_vaccinated,daily_people_vaccinated_per_hundred
87525,Zimbabwe,ZWE,2021-02-18,39.0,39.0,,,,,0.00,0.00,,,,,
87528,Zimbabwe,ZWE,2021-02-21,1314.0,1314.0,,,,425.0,0.01,0.01,,,28.0,425.0,0.003
87530,Zimbabwe,ZWE,2021-02-23,4041.0,4041.0,,,,800.0,0.03,0.03,,,53.0,800.0,0.005
87531,Zimbabwe,ZWE,2021-02-24,6115.0,6115.0,,,2074.0,1013.0,0.04,0.04,,,67.0,1013.0,0.007
87532,Zimbabwe,ZWE,2021-02-25,11264.0,11264.0,,,5149.0,1604.0,0.07,0.07,,,106.0,1604.0,0.011
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
87908,Zimbabwe,ZWE,2022-03-08,7949875.0,4381982.0,3414446.0,153447.0,6550.0,6931.0,52.68,29.03,22.62,1.02,459.0,2304.0,0.015
87909,Zimbabwe,ZWE,2022-03-09,7963423.0,4386789.0,3417692.0,158942.0,13548.0,7526.0,52.77,29.07,22.65,1.05,499.0,2580.0,0.017
87910,Zimbabwe,ZWE,2022-03-10,7972666.0,4390103.0,3421237.0,161326.0,9243.0,7365.0,52.83,29.09,22.67,1.07,488.0,2454.0,0.016
87911,Zimbabwe,ZWE,2022-03-11,7985886.0,4394455.0,3425864.0,165567.0,13220.0,7895.0,52.91,29.12,22.70,1.10,523.0,2794.0,0.019


For more columns:

In [161]:
zimbabwe.dropna(subset = ['total_vaccinations', 'daily_vaccinations_per_million']).head()

Unnamed: 0,location,iso_code,date,total_vaccinations,people_vaccinated,people_fully_vaccinated,total_boosters,daily_vaccinations_raw,daily_vaccinations,total_vaccinations_per_hundred,people_vaccinated_per_hundred,people_fully_vaccinated_per_hundred,total_boosters_per_hundred,daily_vaccinations_per_million,daily_people_vaccinated,daily_people_vaccinated_per_hundred
87528,Zimbabwe,ZWE,2021-02-21,1314.0,1314.0,,,,425.0,0.01,0.01,,,28.0,425.0,0.003
87530,Zimbabwe,ZWE,2021-02-23,4041.0,4041.0,,,,800.0,0.03,0.03,,,53.0,800.0,0.005
87531,Zimbabwe,ZWE,2021-02-24,6115.0,6115.0,,,2074.0,1013.0,0.04,0.04,,,67.0,1013.0,0.007
87532,Zimbabwe,ZWE,2021-02-25,11264.0,11264.0,,,5149.0,1604.0,0.07,0.07,,,106.0,1604.0,0.011
87533,Zimbabwe,ZWE,2021-02-26,12836.0,12836.0,,,1572.0,1767.0,0.09,0.09,,,117.0,1767.0,0.012


---
>A summary of the functions so far:
>
>* `.isnull()` - display rows that contain missing values
>* `.notnull()` - display rows that don't contain missing values
>* `.dropna()` - Remove rows with missing values according to parameters:
    * `.dropna()` (default) - drops rows if at least one column has NaN
    * `.dropna(subset = ['column_name'])` - drop rows that contain missing values in the subset of column names
    * `.dropna(how='all')` - drops rows only if all of its columns have NaNs
    * `.dropna(thresh = k)` - k how many non-null values you want to keep (k=3 means the row should contain at least 3 non-null values)
    * `.dropna(axis=1)` - drop columns instead of rows
> 

See documnetation [here.](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.dropna.html)

---


<a id='section3'></a>
### 3. Fill missing values using fillna()

Use `.fillna()` to fill missing dataframe values with:
* Whatever value you choose
* Mean, median, mode

This is called *imputation*

Replace all NaNs with 0s

In [162]:
vacc_df.fillna(0, inplace = False )
vacc_df

Unnamed: 0,location,iso_code,date,total_vaccinations,people_vaccinated,people_fully_vaccinated,total_boosters,daily_vaccinations_raw,daily_vaccinations,total_vaccinations_per_hundred,people_vaccinated_per_hundred,people_fully_vaccinated_per_hundred,total_boosters_per_hundred,daily_vaccinations_per_million,daily_people_vaccinated,daily_people_vaccinated_per_hundred
0,Afghanistan,AFG,2021-02-22,0.0,0.0,,,,,0.00,0.00,,,,,
1,Afghanistan,AFG,2021-02-23,,,,,,1367.0,,,,,34.0,1367.0,0.003
2,Afghanistan,AFG,2021-02-24,,,,,,1367.0,,,,,34.0,1367.0,0.003
3,Afghanistan,AFG,2021-02-25,,,,,,1367.0,,,,,34.0,1367.0,0.003
4,Afghanistan,AFG,2021-02-26,,,,,,1367.0,,,,,34.0,1367.0,0.003
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
87908,Zimbabwe,ZWE,2022-03-08,7949875.0,4381982.0,3414446.0,153447.0,6550.0,6931.0,52.68,29.03,22.62,1.02,459.0,2304.0,0.015
87909,Zimbabwe,ZWE,2022-03-09,7963423.0,4386789.0,3417692.0,158942.0,13548.0,7526.0,52.77,29.07,22.65,1.05,499.0,2580.0,0.017
87910,Zimbabwe,ZWE,2022-03-10,7972666.0,4390103.0,3421237.0,161326.0,9243.0,7365.0,52.83,29.09,22.67,1.07,488.0,2454.0,0.016
87911,Zimbabwe,ZWE,2022-03-11,7985886.0,4394455.0,3425864.0,165567.0,13220.0,7895.0,52.91,29.12,22.70,1.10,523.0,2794.0,0.019


>`inplace = False` is the default. This doesn't change the vacc_df dataframe. 
>
>To change it you need:
>
>`vacc_df.fillna(0 , inplace = True)`
>
>or to assign:
>
>`vacc_df = vacc_df.fillna(0)`
>
>But we won't do that! This is where some **business understanding** comes in: it's not a good idea to fill a column like `total_vaccinations` with 0s. 
>
>See what happens:

In [163]:
vacc_df.fillna(0).head(10)

Unnamed: 0,location,iso_code,date,total_vaccinations,people_vaccinated,people_fully_vaccinated,total_boosters,daily_vaccinations_raw,daily_vaccinations,total_vaccinations_per_hundred,people_vaccinated_per_hundred,people_fully_vaccinated_per_hundred,total_boosters_per_hundred,daily_vaccinations_per_million,daily_people_vaccinated,daily_people_vaccinated_per_hundred
0,Afghanistan,AFG,2021-02-22,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,Afghanistan,AFG,2021-02-23,0.0,0.0,0.0,0.0,0.0,1367.0,0.0,0.0,0.0,0.0,34.0,1367.0,0.003
2,Afghanistan,AFG,2021-02-24,0.0,0.0,0.0,0.0,0.0,1367.0,0.0,0.0,0.0,0.0,34.0,1367.0,0.003
3,Afghanistan,AFG,2021-02-25,0.0,0.0,0.0,0.0,0.0,1367.0,0.0,0.0,0.0,0.0,34.0,1367.0,0.003
4,Afghanistan,AFG,2021-02-26,0.0,0.0,0.0,0.0,0.0,1367.0,0.0,0.0,0.0,0.0,34.0,1367.0,0.003
5,Afghanistan,AFG,2021-02-27,0.0,0.0,0.0,0.0,0.0,1367.0,0.0,0.0,0.0,0.0,34.0,1367.0,0.003
6,Afghanistan,AFG,2021-02-28,8200.0,8200.0,0.0,0.0,0.0,1367.0,0.02,0.02,0.0,0.0,34.0,1367.0,0.003
7,Afghanistan,AFG,2021-03-01,0.0,0.0,0.0,0.0,0.0,1580.0,0.0,0.0,0.0,0.0,40.0,1580.0,0.004
8,Afghanistan,AFG,2021-03-02,0.0,0.0,0.0,0.0,0.0,1794.0,0.0,0.0,0.0,0.0,45.0,1794.0,0.005
9,Afghanistan,AFG,2021-03-03,0.0,0.0,0.0,0.0,0.0,2008.0,0.0,0.0,0.0,0.0,50.0,2008.0,0.005


So we'll use 0's only for the daily_vaccinations columns, and perhaps for some other columns (which?)

In [164]:
vacc_df['daily_vaccinations'].fillna(0 , inplace = True)

checkout some of the data to see that it works

In [165]:
vacc_df.iloc[0:3,[0,2,8]]

Unnamed: 0,location,date,daily_vaccinations
0,Afghanistan,2021-02-22,0.0
1,Afghanistan,2021-02-23,1367.0
2,Afghanistan,2021-02-24,1367.0


Other options - using central measures:

In [166]:
# Using median
vacc_df['daily_vaccinations'].fillna(vacc_df['daily_vaccinations'].median(), inplace=True)
  
# Using mean
#vacc_df['daily_vaccinations'].fillna(vacc_df['daily_vaccinations'].mean(), inplace=True)
  
# Using mode
#vacc_df['daily_vaccinations'].fillna(vacc_df['daily_vaccinations'].mode(), inplace=True)


What about `total_vaccinations`? - there are some `NaN`s there as well:

In [167]:
vacc_df.iloc[52:62,[0,2,3]]

Unnamed: 0,location,date,total_vaccinations
52,Afghanistan,2021-04-15,
53,Afghanistan,2021-04-16,
54,Afghanistan,2021-04-17,
55,Afghanistan,2021-04-18,
56,Afghanistan,2021-04-19,
57,Afghanistan,2021-04-20,
58,Afghanistan,2021-04-21,
59,Afghanistan,2021-04-22,240000.0
60,Afghanistan,2021-04-23,
61,Afghanistan,2021-04-24,


For the `total_vaccinations` we'll use `ffill` which fills the missing values with first non-missing value that occurs before it.

Yes, `bfill` exists as well. If does what you think it does :-)

In [168]:
vacc_df[['date','total_vaccinations']].fillna(method='ffill')[52:62]
#vacc_df['total_vaccinations'][52:62]

Unnamed: 0,date,total_vaccinations
52,2021-04-15,120000.0
53,2021-04-16,120000.0
54,2021-04-17,120000.0
55,2021-04-18,120000.0
56,2021-04-19,120000.0
57,2021-04-20,120000.0
58,2021-04-21,120000.0
59,2021-04-22,240000.0
60,2021-04-23,240000.0
61,2021-04-24,240000.0


check it again - what happened?

In [169]:
vacc_df.iloc[52:62,[0,2,3]]

Unnamed: 0,location,date,total_vaccinations
52,Afghanistan,2021-04-15,
53,Afghanistan,2021-04-16,
54,Afghanistan,2021-04-17,
55,Afghanistan,2021-04-18,
56,Afghanistan,2021-04-19,
57,Afghanistan,2021-04-20,
58,Afghanistan,2021-04-21,
59,Afghanistan,2021-04-22,240000.0
60,Afghanistan,2021-04-23,
61,Afghanistan,2021-04-24,


The last value for some country might be NaN 

Business understanding: this isn't good enought! We need to aggregate by country!!

Use `groupby()` and `apply`  (This is more advanced and we will return to it shortly)

We will create a new column here, `newTotal` - so we can see the difference in `total_vaccinations`


In [170]:
vacc_df['total_vacc_no_missing'] = vacc_df.groupby('location')[['total_vaccinations']].apply(lambda x: x.fillna(method='ffill'))
vacc_df.iloc[375:385,[0,2,3,16]]

Unnamed: 0,location,date,total_vaccinations,total_vacc_no_missing
375,Afghanistan,2022-03-04,,5535254.0
376,Afghanistan,2022-03-05,,5535254.0
377,Afghanistan,2022-03-06,5597130.0,5597130.0
378,Africa,2021-01-09,0.0,0.0
379,Africa,2021-01-10,0.0,0.0
380,Africa,2021-01-11,0.0,0.0
381,Africa,2021-01-12,0.0,0.0
382,Africa,2021-01-13,2000.0,2000.0
383,Africa,2021-01-14,2000.0,2000.0
384,Africa,2021-01-15,4000.0,4000.0


<a id='section4'></a>
### 4. Fill missing values using interpolate()

In [171]:
vacc_df['total_vacc_interpolate'] = vacc_df['total_vaccinations'].interpolate(method ='linear') 
vacc_df.iloc[44:62,[0,2,3,16, 17]]

Unnamed: 0,location,date,total_vaccinations,total_vacc_no_missing,total_vacc_interpolate
44,Afghanistan,2021-04-07,120000.0,120000.0,120000.0
45,Afghanistan,2021-04-08,,120000.0,128000.0
46,Afghanistan,2021-04-09,,120000.0,136000.0
47,Afghanistan,2021-04-10,,120000.0,144000.0
48,Afghanistan,2021-04-11,,120000.0,152000.0
49,Afghanistan,2021-04-12,,120000.0,160000.0
50,Afghanistan,2021-04-13,,120000.0,168000.0
51,Afghanistan,2021-04-14,,120000.0,176000.0
52,Afghanistan,2021-04-15,,120000.0,184000.0
53,Afghanistan,2021-04-16,,120000.0,192000.0


---
>A summary of the functions so far:
>
>* `.fillna()` - fill missing values according to parameters:
    * `.fillna('k')`  - with value k, create a new dataframe
    * `.fillna('k', inplace = True)` - with value k, into the existing dataframe
    * `.fillna(method='ffill')` - fill with first non-missing value that occurs before it 
    * `.fillna(method='bfill')` - fill with first non-missing value that occurs after it  
> * `interpolate` - fill using some interpolation technique
>
>See documnetation:
>
>* [Missing data handling documentation](https://pandas-docs.github.io/pandas-docs-travis/reference/frame.html#missing-data-handling)
---

---
<a id='section5'></a>

## 5. A note on slicing

Slicing is taking only part of a dataframe. For example - the slice we named zimbabwe:

In [172]:
zimbabwe = vacc_df.loc[vacc_df.location == 'Zimbabwe']

When we change data in a slice, we are changing the ORIGINAL dataframe. This will cause a warning to appear:

In [173]:
zimbabwe.fillna(0, inplace=True)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  zimbabwe.fillna(0, inplace=True)


The warning will disappear if you rerun the command, but this is still bad practice. Best way to avoid it is to create a `copy` of the dataframe:

In [174]:
zimbabwe = vacc_df.loc[vacc_df.location == 'Zimbabwe'].copy()
zimbabwe.fillna(0, inplace=True)

This works fine, no warnings. Note - this won't change the original dataframe (which might be a good thing, if you didn't plan to change it, or a bad thing, if you did)

What about changes in the original dataframe? They will not change the copy.

If you do  want your copy to change, use a shallow copy:

In [175]:
small_example = pd.Series([1, 2], index=["a", "b"])
small_example

a    1
b    2
dtype: int64

deep copy is the default:

In [176]:
my_deep_copy = small_example.copy()
my_deep_copy

a    1
b    2
dtype: int64

In [177]:
my_shallow_copy = small_example.copy(deep=False)
my_shallow_copy

a    1
b    2
dtype: int64

Make a change to the dataframe - where will it appear?

In [178]:
small_example[0] = -100
small_example

a   -100
b      2
dtype: int64

In [179]:
my_deep_copy

a    1
b    2
dtype: int64

In [180]:
my_shallow_copy

a   -100
b      2
dtype: int64

---
>A summary:
>
>* `.copy()` - creates a copy of the slice of the dataframe
>
>* `.copy(deep=False)` - updates to the original dataframe will show in the copy
---

---
<a id='section6'></a>
## 6. Groupby()



#### How do we fill missing values for `total_vaccinations` according to the mean of each country?

#### How do we fill missing values for `daily_vaccinations` according to the mean of each country each month?

##### For this, we need to use groupby

[groupby documentation](https://pandas.pydata.org/pandas-docs/stable/user_guide/groupby.html)

#### Group according to something + select some columns + do something on the result

The `mean` of `daily_vaccinations` according to `location`:


In [181]:
vacc_df.groupby('location')[['daily_vaccinations']].mean()

Unnamed: 0_level_0,daily_vaccinations
location,Unnamed: 1_level_1
Afghanistan,1.474786e+04
Africa,1.012646e+06
Albania,6.407555e+03
Algeria,3.476839e+04
Andorra,3.683818e+02
...,...
Wallis and Futuna,3.595918e+01
World,2.354100e+07
Yemen,2.572517e+03
Zambia,9.266205e+03


Note that this format means `location` is now the index

Try running the below commands:

In [182]:
df_by_loc = vacc_df.groupby('location')[['daily_vaccinations']].mean()
#df_by_loc[['location']]   #this will result in an error
#df_by_loc[['daily_vaccinations']]   #this is OK

##### If you plan to continue using this data and need the index as an attribute:

##### Two possible solutions: 

##### set `as_index=False` 

##### add `reset_index()`

In [183]:
#vacc_df.groupby('location', as_index = False)[['daily_vaccinations']].mean()
vacc_df.groupby('location')[['daily_vaccinations']].mean().reset_index()

Unnamed: 0,location,daily_vaccinations
0,Afghanistan,1.474786e+04
1,Africa,1.012646e+06
2,Albania,6.407555e+03
3,Algeria,3.476839e+04
4,Andorra,3.683818e+02
...,...,...
230,Wallis and Futuna,3.595918e+01
231,World,2.354100e+07
232,Yemen,2.572517e+03
233,Zambia,9.266205e+03


##### Groupby two or more columns is possible

For example: create a dataframe with the mean daily vaccinations per country per month

First, change the `date` into a `datetime` object and extract the month

In [184]:
vacc_df['date'] = pd.to_datetime(vacc_df['date'])

In [185]:
vacc_df['month'] = pd.DatetimeIndex(vacc_df['date']).month
vacc_df[['month','date']] 

Unnamed: 0,month,date
0,2,2021-02-22
1,2,2021-02-23
2,2,2021-02-24
3,2,2021-02-25
4,2,2021-02-26
...,...,...
87908,3,2022-03-08
87909,3,2022-03-09
87910,3,2022-03-10
87911,3,2022-03-11


Now, groupby both `location` and `month`

In [186]:
vacc_df.groupby(['location','month'])[['daily_vaccinations', 'total_vaccinations']].mean().reset_index()

Unnamed: 0,location,month,daily_vaccinations,total_vaccinations
0,Afghanistan,1,12893.870968,5.010983e+06
1,Afghanistan,2,11835.600000,3.806624e+06
2,Afghanistan,3,4260.135135,2.825565e+06
3,Afghanistan,4,7320.200000,1.800000e+05
4,Afghanistan,5,9220.580645,5.682665e+05
...,...,...,...,...
2708,Zimbabwe,8,58068.290323,3.425626e+06
2709,Zimbabwe,9,38264.500000,4.862657e+06
2710,Zimbabwe,10,19820.258065,5.667995e+06
2711,Zimbabwe,11,22653.966667,6.248622e+06


what will happen if we swith the order of the indexes?

try runing the follwing:

In [187]:
vacc_df.groupby(['month', 'location'])[['daily_vaccinations', 'total_vaccinations']].mean().reset_index()

Unnamed: 0,month,location,daily_vaccinations,total_vaccinations
0,1,Afghanistan,1.289387e+04,5.010983e+06
1,1,Africa,9.582386e+05,1.905173e+08
2,1,Albania,3.826566e+03,8.517575e+05
3,1,Algeria,1.709424e+04,4.324858e+06
4,1,Andorra,5.823158e+02,9.206333e+04
...,...,...,...,...
2708,12,Wallis and Futuna,6.806452e+00,
2709,12,World,1.878500e+07,4.387123e+09
2710,12,Yemen,1.926000e+03,
2711,12,Zambia,1.889281e+04,1.362782e+06


Still the same, but using a lambda function:

In [188]:
vacc_df.groupby(['location', 'month'])[['daily_vaccinations', 'total_vaccinations']].apply(lambda x: x.mean()).reset_index()

Unnamed: 0,location,month,daily_vaccinations,total_vaccinations
0,Afghanistan,1,12893.870968,5.010983e+06
1,Afghanistan,2,11835.600000,3.806624e+06
2,Afghanistan,3,4260.135135,2.825565e+06
3,Afghanistan,4,7320.200000,1.800000e+05
4,Afghanistan,5,9220.580645,5.682665e+05
...,...,...,...,...
2708,Zimbabwe,8,58068.290323,3.425626e+06
2709,Zimbabwe,9,38264.500000,4.862657e+06
2710,Zimbabwe,10,19820.258065,5.667995e+06
2711,Zimbabwe,11,22653.966667,6.248622e+06


-----
##### So now we are ready to answer the questions:
##### How do we fill missing values for `total_vaccinations` according to the mean of each country?

We now understand this:

In [189]:
vacc_df['total_vacc_no_missing'] = vacc_df.groupby('location')[['total_vaccinations']].apply(lambda x: x.fillna(method='ffill'))

---
You can also group different columns using different functions:

In [191]:
vacc_df.groupby('location').agg({'daily_vaccinations': ['first', 'last' , 'mean', 'median', 'max'], 'total_vaccinations':['max']})

Unnamed: 0_level_0,daily_vaccinations,daily_vaccinations,daily_vaccinations,daily_vaccinations,daily_vaccinations,total_vaccinations
Unnamed: 0_level_1,first,last,mean,median,max,max
location,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2
Afghanistan,0.0,8839.0,1.474786e+04,12999.0,70761.0,5.597130e+06
Africa,0.0,1780854.0,1.012646e+06,917931.5,2831824.0,4.231429e+08
Albania,0.0,2363.0,6.407555e+03,6651.0,17565.0,2.729969e+06
Algeria,0.0,0.0,3.476839e+04,22369.0,256927.0,1.363168e+07
Andorra,0.0,65.0,3.683818e+02,218.0,1762.0,1.424200e+05
...,...,...,...,...,...,...
Wallis and Futuna,0.0,14.0,3.595918e+01,14.0,343.0,1.284900e+04
World,0.0,17736880.0,2.354100e+07,26537616.0,43535350.0,1.097306e+10
Yemen,0.0,719.0,2.572517e+03,1926.0,10240.0,7.847920e+05
Zambia,0.0,19841.0,9.266205e+03,5751.0,36338.0,3.131843e+06


### (More) Advanced: create your own function

In [192]:
vacc_df.groupby('location')[['people_vaccinated_per_hundred']].apply(lambda x: x.max() - x.min()).reset_index()

Unnamed: 0,location,people_vaccinated_per_hundred
0,Afghanistan,12.43
1,Africa,19.29
2,Albania,44.36
3,Algeria,16.71
4,Andorra,73.98
...,...,...
230,Wallis and Futuna,46.55
231,World,63.49
232,Yemen,1.99
233,Zambia,13.27


### (More) Advanced: multiple functions using agg

In [193]:
vacc_df.columns

Index(['location', 'iso_code', 'date', 'total_vaccinations',
       'people_vaccinated', 'people_fully_vaccinated', 'total_boosters',
       'daily_vaccinations_raw', 'daily_vaccinations',
       'total_vaccinations_per_hundred', 'people_vaccinated_per_hundred',
       'people_fully_vaccinated_per_hundred', 'total_boosters_per_hundred',
       'daily_vaccinations_per_million', 'daily_people_vaccinated',
       'daily_people_vaccinated_per_hundred', 'total_vacc_no_missing',
       'total_vacc_interpolate', 'month'],
      dtype='object')

In [194]:
vacc_group = vacc_df.groupby('location').agg({'daily_people_vaccinated': ['first', 'last' , 'mean', 'median', 'max'], 'total_vaccinations':['max']}).reset_index()
vacc_group

Unnamed: 0_level_0,location,daily_people_vaccinated,daily_people_vaccinated,daily_people_vaccinated,daily_people_vaccinated,daily_people_vaccinated,total_vaccinations
Unnamed: 0_level_1,Unnamed: 1_level_1,first,last,mean,median,max,max
0,Afghanistan,1367.0,6527.0,1.309611e+04,8000.0,60035.0,5.597130e+06
1,Africa,0.0,934013.0,6.251409e+05,649046.5,1920423.0,4.231429e+08
2,Albania,64.0,413.0,3.003566e+03,2525.5,6816.0,2.729969e+06
3,Algeria,30.0,4372.0,1.924690e+04,18386.0,105248.0,1.363168e+07
4,Andorra,66.0,1.0,1.495208e+02,50.0,854.0,1.424200e+05
...,...,...,...,...,...,...,...
230,Wallis and Futuna,272.0,7.0,1.744444e+01,7.0,272.0,1.284900e+04
231,World,0.0,3519778.0,9.562990e+06,9007705.5,21396353.0,1.097306e+10
232,Yemen,4276.0,599.0,2.050944e+03,1662.0,10240.0,7.847920e+05
233,Zambia,106.0,9976.0,7.707669e+03,7624.0,15561.0,3.131843e+06


if you want to access the data and not deal with a multi-index, flatten the data by dropping a level and rename the columns:

In [195]:
vacc_group.columns = vacc_group.columns.droplevel(0)

vacc_group.columns = ['location','daily_first','daily_last','daily_mean','daily_median','daily_max','total_max']

vacc_group

Unnamed: 0,location,daily_first,daily_last,daily_mean,daily_median,daily_max,total_max
0,Afghanistan,1367.0,6527.0,1.309611e+04,8000.0,60035.0,5.597130e+06
1,Africa,0.0,934013.0,6.251409e+05,649046.5,1920423.0,4.231429e+08
2,Albania,64.0,413.0,3.003566e+03,2525.5,6816.0,2.729969e+06
3,Algeria,30.0,4372.0,1.924690e+04,18386.0,105248.0,1.363168e+07
4,Andorra,66.0,1.0,1.495208e+02,50.0,854.0,1.424200e+05
...,...,...,...,...,...,...,...
230,Wallis and Futuna,272.0,7.0,1.744444e+01,7.0,272.0,1.284900e+04
231,World,0.0,3519778.0,9.562990e+06,9007705.5,21396353.0,1.097306e+10
232,Yemen,4276.0,599.0,2.050944e+03,1662.0,10240.0,7.847920e+05
233,Zambia,106.0,9976.0,7.707669e+03,7624.0,15561.0,3.131843e+06


---
>A summary:
>
>* `.groupby()` - group according to the columns specified
>
>* `.reset_index()` or  set `as_index=False` - adds the current index as a column, adds a new numerical index
>
>* `pd.to_datetime(df['date'])` - changes the attribute type to datetime
>
>* `pd.DatetimeIndex(df['date']).month` - extracts the month from the datatime attribute
>
>* `apply` - applies a function on each row (axis =0) in the dataframe. Change to (axis = 1) to apply the function on each column [documentation](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.apply.html#pandas.DataFrame.apply)
>
>* `lambda` - small anonymous function
>
>* `agg` - apply multiple functions at once, one for each specified column [documentation](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.agg.html)
---

#### This was a lot of information.

#### Keep your balance. Practice. You will make it.

<div>
<img src="images/balance.jpg" width="500"/>
</div>

Photo by <a href="https://unsplash.com/@martinsanchez?utm_source=unsplash&utm_medium=referral&utm_content=creditCopyText">Martin Sanchez</a> on <a href="https://unsplash.com/s/photos/perfect-balance?utm_source=unsplash&utm_medium=referral&utm_content=creditCopyText">Unsplash</a>
  