# Unit 3 - Missing values and Data statistics
---

1. [Find rows with missing values](#section1)
2. [Remove missing values using dropna()](#section2)
3. [Fill missing values using fillna()](#section3)
4. [Fill missing values using interpolate()](#section4)
5. [Replace values](#section5)
5. [A note on slicing - copy()](#section6)
6. [GroupBy()](#section7)





In [3]:
import pandas as pd
import numpy as np

In [4]:
url = 'https://raw.githubusercontent.com/owid/covid-19-data/master/public/data/vaccinations/vaccinations.csv'
vacc_df = pd.read_csv(url)
vacc_df.shape

(114892, 16)

---
>## Reminder:
>##### Treating missing values is optional. Sometimes we just leave the dataframe with missing values!!
---


<a id='section1'></a>
### 1. Find rows with missing values

`null` / `na` - no value

`NaN` - **N**ot **a** **N**umber - the value is missing. This value will be ignored in calculations such as `.mean()`

`isnull()` is a pandas function, so either use it on a dataframe or call it through pd

In [None]:
vacc_df.head()

In [None]:
vacc_df.isnull().sum()

##### call it through pandas:

In [None]:
pd.isnull(vacc_df).sum()

##### View specific columns:

In [None]:
vacc_df[['daily_vaccinations', 'total_vaccinations']].notnull().sum()

In [None]:
vacc_df[['daily_vaccinations']].isnull().sum()

##### Using numpy: `isnan` is a numpy function

In [None]:
np.isnan(vacc_df[['daily_vaccinations']]).sum()

<a id='section2'></a>
### 2. Remove missing values using dropna() 

##### Look at Zimbabwe for example. Zimbabwe contains missing values:

In [None]:
zimbabwe = vacc_df.loc[vacc_df.location == 'Zimbabwe']
zimbabwe.head(10)

In [None]:
zimbabwe[['total_vaccinations']].isnull().sum()

In [None]:
zimbabwe['total_vaccinations'].notnull().sum()

##### We can see the difference when counting the number of values per row:

In [None]:
zimbabwe.count()

##### Remove all rows that contain one or more missing values: 

In [None]:
zimbabwe.dropna()

Note: `dropna()`, like most other functions in the pandas API returns a new DataFrame 
(a copy of the original with changes) as the result, so you should assign it back if you want to see changes:

In [None]:
zimbabwe.head()

assign it back:

In [None]:
zimbabwe = zimbabwe.dropna()
zimbabwe

Re-read the df so we have the NaNs again

In [159]:
zimbabwe = vacc_df.loc[vacc_df.location == 'Zimbabwe']

# Remove all values for a specific column - using `subset`

In [None]:
zimbabwe.dropna(subset = ['total_vaccinations'])

For more columns:

In [None]:
zimbabwe.dropna(subset = ['total_vaccinations', 'daily_vaccinations_per_million']).head()

### We can now answer a question such as - which countries have the longest running vaccination programs?

Remember -`date` in this data is never null

Create a dataset that doesn't have null values for `people_vaccinated`


In [None]:
vacc_df_full = vacc_df.dropna(subset = ['people_vaccinated'])

use `value counts` - count how many times each `location` shows up:

In [None]:
vacc_df_full[["location"]].value_counts().head(10)

Remove places that are not countries

In [None]:
vacc_df_full = vacc_df_full[(vacc_df_full.location != "Europe") & 
                            (vacc_df_full.location != "High income") &
                            (vacc_df_full.location != "World") &
                            (vacc_df_full.location != "European Union") &
                            (vacc_df_full.location != "North America") &
                            (vacc_df_full.location != "Upper middle income") &
                            (vacc_df_full.location != "Asia") &
                            (vacc_df_full.location != "South America")]

Now we can finally see the countries with the longest running vaccination program

In [None]:
vacc_df_full[["location"]].value_counts().head(10)

Note: we alreay used `value_counts` in unit 2. What we added here is an additional step, we first removed missing values and then used value_counts

The below contains missing values for countries such as Latvia and Russia

In [None]:
vacc_df[["location"]].value_counts().head(10)

---
>A summary of the functions so far:
>
>* `.isnull()` - display rows that contain missing values
>* `.notnull()` - display rows that don't contain missing values
>* `.dropna()` - Remove rows with missing values according to parameters:
    * `.dropna()` (default) - drops rows if at least one column has NaN
    * `.dropna(subset = ['column_name'])` - drop rows that contain missing values in the subset of column names
    * `.dropna(how='all')` - drops rows only if all of its columns have NaNs
    * `.dropna(thresh = k)` - k how many non-null values you want to keep (k=3 means the row should contain at least 3 non-null values)
    * `.dropna(axis=1)` - drop columns instead of rows
>

See documnetation [here.](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.dropna.html)

---


<a id='section3'></a>
### 3. Fill missing values using fillna()

Use `.fillna()` to fill missing dataframe values with:
* Whatever value you choose
* Mean, median, mode

This is called *imputation*

Replace all NaNs with 0s

In [None]:
vacc_df.fillna(0, inplace = False )
vacc_df

>`inplace = False` is the default. This doesn't change the vacc_df dataframe. 
>
>To change it you need:
>
>`vacc_df.fillna(0 , inplace = True)`
>
>or to assign:
>
>`vacc_df = vacc_df.fillna(0)`
>
>But we won't do that! This is where some **business understanding** comes in: it's not a good idea to fill a column like `total_vaccinations` with 0s. 
>
>See what happens:

In [None]:
vacc_df.fillna(0).head(10)

So we'll use 0's only for the daily_vaccinations columns, and perhaps for some other columns (which?)

In [None]:
vacc_df['daily_vaccinations'].fillna(0 , inplace = True)

Other options - using central measures:

In [6]:
# Using median
vacc_df['daily_vaccinations'].fillna(vacc_df['daily_vaccinations'].median(), inplace=True)
  
# Using mean
#vacc_df['daily_vaccinations'].fillna(vacc_df['daily_vaccinations'].mean(), inplace=True)
  
# Using mode
#vacc_df['daily_vaccinations'].fillna(vacc_df['daily_vaccinations'].mode(), inplace=True)


What about `total_vaccinations`? - there are some `NaN`s there as well:

In [None]:
vacc_df.iloc[52:62,[0,2,3]]

For the `total_vaccinations` we'll use `ffill` which fills the missing values with first non-missing value that occurs before it.

Yes, `bfill` exists as well. If does what you think it does :-)

In [None]:
vacc_df[['date','total_vaccinations']].fillna(method='ffill')[52:62]
#vacc_df['total_vaccinations'][52:62]

check it again - what happened?

In [None]:
vacc_df.iloc[52:62,[0,2,3]]

The last value for some country might be NaN 

Business understanding: this isn't good enought! We need to aggregate by country!!

In [None]:
vacc_df['total_vacc_no_missing'] = vacc_df.groupby('location')[['total_vaccinations']].apply(lambda x: x.fillna(method='ffill'))
vacc_df.iloc[375:385,[0,2,3,16]]

##### Note: `fillna(inplace = True)` does not work when `.loc` is used. 

<a id='section4'></a>
### 4. Fill missing values using interpolate()

In [None]:
vacc_df['total_vacc_interpolate'] = vacc_df['total_vaccinations'].interpolate(method ='linear') 
vacc_df.iloc[44:62,[0,2,3,16, 17]]

<a id='section5'></a>
### 5. Replace values

Sometime we need to replace values, not fill missing values.

[replace documentation](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.replace.html)

> you try!
>
> read the documentation and try to figure out how it's done

In [11]:
vacc_df['total_vacc_interpolate'] = vacc_df['total_vaccinations'].interpolate(method ='linear') 
vacc_df.iloc[44:62,[0,2,3,16, 17]]

Unnamed: 0,location,date,total_vaccinations,total_vacc_no_missing,total_vacc_interpolate
44,Afghanistan,2021-04-07,120000.0,120000.0,120000.0
45,Afghanistan,2021-04-08,,120000.0,128000.0
46,Afghanistan,2021-04-09,,120000.0,136000.0
47,Afghanistan,2021-04-10,,120000.0,144000.0
48,Afghanistan,2021-04-11,,120000.0,152000.0
49,Afghanistan,2021-04-12,,120000.0,160000.0
50,Afghanistan,2021-04-13,,120000.0,168000.0
51,Afghanistan,2021-04-14,,120000.0,176000.0
52,Afghanistan,2021-04-15,,120000.0,184000.0
53,Afghanistan,2021-04-16,,120000.0,192000.0


---
>A summary of the functions so far:
>
>* `.fillna()` - fill missing values according to parameters:
    * `.fillna('k')`  - with value k, create a new dataframe
    * `.fillna('k', inplace = True)` - with value k, into the existing dataframe
    * `.fillna(method='ffill')` - fill with first non-missing value that occurs before it 
    * `.fillna(method='bfill')` - fill with first non-missing value that occurs after it  
> * `interpolate` - fill using some interpolation technique
> * `replace(x,y)` - replace x with y
>See documnetation:
>
>* [Missing data handling documentation](https://pandas-docs.github.io/pandas-docs-travis/reference/frame.html#missing-data-handling)
---

---
<a id='section6'></a>

## 6. A note on slicing

Slicing is taking only part of a dataframe. For example - the slice we named zimbabwe:

In [27]:
zimbabwe = vacc_df.loc[vacc_df.location == 'Zimbabwe']

When we change data in a slice, we are changing the ORIGINAL dataframe. This will cause a warning to appear:

In [26]:
zimbabwe.fillna(0, inplace=True)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  zimbabwe.fillna(0, inplace=True)


It is only a warning, but this is bad practice. Best way to avoid it is to create a `copy` of the dataframe:

In [None]:
zimbabwe = vacc_df.loc[vacc_df.location == 'Zimbabwe'].copy()
zimbabwe.fillna(0, inplace=True)

This works fine, no warnings. Note - this won't change the original dataframe (which might be a good thing, if you didn't plan to change it, or a bad thing, if you did)

What about changes in the original dataframe? They will not change the copy.

If you do  want your copy to change, use a shallow copy:

In [None]:
small_example = pd.Series([1, 2], index=["a", "b"])
small_example

deep copy is the default:

In [None]:
my_deep_copy = small_example.copy()
my_deep_copy

In [None]:
my_shallow_copy = small_example.copy(deep=False)
my_shallow_copy

Make a change to the dataframe - where will it appear?

In [None]:
small_example[0] = -100
small_example

In [None]:
my_deep_copy

In [None]:
my_shallow_copy

---
>A summary:
>
>* `.copy()` - creates a copy of the slice of the dataframe
>
>* `.copy(deep=False)` - updates to the original dataframe will show in the copy
---

---
<a id='section7'></a>
## 7. Groupby()



#### How do we fill missing values for `total_vaccinations` according to the mean of each country?

#### How do we fill missing values for `daily_vaccinations` according to the mean of each country each month?

##### For this, we need to use groupby

[groupby documentation](https://pandas.pydata.org/pandas-docs/stable/user_guide/groupby.html)

#### Group according to something + select some columns + do something on the result

The `mean` of `daily_vaccinations` according to `location`:


In [None]:
vacc_df.groupby('location')[['daily_vaccinations']].mean()

Note that this format means `location` is now the index

Try running the below commands:

In [None]:
df_by_loc = vacc_df.groupby('location')[['daily_vaccinations']].mean()
#df_by_loc[['location']]   #this will result in an error
#df_by_loc[['daily_vaccinations']]   #this is OK

##### If you plan to continue using this data and need the index as an attribute:

##### Two possible solutions: 

##### set `as_index=False` 

##### add `reset_index(inplace = True)`

In [None]:
#vacc_df.groupby('location', as_index = False)[['daily_vaccinations']].mean()
df_by_loc.reset_index(inplace = True)
df_by_loc.head()

##### Groupby two or more columns is possible

For example: create a dataframe with the mean daily vaccinations per country per month

First, change the `date` into a `datetime` object and extract the month

In [None]:
vacc_df[['date']]

In [None]:
vacc_df['date'] = pd.to_datetime(vacc_df['date'])
vacc_df[['date']]

In [None]:
vacc_df['month'] = pd.DatetimeIndex(vacc_df['date']).month
vacc_df[['month','date']] 

Now, groupby both `location` and `month`

In [None]:
vacc_df.groupby(['location','month'])[['daily_vaccinations', 'total_vaccinations']].mean().reset_index()

what will happen if we swith the order of the indexes?

try runing the follwing:

In [None]:
vacc_df.groupby(['month', 'location'])[['daily_vaccinations', 'total_vaccinations']].mean().reset_index()

Still the same, but using a lambda function:

In [None]:
vacc_df.groupby(['location', 'month'])[['daily_vaccinations', 'total_vaccinations']].apply(lambda x: x.mean()).reset_index()

-----
##### So now we are ready to answer the questions:
##### How do we fill missing values for `total_vaccinations` according to the mean of each country?

We now understand this:

In [None]:
vacc_df['total_vacc_no_missing'] = vacc_df.groupby('location')[['total_vaccinations']].apply(lambda x: x.fillna(method='ffill'))

In [None]:
vacc_df[['total_vacc_no_missing','total_vaccinations']]

### (More) Advanced: create your own function

In [None]:
vacc_df.groupby('location')[['people_vaccinated_per_hundred']].apply(lambda x: x.max() - x.min()).reset_index()

### (More) Advanced: multiple functions using agg

In [None]:
vacc_group = vacc_df.groupby('location').agg({'daily_people_vaccinated': ['first', 'last' , 'mean', 'median', 'max'], 'total_vaccinations':['max']})
vacc_group.reset_index(inplace = True)
vacc_group

if you want to access the data and not deal with a multi-index, flatten the data by dropping a level and rename the columns:

In [None]:
vacc_group.columns = vacc_group.columns.droplevel(0)
vacc_group

Change the column names:

In [None]:
vacc_group.columns = ['location','daily_first','daily_last','daily_mean','daily_median','daily_max','total_max']

vacc_group

---
>A summary:
>
>* `.groupby()` - group according to the columns specified
>
>* `.reset_index()` or  set `as_index=False` - adds the current index as a column, adds a new numerical index
>
>* `pd.to_datetime(df['date'])` - changes the attribute type to datetime
>
>* `pd.DatetimeIndex(df['date']).month` - extracts the month from the datatime attribute
>
>* `apply` - applies a function on each row (axis =0) in the dataframe. Change to (axis = 1) to apply the function on each column [documentation](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.apply.html#pandas.DataFrame.apply)
>
>* `lambda` - small anonymous function
>
>* `agg` - apply multiple functions at once, one for each specified column [documentation](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.agg.html)
---

#### This was a lot of information.

#### Keep your balance. Practice. You will make it.

<div>
<img src="images/balance.jpg" width="500"/>
</div>

Photo by <a href="https://unsplash.com/@martinsanchez?utm_source=unsplash&utm_medium=referral&utm_content=creditCopyText">Martin Sanchez</a> on <a href="https://unsplash.com/s/photos/perfect-balance?utm_source=unsplash&utm_medium=referral&utm_content=creditCopyText">Unsplash</a>
  