<h1><font color = 'red' size = '10'>
<b>
Imputation or Handling Strategies for Missing Values
</b>
</font>
</h1>

<h1>

<ul>
<font color = 'red brown' size = '6'>
<b>

<li>
There are multiple ways of dealing with missing values in a column.
</li><br>

<li>
The common ways in which this can be done are listed here:

<ul>
<font color = 'red brown' size = '5'>
<li>
The simplest way is to simply delete rows having missing values; however, this can result in the loss of valuable information from other columns.
</li><br>

<li>
Create a new value that is distinct from the other values to replace the missing
values in the column so as to differentiate those rows altogether.
</li><br>

<li>
Use an appropriate central value from the column (mean, median, or mode) to
replace the missing values.
</li><br>

<li>
Use a model (such as a K-nearest neighbors or a Gaussian mixture model) to learn
the best value with which to replace the missing values.
</li>
</font>
</ul>

</li><br>


</ul>
</b>
</font>
</ul>
</h1>

<h1><font color = 'red' size = '10'>
<b>
Exercise 1: Performing Imputation Using Pandas
</b>
</font>
</h1>

<b>Let's look at missing values and replace them with zeros in time-based (continuous) features having at least one null value (month, day, hour, minute, and second). We do this because, for cases where we do not have recorded values, it would be safe to assume that the events take place at the beginning of the time duration.</b>

1. Read the earthquakes data into a data pandas DataFrame.

In [None]:
import pandas as pd
import numpy as np
import missingno as msno
from sklearn.impute import SimpleImputer
import matplotlib.pyplot as plt
import seaborn as sns

In [None]:
# load the earthquakes dataset
data = pd.read_csv("/content/drive/MyDrive/Machine Learning Lectures/datasets/earthquake_data.csv")

In [None]:
# show first five data samples
data.head()

Unnamed: 0,id,flag_tsunami,year,month,day,hour,minute,second,focal_depth,eq_primary,...,longitude,region_code,injuries,injuries_description,damage_millions_dollars,damage_description,total_injuries,total_injuries_description,total_damage_millions_dollars,total_damage_description
0,338.0,No,1048.0,,,,,,,,...,,120,,,,,,,,
1,771.0,Tsu,1580.0,4.0,6.0,,,,33.0,6.2,...,1.309,120,,,,2.0,,,,
2,7889.0,Tsu,1757.0,7.0,15.0,,,,,,...,-6.32,120,,,,,,,,
3,6697.0,Tsu,1500.0,,,,,,,,...,,150,,,,,,,,
4,6013.0,Tsu,1668.0,4.0,13.0,,,,,,...,-71.05,150,,,,,,,,


2. Create a list containing the names of the columns whose values we want to impute:

In [None]:
time_features = ['month', 'day', 'hour', 'minute', 'second']

3. Impute the null values using **.fillna()**. We will replace the missing values in these columns with **0** using the inherent pandas **.fillna()** function and pass **0** as an argument to the function:

In [None]:
data[time_features] = data[time_features].fillna(0)
data[time_features]

Unnamed: 0,month,day,hour,minute,second
0,0.0,0.0,0.0,0.0,0.0
1,4.0,6.0,0.0,0.0,0.0
2,7.0,15.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0
4,4.0,13.0,0.0,0.0,0.0
...,...,...,...,...,...
6067,8.0,8.0,8.0,34.0,24.9
6068,12.0,22.0,1.0,2.0,2.4
6069,2.0,25.0,17.0,44.0,43.0
6070,7.0,9.0,5.0,19.0,7.3


4. Use the **.info()** function to view null value counts for the imputed columns:

In [None]:
data[time_features].info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6072 entries, 0 to 6071
Data columns (total 5 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   month   6072 non-null   float64
 1   day     6072 non-null   float64
 2   hour    6072 non-null   float64
 3   minute  6072 non-null   float64
 4   second  6072 non-null   float64
dtypes: float64(5)
memory usage: 237.3 KB


<h1><font color = 'red' size = '10'>
<b>
Exercise 2: Performing Imputation Using Scikit-Learn
</b>
</font>
</h1>

<b>In this exercise, you will replace the null values in the description-related categorical features using scikit-learn's **SimpleImputer** class.</b>

1. Create a list containing the names of the columns whose values we want to impute:

In [None]:
description_features = ['injuries_description', 'damage_description', 'total_injuries_description', 'total_damage_description']

2. Create an object of the **SimpleImputer** class. Here, we first create an **imp** object of the **SimpleImputer** class and initialize it with parameters that represent how we want to impute the data.

In [None]:
imp = SimpleImputer(missing_values = np.nan, strategy = 'mean')

3. Perform the imputation. We will use **imp.fit_transform()** to actually perform the imputation. It takes the DataFrame with null values as input and returns the imputed DataFrame:

In [None]:
data[description_features] = imp.fit_transform(data[description_features])
data[description_features]

Unnamed: 0,injuries_description,damage_description,total_injuries_description,total_damage_description
0,1.975537,2.263693,1.973471,2.193139
1,1.975537,2.000000,1.973471,2.193139
2,1.975537,2.263693,1.973471,2.193139
3,1.975537,2.263693,1.973471,2.193139
4,1.975537,2.263693,1.973471,2.193139
...,...,...,...,...
6067,1.000000,4.000000,1.000000,4.000000
6068,1.975537,4.000000,1.973471,4.000000
6069,3.000000,4.000000,3.000000,4.000000
6070,2.000000,4.000000,2.000000,4.000000


4. Use the **.info()** function to view null value counts for the imputed columns:

In [None]:
data[description_features].info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6072 entries, 0 to 6071
Data columns (total 4 columns):
 #   Column                      Non-Null Count  Dtype  
---  ------                      --------------  -----  
 0   injuries_description        6072 non-null   float64
 1   damage_description          6072 non-null   float64
 2   total_injuries_description  6072 non-null   float64
 3   total_damage_description    6072 non-null   float64
dtypes: float64(4)
memory usage: 189.9 KB
