##### Documented by: Noura Medhat

# How to handle missing numerical values in your dataset?

## First and before going so deep, we need to know the meaning by "missing values in a dataset", so here is an example of missing data.

#### Suppose we have the following dataset

In [18]:
import pandas as pd
data = pd.read_csv('/home/noura/Documents/hp.csv')
data

Unnamed: 0,area,price
0,2600,550000.0
1,3000,565000.0
2,3200,
3,3600,680000.0
4,4000,725000.0


#### In the 3rd row, there is a missing value which is labelled "NaN". That is what we mean by a missing value in a dataset. "NaN" stands for "Not a Number."
####   

## Why do we need to handle missing value?
### Missing values can lead to a lack of precision and many ML algorithms fail if our dataset contains missing values.
####     




## Now, how can we handle missing values?

## Handling numerical missing values
### 1st method: Deleting the missing values. Easy peasy lemon squeezy, right? But this is not recommended:)
#### This could be by deleting either the entire row or the entire column.

In [3]:
#deleting the entire row
train1 = data.dropna(axis=0)
train1

Unnamed: 0,area,price
0,2600,550000.0
1,3000,565000.0
3,3600,680000.0
4,4000,725000.0


In [4]:
#deleting the entire column
train2 = data.dropna(axis=1)
train2

Unnamed: 0,area
0,2600
1,3000
2,3200
3,3600
4,4000


#### NOTE: The "dropna" built-in function is used to remove missing values.
####     


### 2nd method: Replacing the missing value with Mean -the most common used method

In [5]:
import numpy as np
#calculating the mean
train3 = data.copy() 
mean_price = np.floor (train3.price.mean())
mean_price

630000.0

In [6]:
#assigning the value of the mean to the missing value
train3.price = train3.price.fillna (mean_price)
train3

Unnamed: 0,area,price
0,2600,550000.0
1,3000,565000.0
2,3200,630000.0
3,3600,680000.0
4,4000,725000.0


#### NOTE: The "fillna" built-in function is used to remove a missing value with a specified value passed as a parameter to the function.
#####  

### 3rd method: Replacing the missing value with Median 

In [7]:
#calculating the median
train4 = data.copy()
median_price = np.floor (train4.price.median())
median_price

622500.0

In [8]:
#assigning the value of the median to the missing value
train4.price = train4.price.fillna (median_price)
train4

Unnamed: 0,area,price
0,2600,550000.0
1,3000,565000.0
2,3200,622500.0
3,3600,680000.0
4,4000,725000.0


####  

### 4th method: Replacing the missing value with Mode 

In [9]:
#calculating the mode
train5 = data.copy()
mode_price = np.floor(train5.price.mode()[0])
mode_price

550000.0

In [10]:
#assigning the value of the mode to the missing value
train5 = data.copy()
train5.price = train5.price.fillna (mode_price)
train5

Unnamed: 0,area,price
0,2600,550000.0
1,3000,565000.0
2,3200,550000.0
3,3600,680000.0
4,4000,725000.0


#### Why mode()[0]? https://community.dataquest.io/t/why-mode-0-not-just-mode/5057

####    


### 5th method: Replacing the missing value with previous value

In [11]:
#Forward-fill (to fill with previous value)
train6 = data.ffill(axis = 0)
train6

Unnamed: 0,area,price
0,2600,550000.0
1,3000,565000.0
2,3200,565000.0
3,3600,680000.0
4,4000,725000.0


#### NOTE: The "ffill" built-in function stands for "forward-fill" and it is used to fill the missing value.

####  

### 6th method: Replacing the missing value with next value

In [12]:
#Backward-fill (to fill with next value)
train7 = data.bfill(axis = 0)
train7

Unnamed: 0,area,price
0,2600,550000.0
1,3000,565000.0
2,3200,680000.0
3,3600,680000.0
4,4000,725000.0


#### NOTE: The "bfill" built-in function stands for "backward-fill" and it is used to fill the missing value.

####  

### 7th method: Using Interpolation
#### a) Linear Interpolation
##### Linear Interpolation is a form of interpolation, which involves the generation of new values based on an exisitng set of values. It estimates a missing value by connecting dots in a straight line in increasing order.

In [13]:
train8 = data.interpolate()
train8

Unnamed: 0,area,price
0,2600,550000.0
1,3000,565000.0
2,3200,622500.0
3,3600,680000.0
4,4000,725000.0


#### b) Polynomial Interpolation
##### Polynomial Interpolation is the process of filling missing values with the lowest possible degree that passes through available data points.

In [14]:
train9 = data.interpolate (method = "polynomial", order = 2)
train9

Unnamed: 0,area,price
0,2600,550000.0
1,3000,565000.0
2,3200,617500.0
3,3600,680000.0
4,4000,725000.0


#### c) Interpolation through Padding
##### Interpolation through Padding means filling missing values with the same value present above them in the dataset.

In [15]:
train10 = data.interpolate (method = "pad", limit = 2)
train10

Unnamed: 0,area,price
0,2600,550000.0
1,3000,565000.0
2,3200,565000.0
3,3600,680000.0
4,4000,725000.0


####  

### 8th method: Using KNNImputer
##### KNNImputer: https://machinelearningmastery.com/knn-imputation-for-missing-values-in-machine-learning/

In [23]:
from sklearn.impute import KNNImputer
imputer = KNNImputer (n_neighbors = 1, weights = "uniform")
imputer.fit_transform (data)

array([[  2600., 550000.],
       [  3000., 565000.],
       [  3200., 565000.],
       [  3600., 680000.],
       [  4000., 725000.]])

####  
