<a href="https://colab.research.google.com/github/mcfatbeard57/Feature-Engineering/blob/main/Missing_Values_Feature_Engineering.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Missing Values- Feature Engineering
Handling missing values with various methods

## What are the different types of Missing Data?


###  **Missing Completely at Random, MCAR**: 
 
A variable is missing completely at random (MCAR) if the probability of being missing is the same for all the observations.

When data is MCAR, there is absolutely no relationship between the data missing and any other values, observed or missing, within the dataset. 

In other words, those missing data points are a random subset of the data. There is nothing systematic going on that makes some data more likely to be missing than other.

In [None]:
# example is of titanoc dataset
df[df['Embarked'].isnull()]

### Missing Data Not At Random(MNAR): 
Systematic missing Values There is absolutely some relationship between the data missing and any other values, observed or missing, within the dataset.

In [None]:
import numpy as np
df['cabin_null']=np.where(df['Cabin'].isnull(),1,0)

##find the percentage of null values
df['cabin_null'].mean()

In [None]:
df.groupby(['Survived'])['cabin_null'].mean()

### Missing At Random(MAR)

In [None]:
## example
# Men---hide their salary
# Women---hide their age

## All the techniques of handling Missing values

1. Mean/ Median/Mode replacement
2. Random Sample Imputation
3. Capturing NAN values with a new feature
4. End of Distribution imputation
5. Arbitrary imputation
6. Frequent categories imputation

#### Mean/ Median /Mode imputation
When should we apply?

Mean/median imputation has the assumption that the data are missing completely at random(MCAR). We solve this by replacing the NAN with the most frequent occurance of the variables

In [None]:
def impute_nan(df,variable,median): # passing df, column name and value with which you want to replace.. mean/median/mode
    df[variable+"_median"]=df[variable].fillna(median)

**Advantages**

Easy to implement(Robust to outliers)

Faster way to obtain the complete dataset 


**Disadvantages**

Change or Distortion in the original variance

Impacts Correlation

#### Random Sample Imputation
Aim: Random sample imputation consists of taking random observation from the dataset and we use this observation to replace the nan values

When should it be used? It assumes that the data are missing completely at random(MCAR)

In [None]:
def impute_nan(df,variable):
    df[variable+"_random"]=df[variable]
    ##It will have the random sample to fill the na
    random_sample=df[variable].dropna().sample(df[variable].isnull().sum(),random_state=0)
    ##pandas need to have same index in order to merge the dataset
    random_sample.index=df[df[variable].isnull()].index
    df.loc[df[variable].isnull(),variable+'_random']=random_sample

**Advantages**

Easy To implement

There is less distortion in variance

**Disadvantage**

Every situation randomness wont work

#### Capturing NAN values with a new feature
It works well if the data are not missing completely at random



In [None]:
import numpy as np
df['Age_NAN']=np.where(df['Age'].isnull(),1,0)

**Advantages**

Easy to implement

Captures the importance of missing values

**Disadvantages**

Creating Additional Features(Curse of Dimensionality)

#### End of Distribution imputation
End of tail imputation is equivalent to arbitrary value imputation, but automatically selecting arbitrary values at the end of the variable distributions. If the variable is normally distributed, we can use the mean plus or minus 3 times the standard deviation.

In [None]:
df.Age.hist(bins=50)

In [None]:
# use box plot to check for outliers
sns.boxplot('Age',data=df)

In [None]:
extreme=df.Age.mean()+3*df.Age.std()

In [None]:
def impute_nan(df,variable,median,extreme):
    df[variable+"_end_distribution"]=df[variable].fillna(extreme)
    df[variable].fillna(median,inplace=True)

In [None]:
impute_nan(df,'Age',df.Age.median(),extreme)

**Advantages**:

· Easy to implement

· Fast way of obtaining complete datasets

· Can be integrated into production (during model deployment)

· Captures the importance of “missingness” if there is one

**Disadvantages**:

· Distortion of the original variable distribution

· Distortion of the original variance

· Distortion of the covariance with the remaining variables of the dataset

· This technique may mask true outliers in the distribution

#### Arbitrary Value Imputation

In [None]:
def impute_nan(df,variable):
    df[variable+'_zero']=df[variable].fillna(0)
    df[variable+'_hundred']=df[variable].fillna(100)

**Advantages**

Easy to implement

Captures the importance of missingess if there is one

**Disadvantages**

Distorts the original distribution of the variable

If missingess is not important, it may mask the predictive power of the original variable by distorting its distribution

Hard to decide which value to use

#### Frequent Category Imputation

**How To Handle Categroical Missing Values**

In [None]:
# df['BsmtQual'].value_counts() === df.groupby(['BsmtQual'])['BsmtQual'].count().sort_values(ascending=False)

In [None]:
def impute_nan(df,variable):
    most_frequent_category=df[variable].mode()[0]
    df[variable].fillna(most_frequent_category,inplace=True)

In [None]:
for feature in ['BsmtQual','FireplaceQu','GarageType']:
    impute_nan(df,feature)

**Advantages**

Easy To implement

Fater way to implement

**Disadvantages**

Since we are using the more frequent labels, it may use them in an over respresented way, if there are many nan's
It distorts the relation of the most frequent label

#### Adding a variable to capture NAN

In [None]:
# Suppose if you have more frequent categories, we just replace NAN with a new category
def impute_nan(df,variable):
    df[variable+"newvar"]=np.where(df[variable].isnull(),"Missing",df[variable])