## Handling Missing data
### Ways To Handle Missing Values
-  Deleting Rows
-  Replacing With Mean/Median/Mode



In [6]:
import pandas as pd
import numpy as np


In [29]:
class preprocessing(object):
    def __init__(self):
        pass
    
    def load_data(self):
        data=pd.read_csv('D:/Datasets/titanic/train.csv')
        return data
    def missingValue_count(self,data):
        misVal=data.isnull().sum()
        print("missing val=",misVal)
        
    ## delete rows
    def delete_rows(self,data):
        data1=data.copy()
        data1.dropna(inplace=True)
        print("missing value after removing missing val rows")
        print(data1.isnull().sum())
    
    ## fill value with mean, median, mode
    def fill_missingVal(self,data):
        fill_data=data.copy()
        age_mean=fill_data.Age.mean()
        print("mean value=",age_mean)
        print(fill_data['Age'].replace(np.NaN,fill_data['Age'].mean()).head(25))
        
        

In [30]:

## load data

obj=preprocessing()
data=obj.load_data()
data.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


## missing value counts

In [24]:
obj.missingValue_count(data)

missing val= PassengerId      0
Survived         0
Pclass           0
Name             0
Sex              0
Age            177
SibSp            0
Parch            0
Ticket           0
Fare             0
Cabin          687
Embarked         2
dtype: int64


### Deleting Rows
-  This method commonly used to handle the null values.
Here, we either delete a particular row if it has a null value for a particular feature 
and a particular column if it has more than 70-75% of missing values. 
This method is advised only when there are enough samples in the data set.
One has to make sure that after we have deleted the data, there is no addition of bias.
Removing the data will lead to loss of information which will not give the expected results while predicting the output.

In [17]:
obj.delete_rows(data)

missing value after removing missing val rows
PassengerId    0
Survived       0
Pclass         0
Name           0
Sex            0
Age            0
SibSp          0
Parch          0
Ticket         0
Fare           0
Cabin          0
Embarked       0
dtype: int64


### Pros:
-  Complete removal of data with missing values results in robust and highly accurate model
-  Deleting a particular row or a column with no specific information is better, since it does not have a high weightage

### Cons:
-  Loss of information and data
-  Works poorly if the percentage of missing values is high (say 30%), compared to the whole dataset

### Replacing With Mean/Median/Mode
-  This strategy can be applied on a feature which has numeric data like the age of a person or the ticket fare. We can calculate the mean, median or mode of the feature and replace it with the missing values. 


-  This is an approximation which can add variance to the data set. But the loss of the data can be negated by this method which yields better results compared to removal of rows and columns.

In [18]:
data.Age.isnull().sum()

177

In [31]:
obj.fill_missingVal(data)

mean value= 29.69911764705882
0     22.000000
1     38.000000
2     26.000000
3     35.000000
4     35.000000
5     29.699118
6     54.000000
7      2.000000
8     27.000000
9     14.000000
10     4.000000
11    58.000000
12    20.000000
13    39.000000
14    14.000000
15    55.000000
16     2.000000
17    29.699118
18    31.000000
19    29.699118
20    35.000000
21    34.000000
22    15.000000
23    28.000000
24     8.000000
Name: Age, dtype: float64


### Pros:
-  This is a better approach when the data size is small
-  It can prevent data loss which results in removal of the rows and columns

### Cons
-  Imputing the approximations add variance and bias
-  Works poorly compared to other multiple-imputations method