## Feature Engineering


### Handling Missing Values

## Missing Values
Missing values occurs in dataset when some of the informations is not stored for a variable

There are three mechanisms.



#### (i)  Missing Completely at Random, MCAR:

If  the data is MCAR, the missing values are randomly distributed throughout the dataset, and there is no systematic reason for why they are missing.


#### (ii) Missing at Random MAR:

If the data is MAR, the missing values are systematically related to the observed data, but not to the missing data.


#### (iii) Missing data not at random (MNAR) 

If the data is MNAR, the missingness is not random and is dependent on unobserved or unmeasured factors that are associated with the missing values.

### 1. Mean Value Imputation

In [1274]:
### import all the necessary libraries

import seaborn as sns


### load the dataset

df = sns.load_dataset('titanic')

print(df)

     survived  pclass     sex   age  sibsp  parch     fare embarked   class  \
0           0       3    male  22.0      1      0   7.2500        S   Third   
1           1       1  female  38.0      1      0  71.2833        C   First   
2           1       3  female  26.0      0      0   7.9250        S   Third   
3           1       1  female  35.0      1      0  53.1000        S   First   
4           0       3    male  35.0      0      0   8.0500        S   Third   
..        ...     ...     ...   ...    ...    ...      ...      ...     ...   
886         0       2    male  27.0      0      0  13.0000        S  Second   
887         1       1  female  19.0      0      0  30.0000        S   First   
888         0       3  female   NaN      1      2  23.4500        S   Third   
889         1       1    male  26.0      0      0  30.0000        C   First   
890         0       3    male  32.0      0      0   7.7500        Q   Third   

       who  adult_male deck  embark_town alive  alo

In [1275]:
### load the first five rows of the dataset

print(df.head())

   survived  pclass     sex   age  sibsp  parch     fare embarked  class  \
0         0       3    male  22.0      1      0   7.2500        S  Third   
1         1       1  female  38.0      1      0  71.2833        C  First   
2         1       3  female  26.0      0      0   7.9250        S  Third   
3         1       1  female  35.0      1      0  53.1000        S  First   
4         0       3    male  35.0      0      0   8.0500        S  Third   

     who  adult_male deck  embark_town alive  alone  
0    man        True  NaN  Southampton    no  False  
1  woman       False    C    Cherbourg   yes  False  
2  woman       False  NaN  Southampton   yes   True  
3  woman       False    C  Southampton   yes  False  
4    man        True  NaN  Southampton    no   True  


In [1276]:
### load the last five rows of the dataset

print(df.tail())

     survived  pclass     sex   age  sibsp  parch   fare embarked   class  \
886         0       2    male  27.0      0      0  13.00        S  Second   
887         1       1  female  19.0      0      0  30.00        S   First   
888         0       3  female   NaN      1      2  23.45        S   Third   
889         1       1    male  26.0      0      0  30.00        C   First   
890         0       3    male  32.0      0      0   7.75        Q   Third   

       who  adult_male deck  embark_town alive  alone  
886    man        True  NaN  Southampton    no   True  
887  woman       False    B  Southampton   yes   True  
888  woman       False  NaN  Southampton    no  False  
889    man        True    C    Cherbourg   yes   True  
890    man        True  NaN   Queenstown    no   True  


In [1277]:
## display the information about the dataset

print(df.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 15 columns):
 #   Column       Non-Null Count  Dtype   
---  ------       --------------  -----   
 0   survived     891 non-null    int64   
 1   pclass       891 non-null    int64   
 2   sex          891 non-null    object  
 3   age          714 non-null    float64 
 4   sibsp        891 non-null    int64   
 5   parch        891 non-null    int64   
 6   fare         891 non-null    float64 
 7   embarked     889 non-null    object  
 8   class        891 non-null    category
 9   who          891 non-null    object  
 10  adult_male   891 non-null    bool    
 11  deck         203 non-null    category
 12  embark_town  889 non-null    object  
 13  alive        891 non-null    object  
 14  alone        891 non-null    bool    
dtypes: bool(2), category(2), float64(2), int64(4), object(5)
memory usage: 80.7+ KB
None


In [1278]:
### describe the summary statictics of the dataset (Numerical Columns)

df.describe()

Unnamed: 0,survived,pclass,age,sibsp,parch,fare
count,891.0,891.0,714.0,891.0,891.0,891.0
mean,0.383838,2.308642,29.699118,0.523008,0.381594,32.204208
std,0.486592,0.836071,14.526497,1.102743,0.806057,49.693429
min,0.0,1.0,0.42,0.0,0.0,0.0
25%,0.0,2.0,20.125,0.0,0.0,7.9104
50%,0.0,3.0,28.0,0.0,0.0,14.4542
75%,1.0,3.0,38.0,1.0,0.0,31.0
max,1.0,3.0,80.0,8.0,6.0,512.3292


In [1279]:
### describe the summary statictics of the dataset (Categorical Columns)

df.describe(include='object')

Unnamed: 0,sex,embarked,who,embark_town,alive
count,891,889,891,889,891
unique,2,3,3,3,2
top,male,S,man,Southampton,no
freq,577,644,537,644,549


In [1280]:
### describe the shape of the dataset

print(df.shape)

(891, 15)


In [1281]:
## To check for the missing values in the dataset

print(df.isnull().sum())

survived         0
pclass           0
sex              0
age            177
sibsp            0
parch            0
fare             0
embarked         2
class            0
who              0
adult_male       0
deck           688
embark_town      2
alive            0
alone            0
dtype: int64


### OBSERVATIONS:

1. There are only 177 NULL values in the 'Age' column and 688 NULL values in the 'deck' column of the dataset.

# Imputation Missing Values


### 1- Mean Value Imputation

In [1282]:
df['age'].isnull().sum()

np.int64(177)

### OBSERVATIONS:

1.  There are total 177 NULL Values used in the dataset.

In [1283]:
### Using mean value imputation, remove all the NULL values from the dataset

df['agenewmean'] = df['age'].fillna(df['age'].mean())

In [1284]:
df['agenewmean'].isnull().sum()

np.int64(0)

### OBSERVATIONS:

1. Here fillna function is used to replace all the NULL Values by the mean values of the 'age' column and then put it in a new column named 'agenew'.

In [1285]:
df[['agenewmean', 'age']]

Unnamed: 0,agenewmean,age
0,22.000000,22.0
1,38.000000,38.0
2,26.000000,26.0
3,35.000000,35.0
4,35.000000,35.0
...,...,...
886,27.000000,27.0
887,19.000000,19.0
888,29.699118,
889,26.000000,26.0


### OBSERVATIONS:

1. Here we can see that all the Nan Values has been replaced by the mean values of the 'age' column.

In [1286]:
df.isnull().sum()

survived         0
pclass           0
sex              0
age            177
sibsp            0
parch            0
fare             0
embarked         2
class            0
who              0
adult_male       0
deck           688
embark_town      2
alive            0
alone            0
agenewmean       0
dtype: int64

In [1287]:
df[['age','agenewmean']]

Unnamed: 0,age,agenewmean
0,22.0,22.000000
1,38.0,38.000000
2,26.0,26.000000
3,35.0,35.000000
4,35.0,35.000000
...,...,...
886,27.0,27.000000
887,19.0,19.000000
888,,29.699118
889,26.0,26.000000


### OBSERVATIONS:

1. We can see that the NULL Value of the 'age' column has been replaced with the mean value of the age in the agenewmean column.

### 2. Median Value Imputation

In [1288]:
### Median Value Imputation

df['agenewmedian'] = df['age'].fillna(df['age'].median())

In [1289]:
df['agenewmedian'].isnull().sum()

np.int64(0)

### OBSERVATIONS:

1. Here we have used fillna() function and replaced all the NULL Values with the median value of the 'age' column.

In [1290]:
df.isnull().sum()

survived          0
pclass            0
sex               0
age             177
sibsp             0
parch             0
fare              0
embarked          2
class             0
who               0
adult_male        0
deck            688
embark_town       2
alive             0
alone             0
agenewmean        0
agenewmedian      0
dtype: int64

In [1291]:
df[['age','agenewmedian']]

Unnamed: 0,age,agenewmedian
0,22.0,22.0
1,38.0,38.0
2,26.0,26.0
3,35.0,35.0
4,35.0,35.0
...,...,...
886,27.0,27.0
887,19.0,19.0
888,,28.0
889,26.0,26.0


In [1292]:
### To get the summary statistics for 'age','agenewmean','agenewmedian'

df.isnull().sum()

survived          0
pclass            0
sex               0
age             177
sibsp             0
parch             0
fare              0
embarked          2
class             0
who               0
adult_male        0
deck            688
embark_town       2
alive             0
alone             0
agenewmean        0
agenewmedian      0
dtype: int64

### OBSERVATIONS:

1. We can see that the NULL Value of the 'age' column has been replaced with the median value of the age in the agenewmedian column.

In [1293]:
df[['age','agenewmean','agenewmedian']]

Unnamed: 0,age,agenewmean,agenewmedian
0,22.0,22.000000,22.0
1,38.0,38.000000,38.0
2,26.0,26.000000,26.0
3,35.0,35.000000,35.0
4,35.0,35.000000,35.0
...,...,...,...
886,27.0,27.000000,27.0
887,19.0,19.000000,19.0
888,,29.699118,28.0
889,26.0,26.000000,26.0


###  3. Mode Imputation Technqiue--Categorical values

In [1294]:
df['embarked'].isnull().sum()

np.int64(2)

In [1295]:
### perform the Mode value imputation and replace all the NULL Values by the mode value

df[df['embarked'].isnull()]

Unnamed: 0,survived,pclass,sex,age,sibsp,parch,fare,embarked,class,who,adult_male,deck,embark_town,alive,alone,agenewmean,agenewmedian
61,1,1,female,38.0,0,0,80.0,,First,woman,False,B,,yes,True,38.0,38.0
829,1,1,female,62.0,0,0,80.0,,First,woman,False,B,,yes,True,62.0,62.0


In [1296]:
### Replace all the NULL Values of the embarked column with the mode value of embarked column


mode_embarked = df['embarked'].mode()[0]

print(mode_embarked)

S


In [1297]:
df['embarkedmode'] = df['embarked'].fillna(mode_embarked)

In [1298]:
df[df['embarked'].isnull()]

Unnamed: 0,survived,pclass,sex,age,sibsp,parch,fare,embarked,class,who,adult_male,deck,embark_town,alive,alone,agenewmean,agenewmedian,embarkedmode
61,1,1,female,38.0,0,0,80.0,,First,woman,False,B,,yes,True,38.0,38.0,S
829,1,1,female,62.0,0,0,80.0,,First,woman,False,B,,yes,True,62.0,62.0,S


In [1299]:
df[df['embarkedmode'].isnull()]

Unnamed: 0,survived,pclass,sex,age,sibsp,parch,fare,embarked,class,who,adult_male,deck,embark_town,alive,alone,agenewmean,agenewmedian,embarkedmode


In [1300]:
df.isnull().sum()

survived          0
pclass            0
sex               0
age             177
sibsp             0
parch             0
fare              0
embarked          2
class             0
who               0
adult_male        0
deck            688
embark_town       2
alive             0
alone             0
agenewmean        0
agenewmedian      0
embarkedmode      0
dtype: int64

In [1301]:
df[['age','agenewmean','agenewmedian','embarked','embarkedmode']]

Unnamed: 0,age,agenewmean,agenewmedian,embarked,embarkedmode
0,22.0,22.000000,22.0,S,S
1,38.0,38.000000,38.0,C,C
2,26.0,26.000000,26.0,S,S
3,35.0,35.000000,35.0,S,S
4,35.0,35.000000,35.0,S,S
...,...,...,...,...,...
886,27.0,27.000000,27.0,S,S
887,19.0,19.000000,19.0,S,S
888,,29.699118,28.0,S,S
889,26.0,26.000000,26.0,C,C


### OBSERVATIONS:

1. After performing the mean imputation, median imputation and mode imputation on the missing value columns, we get the mean, median and mode values in the missing values of their columns.