# Feature Engineering
Feature engineering is used to transform or create features from raw data by using domain knowledge. These features are then used by our models for optimal performance.

In [1]:
import seaborn as sns
import pandas as pd
import numpy as np
from matplotlib import pyplot as plt

In [2]:
df = sns.load_dataset('titanic')
df.head()

Unnamed: 0,survived,pclass,sex,age,sibsp,parch,fare,embarked,class,who,adult_male,deck,embark_town,alive,alone
0,0,3,male,22.0,1,0,7.25,S,Third,man,True,,Southampton,no,False
1,1,1,female,38.0,1,0,71.2833,C,First,woman,False,C,Cherbourg,yes,False
2,1,3,female,26.0,0,0,7.925,S,Third,woman,False,,Southampton,yes,True
3,1,1,female,35.0,1,0,53.1,S,First,woman,False,C,Southampton,yes,False
4,0,3,male,35.0,0,0,8.05,S,Third,man,True,,Southampton,no,True


In [3]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 15 columns):
 #   Column       Non-Null Count  Dtype   
---  ------       --------------  -----   
 0   survived     891 non-null    int64   
 1   pclass       891 non-null    int64   
 2   sex          891 non-null    object  
 3   age          714 non-null    float64 
 4   sibsp        891 non-null    int64   
 5   parch        891 non-null    int64   
 6   fare         891 non-null    float64 
 7   embarked     889 non-null    object  
 8   class        891 non-null    category
 9   who          891 non-null    object  
 10  adult_male   891 non-null    bool    
 11  deck         203 non-null    category
 12  embark_town  889 non-null    object  
 13  alive        891 non-null    object  
 14  alone        891 non-null    bool    
dtypes: bool(2), category(2), float64(2), int64(4), object(5)
memory usage: 80.7+ KB


In [4]:
df.isnull().sum()

survived         0
pclass           0
sex              0
age            177
sibsp            0
parch            0
fare             0
embarked         2
class            0
who              0
adult_male       0
deck           688
embark_town      2
alive            0
alone            0
dtype: int64

From the above desciptions, we can see that have missing values which we have to deal with at start. We do this first beacause our algorithms do not take null value

# Handling Missing Values
We will use various imputation methods to deal with missing values like
* mean
* median
* mode

In [5]:
age_mean = df.age.mean()
age_median = df.age.median()
age_mode = df.age.mode()

print(age_mean, age_median, age_mode)

29.69911764705882 28.0 0    24.0
Name: age, dtype: float64


Above we have our mean, median and mode of age respectively. We will use median to fill our values

In [6]:
df.age = df.age.fillna(age_median)

In [7]:
df.isnull().sum()

survived         0
pclass           0
sex              0
age              0
sibsp            0
parch            0
fare             0
embarked         2
class            0
who              0
adult_male       0
deck           688
embark_town      2
alive            0
alone            0
dtype: int64

As we can see our missing age values are dealt with. Now we see that we have embarked and embarked_town where 2 values are missing, as these are relatively small records, we can drop them.

In [8]:
df = df.dropna(subset=["embarked", "embark_town"])

In [9]:
df.isnull().sum()

survived         0
pclass           0
sex              0
age              0
sibsp            0
parch            0
fare             0
embarked         0
class            0
who              0
adult_male       0
deck           688
embark_town      0
alive            0
alone            0
dtype: int64

We can now see that we only deck to deal with. Lets see the values of deck

In [10]:
deck_values = df.deck.unique()
list(deck_values)

[nan, 'C', 'E', 'G', 'D', 'A', 'B', 'F']

We can use this categorical data to transform this into numbers so our algorithm can use it. We will use a mapping function for it.

In [11]:
def deck_mapper(x):
    deck_dict = {"A": 1, "B": 2, "C": 3, "D": 4, "E":5, "F": 6, "G": 7}
    if x:
        return deck_dict[x.upper()]

df.deck = df.deck.apply(deck_mapper)
df = df.fillna(value=0, axis=1)

In [12]:
df.isnull().sum()

survived       0
pclass         0
sex            0
age            0
sibsp          0
parch          0
fare           0
embarked       0
class          0
who            0
adult_male     0
deck           0
embark_town    0
alive          0
alone          0
dtype: int64

As we can see, now we have no missing values. Lets move on to the next step.