## Feature Engineering

Load the Titanic data set:

In [None]:
import numpy as np
import pandas as pd
from seaborn import load_dataset

df=load_dataset('titanic')

df.head()

Unnamed: 0,survived,pclass,sex,age,sibsp,parch,fare,embarked,class,who,adult_male,deck,embark_town,alive,alone
0,0,3,male,22.0,1,0,7.25,S,Third,man,True,,Southampton,no,False
1,1,1,female,38.0,1,0,71.2833,C,First,woman,False,C,Cherbourg,yes,False
2,1,3,female,26.0,0,0,7.925,S,Third,woman,False,,Southampton,yes,True
3,1,1,female,35.0,1,0,53.1,S,First,woman,False,C,Southampton,yes,False
4,0,3,male,35.0,0,0,8.05,S,Third,man,True,,Southampton,no,True


In [None]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 15 columns):
 #   Column       Non-Null Count  Dtype   
---  ------       --------------  -----   
 0   survived     891 non-null    int64   
 1   pclass       891 non-null    int64   
 2   sex          891 non-null    object  
 3   age          714 non-null    float64 
 4   sibsp        891 non-null    int64   
 5   parch        891 non-null    int64   
 6   fare         891 non-null    float64 
 7   embarked     889 non-null    object  
 8   class        891 non-null    category
 9   who          891 non-null    object  
 10  adult_male   891 non-null    bool    
 11  deck         203 non-null    category
 12  embark_town  889 non-null    object  
 13  alive        891 non-null    object  
 14  alone        891 non-null    bool    
dtypes: bool(2), category(2), float64(2), int64(4), object(5)
memory usage: 80.6+ KB


Let's drop the `deck` column because it contains too many missing values.

In [None]:
df.drop(labels=['deck'], axis=1, inplace=True)
df.columns

Index(['survived', 'pclass', 'sex', 'age', 'sibsp', 'parch', 'fare',
       'embarked', 'class', 'who', 'adult_male', 'embark_town', 'alive',
       'alone'],
      dtype='object')

Now, let's drop all rows containing missing values.

In [None]:
df.dropna(inplace=True)
df.isna().sum().sum()

0

### Boolean values

Note that some of the features are given either a boolean or string form. But most of the machine leanring algorithm can only work with numerical data. So, it is important to convert the information present in all non-numerical columns into a numerical form. This is an important part of *pre-processing*, or casting the data into the form ready to be fed into the machine learning algorithm. For example, the `adult_male` column is boolean. We can convert it to the integer type as follows:

In [None]:
df['adult_male_int']=df['adult_male'].astype(np.uint8)
df[['adult_male', 'adult_male_int']].head()

Unnamed: 0,adult_male,adult_male_int
0,True,1
1,False,0
2,False,0
3,False,0
4,True,1


### Encoding labels

The `alive` column contains infomation about the survival status of a given passenger. 

In [None]:
df['alive'].unique()

array(['no', 'yes'], dtype=object)

Note that this information is given in a textual form, as `'yes'` or `'no'`; the type of this column is `object` as can be seen from the output of the `info()` function.

In [None]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 712 entries, 0 to 890
Data columns (total 15 columns):
 #   Column          Non-Null Count  Dtype   
---  ------          --------------  -----   
 0   survived        712 non-null    int64   
 1   pclass          712 non-null    int64   
 2   sex             712 non-null    object  
 3   age             712 non-null    float64 
 4   sibsp           712 non-null    int64   
 5   parch           712 non-null    int64   
 6   fare            712 non-null    float64 
 7   embarked        712 non-null    object  
 8   class           712 non-null    category
 9   who             712 non-null    object  
 10  adult_male      712 non-null    bool    
 11  embark_town     712 non-null    object  
 12  alive           712 non-null    object  
 13  alone           712 non-null    bool    
 14  adult_male_int  712 non-null    uint8   
dtypes: bool(2), category(1), float64(2), int64(4), object(5), uint8(1)
memory usage: 69.6+ KB


The `alive` column is a great example of *categorical* data; this kind of data can only take on a limited, and usually fixed, number of possible values. Categorical data are ubiquitous in machine learning and it is very often necessary to convert them to a numerical form. There are a number of ways how this can be done. For example, in the case when the number of categories is just two, as in the `alive` column, we can use a standard NumPy boolean functionality:

In [None]:
df['alive_int']=(df['alive'].values=='yes').astype(np.uint8)
df[['survived', 'alive', 'alive_int']].head()

Unnamed: 0,survived,alive,alive_int
0,0,no,0
1,1,yes,1
2,1,yes,1
3,1,yes,1
4,0,no,0


Alternatively, we can use the scikit-learn `LabelEncoder()` class to acheive a similar result.

In [None]:
# https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.LabelEncoder.html
from sklearn.preprocessing import LabelEncoder

le=LabelEncoder()
df['alive_le']=le.fit_transform(df['alive'])
le.classes_

array(['no', 'yes'], dtype=object)

In [None]:
df[['survived', 'alive', 'alive_int', 'alive_le']].head()

Unnamed: 0,survived,alive,alive_int,alive_le
0,0,no,0,0
1,1,yes,1,1
2,1,yes,1,1
3,1,yes,1,1
4,0,no,0,0


### Ordinal encoding of features

The `alive` column plays the role of a target for this data set because its content is directly related to our labels. In the example above, we apply `LabelEncoder()` to a single column which happened to be our column of labels. But what if you need to perform a similar encoding for multiple columns, say, containing information about some multiple features? One of the options in this case is to use an `OrdinalEncoder()` class which can encoder multiple columns. 

In [None]:
# https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OrdinalEncoder.html
from sklearn.preprocessing import OrdinalEncoder

cat_feats=['sex', 'embarked', 'class', 'who', 'embark_town']
cat_feats_int=[feat+'_int' for feat in cat_feats]

oe=OrdinalEncoder(dtype=np.uint8)
df[cat_feats_int]=oe.fit_transform(df[cat_feats])

oe.categories_

[array(['female', 'male'], dtype=object),
 array(['C', 'Q', 'S'], dtype=object),
 array(['First', 'Second', 'Third'], dtype=object),
 array(['child', 'man', 'woman'], dtype=object),
 array(['Cherbourg', 'Queenstown', 'Southampton'], dtype=object)]

In [None]:
df[cat_feats+cat_feats_int].head()

Unnamed: 0,sex,embarked,class,who,embark_town,sex_int,embarked_int,class_int,who_int,embark_town_int
0,male,S,Third,man,Southampton,1,2,2,1,2
1,female,C,First,woman,Cherbourg,0,0,0,2,0
2,female,S,Third,woman,Southampton,0,2,2,2,2
3,female,S,First,woman,Southampton,0,2,0,2,2
4,male,S,Third,man,Southampton,1,2,2,1,2


### One-hot encoding of features

The output of `OrdinalEncoder()` is a set of columns in which every category is assigned a numerical value (typically, integer). And this works well for columns that have only two categories present, like for example, the `sex` column. But for columns that contain more than two categories this method may or may not work. The main problem here is that after the categories are converted to numbers, machine learning algorithms typically treat numbers algebraically, so the order of the numerical values encoding different categories becomes extremely important. And sometimes, this is what we want. For example, there is a clear sense of order in the `class` column of the Titanic data set: the first calss is better than the second and the second clas is better than the third in many respects that are relevant for the classification model we are trying to build. For example, the first class passengers had easier and faster access to the lifeboats, the first class area was not as crowded as those for the second and thirds classes, just to name a few. On the other hand, there is no clear sense of mathematical order for the categories present in the `who` or `embark_town` column: for those columns the integer values representing different categories are just numbers whose relative sizes do not matter. For this reason it might be counterproductive to use ordinal encoding for those columns. For categories that do not represent any ordinal relationship we typically use a so-called *one-hot encoding*, when every category is represented by a new binary column. Let's take a look at an example illustrating how one-hot encoding works. There are a couple of ways to do it. Let's start with the Pandas library. 

In [None]:
# https://pandas.pydata.org/docs/reference/api/pandas.get_dummies.html

one_hot=pd.get_dummies(df[['who', 'embark_town']])
one_hot.head()

Unnamed: 0,who_child,who_man,who_woman,embark_town_Cherbourg,embark_town_Queenstown,embark_town_Southampton
0,0,1,0,0,0,1
1,0,0,1,1,0,0
2,0,0,1,0,0,1
3,0,0,1,0,0,1
4,0,1,0,0,0,1


Alternatively, we can use the `OneHotEncoder()` class of the scikit-learn library.

In [None]:
from sklearn.preprocessing import OneHotEncoder

oh=OneHotEncoder(dtype=np.uint8)

df_oh=oh.fit_transform(df[['who', 'embark_town']])
type(df_oh)

scipy.sparse.csr.csr_matrix

In [None]:
oh.categories_

[array(['child', 'man', 'woman'], dtype=object),
 array(['Cherbourg', 'Queenstown', 'Southampton'], dtype=object)]

In [None]:
oh_names_who=['who_'+cat for cat in oh.categories_[0]]
oh_names_embark_town=['embark_town_'+cat for cat in oh.categories_[1]]

oh_names=oh_names_who+oh_names_embark_town
oh_names

['who_child',
 'who_man',
 'who_woman',
 'embark_town_Cherbourg',
 'embark_town_Queenstown',
 'embark_town_Southampton']

In [None]:
pd.DataFrame(df_oh.toarray(), columns=oh_names).head()

Unnamed: 0,who_child,who_man,who_woman,embark_town_Cherbourg,embark_town_Queenstown,embark_town_Southampton
0,0,1,0,0,0,1
1,0,0,1,1,0,0
2,0,0,1,0,0,1
3,0,0,1,0,0,1
4,0,1,0,0,0,1


### Pre-processing categorical and numerical features

It often becomes necessary to pre-process the numerical and categorical features separately. The scikit-learn library makes it possible with the help of the `ColumnTransformer()` class.

In [None]:
# reset the data set
df=load_dataset('titanic')

num_feats=['sibsp', 'age', 'fare']
cat_feats=['sex', 'class', 'embark_town']
all_feats=num_feats+cat_feats+['survived']

df1 = df[all_feats]
df1.head()

Unnamed: 0,sibsp,age,fare,sex,class,embark_town,survived
0,1,22.0,7.25,male,Third,Southampton,0
1,1,38.0,71.2833,female,First,Cherbourg,1
2,0,26.0,7.925,female,Third,Southampton,1
3,1,35.0,53.1,female,First,Southampton,1
4,0,35.0,8.05,male,Third,Southampton,0


In [None]:
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import StandardScaler
# https://scikit-learn.org/stable/modules/generated/sklearn.impute.SimpleImputer.html
from sklearn.impute import SimpleImputer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

num_pipe=Pipeline([('imp', SimpleImputer(strategy='mean')),
                   ('scale', StandardScaler())
                   ])

cat_pipe=Pipeline([('imp', SimpleImputer(strategy='most_frequent')), 
                   ('oh', OneHotEncoder(dtype=np.uint8))
                   ])

col_transf=ColumnTransformer([
                              ('num', num_pipe, [0, 1, 2]), 
                              ('cat', cat_pipe, [3, 4, 5])
                              ])

full_pipe=Pipeline([
                   ('transf', col_transf),
                   ('model', LogisticRegression())
                   ])

X=df1.iloc[:, :-1].values
y=df1.iloc[:, -1].values

# The following is just for demonstration purposes,
# so we will not make a train-test split
full_pipe.fit(X, y)
y_pred=full_pipe.predict(X)

acc=accuracy_score(y, y_pred)

print(f"The accuracy score is {acc:.3f}.")

The accuracy score is 0.800.


### Text features

Very often the features are given as some kind of text (for example, the content of "spam" and "ham" emails). In this case, we need to convert this text to a digital form before feeding it to our machine learning model. One popular approach is using the scikit-learn's `TfidfVectorizer()` class.

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer

sample=['applied machine leanring is cool!', 
        'we need to applied a lot of force to move this machine',
        'leanring how to cool down is very important!']

vec = TfidfVectorizer()

X = vec.fit_transform(sample)
pd.DataFrame(X.toarray(), columns=vec.get_feature_names())

Unnamed: 0,applied,cool,down,force,how,important,is,leanring,lot,machine,move,need,of,this,to,very,we
0,0.447214,0.447214,0.0,0.0,0.0,0.0,0.447214,0.447214,0.0,0.447214,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.235035,0.0,0.0,0.309043,0.0,0.0,0.0,0.0,0.309043,0.235035,0.309043,0.309043,0.309043,0.309043,0.47007,0.0,0.309043
2,0.0,0.302674,0.39798,0.0,0.39798,0.39798,0.302674,0.302674,0.0,0.0,0.0,0.0,0.0,0.0,0.302674,0.39798,0.0
