# Dimensionality Expansion
## One Hot Encoding
In this lesson we will practice converting categorical data into numerical so that it can be used by the scikit-learn package.  Other machine learning packages also have this limitation in that they can only ingest numerical data.  One-Hot encoding adds additional columns, one for each category in the original column - hence dimensionality expansion.

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn import datasets

# Workshop Functions
import sys
sys.path.append('..')
from WKDSS320_functions import * 

In [2]:
# Read in the data
df = pd.read_csv("titanic_train_cleaned.csv")
df.drop(columns=['Name','Age','SibSp','Parch','Ticket','Fare'],inplace=True) #dropping columns we won't need for this exercise
df.head()

Unnamed: 0,PassengerId,Survived,Pclass,Sex,Embarked,Salutation
0,1,0,3,male,S,Mr.
1,2,1,1,female,C,Mrs.
2,3,1,3,female,S,Miss.
3,4,1,1,female,S,Mrs.
4,5,0,3,male,S,Mr.


### Need for additional data processing
The scikit-learn machine learning package can't take **categorical** data as input.  For the 'Sex' column, we can replace 'male' as 0 and 'female' as 1. 

In [3]:
# convert the categorical variable 'Sex' to numerical 0 and 1 using mapping
mapping = {'male':0, 'female':1}
df.loc[:,'Sex'] = df.loc[:,'Sex'].map(mapping)
df.head(2)

Unnamed: 0,PassengerId,Survived,Pclass,Sex,Embarked,Salutation
0,1,0,3,0,S,Mr.
1,2,1,1,1,C,Mrs.


For 'Embarked' there are 3 possible values: S, C, and Q.  Rather than assign them values of 0,1,2 respectively, let's use one-hot encoding to create 3 new columns for each value.  In the 'S' column, the value will be a 1 if the original 'Embarked' column has a 'S' as the value for that passenger, and a '0' otherwise.  Similarly for C and Q columns.  

Let's also do the same with the Salutation.   

In [4]:
dfTemp = pd.get_dummies(df.loc[:,['Embarked','Salutation']])
df = pd.concat([df,dfTemp], axis=1)
df.head()

   PassengerId  Survived  Pclass  Sex Embarked Salutation  Embarked_C  \
0            1         0       3    0        S        Mr.           0   
1            2         1       1    1        C       Mrs.           1   
2            3         1       3    1        S      Miss.           0   
3            4         1       1    1        S       Mrs.           0   
4            5         0       3    0        S        Mr.           0   

   Embarked_Q  Embarked_S  Salutation_Capt.  ...  Salutation_Major.  \
0           0           1                 0  ...                  0   
1           0           0                 0  ...                  0   
2           0           1                 0  ...                  0   
3           0           1                 0  ...                  0   
4           0           1                 0  ...                  0   

   Salutation_Master.  Salutation_Miss.  Salutation_Mlle.  Salutation_Mme.  \
0                   0                 0                 

Now remove the 2 categorical columns to make the data fully numeric.  It is now ready to be processed by a machine learning algorithm in the scikit-learn package

In [5]:
df.drop(columns=['Salutation','Embarked'],inplace=True)
df.head()

   PassengerId  Survived  Pclass  Sex  Embarked_C  Embarked_Q  Embarked_S  \
0            1         0       3    0           0           0           1   
1            2         1       1    1           1           0           0   
2            3         1       3    1           0           0           1   
3            4         1       1    1           0           0           1   
4            5         0       3    0           0           0           1   

   Salutation_Capt.  Salutation_Col.  Salutation_Countess.  ...  \
0                 0                0                     0  ...   
1                 0                0                     0  ...   
2                 0                0                     0  ...   
3                 0                0                     0  ...   
4                 0                0                     0  ...   

   Salutation_Major.  Salutation_Master.  Salutation_Miss.  Salutation_Mlle.  \
0                  0                   0              