# The Dataset with Categorial Columns or Variables

- The previous dataset (Chapter4.csv) was easy for correlation analysis because all the variables or columns in the dataset are numbers
- When correlation analysis is done with the data with categorical columns, special care is needed

# Data description (Titanic)
<img src="images\titanic_datadescription.gif">

In [2]:
# import packages
import pandas as pd
import numpy as np
import seaborn as sns
from scipy import stats
from pandas.tools import plotting
import matplotlib.pyplot as plt
%matplotlib inline

In [3]:
# read data
data = pd.read_csv('data/titanic_train.csv')
data.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


In [4]:
#remove unimportant columns ... Ticket #s mean nothing ... too many missing values in Cabin ... 
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
PassengerId    891 non-null int64
Survived       891 non-null int64
Pclass         891 non-null int64
Name           891 non-null object
Sex            891 non-null object
Age            714 non-null float64
SibSp          891 non-null int64
Parch          891 non-null int64
Ticket         891 non-null object
Fare           891 non-null float64
Cabin          204 non-null object
Embarked       889 non-null object
dtypes: float64(2), int64(5), object(5)
memory usage: 83.6+ KB


In [5]:
#many missing values in Age and Cabin
df = data.drop(['Name','PassengerId', 'Ticket', 'Cabin'], axis = 1)
df.head()

Unnamed: 0,Survived,Pclass,Sex,Age,SibSp,Parch,Fare,Embarked
0,0,3,male,22.0,1,0,7.25,S
1,1,1,female,38.0,1,0,71.2833,C
2,1,3,female,26.0,0,0,7.925,S
3,1,1,female,35.0,1,0,53.1,S
4,0,3,male,35.0,0,0,8.05,S


In [6]:
#remove the rows with missing age ... this removes 177 rows with missing values
sum(df['Age'].isnull())
df = df.dropna()

In [7]:
# find out missing values again
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 712 entries, 0 to 890
Data columns (total 8 columns):
Survived    712 non-null int64
Pclass      712 non-null int64
Sex         712 non-null object
Age         712 non-null float64
SibSp       712 non-null int64
Parch       712 non-null int64
Fare        712 non-null float64
Embarked    712 non-null object
dtypes: float64(2), int64(4), object(2)
memory usage: 50.1+ KB


* Now, no missing value in the dataset

In [8]:
# info
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 712 entries, 0 to 890
Data columns (total 8 columns):
Survived    712 non-null int64
Pclass      712 non-null int64
Sex         712 non-null object
Age         712 non-null float64
SibSp       712 non-null int64
Parch       712 non-null int64
Fare        712 non-null float64
Embarked    712 non-null object
dtypes: float64(2), int64(4), object(2)
memory usage: 50.1+ KB


* Sex and Embarked are string columns in the dataset (using ETL they need to be converted to number columns for correlation analaysis) 

In [9]:
# simple statistics for Sex
df['Sex'].describe()

count      712
unique       2
top       male
freq       453
Name: Sex, dtype: object

In [10]:
# unique values in Sex column ... groupby
df['Sex'].unique()
df.groupby('Sex').count()

Unnamed: 0_level_0,Survived,Pclass,Age,SibSp,Parch,Fare,Embarked
Sex,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
female,259,259,259,259,259,259,259
male,453,453,453,453,453,453,453


In [11]:
df['Pclass'].unique()

array([3, 1, 2])

In [12]:
# correlation analysis
df.corr()

#Pclass is a proxy for socio-economic status (SES) 1st ~ Upper; 2nd ~ Middle; 3rd ~ Lower
#sibsp           Number of Siblings/Spouses Aboard
#parch           Number of Parents/Children Aboard

Unnamed: 0,Survived,Pclass,Age,SibSp,Parch,Fare
Survived,1.0,-0.356462,-0.082446,-0.015523,0.095265,0.2661
Pclass,-0.356462,1.0,-0.365902,0.065187,0.023666,-0.552893
Age,-0.082446,-0.365902,1.0,-0.307351,-0.187896,0.093143
SibSp,-0.015523,0.065187,-0.307351,1.0,0.383338,0.13986
Parch,0.095265,0.023666,-0.187896,0.383338,1.0,0.206624
Fare,0.2661,-0.552893,0.093143,0.13986,0.206624,1.0


In [13]:
df[['Survived', 'Fare']].corr()

Unnamed: 0,Survived,Fare
Survived,1.0,0.2661
Fare,0.2661,1.0


The above correlation analysis does not include two columns (Sex and Embarked) because they are categorical columns (not numbers). We will take care of this later. 

In [14]:
#http://stanford.edu/~mwaskom/software/seaborn/tutorial/quantitative_linear_models.html#plotting-many-pairwise-relationships-with-corrplot
# correlation plot using seaborn
plt.figure(figsize =(8, 8))
sns.corrplot(df)


AttributeError: 'module' object has no attribute 'corrplot'

<matplotlib.figure.Figure at 0x115398f90>

Now let's try to include the categorical variables (sex, embarked) in the correlation analysis

In [15]:
# find out unique values in Embarked ... groupby
df['Embarked'].unique()
#embarked        Port of Embarkation                 (C = Cherbourg; Q = Queenstown; S = Southampton)

array(['S', 'C', 'Q'], dtype=object)

In [16]:
#replace (C to 1, Q to 2, S to 3)
port = {"C": "1", "Q":"2", "S":"3"}
df = df.replace({"Embarked":port})
df.head()

Unnamed: 0,Survived,Pclass,Sex,Age,SibSp,Parch,Fare,Embarked
0,0,3,male,22.0,1,0,7.25,3
1,1,1,female,38.0,1,0,71.2833,1
2,1,3,female,26.0,0,0,7.925,3
3,1,1,female,35.0,1,0,53.1,3
4,0,3,male,35.0,0,0,8.05,3


In [17]:
#replace (male to 1, female to 0)
mf = {"male": "1", "female":"0"}
df = df.replace({"Sex":mf})
df.head()

Unnamed: 0,Survived,Pclass,Sex,Age,SibSp,Parch,Fare,Embarked
0,0,3,1,22.0,1,0,7.25,3
1,1,1,0,38.0,1,0,71.2833,1
2,1,3,0,26.0,0,0,7.925,3
3,1,1,0,35.0,1,0,53.1,3
4,0,3,1,35.0,0,0,8.05,3


In [18]:
# the data type of sex and embarked is now number
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 712 entries, 0 to 890
Data columns (total 8 columns):
Survived    712 non-null int64
Pclass      712 non-null int64
Sex         712 non-null object
Age         712 non-null float64
SibSp       712 non-null int64
Parch       712 non-null int64
Fare        712 non-null float64
Embarked    712 non-null object
dtypes: float64(2), int64(4), object(2)
memory usage: 50.1+ KB


In [19]:
#so let's convert object to numerical

df = df.convert_objects(convert_numeric=True)
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 712 entries, 0 to 890
Data columns (total 8 columns):
Survived    712 non-null int64
Pclass      712 non-null int64
Sex         712 non-null int64
Age         712 non-null float64
SibSp       712 non-null int64
Parch       712 non-null int64
Fare        712 non-null float64
Embarked    712 non-null int64
dtypes: float64(2), int64(6)
memory usage: 50.1 KB


  This is separate from the ipykernel package so we can avoid doing imports until


In [20]:
# correlation analysis again
df.corr()

Unnamed: 0,Survived,Pclass,Sex,Age,SibSp,Parch,Fare,Embarked
Survived,1.0,-0.356462,-0.536762,-0.082446,-0.015523,0.095265,0.2661,-0.181979
Pclass,-0.356462,1.0,0.150826,-0.365902,0.065187,0.023666,-0.552893,0.244145
Sex,-0.536762,0.150826,1.0,0.099037,-0.106296,-0.249543,-0.182457,0.109639
Age,-0.082446,-0.365902,0.099037,1.0,-0.307351,-0.187896,0.093143,-0.032565
SibSp,-0.015523,0.065187,-0.106296,-0.307351,1.0,0.383338,0.13986,0.033064
Parch,0.095265,0.023666,-0.249543,-0.187896,0.383338,1.0,0.206624,0.011803
Fare,0.2661,-0.552893,-0.182457,0.093143,0.13986,0.206624,1.0,-0.28351
Embarked,-0.181979,0.244145,0.109639,-0.032565,0.033064,0.011803,-0.28351,1.0


In [21]:
#http://stanford.edu/~mwaskom/software/seaborn/tutorial/quantitative_linear_models.html#plotting-many-pairwise-relationships-with-corrplot
# correlation plot again


The above output looks good, but in general it is difficult to interpret the correlation of categorical variable. For example, what does that mean by -0.5367(Sex) and -0.1809(Embarked)?

* Sex is negatively related to Survival, meaning the lower the value (female), the higher the survival. this is well done!

However, what about Embarked? This column contains more than two values. 

* Given that C = Cherbourg; Q = Queenstown; S = Southampton, we may say many of the passengers from Cherbourg did not survive, 

But what about the passengers from Queenstown? 

* Overall, it is difficult to interpret the correlation results in a meaningful way (When there are more than two unique values in a categorical variable, things become complicated ... )

So we do more ETL (data wrangling) using **dummy variables**

In [22]:
#http://pandas.pydata.org/pandas-docs/stable/generated/pandas.get_dummies.html

# create dummy variables or colummn for Sex
Sex_dummies = pd.get_dummies(df['Sex'], prefix='Sex')
df = df.join(Sex_dummies)
df.head()

Unnamed: 0,Survived,Pclass,Sex,Age,SibSp,Parch,Fare,Embarked,Sex_0,Sex_1
0,0,3,1,22.0,1,0,7.25,3,0,1
1,1,1,0,38.0,1,0,71.2833,1,1,0
2,1,3,0,26.0,0,0,7.925,3,1,0
3,1,1,0,35.0,1,0,53.1,3,1,0
4,0,3,1,35.0,0,0,8.05,3,0,1


In [46]:
# create dummy variables or colummn for Embarked
Embarked_dummies = pd.get_dummies(df['Embarked'], prefix='Embarked')
df = df.join(Embarked_dummies)
df.head()


Unnamed: 0,Survived,Pclass,Sex,Age,SibSp,Parch,Fare,Embarked,Sex_1,Sex_2,Embarked_1,Embarked_2,Embarked_3
0,0,3,1,22.0,1,0,7.25,3,1,0,0,0,1
1,1,1,2,38.0,1,0,71.2833,1,0,1,1,0,0
2,1,3,2,26.0,0,0,7.925,3,0,1,0,0,1
3,1,1,2,35.0,1,0,53.1,3,0,1,0,0,1
4,0,3,1,35.0,0,0,8.05,3,1,0,0,0,1


In [47]:
#now we can drop two columns: Sex and Embarked
df = df.drop(['Sex', 'Embarked'], axis = 1)
df.head()

Unnamed: 0,Survived,Pclass,Age,SibSp,Parch,Fare,Sex_1,Sex_2,Embarked_1,Embarked_2,Embarked_3
0,0,3,22.0,1,0,7.25,1,0,0,0,1
1,1,1,38.0,1,0,71.2833,0,1,1,0,0
2,1,3,26.0,0,0,7.925,0,1,0,0,1
3,1,1,35.0,1,0,53.1,0,1,0,0,1
4,0,3,35.0,0,0,8.05,1,0,0,0,1


In [48]:
# correlation analysis again
df.corr()

Unnamed: 0,Survived,Pclass,Age,SibSp,Parch,Fare,Sex_1,Sex_2,Embarked_1,Embarked_2,Embarked_3
Survived,1.0,-0.356462,-0.082446,-0.015523,0.095265,0.2661,-0.536762,0.536762,0.195673,-0.048966,-0.159015
Pclass,-0.356462,1.0,-0.365902,0.065187,0.023666,-0.552893,0.150826,-0.150826,-0.279194,0.131989,0.197831
Age,-0.082446,-0.365902,1.0,-0.307351,-0.187896,0.093143,0.099037,-0.099037,0.038268,-0.021693,-0.025431
SibSp,-0.015523,0.065187,-0.307351,1.0,0.383338,0.13986,-0.106296,0.106296,-0.046227,0.051331,0.018968
Parch,0.095265,0.023666,-0.187896,0.383338,1.0,0.206624,-0.249543,0.249543,-0.009523,-0.009417,0.013259
Fare,0.2661,-0.552893,0.093143,0.13986,0.206624,1.0,-0.182457,0.182457,0.301337,-0.062346,-0.250994
Sex_1,-0.536762,0.150826,0.099037,-0.106296,-0.249543,-0.182457,1.0,-1.0,-0.103611,-0.027256,0.109078
Sex_2,0.536762,-0.150826,-0.099037,0.106296,0.249543,0.182457,-1.0,1.0,0.103611,0.027256,-0.109078
Embarked_1,0.195673,-0.279194,0.038268,-0.046227,-0.009523,0.301337,-0.103611,0.103611,1.0,-0.095623,-0.884986
Embarked_2,-0.048966,0.131989,-0.021693,0.051331,-0.009417,-0.062346,-0.027256,0.027256,-0.095623,1.0,-0.378859


### now, everything is clear
- high femal survival rate
- high survival rate for the passengers C = Cherbourg
- low survival rate for Q = Queenstown
- very low survival rate for S = Southampton

In [None]:
# correlation plot
