In [102]:
# Import pandas and load titanic dataset
import pandas

In [103]:
raw_data = pandas.read_csv('./titanic.csv', index_col='PassengerId')

# Step 1: Study Dataset

### Review features of dataset
Through this cell we will review the total # of rows in the dataset to better understand its size. Then we review the features (columns) of the dataset and then iterate through each feature and print its first 5 rows as well as get the columns type.

In [104]:
length = len(raw_data)
columns = raw_data.columns

print("Num of rows:: " + str(length))
print(raw_data.columns)

for row in columns:
  print(raw_data[row].head())

Num of rows:: 891
Index(['Survived', 'Pclass', 'Name', 'Sex', 'Age', 'SibSp', 'Parch', 'Ticket',
       'Fare', 'Cabin', 'Embarked'],
      dtype='object')
PassengerId
1    0
2    1
3    1
4    1
5    0
Name: Survived, dtype: int64
PassengerId
1    3
2    1
3    3
4    1
5    3
Name: Pclass, dtype: int64
PassengerId
1                              Braund, Mr. Owen Harris
2    Cumings, Mrs. John Bradley (Florence Briggs Th...
3                               Heikkinen, Miss. Laina
4         Futrelle, Mrs. Jacques Heath (Lily May Peel)
5                             Allen, Mr. William Henry
Name: Name, dtype: object
PassengerId
1      male
2    female
3    female
4    female
5      male
Name: Sex, dtype: object
PassengerId
1    22.0
2    38.0
3    26.0
4    35.0
5    35.0
Name: Age, dtype: float64
PassengerId
1    1
2    1
3    0
4    1
5    0
Name: SibSp, dtype: int64
PassengerId
1    0
2    0
3    0
4    0
5    0
Name: Parch, dtype: int64
PassengerId
1           A/5 21171
2            PC 

# Step 2: Data Cleansing

### Handling missing data
Though the next few cells we are going to clean up our data with our first step being missing values.

In [105]:
# List out the null or na values in each column
raw_data.isna().sum()

Survived      0
Pclass        0
Name          0
Sex           0
Age         177
SibSp         0
Parch         0
Ticket        0
Fare          0
Cabin       687
Embarked      2
dtype: int64

You can see that `Cabin` has 687 empty values compared to the 891 total potential values. It can be safe to say that the `Cabin` feature is not helpful and can be dropped. We will also save this change to a new dataframe. 

In [106]:
clean_data = raw_data.drop('Cabin', axis=1) # We are setting axis to "1" to represent dropping the column whereas "0" would mean "row"

The next column to review is `Age` which has 177 missing values which is not very significant and the age feature is one that would be important for us to use so we would not want to drop it. In situations like this a common approach would be to take the median of every value in that column and apply that to the missing values.

In [107]:
median_age = clean_data["Age"].median()
clean_data["Age"] = clean_data["Age"].fillna(median_age)

The third column missing data is `Embarked` which has only 2 missing values. The column is made up of strings not integers so an approach would be to group these missing values as one value "U" for unknown.

In [108]:
clean_data["Embarked"] = clean_data["Embarked"].fillna('U')

In [109]:
# Save updated dataframe to a new CSV to not lose progress
clean_data.to_csv('./clean_titanic_data.csv', index=None)

# Step 3: Feature Engineering

### 3.1: Turn categorical data into numerical data with One-hot encoding
In order for our model to perform mathematical operations with our data we have to make sure all of the data is numerical.

**One-hot encoding:** Look at a specific feature and then determine the number of classes within that feature. You will then create a new column (feature) specific to that one class within that feature. In the case of `Sex` you have "male" and "female". So you would add 2 new columns to the dataframe-- `gender_male && gender_female`
- Keep in mind that if you have a feature with hundreds of classes then one-hot encoding will create a hundred new columns, and those columns will be filled with 0's and 1's.

In [110]:
preprocessed_data = pandas.read_csv('./clean_titanic_data.csv')

gender_columns = pandas.get_dummies(preprocessed_data['Sex'], prefix='Sex')
embarked_columns = pandas.get_dummies(
    preprocessed_data['Embarked'], prefix='Embarked')

preprocessed_data = pandas.concat([preprocessed_data, gender_columns], axis=1)
preprocessed_data = pandas.concat([preprocessed_data, embarked_columns], axis=1)

preprocessed_data = preprocessed_data.drop(['Sex', 'Embarked'], axis=1)

In [111]:
preprocessed_data_columns = preprocessed_data.columns
print(preprocessed_data_columns)

Index(['Survived', 'Pclass', 'Name', 'Age', 'SibSp', 'Parch', 'Ticket', 'Fare',
       'Sex_female', 'Sex_male', 'Embarked_C', 'Embarked_Q', 'Embarked_S',
       'Embarked_U'],
      dtype='object')


### one-shot encoding for numerical classes
Looking at the feature `Pclass` we see that there are 3 numbers representing it. 3, 2, 1 representing first class, second class, and third class. In order to determine if we would want to apply one-shot encoding we should see how this feature plays a role in our predicted outcome. We will view this through the following code

In [112]:
class_survived = preprocessed_data[['Pclass', 'Survived']]

first_class = class_survived[class_survived['Pclass'] == 1]
second_class = class_survived[class_survived['Pclass'] == 2]
third_class = class_survived[class_survived['Pclass'] == 3]

print("In first class", sum(first_class['Survived'])/len(first_class)*100, "% of passengers survived")
print("In second class", sum(second_class['Survived'])/len(second_class)*100, "% of passengers survived")
print("In third class", sum(third_class['Survived'])/len(third_class)*100, "% of passengers survived")

In first class 62.96296296296296 % of passengers survived
In second class 47.28260869565217 % of passengers survived
In third class 24.236252545824847 % of passengers survived


We can see that passengers in third class had the lowest chance of survival so in this case it would make sense to leave this feature as a linear feature where the higher the number the less of a chance of survival. For the projects sake we will still apply one-shot encoding and can apply changes in the future to see how the model reacts.

In [113]:
categorized_pclass_columns = pandas.get_dummies(preprocessed_data['Pclass'], prefix='Pclass')
preprocessed_data = pandas.concat([preprocessed_data, categorized_pclass_columns], axis=1)
preprocessed_data = preprocessed_data.drop(['Pclass'], axis=1)

preprocessed_data_columns = preprocessed_data.columns
print(preprocessed_data_columns)

Index(['Survived', 'Name', 'Age', 'SibSp', 'Parch', 'Ticket', 'Fare',
       'Sex_female', 'Sex_male', 'Embarked_C', 'Embarked_Q', 'Embarked_S',
       'Embarked_U', 'Pclass_1', 'Pclass_2', 'Pclass_3'],
      dtype='object')


### Binning: Turning numerical data into categorical data
An example of when we would want to turn numerical data into categorical data would be through `Binning` which is the process of splitting numbers into several different buckets

Age is a good example, in the context of this titanic model we would want to answer the question: how does age effect the passengers chance of survival? In a linear model this could be grouped into two categories-- the older you are the the less likely you are to survive and vice versa. But what if this relationship is not as straightforward? 

The process of binning is similar to one-shot encoding where it will create new columns for each bin.

In [114]:
bins = [0, 10, 20, 30, 40, 50, 60, 70, 80]
categorized_age = pandas.cut(preprocessed_data['Age'], bins)
print(categorized_age)
preprocessed_data['Categorized_age'] = categorized_age
preprocessed_data = preprocessed_data.drop(['Age'], axis=1)

# You will see that this will update `Categorized_age` values into a range of numbers
preprocessed_data_columns = preprocessed_data.columns
print(preprocessed_data_columns)

0      (20, 30]
1      (30, 40]
2      (20, 30]
3      (30, 40]
4      (30, 40]
         ...   
886    (20, 30]
887    (10, 20]
888    (20, 30]
889    (20, 30]
890    (30, 40]
Name: Age, Length: 891, dtype: category
Categories (8, interval[int64, right]): [(0, 10] < (10, 20] < (20, 30] < (30, 40] < (40, 50] < (50, 60] < (60, 70] < (70, 80]]
Index(['Survived', 'Name', 'SibSp', 'Parch', 'Ticket', 'Fare', 'Sex_female',
       'Sex_male', 'Embarked_C', 'Embarked_Q', 'Embarked_S', 'Embarked_U',
       'Pclass_1', 'Pclass_2', 'Pclass_3', 'Categorized_age'],
      dtype='object')


In [115]:
categorized_aged_columns = pandas.get_dummies(preprocessed_data['Categorized_age'], prefix='Categorized_age')
preprocessed_data = pandas.concat([preprocessed_data, categorized_aged_columns], axis= 1)
preprocessed_data = preprocessed_data.drop(['Categorized_age'], axis=1)

In [116]:
preprocessed_data.head()

Unnamed: 0,Survived,Name,SibSp,Parch,Ticket,Fare,Sex_female,Sex_male,Embarked_C,Embarked_Q,...,Pclass_2,Pclass_3,"Categorized_age_(0, 10]","Categorized_age_(10, 20]","Categorized_age_(20, 30]","Categorized_age_(30, 40]","Categorized_age_(40, 50]","Categorized_age_(50, 60]","Categorized_age_(60, 70]","Categorized_age_(70, 80]"
0,0,"Braund, Mr. Owen Harris",1,0,A/5 21171,7.25,False,True,False,False,...,False,True,False,False,True,False,False,False,False,False
1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",1,0,PC 17599,71.2833,True,False,True,False,...,False,False,False,False,False,True,False,False,False,False
2,1,"Heikkinen, Miss. Laina",0,0,STON/O2. 3101282,7.925,True,False,False,False,...,False,True,False,False,True,False,False,False,False,False
3,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",1,0,113803,53.1,True,False,False,False,...,False,False,False,False,False,True,False,False,False,False
4,0,"Allen, Mr. William Henry",0,0,373450,8.05,False,True,False,False,...,False,True,False,False,False,True,False,False,False,False


## Feature Selection
Determine which columns are unecessary for our model. One of which would be "Name" which is unique for each passenger and would provide the model no insight into how the name relates to the passengers chance of survival, or in general that name doesn't relate for the matter of this dataset. We will drop `Name` `Ticket` and `PassengerId` 

In [117]:
preprocessed_data = preprocessed_data.drop(['Name', 'Ticket'], axis=1)

In [118]:
preprocessed_data.head()

Unnamed: 0,Survived,SibSp,Parch,Fare,Sex_female,Sex_male,Embarked_C,Embarked_Q,Embarked_S,Embarked_U,...,Pclass_2,Pclass_3,"Categorized_age_(0, 10]","Categorized_age_(10, 20]","Categorized_age_(20, 30]","Categorized_age_(30, 40]","Categorized_age_(40, 50]","Categorized_age_(50, 60]","Categorized_age_(60, 70]","Categorized_age_(70, 80]"
0,0,1,0,7.25,False,True,False,False,True,False,...,False,True,False,False,True,False,False,False,False,False
1,1,1,0,71.2833,True,False,True,False,False,False,...,False,False,False,False,False,True,False,False,False,False
2,1,0,0,7.925,True,False,False,False,True,False,...,False,True,False,False,True,False,False,False,False,False
3,1,1,0,53.1,True,False,False,False,True,False,...,False,False,False,False,False,True,False,False,False,False
4,0,0,0,8.05,False,True,False,False,True,False,...,False,True,False,False,False,True,False,False,False,False


In [119]:
preprocessed_data.to_csv('preprocessed_titanic_data.csv', index=None)