### Author: Kubam Ivo Mbi
### Date: 8/11/2020
### Purpose: Titanic Kaggle Competition

## Data set preprocessing

Based on the exploratory analysis, the following preprocessing was done on the dataset. The columns "PassengerId", "Name", "SibSp", "Parch", "Ticket", "Cabin" were dropped from the dataset. The categorical variables "Pclass","Sex" and "Embarked" were encoded(dummies created). To take care of the missing values in the Age variable, the mean age of all passengers(29.7) was imputed for missing age values. Due to the different scales of Age and Fare, those variables were normalised to wipe out the effect of scales, outliers etc. 

In [63]:
# Importing necessary libraries
import pandas as pd
import numpy as np
from sklearn import preprocessing

In [64]:
# Import train data set
train_data = pd.read_csv("train.csv")

In [65]:
# dropping columns
train_df = train_data.drop(columns=["PassengerId", "Name", "Ticket", "Cabin"])
train_df.head()

Unnamed: 0,Survived,Pclass,Sex,Age,SibSp,Parch,Fare,Embarked
0,0,3,male,22.0,1,0,7.25,S
1,1,1,female,38.0,1,0,71.2833,C
2,1,3,female,26.0,0,0,7.925,S
3,1,1,female,35.0,1,0,53.1,S
4,0,3,male,35.0,0,0,8.05,S


In [66]:
# Encoding categorical variables
train_df1 = pd.get_dummies(train_df,columns=["Pclass","Sex","Embarked"])
train_df1.head()

Unnamed: 0,Survived,Age,SibSp,Parch,Fare,Pclass_1,Pclass_2,Pclass_3,Sex_female,Sex_male,Embarked_C,Embarked_Q,Embarked_S
0,0,22.0,1,0,7.25,0,0,1,0,1,0,0,1
1,1,38.0,1,0,71.2833,1,0,0,1,0,1,0,0
2,1,26.0,0,0,7.925,0,0,1,1,0,0,0,1
3,1,35.0,1,0,53.1,1,0,0,1,0,0,0,1
4,0,35.0,0,0,8.05,0,0,1,0,1,0,0,1


In [67]:
train_df1.isnull().sum()

Survived        0
Age           177
SibSp           0
Parch           0
Fare            0
Pclass_1        0
Pclass_2        0
Pclass_3        0
Sex_female      0
Sex_male        0
Embarked_C      0
Embarked_Q      0
Embarked_S      0
dtype: int64

In [68]:
# Handling missing values in the age column
train_df2 = train_df1.dropna()
#train_df2["Age"] = train_df2["Age"].fillna(29.7) # imputing the mean value of 29.7 for all missing ages

In [69]:
# Handling missing values in age column usng linear regression
from sklearn.linear_model import LinearRegression

# Extracting the features
X = train_df2.iloc[:,2:]
#Extracting the labels
y = train_df2["Age"]

reg = LinearRegression().fit(X,y)
print(reg.score(X,y))

0.25035389535528974


In [70]:
x1 = pd.isnull(train_df["Age"]) # Extracting rows with null values for Age
x2 = train_df1[x1].index #Extracting index for all rows with null values for age


In [71]:
x_test = train_df1.iloc[x2,2:] # extracting test dataset where age is null
y_pred = reg.predict(x_test) #Predicting Age
train_df1.iloc[x2,1]=y_pred #replacing all null ages from  original dataset with predicted values
train_df1.isnull().sum()

Survived      0
Age           0
SibSp         0
Parch         0
Fare          0
Pclass_1      0
Pclass_2      0
Pclass_3      0
Sex_female    0
Sex_male      0
Embarked_C    0
Embarked_Q    0
Embarked_S    0
dtype: int64

In [72]:
# Normalising Age and Fare
train_df3 = train_df1
x = train_df3[["Age","Fare"]].values
normaliser = preprocessing.Normalizer()
x_normalized = normaliser.fit_transform(x)
df = pd.DataFrame(x_normalized)


In [73]:
train_clean = pd.concat([train_df3, df], axis=1)


In [74]:
train_clean = train_clean.drop(columns=["Age","Fare"])

In [75]:
train_clean = train_clean.rename(columns={0:"Age",1:"Fare"})

In [76]:
train_clean.head()

Unnamed: 0,Survived,SibSp,Parch,Pclass_1,Pclass_2,Pclass_3,Sex_female,Sex_male,Embarked_C,Embarked_Q,Embarked_S,Age,Fare
0,0,1,0,0,0,1,0,1,0,0,1,0.949757,0.312988
1,1,1,0,1,0,0,1,0,1,0,0,0.470417,0.882444
2,1,0,0,0,0,1,1,0,0,0,1,0.956551,0.291564
3,1,1,0,1,0,0,1,0,0,0,1,0.550338,0.834942
4,0,0,0,0,0,1,0,1,0,0,1,0.974555,0.224148


In [77]:
train_clean.to_csv("train_clean", index=False) # Data set with missing values for age replaced, age and fare normalised


In [78]:
train_df1.to_csv("train_clean1", index=False)# missing value filled and no normalisation done