## Titanic Survival Prediction Using the Logistic Regression Model
This notebook shows the steps taken in training a Logistic Regression Model for making predictions of survival for passengers in the titanic shipwreck of 1912. The resulting prediction is for the purpose of submission to the Titanic competition on Kaggle.

The notebook shows:

-how to do some EDA to pre-processing the data for the model.
-how to train a Logistic Regression Model using a training dataset.
-how to test the trained model on test datasets.

## Step 1 : Import the Dependencies for the Model
It is worthy to note that the train-test-split is not imported from skicit-learn because the data has been split prior to the start of the task. If the data comes without a train-test split,  import train_test_split from the sklearn.model_selection to perform the split. Pandas and Numpy are imported for Data Manipulation and Wrangling while the Logistic Regression is imoorted from Skicit-Learn.

In [2]:
import pandas as pd
import numpy as np
from sklearn.linear_model import LogisticRegression
#from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix, accuracy_score

## Step 2: Load Data, Perform Exploratory Analysis and Pre-Processing on Data
- Identify Datatypes present and make appropriate conversions , if necessary.
- Drop features that have excessive missing values and for those with few missing values, impute values for NaNs. 
- Create Dummies for categorical variables in the dataset (drop original dummy columns).

In [3]:
df_train = pd.read_csv("train.csv")
df_test = pd.read_csv("test.csv")

In [4]:
#For Train Data
df_train.dtypes
df_train.info()

#For Test Data
df_test.dtypes
df_test.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  891 non-null    int64  
 1   Survived     891 non-null    int64  
 2   Pclass       891 non-null    int64  
 3   Name         891 non-null    object 
 4   Sex          891 non-null    object 
 5   Age          714 non-null    float64
 6   SibSp        891 non-null    int64  
 7   Parch        891 non-null    int64  
 8   Ticket       891 non-null    object 
 9   Fare         891 non-null    float64
 10  Cabin        204 non-null    object 
 11  Embarked     889 non-null    object 
dtypes: float64(2), int64(5), object(5)
memory usage: 83.7+ KB
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 418 entries, 0 to 417
Data columns (total 11 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  418 non-null    int64  
 1   Pclass  

In [5]:
#For Train
cla = {1:"First Class", 2: "Second Class", 3 :"Third Class"}
emb = {"Q" : "Queenstown" , "S" : "Southampton" , "C" : "Cherbourg" }
df_train['Pclass'] = df_train['Pclass'].map(cla)
df_train['Embarked'] = df_train['Embarked'].map(emb)

#For Test
cla = {1:"First Class", 2: "Second Class", 3 :"Third Class"}
emb = {"Q" : "Queenstown" , "S" : "Southampton" , "C" : "Cherbourg" }
df_test['Pclass'] = df_test['Pclass'].map(cla)
df_test['Embarked'] = df_test['Embarked'].map(emb)

In [6]:
drop_cols = ["Name", "Cabin","Ticket"]
dummy_cols = ["Sex", "Pclass","Embarked"]

#For Train
dummies_tr = []
for col in dummy_cols:
    dummies_tr.append(pd.get_dummies(df_train[col]))

df_dummy_tr = pd.concat(dummies_tr, axis = 1)
df_train = pd.concat((df_train, df_dummy_tr), axis =1)

df_train.drop(drop_cols, axis = 1, inplace = True)
df_train.drop(dummy_cols, axis=1, inplace=True)

In [7]:
drop_cols = ["Name", "Cabin","Ticket"]
dummy_cols = ["Sex", "Pclass","Embarked"]

#For Test
dummies_te = []
for col in dummy_cols:
    dummies_te.append(pd.get_dummies(df_test[col]))

df_dummy_te = pd.concat(dummies_te, axis = 1)
df_test = pd.concat((df_test, df_dummy_te), axis =1)

df_test.drop(drop_cols, axis = 1, inplace = True)
df_test.drop(dummy_cols, axis=1, inplace=True)

In [8]:
#For Train
df_train["Age"] = df_train["Age"].interpolate()
df_train["Age"] = df_train["Age"].astype('int64')

#For Test
df_test["Fare"] = df_test["Fare"].interpolate()
df_test["Age"] = df_test["Age"].interpolate()
df_test["Age"] = df_test["Age"].astype('int64')

## Split the Data into Features and Target Variables

In [9]:
#Train Data Feature and Target
X = df_train[['Age','SibSp','Parch','Fare','female','male', 'First Class', 'Second Class', 'Third Class', 'Cherbourg',
       'Queenstown', 'Southampton']]
y = df_train["Survived"]

# Test Data Features
X_test = df_test[['Age','SibSp','Parch','Fare','female','male', 'First Class', 'Second Class', 'Third Class', 'Cherbourg',
       'Queenstown', 'Southampton']]

## Instantiate Model and Fit on the Training Dataset

In [10]:
#Train Model on Train Dataset
model = LogisticRegression(solver = "liblinear", max_iter = 1000, random_state = 42)
model.fit(X,y)

## Predict Survival of Test Passengers Using Model Predictions of Test Features

In [11]:
predict = model.predict(X_test)

In [14]:
acc = accuracy_score(y, model.predict(X))
print(acc)

0.8002244668911336


## Print the Results and Export to CSV File

In [12]:
output = pd.DataFrame({'PassengerId': df_test.PassengerId, 'Survived': predict})
output.to_csv('submission.csv', index=False)
print("Your submission was successfully saved!")

Your submission was successfully saved!


In [13]:
output["Survived"].value_counts()

Survived
0    265
1    153
Name: count, dtype: int64