In [1]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import matplotlib.pyplot as plt
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import confusion_matrix

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

# Some considerations

* We have already our dataset split into train - we might want at some point to do some kind of cross-training (and so re-merge those 2 datasets) to increase the accuracy of our model.  
* We were provided with a simple heuristic looking at the result of a model if we assigned 'survived' to all the women (this is a starting benchmark to understand what could be the additional impact of our work)  
* Every modification we'll be doing on the train dataset, we should do the same on the test dataset. 
Let's now look at the data

# Setting up some base variables

In [None]:
train_data = pd.read_csv("/kaggle/input/titanic/train.csv")
test_data = pd.read_csv("/kaggle/input/titanic/test.csv")
train_data.head()

In [None]:
train_data.describe()

From this, we see that some fields won't be super useful. 
* The PassengerId seems to be a uniqueID assigned to each passenger so this cannot be used for our model 
* Survived will be what we will use for our model
* PClass seems to be a good predictor of survival (knowing the story of the titanic)
* Name could be interesting but we might need to clean the field and extract something like the title
* Sex => very interesting (knowing the story)
* SibSp => # of siblings or spouse on the boat. That's very interesting, it could be interesting to look at that / if people survived more if they were part of a family
* Parch => same as above
* Ticket seems similar to PassengerID but we should look into that in more details 
* Fare => interesting field
* Cabin - we should look at null value and what that means (no cabin?)
* Embarked is the pork of embarcation - I am not sure for now what that entails/how that can modify the model but there might be some reasons why it would, so for now let's keep it and keep studying the data


# Studying the NA

First step is to understand where are the NA and which strategy we could follow here to fill those NA

In [None]:
train_data.isna().sum()

We see that:
* Cabin is mostly empty. For this reason we are recommending removing this field as it would be very difficult to recreate it (or maybe using the Fare/Embarked but then there is a question re:do we need this feature if we have enough information from other features)
* for Age - this one is a bit trickier. we need to look into this feature a bit more to be able to decide what is gonna be the best. Mostly, there is possibly something that can be done using other features. We could use the average (what we have been doing for the first submission) but if - for instance - all the rows missing age are for children, then we would completely change the output of the model. 
* for Embarked - there are only 2 rows that are missing this information. We could assign them the mode - in case we discover this is an important feature. 

For now, we'd like to run a quick model first to have a general benchmark against all our future models. In order to run a quick model, we'll use a few simple variables:
* Pclass
* Sex (need to encode it since not INT)
* SibSp
* Parch
* Fare

# Creating a function to encode dummy variables

In [None]:
def dummy_encoded(df,array):
    for i in array:
        df[[i]] = df[[i]].astype(str)
        encoding = pd.get_dummies(df[[i]])
        df = pd.concat([
            df.drop([i],axis=1),
            encoding],
            axis=1)
    return df;

# Encoding Dummy Variables

At this stage we are going to encode the dummy variables for the features we want to keep. We will also create a variable column_to_keep that will contain the new columns we want our first model to use

In [None]:
df_training = dummy_encoded(train_data,['Sex','Pclass'])
column_to_keep = ['Age','SibSp','Parch','Pclass_1','Pclass_2','Sex_male','Fare']
df_training_final = df_training[column_to_keep]

# Filling the NA

In [None]:
df_training_final.isna().sum()

For Age, we have a consequent number of NA. To make it simple, we'll replace by the average of the column (Note => that's not necessarily the best way to do, in subsequent submission we should double check if this is the best strategy)

In [None]:
train_data.mean()

As we can see from the above, the average age is ~29.6. We are going to round it to 30 and we are going to replace the NA by 30

In [None]:
training_set = df_training_final.fillna(30)
training_set = pd.concat([training_set,train_data[['Survived']]],axis=1)
training_set = training_set.astype(int)
training_set.head()

# Building a first model to get a benchmark (+first result)

To start with, we're going to create some kind of 'naive' model to understand exactly how precise/accurate a model could be without any refinement on the metric. Then we shall improve this model.
This is a classification problem here - so we'll use simple classification to try to build the model

In [None]:
reg = LogisticRegression()
reg.fit(training_set[column_to_keep], training_set["Survived"])

In [None]:
tn, fp, fn, tp = confusion_matrix(training_set["Survived"], reg.predict(training_set[['Age','SibSp','Parch','Pclass_1','Pclass_2','Sex_male','Fare']])).ravel()

In [None]:
accuracy = (tp+tn)/(tp+tn+fp+fn)
precision = tp / (tp+fp)
print([accuracy,precision])

# Cleaning Data for 1st submission

We are going to use this first model to do one submission here and have a first result on the problem we are trying to solve. But we can't directly run this on our test data, mostly because some fields seem to be missing

## Encoding the data + keeping the same features that we have in our model

In [None]:
df_test = dummy_encoded(test_data,['Sex','Pclass'])
test_set = df_test[column_to_keep]
test_set.head()

## Filling NA

In [None]:
test_set.isna().sum()

* For age, we will follow the same strategy that we followed previously (replace by 30)
* For Fare, we are assuming that Fare is very related to the class you are in. For this reason we'll look at the average fare for the class of this passenger, and we'll use that here

In [None]:
test_set[test_set['Fare'].isna()]

This person was in class 3. Let's look at the average Fare for our passenger in Pclass 3

In [None]:
train_data.groupby('Pclass').mean()

In order to replace the Fare price, we are going to use the average of the price Fare for Pclass = 3

In [None]:
test_set[['Fare']] = test_set[['Fare']].fillna(13.7)
test_set[['Age']] = test_set[['Age']].fillna(30)
test_set.isna().sum()

## Preparing the data for the submission

In [None]:
result = pd.concat([
    test_data[['PassengerId']],
    pd.DataFrame(reg.predict(test_set),columns=['Survived'])],
    axis=1)

In [None]:
result.to_csv('titanic_submission5.csv', index = False, header=True)

Here we end up with a score of **0.75837** - let's see how we can improve that

# Deep diving into the first model

From this past result there are a few strategies we could use:
* rethinking the different features
* trying different models

Let's start with rethinking the different features

## Rethinking the different features

### Understanding the coef

In [None]:
coef_from_reg = pd.DataFrame(
    data = reg.coef_,
    columns = column_to_keep)
coef_from_reg.head()

From this, it seems like the 'Fare' feature has the smallest impact on the model. We could consider removing it.<br> 
Same with age/parch it seems like => maybe we could transform this feature. <br>
Especially, we are thinking of coding a true/false flag for "is a kid" - based on the well-known adage 'Women and children first'. Also a flag 'is part of a family'
For kis => let's define it as anyone below 16

In [None]:
training_set['Is a Child'] = training_set['Age'] <= 16 
training_set.head()

In [None]:
training_set2 = training_set.drop(["Age","Fare"],axis=1)
training_set2 = training_set2.astype(int)
training_set2_wo_survived = training_set2.drop(["Survived"],axis=1)
training_set2_wo_survived.head()

### Training a second model

In [None]:
reg2 = LogisticRegression()
reg2.fit(training_set2_wo_survived, training_set2["Survived"])
tn2, fp2, fn2, tp2 = confusion_matrix(training_set2["Survived"], reg2.predict(training_set2_wo_survived)).ravel()
accuracy2 = (tp2+tn2)/(tp2+tn2+fp2+fn2)
precision2 = tp2 / (tp2+fp2)
print([accuracy2,precision2])

We have a slightly higer accuracy and precision here - let's look at the coeff

In [None]:
coef_from_reg2 = pd.DataFrame(
    data = reg2.coef_,
    columns = training_set2_wo_survived.columns)
coef_from_reg2.head()