# Exloratory Data Analysis of the Titanic dataset


## Imports

In [None]:
from google.colab import drive
drive.mount('/content/drive')

In [1]:
import pandas as pd
from pathlib import Path
import sys
import os
import plotly.graph_objects as go
import plotly.express as px
from plotly.subplots import make_subplots

## Data Cleaning and Handling Missing Values

In [41]:
data = pd.read_csv('train.csv')
fig = go.Figure(data=[go.Table(header=dict(values=data.columns),
                 cells=dict(values=[data[i] for i in data.columns]))
                     ])
fig.show()

We can directly see that several columns (3) have missing vaaluçes, and the cabin's column seems to have many NaN.
We that the dataset is composed by almost 900 passengers :
  - 177 (less than a quarter) have their age missing ;
  - 687 (more than 3/4) have their cabin unknown ;
  - 2 have their unbording place missing.

In [42]:
data.isna().sum()

PassengerId      0
Survived         0
Pclass           0
Name             0
Sex              0
Age            177
SibSp            0
Parch            0
Ticket           0
Fare             0
Cabin          687
Embarked         2
dtype: int64

The missing age could be replace with the mean of the column, onbording missing values could be replace byu the most frequent place of onboarding in the dataset. For the cabin, majority of the cabin are missing so we can drop the column for the moment. Maybe an other solution could be expolored in the furure.

In [43]:
mean_age = int(data['Age'].mean())
data['Age'].fillna(mean_age, inplace=True)

emb = data['Embarked'].value_counts().idxmax()
data['Embarked'].fillna(emb, inplace=True)

del data['Cabin']

data.isna().sum()

PassengerId    0
Survived       0
Pclass         0
Name           0
Sex            0
Age            0
SibSp          0
Parch          0
Ticket         0
Fare           0
Embarked       0
dtype: int64

Conclusion : We have replaced the missing ages by the average value of the boat's passengers, and the missing places of embarkation have been put in the majority class to have the least possible impact on the distribution. Lastly, the "cabin" column has been removed in view of its very large number of missing values. These treatments may be reconsidered in the light of subsequent analyses and modelling.

## Data Visualisation and Exploration


Let's see the surving rate. We see in the following plot that a majority of the passengers died. Only 38% of them survived.

In [6]:
fig = px.pie(data, names='Survived', title='Passenger Survival')
fig.show()

### Onboarding Spot

First, we see that most of the passenger onboard in Southampton.

In [9]:
fig = go.Figure(data=[go.Pie(labels=data['Embarked'], pull=[.1, .15, .15, 0])])
fig.show()

Then, in the second plot we can see that passengers from Southampton are more likely to die than the other. And passengers from CHerbourg are more likely to survive that the others. So we havez a correlation between the onboarding spot and the surviving rate.

In [10]:
fig = make_subplots(rows=1, cols=3, specs=[[{"type": "pie"}, {"type": "pie"}, {"type": "pie"}]])
fig.add_trace(
            go.Pie(labels=data.loc[data['Embarked'] == 'C']['Survived'], pull = [.1, .1],
                   title = 'Embarked C vs. Survived'), row=1, col=1)

fig.add_trace(
            go.Pie(labels=data.loc[data['Embarked'] == 'S']['Survived'], pull = [.07, .07],
                   title = 'Embarked S vs. Survived'),row=1, col=2)

fig.add_trace(
            go.Pie(labels=data.loc[data['Embarked'] == 'Q']['Survived'], pull = [.1, .1],
                   title = 'Embarked Q vs. Survived'), row=1, col=3)


fig.update_layout(height=500, width=800, title_text="Gene Expression Features")
fig.show()

### Passenger's Age and Class

Let's see if interesting information can be found in the passenger's age and class.

First, the age repartition below shows that most of the passengers are under forty years old, so quite young compared to ther entire spectrum.

In [11]:
fig = px.histogram(data, x='Age', nbins=50, histnorm='probability density')
fig.show()

Second, The repartition of the passengers in 1st, 2nd and 3rd classes against their age. We can see that older passengers are more likely to be in 1st, and young ones tend to be in 2nd and 3rd. So there is a correlation between those to features.

In [13]:
fig = px.box(data, x='Pclass', y="Age", points="all")
fig.show()

So now we want to look at the surviving rate against the passenger class. We see that 3rd class passenger died 25% more than the second and 38% more than the firt. Moreover 2nd died 15% more than the first. So we have a huge correlation between passenger's class and survivig rate

By extention we do have a correlation between passenger's age and surviving.

In [14]:
fig = make_subplots(rows=1, cols=3, specs=[[{"type": "pie"}, {"type": "pie"}, {"type": "pie"}]])
fig.add_trace(
            go.Pie(labels=data.loc[data['Pclass'] == 1]['Survived'], pull = [.1, .1],
                   title = 'Pclass 1 vs. Survived'), row=1, col=1)

fig.add_trace(
            go.Pie(labels=data.loc[data['Pclass'] == 2]['Survived'], pull = [.07, .07],
                   title = 'Pclass 2 vs. Survived'),row=1, col=2)

fig.add_trace(
            go.Pie(labels=data.loc[data['Pclass'] == 3]['Survived'], pull = [.1, .1],
                   title = 'Pclass 3 vs. Survived'), row=1, col=3)
fig.show()


### Passenger's sex
First, wee see that men represent the majority of the population.

In [19]:
fig = px.pie(data, names='Sex', title='Passenger sex')
fig.show()

Secondly, we can see that women tend to survived 4 times more than men. Once more thi represent a huge corelation betwenn the feature and the taget variable.

In [18]:
fig = make_subplots(rows=1, cols=3, specs=[[{"type": "pie"}, {"type": "pie"}, {"type": "pie"}]])
fig.add_trace(
            go.Pie(labels=data.loc[data['Sex'] == 'male']['Survived'], pull = [.1, .1],
                   title = 'Male vs. Survived'), row=1, col=1)

fig.add_trace(
            go.Pie(labels=data.loc[data['Sex'] == 'female']['Survived'], pull = [.07, .07],
                   title = 'Female vs. Survived'),row=1, col=2)
fig.show()

### Family on board

Finaly the correlation between siblings, spouses, parents, children and surviving. The more the family is develop the more yon tend to die on titanic.

In [21]:
fig = px.density_contour(data, x="SibSp", y="Parch", color='Survived',
                        height=400, width=800)
fig.update_traces(contours_coloring="fill", contours_showlabels = True)
fig.show()

### Onbording Spot vs Age and Class

We are going to look for correlation between the meaningfull fearure we just exposed.

We see that onbording S and 3rd class are correlated. We have to remember that passsenger from Southampton are most likely to die that the other and so are passenger from 3rd class. So this correlation is important because they seems to be overlaping each other in the explanation of the taget variable.

In [20]:
fig = px.density_heatmap(data, x="Embarked", y="Pclass",
                        height=500, width=500)
fig.show()


### Conclusions
We've just seen that several independent variables in the dataset are correlated with the dependent variable we're trying to predict, so we do have potentially explanatory variables. What's more, we've observed that some of the explanatory variables appear to be correlated, a point we'll need to take into account in our modelling, particularly for the regularizations we may need to implement.


## Statistical Analysis

## Outlier detection and treatment

## Conclusion