# Explainability AI - Project

[Noms des membres du groupe]

Dataset : Titanic (https://www.kaggle.com/datasets/brendan45774/test-file)

We decided to use the Titanic dataset for our project. The dataset contains information about the passengers of the Titanic, such as their age, the class of their ticket, the fare they paid, etc. 

The goal is to predict whether a passenger survived or not. We will use the dataset to train a model and then use the model to predict the survival of a passenger.

In [159]:
# imports
import pandas as pd
import seaborn as sns


In [160]:
# Load the dataset
titanic = sns.load_dataset('titanic')
titanic.head(5).style.set_caption('First 5 rows of the dataset')


Unnamed: 0,survived,pclass,sex,age,sibsp,parch,fare,embarked,class,who,adult_male,deck,embark_town,alive,alone
0,0,3,male,22.0,1,0,7.25,S,Third,man,True,,Southampton,no,False
1,1,1,female,38.0,1,0,71.2833,C,First,woman,False,C,Cherbourg,yes,False
2,1,3,female,26.0,0,0,7.925,S,Third,woman,False,,Southampton,yes,True
3,1,1,female,35.0,1,0,53.1,S,First,woman,False,C,Southampton,yes,False
4,0,3,male,35.0,0,0,8.05,S,Third,man,True,,Southampton,no,True


- fare is in k$
- deck is the port of embarkation (C = Cherbourg; Q = Queenstown; S = Southampton)
- parch is the number of parents/children aboard
- sisbp is the number of siblings/spouses aboard
- pclass is the class of the ticket.
- who is the type of passenger (men / women / child)

In [161]:
titanic.describe().style.format("{0:.2f}").set_caption("Dataset numerical features description")


Unnamed: 0,survived,pclass,age,sibsp,parch,fare
count,891.0,891.0,714.0,891.0,891.0,891.0
mean,0.38,2.31,29.7,0.52,0.38,32.2
std,0.49,0.84,14.53,1.1,0.81,49.69
min,0.0,1.0,0.42,0.0,0.0,0.0
25%,0.0,2.0,20.12,0.0,0.0,7.91
50%,0.0,3.0,28.0,0.0,0.0,14.45
75%,1.0,3.0,38.0,1.0,0.0,31.0
max,1.0,3.0,80.0,8.0,6.0,512.33


In [162]:
def data_info(titanic):
    data = {
        'Column Name': titanic.columns,
        'Data Type': [str(titanic[col].dtype) for col in titanic.columns],
        'Non-Null Count': titanic.count().values
    }

    info_df = pd.DataFrame(data)

    info_df.style.set_caption('Dataset information')
    
    return info_df

data_info(titanic)


Unnamed: 0,Column Name,Data Type,Non-Null Count
0,survived,int64,891
1,pclass,int64,891
2,sex,object,891
3,age,float64,714
4,sibsp,int64,891
5,parch,int64,891
6,fare,float64,891
7,embarked,object,889
8,class,category,891
9,who,object,891


We can remove the columns :
- class (because we have encoded pclass)
- sex (because the column "who" is more precise) 
- adult_male (same reason)
- deck (because we have embarked)
- embark_town
- alive (because it's a duplicate of survived)

In [163]:
titanic = titanic.drop(['class', 'sex', 'adult_male', 'deck', 'embark_town', 'alive'], axis=1)


Some passenger have been given their ticket for free (15):

In [164]:
titanic.sort_values(by='fare', ascending=True).head(17)


Unnamed: 0,survived,pclass,age,sibsp,parch,fare,embarked,who,alone
815,0,1,,0,0,0.0,S,man,True
822,0,1,38.0,0,0,0.0,S,man,True
481,0,2,,0,0,0.0,S,man,True
466,0,2,,0,0,0.0,S,man,True
263,0,1,40.0,0,0,0.0,S,man,True
633,0,1,,0,0,0.0,S,man,True
271,1,3,25.0,0,0,0.0,S,man,True
413,0,2,,0,0,0.0,S,man,True
597,0,3,49.0,0,0,0.0,S,man,True
732,0,2,,0,0,0.0,S,man,True


We choose to remove them from the dataset because they are not representative of the population, and they could bias the model.

They are only 15 so we still have enought relevant data.


In [165]:
titanic_free = titanic[titanic['fare'] == 0]
titanic = titanic.drop(titanic_free.index, axis=0)


Let's check what we have now:

In [166]:
data_info(titanic)


Unnamed: 0,Column Name,Data Type,Non-Null Count
0,survived,int64,876
1,pclass,int64,876
2,age,float64,707
3,sibsp,int64,876
4,parch,int64,876
5,fare,float64,876
6,embarked,object,874
7,who,object,876
8,alone,bool,876


We can now fill missing values for age and embarked. We will use a KNNImputer to fill missing values for age and the most frequent value for embarked.

This will replace the missing values with the mean value of the 5 nearest neighbors. The neighbors are computed based on the other columns in the DataFrame.

In [167]:
from sklearn.impute import KNNImputer
from sklearn.impute import SimpleImputer

imputer_age = KNNImputer(n_neighbors=5)
titanic['age'] = imputer_age.fit_transform(titanic[['age']])[:, 0]

imputer_embarked = SimpleImputer(strategy='most_frequent')
titanic['embarked'] = imputer_embarked.fit_transform(titanic[['embarked']])[:, 0]


In [168]:
data_info(titanic)


Unnamed: 0,Column Name,Data Type,Non-Null Count
0,survived,int64,876
1,pclass,int64,876
2,age,float64,876
3,sibsp,int64,876
4,parch,int64,876
5,fare,float64,876
6,embarked,object,876
7,who,object,876
8,alone,bool,876
