# Discovering the Titanic 🚢🚢

The Titanic is a well known ship, but did you know that it is also one of the most popular datasets in Data Science ? Here's the link to the dataset:

<a href="https://www.kaggle.com/c/titanic/"> Titanic </a>

Machine Learning is of course all about statistical prediction and understanding of data. The objective of this exercise is to predict whether a passenger survived the sinking of the Titanic, based on the information available about that passenger. The part of the code to train the model, make predictions and evaluate its performance has already been coded. You have to complete the upstream part, which will allow you to prepare the dataset before training the model (preprocessing).

1. Download the dataset _titanic.csv_.
2. Try to understand what's in this dataset.
    1. You will find all the explanations via this link : <a href="https://www.kaggle.com/c/titanic/data"> Titanic Data </a>

3. Place the file _titanic.csv_ in the same folder as this notebook and read it.

4. Explore the dataset and determine which columns are useful for prediction and what preprocessing you will do.

Number of rows : 891

Display of dataset: 


Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S



Basics statistics: 


Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
count,891.0,891.0,891.0,891,891,714.0,891.0,891.0,891.0,891.0,204,889
unique,,,,891,2,,,,681.0,,147,3
top,,,,"Braund, Mr. Owen Harris",male,,,,347082.0,,B96 B98,S
freq,,,,1,577,,,,7.0,,4,644
mean,446.0,0.383838,2.308642,,,29.699118,0.523008,0.381594,,32.204208,,
std,257.353842,0.486592,0.836071,,,14.526497,1.102743,0.806057,,49.693429,,
min,1.0,0.0,1.0,,,0.42,0.0,0.0,,0.0,,
25%,223.5,0.0,2.0,,,20.125,0.0,0.0,,7.9104,,
50%,446.0,0.0,3.0,,,28.0,0.0,0.0,,14.4542,,
75%,668.5,1.0,3.0,,,38.0,1.0,0.0,,31.0,,



Percentage of missing values: 


PassengerId     0.000000
Survived        0.000000
Pclass          0.000000
Name            0.000000
Sex             0.000000
Age            19.865320
SibSp           0.000000
Parch           0.000000
Ticket          0.000000
Fare            0.000000
Cabin          77.104377
Embarked        0.224467
dtype: float64

## Preprocessing - pandas part 🐼🐼 
5. Use the pandas library to discard columns you won't use for prediction.

In this dataset, some categorical variables have too many modalities, we will have to think about throwing them away: typically, for a dataset that is less than 1000 lines long, we will tend to reject categorical variables that have more than 15-20 possible values. So pay attention to the number of unique values in each column, to decide which ones you will keep.

Dropping useless columns...
...Done.
   Survived  Pclass     Sex   Age  SibSp  Parch     Fare Embarked
0         0       3    male  22.0      1      0   7.2500        S
1         1       1  female  38.0      1      0  71.2833        C
2         1       3  female  26.0      0      0   7.9250        S
3         1       1  female  35.0      1      0  53.1000        S
4         0       3    male  35.0      0      0   8.0500        S


6. Separate the target variable (Y) from the explanatory variables (X)

Separating labels from features...
...Done.
0    0
1    1
2    1
3    1
4    0
Name: Survived, dtype: int64

   Pclass     Sex   Age  SibSp  Parch     Fare Embarked
0       3    male  22.0      1      0   7.2500        S
1       1  female  38.0      1      0  71.2833        C
2       3  female  26.0      0      0   7.9250        S
3       1  female  35.0      1      0  53.1000        S
4       3    male  35.0      0      0   8.0500        S



## Preprocessing - scikit-learn part 🔬🔬
7. Separate your data to create a train set and a test set, the latter should represent 15% of the available data.

Dividing into train and test sets...
...Done.



8. Create the preprocessing pipeline for numeric columns

9. Create the preprocessing pipeline for category columns

10. Use the preprocessing pipelines of questions 9 and 10 to transform X_train and X_test

Reminder: you need to call `fit_transform()` on X_train and only `transform()` on X_test, to ensure that the latter gets the same transformations as X_train.

Performing preprocessings on train set...
     Pclass     Sex   Age  SibSp  Parch    Fare Embarked
545       1    male  64.0      0      0  26.000        S
37        3    male  21.0      0      0   8.050        S
214       3    male   NaN      1      0   7.750        Q
40        3  female  40.0      1      0   9.475        S
236       2    male  44.0      1      0  26.000        S
...Done.
[[-1.60067161e+00  2.61131471e+00 -4.63468368e-01 -4.65997851e-01
  -1.09604554e-01  1.00000000e+00  0.00000000e+00  1.00000000e+00]
 [ 8.10688409e-01 -6.78358906e-01 -4.63468368e-01 -4.65997851e-01
  -4.71133941e-01  1.00000000e+00  0.00000000e+00  1.00000000e+00]
 [ 8.10688409e-01 -2.71796941e-16  4.31545801e-01 -4.65997851e-01
  -4.77176214e-01  1.00000000e+00  1.00000000e+00  0.00000000e+00]
 [ 8.10688409e-01  7.75217807e-01  4.31545801e-01 -4.65997851e-01
  -4.42433140e-01  0.00000000e+00  0.00000000e+00  1.00000000e+00]
 [-3.94991602e-01  1.08123396e+00  4.31545801e-01 -4.65997851e-01
  -1.0960

### Training model

In [14]:
from sklearn.linear_model import LogisticRegression

In [15]:
# Train model
model = LogisticRegression()

print("Training model...")
model.fit(X_train, Y_train) # Training is always done on train set !!
print("...Done.")

Training model...
...Done.


### Predictions

In [16]:
# Predictions on training set
print("Predictions on training set...")
Y_train_pred = model.predict(X_train)
print("...Done.")
print(Y_train_pred[0:5])
print()

Predictions on training set...
...Done.
[0 0 0 0 0]



In [17]:
# Predictions on test set
print("Predictions on test set...")
Y_test_pred = model.predict(X_test)
print("...Done.")
print(Y_test_pred[0:5])
print()

Predictions on test set...
...Done.
[0 0 0 1 1]



### Performances evaluation

In [18]:
from sklearn.metrics import accuracy_score

In [19]:
# Print scores
print("Accuracy on training set : ", accuracy_score(Y_train, Y_train_pred))
print("Accuracy on test set : ", accuracy_score(Y_test, Y_test_pred))

Accuracy on training set :  0.8018494055482166
Accuracy on test set :  0.7910447761194029


If you get a score close to 0.79 on the test set, it means that you managed to do all the preprocessings with a good methodology! :-)