## Introduction

This is a project launched at Codecademy and it is a part of Data Scientist Career Path. More information [here](https://www.codecademy.com/learn/paths/data-science)

## About Project

The RMS Titanic set sail on its maiden voyage in 1912, crossing the Atlantic from Southampton, England to New York City. The ship never completed the voyage, sinking to the bottom of the Atlantic Ocean after hitting an iceberg, bringing down 1,502 of 2,224 passengers onboard.

In this project I will create a **Logistic Regression model** that predicts which passengers survived the sinking of the Titanic, based on features like age and class.

The data we will be using for training our model is provided by `Kaggle`.

## About Data

The file `train.csv` contains the data of 891 passengers onboard the Titanic when it sank that fateful day.

#### Task 1 - import Python libraries

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

In [2]:
passangers = pd.read_csv('train.csv')

In [3]:
passangers.shape

(891, 12)

In [4]:
passangers.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


#### Task 3

Given the saying, “women and children first,” `Sex` and `Age` seem like good features to predict survival. Let’s map the text values in the Sex column to a numerical value. Update Sex such that all values female are replaced with 1 and all values male are replaced with 0.

In [5]:
passangers['Sex'] = passangers['Sex'].map({'male': 0, 'female': 1})

In [6]:
passangers.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",0,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",1,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",1,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",1,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",0,35.0,0,0,373450,8.05,,S


#### Task 4

Let’s take a look at `Age`. Print passengers['Age'].values. You can see we have multiple missing values, or nans. Fill all the empty Age values in passengers with the mean age.

In [7]:
print(passangers['Age'].value_counts())

24.00    30
22.00    27
18.00    26
19.00    25
30.00    25
         ..
55.50     1
70.50     1
66.00     1
23.50     1
0.42      1
Name: Age, Length: 88, dtype: int64


In [8]:
passangers['Age'] = passangers['Age'].fillna(value = passangers['Age'].mean())

#### Task 5

Given the strict class system onboard the Titanic, let’s utilize the `Pclass` column, or the passenger class, as another feature. Create a new column named `FirstClas`s that stores 1 for all passengers in first class and 0 for all other passengers.

In [9]:
passangers['FirstClass'] = passangers['Pclass'].apply(lambda x: 1 if x == 1 else 0)

#### Task 6

Create a new column named `SecondClass` that stores 1 for all passengers in second class and 0 for all other passengers.

In [10]:
passangers['SecondClass'] = passangers['Pclass'].apply(lambda x: 1 if x == 2 else 0)

In [11]:
passangers.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked,FirstClass,SecondClass
0,1,0,3,"Braund, Mr. Owen Harris",0,22.0,1,0,A/5 21171,7.25,,S,0,0
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",1,38.0,1,0,PC 17599,71.2833,C85,C,1,0
2,3,1,3,"Heikkinen, Miss. Laina",1,26.0,0,0,STON/O2. 3101282,7.925,,S,0,0
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",1,35.0,1,0,113803,53.1,C123,S,1,0
4,5,0,3,"Allen, Mr. William Henry",0,35.0,0,0,373450,8.05,,S,0,0


#### Task 7

Now that we have cleaned our data, let’s select the columns we want to build our model on.

Select columns `Sex`, `Age`, `FirstClass`, and `SecondClas`s and store them in a variable named `features`.

Select column `Survived` and store it a variable named `survival`.

In [12]:
features = passangers [['Sex', 'Age', 'FirstClass', 'SecondClass']]
survival = passangers['Survived']

#### Task 8

Split the data into training and test sets using sklearn‘s `train_test_split() method`. We’ll use the training set to train the model and the test set to evaluate the model.

In [13]:
x_train, x_test, y_train, y_test = train_test_split(features, survival, train_size = 0.8, test_size = 0.2)

#### Task 9

Since sklearn‘s Logistic Regression implementation uses Regularization, we need to scale our feature data.

Create a StandardScaler object, .fit_transform() it on the training features, and .transform() the test features.

In [14]:
scaler = StandardScaler()
scaler.fit_transform(x_train)
scaler.transform(x_test)

array([[-7.40152746e-01, -3.10489966e-01, -5.66537509e-01,
         1.91143788e+00],
       [-7.40152746e-01, -2.43891447e-02, -5.66537509e-01,
        -5.23166361e-01],
       [ 1.35107247e+00, -6.19861929e-01,  1.76510819e+00,
        -5.23166361e-01],
       [-7.40152746e-01,  9.26997885e-01, -5.66537509e-01,
        -5.23166361e-01],
       [ 1.35107247e+00,  1.70042779e+00,  1.76510819e+00,
        -5.23166361e-01],
       [-7.40152746e-01, -2.25721304e+00, -5.66537509e-01,
         1.91143788e+00],
       [-7.40152746e-01, -1.11800367e-03, -5.66537509e-01,
        -5.23166361e-01],
       [-7.40152746e-01, -2.43891447e-02, -5.66537509e-01,
        -5.23166361e-01],
       [-7.40152746e-01, -6.97204920e-01, -5.66537509e-01,
        -5.23166361e-01],
       [-7.40152746e-01, -2.43891447e-02, -5.66537509e-01,
        -5.23166361e-01],
       [ 1.35107247e+00, -1.62532081e+00, -5.66537509e-01,
        -5.23166361e-01],
       [-7.40152746e-01,  1.00434088e+00, -5.66537509e-01,
      

#### Task 10

Create a LogisticRegression model with sklearn and .fit() it on the training data.

In [15]:
model = LogisticRegression()
model.fit(x_train, y_train)

LogisticRegression()

#### Task 11

`score()` the model on the training data and print the training score.

In [16]:
print('score() the model on test data:')
print(model.score(x_test, y_test))

score() the model on test data:
0.7932960893854749


#### Task 12

Print the feature coefficients determined by the model. Which feature is most important in predicting survival on the sinking of the Titanic?

In [17]:
print(list(zip(['Sex', 'Age', 'FirstClass', 'SecondClass'],model.coef_[0])))

[('Sex', 2.4560735256011776), ('Age', -0.03175818716691897), ('FirstClass', 2.1780424091829573), ('SecondClass', 1.1247334441022996)]


The most important features is **Sex**

#### Task 13

Let’s use our model to make predictions on the survival of a few fateful passengers. Provided in the code editor is information for 3rd class passenger Jack and 1st class passenger Rose, stored in NumPy arrays.

In [18]:
Jack = np.array([0.0,20.0,0.0,0.0])
Rose = np.array([1.0,17.0,1.0,0.0])
Me = np.array([1.0,30.0,0.0,1.0])

In [19]:
sample_passangers = np.array([Jack, Rose, Me])

#### Task 14

Since our Logistic Regression model was trained on scaled feature data, we must also scale the feature data we are making predictions on.

Using the StandardScaler object created earlier, apply its `.transform() method` to `sample_passengers` and save the result to `sample_passengers`.

In [20]:
print(model.predict(sample_passangers))

[0 1 1]


In [21]:
print(model.predict_proba(sample_passangers))

[[0.8764594  0.1235406 ]
 [0.05896313 0.94103687]
 [0.21350863 0.78649137]]
