# Introduction
This case study is about predicting which passengers survived the [sinking of the famous Titanic](https://en.wikipedia.org/wiki/Sinking_of_the_RMS_Titanic). 
In our work, we will explore a dataset and establish a good prediction model. 

# Data description
In this section, we load and explore the data.

In [2]:
# Libraries importing
import numpy as np # linear algebra
import pandas as pd # data processing

We have 2 datasets:
- "train.csv": contains informations about some passengers (multiple columns) and the fact that they survived or not (one column). You may download this dataset <a href="{{ site.baseurl }}/dev/titanic/data/train.csv">here</a> in CSV format.
- "test.csv": contains informations about some passengers (multiple columns) but without the survival information. You may download this dataset <a href="{{ site.baseurl }}/dev/titanic/data/test.csv">here</a> in CSV format.

In what follows, we mainly focus and use the first dataset.

In [7]:
# This creates a pandas dataframe and assigns it to the titanic variable
titanic = pd.read_csv("data/train.csv")

# Print the first five rows of the dataframe
print(titanic.head(5))

   PassengerId  Survived  Pclass  \
0            1         0       3   
1            2         1       1   
2            3         1       3   
3            4         1       1   
4            5         0       3   

                                                Name     Sex   Age  SibSp  \
0                            Braund, Mr. Owen Harris    male  22.0      1   
1  Cumings, Mrs. John Bradley (Florence Briggs Th...  female  38.0      1   
2                             Heikkinen, Miss. Laina  female  26.0      0   
3       Futrelle, Mrs. Jacques Heath (Lily May Peel)  female  35.0      1   
4                           Allen, Mr. William Henry    male  35.0      0   

   Parch            Ticket     Fare Cabin Embarked  
0      0         A/5 21171   7.2500   NaN        S  
1      0          PC 17599  71.2833   C85        C  
2      0  STON/O2. 3101282   7.9250   NaN        S  
3      0            113803  53.1000  C123        S  
4      0            373450   8.0500   NaN        S  


Here are the different columns
- <b>PassengerId</b>: Id of the passenger
- <b>Pclass</b>: Ticket class (1 = 1st, 2 = 2nd, 3 = 3rd)
- <b>Sex</b>: Sex	
- <b>Age</b>: Age in years	
- <b>Sibsp</b>: Number of siblings / spouses aboard the Titanic	
- <b>Parch</b>: Number of parents / children aboard the Titanic	
- <b>Ticket</b>: Ticket number	
- <b>Fare</b>: Passenger fare	
- <b>Cabin</b>: Cabin number	
- <b>Embarked</b>: Port of Embarkation (C = Cherbourg, Q = Queenstown, S = Southampton)
- <b>Survival</b>: Survival (0 = No, 1 = Yes)


# Hypothesis
Let's think about the variables that might affect the outcome of survival:
- Are women and children were more likely to survive? If yes, "Age" and "Sex" would be good predictors. 
- Knowing that first class cabins were closer to the deck of the ship, are passengers from the first class more likely to survive? If yes, passenger class "pclass" might affect the outcome. "Fare" is tied to passenger class and would probably have a strong correlation too.

....................................................................................................
Family size (the number of siblings and parents/children) will probably be correlated with survival one way or the other. That's because there would either be more people to help you, or more people to think about trying to save.

There may be links between survival and columns like Ticket, Name, and Embarked (because people who boarded at certain ports may have had cabins closer or farther away from the top of the ship), .

We call this step acquiring domain knowledge, and it's fairly important to most machine learning tasks. We're looking to engineer the features so that we maximize the information we have about what we're trying to predict.
....................................................................................................

# Data cleaning
Let us have a look to the dataset.

In [14]:
# Summary on the dataframe
print(len(titanic))
print(titanic.describe())

891
       PassengerId    Survived      Pclass         Age       SibSp  \
count   891.000000  891.000000  891.000000  714.000000  891.000000   
mean    446.000000    0.383838    2.308642   29.699118    0.523008   
std     257.353842    0.486592    0.836071   14.526497    1.102743   
min       1.000000    0.000000    1.000000    0.420000    0.000000   
25%     223.500000    0.000000    2.000000   20.125000    0.000000   
50%     446.000000    0.000000    3.000000   28.000000    0.000000   
75%     668.500000    1.000000    3.000000   38.000000    1.000000   
max     891.000000    1.000000    3.000000   80.000000    8.000000   

            Parch        Fare  
count  891.000000  891.000000  
mean     0.381594   32.204208  
std      0.806057   49.693429  
min      0.000000    0.000000  
25%      0.000000    7.910400  
50%      0.000000   14.454200  
75%      0.000000   31.000000  
max      6.000000  512.329200  


All the numerical columns have indeed a count of 891 except the "Age" column that has a count of 714. 
This indicates that there are missing values (null, NA, or not a number).

As we don't want to remove the rows with missing values, we choose to clean the data by filling in all of the missing values with the median of all the values in the column.

In [15]:
titanic2 = titanic
titanic["Age"] = titanic["Age"].fillna(titanic["Age"].median())

The "Sex" column is non-numeric, but it could be very informative. 
We will then convert it.
First, we confirm that this column does not have empty values. then we make the conversion.

In [19]:
# What are the values for this column?
print(titanic["Sex"].unique())

['male' 'female']


In [20]:
titanic.loc[titanic["Sex"] == "male", "Sex"] = 0
titanic.loc[titanic["Sex"] == "female", "Sex"] = 1

We do the same with the "Embarked" column.

In [167]:
# What are the values for this column?
print(titanic["Embarked"].unique())

[0 1 2]


In [27]:
titanic["Embarked"] = titanic["Embarked"].fillna('S')

In [36]:
titanic.loc[titanic["Embarked"] == "S", "Embarked"] = 0
titanic.loc[titanic["Embarked"] == "C", "Embarked"] = 1
titanic.loc[titanic["Embarked"] == "Q", "Embarked"] = 2


# Model application


In [164]:
# Sklearn also has a helper that makes it easy to do cross-validation
from sklearn.model_selection import KFold

#???
from sklearn.model_selection import cross_val_score

# Import the linear regression class
from sklearn.linear_model import LinearRegression
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier


In [146]:
# The columns that can be used in the prediction
predictorsAll = ["Pclass", "Sex", "Age", "SibSp", "Parch", "Fare", "Embarked"]

In [168]:
# Library for helping creating all combinations of sublists
import itertools

# Create all combinations of predictors
myList = predictorsAll
predictorCombinations = []
for index in range(1, len(myList)+1):
    for subset in itertools.combinations(myList, index):
         predictorCombinations.append(list(subset))  
            
#print(combinations)

In [126]:
# Return the indexed of the sorted list
def sort_list(myList):
    return sorted(range(len(myList)), key=lambda i:myList[i])

In [55]:
# not used
def titanic_LinearRegression(predictors):
    # Initialize our algorithm class
    alg = LinearRegression()
    # Generate cross-validation folds for the titanic data set
    # It returns the row indices corresponding to train and test
    # We set random_state to ensure we get the same splits every time we run this
    kf = KFold(3, random_state=1)
    predictions = []

    for train, test in kf.split(titanic):
        # The predictors we're using to train the algorithm  
        # Note how we only take the rows in the train folds
        train_predictors = (titanic[predictors].iloc[train,:])
        # The target we're using to train the algorithm
        train_target = titanic["Survived"].iloc[train]
        # Training the algorithm using the predictors and target
        alg.fit(train_predictors, train_target)
        # We can now make predictions on the test fold
        test_predictions = alg.predict(titanic[predictors].iloc[test,:])
        predictions.append(test_predictions)

    # The predictions are in three separate NumPy arrays  
    # Concatenate them into a single array, along the axis 0 (the only 1 axis) 
    predictions = np.concatenate(predictions, axis=0)

    # Map predictions to outcomes (the only possible outcomes are 1 and 0)
    predictions[predictions > .5] = 1
    predictions[predictions <=.5] = 0
    accuracy = len(predictions[predictions == titanic["Survived"]]) / len(predictions)

    return accuracy

In [56]:
# not used
def titanic_LogisticRegression(predictors):
    # Initialize our algorithm
    alg = LogisticRegression(random_state=1)
    # Compute the accuracy score for all the cross-validation folds; this is much simpler than what we did before
    scores = cross_val_score(alg, titanic[predictors], titanic["Survived"], cv=3)
    # Take the mean of the scores (because we have one for each fold)
    return scores.mean()

In [165]:
def titanic_model_kf(predictors, nbKF, model, paramDict):
    # List of algorithms
    algs = []
    
    # Generate cross-validation folds for the titanic data set
    # It returns the row indices corresponding to train and test
    # We set random_state to ensure we get the same splits every time we run this
    kf = KFold(nbKF, random_state=1)

    # List of predictions
    predictions = []

    for train, test in kf.split(titanic):
        # The predictors we're using to train the algorithm  
        # Note how we only take the rows in the train folds
        train_predictors = (titanic[predictors].iloc[train,:])
        # The target we're using to train the algorithm
        train_target = titanic["Survived"].iloc[train]
        
        # Initialize our algorithm class
        if(model == "LinearRegression"):
            alg = LinearRegression()
        elif(model == "LogisticRegression"):
            alg = LogisticRegression()
        elif(model == "KNeighborsClassifier"):
            alg = KNeighborsClassifier(paramDict['n_neighbors'])
        # Training the algorithm using the predictors and target
        alg.fit(train_predictors, train_target)
        algs.append(alg)
        
        # We can now make predictions on the test fold
        #prediction = alg.predict(titanic[predictors])
        #predictions.append(prediction)
        
        # We can now make predictions on the test fold
        test_predictions = alg.predict(titanic[predictors].iloc[test,:])
        predictions.append(test_predictions)


    # We have multiple predictions. Let us average them
    #predictions = np.mean(predictions, axis=0)
    
    # The predictions are in three separate NumPy arrays  
    # Concatenate them into a single array, along the axis 0 (the only 1 axis) 
    predictions = np.concatenate(predictions, axis=0)

    # Map predictions to outcomes (the only possible outcomes are 1 and 0)
    predictions[predictions > .5] = 1
    predictions[predictions <=.5] = 0
    accuracy = len(predictions[predictions == titanic["Survived"]]) / len(predictions)
    
    # return the multiple algoriths and the accuracy
    return [algs, accuracy]

In [185]:
accuracyList1 = []
for combination in predictorCombinations:
    #accuracyList1.append(titanic_model_kf(combination, 3, "LinearRegression", {})[1])
    #accuracyList1.append(titanic_model_kf(combination, 3, "LogisticRegression", {})[1])
    accuracyList1.append(titanic_model_kf(combination, 3, "KNeighborsClassifier", {'n_neighbors':5})[1])
    
#for index in range(len(predictorCombinations)):
#    print(combinations[index], accuracyList1[index])

for elementIndex in sort_list(accuracyList1):
    print(combinations[elementIndex], ": ", accuracyList1[elementIndex])

['Parch', 'Embarked'] :  0.5252525252525253
['Parch'] :  0.5297418630751964
['Pclass'] :  0.5432098765432098
['SibSp', 'Parch'] :  0.547699214365881
['SibSp', 'Parch', 'Embarked'] :  0.5488215488215489
['Age', 'Embarked'] :  0.569023569023569
['Age', 'Parch', 'Embarked'] :  0.5701459034792368
['Age'] :  0.5735129068462402
['Age', 'Parch'] :  0.5937149270482603
['Age', 'SibSp', 'Embarked'] :  0.5993265993265994
['Age', 'SibSp', 'Parch', 'Embarked'] :  0.6015712682379349
['SibSp', 'Embarked'] :  0.6071829405162739
['Pclass', 'SibSp', 'Embarked'] :  0.6083052749719416
['Sex', 'Embarked'] :  0.611672278338945
['Age', 'SibSp'] :  0.611672278338945
['SibSp'] :  0.622895622895623
['Age', 'SibSp', 'Parch'] :  0.6262626262626263
['Embarked'] :  0.6363636363636364
['Age', 'Fare'] :  0.6363636363636364
['Pclass', 'Age', 'Fare'] :  0.6374859708193041
['Age', 'Parch', 'Fare'] :  0.6374859708193041
['Pclass', 'Parch', 'Embarked'] :  0.6386083052749719
['Pclass', 'SibSp'] :  0.6408529741863075
['Pcla

# KNN

In [None]:
!!!!!!!!!!!!
# Make predictions using the test set
predictions = alg.predict(titanic_test[predictors])

# Create a new dataframe with only the columns Kaggle wants from the data set
submission = pandas.DataFrame({
        "PassengerId": titanic_test["PassengerId"],
        "Survived": predictions
    })


In [161]:
print(pd.__version__)

0.19.2


In [163]:
import sklearn as skl
print(skl.__version__)

0.18.1
