# HW3 Class Competition

# Who survived the sinking of the Titanic?

The goal of this HW is to predict who survived the Titanic sinking in 1912.

In [1]:
import pandas as pd

In [2]:
df = pd.read_csv("Titanic_0.csv")

In [3]:
df.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Moran, Mr. James",male,,0,0,330877,8.4583,,Q


In [4]:
df.columns

Index(['PassengerId', 'Survived', 'Pclass', 'Name', 'Sex', 'Age', 'SibSp',
       'Parch', 'Ticket', 'Fare', 'Cabin', 'Embarked'],
      dtype='object')

## Data set description

<ul>
<li><b>Survived</b>: binary attribute that indicates whether the passenger survived. This is the dependent variable that we will attempt to explain
<li><b>Pclass</b>: Ticket class (1 = 1st class, 2 = 2nd class, 3 = 3rd class)
<li><b>Age</b>: Passenger age
<li><b>SibSp</b>: The amout of the passenger's siblings/spouses aboard the Titanic
<li><b>Parch</b>: The amout of the passenger's parents/children aboard the Titanic
<li><b>Fare</b>: The ticket fare
<li><b>Male</b>: binary attibute that indicates the gender (1=Male, 0=Female)
<li><b>Embarked_C</b>: binary attibute that indicates whether the passenger embarked in Cherbourg
<li><b>Embarked_Q</b>: binary attibute that indicates whether the passenger embarked in Queenstown
<li><b>Embarked_S</b>: binary attibute that indicates whether the passenger embarked in Southampton
</ul>

## Instruction

Cleaning the data set if necessary. 

Use everything you know to find a machine learning model to achieve the highest possible AUC score. Two testing sets have been reserved: TestA.csv and TestB.csv. Your homework will be evaluated using these two sets. 70% of the grade will be based on the AUC score on TestA.csv. 30% of the grade will be based on the ranking of the AUC score on TestB.csv among the groups. To be specific, your grade on TestA.csv will be equal to the final AUC score multiplied by 70, and your grade on TestB.csv will be equal to 2.5 * (number of groups - your ranking). You must submit the same model for both sets with clear explanation of your codes. You must include the codes to evaluate your model on TestA.csv and TestB.csv. Failure to do so will result in 20% loss of grades (10% for each test). 

TestB.csv is private, which means you will never see it. The ranking will be revealed only after the deadline. TestA.csv is semi-private. This means that you have at most one chance everyday for me to check your model performance on TestA.csv using your code, and I will let you know the AUC score and post your score on the discussion board. I will save your notebook file in the same folder with the data files. If your code does not work on my computer, you lose the opportunity on the same day. 

In [5]:
# Cleaning dataset function
def cleaning(dataframe):
    # Breaking up the age category more for better classification
    def young(age):
        if age <= 6.5:
            return 1
        elif age > 6.5 and age <= 20:
            return 0.75
        elif age > 20 and age <= 40:
            return 0.50
        elif age > 40 and age <= 60:
            return 0.25
        else:
            return 0

    # Applying the filter to a new column
    dataframe['Young'] = dataframe['Age'].apply(young)

    # Breaking fare into if it's "small"/<=$10.50 or not to differentiate more within the column
    def fare(fare):
        if fare <= 10.5:
            return 1
        else:
            return 0

    # Creating the Small_Fare column
    dataframe['Small_Fare'] = dataframe['Fare'].apply(fare)

    # Combining Parch and SibSp columns for broader idea of if the passenger has relatives on board
    dataframe['Family'] = dataframe[['Parch','SibSp']].values.sum(axis=1)
    
    # Binary value for male column
    dataframe['Male'] = dataframe.Sex == 'male'
    
    # Dropping columns
    dataframe.drop(['Name', 'Ticket', 'PassengerId', 'Cabin','Sex','Embarked','Fare','Parch','SibSp'], axis=1, inplace=True)
    
    # Dropping n/a values
    dataframe.dropna(inplace=True)

In [6]:
# Applying cleaning function to titanic_0 dataset
cleaning(df)
df.head()

Unnamed: 0,Survived,Pclass,Age,Young,Small_Fare,Family,Male
0,0,3,22.0,0.5,1,1,True
1,1,1,38.0,0.5,0,1,False
2,1,3,26.0,0.5,1,0,False
3,1,1,35.0,0.5,0,1,False
5,0,1,54.0,0.25,0,0,True


In [7]:
# Setting X and Y
X = df.drop('Survived',axis=1)
Y = df.Survived

In [8]:
# Splitting data into training and test
from sklearn.model_selection import train_test_split
X_train,X_test,Y_train,Y_test = train_test_split(X,Y,
                            test_size=0.3,random_state = 0,)

# Training on training set
from sklearn.ensemble import RandomForestClassifier

cl = RandomForestClassifier(random_state=0)
cl.fit(X_train,Y_train)

RandomForestClassifier(random_state=0)

In [9]:
# Calculating titanic_0 AUC score
import sklearn

Y_proba = cl.predict_proba(X_test)[:,1]
sklearn.metrics.roc_auc_score(Y_test,Y_proba)

0.8690023920078795

## Test A

In [10]:
# Importing testA.csv
TestA = pd.read_csv("testA.csv")
cleaning(TestA)

In [11]:
from sklearn.preprocessing import StandardScaler

# Creating new X and Y values
X_new = TestA.drop('Survived',axis=1)
#scaler = sklearn.preprocessing.StandardScaler().fit(X_test) # Unsure if I fit the scaler to the X_train or X_test
#X_new = scaler.transform(X_new)
Y_new = TestA.Survived

In [12]:
# AUC score
Y_proba = cl.predict_proba(X_new)[:,1]
AUC_score = sklearn.metrics.roc_auc_score(Y_new,Y_proba)
print('Test A AUC Score: {}'.format(AUC_score))

Test A AUC Score: 0.9544248517855556


## Test B