# Kaggle Competition: Titanic Analysis

More information about the data can be seen at the link below:
http://www.kaggle.com/c/titanic-gettingStarted/data.

This analysis follows Udacity's course - Intro to Data Science.

In following exercises, we will perform some rudimentary practices similar to those of an actual data scientist.
    
Part of a data scientist's job is to use her or his intuition and insight to write algorithms and heuristics. A data scientist also creates mathematical models to make predictions based on some attributes from the data that they are examining.
    
Write your prediction back into the "predictions" dictionary. The key of the dictionary should be the Passenger's id (which can be accessed via passenger["PassengerId"]) and the associated value should be 1 if the passenger survived or 0 otherwise. 

You can also look at the Titantic data that you will be working with at the link below:
https://s3.amazonaws.com/content.udacity-data.com/courses/ud359/titanic_data.csv


In [9]:
import numpy
import pandas
import statsmodels.api as sm
import csv as csv

filepath = '../csv/train.csv'

In [2]:
# Helper Functions

def create_title_col(ddf):
    ddf['Title'] = ddf['Name'].str.extract('([A-Z]\w{0,}\.)', expand=True)
    ddf.loc[ddf["Title"] == "Mlle.", "Title"] = 'Miss.'
    ddf.loc[ddf["Title"] == "Ms.", "Title"] = 'Miss.'
    ddf.loc[ddf["Title"] == "Mme.", "Title"] = 'Mrs.'
    ddf.loc[ddf["Title"] == "Dona.", "Title"] = 'Rare.'
    ddf.loc[ddf["Title"] == "Lady.", "Title"] = 'Rare.'
    ddf.loc[ddf["Title"] == "Countess.", "Title"] = 'Rare.'
    ddf.loc[ddf["Title"] == "Capt.", "Title"] = 'Rare.'
    ddf.loc[ddf["Title"] == "Col.", "Title"] = 'Rare.'
    ddf.loc[ddf["Title"] == "Don.", "Title"] = 'Rare.'
    ddf.loc[ddf["Title"] == "Major.", "Title"] = 'Rare.'
    ddf.loc[ddf["Title"] == "Rev.", "Title"] = 'Rare.'
    ddf.loc[ddf["Title"] == "Sir.", "Title"] = 'Rare.'
    ddf.loc[ddf["Title"] == "Jonkheer.", "Title"] = 'Rare.'
    ddf.loc[ddf["Title"] == "Dr.", "Title"] = 'Rare.'
    ddf.loc[ddf["Title"] == "Master.", "Title"] = 'Rare.'
    ddf.loc[ddf['Title'] == 'Mr.', 'TitleNum' ] = 1
    ddf.loc[ddf['Title'] == 'Miss.', 'TitleNum'] = 2
    ddf.loc[ddf['Title'] == 'Mrs.', 'TitleNum' ] = 3
    ddf.loc[ddf['Title'] == 'Rare.', 'TitleNum' ] = 4
    return ddf

def create_familysize_col(ddf):
    ddf['FamilySize'] = ddf['SibSp'] + ddf['Parch'] + 1
    
    ddf['FsizeD'] = 1
    ddf.loc[ (ddf['FamilySize'] > 1) & (ddf['FamilySize'] < 5), 'FsizeD'] = 2
    ddf.loc[ ddf['FamilySize'] >= 5, 'FsizeD'] = 3
    
    return ddf

def create_ageclass_col(ddf):
    ddf['Age*Class'] = ddf.Age * ddf.Pclass
    
    #bins = [0, 20, 40, 57, 85]
    #group_names = ['a', 'b', 'c', 'd']
    #df['Age*ClassD'] = pd.cut(df['Age*Class'], bins, labels=group_names)
    
    return df

In [3]:
#output file to be submitted to Kaggle
import csv as csv

test_filepath = '../csv/test.csv'
tdf = pandas.read_csv(test_filepath)
test_passengerID_list = tdf['PassengerId']

def output_csv(filename, pred_dict):
    prediction_file = open(filename, "wb")
    prediction_file_object = csv.writer(prediction_file)
    prediction_file_object.writerow(["PassengerId", "Survived"])
    for i in range(0, len(test_passengerID_list)):
        prediction_file_object.writerow( [ int(test_passengerID_list[i]), int(pred_dict[test_passengerID_list[i]]) ] )
    prediction_file.close()

In [4]:
def simple_heuristic(file_path):
    '''    
    Here's a simple heuristic to start off:
       1) If the passenger is female, your heuristic should assume that the
       passenger survived.
       2) If the passenger is male, you heuristic should
       assume that the passenger did not survive.
    
    You can access the gender of a passenger via passenger['Sex'].
    If the passenger is male, passenger['Sex'] will return a string "male".
    If the passenger is female, passenger['Sex'] will return a string "female".
    
    You prediction should be 78% accurate or higher.
    
    '''
    predictions = {}
    df = pandas.read_csv(file_path)
    for passenger_index, passenger in df.iterrows():
        passenger_id = passenger['PassengerId']
      
        if passenger['Sex'] == 'female':
            predictions[passenger_id] = 1
        else:
            predictions[passenger_id] = 0
        
    return predictions

# 78.12%

In [5]:
def complex_heuristic(file_path):
    ''' 
    Here's the algorithm, predict the passenger survived if:
    1) If the passenger is female or
    2) if his/her socioeconomic status is high AND if the passenger is under 18
    
    Otherwise, your algorithm should predict that the passenger perished in the disaster.
    
    Or more specifically in terms of coding:
    female or (high status and under 18)
    
    You can access the gender of a passenger via passenger['Sex'].
    If the passenger is male, passenger['Sex'] will return a string "male".
    If the passenger is female, passenger['Sex'] will return a string "female".
    
    You can access the socioeconomic status of a passenger via passenger['Pclass']:
    High socioeconomic status -- passenger['Pclass'] is 1
    Medium socioeconomic status -- passenger['Pclass'] is 2
    Low socioeconomic status -- passenger['Pclass'] is 3

    You can access the age of a passenger via passenger['Age'].
    
    You prediction should be 79% accurate or higher.
    '''

    predictions = {}
    df = pandas.read_csv(file_path)
    for passenger_index, passenger in df.iterrows():
        passenger_id = passenger['PassengerId']
  
        if passenger['Sex'] == 'female':
            predictions[passenger_id] = 1
        elif passenger['Pclass'] == 1 and passenger['Age'] < 18:
            predictions[passenger_id] = 1
        else:
            predictions[passenger_id] = 0
            
    return predictions

# 79.12%

In [6]:

def custom_heuristic(file_path):
    '''
    For this exercise, you need to write a custom heuristic that will take
    in some combination of the passenger's attributes and predict if the passenger
    survived the Titanic diaster.

    Can your custom heuristic beat 80% accuracy?
    
    The available attributes are:
    Pclass          Passenger Class
                    (1 = 1st; 2 = 2nd; 3 = 3rd)
    Name            Name
    Sex             Sex
    Age             Age
    SibSp           Number of Siblings/Spouses Aboard
    Parch           Number of Parents/Children Aboard
    Ticket          Ticket Number
    Fare            Passenger Fare
    Cabin           Cabin
    Embarked        Port of Embarkation
                    (C = Cherbourg; Q = Queenstown; S = Southampton)
                    
    SPECIAL NOTES:
    Pclass is a proxy for socioeconomic status (SES)
    1st ~ Upper; 2nd ~ Middle; 3rd ~ Lower

    Age is in years; fractional if age less than one
    If the age is estimated, it is in the form xx.5

    With respect to the family relation variables (i.e. SibSp and Parch)
    some relations were ignored. The following are the definitions used
    for SibSp and Parch.

    Sibling:  brother, sister, stepbrother, or stepsister of passenger aboard Titanic
    Spouse:   husband or wife of passenger aboard Titanic (mistresses and fiancees ignored)
    Parent:   mother or father of passenger aboard Titanic
    Child:    son, daughter, stepson, or stepdaughter of passenger aboard Titanic
    '''    
    predictions = {}
    df = pandas.read_csv(file_path)
    df = create_title_col(df)
    df = create_familysize_col(df)
    
    for passenger_index, passenger in df.iterrows():
        passenger_id = passenger['PassengerId']
        if passenger['Sex'] == 'female' and passenger['FsizeD'] < 3:
            predictions[passenger_id] = 1
        elif passenger['Pclass'] == 1 and passenger['Age'] < 18:
            predictions[passenger_id] = 1
        elif passenger['Title'] == 4:
            predictions[passenger_id] = 1
        else:
            predictions[passenger_id] = 0
            
    return predictions


# Udacity Heuristic, Kaggle Score
# 0.8081, 0.77512

In [7]:
# use custom_heuristic prediction result
prediction_dict = custom_heuristic(test_filepath)

output_csv('output/python01.csv', prediction_dict)



In [13]:
# Get Udacity's train_data 'Age'

train_df = pandas.read_csv(filepath)
age_utrain = train_df['Age']
print age_utrain

0      22.0
1      38.0
2      26.0
3      35.0
4      35.0
5       NaN
6      54.0
7       2.0
8      27.0
9      14.0
10      4.0
11     58.0
12     20.0
13     39.0
14     14.0
15     55.0
16      2.0
17      NaN
18     31.0
19      NaN
20     35.0
21     34.0
22     15.0
23     28.0
24      8.0
25     38.0
26      NaN
27     19.0
28      NaN
29      NaN
       ... 
861    21.0
862    48.0
863     NaN
864    24.0
865    42.0
866    27.0
867    31.0
868     NaN
869     4.0
870    26.0
871    47.0
872    33.0
873    47.0
874    28.0
875    15.0
876    20.0
877    19.0
878     NaN
879    56.0
880    25.0
881    33.0
882    22.0
883    28.0
884    25.0
885    39.0
886    27.0
887    19.0
888     NaN
889    26.0
890    32.0
Name: Age, dtype: float64
