# Analysing the Titanic dataset for survival
I've chosen the classic investigation, trying to find which variables are likely to predict survival for passengers on the Titanic.

**Question: What attributes did passengers have, that were likely to survive the disaster?**

## 1) Taking a look at the data

In [13]:
import pandas as pd

# reading the csv data into a pandas dataframe
titanic_data_all = pd.read_csv('titanic_data.csv')

print titanic_data_all

     PassengerId  Survived  Pclass  \
0              1         0       3   
1              2         1       1   
2              3         1       3   
3              4         1       1   
4              5         0       3   
5              6         0       3   
6              7         0       1   
7              8         0       3   
8              9         1       3   
9             10         1       2   
10            11         1       3   
11            12         1       1   
12            13         0       3   
13            14         0       3   
14            15         0       3   
15            16         1       2   
16            17         0       3   
17            18         1       2   
18            19         0       3   
19            20         1       3   
20            21         0       2   
21            22         1       2   
22            23         1       3   
23            24         1       1   
24            25         0       3   
25          

## 2) Reducing complexity and cleaning the data

After looking at the printout of the DataFrame containing the available information about the passengers, I decided to remove four of the columns, that I thought would be unlikely to hold predictive value in regards to survival. Those were:

- PassengerId
- Name
- Ticket
- Embarked

### Removing some data

In [14]:
# removing the columns that are unlikely to have any predictive value
titanic_data = titanic_data_all.drop(['PassengerId', 'Name', 'Ticket', 'Embarked'], 1)

In [15]:
print titanic_data

     Survived  Pclass     Sex  Age  SibSp  Parch      Fare        Cabin
0           0       3    male   22      1      0    7.2500          NaN
1           1       1  female   38      1      0   71.2833          C85
2           1       3  female   26      0      0    7.9250          NaN
3           1       1  female   35      1      0   53.1000         C123
4           0       3    male   35      0      0    8.0500          NaN
5           0       3    male  NaN      0      0    8.4583          NaN
6           0       1    male   54      0      0   51.8625          E46
7           0       3    male    2      3      1   21.0750          NaN
8           1       3  female   27      0      2   11.1333          NaN
9           1       2  female   14      1      0   30.0708          NaN
10          1       3  female    4      1      1   16.7000           G6
11          1       1  female   58      0      0   26.5500         C103
12          0       3    male   20      0      0    8.0500      

### Cabin

"Cabin" might be an interesting information when assessing why people survived the sinking of the ship. However I can see in the limited printout, that for many passengers there seems to be no data in regard of the cabin. So I went to check.

In [16]:
# assessing how many of the passengers have an entry for "Cabin"
# because looking at the print of the DF they seem to be scarce

no_value = 0
cabin_known = 0
random_float_to_compare_type = 1.2

for passenger_cabin in titanic_data["Cabin"]:
    if type(passenger_cabin) == type(random_float_to_compare_type):
        no_value += 1
    else:
        cabin_known += 1
        
print cabin_known, "passengers with cabin data known"
print no_value, "entries have no data for 'Cabin'"

204 passengers with cabin data known
687 entries have no data for 'Cabin'


I will keep the "Cabin" column around, for it might be interesting. As of now I am not sure whether "NaN" means that they did not stay in a cabin, or whether this data is simply not recorded.

I will assume a proper data record, and that those passengers who do not have a cabin specified, did actually not stay in a cabin.

I am wondering how does this relate to Pclass, titled to be a proxy for socio-economic status.

In [17]:
random_float_to_compare_type = 1.2

class_cabin_dict = {
                    'cabin_class_1' : 0,
                    'cabin_class_2' : 0,
                    'cabin_class_3' : 0,
                    'nan_class_1' : 0,
                    'nan_class_2' : 0,
                    'nan_class_3' : 0
                    }

#print "Cabin", "|", "Pclass"
for n in range(len(titanic_data["Cabin"])):
    cabin = titanic_data["Cabin"][n]
    p_class = titanic_data["Pclass"][n]
    #print cabin, "|", p_class
    if not type(cabin) == type(random_float_to_compare_type):
        #print cabin, "|", p_class
        if p_class == 1:
            class_cabin_dict['cabin_class_1'] += 1
        if p_class == 2:            
            class_cabin_dict['cabin_class_2'] += 1
        if p_class == 3:
            class_cabin_dict['cabin_class_3'] += 1
    else:    
        if p_class == 1:
            class_cabin_dict['nan_class_1'] += 1
        if p_class == 2:            
            class_cabin_dict['nan_class_2'] += 1
        if p_class == 3:
            class_cabin_dict['nan_class_3'] += 1
            
print class_cabin_dict

{'cabin_class_2': 16, 'cabin_class_3': 12, 'cabin_class_1': 176, 'nan_class_1': 40, 'nan_class_2': 168, 'nan_class_3': 479}


Pclass and Cabin seem to be correlated. I think it's time for a graph! :)

Could try to use the visualization program `Parallel Sets`, or simply stacked bars, to show all Cabin yes vs. all Cabin no, but stacked as Classes with color coding!

Decided to use code for stacked bar graphs that I found online, and adapted for my needs.

In [18]:
# courtesy of https://github.com/minillinim/stackedBarGraph GPLv3 - Thanks :)
import stackedBarGraph as sbg

# creating an instance of the stackedBarGraph class
class_cabin = sbg.StackedBarGrapher()
print class_cabin

# convert the data on whether a cabin was booked in relation to the Pclass into a numpy array
cabin_yes_no = np.array([[class_cabin_dict['cabin_class_1'], class_cabin_dict['cabin_class_2'], class_cabin_dict['cabin_class_3']],
                        [class_cabin_dict['nan_class_1'], class_cabin_dict['nan_class_2'], class_cabin_dict['nan_class_3']]])

# convert np array values to float to work with stackedBarGraph
cabin_yes_no = cabin_yes_no.astype(float)

# setting the values
cabin_yes_no_labels = ["cabin","w/o cabin"]
cabin_yes_no_colors = ['#2166ac', '#fee090', '#fdbb84']

fig = plt.figure()
ax1 = fig.add_subplot(311)
class_cabin.stackedBarPlot(ax1,
                    cabin_yes_no,
                    cabin_yes_no_colors,
                    edgeCols=['#000000']*len(cabin_yes_no[0]),
                    xLabels=cabin_yes_no_labels,
                    scale=True
                    )
plt.title("passengers stacked per class")
plt.legend([1, 2, 3], loc='lower left')

<stackedBarGraph.StackedBarGrapher instance at 0x10ca9e908>


NameError: name 'np' is not defined

From the graph it is apparent (although less surprising), that most passengers for whom a cabin information is recorded were belonging to Pclass 1.
Even though there is no certainty to this, it might be reasonable to assume, that the cabin information was well recorded, as it seems valid to assume, that most people in Pclass 1 would book a cabin, while most people in Pclass 3 would not do so, in order to save on the fare.

If, therefore:
- cabin information was diligently recorded and
- the assumption that Pclass can be seen as a proxy for socioeconomic status is true,

then I think it is valid to say that cabin yes/no can also be seen as a proxy for socioeconomic status.

In [19]:
# reducing the cabin information to boolean values
titanic_data["Cabin"] = titanic_data["Cabin"].isnull()

### SibSp and Parch = Relatives
I will unify the two columns indicating whether the passenger had relatives on board to a single column with the values:

- 1 = has relatives on board
- 0 = has no relatives on board

In [20]:
# creating a list containing the 1 or 0 values regarding whether relatives of the passenger were on board
# it's a different way of calculating what I calculated above for Cabin. I did the latter one before. Seems to me
# that boolean values make more sense, however I am keeping this version to see a different approach.
# ????????????????????????????????????????????????????????????????????????????????????????????????
# are there any advantages to making such information "pseudo-booleans" (such as 1/0) instead of True/False?
    # Glad for any hints about this!
    # me, i found this: http://pandas.pydata.org/pandas-docs/stable/gotchas.html
# ????????????????????????????????????????????????????????????????????????????????????????????????
relatives = []
for p in range(len(titanic_data["SibSp"])):
    sibs = titanic_data["SibSp"][p]
    p_ch = titanic_data["Parch"][p]
    if sibs > 0 or p_ch > 0:
        relatives.append(1)
    else:
        relatives.append(0)

# creating a Series object
relatives_Series = pd.Series(relatives)

# adding the new Series to the dataframe
titanic_data["Relatives"] = relatives_Series
# removing the obsolete columns
titanic_data = titanic_data.drop(["SibSp", "Parch"], 1)
print titanic_data

     Survived  Pclass     Sex  Age      Fare  Cabin  Relatives
0           0       3    male   22    7.2500   True          1
1           1       1  female   38   71.2833  False          1
2           1       3  female   26    7.9250   True          0
3           1       1  female   35   53.1000  False          1
4           0       3    male   35    8.0500   True          0
5           0       3    male  NaN    8.4583   True          0
6           0       1    male   54   51.8625  False          0
7           0       3    male    2   21.0750   True          1
8           1       3  female   27   11.1333   True          1
9           1       2  female   14   30.0708   True          1
10          1       3  female    4   16.7000  False          1
11          1       1  female   58   26.5500  False          0
12          0       3    male   20    8.0500   True          0
13          0       3    male   39   31.2750   True          1
14          0       3  female   14    7.8542   True    

### Fare

In [21]:
# changing the Fare data to make it easier to keep an overview
titanic_data["Fare"] = titanic_data["Fare"].astype('int64')

print titanic_data
titanic_data.dtypes

     Survived  Pclass     Sex  Age  Fare  Cabin  Relatives
0           0       3    male   22     7   True          1
1           1       1  female   38    71  False          1
2           1       3  female   26     7   True          0
3           1       1  female   35    53  False          1
4           0       3    male   35     8   True          0
5           0       3    male  NaN     8   True          0
6           0       1    male   54    51  False          0
7           0       3    male    2    21   True          1
8           1       3  female   27    11   True          1
9           1       2  female   14    30   True          1
10          1       3  female    4    16  False          1
11          1       1  female   58    26  False          0
12          0       3    male   20     8   True          0
13          0       3    male   39    31   True          1
14          0       3  female   14     7   True          0
15          1       2  female   55    16   True         

Survived       int64
Pclass         int64
Sex           object
Age          float64
Fare           int64
Cabin           bool
Relatives      int64
dtype: object

In [86]:
# computing statistical values for the Fare column, to better understand the data.
# I re-used the framework of the function I wrote in class, and adapted it to be useful here
def calculate_stats(list_of_values):
    '''
    takes as input a list of values,
    returns a pandas Series object with the mean, standard deviation,
    minimum and maximum of all values in the list
    '''
    import numpy as np
    import pandas as pd

    # Calculate commone statistical values
    mean = np.mean(list_of_values) # Mean
    standard_deviation = np.std(list_of_values) # Standard deviation
    minimum = np.min(list_of_values) # Minimum
    maximum = np.max(list_of_values) # Maximum
    median = np.median(list_of_values) # Median
    
    # Create a Series object containing the statistical information
    column_header = ["mean", "standard_deviation", "median", "minimum", "maximum"]
    values = [mean, standard_deviation, median, minimum, maximum]
    stat_values = pd.Series(values, column_header)
    return stat_values

print calculate_stats(titanic_data["Fare"])

mean                   31.785634
standard_deviation     49.675830
median                 14.000000
minimum                 0.000000
maximum               512.000000
dtype: float64


In [23]:
# the minimum is 0, so I am wondering whether there is missing information for Fare, or whether it is recorded
# that some passenger(s) actually paid 0$.
amount_missing_fares = 0
amount_zero_fares = 0

# checking whether there are NaNs - and if so how many
for tr in titanic_data["Fare"].isnull():
    if tr:
        amount_missing_fares += 1
# checking whether there are 0 - and if so how many
for num in titanic_data["Fare"]:
    if num == 0:
        amount_zero_fares += 1
        
print "NaN in Fare:", amount_missing_fares
print "0 in Fare:", amount_zero_fares

# TODO: understand why the following calculates the amount of NaN's (meaning: why 2x .sum())
# http://stackoverflow.com/questions/29530232/python-pandas-check-if-any-value-is-nan-in-dataframe
# print titanic_data["Fare"].isnull().sum().sum()

NaN in Fare: 0
0 in Fare: 15


In [24]:
# it surprised me to see that there are no entries for Fare missing, yet still the minimum shows to be 0.
# need to look at this a bit more:
fare = titanic_data["Fare"]
#print fare

# using a function to show the whole dataframe regarding the Fare
# found this here: http://stackoverflow.com/questions/19124601/is-there-a-way-to-pretty-print-the-entire-pandas-series-dataframe
def print_full(x):
    pd.set_option('display.max_rows', len(x))
    print(x)
    pd.reset_option('display.max_rows')

# went to print the full column of the df
#print_full(fare)
# found my first 0 here:
print fare[179]
# then I went to adapt the code above to count the existing 0 Fare entries

0


In [25]:
# so here are the relevant rows!
titanic_data["Fare"].nsmallest(amount_zero_fares)

179    0
263    0
271    0
277    0
302    0
413    0
466    0
481    0
597    0
633    0
674    0
732    0
806    0
815    0
822    0
Name: Fare, dtype: int64

In [26]:
# Let's look at them in more detail

# getting a list of the rows with Fare = 0
all_zero_fares = list(titanic_data["Fare"].nsmallest(amount_zero_fares).keys())

# and making a subset of the dataframe with this list
titanic_data.iloc[all_zero_fares]

Unnamed: 0,Survived,Pclass,Sex,Age,Fare,Cabin,Relatives
179,0,3,male,36.0,0,True,0
263,0,1,male,40.0,0,False,0
271,1,3,male,25.0,0,True,0
277,0,2,male,,0,True,0
302,0,3,male,19.0,0,True,0
413,0,2,male,,0,True,0
466,0,2,male,,0,True,0
481,0,2,male,,0,True,0
597,0,3,male,49.0,0,True,0
633,0,1,male,,0,True,0


#### Looking at the free-riders
I took this example to practice making subsets of a dataframe.

What I got to see in this small list of the 15 passengers that paid 0 for their Fare, is:

- all were male
- none had relatives on board
- most lodged in a cabin
- and nearly all of them died

(The other values for Age and Pclass seem to be more diverse)

Therefore I would say that skipping the Fare is not a good way to survive the catastrophe... :)

_**Disclaimer**: I understand that this is not a useful conclusion, especially because the dataset is very very small_

Okay. After this little practice, I think that it is:

## 3) Time for looking for relationships in the data
Now I will check how the cleaned data relates to each other, and whether I can discover patterns that could lead to good predictions on survival of the passengers.

In [27]:
# since my basic distinction is gonna be whether someone survived or not, I'll split the data according to this
survivor_rows = []
non_survivor_rows = []

title_row = 1
row_num = 0
for row in range(len(titanic_data) - title_row):
    row_num += 1
    if titanic_data.iloc[row_num]["Survived"] == 1:
        survivor_rows.append(row_num)
    else:
        non_survivor_rows.append(row_num)

titanic_survivors = titanic_data.iloc[survivor_rows]
titanic_non_survivors = titanic_data.iloc[non_survivor_rows]

#print titanic_survivors
#print titanic_non_survivors

### Age in comparison
Having my two datasets, split on the independent variable, I'll start to compute statistical values for them

In [28]:
print "SURVIVORS"
print calculate_stats(titanic_survivors["Age"]), "\n"
print "NON-SURVIVORS"
print calculate_stats(titanic_non_survivors["Age"])

SURVIVORS
mean                  28.343690
standard_deviation    14.925152
median                31.000000
minimum                0.420000
maximum               80.000000
dtype: float64 

NON-SURVIVORS
mean                  30.646572
standard_deviation    14.165888
median                34.000000
minimum                1.000000
maximum               74.000000
dtype: float64


This doesn't look too interesting. There is a very minimal hint that Survivors might be altogether younger, but the difference is small. So I try something else.

In [29]:
len(titanic_survivors["Age"])
titanic_survivors.iloc[0]
titanic_survivors

Unnamed: 0,Survived,Pclass,Sex,Age,Fare,Cabin,Relatives
1,1,1,female,38.00,71,False,1
2,1,3,female,26.00,7,True,0
3,1,1,female,35.00,53,False,1
8,1,3,female,27.00,11,True,1
9,1,2,female,14.00,30,True,1
10,1,3,female,4.00,16,False,1
11,1,1,female,58.00,26,False,0
15,1,2,female,55.00,16,True,0
17,1,2,male,,13,True,0
19,1,3,female,,7,True,0


In [30]:
# defining what it means to be a child (by law)
child_age = 18

children_survived = 0
children_died = 0

# summing up the amounts of children in each dataset
for p in range(len(titanic_survivors["Age"])):
    #print p
    if titanic_survivors.iloc[p]["Age"] <= child_age:
        children_survived += 1
for p in range(len(titanic_non_survivors["Age"])):
    if titanic_non_survivors.iloc[p]["Age"] <= child_age:
        children_died += 1

print "surviving children:", children_survived
print "non-surviving children:", children_died

surviving children: 70
non-surviving children: 69


Okay, on first sight this doesn't seem too spectacular of a result. Seems like as a child on the Titanic, you had a near-perfect 1 in 2 chance of surviving the disaster. Live or die, at the flip of a coin.

However, since I know that there is more to this, I am not yet gonna discard this information point. (I am happy that I know about the relevance of Age already in advance, because otherwise I might have put this attempt aside; But knowing that it has predictive value, I took the extra moment to think about it, realizing that a 50% chance can still be much better than many other percentile rates of survival!)

So let's look at this for the adults.

In [31]:
def amount_of_passengers(data_table, column_header, operator, value):
    '''
    takes as input a pandas dataframe, the name of the column that is being queried,
    an operator ["smaller", "smaller_eguals", "bigger", "bigger_equals"] and the value of the query parameter.
    returns the number of passengers in the dataframe that fits the defined query
    '''
    num_passengers = 0
    if operator == "smaller":
        for p in range(len(data_table[column_header])):
            #print p
            if data_table.iloc[p][column_header] < value:
                num_passengers += 1
        return num_passengers
    if operator == "smaller_equals":
        for p in range(len(data_table[column_header])):
            #print p
            if data_table.iloc[p][column_header] <= value:
                num_passengers += 1
        return num_passengers
    if operator == "bigger":
        for p in range(len(data_table[column_header])):
            #print p
            if data_table.iloc[p][column_header] > value:
                num_passengers += 1
        return num_passengers
    if operator == "bigger_equals":
        for p in range(len(data_table[column_header])):
            #print p
            if data_table.iloc[p][column_header] >= value:
                num_passengers += 1
        return num_passengers
    else:
        print 'Error. Please enter a valid operator from the list:'
        print '["smaller", "smaller_eguals", "bigger", "bigger_equals"]'

amount_of_passengers(titanic_survivors, "Age", "smaller_equals", 18)

70

In [32]:
adults_survived = amount_of_passengers(titanic_survivors, "Age", "bigger", 18)
adults_died = amount_of_passengers(titanic_non_survivors, "Age", "bigger", 18)

print "adults survived:", adults_survived
print "adults died: ", adults_died

adults survived: 220
adults died:  354


In [33]:
# So how are the ratios?
children_ratio = children_survived / float(children_died)
adult_ratio = adults_survived / float(adults_died)

ratios = pd.Series([children_ratio, adult_ratio], ["Survival of children", "Survival of adults"])
print children_ratio, "vs.", adult_ratio, "\n"
print ratios

1.01449275362 vs. 0.621468926554 

Survival of children    1.014493
Survival of adults      0.621469
dtype: float64


In [34]:
def survival_by_metric(column_header, value):
    print "METRICS FOR %s, RATIOS FOR SPLIT AT %i:\n" %(column_header, value)
    print calculate_stats(titanic_data[column_header]), "\n"
    survived_below = amount_of_passengers(titanic_survivors, column_header, "smaller_equals", value)
    died_below = amount_of_passengers(titanic_non_survivors, column_header, "smaller_equals", value)
    survived_above = amount_of_passengers(titanic_survivors, column_header, "bigger", value)
    died_above = amount_of_passengers(titanic_non_survivors, column_header, "bigger", value)
    below_ratio = survived_below / float(died_below)
    above_ratio = survived_above / float(died_above)
    ratios = pd.Series([below_ratio, above_ratio], ["Survival of those BELOW entered value", "Survival of those ABOVE entered value"])
    return ratios

print survival_by_metric("Age", 18)

METRICS FOR Age, RATIOS FOR SPLIT AT 18:

mean                  29.699118
standard_deviation    14.516321
median                32.000000
minimum                0.420000
maximum               80.000000
dtype: float64 

Survival of those BELOW entered value    1.014493
Survival of those ABOVE entered value    0.621469
dtype: float64


So the chances of survival were indeed much higher for children than for adults! Worth to keep this metric around.

### Fare - do the spendatious live longer?


In [85]:
# first I need to know what is a high fare
print calculate_stats(titanic_data["Fare"])

mean                   31.785634
standard_deviation     49.675830
median                 14.000000
minimum                 0.000000
maximum               512.000000
dtype: float64


In [88]:
# the maximum seems to be quite an outlier, so I did not trust the average as a useful value.
# I adapted the calculate_stats() function to include the median
# I'll choose up to and including the median as a low fare, and what is above the median as a high fare
import numpy as np

median_fare = np.median(titanic_data["Fare"])
survivors_rich = amount_of_passengers(titanic_survivors, "Fare", "bigger", median_fare)
survivors_poor = amount_of_passengers(titanic_non_survivors, "Fare", "bigger", median_fare)


In [89]:
survival_by_metric("Fare", median_fare)

METRICS FOR Fare, RATIOS FOR SPLIT AT 14:

mean                   31.785634
standard_deviation     49.675830
median                 14.000000
minimum                 0.000000
maximum               512.000000
dtype: float64 



Survival of those BELOW entered value    0.333333
Survival of those ABOVE entered value    1.106796
dtype: float64

This looks promising! Seems one can buy oneself the way out of disaster...

### Three pseudo-boolean values

In [45]:
# Cabin, Relatives, Sex -> categorical values (pseudo-booleans)
# I'll write a function to handle these

def amount_of_passengers_bool(column_header, one_value):
    survivors_1 = 0
    survivors_2 = 0
    non_survivors_1 = 0
    non_survivors_2 = 0
    for p in range(len(titanic_survivors)):
        if titanic_survivors.iloc[p][column_header] == one_value:
            survivors_1 += 1
        else:
            survivors_2 += 1
    for p in range(len(titanic_non_survivors)):
        if titanic_non_survivors.iloc[p][column_header] == one_value:
            non_survivors_1 += 1
        else:
            non_survivors_2 += 1
    amount_list = [survivors_1, survivors_2, non_survivors_1, non_survivors_2]
    return amount_list
            
def survival_by_bool_metric(column_header, one_value):
    '''
    takes as input the header of a column and one of the two boolean values
    (this is, so that the function also accounts for pseudo-boolean constructs)
    returns a pandas Series object containing the two ratios
    '''
    amount_list = amount_of_passengers_bool(column_header, one_value)
    survived_provided_value = amount_list[0]
    died_provided_value = amount_list[2]
    survived_other_value = amount_list[1]
    died_other_value = amount_list[3]
    provided_value_ratio = survived_provided_value / float(died_provided_value)
    other_value_ratio = survived_other_value / float(died_other_value)
    ratios = pd.Series([provided_value_ratio, other_value_ratio], ["Survival of %s (entered value)" %(one_value), "Survival of second value (not entered)"])
    return ratios
print "RATIOS OF SURVIVAL ACCORDING TO Sex (female/male):\n"
print survival_by_bool_metric("Sex", "female"), "\n"
print "RATIOS OF SURVIVAL ACCORDING TO Cabin (True/False):\n"
print survival_by_bool_metric("Cabin", True), "\n"
print "RATIOS OF SURVIVAL ACCORDING TO Relatives (1/0):\n"
print survival_by_bool_metric("Relatives", 1), "\n"

RATIOS OF SURVIVAL ACCORDING TO Sex (female/male):

Survival of female (entered value)        2.876543
Survival of second value (not entered)    0.233405
dtype: float64 

RATIOS OF SURVIVAL ACCORDING TO Cabin (True/False):

Survival of True (entered value)          0.429167
Survival of second value (not entered)    2.000000
dtype: float64 

RATIOS OF SURVIVAL ACCORDING TO Relatives (1/0):

Survival of 1 (entered value)             1.028736
Survival of second value (not entered)    0.435829
dtype: float64 



The results indicate that it was more likely for a passenger to survive if:

- the passenger was **female**
- did have **no cabin**
- and **had relatives** on board

I am surprised about the results for the "cabin" metric, because I calculated something else before. But I did not compare it to survival yet.

### Social Class and Survival
The final metric to investigate is Pclass, a collection of ordinal values [1, 2, 3,], that are meant to be an indicator for socio-economic status.

Since it's a bit a different type of data, I'll have to make a new function for this cause.

In [39]:
# actually, since I'm only trying to get the difference between Pclass 1 and [Pclass 2 AND Pclass 3],
# it's again a pseudo-boolean, so I can use my function that I just wrote above.
survival_by_bool_metric("Pclass", 1)

RATIOS OF SURVIVAL ACCORDING TO Pclass:



Survival of 1 (entered value)             1.700000
Survival of second value (not entered)    0.440171
dtype: float64

_a bit lazy, I agree. :)_
    
But we can see that being part of Pclass 1 was a great advantage for the passengers and significantly increased their rate of survival.

## 4) Discussion

The following metrics are those that are likely to be useful variables to predict survival of the passengers:

- Pclass = 1
- Sex = female
- Fare > 14
- Age <= 18
- Cabin = False
- Relatives = 1

meaning that a passenger that fulfills some of these criteria is more likely to have survived.
The more metrics that are fulfilled, the higher is the chance of survival.

Some, however, are more important than others. Let's talk more about this.

In [83]:
# For this I'll create a dataframe holding all the ratios of these (hopefully) predictive values
# as compared to their pseudo-boolean opposites
for_sex = survival_by_bool_metric("Sex", "female")
for_cabin = survival_by_bool_metric("Cabin", False)
for_relatives = survival_by_bool_metric("Relatives", 1)


ratios_df = pd.DataFrame(index=["var_indicates_survival", "var_means_probably_not"])

def add_ratio_to_df(column_name, survival_predictor_value):
    if column_name == "Sex" or column_name == "Cabin" or column_name == "Relatives" or column_name == "Pclass":
        ratios = survival_by_bool_metric(column_name, survival_predictor_value)
        ratios_df.loc[:, column_name] = pd.Series([ratios[0], ratios[1]], index = ratios_df.index)
    elif column_name == "Fare" or column_name == "Age":
        print "fareage"
        
add_ratio_to_df("Cabin", False)
add_ratio_to_df("Sex", "female")
add_ratio_to_df("Relatives", 1)
add_ratio_to_df("Pclass", 1)
print ratios_df

                           Cabin       Sex  Relatives    Pclass
var_indicates_survival  2.000000  2.876543   1.028736  1.700000
var_means_probably_not  0.429167  0.233405   0.435829  0.440171


In [71]:
df = pd.DataFrame(index=["var_indicates_survival", "var_means_probably_not"])
df.loc[:, "a"] = pd.Series([1,2], index = df.index)
df

Unnamed: 0,a
var_indicates_survival,1
var_means_probably_not,2


In [73]:
new = pd.Series([1,2], index = ["a", "b"])

In [75]:
new[0]

1