1. Importing the Training Set
------------------------
------------

I begin by importing useful packages, Numpy (for maths and arrays), and csv for reading and writing csv files.

If I want to use something from this I need to call ```csv.[function]``` or ```np.[function]```.

In [37]:
import csv as csv 
import numpy as np

Next I check my working directory is pointing to the location of the training file, and then I open it in Python.

In [38]:
pwd

u'\\\\spf801\\staffusers$\\jsc\\Desktop\\Kaggle\\Titanic\\2. Python'

In [39]:
csv_file_object = csv.reader(open('train.csv', 'rb')) 

I use the ```next()``` command which skips the first line of headings. I create an empty list called ```data``` and then populate it with a line from the ```.csv``` file, before finally converting it to an array. Each item is a string by default.

In [40]:
header = csv_file_object.next() 
                                 
data=[]                          
for row in csv_file_object:      
    data.append(row)            
data = np.array(data)

In [41]:
print data

[['1' '0' '3' ..., '7.25' '' 'S']
 ['2' '1' '1' ..., '71.2833' 'C85' 'C']
 ['3' '1' '3' ..., '7.925' '' 'S']
 ..., 
 ['889' '0' '3' ..., '23.45' '' 'S']
 ['890' '1' '1' ..., '30' 'C148' 'C']
 ['891' '0' '3' ..., '7.75' '' 'Q']]


I can now pick out parts of the data starting with the headings:

In [42]:
print header

['PassengerId', 'Survived', 'Pclass', 'Name', 'Sex', 'Age', 'SibSp', 'Parch', 'Ticket', 'Fare', 'Cabin', 'Embarked']


Then the first row of the data:

In [43]:
print data[0]

['1' '0' '3' 'Braund, Mr. Owen Harris' 'male' '22' '1' '0' 'A/5 21171'
 '7.25' '' 'S']


Then the last row of the data:

In [44]:
print data[-1]

['891' '0' '3' 'Dooley, Mr. Patrick' 'male' '32' '0' '0' '370376' '7.75' ''
 'Q']


And individual entries such as the 1st row, 4th column (remember indexing from zero).

In [45]:
print data[0,3]

Braund, Mr. Owen Harris


And can call a full gender column (Python starts indices from 0, not 1)

In [46]:
data[0::,4] 

array(['male', 'female', 'female', 'female', 'male', 'male', 'male',
       'male', 'female', 'female', 'female', 'female', 'male', 'male',
       'female', 'female', 'male', 'male', 'female', 'female', 'male',
       'male', 'female', 'male', 'female', 'female', 'male', 'male',
       'female', 'male', 'male', 'female', 'female', 'male', 'male',
       'male', 'male', 'male', 'female', 'female', 'female', 'female',
       'male', 'female', 'female', 'male', 'male', 'female', 'male',
       'female', 'male', 'male', 'female', 'female', 'male', 'male',
       'female', 'male', 'female', 'male', 'male', 'female', 'male',
       'male', 'male', 'male', 'female', 'male', 'female', 'male', 'male',
       'female', 'male', 'male', 'male', 'male', 'male', 'male', 'male',
       'female', 'male', 'male', 'female', 'male', 'female', 'female',
       'male', 'male', 'female', 'male', 'male', 'male', 'male', 'male',
       'male', 'male', 'male', 'male', 'female', 'male', 'female', 'male',
      

The ```.csv``` reader works by default with strings, so I now need to convert to floats in order to do calculations

In [47]:
data[0::,2].astype(np.float)

array([ 3.,  1.,  3.,  1.,  3.,  3.,  1.,  3.,  3.,  2.,  3.,  1.,  3.,
        3.,  3.,  2.,  3.,  2.,  3.,  3.,  2.,  2.,  3.,  1.,  3.,  3.,
        3.,  1.,  3.,  3.,  1.,  1.,  3.,  2.,  1.,  1.,  3.,  3.,  3.,
        3.,  3.,  2.,  3.,  2.,  3.,  3.,  3.,  3.,  3.,  3.,  3.,  3.,
        1.,  2.,  1.,  1.,  2.,  3.,  2.,  3.,  3.,  1.,  1.,  3.,  1.,
        3.,  2.,  3.,  3.,  3.,  2.,  3.,  2.,  3.,  3.,  3.,  3.,  3.,
        2.,  3.,  3.,  3.,  3.,  1.,  2.,  3.,  3.,  3.,  1.,  3.,  3.,
        3.,  1.,  3.,  3.,  3.,  1.,  1.,  2.,  2.,  3.,  3.,  1.,  3.,
        3.,  3.,  3.,  3.,  3.,  3.,  1.,  3.,  3.,  3.,  3.,  3.,  3.,
        2.,  1.,  3.,  2.,  3.,  2.,  2.,  1.,  3.,  3.,  3.,  3.,  3.,
        3.,  3.,  3.,  2.,  2.,  2.,  1.,  1.,  3.,  1.,  3.,  3.,  3.,
        3.,  2.,  2.,  3.,  3.,  2.,  2.,  2.,  1.,  3.,  3.,  3.,  1.,
        3.,  3.,  3.,  3.,  3.,  2.,  3.,  3.,  3.,  3.,  1.,  3.,  1.,
        3.,  1.,  3.,  3.,  3.,  1.,  3.,  3.,  1.,  2.,  3.,  3

I can use the ```size()``` function to count elements and ```sum()``` to count up elements (used as indicator functions).

Firstly, I look at the second column, and the size function counts total passengers.

In [48]:
number_passengers = np.size(data[0::,1].astype(np.float)) 

Next,  I use the sum function which will only count those who survived (and given the value 1, not 0) in order to count the survivors.

In [49]:
number_survived = np.sum(data[0::,1].astype(np.float)) 

Finally, I count the proportion of survivors:

In [50]:
proportion_survivors = number_survived / number_passengers
print proportion_survivors

0.383838383838


2. Gender Analysis
------------------------
------------

I now concetrate on those rows which concern females -  the elements in the gender column that equals “female”, or the men ("not female"), and can directly apply a filter to the data.

In [51]:
women_only_stats = data[0::,4] == "female"
men_only_stats = data[0::,4] != "female"                          

We use these two new variables as a "mask" on our original training data, so we can select only those women, and only those men on board, then calculate the proportion of those who survived. We play the same trick on the survivial column of comparing the total (size of set), with sum (adds 1 for each survivor).

In [52]:
women_onboard = data[women_only_stats,1].astype(np.float)     
men_onboard = data[men_only_stats,1].astype(np.float)

proportion_women_survived = np.sum(women_onboard) / np.size(women_onboard)  
proportion_men_survived = np.sum(men_onboard) / np.size(men_onboard) 

print 'Proportion of women who survived is %s' % proportion_women_survived
print 'Proportion of men who survived is %s' % proportion_men_survived

Proportion of women who survived is 0.742038216561
Proportion of men who survived is 0.188908145581


Women were much more likely to survive, and the first model I build, as a prediction for which of the remaining passengers will survive is simply that women survive and men die. Of course, there will be other, possibly more presuasive factors (or combinations of factors which we look into later).

We now begin work on creating a prediction file. So we import the data set again, and create a .csv file which will have our predictions.


In [53]:
test_file = open('test.csv', 'rb')
test_file_object = csv.reader(test_file)
header = test_file_object.next()

In [54]:
prediction_file = open("ModelBasedonGenderAlone.csv", "wb")
prediction_file_object = csv.writer(prediction_file)

We now want to read in the test file row by row, see if it is female or male, and write our survival prediction (women assigned a 1; men assigned 0) to a new file. 

In [55]:
prediction_file_object.writerow(["PassengerId", "Survived"])
for row in test_file_object:       
    if row[3] == 'female':                                        
        prediction_file_object.writerow([row[0],'1'])   
    else:                                 
        prediction_file_object.writerow([row[0],'0'])    
test_file.close()
prediction_file.close()

One of the immediate advantages of using Python over Excel is that we can can quickly run all of the steps again in the future, for a different training set.

3. Class, Gender, and Ticket Price Analysis
------------------------
------------

I begin by checking that the data is still available.

In [56]:
print data[0]

['1' '0' '3' 'Braund, Mr. Owen Harris' 'male' '22' '1' '0' 'A/5 21171'
 '7.25' '' 'S']


Next I will analyse according to gender (male/female), boarding class (1st/2nd/3rd), and ticket price (which I'll group into 4 bins). I want to take the ticket prices as 0-10, 10-20, 20-30, 30+, but as I will create bins of equal width I begin by making all the top prices fall into the 30-40 bin (setting tickets more than 40 as 39 WLOG). This may be described as adding a ceiling to our ticket price data.

In [57]:
fare_ceiling = 40
data[ data[0::,9].astype(np.float) >= fare_ceiling, 9 ] = fare_ceiling - 1.0 #set to be 39 in top bin

In [58]:
for price in data[0::, 9]:
    if price.astype(np.float) > 40:
        print("ceiling not implemented", price)

Nothing has executed so the ceiling has worked.

In [59]:
fare_bracket_size = 10
number_of_price_brackets = fare_ceiling / fare_bracket_size #We have 4 buckets described above

There were 3 classes on board - 1st, 2nd and 3rd - but without this knowlede I can calculate this from the data directly, by taking the the length of the array of UNIQUE values in the column of index 2.

In [60]:
number_of_classes = len(np.unique(data[0::,2])) 
print number_of_classes

3


I now build a survival table (filled with zeros for now), ready to populate.

In [61]:
number_of_genders = 2
survival_table = np.zeros((number_of_genders, number_of_classes, number_of_price_brackets))

I now loop through each variable and find all those passengers that agree with the statements. We loop through each class, loop through each price bin, use compound logic, use the bins for fares between $10j$ and $10(j+1)$

In [62]:
for i in xrange(number_of_classes):      
  for j in xrange(number_of_price_brackets):   

    women_only_stats = data[                          
                       (data[0::,4] == "female")    
                       &(data[0::,2].astype(np.float) 
                             == i+1)                      
                       &(data[0:,9].astype(np.float)   
                            >= j*fare_bracket_size)               
                       &(data[0:,9].astype(np.float)  
                            < (j+1)*fare_bracket_size)    
                          , 1]                         


    men_only_stats = data[                                       
                         (data[0::,4] != "female")    
                       &(data[0::,2].astype(np.float) 
                             == i+1)                                       
                       &(data[0:,9].astype(np.float)   
                            >= j*fare_bracket_size)                 
                       &(data[0:,9].astype(np.float)  
                            < (j+1)*fare_bracket_size)    
                          , 1] 
    survival_table[0,i,j] = np.mean(women_only_stats.astype(np.float)) 
    survival_table[1,i,j] = np.mean(men_only_stats.astype(np.float))

It is hidden by the functions we are testing for, but ```data[ where function, 1]``` means it is finding the Survived column for the conditional criteria which is being called. As the loop starts with i=0 and j=0, the first loop will return the Survived values for all the 1st-class females ```(i + 1)``` who paid less than ```10 ((j+1)*fare_bracket_size)``` and similarly all the 1st-class males who paid less than 10.  Before resetting to the top of the loop, we can calculate the proportion of survivors for this particular combination of criteria and record it to our survival table.

The nasty warning occurs because some categories were empty, but we can set them to 0 using the following statement.

In [63]:
survival_table[ survival_table != survival_table ] = 0.

In [64]:
print survival_table

[[[ 0.          0.          0.83333333  0.97727273]
  [ 0.          0.91428571  0.9         1.        ]
  [ 0.59375     0.58139535  0.33333333  0.125     ]]

 [[ 0.          0.          0.4         0.38372093]
  [ 0.          0.15873016  0.16        0.21428571]
  [ 0.11153846  0.23684211  0.125       0.24      ]]]


Each of these numbers is the proportion of survivors for that criteria of passengers. 

For example, 0.91428571 signifies 91.4% of female, Pclass = 2, in the Fare bin of 10-19. For our second iteration of a model, let's again assume any probability greater than or equal to 0.5 should result in our predicting survival -- and less than 0.5 should not. We can update our survival table with:

In [65]:
survival_table[ survival_table < 0.5 ] = 0
survival_table[ survival_table >= 0.5 ] = 1 

In [66]:
print survival_table

[[[ 0.  0.  1.  1.]
  [ 0.  1.  1.  1.]
  [ 1.  1.  0.  0.]]

 [[ 0.  0.  0.  0.]
  [ 0.  0.  0.  0.]
  [ 0.  0.  0.  0.]]]


When we go through each row of the test file we can find what criteria fit each new passenger and assign them a 1 or 0 according to our survival table.  As previously, let's open up the test file to read (and skip the header row), and also a new file to write to, called 'genderclassmodel.csv':

In [68]:
test_file = open('test.csv', 'rb')
test_file_object = csv.reader(test_file)
header = test_file_object.next()
predictions_file = open("ModelBasedonGenderClass.csv", "wb")
predictions_file_object = csv.writer(predictions_file)
predictions_file_object.writerow(["PassengerId", "Survived"])

As with the previous model, we can take the first passenger, look at his/her gender, class, and price of ticket, and assign a Survived label. However, not every passenger in the test.csv file is not binned. We should loop through each bin and see if the price of their ticket falls in that bin. If so, we can break the loop (so we don’t go through all the bins) and assign that bin.

A way to test for existence of a fare is to try to make it into a float, since, in the case of empty data, the script cannot make it a float. 

If there is no fare entry we'll assume a fare bin simply correlated to the Passenger class. For example, if the passenger is third class they are put in the first bin ($0-9), second class into the second bin ($10-19), etc. The other thing to notice is that we assign the bin_fare to equal ```3-Pclass```. Although there are four bins, they must go from 0 to 3 because we will be using these as indices of our survival table. This little loop determines the index of the bin to look up in the survival table.

Now that we have determined the binned ticket price (bin_fare), we can see if the passenger is female (row[3]), find their Pclass (row[1]), and then grab the relevant element in survival_table. We need to convert this from the float in the survival_table into an integer (int) that we write in our prediction file for Kaggle.

In [69]:
for row in test_file_object:                 # We are going to loop
                                              # through each passenger
                                              # in the test set                     
  for j in xrange(number_of_price_brackets):  # For each passenger we
                                              # loop through each price bin
    try:                                      # Some passengers have no
                                              # Fare data so try to make
      row[8] = float(row[8])                  # a float
    except:                                   # If fails: no data, so 
      bin_fare = 3 - float(row[1])            # bin the fare according Pclass
      break                                   # Break from the loop
    if row[8] > fare_ceiling:              # If there is data see if
                                              # it is greater than fare
                                              # ceiling we set earlier
      bin_fare = number_of_price_brackets-1   # If so set to highest bin
      break                                   # And then break loop
    if row[8] >= j * fare_bracket_size\
       and row[8] < \
       (j+1) * fare_bracket_size:             # If passed these tests 
                                              # then loop through each bin 
      bin_fare = j                            # then assign index
      break  
  if row[3] == 'female':
    predictions_file_object.writerow([row[0], "%d" % int(survival_table[ 0, float(row[1]) - 1, bin_fare])])
  else:
    predictions_file_object.writerow([row[0], "%d" % int(survival_table[ 1, float(row[1]) - 1, bin_fare])])
# Close out the files
test_file.close()
predictions_file.close()



We have now inserted a 1 or 0 prediction, according to gender, class, and how much she/he paid in fare. We can now submit the file genderclassmodel.csv.

Just like in Excel, here we built predictions that take into account several features. But type  print survival_table  again: what do you notice about the predictions for men? Surely some of the men survived, but our model can only predict 0. This suggests one source of error that's reflected in our leaderboard score, and it may already be prompting new ideas for improving your next model. 

Yet in contrast to Excel, we have created a script now that can easily be altered to add more variables. For example, we could include Age, where they Embarked, or even their Name. All these variables may themselves have complications, so you will need to think of ways to make them useful. In this tutorial, in order to fill in any missing values of the fare, we assumed the Passenger Class can correlate simply to which fare bin to use. Using python we developed an extensible model without too much effort.

We are almost ready to apply Machine Learning on this data using python. However before we jump in, it would be advantageous to take a brief detour to learn tools that makes some of the work here easier.

Next, I will explore python's Pandas package.