<a href="https://colab.research.google.com/github/minerva-mcgonagraph/titanic/blob/master/main2.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

If you went through my Chicken and Doggo notebook, you'll recall that the data (the images) were nice and neat. There weren't that many of them, there were about as many images of fried chicken as there were labradoodles, and they were all approximately the same size. If you didn't go through my Chicken and doggo notebook, then you probably think I'm mad. Just go through my chickken and doggo notebook. Join the madness. It's also cute.

But unfortunately, data in the real world isn't so nice and cute. Problems often involve millions of rows of data. So instead of a proverbial sandbox, it's a more literal one: how do you turn all those grains of sand into a sparkling castle? That's what we're going to explore here.

This notebook will use the python library pandas to do some feature engineering. Pandas is great if the data fits in memory. Since there's only a few hundred rows of data here, that's what we'll use.

This data set is from the ongoing Kaggle competition.



The next few code cells get the data.

In [0]:
from google.colab import files
files.upload()

Saving kaggle.json to kaggle.json


{'kaggle.json': b'{"username":"minervamcgonagraph","key":"74c142476e8374702f99fb22364fbc71"}'}

In [0]:
!mkdir -p ~/.kaggle
!cp kaggle.json ~/.kaggle/
!chmod 600 ~/.kaggle/kaggle.json
!ls ~/.kaggle

kaggle.json


In [0]:
!ls -l ~/.kaggle
!cat ~/.kaggle/kaggle.json

total 4
-rw------- 1 root root 74 Oct  7 20:02 kaggle.json
{"username":"minervamcgonagraph","key":"74c142476e8374702f99fb22364fbc71"}

In [0]:
!pip install -q kaggle
!pip install -q kaggle-cli

[K     |████████████████████████████████| 81kB 1.6MB/s 
[K     |████████████████████████████████| 5.3MB 4.1MB/s 
[K     |████████████████████████████████| 112kB 51.4MB/s 
[K     |████████████████████████████████| 51kB 21.3MB/s 
[K     |████████████████████████████████| 112kB 53.9MB/s 
[?25h  Building wheel for kaggle-cli (setup.py) ... [?25l[?25hdone
  Building wheel for pyperclip (setup.py) ... [?25l[?25hdone


In [0]:
!kaggle datasets list

ref                                                       title                                               size  lastUpdated          downloadCount  
--------------------------------------------------------  -------------------------------------------------  -----  -------------------  -------------  
bradklassen/pga-tour-20102018-data                        PGA Tour Golf Data                                  98MB  2019-10-02 14:55:56           6764  
martj42/international-football-results-from-1872-to-2017  International football results from 1872 to 2019   525KB  2019-10-02 16:51:16          21868  
dgomonov/new-york-city-airbnb-open-data                   New York City Airbnb Open Data                       2MB  2019-08-12 16:24:45          23408  
lakshyaag/india-trade-data                                India - Trade Data                                   1MB  2019-08-16 16:13:58          11102  
therohk/ireland-historical-news                           The Irish Times - Waxy-W

In [0]:
!kaggle competitions download -c titanic

Downloading train.csv to /content
  0% 0.00/59.8k [00:00<?, ?B/s]
100% 59.8k/59.8k [00:00<00:00, 24.4MB/s]
Downloading test.csv to /content
  0% 0.00/28.0k [00:00<?, ?B/s]
100% 28.0k/28.0k [00:00<00:00, 28.2MB/s]
Downloading gender_submission.csv to /content
  0% 0.00/3.18k [00:00<?, ?B/s]
100% 3.18k/3.18k [00:00<00:00, 1.33MB/s]


In [0]:
import tensorflow as tf
import numpy as np
import pandas as pd
import math

Now we'll do some initial set up. We'll set all numbers to display to two decimal places and set the limit of table displays to 15 rows.

In [0]:
#display preferences
pd.options.display.float_format = '{:.2f}'.format
pd.options.display.max_rows = 15

#put the data into a readable format
titanic_data = pd.read_csv('./train.csv')

#randomize the data
#note: the data has already been split so no need to worry about leaks
titanic_data = titanic_data.reindex(np.random.permutation(titanic_data.index))

now we'll do a first look at the data to see what we're working with. The training and test data combined account for a total of 1,309 passengers. Note that Titanic had a total of 1,317 passengers and a grand total of about 2,224 people on board (passengers and crew). For this project that's not important but that could affect the accuracy of the model in a real-world situation, since this model does not account for any crew members. Know your data!

In [0]:
#print the number of training examples
print("There are", len(titanic_data), "examples in the training set.")

#look at a few rows to get an idea of what the data is like
print(titanic_data[4:9])

#show the header row which will become the feature names. Also show the type - numeric or categorical, since we will need to deal with these separately.
titanic_data.dtypes

#next: create two new dataframes, one with the numeric data and one with the categorical data

There are 891 examples in the training set.
     PassengerId  Survived  Pclass  ...   Fare Cabin  Embarked
319          320         1       1  ... 134.50   E34         C
192          193         1       3  ...   7.85   NaN         S
825          826         0       3  ...   6.95   NaN         Q
800          801         0       2  ...  13.00   NaN         S
651          652         1       2  ...  23.00   NaN         S

[5 rows x 12 columns]


PassengerId      int64
Survived         int64
Pclass           int64
Name            object
Sex             object
Age            float64
SibSp            int64
Parch            int64
Ticket          object
Fare           float64
Cabin           object
Embarked        object
dtype: object

From https://www.kaggle.com/c/titanic/data :

Survival: 0 = No, 1 = Yes

Pclass: ticket class, 1, 2, or 3

Sex: sex

Age: age in years; age is fractional if less than 1. If the age is estimated, it is in the form of xx.5

SibSp: number of siblings, spouses also aboard; siblings include brother, sister, and stepbrother and stepsister; spouse includes husband or wife - fiances and mistresses were ignored

Parch: number of parents, children also aboard; parent includes mother or father; child includes son, daughter, stepdaughter, stepson. Does not include nannies - some children travelled only with a nanny so this value is 0 for them.

ticket: ticket number

fare: passenger fare

cabin: cabin number

embarked: port of embarkation; C = Cherbourg, Q = Queenstown, S = Southampton

Some questions:

1. How many survived?

2. How many first, second, and third class tickets were there?

3. How many men vs women?

4. What was the age distribution?

5. Is sibsp,parch mean each family got 1 ticket or each ticket listed include familial relationships - ie, is the sum of sibsp and parch higher than the total number of passengers? How many travelled as a family and how many alone? Was someone more likely to survive if they travelled with family?

6. What's the range of ticket numbers? Are they sequential or random (ie are there gaps)?

7. What's the range in ticket prices? Were all tickets of each class type the same cost? Does this correlate more strongly with survival than class? Since fare and class are presumably strongly correlated, should we ignore one of them as a feature? (Which tf can do automatically)

8. Can you tell what level each passenger was on based on the cabin number? Does that correlate with survival?

9. How many embarked from each location?