# Exploring Titanic Surviors

In this notebook, we will explore a dataset about Titanic survivors. You will see what features of the dataset matter, how you can make seemingly useless features matter for the task, see how these combine for the prediction task. Work with the people around you to make this task both easy and fun!

In [1]:
import numpy as np
import pandas as pd
from datascience import *
import matplotlib.pyplot as plt

%matplotlib inline

#### The Data
The data is split into a train and test dataset, which are downloaded from Kaggle.

link: https://www.kaggle.com/c/titanic/data

#### About the Training Data:
```Passengerid``` is the unique identifier for each passenger.

```Survived``` is an indicator for whether the passenger survived or not.

```Pclass``` is the passenger class for the passanger, which ticket type they had.

```Name``` is the name of the passenger.

```Sex``` indicates the gender of the .

```Age``` is an indicator for whether the passenger survived or not.

```SibSp``` refers to the number of siblings or spouses onboard.

```Survived``` refers to the number of parents or children onboard.

```Ticket``` identifies the ticket number

```Fare``` identifies the payment for the ticket made by that passenger.

```Cabin``` recorded the cabin number for that passenger

```Embarked``` recorded where the passenger embarkets from. ```C = Cherbourg```, ```Q = Queenstown```, ```S = Southampton```

#### About the Training Data:
The test data has the same columns, except for the survived column, because that is what is being predicted using the train dataset.

In [2]:
train = pd.read_csv('train.csv')
test = pd.read_csv('test.csv')
sample = pd.read_csv('gender_submission.csv')
train.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


In [3]:
test.head()

Unnamed: 0,PassengerId,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,892,3,"Kelly, Mr. James",male,34.5,0,0,330911,7.8292,,Q
1,893,3,"Wilkes, Mrs. James (Ellen Needs)",female,47.0,1,0,363272,7.0,,S
2,894,2,"Myles, Mr. Thomas Francis",male,62.0,0,0,240276,9.6875,,Q
3,895,3,"Wirz, Mr. Albert",male,27.0,0,0,315154,8.6625,,S
4,896,3,"Hirvonen, Mrs. Alexander (Helga E Lindqvist)",female,22.0,1,1,3101298,12.2875,,S


In [4]:
print('Instances in the train dataset: ' + str(len(train)))
print('Instances in the tets dataset: ' + str(len(test)))

Instances in the train dataset: 891
Instances in the tets dataset: 418


Notice that this dataset is pretty small. It is big enough to create a model, but is still smaller than what you might have used in other prediction or classification projects before.

### The Model:
In this notebook, you will create a model that predicts whether the passenger in the test dataset survived the Titanic crash or not. Use your favorite classification/prediction model. I will list a few and documentation for how to implement one of those after having imported the libraries needed for the model. 

If you want to read all about the Titanic, check out this link: https://www.history.com/news/why-did-the-titanic-sink. There is more on the internet about the ship and its sinking, feel free to search about it! There is also a movie called Titanic!

### The Data Science Lifecylce:
Before you jump right into creating the prediction model and test your predictions, know your data. Follow the **Data Science lifecyle**: learn about where your data came from, how it was collected, what errors there may be in tabulating the data; know your data by reading some of the instances, plot graphs and visuals to understand key patterns and trends; run some clustering methods to better show how rows in the dataset are similar; talk to the people next to you and collaborate on what you think it relevant within the dataset for the problem given to you. 

Do all your work in the cells below and feel free to add and delete as you need:

Refer to online resources and libraries! Refer to things you learned from previous classes or are learning right now! Talk to the people around you.

In [None]:
# any null values? might want to fill them with relevant values?

In [None]:
# clean data types (more about this in the description below)

### Predicting and Classifying:
You will need to do feature engineering, such as one-hot and normalization, on your dataset to get the data in the relevant format you need to implement a model. 

Many of these will require that you use the scikit library so be sure to ```import sklearn```

Here's a list of models you can use with relevant links:
1. Simple Linear Regression: https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LinearRegression.html
2. Decision Tree Classifier: https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html
3. Multiple Linear Regression
4. Clustering: https://scikit-learn.org/stable/modules/generated/sklearn.cluster.KMeans.html
5. Nearest Neighbors: Data8
6. Neural Nets: https://scikit-learn.org/stable/modules/neural_networks_supervised.html

Search things like "*model name* python" to get a documentation for the implementation model. **Make sure you import relevant libraries so you can use the model themselves**. This is where you should work with your partners and people around you to generate a model using the different model types. 

Use techniques like corss-validation to see your performance on your model. **Try to minimize the number of submissions.** 

### Things to Think About:
- To explore the context of the time period during which the Titanic took off, notice how the fare prices are distributed among the male and female. Can you group people by last name and how many children/spouses there were to understand how much a family was able to spend to go on the trip?
- How does the location and the cabin room compare? Do people from one of the embark locations have a certain set of cabin rooms, do they have a certain passenger class. 
- Feature engineer ```SibSp``` and ```parch``` to get an idea of how families are distributed among the passengers? Any single passengers, and what is the survival rate of the single passengers? 
- Look at the type or units of the features? Why would they be given in that form? What must you do to the data to make it relevant for the task of the project? 
- Does it even make sense to predict whether one survived or not given the data? Maybe the cabin room and passanger class type matter since certain rooms may be evacuated in time for survival. But how can other random events be quantified or accounted for? Maybe feature engineering them can make them relevent somehow?

Take a look at how the accuracy increases as more and more work is put into the model. The data is on the Kaggle submissions in 2018.
<img src="first.png">

In [10]:
# use SibSp and Parch to see whether a passenger is alone or not
# define a function to do this, which returns whether a person is alone or not, either
# as a 0-1 indicator, or as a string

In [None]:
# can the 'Mr.', 'Miss.',  'Master' tell you something? 
# create counts or weights for the name feature?

In [None]:
# Try random features
# age * class
# average fare per person
# weight of the fare by family for each member of the family

### Submission and Accuracy
The indicator for whether a person from the test data survived is not any of the dataset. To get the accuracy of your model, you will have to submit on Kaggle. Follow the steps below to create a relevant submission type file to get the accuracy of your model. 

A sample submission type is shown below. 

In [5]:
sample.head()

Unnamed: 0,PassengerId,Survived
0,892,0
1,893,1
2,894,0
3,895,0
4,896,1


You will need the ```PassengerId``` from the ```test``` dataset and create a new DataFrame with the ```PassengerId``` and your predictions. You will then convert your DataFrame into a csv file and submit that to Kaggle. Below is the code for converting your DataFrame into a CSV and then having it be uploaded to your DataHub. You can download the file from there. Feel free to skip the implementation for this below and code one up to your preference. 

In [None]:
# converts DataFrame to a .csv file in your DataHub. 
# make sure to change DataFrame to your final DataFrame and 'file_name.csv' to your submission file name
DataFrame.to_csv('file_name.csv', index = None, header=True)

## Submit to Kaggle:
link: https://www.kaggle.com/c/titanic