# Titanic Dataset


For this second dataset, I will focus on learning the methodology and workflow practices for working through a data science problem using the tutorial from kaggle [here](https://www.kaggle.com/renil93/titanic/titanic-data-science-solutions/notebook) to better learn these processes. 

### Note:
While I may be using the tutorial, I will also be researching and using other/additional methods then what is provided with the tutorial to further explore machine learning. 


## Workflow stages

Kaggle competition solutions follow a workflow of seven stages:
1. Question or problem definition.
2. Aquire training and testing data.
3. Wrangle, prepare, and cleanse the data.
4. Analyze, identify patterns, and explore the data.
5. Model, predict, and solve the problem.
6. Vizualize, report, and present the problem solving steps and final solution.
7. Supply or submit the results.

## Question and problem definition:

Since this problem was defined by Kaggle [here](https://www.kaggle.com/c/titanic), I will go ahead and paste the problem definition here:

>Knowing from a training set of samples listing passengers who survived or did not survive the Titanic disaster, can our model determine based on a given test dataset not containing the survival information, if these passengers in the test dataset survived or not.


So before moving on, one thing the tutorial suggests is to *develop some early understanding about the domain of our problem*. This will be done by looking at some of the highlights from the kaggle competition description [here](https://www.kaggle.com/c/titanic):

* On April 15, 1912, during her maiden voyage, the Titanic sank after colliding with an iceberg, killing 1502 out of 2224 passengers and crew. Translated 32% survival rate.

* One of the reasons that the shipwreck led to such loss of life was that there were not enough lifeboats for the passengers and crew.

* Although there was some element of luck involved in surviving the sinking, some groups of people were more likely to survive than others, such as women, children, and the upper-class.

## Workflow goals

The solutions workflow for data science solves for seven major goals:

#### Note:
The below quotes are straight out of the tutorial.

* **Classifying**:
>We may want to classify or categorize our samples. We may also want to understand the implications or correlation of different classes with our solution goal.

* **Correlating**:
>One can approach the problem based on available features within the training dataset.
>* Which features within the dataset contribute significantly to our solution goal? 
>* Statistically speaking is there a correlation among a feature and solution goal?
>* As the feature values change does the solution state change as well, and visa-versa?
>This can be tested both for numerical and categorical features in the given dataset. We may also want to determine correlation among features other than survival for subsequent goals and workflow stages. Correlating certain features may help in creating, completing, or correcting features.

* **Converting**:
>For modeling stage, one needs to prepare the data. Depending on the choice of model algorithm one may require all features to be converted to numerical equivalent values. So for instance converting text categorical values to numeric values.

* **Completing**:
>Data preparation may also require us to estimate any missing values within a feature. Model algorithms may work best when there are no missing values.

* **Correcting**:
>We may also analyze the given training dataset for errors or possibly innacurate values within features and try to corrent these values or exclude the samples containing the errors. One way to do this is to detect any outliers among our samples or features. We may also completely discard a feature if it is not contribting to the analysis or may significantly skew the results.

* **Creating**:
>Can we create new features based on an existing feature or a set of features, such that the new feature follows the correlation, conversion, completeness goals.

* **Charting**:
>How to select the right visualization plots and charts depending on nature of the data and the solution goals. See [this](https://www.tableau.com/learn/whitepapers/which-chart-or-graph-is-right-for-you#ERAcoH5sEG5CFlek.99) for more info. 



## Best practices 
* Performing feature correlation analysis early in the project.
* Using multiple plots instad of overlays for readability.


In [15]:
#data analysis
import numpy as np
import pandas as pd
import scipy

#visualization
import seaborn as sns
import matplotlib.pyplot as plt
from pandas.tools.plotting import scatter_matrix
%matplotlib inline

#machine learning
import sklearn
from sklearn import model_selection
from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix
from sklearn.metrics import accuracy_score
##machine learning algorithms
from sklearn.linear_model import LogisticRegression
from sklearn.linear_model import Perceptron
from sklearn.linear_model import SGDClassifier
from sklearn.svm import SVC, LinearSVC
from sklearn.ensemble import RandomForestClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
from sklearn.naive_bayes import GaussianNB

## Import the data

This dataset provides me with both a training csv as well as a test csv that we can use. I will then combine the datasets to run certain operations on both datasets at once. 

In [16]:
train = pd.read_csv('train.csv')
test = pd.read_csv('test.csv')
combine = [train, test]

## Analyze by describing the data


Let's start by asking a question...

**Which features are available in the dataset?**

First we will identify the feature names by printing them. The names are further explained on [kaggle here](https://www.kaggle.com/c/titanic/data).

In [17]:
print(train.columns.values)

['PassengerId' 'Survived' 'Pclass' 'Name' 'Sex' 'Age' 'SibSp' 'Parch'
 'Ticket' 'Fare' 'Cabin' 'Embarked']


**Which features are categorical?**

>These values classify the samples into sets of similar samples. Within categorical features are the values nominal, ordinal, ratio, or interval based? Among other things this helps us select the appropriate plots for visualization.
>* Categorical: Survived, Sex, and Embarked.
>* Ordinal: Pclass.

**Which features are numerical?**
>Which features are numerical? These values change from sample to sample. Within numerical features are the values discrete, continuous, or timeseries based? Among other things this helps us select the appropriate plots for visualization.
>* Continous: Age, Fare.
>* Discrete: SibSp, Parch.


In [18]:
#preview the data
train.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


**Which features are mixed data types?**
>Numerical, alphanumeric data within same feature. These are candidates for correcting goal.
>* Ticket is a mix of numeric and alphanumeric data types. Cabin is alphanumeric.

**Which features may contain errors or typos?**
>This is harder to review for a large dataset, however reviewing a few samples from a smaller dataset may just tell us outright, which features may require correcting
>* Name feature may contain errors or typos as there are several ways used to describe a name including titles, round brackets, and quotes used for alternative or short names.

In [19]:
train.tail()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
886,887,0,2,"Montvila, Rev. Juozas",male,27.0,0,0,211536,13.0,,S
887,888,1,1,"Graham, Miss. Margaret Edith",female,19.0,0,0,112053,30.0,B42,S
888,889,0,3,"Johnston, Miss. Catherine Helen ""Carrie""",female,,1,2,W./C. 6607,23.45,,S
889,890,1,1,"Behr, Mr. Karl Howell",male,26.0,0,0,111369,30.0,C148,C
890,891,0,3,"Dooley, Mr. Patrick",male,32.0,0,0,370376,7.75,,Q


**Which features contain blank, null or empty values?**

>These will require correcting.
>* Cabin > Age > Embarked features contain a number of null values in that order for the training dataset.
>* Cabin > Age are incomplete in case of test dataset.

**What are the data types for various features?**

>Helping us during converting goal.
>* Seven features are integer or floats. Six in case of test dataset.
>* Five features are strings (object).


In [20]:
train.info()
print('_'*40)
test.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
PassengerId    891 non-null int64
Survived       891 non-null int64
Pclass         891 non-null int64
Name           891 non-null object
Sex            891 non-null object
Age            714 non-null float64
SibSp          891 non-null int64
Parch          891 non-null int64
Ticket         891 non-null object
Fare           891 non-null float64
Cabin          204 non-null object
Embarked       889 non-null object
dtypes: float64(2), int64(5), object(5)
memory usage: 83.6+ KB
________________________________________
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 418 entries, 0 to 417
Data columns (total 11 columns):
PassengerId    418 non-null int64
Pclass         418 non-null int64
Name           418 non-null object
Sex            418 non-null object
Age            332 non-null float64
SibSp          418 non-null int64
Parch          418 non-null int64
Ticket         418 non-null

**What is the distribution of numerical feature values across the samples?**
>This helps us determine, among other early insights, how representative is the training dataset of the actual problem domain.
>* Total samples are 891 or 40% of the actual number of passengers on board the Titanic (2,224).
>* Survived is a categorical feature with 0 or 1 values.
>* Around 38% samples survived representative of the actual survival rate at 32%.
>* Most passengers (> 75%) did not travel with parents or children.
>* Nearly 30% of the passengers had siblings and/or spouse aboard.
>* Fares varied significantly with few passengers (< 1%) paying as high as 512 dollars.
>* Few elderly passengers ( < 1%) within age range 65-80.


In [21]:
train.describe()
# Review survived rate using `percentiles=[.61, .62]` knowing our problem description mentions 38% survival rate.
# Review Parch distribution using `percentiles=[.75, .8]`
# SibSp distribution `[.68, .69]`
# Age and Fare `[.1, .2, .3, .4, .5, .6, .7, .8, .9, .99]`

Unnamed: 0,PassengerId,Survived,Pclass,Age,SibSp,Parch,Fare
count,891.0,891.0,891.0,714.0,891.0,891.0,891.0
mean,446.0,0.383838,2.308642,29.699118,0.523008,0.381594,32.204208
std,257.353842,0.486592,0.836071,14.526497,1.102743,0.806057,49.693429
min,1.0,0.0,1.0,0.42,0.0,0.0,0.0
25%,223.5,0.0,2.0,20.125,0.0,0.0,7.9104
50%,446.0,0.0,3.0,28.0,0.0,0.0,14.4542
75%,668.5,1.0,3.0,38.0,1.0,0.0,31.0
max,891.0,1.0,3.0,80.0,8.0,6.0,512.3292


**What is the distribution of catorgies?**

We can see this by the below table.

In [30]:
train.describe(include=['O']) 
#To select categorical objects use type object. 
#See also the select_dtypes documentation. eg. df.describe(include=[‘O’])

Unnamed: 0,Name,Sex,Ticket,Cabin,Embarked
count,891,891,891,204,889
unique,891,2,681,147,3
top,"Graham, Mr. George Edward",male,CA. 2343,C23 C25 C27,S
freq,1,577,7,4,644


In [45]:
# Male ratio = freq577/count891
print "{} = {}".format("Male ratio",float(577)/float(891))
print "{} = {}".format("Duplicate ticket ratio",float(891-681)/float(891))

Male ratio = 0.64758698092
Duplicate ticket ratio = 0.23569023569


Here we see that:
* The names are unique for the entire dataset as count=unique=891
* Sex variable has two values. Top = male and 65% male
* Cabin values have several duplicates across samples... aka: passengers shared cabins
* Embarked takes 3 possible values (unique = 3), where the S port was used by most passengers (top = s)
* Ticket feature has high ratio (24%) of duplicate values (unique = 681)

## Assumptions based on data analysis

**Correlating**

Here we want to think about how well each feature correlates with survival. We want to do this early and match these quick correlations with modelled correlations later in the project.

**Completing**
1. We want to complete the Age feature as we believe this feature is correlated to survival.
2. We also want to complete the Embarked feature, as it may also correlate to survival.

**Correcting**
1. Tick
2. 
3. 
4. 