In [None]:
# plotting
%matplotlib inline
import matplotlib.pyplot as plt
import matplotlib as mpl

# data visualization
import seaborn as sns
from helper_functions import plot_setup
sns.set_style('white')
plot_setup()

# data analysis
import pandas as pd

# data mining & ML
from sklearn import preprocessing

import warnings
warnings.filterwarnings('ignore')

# Getting Started with the Data

We will be working with the [Titanic Dataset](http://biostat.mc.vanderbilt.edu/wiki/pub/Main/DataSets/titanic.html).

### Some Background before you Start

"The sinking of the RMS Titanic is one of the most infamous shipwrecks in history. On April 15, 1912, during her maiden voyage, the Titanic sank after colliding with an iceberg, killing 1502 out of 2224 passengers and crew. This sensational tragedy shocked the international community and led to better safety regulations for ships.

One of the reasons that the shipwreck led to such loss of life was that there were not enough lifeboats for the passengers and crew. Although there was some element of luck involved in surviving the sinking, some groups of people were more likely to survive than others." -- Kaggle

In this exploration, we will complete analysis to predict which factors made individuals more or less likely to survive the Titanic. We will build up predictive machine learning models that model the likelihood of survival for an individual based on various features of this person. Then we will use these models to predict the likelihood of survival for unknown people.

### Loading and Pre-processing

Before we get started with analysis, we begin with the first part of any machine learning exploration: Loading the data and cleaning it up for analyis.

Typical datasets are messy.
* Data can be, and often is, missing.
* Sometimes data is invalid.
* You might have much more data available than you need.
* You may also need to change your data types, so they are compatible with your algorithms. 

This is just a short set of the way that data sets in the wild can be imperfect.

These datasets require preprocessing to get them into the format which a library like scikit-learn can use.

#### Loading the data

Let's first load our data set and take a look at some of the data to get an idea of what type of preprocessing we might need.

In [None]:
data_url = 'titanic.csv'
titanic = pd.read_csv(data_url, sep = ';')

In [None]:
titanic.head()

Let's take a look at the columns in our data set.

The `Survived` column is our label set (i.e. what we are trying to predict). The rest of the columns are our features.

In [None]:
titanic.columns.tolist()

What the columns mean:

**survived** - Survival  
0 = No, 1 = Yes

**pclass** - Ticket class (a proxy for socio-economic status)  
1st = Upper, 2nd = Middle, 3rd = Lower

**gender** - Gender  
female, male

**age** - Age in years  
Fractional if less than 1. If the age is estimated, it is in the form of xx.5

**sibsp** - # of siblings/spouses aboard the Titanic  
Sibling = brother, sister, stepbrother, stepsister  
Spouse = husband, wife (only official, legal wives and husbands considered)

**parch** - # of parents/children aboard the Titanic  
Parent = mother, father  
Child = daughter, son, stepdaughter, stepson  
Some children travelled only with a nanny, therefore parch=0 for them.

**ticket** - Ticket number

**fare** - Passenger fare

**cabin** - Cabin number

**embarked** - Port of Embarkation  
C = Cherbourg, Q = Queenstown, S = Southampton

**boat** - Lifeboat (if survived)

**body** - Body number (if did not survive and body was recovered)

**home.dest** - Home/destination

Let's take another look at a small sample of the data.

In [None]:
titanic.head()

#### Cleaning the Data 

What are some good places to start with cleaning up this data?

1) In the `age` and `fare` columns, we see that commas are used rather than periods. We need to replace all commas in those columns with periods so we can work with them properly.

In [None]:
titanic["age"].replace(',', '.', inplace = True, regex = True)
titanic["fare"].replace(',', '.', inplace = True, regex = True)

titanic.head()

2) When the dataset was first loaded, some of the numeric types, like `age` were actually stored and loaded as object types. We need to convert these into numeric types for scikit-learn to use. Note that python `string` data types show up as `object` in pandas.

Let's see what types are in our data set and make sure they match our expectations.

In [None]:
for col in titanic.columns.values:
    print(col, titanic[col].dtype)

Which of the column types don't match the expected types?

We expect `age` and `fare` to be numeric, but they are currently `object` types. Everything else looks as expected.

Let's convert those columns.

In [None]:
titanic[['age', 'fare']] = titanic[['age', 'fare']].apply(pd.to_numeric)

for col in titanic.columns.values:
    print(col, titanic[col].dtype)

**Adding Features**

We don't have to just remove and reformat columns during pre-processing, though!

We can also create new features that we think may be useful.

The total family size of an individual could be a useful feature for us. Let's create a new feature, `family_members` out of `sibsp` (the number of siblings and spouses) and `parch` (the number of parent and children).

We'll also create the `family_status` feature which tells us if an individual traveled alone or with a family.

In [None]:
titanic['family_members'] = titanic['sibsp'] + titanic['parch']

titanic['family_status'] = 'alone'
titanic['family_status'][titanic['family_members'] != 0] = 'with family'

titanic.head()

OK. Pre-processing done for now. Our data is ready to work well with scikit-learn and we're in a good place to move to the next step of data exploration: plotting.

# Exploring the Data

The first step of machine learning is understanding the data that you are working with. This helps you get a sense of which features might be the most important, which algorithms make the most sense for your data, etc.

What do you think might distinguish the people who survived the Titanic from the ones who didn't?

Let's plot the data and see what initial insights we can get.

Let's start with **age**. How does age affect people's likelihood of survival?

In [None]:
sns.distplot(a = titanic['age'][titanic['survived'] == 1].dropna(), kde_kws = {'label': 'survived'})
sns.distplot(a = titanic['age'][titanic['survived'] == 0].dropna(), kde_kws = {'label': 'did not survive'})

What about the **price of their ticket**?

In [None]:
sns.distplot(a = titanic['fare'][titanic['survived'] == 1].dropna(), kde_kws = {'label': 'survived'})
ax = sns.distplot(a = titanic['fare'][titanic['survived'] == 0].dropna(), kde_kws = {'label': 'did not survive'})
ax.set(xlim = (-20, 100))

What else could influence whether they survived? Maybe their **gender**?

In [None]:
# Display counts of survivors for each gender category
sns.countplot(data = titanic, x = 'gender', hue = 'survived')

# Display percent of survivors for each gender category
sns.factorplot('gender', 'survived', data = titanic)

We can now look at the overall percentage of people who survived.

In [None]:
titanic['survived'].mean()

As we have just seen, survival rates are very gender-dependent. These are the survival rates for men and women separately.

In [None]:
titanic.groupby('gender')['survived'].mean()

Would their **passenger class** have an effect?

In [None]:
# Display counts of survivors for each passenger class
sns.countplot(data = titanic, x = 'pclass', hue = 'survived')

# Display percent of survivors for each passenger class
sns.factorplot('pclass', 'survived', data = titanic)

These are now survival rates for each passenger class.

In [None]:
titanic.groupby('pclass')['survived'].mean()

We see some strong indicators in `gender` and `pclass`.

We can examine the averages of other features split on gender or pclass and see if there are any differences which stand out.

_Do you notice any other features which have very different means based on gender or pclass?_

In [None]:
titanic.groupby('gender').mean()

In [None]:
titanic.groupby('pclass').mean()

Now let's look at how the combination of gender and passenger class influences the survival rates.

In [None]:
sns.factorplot('pclass', 'survived', hue = 'gender', data = titanic.sort_values(by = 'pclass'))

We divided people into those traveling **alone or with family**. How did that affect whether they ended up surviving?

In [None]:
sns.factorplot('family_status', 'survived', hue = 'pclass', data = titanic.sort_values(by = 'family_members'))

In [None]:
sns.factorplot('family_status', 'survived', hue = 'gender', data = titanic.sort_values(by = 'family_members'))

On Your Own: Explore and plot at least one other feature or combination of features you think may be an indicator of survival rate

In [None]:
### Code Here

titanic.columns.values

# Preparing the Data

#### Removing the Columns

After the initial exploration, are there any columns you think we can remove?

Let's look at two options:

* Columns with mostly unique values - If most values are unique, we won't be able to discover patterns and have enough information to generalize.  
* Columns with lots of missing values - If most values are missing, we won't have enough data to get predictive power.

1) Do we have any columns with almost all unique values?

In [None]:
sorted([(col, titanic[col].unique().size) for col in titanic.columns.values ], key=lambda tup: tup[1], reverse=True)

The majority of `name` and `ticket` fields are unique. We'll drop those columns as they are unlikely to provide us much useful information.

In [None]:
titanic.drop(['name', 'ticket'], axis = 1, inplace = True)

2) Do we have all values available for all passengers?

In [None]:
column_counts = titanic.count()
column_counts.sort_values(inplace=True)

column_counts

Some of the columns have significantly fewer values that others. Since machine learning models don't deal with missing values well, we'll remove these columns.

In [None]:
titanic.drop(['cabin', 'boat', 'body', 'home.dest'], axis = 1, inplace = True)

Now age is the column remaining with most missing values. Because we saw that age influences survival rates some, we don't want to exclude the age, so we'll just remove all of the missing values in the rest of the dataframe.

In [None]:
titanic.dropna(inplace = True)

And now instead of 1309, we have 1043 rows. We lost about 20% of the data which isn't ideal, but also not that bad.

In [None]:
titanic.count()

Let's look at what kind of data we have left in the dataframe. 

In [None]:
titanic.head()

#### Different Data Types & Encoding

There are three different types of data: numerical, categorical, and ordinal.

*Numerical data* types of typically measurements and also refered to as quantitative data.

*Categorical data* represents characteristics. They can take on numerical values (e.g. 0 for Female and 1 for Male), but these numerical values don't have a mathematical sense. They can't be added together for example.

*Ordinal data* represents a mix of categorical and numerical data. The data falls into categories, but the numbers have mathematical meaning and they can be placed in an order. Star ratings (e.g. 0 - 5) are an example of ordinal data.

In our data set we have numerical and categorical data.

`age` and `fare` are examples of numerical data from our data set.

`pclass` is an example of categorical data. It uses a numerical value, though, taking values 1, 2, and 3.
`gender` is also categorical. Right now it is a string, but it refers to a category (Female or Male).
`embarked` and `family_status` are also these variables.

Machine learning models can work with numerical data with little change. However, in order for these models to work with categorical data properly, we need to encode them as numerical values. 

We will use the [`LabelEncoder`](http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.LabelEncoder.html) from `sklearn`.

This encoder converts string categorical data into numerical values, which sklearn can use, using a method which you don't need to worry about now called [One-Hot encoding](http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OneHotEncoder.html).

In [None]:
label_encoder = preprocessing.LabelEncoder()

titanic['gender'] = label_encoder.fit_transform(titanic['gender'])
titanic['embarked'] = label_encoder.fit_transform(titanic['embarked'])
titanic['family_status'] = label_encoder.fit_transform(titanic['family_status'])

In [None]:
titanic.head()

Look how clean our data set is. It's ready for use in our models. Let's save this cleaned data set for later use.

In [None]:
titanic.to_csv('titanic_processed.csv', index = False)

Remember, cleaning data is a critical part of machine learning. It can take a while, but it's incredibly important. Real data is messy, and it takes some systematic run throughs of your data to get it to a good state to start working with models and algorithms.