# Introduction to open source Machine Learning utilising the Titanic dataset

This notebook provides an introduction to machine learning. For that it utilises the Titanic dataset from the Kaggle website https://www.kaggle.com/c/titanic.

Let's start by having a look at the data. For that, we will utilise Pandas, an open source data analysis package for Python, using to main data structures: Series and DataFrame. 

#### Series:  
Data structure for one-dimensional data (e.g. time series, ...)

#### DataFrame: 
Multi-dimensional data structure (time series with multiple parameters, e.g. data formerly contained in Excel sheets, ...).
####      

You can find a documentation of Pandas here: http://pandas.pydata.org .

To import data into Pandas, we first have to load the Pandas module into Python.


In [2]:
import pandas as pd

Then, we can read the data using a standardised Pandas function. Here, we assign the data in the `train.csv` file to the Python variable `df`. If you have other file formats you can find all the functions needed in the [pandas doc](https://pandas.pydata.org/docs/user_guide/io.html).

In [3]:
df = pd.read_csv('train.csv')

In [4]:
df

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.2500,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.9250,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1000,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.0500,,S
...,...,...,...,...,...,...,...,...,...,...,...,...
886,887,0,2,"Montvila, Rev. Juozas",male,27.0,0,0,211536,13.0000,,S
887,888,1,1,"Graham, Miss. Margaret Edith",female,19.0,0,0,112053,30.0000,B42,S
888,889,0,3,"Johnston, Miss. Catherine Helen ""Carrie""",female,,1,2,W./C. 6607,23.4500,,S
889,890,1,1,"Behr, Mr. Karl Howell",male,26.0,0,0,111369,30.0000,C148,C


## Let us start with a small quiz

What kind of learning problem do we have here:
- [ ] Supervised Learning
- [ ] Unsupervised Learning
- [ ] Reinforcement Learning

What is our target here:
- [ ] Regression
- [ ] Classification
- [ ] Clustering

To get an overview of the data, we can utilise one of the multiple functions of Pandas:
- `.columns` property shows the different columns in the DataFrame
- `.head(n)` returns the first n rows of the DataFrame
- `.describe()` returns a general summary statistics for all columns in the DataFrame
- `df[col]` returns the data in a certain col of the DataFrame as a Series
- `df[col].unique()` returns the unique values of the column of a DataFrame
- `df[col].value_counts()` returns the number of occurences of each unique value in a DataFrame as a Series

In [5]:
df.describe()

Unnamed: 0,PassengerId,Survived,Pclass,Age,SibSp,Parch,Fare
count,891.0,891.0,891.0,714.0,891.0,891.0,891.0
mean,446.0,0.383838,2.308642,29.699118,0.523008,0.381594,32.204208
std,257.353842,0.486592,0.836071,14.526497,1.102743,0.806057,49.693429
min,1.0,0.0,1.0,0.42,0.0,0.0,0.0
25%,223.5,0.0,2.0,20.125,0.0,0.0,7.9104
50%,446.0,0.0,3.0,28.0,0.0,0.0,14.4542
75%,668.5,1.0,3.0,38.0,1.0,0.0,31.0
max,891.0,1.0,3.0,80.0,8.0,6.0,512.3292


#### We can see, that we have 12 data columns.

We can divide these columns already in two groups, numerical data and string data. NOTE: Strings have to be converted to numerical values before they can be used in a machine laerning application. This originates in the fact, that machine learning algorthims are all based on mathematical operations like multiplications, additions and non-linear functions and these only work with numbers.

Let's start by having a closer look at the numerical data.

#### The numerical data:
The most important column is the `Survived` column. It provides the targets for training the algorithm.

Another important one is the unique `PassengerId`, which serves throughout the whole process as a unique identification for data belonging to one person. This could also be done by the `Name` property, but similar names are possible, while similar ids aren't.

#### The other columns are:
- `Pcclass`, giving the Class in which the passanger traveled
- `SibSp`, giving the number of siblings / spouses aboard
- `Parch`, giving the number of parents / children aboard
- `Age`, giving the age of the passenger
- `Fare`, giving the price of the passenger paid for his/her ticket

To investigate the numerical data better we can call `df.describe()`

In [6]:
df.describe()

Unnamed: 0,PassengerId,Survived,Pclass,Age,SibSp,Parch,Fare
count,891.0,891.0,891.0,714.0,891.0,891.0,891.0
mean,446.0,0.383838,2.308642,29.699118,0.523008,0.381594,32.204208
std,257.353842,0.486592,0.836071,14.526497,1.102743,0.806057,49.693429
min,1.0,0.0,1.0,0.42,0.0,0.0,0.0
25%,223.5,0.0,2.0,20.125,0.0,0.0,7.9104
50%,446.0,0.0,3.0,28.0,0.0,0.0,14.4542
75%,668.5,1.0,3.0,38.0,1.0,0.0,31.0
max,891.0,1.0,3.0,80.0,8.0,6.0,512.3292


Note that only the numerical data appear, as these stats make only senes for them.

#### Lets go throught the different rows:
- `count` gives the number valid values (excluded missing `NaN` vales)
- `mean` provides the mean of all the values in the column.
- `std` provides the standard deviation from the mean in the column.
- `min` provides the minimum value in the column.
- `25%` provides the lower quartile value of the column.
- `50%` provides the median value of the column.
- `75%` provides the upper quartile value of the column.
- `max` provides the maximum value of the column.

#### We already gain some conclusions from these summary statistics:
- It was more probable to die, than to survive on the titanic, since the mean of the `Survived` column is only 0.38.
- The `count` value is the same for all columns except the `Age` colum, where some values are missing.
- Most people travelled alone as the `25%` and `50%` values for both `SibSp` and `Parch` are 0, while the former has a `75%` value of 1 and a maximum value of 8 and the latter a `75%` value of 0 and a maximum value of 6.
- The fares are rising strongly towards the more expensive tickets.
- Most of the tickets were third class tickets.




## Pandas Indexing
Before we continue, let us get a bit more comfortable with the way we can select data in pandas.

To select data in pandas we will use `.loc` and `__getitem__`. There are even more options which can be found in the [indexing and selecting data section of the pandas documentation](https://pandas.pydata.org/docs/user_guide/indexing.html).

In pandas the `__getitem__` function returns the related column, so if we do `df["Survived"]` we will retrieve the column indicating if a person survived or didn't.

`.loc` on the other hand allows us to select rows or rows and columns at the same time. E.g. `df[df["Survived"] == True]` would return all data related to survivours, while `df[df["Survived"] == True, "Age"]` would return the age of the survivours. Let's actually try this:
- Select the data related to everyone older than 30
- Return the parchment class (`"Pclass"`) for everyone paying more the 50 Pounds (`"Fare"`) for the journey

In [7]:
df[df["Age"] > 30]

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1000,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.0500,,S
6,7,0,1,"McCarthy, Mr. Timothy J",male,54.0,0,0,17463,51.8625,E46,S
11,12,1,1,"Bonnell, Miss. Elizabeth",female,58.0,0,0,113783,26.5500,C103,S
...,...,...,...,...,...,...,...,...,...,...,...,...
873,874,0,3,"Vander Cruyssen, Mr. Victor",male,47.0,0,0,345765,9.0000,,S
879,880,1,1,"Potter, Mrs. Thomas Jr (Lily Alexenia Wilson)",female,56.0,0,1,11767,83.1583,C50,C
881,882,0,3,"Markun, Mr. Johann",male,33.0,0,0,349257,7.8958,,S
885,886,0,3,"Rice, Mrs. William (Margaret Norton)",female,39.0,0,5,382652,29.1250,,Q


In [8]:
df.loc[df["Fare"] > 50, "Pclass"]

1      1
3      1
6      1
27     1
31     1
      ..
856    1
863    3
867    1
871    1
879    1
Name: Pclass, Length: 160, dtype: int64

Let us investigate these data closer using [visualization](./02-ExplorationNumeric.ipynb).