# Introduction to open source Machine Learning utilising the Titanic dataset

This notebook series provides an introduction to machine learning. For that it utilises the Titanic dataset from the Kaggle website https://www.kaggle.com/c/titanic.

Let's start by having a look at the data. For that, we will utilise Pandas, an open source data analysis package for Python, using to main data structures: Series and DataFrame. 

#### Series:  
Data structure for one-dimensional data (e.g. time series, ...)

#### DataFrame: 
Multi-dimensional data structure (time series with multiple parameters, e.g. data formerly contained in Excel sheets, ...).
####      

You can find a documentation of Pandas here: http://pandas.pydata.org .

To import data into Pandas, we first have to load the Pandas module into Python.

In [None]:
import pandas as pd

Then, we can read the data using a standard pandas function `read_csv`. If you have other file formats you can find all the functions needed in the [pandas doc](https://pandas.pydata.org/docs/user_guide/io.html).

In [None]:
df = pd.read_csv('train.csv')

In [None]:
df

## Let us start with a small quiz

What kind of learning problem do we have here:
- [ ] Supervised Learning
- [ ] Unsupervised Learning
- [ ] Reinforcement Learning

What is our target here:
- [ ] Regression
- [ ] Classification
- [ ] Clustering

## Data Overview

#### We can see, that we have 12 data columns.

We can divide these columns already in two groups, numerical data and string data. NOTE: Strings have to be converted to numerical values before they can be used in a machine laerning application. This originates in the fact, that machine learning algorthims are all based on mathematical operations like multiplications, additions and non-linear functions and these only work with numbers.

Let's start by having a closer look at the numerical data.

#### The numerical data:
The most important column is the `Survived` column. It provides the targets for training the algorithm.

Another important one is the unique `PassengerId`, which serves throughout the whole process as a unique identification for data belonging to one person. This could also be done by the `Name` property, but similar names are possible, while similar ids aren't.

#### The other columns are:
- `Pcclass`, giving the Class in which the passanger traveled
- `SibSp`, giving the number of siblings / spouses aboard
- `Parch`, giving the number of parents / children aboard
- `Age`, giving the age of the passenger
- `Fare`, giving the price of the passenger paid for his/her ticket

To get an overview of the data, we can utilise one of the multiple functions of Pandas:
- `.columns` property shows the different columns in the DataFrame
- `.head(n)` returns the first n rows of the DataFrame
- `.describe()` returns a general summary statistics for all columns in the DataFrame
- `df[col]` returns the data in a certain col of the DataFrame as a Series
- `df[col].unique()` returns the unique values of the column of a DataFrame
- `df[col].value_counts()` returns the number of occurences of each unique value in a DataFrame as a Series

In [None]:
df.describe()

Note that only the numerical data appear, as these stats make only senes for those.

#### Lets go throught the different rows:
- `count` gives the number valid values (excluded missing `NaN` vales)
- `mean` provides the mean of all the values in the column.
- `std` provides the standard deviation from the mean in the column.
- `min` provides the minimum value in the column.
- `25%` provides the lower quartile value of the column.
- `50%` provides the median value of the column.
- `75%` provides the upper quartile value of the column.
- `max` provides the maximum value of the column.

#### We already gain some conclusions from these summary statistics:
- It was more probable to die, than to survive on the titanic, since the mean of the `Survived` column is only 0.38.
- The `count` value is the same for all columns except the `Age` colum, where some values are missing.
- Most people travelled alone as the `25%` and `50%` values for both `SibSp` and `Parch` are 0, while the former has a `75%` value of 1 and a maximum value of 8 and the latter a `75%` value of 0 and a maximum value of 6.
- The fares are rising strongly towards the more expensive tickets.
- Most of the tickets were third class tickets.




## Pandas Indexing
Before we continue, let us get a bit more comfortable with the way we can select data in pandas.

To select data in pandas we will use `.loc` and `__getitem__`. There are even more options which can be found in the [indexing and selecting data section of the pandas documentation](https://pandas.pydata.org/docs/user_guide/indexing.html).

In pandas the `__getitem__` function returns the related column, so if we do `df["Survived"]` we will retrieve the column indicating if a person survived or didn't.

`.loc` on the other hand allows us to select rows or rows and columns at the same time. E.g. `df[df["Survived"] == True]` would return all data related to survivours, while `df[df["Survived"] == True, "Age"]` would return the age of the survivours. Let's actually try this:
- Select the data related to everyone older than 30
- Return the parchment class (`"Pclass"`) for everyone paying more the 50 Pounds (`"Fare"`) for the journey

In [None]:
df[df["Age"] > 30]

In [None]:
df.loc[df["Fare"] > 50, "Pclass"]

Let us investigate these data closer using [visualization](./02-ExplorationNumeric.ipynb).