# Titanic dataset

This assigment consists of the introductory problem [Titanic: Machine Learning from Disaster](https://www.kaggle.com/c/titanic). The goal is to train a classifier to predict persons that have survived the disaster. 

We will start by reading in the, by now standard, Titanic dataset. It contains information about passengers of the Titanic. The information includes i.a. sex, age, name  and passenger class as well as information if the passenger survived or died in the disaster. You can find more details about this data set [here](http://campus.lakeforest.edu/frank/FILES/MLFfiles/Bio150/Titanic/TitanicMETA.pdf). 

The data  is in "coma separated values" (csv) format and to read it we will use the [pandas](https://pandas.pydata.org) library. Pandas  provides tools for manipulating  data frames and series and is wildly used in data science projects. 

Please note that this is NOT a pandas manual. For detailed explanation of the concepts and functions used here you should consult the [documentation](https://pandas.pydata.org/pandas-docs/stable/getting_started/index.html). 

In [None]:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd

In [None]:
data = pd.read_csv("titanic3.csv")

`data` is a pandas  [_DataFrame_](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.html) object. 

In [None]:
type(data)

We can check what attributes are stored in the DataFrame by listing the column names:

In [None]:
data.columns

or get a quick preview using ```head``` function: 

In [None]:
data.head(2)

For the description of those features please see the before mentioned [link](http://campus.lakeforest.edu/frank/FILES/MLFfiles/Bio150/Titanic/TitanicMETA.pdf). 

Another usefull function is ```info```:

In [None]:
data.info()

As we can see not all attributes are known (non-null) for every passanger. This is a frequent situation in real datasets. 

## Problem 1

__a) Implement a Bayes classifier for predicting passenger survival  using sex and pclass  features.__

#### Preliminaries

We will start by extracting from the frame  only the information we need:

In [None]:
data_selected = data[['pclass', 'sex', 'survived']]

In [None]:
data_selected.info()

In [None]:
data_selected.head(5)

First we need to group passengers according to sex, class and survival status. This can be achieved using  the [`groupby`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.groupby.html) function:

In [None]:
grouped = data_selected.groupby(['survived','sex','pclass'])

We can count the number of passegers in each group using function ```size```:

In [None]:
counts = grouped.size()

Object ```counts``` contains all the information that we need to construct the classifier:

In [None]:
counts

`counts` is a pandas [_Series_](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.html) object indexed by a [_MultiIndex_](https://pandas.pydata.org/pandas-docs/stable/user_guide/advanced.html#advanced-hierarchical).

In [None]:
counts.index

We can treat a multi-indexed series as an multi-dimensional table with each level of the index corresponding to one dimension. You can index `counts` to obtain information on specified entry: 

In [None]:
counts[1,'female',2]

The index is hierarchical, if we do not provide all indices, a subset of elements will be returned e.g. 

In [None]:
counts[1,'female']

list the number of male surviving women  in each class. Similarly 

In [None]:
counts[1]

lists the number of survivors for each sex and class.

It is however better to use the `loc` function. With this function we can also use the _slicing_ notation. For example 

In [None]:
counts.loc[0, :,3]

list non-survivors in third class  regardless of sex. 

Both `[]` and `loc[]` can  also take a _tuple_ as an argument: 

In [None]:
counts.loc[(0, 'female',3)]

but the use of slice notation in tuple is not permitted. You can use it by providing an explicit _slice_ object 

In [None]:
counts.loc[(0, slice(None),3)]

Function `sum`  as expected returns the sum of all the entries of the series

In [None]:
n_passengers = counts.sum()
n_survivors = counts[1].sum()
n_dead = counts[0].sum()

print(n_passengers, n_survivors, n_dead)
print(n_survivors+n_dead==n_passengers)

### Classifier

To implement classifier we need to calculate the conditional probability of survival given sex and class:

$$P(survived|sex, pclass)$$

$survived$ here is the label taht can take two values 0 for dead and 1 fro survivors, but we can  calculate only the survival probability because of the realation

$$P(survived=1|sex, pclass)+P(survived=0|sex, pclass)=1$$

We can use the Bayes theorem but it will be actually quicker to calculate it directly from the definition:

$$P(survived|sex, pclass)=\frac{P(survived,sex, pclass)}{P(sex, pclass)}
\approx \frac{\#(survived,sex, pclass)}{\#(sex,pclass)}$$

where by $\#$ I have denoted the number of passengers with given attributes. For example the probability of survival for a women traveling in second class is: 

$$\frac{\text{number of women in second class that survived}}{\text{number of women in second class}}$$

which we can calculate as

In [None]:
counts[(1,'female',2)]/(counts[(1,'female',2)]+counts[(0,'female',2)])

This operation has to be repeated for every sex and class combination. We do not have to do it index by index. Pandas have overloaded arithmetic operations that work  on all indices at once e.g. 

In [None]:
by_sex_class = counts.loc[0]+counts.loc[1]

creates a series with number of passengers of each gender and class

In [None]:
by_sex_class

Same effect can be achieved by passing `level` argument to the series `sum` function. The level argument lists the levels which are __not__ summed over. In other words those are the levels left after summation. To sum over the `survived` level we use

In [None]:
by_sex_class = counts.sum(level=['sex','pclass'])

Using `counts` and `by_sex_class` you can calculate required conditional propabilities. 

In [None]:
p_surv_cond_sex_pclass = (counts/by_sex_class)
p_surv_cond_sex_pclass = p_surv_cond_sex_pclass.reorder_levels(['survived','sex','pclass']).sort_index()

In [None]:
p_surv_cond_sex_pclass

In the above expression we have used a very useful feature of pandas series. When performing an arithmetic operation  the elements of the series are _joined_ based on the common index levels.  

Let's  look at it in more detail:

`counts` have three levels of index

In [None]:
counts.index.names

and `by_sex_class` has two

In [None]:
by_sex_class.index.names

Levels 'sex' and 'pclass' are common to both indexes so the expression

```p_surv_cond_sex_pclass = (counts/by_sex_class)```

will have a three level index with  levels 'survived', 'sex' and 'pclass'  and is equivalent to:

In [None]:
p_surv_cond_sex_pclass = pd.Series(0,index=counts.index)
for survived, sex, pclass in counts.index: 
    p = counts.loc[survived, sex, pclass]/by_sex_class.loc[sex, pclass]
    p_surv_cond_sex_pclass.loc[(survived, sex, pclass)] = p

Unfortunatelly this join operation also reorders the levels of the multi index so we have to order them back using `reorder_levels` and `sort_index` function.

```p_surv_cond_sex_pclass = (counts/by_sex_class).reorder_levels(['survived','sex','pclass']).sort_index()```

We can check that we indeed get the identical values

In [None]:
p_surv_cond_sex_pclass.sum(level=['sex', 'pclass'])

#### b) Calculate TPR and FPR on the whole set. Draw the ROC curve and calculate AUC score

The TPR is the fraction of survivors that were classified as survivors. And FPR is the fraction of dead persons that were classified as survivors. We classify a person as survivor when the probability of survival is  greater or equal to one half. 

For ROC and AUC use the functions from scikit-learn library.

#### c) Are those features conditionally independent? 

To answer this question we need to compare conditional probability distribution

$$P(sex,pclass|survived)$$

with

$$P(sex|survied)\times P(pclass|survived)$$ 

Please note that $survived$ is actually a label for the survival status: 1 for survived and 0 for dead. 

By definition

$$P(sex,pclass|survived)= \frac{P(sex,pclass,survived)}{P(survived)}$$

which can be calculated based on the `counts` object. 

We  can also   look at the relative differences

__d) Implement a Naive bayes classifier using same features and compare it with a).__

Please calculate the FPR and TPR as well as AUC and draw the ROC curve. 

We  have already calculated the probability 

$$P_{NB}(sex,pclass|survived) = P(sex|survived)\times P(pclass|survived)$$

From which we can calculate 

$$P_{NB}(survived|sex,pclass)= \frac{P_{NB}(sex,pclass|survived)P(survived)}{P_{NB}(sex,pclass)}$$

where the denominator is also calculated from the factorised probabilities

$$P_{NB}(sex,pclass)= P_{NB}(sex,pclass|survived=1)P(survived=1)+P_{NB}(sex,pclass|survived=0)P(survived=0)$$

That is very important because the result must be a probability and add up to one

$$P_{NB}(survived=1|sex,pclass)+P_{NB}(survived=0|sex,pclass)=1$$

for each sex and passenger class. 

## Problem 2

##### Add age as a feature and implement naive bayes classifier. 

Compute the FPR, TPR and AUC as well as draw the ROC curve. 

#### Hint: 
Consider using it as a categorical variable

We start by constructning a new dataframe with age added:

In [None]:
data_with_age = data_selected.copy()

In [None]:
data_with_age['age'] = data['age']

###  Missing values ! :( 

Unfortunatelly there is a problem. Not all passengers have their age assigned, as we can check by inspecting the dataframe:

In [None]:
data_with_age.info()

We can see that there are only 1046 non-null entries for 'age'. We can do it also directly by using the `isna` methods

In [None]:
data['age'].isna().sum()

This problem is unfortunatelly very common in data science and machine learning. I will present here several  methods of dealing with this problem.

First solution is to ignore the missing data _i.e._ delete all rows with missing entries. This can be easilly achieved  using `pandas.DataFrame.dropna` method:

In [None]:
data_with_age_cleaned = data_with_age.dropna()

In [None]:
data_with_age_cleaned.info()

Second method is to fill in the missing values. A good choice would be the  average age (or median)

In [None]:
data_with_age_cleaned.age.mean()

In [None]:
data_with_age_filled = data_with_age.fillna(data_with_age.age.median())

The statistical functions conveniently just disregard the missing data. 

In [None]:
data_with_age_filled.info()

In [None]:
data_with_age_filled.age.mean()

A more "sophisticated" method would be to will the missing values based on the group statistics e.g. average age of persons with same sex and pclass. That can be achieved using `groupby`, and `apply` methods. We could do it in one line, but I will rather do it step by step. 

We start by grouping  data by sex and pclass. We do not group by survived, those are the labels and after training we would not have access to them. 

In [None]:
grouped = data_with_age.groupby(['sex', 'pclass'])

Now we would like to take each group, calculate the mean age, and use it to fill the 'na" values in the group. This can be achieved using the `apply` method

In [None]:
data_with_age_group_filled = grouped.apply(
    lambda g: g.fillna({'age':g.age.mean()})
                        )

The `apply` method takes a function as argument. This functions expects a DataFrame and returs a DataFrame. The groups are passed one by one to this function and the results are assempled together back into the resulting  dataframe.

In [None]:
data_with_age_group_filled.info()

In [None]:
data_with_age_group_filled.age.mean()

The last method would be to treat the missing data as separate age category. 

Now we will divide age into categories. Just for fun I will define a function that returns a categorizing function

In [None]:
def make_age_categorizer(limits, lbls):
    def categorizer(age):
        if not pd.isna(age):
            for i,l in enumerate(limits):
                if age<=l:
                    return lbls[i]
            return lbls[-1]    
        else:
            return 'unc'

    return categorizer
    

Somewhat arbitrarly I will classify all younger then 12 years as children, between 12 and 60 as adults and older as seniors

In [None]:
ctg = make_age_categorizer([12,60],['child','adult','senior'])

We can use the `cgt` function to add new column containing age categories to dataframe

In [None]:
data_with_age['age_category']=data_with_age.age.apply(ctg)

and group it by all categories

In [None]:
counts_with_age = data_with_age.groupby( ['survived', 'sex', 'pclass', 'age_category']).size()

In [None]:
counts_with_age

From now on we can proceed as before