# Introduction to Machine Learning in Python
The pupose of this notebook is to introduce you to some basic concepts and techniques for machine learning in Python. 

**This notebook will cover:**
1. Setting up JupyterNotebook and importing libraries
2. The dataset used in this notebook
3. Reading files into a pandas DataFrame
3. Looking closer at data in the read file using pands built-in functionality and python basics
5. Vizualizing data using matplotlib and seaborn
6. Wrangling data
7. Bonus: Intro to Scikit-learn

## Prerequisites for using Jupyter Notebook
I recommend installing [Anaconda](https://www.anaconda.com/download). This enables you to run a JupyterNotebook on your local computer in the browser and manage Python libraries. There are alternative solutions but this get you up and running very quickly.

* Once installed, navigate to `Home` and click on the `Launch` button for the `JupyterNotebook` application. This will open a new tab in your browser.
* Navigate to this file in your folder stucture and double click it, this should open a new tab in your browser.
* Now you are good to go.


## Importing necessary libraries, and reading data
Before diving into the details we start by importing the necessary Python libraries for this training. Click the respective library names for further information.

> | Library | Description |
> | --- | --- |
> | [random](https://docs.python.org/3/library/random.html) | for random number generation |
> | [pandas](https://pandas.pydata.org/) | for data manipulation and analysis (dataframes) |
> | [numpy](https://numpy.org/) | for numerical operations |
> | [matplotlib](https://matplotlib.org/) | visualization with Python |
> | [scikit-learn](https://scikit-learn.org/stable/) | machine learning in Python |
> | [seaborn](https://seaborn.pydata.org/) | statistical data visualization |

This line of code installs all necessary python libraries from the requirements.txt file.

In [None]:
!pip install -r ../requirements.txt

In [7]:
# Each library is imported as "short_name" to enables shorter reference names later, e.g., random is imported as rd
import random as rd
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns # Add to importing libraries table

# The Dataset used in this notebook

In this notebook we will utilize the [Titanic - Machine Learning from Disaster dataset](https://www.kaggle.com/c/titanic/data) available on kaggle. These are available in the *02_Dataset* folder. The dataset is split into two groups:
1. The training dataset, train.csv, which should be used to build your ML models.
2. The test dataset, test.csv, whihc should be used to evaluate your ML models.

**Both datasets contains information about passengers on the Titanic**, in addition to passenger traits such as name, the training dataset contains information about if the passenger survived or not. For an in depth walkthrough of the dataset please visit the provided link above. In short the datasets contins the following attributes for each passanger:

> | Variable | Definition | Key |
> | --- | --- | --- |
> | survival | Survival	| 0 = No, 1 = Yes |
> | pclass | Ticket class | 1 = 1st, 2 = 2nd, 3 = 3rd |
> | sex	| Sex | |
> | Age	| Age in years | |	
> | sibsp |	# of siblings / spouses aboard the Titanic | |
> | parch | # of parents / children aboard the Titanic | |
> | ticket | Ticket number | |	
> | fare | Passenger fare | |	
> | cabin | Cabin number | |	
> | embarked | Port of Embarkation | C = Cherbourg, Q = Queenstown, S = Southampton |



# Loading csv-files into pandas dataframes

We will read our two csv-files into a pandas dataframe. This is the first step in getting to know the data. 

Create two dataframe objects named *df_test* and *df_train* by calling the *read_csv* method of the pandas library. Use the PassengerId as the index

In [8]:
#?pd.read_csv

In [9]:
dataset_filepath = "../02_Dataset/"

In [10]:
df_test = pd.read_csv(f"{dataset_filepath}test.csv",index_col=0)
df_train = pd.read_csv(f"{dataset_filepath}train.csv", index_col=0)

In [None]:
df_train

In [None]:
#Make sure it looks good
print(f'{len(df_test.columns)} columns and {len(df_train)} rows in the test dataset.')
print(f'{len(df_train.columns)} columns and {len(df_test)} rows in the train dataset.')

# Investigating the data
Investigating the data, understanding it and thinking about what we could do with it is the first step in the feature selection process

## Printing

Printing is very easy in python using the function *print*. Below we are creating a string *str*. Add code to print it

In [None]:
str = "Machine learning rocks!"
print(str)

However, you don't need to always use print when using a jupyter notebook or some sort of IDE. 

In [None]:
str

Python offers f-strings which allow us to combine a regular string with expressions and variables.

In [None]:
# Inside {} we can but expressions and variables, an typing f'' or f"" results in an f-string.
f_str = f'{len(df_test.columns)} columns and {len(df_train)} rows in the test dataset.'
print(f_str)
print(f'{len(df_train.columns)} columns and {len(df_test)} rows in the train dataset.')

## Understanding the data in our dataframes
We will now try to select data from the dataframes in different ways in order to understand it better. Dataframes in pandas comes with various methods, such as the `map()` function which enable us to apply a function to every element in a df. Refer to the [pandas documentation](https://pandas.pydata.org/) for a full list of the available methods.

In [None]:
df_train # Maybe a bit too long

> 🤔 **Problem 1: Using the head and tail pandas methods**
>
> Use the `pd.df.head()` and `pd.df.tail()` methods to print some data without getting a massive list. Alter the expression to print 3 or 10 elements of the list. If you want to know more about a pandas methods simply run `?pd.DataFrame.method_name`, e.g., "?pd.DataFrame.head".

In [None]:
# Select only the first 3
df_train

In [None]:
# Select last 10
df_train

> 🤔 **Problem 2: Using the info and description pandas methods**
>
> Use the `pd.df.info()` and `pd.df.describe()` methods to further understand the data. Which columns does the `describe()` method show?

In [None]:
# Information about the dataset types
df_train

In [None]:
# Which categories do we get?
df_train.describe()

Add arguments to .describe() to show ordinal data as well

In [None]:
?pd.DataFrame.describe

In [None]:
# Describe ordinal data
df_train.describe(include='all') 

## Selecting parts of the dataframe

It is very easy to select only parts of the dataframe in pandas.

`df_train[['Pclass', 'Survived']]` will return a new dataframe object with only two columns. We can use all the methods we have learned on this new dataframe as well. 

In [None]:
# Create a new dataframe with the columns 'Pclass' and 'Survived'
df_train[['Pclass', 'Survived']]

> 🤔 **Problem 3: Using the head method to selected parts of the data**
>
> Use the `head()` method to only show the first 10 entries in the new df above!

> 🤔 **Problem 4: Show some basic statistics of the selected data**
>
> Show some basic statistics of the first ten lines in the df!
>
> *Hint* use the `describe()`method

> 🤔 **Problem 4: Show some basic statistics of the selected data**
>
> Append the code above to select only the `Pclass` column from the dataframe returned above. ie., copy your code from the previous problem and add code to only display statistical data for the `Plcass`.

> 🤔 **Problem 5: Print the std for Plcass**
>
> Print the std in the df above, i.e. the standard deviation for the first 10 entries in `Plcass`!
>
> *Hint:* you select rows with `loc[]`

In [None]:
?pd.DataFrame.loc

It is even possible to select columns in the dataframe as attributes and get a pandas series as the return

In [None]:
df_train.Name

Slicing the series and dataframes can be done by addin *[a:b]* after the selected df/series

In [None]:
print("From the first element: 'df_train.Pclass[:4]'\n")
print(df_train.Pclass[:4])
print("_"*60+"\n") # Prints 60 underscores and adds a line break
print("With start and finish: 'df_train.Pclass[5:8]'\n")
print(df_train.Pclass[5:8])
print("_"*60+"\n")
print("To the last element: 'df_train.Pclass[887:]'\n")
print(df_train.Pclass[887:])

In [None]:
print("The last 3 elements 'df_train.Pclass[-3:]'\n")
print(df_train.Pclass[-3:])

## Looking for correlations

Just like in excel, we can pivot data to find interesting cuts and correlations. Notice below how each step of the expression returns a new df which we in turn can apply all df methods and attributes on

In [None]:
df_train[['Pclass', 'Survived']].groupby(['Pclass'], as_index=True).mean().sort_values(by='Survived', ascending=False)
# Df with selected columns           df grouped by Pclass showing means          df sorted by survived

> 🤔 **Problem 6: Examine correlations**
> 
> 1. Examine the correlation between categories `Sex` and `Survived`
> 
> 2. Examine the correlation between `SibSp` and `Survived`
> 
> 3. Examine the correlation between `Parch` and `Survived`

# Vizualizing the Data

Vizualizing data could be its own training, therefore this section only introduce the basics of [matplotlib](https://matplotlib.org/) and [seaborn](https://seaborn.pydata.org/) .

## You can use plot functions directly on the data structures from pandas
In the same way as Pandas has defined the print() to enable easy understanding of the data at hand, pandas have its own plotting functions and methods all relying on [matplotlib](https://matplotlib.org/). We will not cover plotting in depth here, only show how visualization is a natural part of the workflow in understanding the data

In [None]:
# Histrogram of SibSp
%matplotlib inline
plt.hist(df_train['SibSp'])


## Plotting with seaborn

**Difference between matplotlib and seborn:** "In summary, both Seaborn and Matplotlib have their strengths and weaknesses depending on your specific needs. Seaborn is great for quickly creating visually appealing plots with minimal code, while Matplotlib offers more customization options and fine-grained control over every aspect of a plot." - [New Horizons](https://www.newhorizons.com/resources/blog/how-to-choose-between-seaborn-vs-matplotlib#:~:text=In%20summary%2C%20both%20Seaborn%20and,every%20aspect%20of%20a%20plot.)

In [None]:
g = sns.FacetGrid(df_train, col='Survived', sharey=True) # Maps "Survived" into multiple axes, in this case Survived can be 0 or 1, because of this we get two axes
g.map(plt.hist, 'Age', bins=20) # One or more plot functions can be applied to each subset by calling .map(), in this case we plot histograms of "Age", one for each lable in "Survived", bins specify number of facets we divide the ages into.

In [None]:
g = sns.FacetGrid(df_train, col='Survived', row='Pclass', height=2.2, aspect=1.6) # One axes for each possible Survived/Pclass combination
g.map(plt.hist, 'Age', alpha=1, bins=20)
g.add_legend()

In [None]:
df_train.columns

In [None]:
g = sns.FacetGrid(df_train, row='Embarked', height=2.2, aspect=1.6)
g.map(sns.pointplot, 'Pclass', 'Survived', 'Sex', palette='deep', order = [1,2,3], hue_order = ['male','female'])
g.add_legend()

In [None]:
df_train.Fare

In [None]:
g = sns.FacetGrid(df_train, col='Survived',
                  row='Embarked', height=2.2, aspect=1.6)
g.map(sns.barplot, 'Sex', 'Fare', errorbar=None, order = ['male','female'])

> 🤔 **Problem 7: What conclusions can be drawn from our analysis?**
> 
> What conclusions can be drawn from our analysis so far? Look through everything you have done and think about which features seem to be of importance, and how they affect survival

# Wrangling data
In this section we will wrangle and manipulate the data to enable analysis. We will learn to drop and change features to enable efficient and accurate modelling

## Dropping features

The first step is to drop features we do not need or want. We will have to drop them from both dataframes (train and test). This can be done using the `drop()` function of a pandas dataframe. To make sure we do not make any mistakes, lets compare the shape before or after:

To ensure we apply the same principle to both datasets, but both dataframe objects in a list called `combine` with the " [ ] " operator

In [619]:
combine = [df_test, df_train]

In [None]:
# Checking shapes
for df in combine:
    print(df.shape)


> 🤔 **Problem 8: Drop features**
> 
> We will drop the features `Ticket` and `Cabin`. We pass a list of columns to drop to the function. 


In [None]:
?pd.DataFrame.drop

In [None]:
for df in combine:
    df.drop(....)

When we check the shape we will see that nothing has happened to it

In [None]:
for df in combine:
    print(df.shape)

Nothing has happend to the shape since we created two new dataframes but never stored them anywhere. The `drop()` function returns a new dataframe. What we need to do in order to change the dataframe is to assign it to a variable or use the argument *inplace=True*. 

* However, setting inplace to True dropped the columns in the original df_train when I ran the code and therefore I would be careful using this one.

In [624]:
combine[0] = combine[0].drop(....) 
# Here we are calling drop on the first element in the list combine, which contains the test data

In [625]:
# This code are calling drop on the first element in the list combine, which contains the train data
# However, setting inplace to True dropped the columns in the original df_train when I ran the code and therefore I would be careful using this one
# combine[1].drop(....)

In [626]:
combine[1] = combine[1].drop(....) 
# Here we are calling drop on the second element in the list combine, which contains the train data

In [None]:
for df in combine:
    print(df.shape)

## Creating new features

We are pretty sure the name feature contains some interesting data but it is hard to process as is. Lets try to extract the titles from it. Here we are using a regular expression, a subject which is worth a session of its own. In short, the regex here selects all text in the second word followed by a dot. 

The expand flag makes the function return a new dataframe. This is stored in our dataframe with the column header 'Name'

In [628]:
for df in combine:
    df['Title'] = df.Name.str.extract(' ([A-Za-z]+)\.', expand=False)

In [None]:
combine[0].info()

As always, check if it worked

In [None]:
combine[1].head(15)

> 🤔 **Problem 9: Create crosstabs**
>
> Use a `crosstab` to get an understanding for the different titles. 

In [None]:
?pd.crosstab

In [None]:
# Create a crosstab from the df_test data using Title and Sex


In [None]:
# Create a crosstab from the df_train data using Title and Sex


We replace the less frequent titles with 'Rare'. For both datasets, replace the below titles with the title in the comment. 

In [634]:
# Replace with "Rare"
less_frequent_titles =['Lady', 'Countess','Capt', 'Col',
                       'Don', 'Dr', 'Major', 'Rev', 'Sir', 
                       'Jonkheer', 'Dona']

# Replace with Miss
miss_replace = ['Mlle', 'Ms']

# Replace with Mrs
mrs_replace = 'Mme'

In [None]:
?pd.DataFrame.replace

> 🤔 **Problem 10: Perfom the replacements**
>
> Complete the codeblock below to perform the replacements above and get the survival mean for the different titles.

In [None]:
for df in combine:
    df['Title'] = df['Title'].replace(....)
    df['Title'] = df['Title'].replace(....)
    df['Title'] = df['Title'].replace(....)
    
combine[1][['Title', 'Survived']].groupby(['Title'], as_index=False).mean()

## Converting feature type

Many models have a problem with categorical data. We can convert it into ordinal. However, be careful that the model you are using is not considering it numerical data. 

In [637]:
title_mapping = {"Mr": 1, "Miss": 2, "Mrs": 3, "Master": 4, "Rare": 5}
for dataset in combine:
    dataset['Title'] = dataset['Title'].map(title_mapping)
    dataset['Title'] = dataset['Title'].fillna(0)

If you are not careful with which model you are using, many models would consider the title "Mrs" as being three times as much title as "Mr" which is of course nonsense. This is still categorical data.

In [None]:
combine[0]

Let's drop the name

In [639]:
for df in combine:
    df.drop(['Name'], axis=1, inplace=True)

> 🤔 **Problem 11: Convert the sex feature into a categorical feature**
>
> Let's convert the sex feature into a new categorical feature
> 
> *Hint:* use the `map()` method.

In [640]:
gender_map = {'female':1,'male':0}
for df in combine:
    df['Sex']= df['Sex']


In [None]:
combine[0].head()

## Completing features

If we examine NaN values (missing values) for each feature using the code block below, we note that a lot of values are missing for age. Consequentially we can either:

* Drop all examples in the data with NaN values for any feature
* Make an estimated guess to complete the "Age" feature in our dataset

In [None]:
for index, df in enumerate(combine):
    for feature in df.columns:
        number_of_rows_with_na = len(df[df[feature].isna()])
        if number_of_rows_with_na != 0:
            print(f'{number_of_rows_with_na} NaN rows out of {len(df)} for feature "{feature}" in {"test" if index == 1 else "train"}.')

Lets see how Age correlates with Sex and Pclass to make an educated guess on the missing Age-values

In [None]:
grid = sns.FacetGrid(combine[0], row='Pclass', col='Sex', height=2.2, aspect=1.6)
grid.map(plt.hist, 'Age', alpha=.5, bins=20)
grid.add_legend()

Let's make one guess for each combination of Pclass and Sex

In [None]:
guess_ages = np.zeros((2,3))
guess_ages

In [645]:
# Loop over the test and train dataframes
for df in combine:
    # Loop over all possible Sex and Pclass combination
    for i in range(0, 2): # Sex
        for j in range(0, 3): # Pclass
            # Save all ages for the given Sex/Pclass combination (drop N/A)
            guess_df = df[(df['Sex'] == i) & (df['Pclass'] == j+1)]['Age'].dropna()

            # Our guess is that the age of a all samples with a missing age for a given Sex/Pclass combination is the median age 
            guess_ages[i,j] = guess_df.median() 

    # Loop over all possible Sex and Pclass combinations        
    for i in range(0, 2):
        for j in range(0, 3):
            # 
            df.loc[ (df.Age.isnull()) & (df.Sex == i) & (df.Pclass == j+1),'Age'] = guess_ages[i,j]

    df['Age'] = df['Age'].astype(int)

Now let's check if NaN rows remain for "Age"

In [None]:
for index, df in enumerate(combine):
    for feature in df.columns:
        number_of_rows_with_na = len(df[df[feature].isna()])
        if number_of_rows_with_na != 0:
            print(f'{number_of_rows_with_na} NaN rows out of {len(df)} for feature "{feature}" in {"test" if index == 1 else "train"}.')

Since is only a few rows left in train and test with one missing value we drop these rows

In [647]:
combine[0] = combine[0].dropna().copy()
combine[1] = combine[1].dropna().copy()

# Bonus: Support Vector Machines in Scikit learn

Import necessary libraries

In [648]:
from sklearn import svm
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import ConfusionMatrixDisplay
from sklearn.decomposition import PCA


## Ensure all features are numerical

In [None]:
combine[1].info()

In [650]:
embarked_map = {'C':0,'Q':1, 'S':2}
for df in combine:
    df['Embarked']= df['Embarked'].map(embarked_map).astype(int)

In [None]:
combine[1].info()

## Divide the df into X and y

In [652]:
X_train_unscaled = combine[1].loc[:, combine[1].columns != 'Survived']
y_train = combine[1]['Survived']
X_test_unscaled = combine[0]

## Scale the features

In [653]:
sc = StandardScaler()
X_train = sc.fit_transform(X_train_unscaled)
X_test = sc.transform(X_test_unscaled)

## Fit a SVM

In [None]:
clf = svm.SVC(kernel='linear')
clf.fit(X_train, y_train)

## Predict test data survival

Unfortunately the dataset does not contain any informaiton about the true survival labels in the test data so we cannot validate how "good" our SVM is.

In [None]:
clf.predict(X_test)

## Examine feature importance

If we use a linear kernel, we can examine the coefficients to understand which features have the most influence on the decision boundary.

In [None]:
X_train_unscaled.columns.to_list()

In [None]:
feature_importance = abs(clf.coef_[0])
plt.bar(X_train_unscaled.columns.to_list(), feature_importance)
plt.ylabel("Feature Importance")
plt.title("Feature Importance for survival prediction")
plt.show()

## Confusion matrix for training data

Ideally you want to look at the confusion matrix for the test data as well but since the dataset is lacking the true survival labels we can only examine the train data.

In [None]:
print(f'{sum(clf.predict(X_train) == y_train) / len(y_train) * 100:.2f}% of the training examples are predicted correct using our SVM.')

However, if we examine the confusion matrix for the training data we note that the dataset is very unbalanced. 

* Only a small number of true labels are "Survival=1". 

* Meaning that if our model where to predict "Survival=0" in all cases the precision would be 84.38% (470/557), suggesting that this very simple SVM is note very good at predciting survival. 

* As a measure to cope with this we could specify class weights but I did that very quickly and did not get better results.

* The adjustments we can make are more or loss endless and we can choose a different model to predict survival

In [None]:
y_pred = clf.predict(X_train)
ConfusionMatrixDisplay.from_predictions(y_train, y_pred)
plt.title("Confusion Matrix for Survival Prediction")
plt.show()

## Principal Component Analysis (PCA)

In [None]:
pca = PCA(n_components=2)
X_reduced = pca.fit_transform(X_train)

sns.scatterplot(x=X_reduced[:, 0], y=X_reduced[:, 1], hue=y_train)
plt.title("PCA Projection of Survival Data")
plt.show()

In [None]:
# Explained variance
print("Explained Variance of each component:", pca.explained_variance_, '\n')

# Explained variance ratio
print("Explained Variance Ratio of each component:", pca.explained_variance_ratio_, '\n')

# PCA Components (Eigenvectors)
print("PCA Components (Eigenvectors):")
print(pca.components_, '\n')

# Singular values
print("Singular values of each component:", pca.singular_values_, '\n')