<a href="https://colab.research.google.com/github/michalis0/DataMining_and_MachineLearning/blob/master/week6/Classification_Exercises.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Data Mining and Machine Learning - Week 6
# Classification - Exercises

This is an exercise based on a sample from the Titanic dataset.

In [2]:
# Import required packages
import  numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.preprocessing import LabelEncoder, OneHotEncoder
from sklearn.linear_model import LogisticRegression, LogisticRegressionCV
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix, accuracy_score, precision_score, recall_score, f1_score

%matplotlib inline

### Load Data

In [3]:
data = pd.read_csv("https://raw.githubusercontent.com/michalis0/DataMining_and_MachineLearning/master/week6/data/Sample-Data-Titanic-Survival.csv")
data.head()

Unnamed: 0,Class,Age,Sex,SurvivalStatus
0,1st,"Quantity[29., ""Years""]",female,survived
1,1st,"Quantity[0.9167, ""Years""]",male,survived
2,1st,"Quantity[2., ""Years""]",female,died
3,1st,"Quantity[30., ""Years""]",male,died
4,1st,"Quantity[25., ""Years""]",female,died


In [4]:
# Clean data
data["Age"] = data["Age"].map(lambda x: float(x.strip('Quantity[').split(",")[0].replace('Missing["Not Available"]', "-1.")))
data = data.replace(-1.0, np.nan)
data.head()

Unnamed: 0,Class,Age,Sex,SurvivalStatus
0,1st,29.0,female,survived
1,1st,0.9167,male,survived
2,1st,2.0,female,died
3,1st,30.0,male,died
4,1st,25.0,female,died


In [5]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1309 entries, 0 to 1308
Data columns (total 4 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   Class           1309 non-null   object 
 1   Age             1046 non-null   float64
 2   Sex             1309 non-null   object 
 3   SurvivalStatus  1309 non-null   object 
dtypes: float64(1), object(3)
memory usage: 41.0+ KB


In [6]:
data = data.dropna().reset_index(drop=True)
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1046 entries, 0 to 1045
Data columns (total 4 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   Class           1046 non-null   object 
 1   Age             1046 non-null   float64
 2   Sex             1046 non-null   object 
 3   SurvivalStatus  1046 non-null   object 
dtypes: float64(1), object(3)
memory usage: 32.8+ KB


## In what follows, try to answer the questions. The results will be provided during the week. You need to complete the code (# [YOUR CODE HERE] or ...) to arrive at the same results.

### 1. Create a new DataFrame where you encode the different categorical features as follows:
* Use one-hot encoding for `Class`
* Use label encoding for `Sex` and `SurvivalStatus`

In [None]:
# One-hot encoding
# [YOUR CODE HERE]

In [None]:
# Label encoding of `Sex` and store it in a feature called `le_sex`
# [YOUR CODE HERE]

In [None]:
# Label encoding of `SurvivalStatus` and store in in a feature called `le_survival`
# [YOUR CODE HERE]

In [None]:
# Concatenate all your DataFrames
data = pd.concat([data, ..., ..., ...], axis=1)
data.head()

In [None]:
data.info()

### 2. Logistic Regression: part 1

#### 2.1. Use logistic regression to predict the `SurvivalStatus` based on `Age` and `Sex`. Display the confusion matrix and the other accuracy measures seen in class.

In [None]:
X = # [YOUR CODE HERE]
y = # [YOUR CODE HERE]

In [None]:
# Train test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

### 2.2 What is the base rate in this case?

In [None]:
# Base rate
# [YOUR CODE HERE]

In [None]:
# logistic regression with 5 fold cross validation
LR_cv = LogisticRegressionCV(solver='lbfgs', cv=5, max_iter=100)

In [None]:
# Fit the model on the training set
LR_cv.fit(..., ...)

In [None]:
# Train accuracy
LR_cv.score(..., ...)

In [None]:
# Test accuracy 
LR_cv.score(..., ...)

In [None]:
# Accuracy measures
y_pred = LR_cv.predict(...)

def evaluate(true, pred):
    precision = precision_score(y_test, y_pred)
    recall = recall_score(y_test, y_pred)
    f1 = f1_score(y_test, y_pred)
    print(f"CONFUSION MATRIX:\n{confusion_matrix(true, pred)}")
    print(f"ACCURACY SCORE:\n{accuracy_score(true, pred):.4f}")
    print(f"CLASSIFICATION REPORT:\n\tPrecision: {precision:.4f}\n\tRecall: {recall:.4f}\n\tF1_Score: {f1:.4f}")

evaluate(..., ...)

#### 2.3 What is the prediction for a man aged 50? What is the probability of each class?

In [None]:
# Prediction
# [YOUR CODE HERE]

In [None]:
# Probabilities
# [YOUR CODE HERE]

#### 2.4 What is the prediction for a woman aged 30? What is the probability of each class?


In [None]:
# Prediction
# [YOUR CODE HERE]

In [None]:
# Probabilities
# [YOUR CODE HERE]

### 3. Logistic Regression: part 2

#### 3.1 Use logistic regression to predict the `SurvivalStatus` based on all other variables (test size = 0.2). Display the confusion matrix and the other accuracy measures seen in class.

In [None]:
X = # [YOUR CODE HERE]
y = # [YOUR CODE HERE]

In [None]:
# Train test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Fit model
LR_cv = LogisticRegressionCV(solver='lbfgs', cv=5, max_iter=100)
LR_cv.fit(..., ...)

# Accuracy measures
y_pred = LR_cv.predict(...)

def evaluate(true, pred):
    precision = precision_score(y_test, y_pred)
    recall = recall_score(y_test, y_pred)
    f1 = f1_score(y_test, y_pred)
    print(f"CONFUSION MATRIX:\n{confusion_matrix(true, pred)}")
    print(f"ACCURACY SCORE:\n{accuracy_score(true, pred):.4f}")
    print(f"CLASSIFICATION REPORT:\n\tPrecision: {precision:.4f}\n\tRecall: {recall:.4f}\n\tF1_Score: {f1:.4f}")

evaluate(..., ...)

#### 3.2 What is the prediction for a man aged 50 of the 2nd class? What is the prbability of each class?

In [None]:
# Prediction
# [YOUR CODE HERE]

In [None]:
# Probabilities
# [YOUR CODE HERE]

#### 3.3 What is the prediction for a woman aged 30 of the 1st class? What is the probability of each class?

In [None]:
# Predictions
# [YOUR CODE HERE]

In [None]:
# Probabilities
# [YOUR CODE HERE]