## Practice Questions

This notebook contains practise questions for the assess coursework on the 19th of March. I will upload some model answers in a later lecture. If you can answer all of these questions, then you shouldn't have too much trouble in the coursework.

Please use the ‘litho_log’ data available in the data folder of this repository to complete these exercises.

### Exercise 1 (Approx. 15 mins)

You have been given some data that contains a large number of observations of downhole logs and the name of the lithologies associated with the log response.
 - 'DEPTH_WMSF': the depth of the measurement below seafloor 
 - 'HCGR': Total gamma ray counts 
 - 'HFK': Potassium counts 
 - 'HTHO': Thorium counts 
 - 'HURA': Uranium counts 
 - 'IDPH': Deep Phasor Dual Induction–Spherically Focused Resistivity 
 - 'IMPH': Medium Phasor Dual Induction–Spherically Focused Resistivity 
 - 'SFLU': Shallow Phasor Dual Induction–Spherically Focused Resistivity 
 - 'lithology': our target value, a string representing the name of the lithology
 
Using a Markdown cell, describe the steps that you would take to clean this data and prepare it for machine learning analysis.

### Write your answer here (in this Markdown cell)

Your answer here.

### Answer

I would do the following:

 - Remove duplicate data
 - Split data into features and target variable
     - Because our target variable uses strings, it should be encoded into numbers after splitting.
 - Create a train-test split
 - Inspect the data for unusual values
 - Drop/reassign unusual values depending on what they mean
 - Create a pipeline (or function) to train an Imputer and a Scaler to remove null values and to scale the data
 
Bonus points if you indicate which steps can be combined into functions. Also note that the order in which you clean and prepare the data is important.

### Exercise 2 (25 minutes)

Load the data set and drop any duplicates you find.

Then answer the following questions:

 - What is the distribution of the lithologies in this dataset?
 - What is the average depth of the interbedded clay and mud?
 - Among the samples found at or below 400m (below seafloor), what are the characteristics of the samples with the five highest Uranium counts?

In [None]:
import pandas as pd
import numpy as np

# Load data
data = pd.read_csv("Data/litho_log_data.csv")

# Drop duplicates
data.drop_duplicates(inplace = True)

# Check there are no duplicates remaining
print(data.duplicated().sum())

In [None]:
# Part 1: Use value counts to see the distribution
print(data['lithology'].value_counts())

# Part 2: Subselect Interbedded clays and muds, then find the mean of the DEPTH column
display(data[data['lithology'] == 'Interbedded clay and mud']['DEPTH_WMSF'].mean())

# Part 3: Subselect samples below 400m, sort by descending HURA, return the top 5 values
data[data['DEPTH_WMSF'] >= 400].sort_values(by = 'HURA', ascending = False).head(5)

### Exercise 3.1 (10 minutes)

Using the steps you outlined in Exercise 1, split this dataset into a training set and a testing set (with reasonable names). 

In [None]:
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder

# Separate the data into features (X) and the target variable (y)
X = data.drop(columns = 'lithology')
y = data['lithology']

# Use as label encoder to convert the strings in the Lithology column to integers
encoder = LabelEncoder()
y = encoder.fit_transform(y)

# Remember - you will be marked based on the names chosen - please use the conventions of Python and ML
X_train, X_test, y_train, y_test = train_test_split(X, y, train_size = 0.7, random_state = 42)

display(X_train, y_train)

### Exercise 3.2 (20 minutes)

Examine the training set. Are there any missing or unusual values in any of the columns? What are these values and in which columns can they be found? Use a Markdown cell to describe your findings.

In [None]:
X_train.isna().sum()

In [None]:
# Create barcharts to explore the data - there's clearly something unusual about IDPH, IMPH, and SFLU
X_train.hist(figsize = (12,8));

In [None]:
# Summary statistics
X_train[['IDPH', 'IMPH', 'SFLU']].describe()

We see that there are quite a few null values in the columns of the dataset. These will need to be removed prior to analysis. Since these are continuous variables, a good strategy would be to impute the mean value of each column in place of the null values.

Looking at the bar charts of the columns in the data set, we see that IDPH, IMPH, and SFLU, have a very unusual distributions. This warrants further investigation using summary statistics. 

The summary statistics show that the maximum value of the IDPH and IMPH columns is 1950, while the maximum value of the SFLU column is 9700. However, you can see that the 75th percentile of those columns is only around 1–2. Consequently, values in these columns are likely to be referring to either missing data or invalid measurements. I would strongly consider discussing these values with the providers of this data in order to find out what exactly these values mean.

## Exercise 3.3 (10 minutes)

Replace any unusual values with `np.nan`. 

In [None]:
# Replace the offending values using a lambda function - any other function that does the same thing will be 
# accepted as long as the procedure is explained in sufficient detail.
X_train[['IDPH', 'IMPH']] = X_train[['IDPH', 'IMPH']].apply(lambda x: np.where(x == 1950, np.nan, x))
X_train[['SFLU']] = X_train[['SFLU']].apply(lambda x: np.where(x == 9700, np.nan, x))

# REMEMBER that you need to do this for the X_test dataset too!
X_test[['IDPH', 'IMPH']] = X_test[['IDPH', 'IMPH']].apply(lambda x: np.where(x == 1950, np.nan, x))
X_test[['SFLU']] = X_test[['SFLU']].apply(lambda x: np.where(x == 9700, np.nan, x))

## Exercise 4 (30 mins)

Create a pipeline with an `Imputer`, a `Scaler`, and a `DecisionTreeClassifier`. Set the `random_state` of the `DecisionTreeClassifier` to 42.

Create and run a RandomizedSearchCV on three hyperparameters of your choice using `accuracy` as the metric of choice (use `n_iter = 20`). Explain what varying each of your selected hyperparameters will do to your model.

Print out the accuracy and parameters of your best model.

In [None]:
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import RandomizedSearchCV

# Create decision tree pipeline. Note that we set random_state = 42
dt_pipe = Pipeline([
    ('imputer', SimpleImputer()),
    ('scaler', StandardScaler()),
    ('model', DecisionTreeClassifier(random_state = 42))
])

param_grid = {
    'model__max_features': [i for i in range(2, 9)],
    'model__max_depth': [i for i in range(4, 11)],
    'model__min_samples_split': [i*10 for i in range(3, 7)]
}

dt_search = RandomizedSearchCV(
    dt_pipe, 
    param_distributions=param_grid,
    n_iter=20,
    n_jobs=-1,
    verbose=5,
    random_state=42,
    scoring = 'accuracy'
)

dt_search.fit(X_train, y_train)

In [None]:
print(dt_search.best_estimator_)
print(dt_search.best_score_)

## Exercise 5 (10 mins)

Explain why accuracy may not be the best metric for assessing the performance of a classifier model.

Describe three other classification metrics and the scenarios in which they would be useful.

### Answer

**Note**: All of the answers here would be significantly improved by the presentation of an appropriate confusion matrix. You can see examples of such matrices in the lecture notes in Lecture 5, Notebook 2. This answer would also be improved by giving examples of scenarios where each metric would be useful.

Accuracy may not be an ideal measure when the dataset is imbalanced, as it tends to significantly overestimate the performance of a model. 

For example, let's say that you have a binary target variable that is either "Yes" or "No". Let's say that your dataset had 99 "Yes" values and 1 "No" value. If you created a "model" that simply predicted "Yes" for each sample, regardless of the values in the features, your "model" would have an accuracy of 99% - but this would be a terrible model.

Alternative classification metrics are as follows:

#### 1. Recall

Recall is calculated using the following formula:
   
$$recall = \frac{TP}{TP + FN}$$

This metric measures your model's ability to detect occurrences of the positive class. In other words, it's very useful in scenarios where you want to ensure that you identify as many "Yes" values as possible.

#### 2. Precision

Precision is calculated using the following formula:

$$precision = \frac{TP}{TP + FP}$$

This metric measures your model's ability to correctly identify True Positives. In other words, if your model says that a sample is "True", this metric tells you how confident you should be that the model has really detected a True Positive rather than a False Positive. 

This is useful in scenarios where you want to be sure that a positive result really is positive. For example, if you are hunting for gold, and your model tells you that a lump of rock contains a lot of gold in it, then you want to be very certain that it contains gold before spending all your money processing the rock. 

#### 3. F1-Score

The F1-score combines precision and recall into a generic score. 

$$F_1=2x\frac{precision \times recall}{precision + recall}$$

This metric is good if you want a generically good model. The primary downside is that this value is difficult to explain in layman's terms. Other metrics may also be more suited for specific use cases.