## Practice Questions

This notebook contains practise questions for the assess coursework on the 19th of March. I will upload some model answers in a later lecture. If you can answer all of these questions, then you shouldn't have too much trouble in the coursework.

Please use the ‘litho_log’ data available in the data folder of this repository to complete these exercises.

### Exercise 1 (Approx. 15 mins)

You have been given some data that contains a large number of observations of downhole logs and the name of the lithologies associated with the log response.
 - 'DEPTH_WMSF': the depth of the measurement below seafloor 
 - 'HCGR': Total gamma ray counts 
 - 'HFK': Potassium counts 
 - 'HTHO': Thorium counts 
 - 'HURA': Uranium counts 
 - 'IDPH': Deep Phasor Dual Induction–Spherically Focused Resistivity 
 - 'IMPH': Medium Phasor Dual Induction–Spherically Focused Resistivity 
 - 'SFLU': Shallow Phasor Dual Induction–Spherically Focused Resistivity 
 - 'lithology': our target value, a string representing the name of the lithology
 
Using a Markdown cell, describe the steps that you would take to clean this data and prepare it for machine learning analysis.

### Write your answer here (in this Markdown cell)

# ANSWER
1. Open and read the dataset using pd.read_csv() function.
2. Drop the duplicates using drop_duplicates() function.
3. Split the data into features and target variables.
    - X contains the features, y contains the target variable.
4. Do a train test split, using train_test_split() function and a reasonable train size
5. Search for outliers and deal with them using domain knowledge
    - Make changes to both x-train and x-test, only after splitting so you don't know what your test set looks like
6. Impute any missing values
    - Use the SimpleImputer() function to fill in missing values
7. Scale the data
    - Use the StandardScaler() function to scale the data
8. The steps 6 and 7 can be done using a pipeline
9. encode the target variable
    - Use the LabelEncoder() function to encode the target variable

### Exercise 2 (25 minutes)

Load the data set and drop any duplicates you find.

Then answer the following questions:

 - What is the distribution of the lithologies in this dataset?
 - What is the average depth of the interbedded clay and mud?
 - Among the samples found at or below 400m (below seafloor), what are the characteristics of the samples with the five highest Uranium counts?

In [None]:
#Import the necessary libraries
import pandas as pd
import numpy as np

data = pd.read_csv('Data/litho_log_data.csv') # load and read

data.drop_duplicates(inplace=True) # drop duplicates

data['lithology'].value_counts() # count the number of unique values in the lithology column and see their distribution


In [None]:
data[data['lithology'] == 'Interbedded clay and mud']['DEPTH_WMSF'].mean() # average depth of interbedded clay and mud

In [None]:
data[data['DEPTH_WMSF'] >= 400] # samples below 400m

#now sort this data by uranium counts
data[data['DEPTH_WMSF'] >= 400].sort_values(by'HURA', ascending=False) # sort the data by uranium counts

# now show the characteristics of the top 5
data[data['DEPTH_WMSF'] >= 400].sort_values(by'HURA', ascending=False).head(5) # show the top 5

### Exercise 3.1 (10 minutes)

Using the steps you outlined in Exercise 1, split this dataset into a training set and a testing set (with reasonable names). 

In [None]:
data = pd.read_csv('Data/litho_log_data.csv') # load and read

data.drop_duplicates(inplace=True) # drop duplicates

# y target variable = 'lithology'
# X features = all columns except 'lithology'

X = data.drop(columns = 'lithology')
y = data['lithology']

# encode
from sklearn.preprocessing import LabelEncoder

encoder = LabelEncoder()
y = encoder.fit_transform(y) # fit the encoder and transform the dataset of the target variable



from sklearn.model_selection import train_test_split
# Split the data into training and test sets (80% training, 20% test)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42) # Random state for reproducibility


### Exercise 3.2 (20 minutes)

Examine the training set. Are there any missing or unusual values in any of the columns? What are these values and in which columns can they be found? Use a Markdown cell to describe your findings.

In [None]:
# Now examine training set using .describe()
X_train.describe()
# This was not usefulm so a graphical approach is used instead
X_train.hist(bins=50, figsize=(20,15))
# Outliers are seen in IDPH, IMPH, and SFLU, as you can see from .describe() also
# Now using a box and whisker plot to see the outliers

In [None]:
import seaborn as sns
import matplotlib.pyplot as plt

sns.boxplot(data=X_train[['IDPH', 'IMPH', 'SFLU']])
# this shows the whole range of data

In [None]:
# Now use ylim to only see between 0 and 5
sns.boxplot(data=X_train[['IDPH', 'IMPH', 'SFLU']])
plt.ylim(0, 5)

Ask the data provider what these mean, if they are on purpose or if they are really not supposed to be there. If they are not supposed to be there, then you should remove them.
Some outliers may not be errors, but may be real data. You should use your domain knowledge to decide what to do with them.

## Exercise 3.3 (10 minutes)

Replace any unusual values with `np.nan`. 

In [None]:
# now replace any values we deemed outliers using a lambda function
X_train[['IDPH', 'IMPH', 'SFLU']]=X_train[['IDPH', 'IMPH', 'SFLU']].apply(lambda x: where np.where (x == 1950, np.nan, x))

# also do this for the test
X_test[['IDPH', 'IMPH', 'SFLU']]=X_test[['IDPH', 'IMPH', 'SFLU']].apply(lambda x: where np.where (x == 1950, np.nan, x))

## Exercise 4 (30 mins)

Create a pipeline with an `Imputer`, a `Scaler`, and a `DecisionTreeClassifier`. Set the `random_state` of the `DecisionTreeClassifier` to 42.

Create and run a RandomizedSearchCV on three hyperparameters of your choice using `accuracy` as the metric of choice (use `n_iter = 20`). Explain what varying each of your selected hyperparameters will do to your model.

Print out the accuracy and parameters of your best model.

In [None]:
# Imports
from skelearn.pipeline import Pipeline
from skelearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler, MinMaxScaler
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import RandomizedSearchCV
from scipy.stats import randint

# Create a pipeline, using random state 42
dt_pipeline = Pipeline([
    ('imputer', SimpleImputer(strategy='mean')),
    ('scaler', MinMaxScaler()),
    ('classifier', DecisionTreeClassifier(random_state=42))
])

# Run a randomised search CV
# hyperparameters that i want to test ==== search online for the best hyperparameters for each c v
param_grid = {
    'dt_model__maxfeatures': [i for i in range(2,9)],
    'dt_model__max_depth': [i for i in range(4,11)],
    'dt_model__min_samples_split': [i for i in range(1,21)]
}

# ORRRRR
# BUT USE REASONABLE VALUES FOR THE PARAMS
param_grid = {
    'dt_model__maxfeatures': randint(4,40),
    'dt_model__max_depth': [i for i in range(4,11)], # max depth indicates the maximum depth of the tree
    'dt_model__min_samples_split': [i for i in range(1,21)]
}

dt_search = RandomizedSearchCV(
    dt_pipeline,
    param_distributions=param_grid,
    n_iter=20, # number of iterations is good to be 20 to 30
    n_jobs=1,
    verbose = 5, # how much information you want to see when it is running and printed into terminal
    random_state=42 # for reproducibility
    scoring='accuracy'
)

dt_search.fit(X_train, y_train)


In [None]:
# Print accuracy and parameters of best model
dt_search.best_score_
# This shows the accuracy
# Now show the best parameters
dt_search.best_params_
# OR
dt_search.best_estimator_
# in additon cv_results_ can be used to see the results of the search
pd.DataFrame(dt_search.cv_results_)


## Exercise 5 (10 mins)

Explain why accuracy may not be the best metric for assessing the performance of a classifier model.

Describe three other classification metrics and the scenarios in which they would be useful.

In [None]:
# Imbalanced dataset

Conceptual question
- Recall the metrics

# Metrics