# Supplementary Materials Part 5: Logistic Regression

We will tackle the third ML model learnt in IT1244 called Logistic Regression. This is a new model for classification problems, but this lesson will be a LOT more than that.

I will be introducing a lot more stuff here, so do take time to digest!

In [3]:
# Pre-load some stuff, mainly some datasets from scikitlearn 
import numpy as np
import pandas as pd 

water = pd.read_csv("data/water_cleaned.csv", index_col ="Unnamed: 0")

# Let's split the data into X and y 
X = water.drop("Potability", axis = 1)
y = water["Potability"]


## Part 1: Initialisation of models (and a lil more)

By this part, you should have learnt how to manipulate data in Pandas (with the other supplementary materials) or with NumPy. 

As a recap, the workflow of models usually goes like this:
1. Find the model from sklearn - it's usually in a separate library
2. do a train test split on the data (80/20? 70/30? up to you)
3. fit the data onto the training data
4. predict the results using the test data 
5. compare the predictions against the y values of test data

How do we apply this for Logistic Regression?

In [10]:
# K-Nearest Neighbours is taken from sklearn.linear_model
# train_test_Ssplit is taken from skmodel.model_selection

from sklearn.linear_model import LogisticRegression 
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

# We will then split the data into different datasets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Save the model into a variable 
logreg = LogisticRegression()

# Fit the data onto training data
logreg.fit(X_train, y_train)

# Predict results using test data
predictions = logreg.predict(X_test)

# Let's get the accuracy score to see how it does! 
accuracy = accuracy_score(y_test, predictions)
accuracy

0.6276703967446592

Ok, this is like 7.9% better than what we did for KNN, but this begs the question...

## Part 1.5: Is there anything we can do better? 

We're always looking to do better, right? Would you trust someone that makes their decision 62% of the time? Surely there are things that we can do to improve the accuracy of our model.

This is where I introduce to you the Standard Scaler - it is a scaler that scales your numerical features using standardization (as covered in lectures). 

!!! Important: you __need__ to scale your data after you have split your dataset. This is to prevent data leakage and make sure that your model can adapt well to unseen data!



In [12]:
# Let's see where we're supposed to fit our StandardScaler

# First, let's import StandardScaler
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()

# We fit our scaler onto the data here as data preprocessing. 
# Fit and transform the scaler on the training data
X_train_scaled = scaler.fit_transform(X_train)

# Transform the testing data
X_test_scaled = scaler.transform(X_test)

# Initialize logreg 
logreg = LogisticRegression()
logreg.fit(X_train_scaled, y_train)

# Predict results using test data
predictions = logreg.predict(X_test_scaled)

# Let's get the accuracy score to see how it does! 
accuracy = accuracy_score(y_test, predictions)
accuracy


0.6286876907426246

Uh, okay. there's a 0.1% increase. Honestly, it's not much, but it is always good to scale your features so that it gives you more accurate results.

The elephant in the room, however, is that there it's still performing terribly, which makes you wonder...

## Part 2: Wait, does this even matter?

In lesson 3, we mentioned that accuracy is the simplest metric to implement, but it is not the best metric to use. You might be wondering why!

### A hypothetical situation...
Let's imagine we have a dataset of 100 datapoints to train a model to predict whether or not a patient has cancer. Out of those 100 datapoints, 99 datapoints classifies "No cancer" while 1 datapoint classifies "cancer".

What if I told you that the model has a 99% accuracy - would you think that it's good? But what if I told you that it consistently predicts "no cancer" well but it cannot predict "cancer" well? 

This is why we accuracy isn't the best metric to use - we have to consider a LOT of things for this! It is a lot better to use other metrics for this. This is where a classification report comes in!

### Classification report
Let's look at what a classification report does!

In [15]:
from sklearn.metrics import classification_report

print(classification_report(y_test, predictions))

              precision    recall  f1-score   support

           0       0.63      1.00      0.77       617
           1       1.00      0.00      0.01       366

    accuracy                           0.63       983
   macro avg       0.81      0.50      0.39       983
weighted avg       0.77      0.63      0.49       983



Ok, there are a whole ton of metrics here:

1. Precision
2. Recall
3. F1-score
4. Support

Let's talk about the definition of these metrics:

1. Precision - basically the number of true positives / total positives (true and false positives). This measures how accurate your model is when it predicts a positive class.
2. Recall - number of true positives / true positives + false negatives. This measures how accurate your model is when it predicts an actual positive class.
3. F1-score - the harmonic mean of precision and recall, defined as 2 * Precision * Recall / (Precision + Recall). If you want a balance between precision and recall, this is the metric you're using.
4. Support - the number of actual occurrences of the class in the testing dataset.

### What should I be focusing on then?
It really depends on your use case. 

If you're looking for your models to have lesser false positives (it is predicted as positive even though really it's negative), you're looking to increase precision. 
For example, if you're working on a fraud detection model and your emphasis is on maintaining a seamless experience for legitimate customers while still catching the most obvious fraudulent cases, you are more tolerant to false negatives and you'd like to decrease as many false positives as possible. 


If you're looking for your models to have lesser false negatives (it is predicted as negative even though really it's positive), you're looking to increase recall. For example, if you're training a model that predicts customers that will stop subscribing to a subscription-based service, it's better to have a false positive and to classify as "unsubscriber" even though they might not actually be unsubscribing. You'd like to decrease as many false negatives as possible. 

# Part 3: Hyperparameter Tuning

For most models, there will be hyperparameters that you can tune. This is a lot more applicable to the optional lesson that you can access, as there are a lot more parameters there! 

Let me introduce you to two that you will often try:
1. Learning Rate / C - this is how fast your model learns from the data. If the learning rate is too small, it will take ages for the model to learn from the data. However, if the learning rate is too big, it might cause the model to diverge instead, which is not a good thing.
2. Max iterations - this is the number of times you iterate on the learning. Adjusting this will help especially if you find your model to overfit.

In [17]:
from sklearn.model_selection import GridSearchCV, RandomizedSearchCV

# We're not using maxiter for this one, but you can still use it! 
param_grid = {
    # "solver": ['newton-cg', 'lbfgs', 'liblinear', 'sag', 'saga'],
    'C': [0.001, 0.01, 0.1, 1, 10, 100],
}

logreg_cv = GridSearchCV(logreg, param_grid, n_jobs = 1)
logreg_cv.fit(X_train, y_train)
best_params = logreg_cv.best_params_
best_score = logreg_cv.best_score_

# EXERCISE: Repeat the same for RandomizedSearchCV

print(f"Best parameters: {best_params}, best score: {best_score}")

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


Best parameters: {'C': 0.001}, best score: 0.6022680785074825


Even though we have tuned hyperparameters, the model still doesn't seem to be doing well... 

It could be that the data wasn't great in the first place, or that we might need a model that is more accurate (in exchange for interpretability). Read up 5.5 for more details!

# Part 4: Further reading

Here are some things I have covered in this lesson:

1. [Here's something on data leakage.](https://www.kaggle.com/code/alexisbcook/data-leakage)
2. [More into the GridSearchCV vs RandomizedSearchCV difference!](https://datascience.stackexchange.com/questions/63129/gridsearchcv-vs-randomsearchcv-and-how-it-works)
3. [Looking to understand why we need to trade accuracy for interpretability/flexibility?](https://www.baeldung.com/cs/ml-flexible-and-inflexible-models)

For the next part, we will talk about additional models that you can use - but I will only give a superficial explanation of what the models are. Don't use these models unless you get a better understanding of it. 