# Part A: Logistic Regression

## Learning task 1

1. Build  a  classification  model  `(LR1)`  using  Logistic  Regression.  
2. What  happens  to  testing accuracy when you vary the  decision  probability threshold  from  0.5 to 0.3, 0.4,  0.6 and  0.7.

In [1]:
import numpy as np
import pandas as pd
import sys
sys.path.append("..")
from preprocessor import Preprocessor
from Models.LogisticRegression import LogReg
import warnings 
import matplotlib.pyplot as plt
warnings.filterwarnings(action="ignore")

In [2]:
# Variables (hyperparamters) as defined in question

thresholds = [0.5]
learning_rates = [1e-2]
descents = ["batch", "mini-batch", "stochastic"]

# Creating a parameter grid

grid = {}
for threshold in thresholds:
    for learning_rate in learning_rates:
        for descent in descents:
            grid[(threshold, learning_rate, descent)]: tuple[float, float] = tuple()

# creating a dataframe to store the results

results = pd.DataFrame(columns=["threshold", "learning_rate", "descent", "mean_accuracy", "std_accuracy"])

In [3]:
dataset = pd.read_csv("../dataset.csv")
dataset.drop(columns = ["id"], inplace=True)

In [4]:
preprocessor = Preprocessor(dataset, "diagnosis")
splits = preprocessor.preprocess(drop_na=False, standardize=False, labels=[0, 1], n_splits=10)

In [5]:
def create_plot(train_losses, test_losses):
    plot = plt.figure()
    plot.suptitle("Losses")
    plt.plot(train_losses, label="Train")
    plt.plot(test_losses, label="Test")
    plt.legend()
    plt.show()
    

In [6]:
train_plots = []
test_plots = []
for key in grid.keys():
    threshold, learning_rate, descent = key
    accuracies: list[float] = []
    for split in splits:
        train, test = split
        X_train, y_train = train.drop(columns=["diagnosis"]).to_numpy(), train["diagnosis"].to_numpy()
        X_test, y_test = test.drop(columns=["diagnosis"]).to_numpy(), test["diagnosis"].to_numpy()
        logreg = LogReg(threshold=threshold)
        train_losses, test_losses = logreg.fit(X_train, y_train, X_test, y_test, lr = learning_rate, descent=descent, epochs=1000)
        # create_plot(train_losses, test_losses)
        print(train_losses[-1], test_losses[-1])
        y_pred = logreg.predict(X_test)
        tp, tn, fp, fn = logreg.score(X_test, y_test)
        accuracies.append((tp + tn) / (tp + tn + fp + fn))
    result_dict = {
        "threshold": threshold,
        "learning_rate": learning_rate,
        "descent": descent,
        "mean_accuracy": np.round((np.mean(accuracies)*100), 2),
        "std_accuracy": np.round((np.std(accuracies)*100), 2)
    }
    result =  pd.DataFrame(result_dict, index=[0])
    results = pd.concat([results, result], ignore_index=True)

3.5376678815859197 1.4134147325706599
19.643469380020683 21.28149158272074
9.682032924100241 6.87024981534509
3.4891018784248784 1.5305711520948453
18.01010199129945 19.01379390651757
4.199707720765796 2.6165739693114154
4.328746919940616 2.61657396931167
11.265273978214399 9.070789760279554
3.2380230371198273 1.9342809733203918
2.979085856930346 2.0932590263417894
4.5618324165070705 3.314327027794459
13.38873381380295 14.597071521605274
4.013878183976716 3.0428058949428065
3.6307641029782847 2.613719289259248
4.22012640286942 2.759046358507349
3.946883194070019 2.616573969311415
4.917543187142495 2.7910512893367407
3.041908924141293 1.8922554115252446
3.1335114137552766 1.8815240557207404
3.105842348659122 2.35274358995986
2.9836748152020633 1.9188209108283703
3.444567998134084 2.0932591754491314
4.189339454908303 2.7910122339323222
2.979085834601461 1.046629587724566
3.309269491929015 1.046629587724565
3.351471563926643 1.3969527017134646
3.213438817750062 1.3955061169660876
3.165278

In [7]:
results

Unnamed: 0,threshold,learning_rate,descent,mean_accuracy,std_accuracy
0,0.5,0.01,batch,80.05,20.43
1,0.5,0.01,mini-batch,88.74,10.97
2,0.5,0.01,stochastic,94.9,1.62


## Learning task 2

1. Perform:
    1. Feature engineering task 1: imputing missing values
    2. Feature engineering task 2: normalization/stanardization

2. Build a  classification  model  `(LR2)`  using  Logistic  Regression.

3. What  happens  to  testing accuracy when you vary the  decision  probability threshold  from  0.5 to 0.3, 0.4,  0.6 and  0.7.