<a href="https://colab.research.google.com/github/michalis0/DataScience_and_MachineLearning/blob/master/Assignements/Part%204/Assignment_4_2024.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

### DSML investigation

You are part of the Suisse Impossible Mission Force, or SIMF for short. You need to uncover a rogue agent that is trying to steal sensitive information.

Your mission, should you choose to accept it, is to find that agent before stealing any classified information. Good luck!

# Assignement part four

#### Identifying the suspects' credit score
We received informations that the rogue agent has a *good* credit score.

Our spies at SIMF have managed to collect financial information relating to our suspects as well as a training dataset.

Create a Neural Network over the training dataset `df` to identify which of the suspects have a *standard* `Credit_Mix`.


## Getting to know our data

* `Age`: a user's age

* `Occupation`: a user's employment field

* `Annual_Income`: a user's annual income

* `Monthly_Inh_Salary`: the calculated salary received by a given user on a monthly basis

* `Num_Bank_Accounts`: the number of bank accounts possessed by a given user

* `Num_Credit_Cards`: the number of credit cards a given user possesses

* `Interest_Rate`: The interest rate on those cards (if multiple then it's the average)

* `Num_of_Loans`: The number of loans of each user

* `Delay_from_due_date`: payment tardiness of user

* `Num_of_Delayed_Payment`: the count of delayed payments

* `Changed_Credit_Limit`: NaN

* `Num_Credit_Inquiries`: NaN

* `Credit_Mix`: The user's credit score

* `Outsting_Debt`: Outstanding debt

* `Credit_Utilization_Ratio`: the percentage of borrowed money over borrowing allowance

* `Payment_of_Min_Amount`: does the user usually pay the minimal amount (categorical)

* `Total_EMI_per_month`: Monthly repayments to be made

* `Amount_invested_monthly`: The amount put in an investment fund by the user on a monthly basis

* `Payment_Behaviour`: the user's payment behavior (categorical)

* `Monthly_Balance`: The user's end of the month balance

* `AutoLoan`: If the user has an active loan for their vehicle

* `Credit-BuilderLoan`: If the user has a loan to increase their credit score

* `DebtConsolidationLoan`, `HomeEquityLoan`, `MortgageLoan`, `NotSpecified`, `PaydayLoan`, `PersonalLoan`, `StudentLoan`: different types of loans (categorical features)


In [None]:
# Import required packages

from sklearn.preprocessing import LabelEncoder, OneHotEncoder
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
import torch
import torch.nn as nn
import torch.optim as optim
from sklearn.preprocessing import MinMaxScaler

%matplotlib inline

In [None]:
df = pd.read_csv("https://raw.githubusercontent.com/michalis0/DataScience_and_MachineLearning/master/Assignements/Part%204/data/train_classification.csv", index_col='Unnamed: 0').dropna()
suspects = pd.read_csv("https://raw.githubusercontent.com/michalis0/DataScience_and_MachineLearning/master/Assignements/Part%204/data/suspects.csv", index_col='Unnamed: 0').dropna()
suspects.rename(columns={"Payment_Behaviour": "le_Payment_Behaviour", "Payment_of_Min_Amount": "le_Payment_of_Min_Amount"}, inplace=True)

In [None]:
display(df.head())
print(df.shape)
display(suspects.head())
print(suspects.shape)

In [None]:
df["Credit_Mix"].unique()

# 1. Preparing the data
## 1.1 Data cleaning
Consider the dataset loaded into the DataFrame `df` which is *train_classification.csv*. We aim to preprocess this data for model training.

Begin by encoding the categorical variables:
- Apply One-Hot Encoding to `Occupation`
- Apply Label Encoding to `Payment_of_Min_Amount` and `Payment_Behaviour`

*Note: To clearly distinguish your newly encoded columns, especially for label encoding, consider renaming them with a prefix. Please, use `le_Payment_of_Min_Amount` and `le_Payment_Behaviour` for the label-encoded new columns.*

In [None]:
# Your code here

 After encoding, integrate the newly encoded columns back into a new DataFrame named `df_encoded`, and remove the original columns `Occupation`, `Payment_of_Min_Amount`, and `Payment_Behaviour` to avoid redundancy.

In [None]:
# Your code here

Finally, display the first few rows of `df_encoded` to verify that the encodings are correctly implemented. This prepared DataFrame will be used for subsequent model training.

In [None]:
# Your code here

## 1.2 Dataset splitting and rescaling

To effectively train and validate our model, it is crucial to properly prepare and partition the data. Follow these steps to preprocess and split the dataframe df_encoded into training and test subsets, ensuring that our model can generalize well to new data:

- Set `X` as all columns except `Credit_Mix` and `y` as the dependent feature `Credit_Mix`.
- Apply manually Label Encoding to `y` using the function `.map`and encoding with the following : Good=2, Standard=1, Bad=0.
- Use a `random_state` of 42 to split `X` and the encoded `y` into training (80%) and test sets (20%).
- Normalize `X` using `MinMaxScaler()`.

In [None]:
# Your code here

### 1.2.2 Final touches
Convert the features to torch tensors of type `torch.float` and the labels (dependent variables) to torch tensors of type `torch.long`.

In [None]:
#Your code here

# 2 Model preparation:

## 2.1 Define a Neural network model and instantiate it.
Define your neural network model as a class in PyTorch, extending from nn.Module. In the `__init__` method, initialize a linear layer using `nn.Linear()` with specified input and output sizes, and set up `nn.ReLU()` for activation. Implement the forward method to describe how data passes through this layer during the network's forward computation.

Set the following parameters:
* `hidden layer` : 1
* `activation function` : ReLU

In [None]:
# Your code here

Set `D_in` to the number of features in `X_train` and `D_out` to the number of target variables in `y_train`, then print these dimensions to verify their values.

In [None]:
# Your code here


Initialize the `Net` model with the specified input size `D_in`, 150 hidden units, and output size `D_out`.

In [None]:
# Your code here

Let's calculate now how many parameters we have in the model.

In [None]:
# Your code here

**Q1. How many parameters does your model have ?**

*Note: Enter an integer (e.g. 355)*

## 2.2 Finding the best model

Determine the optimal hyper-parameters for your model from the options listed below:

* `criterion` : [CrossEntropyLoss](https://pytorch.org/docs/stable/generated/torch.nn.CrossEntropyLoss.html)
* `optimizer` : [Stochastic Gradient Descent (SGD)](https://en.wikipedia.org/wiki/Stochastic_gradient_descent)
* `Epochs`: Test with **150**, **250**, **500**, and **1000** epochs.
* `Learning Rate`: Experiment with learning rates of **0.00005**, **0.001**, **1**, and **10**.

**Evaluation**: Assess your model's performance by measuring its accuracy on the test set.

*Note: Run the code `torch.manual_seed(42)` to ensure consistency across all experiments.*

In [None]:
torch.manual_seed(42)   # Set the seed for reproducibility

### 2.2.1 Automatically Tuning Hyperparameters

In this section, you will automate the process of testing different hyperparameter combinations using a loop, as demonstrated in the lab.

Begin by defining the ranges for the hyperparameters you want to explore:
- **Epochs**: Test with **150**, **250**, **500**, and **1000** epochs.
- **Learning Rate**: Experiment with learning rates of **0.00005**, **0.001**, **1**, and **10**.


In [None]:
# Your code here

Next, create a loop that iterates over the list of learning rates, with an inner loop iterating over the list of number of epochs. Inside the inner loop, initialize (define) the model, and also define the optimizer and the loss function (criterion). Then train the model as demonstrated in the lab, and after the training evaluate its performance to obtain the test accuracy for each model. Ensure that you also display the test loss, as you will need it to answer the upcoming questions.

Here’s a reminder of the criterion and optimizer you should use:
- **Criterion**: [CrossEntropyLoss](https://pytorch.org/docs/stable/generated/torch.nn.CrossEntropyLoss.html)
- **Optimizer**: [Stochastic Gradient Descent (SGD)](https://en.wikipedia.org/wiki/Stochastic_gradient_descent)

**Hint**: Make sure to define the criterion and optimizer at every iteration (inside the inner loop), otherwise the model might retain previous training states and produce biased results.

(Optional) You can enhance the code by storing the best model along with its best_accuracy and best_params. This way, you'll have an automatic evaluation at the end to identify the best-performing model.



In [None]:
# Your code here

### 2.2.3 Questions

**Q2. What is the test accuracy when we train the model with a learning rate of 0.001 and for 150 epochs? Round your answer to 2 decimal points, e.g., 0.25.**


**Q3. When using 1000 epochs, which learning rate results in the highest test accuracy?**

*Note: Select among the following answers*


**Q4. Is BCELoss a suitable alternative to CrossEntropyLoss for our dataset?**

*Hint: Consider the unique values in the Credit_Mix output variable when answering.*

### 3. Predict on the Suspects Dataset

Now it's time to use your trained model to make predictions on the suspects dataset!

Please retrain on the full dataset.

Use the following parameters for the model:
- **Hidden layer**: 1 hidden layer with 150 neurons
- **Output layer**: 3 neurons for classification
- **Optimizer**: [Stochastic Gradient Descent (SGD)](https://en.wikipedia.org/wiki/Stochastic_gradient_descent)
- **Criterion**: [CrossEntropyLoss](https://pytorch.org/docs/stable/generated/torch.nn.CrossEntropyLoss.html)
- **Iterations**: 1000
- **Learning rate**: 1.0

Ensure consistency by setting the manual seed with `torch.manual_seed(42)` before training.

In [None]:
# Your code here


Now, train your model on the training dataset just as you did in section 2.2.1, but you don't have to loop over all the hyper parameters. Ensure that you use the correct number of epochs specified earlier.

In [None]:
# Your code here

Before making predictions, confirm that the feature column names in the `suspects` dataset match those expected by the model, particularly ensuring the X features correspond accurately.

In [None]:
# Your code here

Then scale your dataset and convert it into a torch tensor of dtype float, similar to the preprocessing done for the training set in section 1.2.

In [None]:
#Your code here

Make predictions using the trained model and assign the predicted credit score to each user. Ensure to do the following encoding is used for the predicted categories:
* `0` corresponds to bad credit score,
* `1` corresponds to standard credit score,
* `2` corresponds to good credit score.

Use the predictions to add a new column `credit_score` in the `suspects` dataframe that maps the predicted numerical values to the respective credit score categories.

In [None]:
# Your code here



As mentioned earlier, we believe the suspect had a "good" credit score. Review the predictions made by the model, then display how many suspects were categorized under each credit score, and extract the `userID`s of those with a "standard" credit score.

In [None]:
# Your code here

**Q5.Which of the following suspects have a "good" credit mix according to your model's predictions?**


## Your investigation is progressing effectively, and the list of suspects is narrowing down.

**Don't forget to answer the quiz and submit your code on Moodle before the end of the deadline.**