# **<h3 align="center">Machine Learning 2024-25</h3>**
## **<h3 align="center">To Grant or Not to Grant: Deciding on Compensation Benefits</h3>**


**Group 38 members:**<br>
Ana Marta Azinheira  - 20240496@novaims.unl.pt - 20240496<br>
Braulio Damba - 20240007@novaims.unl.pt - 20240007<br>
Henry Tirla  - 20221016@novaims.unl.pt - 20221016<br>
Marco Galão  - r20201545@novaims.unl.pt - r20201545<br>
Rodrigo Sardinha - 20211627@novaims.unl.pt - 20211627<br>

<a id = "toc"></a>

# Table of Contents

* [1. Import the Libraries](#import_libraries)
* [2. Import the Dataset](#import_dataset)
* [3. Description of the Dataset’s Structure](#dataset_structure)
* [4. Exploring the Dataset](#exploration)
    * [4.1. Constant Features](#constant_features)
    * [4.2. Duplicates](#duplicates)
    * [4.3. Missing Values](#missing_values)
    * [4.4. Data Types](#data_types)
    * [4.5. Inconsistencies](#inconsistencies)
    * [4.6. Outliers](#outliers)
* [5. Creating New Features](#chapter4)
    * [5.1.](#sub_section_4_2_1)
    * [5.2.](#sub_section_4_2_2)
    * [5.3.](#sub_section_4_2_3)
* [6. Correlation Matrix](#section_5_2)

# 0. Identify Business needs

First of all, we need to identify well the business needs.

<img src="01_images/phase01.png" alt="Drawing" style="width: 500px;"/>

We already saw this in the exercise of the previous class.

# 1. Import the needed libraries

The first step is always to import the needed libraries that we are going to use.
- The library `pandas` is a library used for data manipulation and analysis.
- In the end, we are going to try to apply a Decision Tree Classifier. In that way, we need to import from `sklearn.tree`a `DecisionTreeClassifier`
- Since we are going to create a predictive model, we need to split our data into at least two datasets: the train dataset (used to built the model) and the validation dataset (used to evaluate the performance of our model). As so, we need to import the function `train_test_split`from `sklearn.model_selection`
- Finally, we want to assess the quality of our model. This time we are going to import the `confusion_matrix` from `sklearn.metrics`


__`Step 1`__ Import the following libraries/functions: 
    - pandas as pd 
    - DecisionTreeClassifier from sklearn.tree
    - train_test_split from sklearn.model_selection
    - confusion_matrix from sklearn.metrics

In [1]:
import pandas as pd
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix

# 2. Import data

<img src="01_images/phase02.png" alt="Drawing" style="width: 500px;"/>

The second step is to import our data. To do that, we can use the pandas library.

__`Step 2`__ Import the sheet `ClassifiedData` from the excel file `Exercise1.xlsx` and store it in the object `drugs_truth`

In [16]:
drugs_truth = pd.read_excel('Exercise1.xlsx', sheet_name = 'ClassifiedData', index_col = 'ID')

type(drugs_truth)

pandas.core.frame.DataFrame

__`Step 3`__ Import the sheet `Data2Classify` from the excel file `Exercise1.xlsx` and store it in the object `drugs_2classify`

In [17]:
drugs_2classify = pd.read_excel('Exercise1.xlsx', sheet_name = 'Data2Classify', index_col = 'ID')

# 3. Explore the data

It is time to explore and understand the data we have.

<img src="01_images/phase03.png" alt="Drawing" style="width: 500px;"/>

__`Step 4`__ Check the first five rows of the dataset `drugs_truth` using the method `.head()`

In [6]:
drugs_truth.head()

Unnamed: 0_level_0,BD1,BD2,BD3,BD4,BD5,BD6,BD7,BD8,BD9,DrugPlant
ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
1001,789,68,16,90782,0,0,29,66,1402,0
1002,623,78,20,113023,0,0,31,6,1537,0
1003,583,24,18,28344,1,0,4,69,44,0
1004,893,59,19,93571,0,1,21,10,888,0
1006,792,32,20,22386,1,1,5,65,56,0


__`Step 5`__ Using the method `.info()`, check the data types of the variables of `drugs_truth` and if there are any missing values.

In [7]:
drugs_truth.info()

<class 'pandas.core.frame.DataFrame'>
Index: 8000 entries, 1001 to 10999
Data columns (total 10 columns):
 #   Column     Non-Null Count  Dtype
---  ------     --------------  -----
 0   BD1        8000 non-null   int64
 1   BD2        8000 non-null   int64
 2   BD3        8000 non-null   int64
 3   BD4        8000 non-null   int64
 4   BD5        8000 non-null   int64
 5   BD6        8000 non-null   int64
 6   BD7        8000 non-null   int64
 7   BD8        8000 non-null   int64
 8   BD9        8000 non-null   int64
 9   DrugPlant  8000 non-null   int64
dtypes: int64(10)
memory usage: 687.5 KB


__`Step 6`__ Get the main descriptive statistics for all the variables in `drugs_truth` using the method `.describe()`

In [8]:
drugs_truth.describe()

Unnamed: 0,BD1,BD2,BD3,BD4,BD5,BD6,BD7,BD8,BD9,DrugPlant
count,8000.0,8000.0,8000.0,8000.0,8000.0,8000.0,8000.0,8000.0,8000.0,8000.0
mean,898.6535,48.02725,16.736,69974.7135,0.4175,0.47125,14.658,62.129375,623.507375,0.067125
std,202.201258,17.236775,1.871161,27540.800759,0.493178,0.499204,11.937173,68.38289,645.552196,0.250254
min,550.0,18.0,12.0,10000.0,0.0,0.0,1.0,0.0,6.0,0.0
25%,723.0,33.0,15.0,47841.5,0.0,0.0,4.0,26.0,63.0,0.0
50%,894.0,48.0,17.0,70176.0,0.0,0.0,12.0,53.0,385.5,0.0
75%,1075.25,63.0,18.0,92076.25,1.0,1.0,24.0,79.0,1076.0,0.0
max,1250.0,78.0,20.0,139730.0,1.0,1.0,56.0,549.0,3052.0,1.0


__NOTE:__ In this dataset we don't have categorical variables. However, if we want to check the descriptive statistics for categorical data we just need to use the method `.describe(include =['O'] `

__`Step 7`__ What is the mean value of `BD4` when the target `DrugPlant` is equal to 0? And when is equal to 1?

In [11]:
drugs_truth.groupby('DrugPlant')['BD4'].mean()

DrugPlant
0     67573.953370
1    103339.467412
Name: BD4, dtype: float64

__`Step 8`__ How many observations do we have where DrugPlant is equal to 0? And to 1?

In [12]:
drugs_truth['DrugPlant'].value_counts()

DrugPlant
0    7463
1     537
Name: count, dtype: int64

__`Step 9`__ How many observations do we have where `BD3` is equal to 15?

In [13]:
drugs_truth[drugs_truth['BD3'] == 15]

Unnamed: 0_level_0,BD1,BD2,BD3,BD4,BD5,BD6,BD7,BD8,BD9,DrugPlant
ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
1013,657,60,15,95586,0,0,29,54,1407,1
1020,589,26,15,40944,1,0,4,67,52,0
1026,824,56,15,90390,0,0,22,52,959,0
1034,657,32,15,54327,1,0,4,21,47,0
1049,927,35,15,45434,1,1,4,14,90,0
...,...,...,...,...,...,...,...,...,...,...
10980,835,65,15,89275,0,0,23,5,1022,1
10981,771,74,15,120600,0,0,38,16,1978,0
10984,616,47,15,74348,0,1,6,97,153,0
10995,1025,39,15,58121,1,1,4,6,61,0


__`Step 10`__ Look for correlations between the different features with the method `.corr(method = 'spearman')`

In [None]:
#compute the correlation matrix of the features
# it is useful to assess multicollinearity (data redundancy) and to identify the most relevant features (more correlated with the target)
drugs_truth.drop(columns = 'DrugPlant').corr(method = 'spearman')

# 4. Modify the data

<img src="01_images/phase04.png" alt="Drawing" style="width: 500px;"/>

After the exploration and understanding of data, we need to fix possible problems on data like missing values or outliers and we can create new variables in order to get variables with higher predictive power. <br>
At this moment, we are going to ignore this. <br>However, to create a predictive model we need to identify what are our independent variables and the dependent one (the target), as also we need to split our data into at least two different datasets - the train and the validation.

__`Step 11`__ Create a new dataset named as `X` that will include all the independent variables.

In [12]:
X = drugs_truth.iloc[:,:-1]

In [18]:
# or
# X = drugs_truth.drop(columns = 'DrugPlant')

In [19]:
X

Unnamed: 0_level_0,BD1,BD2,BD3,BD4,BD5,BD6,BD7,BD8,BD9
ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
1001,789,68,16,90782,0,0,29,66,1402
1002,623,78,20,113023,0,0,31,6,1537
1003,583,24,18,28344,1,0,4,69,44
1004,893,59,19,93571,0,1,21,10,888
1006,792,32,20,22386,1,1,5,65,56
...,...,...,...,...,...,...,...,...,...
10995,1025,39,15,58121,1,1,4,6,61
10996,967,28,17,54292,0,0,23,72,1011
10997,637,76,15,125962,0,1,33,75,1668
10998,586,69,19,99628,0,0,30,98,1469


__`Step 12`__ Create a new dataset named as `y` that will include the dependent variable (the last column - DrugPlant)

In [20]:
y = drugs_truth.iloc[:,-1]

In [22]:
# or
# y = drugs_truth['DrugPlant']

ID
1001     0
1002     0
1003     0
1004     0
1006     0
        ..
10995    0
10996    1
10997    0
10998    0
10999    0
Name: DrugPlant, Length: 8000, dtype: int64

In [23]:
y

ID
1001     0
1002     0
1003     0
1004     0
1006     0
        ..
10995    0
10996    1
10997    0
10998    0
10999    0
Name: DrugPlant, Length: 8000, dtype: int64

__`Step 13`__ Using the `train_test_split()`, split the data into train and validation, where the training dataset should contain 70% of the observations. (We are going to talk more about this in a future class). 

In [21]:
X_train, X_validation, y_train, y_validation = train_test_split(X,y,
                                                               train_size = 0.7, 
                                                               shuffle = True, 
                                                               stratify = y) # to have the same percentage of 1s in train and test data 

# 5. Modelling - Create a predictive model

It is time to create a model. At this step, we are going to implement a simple algorithm named as "Decision Trees". 

<img src="01_images/phase05.png" alt="Drawing" style="width: 500px;"/>

__`Step 14`__ Create an instance of a DecisionTreeClassifier named as `dt` with the default parameters and fit the instance to the training data (again, we are going to talk more about this later).

In [24]:
dt = DecisionTreeClassifier().fit(X_train, y_train)

__`Step 15`__ Using the model just created in the previous step, predict the values of the target in the train dataset using the method `.predict()`. Assign those values to the object `predictions_train`

In [25]:
predictions_train = dt.predict(X_train)

__`Step 16`__ Similarly to what you have done in the previous step, predict the target values for the validation dataset and assign those values to the object `predictions_val`

In [26]:
predictions_val = dt.predict(X_validation)

# 6. Assess

We already have the ground truth and the predicted values. In this way we can start evaluating the performance of our model in the train and the validation dataset.

<img src="01_images/phase06.png" alt="Drawing" style="width: 500px;"/>

__`Step 17`__ Using the method `.score()`, check the mean accuracy of the model `dt`in the train dataset.

In [27]:
dt.score(X_train, y_train)

1.0

__`Step 18`__ Similarly to what you have done in step 17, check the mean accuracy now for the validation dataset.

In [28]:
dt.score(X_validation, y_validation)

0.9033333333333333

Are we dealing with a case of __overfitting__? <br>
Yes, decision trees are known to be prone to overfitting. <br>
Luckily, there are strategies to avoid this problem. <br>
We are going to understand better what is overfitting and how to avoid it in the different algorithms in the next classes.

It is time to check the confusion matrix of the model for the training and the validation dataset. <br> <br>
__`Step 19`__ Check the confusion matrix for the training dataset, passing as parameters the ground truth (y_train) and the predicted values (predictions_train)<br>
[[TN, FP],<br>
[FN. TP]]

In [31]:
confusion_matrix(y_train, predictions_train)

array([[5224,    0],
       [   0,  376]])

__`Step 20`__ Do the same for the validation dataset.

In [32]:
confusion_matrix(y_validation, predictions_val)

array([[2113,  126],
       [ 106,   55]])

__Can we conclude something from the results above?__ <br>It seems that our model is not so good at predicting the 1's in the target. <br>__Why?__ <br>Because we are dealing with an unbalanced dataset (more about this in the future). 

We are going to learn also different metrics that allow to understant better the performance of our model in unbalanced datasets - the mean accuracy is not a good metric to evaluate those cases.

# 7. Deploy

In the end, we want to classify the unclassified data. If we are already satisfied with our model, we can now predict the target to the new dataset.

__`Step 21`__ Check the dataset that we want to classify, imported as `drugs_2classify`

In [33]:
# test data
drugs_2classify

Unnamed: 0_level_0,BD1,BD2,BD3,BD4,BD5,BD6,BD7,BD8,BD9
ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
1005,1062,59,18,91852,0,1,25,26,1138
1014,619,35,18,57997,1,1,5,28,81
1015,1133,32,20,50289,1,0,1,231,20
1017,624,22,16,20043,0,0,2,37,16
1018,940,76,15,112765,0,0,39,72,2039
...,...,...,...,...,...,...,...,...,...
10975,794,72,19,93012,0,0,23,38,1028
10985,839,67,19,100928,0,0,25,6,1152
10986,1155,23,17,50329,0,0,7,80,188
10990,1036,73,18,104990,0,1,32,75,1607


__`Step 22`__ Using the `.predict()` method and the model created named as `dt`, predict the target on the new dataset and assign those values to a column named as `DrugPlant`

In [23]:
drugs_2classify['DrugPlant'] = dt.predict(drugs_2classify)

__`Step 23`__ Check the new dataset.

In [None]:
drugs_2classify

Now we have already predicted the target for our new dataset! Next, if we wish to save a set of predictions, we can export a solution to a csv file.

In [25]:
#export test data predictions
drugs_2classify['DrugPlant'].to_csv('Exercise1_predictions.csv')