### `Homework3: Classification`

### Dataset

In this homework, we will use the Bank Marketing dataset. Download it from [here](https://archive.ics.uci.edu/static/public/222/bank+marketing.zip).

Or you can do it with `wget`:

```
wget https://archive.ics.uci.edu/static/public/222/bank+marketing.zip
```

We need to take `bank/bank-full.csv` file from the downloaded zip-file.
In this dataset our desired target for classification task will be `y` variable - has the client subscribed a term deposit or not.

### Features

For the rest of the homework, you'll need to use only these columns:

* `age`,
* `job`,
* `marital`,
* `education`,
* `balance`,
* `housing`,
* `contact`,
* `day`,
* `month`,
* `duration`,
* `campaign`,
* `pdays`,
* `previous`,
* `poutcome`,
* `y`

### Data preparation

* Select only the features from above.
* Check if the missing values are presented in the features.

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

### **Getting the data**

In [43]:
data = 'https://archive.ics.uci.edu/static/public/222/bank+marketing.zip'
!wget $data -O bank-full.csv 
#!wget $data -O ./bank+marketing.zip
#!unzip ./bank+marketing.zip -d 
#!unzip ./bank.zip -d 

--2024-10-15 15:35:07--  https://archive.ics.uci.edu/static/public/222/bank+marketing.zip
Resolving archive.ics.uci.edu (archive.ics.uci.edu)... 128.195.10.252
Connecting to archive.ics.uci.edu (archive.ics.uci.edu)|128.195.10.252|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: unspecified
Saving to: './bank+marketing.zip'

     0K .......... .......... .......... .......... .......... 82.9K
    50K .......... .......... .......... .......... ..........  150K
   100K .......... .......... .......... .......... ..........  143K
   150K .......... .......... .......... .......... .......... 86.7K
   200K .......... .......... .......... .......... ..........  171K
   250K .......... .......... .......... .......... .......... 52.7K
   300K .......... .......... .......... .......... .......... 50.7K
   350K .......... .......... .......... .......... .......... 33.6K
   400K .......... .......... .......... .......... .......... 22.5K
   450K .......... ........

### Question 1

What is the most frequent observation (mode) for the column `education`?

- `unknown`
- `primary`
- `secondary`
- `tertiary`

In [2]:
df = pd.read_csv("bank-full.csv", sep=";")
df.head()

Unnamed: 0,age,job,marital,education,default,balance,housing,loan,contact,day,month,duration,campaign,pdays,previous,poutcome,y
0,58,management,married,tertiary,no,2143,yes,no,unknown,5,may,261,1,-1,0,unknown,no
1,44,technician,single,secondary,no,29,yes,no,unknown,5,may,151,1,-1,0,unknown,no
2,33,entrepreneur,married,secondary,no,2,yes,yes,unknown,5,may,76,1,-1,0,unknown,no
3,47,blue-collar,married,unknown,no,1506,yes,no,unknown,5,may,92,1,-1,0,unknown,no
4,33,unknown,single,unknown,no,1,no,no,unknown,5,may,198,1,-1,0,unknown,no


In [3]:
df.columns

Index(['age', 'job', 'marital', 'education', 'default', 'balance', 'housing',
       'loan', 'contact', 'day', 'month', 'duration', 'campaign', 'pdays',
       'previous', 'poutcome', 'y'],
      dtype='object')

In [4]:
# Check for missing values
missing_values = df.isnull().sum()

In [5]:
# Calculate the mode for the 'education' column
education_mode = df['education'].mode()[0]

In [6]:
# Display the most frequent observation (mode)
print("The most frequent observation (mode) for the column 'education' is:", education_mode)

The most frequent observation (mode) for the column 'education' is: secondary


### Question 2

Create the [correlation matrix](https://www.google.com/search?q=correlation+matrix) for the numerical features of your dataset. 
In a correlation matrix, you compute the correlation coefficient between every pair of features.

What are the two features that have the biggest correlation?

- `age` and `balance`
- `day` and `campaign`
- `day` and `pdays`
- `pdays` and `previous`


### Target encoding

* Now we want to encode the `y` variable.
* Let's replace the values `yes`/`no` with `1`/`0`.

### Split the data

* Split your data in train/val/test sets with 60%/20%/20% distribution.
* Use Scikit-Learn for that (the `train_test_split` function) and set the seed to `42`.
* Make sure that the target value `y` is not in your dataframe.




In [7]:
# Select the specified features
features = [
    'age', 'job', 'marital', 'education', 'balance', 'housing', 
    'contact', 'day', 'month', 'duration', 'campaign', 
    'pdays', 'previous', 'poutcome', 'y'
]
data_selected = df[features]

In [8]:
# Encode the target variable 'y'
data_selected['y'] = data_selected['y'].replace({'yes': 1, 'no': 0})

  data_selected['y'] = data_selected['y'].replace({'yes': 1, 'no': 0})
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  data_selected['y'] = data_selected['y'].replace({'yes': 1, 'no': 0})


In [9]:
# Select only numeric features for correlation
numeric_features = data_selected.select_dtypes(include=[np.number])

In [10]:
# Create a correlation matrix for numerical features
correlation_matrix = numeric_features.corr()

In [11]:
# Display the correlation matrix
print("Correlation Matrix:")
print(correlation_matrix)

Correlation Matrix:
               age   balance       day  duration  campaign     pdays  \
age       1.000000  0.097783 -0.009120 -0.004648  0.004760 -0.023758   
balance   0.097783  1.000000  0.004503  0.021560 -0.014578  0.003435   
day      -0.009120  0.004503  1.000000 -0.030206  0.162490 -0.093044   
duration -0.004648  0.021560 -0.030206  1.000000 -0.084570 -0.001565   
campaign  0.004760 -0.014578  0.162490 -0.084570  1.000000 -0.088628   
pdays    -0.023758  0.003435 -0.093044 -0.001565 -0.088628  1.000000   
previous  0.001288  0.016674 -0.051710  0.001203 -0.032855  0.454820   
y         0.025155  0.052838 -0.028348  0.394521 -0.073172  0.103621   

          previous         y  
age       0.001288  0.025155  
balance   0.016674  0.052838  
day      -0.051710 -0.028348  
duration  0.001203  0.394521  
campaign -0.032855 -0.073172  
pdays     0.454820  0.103621  
previous  1.000000  0.093236  
y         0.093236  1.000000  


In [12]:
# Find the two features with the biggest correlation (excluding the target variable 'y')
correlation_upper = correlation_matrix.where(np.triu(np.ones(correlation_matrix.shape), k=1).astype(bool))
max_corr_value = correlation_upper.max().max()
max_corr_features = correlation_upper.stack().idxmax()

print("\nThe two features with the biggest correlation are:", max_corr_features)


The two features with the biggest correlation are: ('pdays', 'previous')


In [13]:
from sklearn.model_selection import train_test_split

In [14]:
# Split the data into train, validation, and test sets (60%/20%/20%)
train, temp = train_test_split(data_selected, test_size=0.4, random_state=42)
val, test = train_test_split(temp, test_size=0.5, random_state=42)

In [15]:
# Display the sizes of the datasets
print("\nTrain set size:", train.shape)
print("Validation set size:", val.shape)
print("Test set size:", test.shape)


Train set size: (27126, 15)
Validation set size: (9042, 15)
Test set size: (9043, 15)


In [16]:
# the target value y is not in the train, validation, or test datasets
print("\nTrain set target (y) in features:", 'y' in train.columns)
print("Validation set target (y) in features:", 'y' in val.columns)
print("Test set target (y) in features:", 'y' in test.columns)


Train set target (y) in features: True
Validation set target (y) in features: True
Test set target (y) in features: True


### Question 3

* Calculate the mutual information score between `y` and other categorical variables in the dataset. Use the training set only.
* Round the scores to 2 decimals using `round(score, 2)`.

Which of these variables has the biggest mutual information score?
  
- `contact`
- `education`
- `housing`
- `poutcome`

In [17]:
from sklearn.feature_selection import mutual_info_classif

In [18]:
# Selecting categorical features from the training set (excluding the target variable 'y')
categorical_features = ['job', 'marital', 'education', 'contact', 'housing', 'poutcome']

In [19]:
# Calculate the mutual information score between 'y' and other categorical variables
X_train = train[categorical_features]
y_train = train['y']

In [20]:
# Convert categorical variables to numeric using one-hot encoding
X_train_encoded = pd.get_dummies(X_train, drop_first=True)

In [21]:
# Calculate mutual information scores
mutual_info_scores = mutual_info_classif(X_train_encoded, y_train, discrete_features=True)

In [22]:
# Create a DataFrame for better visualization of scores
mutual_info_df = pd.DataFrame({'Feature': X_train_encoded.columns, 'Mutual Information Score': mutual_info_scores})

In [23]:
# Round the scores to 2 decimals
mutual_info_df['Mutual Information Score'] = mutual_info_df['Mutual Information Score'].round(2)

In [24]:
# Display the mutual information scores
print("Mutual Information Scores:")
print(mutual_info_df)

Mutual Information Scores:
                Feature  Mutual Information Score
0       job_blue-collar                      0.00
1      job_entrepreneur                      0.00
2         job_housemaid                      0.00
3        job_management                      0.00
4           job_retired                      0.00
5     job_self-employed                      0.00
6          job_services                      0.00
7           job_student                      0.00
8        job_technician                      0.00
9        job_unemployed                      0.00
10          job_unknown                      0.00
11      marital_married                      0.00
12       marital_single                      0.00
13  education_secondary                      0.00
14   education_tertiary                      0.00
15    education_unknown                      0.00
16    contact_telephone                      0.00
17      contact_unknown                      0.01
18          housing_yes

In [25]:
# Find the variable with the biggest mutual information score
max_score_feature = mutual_info_df.loc[mutual_info_df['Mutual Information Score'].idxmax()]
max_score_feature 

Feature                     poutcome_success
Mutual Information Score                0.03
Name: 20, dtype: object

### Question 4

* Now let's train a logistic regression.
* Remember that we have several categorical variables in the dataset. Include them using one-hot encoding.
* Fit the model on the training dataset.
    - To make sure the results are reproducible across different versions of Scikit-Learn, fit the model with these parameters:
    - `model = LogisticRegression(solver='liblinear', C=1.0, max_iter=1000, random_state=42)`
* Calculate the accuracy on the validation dataset and round it to 2 decimal digits.

What accuracy did you get?

- 0.6
- 0.7
- 0.8
- 0.9


In [26]:
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

In [27]:
# Create one-hot encoded features for the training dataset
X_train_encoded = pd.get_dummies(train[categorical_features], drop_first=True)

In [28]:
# Fit the logistic regression model
model = LogisticRegression(solver='liblinear', C=1.0, max_iter=1000, random_state=42)
model.fit(X_train_encoded, train['y'])

In [29]:
# Create one-hot encoded features for the validation dataset
X_val_encoded = pd.get_dummies(val[categorical_features], drop_first=True)

In [30]:
# Ensure the validation set has the same features as the training set
X_val_encoded = X_val_encoded.reindex(columns=X_train_encoded.columns, fill_value=0)

In [31]:
# Predict on the validation dataset
y_val_pred = model.predict(X_val_encoded)

In [32]:
# Calculate the accuracy
accuracy = accuracy_score(val['y'], y_val_pred)

In [33]:
# Round the accuracy to 2 decimal digits
accuracy_rounded = round(accuracy, 1)
accuracy_rounded

0.9

### Question 5 

* Let's find the least useful feature using the *feature elimination* technique.
* Train a model with all these features (using the same parameters as in Q4).
* Now exclude each feature from this set and train a model without it. Record the accuracy for each model.
* For each feature, calculate the difference between the original accuracy and the accuracy without the feature. 

Which of following feature has the smallest difference?

- `age`
- `balance`
- `marital`
- `previous`

> **Note**: The difference doesn't have to be positive.

In [34]:
# Calculate the accuracy with all features included
y_val_pred = model.predict(pd.get_dummies(val.drop(columns=['y']), drop_first=True).reindex(columns=X_train_encoded.columns, fill_value=0))
original_accuracy = accuracy_score(val['y'], y_val_pred)
original_accuracy

0.8961512939615129

In [35]:
# Initialize a dictionary to store accuracy differences for each feature
accuracy_differences = {}

In [36]:
# Loop through each feature and exclude it one by one
for feature in X_train_encoded.columns:
    # Create a new feature set excluding the current feature
    X_train_excluded = X_train_encoded.drop(columns=[feature])
    
    # Train the model again without the excluded feature
    model.fit(X_train_excluded, y_train)
    
    # Create one-hot encoded features for the validation dataset without the excluded feature
    X_val_excluded = pd.get_dummies(val.drop(columns=['y']), drop_first=True).reindex(columns=X_train_encoded.columns, fill_value=0).drop(columns=[feature])
    
    # Predict on the validation dataset
    y_val_pred_excluded = model.predict(X_val_excluded)
    
    # Calculate accuracy without the excluded feature
    accuracy_excluded = accuracy_score(val['y'], y_val_pred_excluded)
    
    # Calculate the difference between original accuracy and accuracy without the feature
    accuracy_difference = original_accuracy - accuracy_excluded
    accuracy_differences[feature] = accuracy_difference

In [37]:
# Find the feature with the smallest difference
least_useful_feature = min(accuracy_differences, key=accuracy_differences.get)
smallest_difference = accuracy_differences[least_useful_feature]

print(f"The feature with the smallest difference is: {least_useful_feature} with a difference of {smallest_difference:.4f}")

The feature with the smallest difference is: job_student with a difference of -0.0001


### Question 6

* Now let's train a regularized logistic regression.
* Let's try the following values of the parameter `C`: `[0.01, 0.1, 1, 10, 100]`.
* Train models using all the features as in Q4.
* Calculate the accuracy on the validation dataset and round it to 3 decimal digits.

Which of these `C` leads to the best accuracy on the validation set?

- 0.01
- 0.1
- 1
- 10
- 100

> **Note**: If there are multiple options, select the smallest `C`.

In [38]:
# Initialize a dictionary to store the accuracy for each value of C
accuracy_results = {}

In [39]:
# Values of C to try
C_values = [0.01, 0.1, 1, 10, 100]

In [40]:
# Loop through each value of C, train the model, and calculate accuracy
for C in C_values:
    # Fit the logistic regression model with the current C value
    model = LogisticRegression(solver='liblinear', C=C, max_iter=1000, random_state=42)
    model.fit(X_train_encoded, train['y'])
    
    # Predict on the validation dataset
    y_val_pred = model.predict(X_val_encoded)
    
    # Calculate accuracy
    accuracy = accuracy_score(val['y'], y_val_pred)
    
    # Round the accuracy to 3 decimal digits and store it
    accuracy_results[C] = round(accuracy, 3)

In [41]:
# Find the value of C with the best accuracy
best_C = min(accuracy_results, key=accuracy_results.get)
best_accuracy = accuracy_results[best_C]

best_C, best_accuracy

(0.01, 0.894)