# Homework

> Note: sometimes your answer doesn't match one of the options exactly. 
> That's fine. 
> Select the option that's closest to your solution.

## Dataset

In this homework, we will use the Bank Marketing dataset. Download it from [here](https://archive.ics.uci.edu/static/public/222/bank+marketing.zip).

Or you can do it with `wget`:

```bash
wget https://archive.ics.uci.edu/static/public/222/bank+marketing.zip
```

We need to take `bank/bank-full.csv` file from the downloaded zip-file.
In this dataset our desired target for classification task will be `y` variable - has the client subscribed a term deposit or not.

## Features

For the rest of the homework, you'll need to use only these columns:

* `age`,
* `job`,
* `marital`,
* `education`,
* `balance`,
* `housing`,
* `contact`,
* `day`,
* `month`,
* `duration`,
* `campaign`,
* `pdays`,
* `previous`,
* `poutcome`,
* `y`

## Data preparation

* Select only the features from above.
* Check if the missing values are presented in the features.

In [1]:
import numpy as np
import pandas as pd
from sklearn.feature_extraction import DictVectorizer
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import mutual_info_score, accuracy_score

In [2]:
df = pd.read_csv("bank-full.csv", sep=";")
features = ["age", "job", "marital", "education", "balance", "housing", "contact",
            "day", "month", "duration", "campaign", "pdays", "previous", "poutcome"]
target = "y"
df = df[features + [target]]

numeric = []
categorical = []
for f in features:
    if df.dtypes[f] == "object":
        categorical.append(f)
    else:
        numeric.append(f)

In [3]:
df.isnull().sum()

age          0
job          0
marital      0
education    0
balance      0
housing      0
contact      0
day          0
month        0
duration     0
campaign     0
pdays        0
previous     0
poutcome     0
y            0
dtype: int64

## Question 1

What is the most frequent observation (mode) for the column `education`?

In [4]:
print(f"The most frequent observation (mode) for the column education is {df['education'].mode().iloc[0]}")

The most frequent observation (mode) for the column education is secondary


## Question 2

Create the [correlation matrix](https://www.google.com/search?q=correlation+matrix) for the numerical features of your dataset. 
In a correlation matrix, you compute the correlation coefficient between every pair of features.

What are the two features that have the biggest correlation?

In [5]:
corr = df.corr(numeric_only=True)
np.fill_diagonal(corr.values, np.nan)

corr_unstacked = corr.unstack()
max_corr_pair = corr_unstacked.idxmax()
print(f"The two features that have the biggest correlation are {max_corr_pair[0]} and {max_corr_pair[1]}")

The two features that have the biggest correlation are pdays and previous


## Target encoding

* Now we want to encode the `y` variable.
* Let's replace the values `yes`/`no` with `1`/`0`.

In [6]:
df["y"] = df["y"].replace({"yes": 1, "no": 0}).astype(int)

  df["y"] = df["y"].replace({"yes": 1, "no": 0}).astype(int)


## Split the data

* Split your data in train/val/test sets with 60%/20%/20% distribution.
    * Use Scikit-Learn for that (the `train_test_split` function) and set the seed to `42`.
* Make sure that the target value `y` is not in your dataframe.

In [7]:
df_full_train, df_test = train_test_split(df, test_size=0.2, random_state=42)
df_train, df_val = train_test_split(df_full_train, test_size=0.25, random_state=42)
y_train = df_train[target]
y_val = df_val[target]
y_test = df_test[target]

## Question 3

* Calculate the mutual information score between `y` and other categorical variables in the dataset. Use the training set only.
* Round the scores to 2 decimals using `round(score, 2)`.

Which of these variables has the biggest mutual information score?

In [8]:
mutual_scores = []
for c in categorical:
    mutual_score = mutual_info_score(y_train, df_train[c])
    mutual_scores.append(mutual_score)

print(f"{categorical[np.argmax(mutual_scores)]} has the biggest mutual information score")

poutcome has the biggest mutual information score


## Question 4

* Now let's train a logistic regression.
* Remember that we have several categorical variables in the dataset. Include them using one-hot encoding.
* Fit the model on the training dataset.
    - To make sure the results are reproducible across different versions of Scikit-Learn, fit the model with these parameters:
    - `model = LogisticRegression(solver='liblinear', C=1.0, max_iter=1000, random_state=42)`
* Calculate the accuracy on the validation dataset and round it to 2 decimal digits.

What accuracy did you get?

In [9]:
dv = DictVectorizer(sparse=False)
train_dict = df_train[features].to_dict(orient="records")
val_dict = df_val[features].to_dict(orient="records")
test_dict = df_test[features].to_dict(orient="records")
X_train = dv.fit_transform(train_dict)
X_val = dv.transform(val_dict)
X_test = dv.transform(test_dict)

In [10]:
model = LogisticRegression(solver="liblinear", C=1.0, max_iter=1000, random_state=42)
model.fit(X_train, y_train)
acc = accuracy_score(y_val, model.predict(X_val))
print(f"The accuracy of the model on the validation dataset is {round(acc, 2)}")

The accuracy of the model on the validation dataset is 0.9


### Question 5 

* Let's find the least useful feature using the *feature elimination* technique.
* Train a model with all these features (using the same parameters as in Q4).
* Now exclude each feature from this set and train a model without it. Record the accuracy for each model.
* For each feature, calculate the difference between the original accuracy and the accuracy without the feature. 

Which of following feature has the smallest difference?

> **Note**: The difference doesn't have to be positive.

In [11]:
original_acc = acc
differences = []

for f in features:
    reduced_features = features.copy()
    reduced_features.remove(f)
    reduced_train_dict = df_train[reduced_features].to_dict(orient="records")
    reduced_val_dict = df_val[reduced_features].to_dict(orient="records")
    dv = DictVectorizer(sparse=False)
    X_reduced_train = dv.fit_transform(reduced_train_dict)
    X_reduced_val = dv.transform(reduced_val_dict)
    model = LogisticRegression(solver="liblinear", C=1.0, max_iter=1000, random_state=42)
    model.fit(X_reduced_train, y_train)
    current_acc = accuracy_score(y_val, model.predict(X_reduced_val))
    difference = original_acc - current_acc
    differences.append(difference)

print(f"{features[np.argmin(differences)]} has the smallest difference")

age has the smallest difference


## Question 6

* Now let's train a regularized logistic regression.
* Let's try the following values of the parameter `C`: `[0.01, 0.1, 1, 10, 100]`.
* Train models using all the features as in Q4.
* Calculate the accuracy on the validation dataset and round it to 3 decimal digits.

Which of these `C` leads to the best accuracy on the validation set?

> **Note**: If there are multiple options, select the smallest `C`.

In [12]:
alphas = [0.01, 0.1, 1, 10, 100]
accs = []
for C in alphas:
    model = LogisticRegression(solver="liblinear", C=C, max_iter=1000, random_state=42)
    model.fit(X_train, y_train)
    acc = accuracy_score(y_val, model.predict(X_val))
    accs.append(round(acc, 3))
    
print(f"C={alphas[np.argmax(accs)]} leads to the best accuracy on the validation set")

C=0.1 leads to the best accuracy on the validation set


## Submit the results

* Submit your results here: https://courses.datatalks.club/ml-zoomcamp-2024/homework/hw03
* If your answer doesn't match options exactly, select the closest one