## Module 3 - Homework



### Dataset

In this homework, we will use the Bank Marketing dataset. Download it from [here](https://archive.ics.uci.edu/static/public/222/bank+marketing.zip).

Or you can do it with `wget`:

```bash
wget https://archive.ics.uci.edu/static/public/222/bank+marketing.zip
```

We need to take `bank/bank-full.csv` file from the downloaded zip-file.  
In this dataset our desired target for classification task will be `y` variable - has the client subscribed a term deposit or not. 


In [1]:
!wget https://archive.ics.uci.edu/static/public/222/bank+marketing.zip

--2024-10-13 22:04:03--  https://archive.ics.uci.edu/static/public/222/bank+marketing.zip
Resolving archive.ics.uci.edu (archive.ics.uci.edu)... 128.195.10.252
Connecting to archive.ics.uci.edu (archive.ics.uci.edu)|128.195.10.252|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: unspecified
Saving to: ‘bank+marketing.zip.1’

bank+marketing.zip.     [    <=>             ] 999.85K  1.25MB/s    in 0.8s    

2024-10-13 22:04:05 (1.25 MB/s) - ‘bank+marketing.zip.1’ saved [1023843]



### Features

For the rest of the homework, you'll need to use only these columns:

* `age`,
* `job`,
* `marital`,
* `education`,
* `balance`,
* `housing`,
* `contact`,
* `day`,
* `month`,
* `duration`,
* `campaign`,
* `pdays`,
* `previous`,
* `poutcome`,
* `y`

In [2]:
hw_columns = ['age', 'job', 'marital', 'education', 'balance', 'housing', 'contact', 'day', 'month', 'duration', 'campaign', 'pdays', 'previous', 'poutcome', 'y']

In [3]:
import pandas as pd
import numpy as np

In [4]:
df = pd.read_csv('./bank-full.csv', sep=";")

In [5]:
df.head()

Unnamed: 0,age,job,marital,education,default,balance,housing,loan,contact,day,month,duration,campaign,pdays,previous,poutcome,y
0,58,management,married,tertiary,no,2143,yes,no,unknown,5,may,261,1,-1,0,unknown,no
1,44,technician,single,secondary,no,29,yes,no,unknown,5,may,151,1,-1,0,unknown,no
2,33,entrepreneur,married,secondary,no,2,yes,yes,unknown,5,may,76,1,-1,0,unknown,no
3,47,blue-collar,married,unknown,no,1506,yes,no,unknown,5,may,92,1,-1,0,unknown,no
4,33,unknown,single,unknown,no,1,no,no,unknown,5,may,198,1,-1,0,unknown,no


In [6]:
len(hw_columns), len(df.columns)

(15, 17)

In [7]:
df = df[hw_columns]

### Question 1

What is the most frequent observation (mode) for the column `education`?

- `unknown`
- `primary`
- `secondary`
- `tertiary`

In [8]:
df.education.mode()

0    secondary
Name: education, dtype: object

In [9]:
df.education.value_counts()

education
secondary    23202
tertiary     13301
primary       6851
unknown       1857
Name: count, dtype: int64

#### Answer 1:

The most frequent observation in the column **education** is **secondary**

### Question 2

Create the [correlation matrix](https://www.google.com/search?q=correlation+matrix) for the numerical features of your dataset. 
In a correlation matrix, you compute the correlation coefficient between every pair of features.

What are the two features that have the biggest correlation?

- `age` and `balance`
- `day` and `campaign`
- `day` and `pdays`
- `pdays` and `previous`

In [10]:
numerical = list(df.dtypes[df.dtypes == 'int64'].index)
numerical

['age', 'balance', 'day', 'duration', 'campaign', 'pdays', 'previous']

In [11]:
df[numerical].corr()

Unnamed: 0,age,balance,day,duration,campaign,pdays,previous
age,1.0,0.097783,-0.00912,-0.004648,0.00476,-0.023758,0.001288
balance,0.097783,1.0,0.004503,0.02156,-0.014578,0.003435,0.016674
day,-0.00912,0.004503,1.0,-0.030206,0.16249,-0.093044,-0.05171
duration,-0.004648,0.02156,-0.030206,1.0,-0.08457,-0.001565,0.001203
campaign,0.00476,-0.014578,0.16249,-0.08457,1.0,-0.088628,-0.032855
pdays,-0.023758,0.003435,-0.093044,-0.001565,-0.088628,1.0,0.45482
previous,0.001288,0.016674,-0.05171,0.001203,-0.032855,0.45482,1.0


#### Answer 2:

The two features with the biggest correlation are **pdays** and **previous**

### Target encoding

* Now we want to encode the `y` variable.
* Let's replace the values `yes`/`no` with `1`/`0`.

In [12]:
df.y.isnull().sum()

0

In [13]:
df.y = (df.y == 'yes').astype('int')

In [14]:
df.y.head()

0    0
1    0
2    0
3    0
4    0
Name: y, dtype: int64

### Split the data

* Split your data in train/val/test sets with 60%/20%/20% distribution.
* Use Scikit-Learn for that (the `train_test_split` function) and set the seed to `42`.
* Make sure that the target value `y` is not in your dataframe.

In [15]:
from sklearn.model_selection import train_test_split

In [16]:
df_full_train, df_test = train_test_split(df, test_size=0.2, random_state=42)

In [17]:
df_train, df_val = train_test_split(df_full_train, test_size=0.25, random_state=42)

In [18]:
df_train = df_train.reset_index(drop=True)
df_val = df_val.reset_index(drop=True)
df_test = df_test.reset_index(drop=True)

In [19]:
y_train = df_train.y.values
y_val = df_val.y.values
y_test = df_test.y.values

In [20]:
del df_train['y']
del df_val['y']
del df_test['y']

In [21]:
len(df), len(df_train), len(df_val), len(df_test)

(45211, 27126, 9042, 9043)

### Question 3

* Calculate the mutual information score between `y` and other categorical variables in the dataset. Use the training set only.
* Round the scores to 2 decimals using `round(score, 2)`.

Which of these variables has the biggest mutual information score?
  
- `contact`
- `education`
- `housing`
- `poutcome`

In [22]:
categorical = ['contact', 'education', 'housing', 'poutcome']

In [23]:
df_train[categorical].head()

Unnamed: 0,contact,education,housing,poutcome
0,cellular,tertiary,yes,unknown
1,cellular,secondary,yes,unknown
2,cellular,secondary,yes,unknown
3,cellular,primary,no,unknown
4,cellular,tertiary,no,unknown


In [24]:
from sklearn.metrics import mutual_info_score

In [25]:
type(y_train)

numpy.ndarray

In [26]:
def mutual_info_y_score(series):
    return mutual_info_score(y_train, series)

In [27]:
df_train[categorical].apply(mutual_info_y_score).sort_values(ascending=False)

poutcome     0.029533
contact      0.013356
housing      0.010343
education    0.002697
dtype: float64

#### Answer 3:

**poutcome** is the variable with the biggest mutual information score

### Question 4

* Now let's train a logistic regression.
* Remember that we have several categorical variables in the dataset. Include them using one-hot encoding.
* Fit the model on the training dataset.
    - To make sure the results are reproducible across different versions of Scikit-Learn, fit the model with these parameters:
    - `model = LogisticRegression(solver='liblinear', C=1.0, max_iter=1000, random_state=42)`
* Calculate the accuracy on the validation dataset and round it to 2 decimal digits.

What accuracy did you get?

- 0.6
- 0.7
- 0.8
- 0.9

In [28]:
from sklearn.feature_extraction import DictVectorizer

In [29]:
df_train.dtypes

age           int64
job          object
marital      object
education    object
balance       int64
housing      object
contact      object
day           int64
month        object
duration      int64
campaign      int64
pdays         int64
previous      int64
poutcome     object
dtype: object

In [30]:
cat_variables = list(df_train.dtypes[df.dtypes == 'object'].index)
num_variables = list(df_train.dtypes[df.dtypes == 'int64'].index)
cat_variables, num_variables

(['job', 'marital', 'education', 'housing', 'contact', 'month', 'poutcome'],
 ['age', 'balance', 'day', 'duration', 'campaign', 'pdays', 'previous'])

In [31]:
dv = DictVectorizer(sparse=False)

In [32]:
dicts_train = df_train[cat_variables + num_variables].to_dict(orient='records')
dicts_val = df_val[cat_variables+ num_variables].to_dict(orient='records')

In [33]:
df_train[cat_variables + num_variables].shape, df_train.shape

((27126, 14), (27126, 14))

In [34]:
dv.fit(dicts_train)

In [35]:
X_train = dv.transform(dicts_train)

In [36]:
X_val = dv.transform(dicts_val)

In [37]:
from sklearn.linear_model import LogisticRegression

In [38]:
model = LogisticRegression(solver='liblinear', C=1.0, max_iter=1000, random_state=42)

In [39]:
model.fit(X_train, y_train)

In [40]:
y_pred = model.predict(X_val)

In [41]:
(y_val == y_pred.astype(int)).mean()

0.9011280690112807

In [42]:
y_pred = model.predict_proba(X_val)

In [43]:
y_decision = y_pred[:,1] > 0.5

In [44]:
(y_val == y_decision.astype(int)).mean().round(2)

0.9

In [45]:
global_accuracy = (y_val == y_decision.astype(int)).mean()
global_accuracy

0.9011280690112807

#### Answer 4:

The accuracy I got was **0.9**

### Question 5 

* Let's find the least useful feature using the *feature elimination* technique.
* Train a model with all these features (using the same parameters as in Q4).
* Now exclude each feature from this set and train a model without it. Record the accuracy for each model.
* For each feature, calculate the difference between the original accuracy and the accuracy without the feature. 

Which of following feature has the smallest difference?

- `age`
- `balance`
- `marital`
- `previous`

> **Note**: The difference doesn't have to be positive.

In [46]:
testing_features = ['age', 'balance', 'marital', 'previous']

In [63]:
tmp_dicts_train = df_train[testing_features].to_dict(orient='records')
tmp_dicts_val = df_val[testing_features].to_dict(orient='records')

dv_new = DictVectorizer(sparse=False)

X_train = dv_new.fit_transform(tmp_dicts_train)
X_val = dv_new.transform(tmp_dicts_val)

model = LogisticRegression(solver='liblinear', C=1.0, max_iter=1000, random_state=42)
model.fit(X_train, y_train)

y_pred = model.predict(X_val)

global_accuracy = (y_val == y_pred.astype(int)).mean()

global_accuracy

0.880336208803362

In [65]:
for f in testing_features:    
    tmpdf_t = df_train[testing_features].copy()
    del tmpdf_t[f]

    tmpdf_v = df_val[testing_features].copy()
    del tmpdf_v[f]

    tmp_dicts_train = tmpdf_t.to_dict(orient='records')
    tmp_dicts_val = tmpdf_v.to_dict(orient='records')

    dv_new = DictVectorizer(sparse=False)

    X_train = dv_new.fit_transform(tmp_dicts_train)
    X_val = dv_new.transform(tmp_dicts_val)

    model = LogisticRegression(solver='liblinear', C=1.0, max_iter=1000, random_state=42)
    model.fit(X_train, y_train)

    y_pred = model.predict(X_val)

    accuracy = (y_val == y_pred.astype(int)).mean()

    print(f, (global_accuracy - accuracy))

age -0.0001105950011059953
balance 0.0
marital 0.00011059500110588427
previous -0.0013271400132713884


#### Answer 5:

The least useful feature would be **balance**

### Question 6

* Now let's train a regularized logistic regression.
* Let's try the following values of the parameter `C`: `[0.01, 0.1, 1, 10, 100]`.
* Train models using all the features as in Q4.
* Calculate the accuracy on the validation dataset and round it to 3 decimal digits.

Which of these `C` leads to the best accuracy on the validation set?

- 0.01
- 0.1
- 1
- 10
- 100

> **Note**: If there are multiple options, select the smallest `C`.


In [66]:
Cs = [0.01, 0.1, 1, 10, 100]

In [70]:
for c in Cs:    
    dicts_train = df_train.to_dict(orient='records')
    dicts_val = df_val.to_dict(orient='records')

    dv = DictVectorizer(sparse=False)

    X_train = dv.fit_transform(dicts_train)
    X_val = dv.transform(dicts_val)

    model = LogisticRegression(solver='liblinear', C=c, max_iter=1000, random_state=42)
    model.fit(X_train, y_train)

    y_pred = model.predict(X_val)

    accuracy = (y_val == y_pred.astype(int)).mean()

    print(c, accuracy.round(3))

0.01 0.898
0.1 0.901
1 0.901
10 0.901
100 0.901


#### Answer 6:

The best accuray is achieve with **C = 0.1** which is the smallest C with an accuracy of 0.901