# Homework

In [43]:
import pandas as pd
import numpy as np
from matplotlib import pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction import DictVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import roc_auc_score

In [44]:
df_full = pd.read_csv("bank-full.csv", sep=";")
features = ['age', 'job', 'marital', 'education', 'balance', 'housing', 'contact', 'day', 'month', 'duration', 'campaign', 'pdays', 'previous', 'poutcome']
df = df_full[features + ['y']].copy()
df['y'] = (df['y'] == 'yes').astype(int)

df_full_train, df_test = train_test_split(df, test_size=0.2, random_state=1)
df_train, df_val = train_test_split(df_full_train, test_size=0.25, random_state=1)
y_full_train = df_full_train.y
y_test = df_test.y
y_train = df_train.y
y_val = df_val.y


# Question 1: ROC AUC feature importance

ROC AUC could also be used to evaluate feature importance of numerical variables.

Let's do that

For each numerical variable, use it as score (aka prediction) and compute the AUC with the y variable as ground truth.
Use the training dataset for that
If your AUC is < 0.5, invert this variable by putting "-" in front

(e.g. `-df_train['engine_hp']`)

AUC can go below 0.5 if the variable is negatively correlated with the target variable. You can change the direction of the correlation by negating this variable - then negative correlation becomes positive.

Which numerical variable (among the following 4) has the highest AUC?

* balance
* day
* duration
* previous


In [45]:
numerical = list(df.dtypes[df.dtypes != 'object'].keys())

for column in numerical:
    score = roc_auc_score(df_train.y, df_train[column])
    if score < 0.5:
        score = roc_auc_score(df_train.y, -df_train[column])
    print(f"{column}:", score )


age: 0.512185717527344
balance: 0.5888313805382317
day: 0.525957882383908
duration: 0.8147002759670778
campaign: 0.5714543015682159
pdays: 0.5901276247352144
previous: 0.5985653242764153
y: 1.0


> A/ `duration`

# Question 2: Training the model

Apply one-hot-encoding using DictVectorizer and train the logistic regression with these parameters:

`LogisticRegression(solver='liblinear', C=1.0, max_iter=1000)`

What's the AUC of this model on the validation dataset? (round to 3 digits)

* 0.69
* 0.79
* 0.89
* 0.99

In [63]:
# Train
dicts = df_train[features].to_dict(orient='records')
dv = DictVectorizer(sparse=False)
X_train = dv.fit_transform(dicts)
model = LogisticRegression(solver='liblinear', C=1.0, max_iter=1000)
model.fit(X_train, y_train)

# Predict
dicts = df_val[features].to_dict(orient='records')
X_val = dv.fit_transform(dicts)
y_pred = model.predict_proba(X_val)[:, 1]

auc = round(roc_auc_score(y_val, y_pred), 3)
auc




np.float64(0.9)

> A/ `0.89`