# Homework 2 - Classification

In this exercise you will use scikit-learn, a popular machine learning package in python to train and tune a classifier. A particularly useful feature is that all classifiers (and linear models) are called using the same API, so it is easy to test between different models (see the sklearn-intro notebook for examples). So in this exercise we will a classification technique (logistic regression) that is representative of methods and challenges you will encounter when using any classification method.


## Dataset
We will be using a banking marketing dataset. 
The dataset is associated with direct marketing campaigns of a banking institution. Your job is to find out the best strategies to improve for the next marketing campaign. How can the bank have a greater effectiveness for future marketing campaigns? In order to answer this, we have to analyze the last marketing campaign the bank performed and identify the patterns that will help us find conclusions in order to develop future strategies.

You have to predict whether a customer subscribes for term deposit or not using the following attributes: 

1 - age (numeric)<br>
2 - job : type of job (categorical: 'admin.','blue-collar','entrepreneur','housemaid','management','retired','self-employed','services','student','technician','unemployed','unknown')<br>
3 - marital : marital status (categorical: 'divorced','married','single','unknown'; note: 'divorced' means divorced or widowed)<br>
4 - education (categorical: 'basic.4y','basic.6y','basic.9y','high.school','illiterate','professional.course','university.degree','unknown')<br>
5 - default: has credit in default? (categorical: 'no','yes','unknown')<br>
6 - balance: balance amount (numeric)<br>
7 - housing: has housing loan? (categorical: 'no','yes','unknown')<br>
8 - loan: has personal loan? (categorical: 'no','yes','unknown')<br>
8 - contact: contact communication type (categorical: 'cellular','telephone')<br>
9 - month: last contact month of year (categorical: 'jan', 'feb', 'mar', ..., 'nov', 'dec')<br>
10 - day_of_week: last contact day of the week (categorical: 'mon','tue','wed','thu','fri')<br>
12 - campaign: number of contacts performed during this campaign and for this client (numeric, includes last contact)<br>
13 - pdays: number of days that passed by after the client was last contacted from a previous campaign (numeric; 999 means client was not previously contacted)<br>
14 - previous: number of contacts performed before this campaign and for this client (numeric)<br>
15 - poutcome: outcome of the previous marketing campaign (categorical: 'failure','nonexistent','success')<br>

features_ex2.xlsx contains the features. It has 4521 records. First 3165 observations are used for training dataset, next 678 observations are used for cross validation dataset and final 678 observations are used for test dataset.

label_ex2.xlsx contains the label: "yes" or "no". First 3165 observations are used for training dataset, next 678 observations are used for cross validation dataset. Labels for test dataset are not provided to you because in a real world scenario you will not know the true values for your test set. 

In [8]:
import numpy as np
import pandas as pd
import warnings
warnings.filterwarnings('ignore')

In [9]:
X = pd.read_excel("features_ex2.xlsx")
X.head()

Unnamed: 0,age,job,marital,education,default,balance,housing,loan,contact,day,month,campaign,pdays,previous,poutcome
0,30,unemployed,married,primary,no,1787,no,no,cellular,19,oct,1,-1,0,unknown
1,33,services,married,secondary,no,4789,yes,yes,cellular,11,may,1,339,4,failure
2,35,management,single,tertiary,no,1350,yes,no,cellular,16,apr,1,330,1,failure
3,30,management,married,tertiary,no,1476,yes,yes,unknown,3,jun,4,-1,0,unknown
4,59,blue-collar,married,secondary,no,0,yes,no,unknown,5,may,1,-1,0,unknown


In [10]:
y = pd.read_excel("label_ex2.xlsx")
y.head()

Unnamed: 0,y
0,no
1,no
2,no
3,no
4,no


In [11]:
categories = ['job','marital','education','default','housing','loan','contact','month','poutcome']
categorical = pd.get_dummies(X[categories])
continuous = X.drop(columns=categories)
X = pd.concat([continuous,categorical],axis=1)

In [12]:
#splitting data into train, cv and test set (70:15:15 ratio)
X_train = X.iloc[0:3165,:]
y_train = y.iloc[0:3165,:]
X_cv = X.iloc[3165:3843,:]
y_cv = y.iloc[3165:3843,:]
X_test = X.iloc[3843:4521,:]

In [13]:
print("X_train "+ str(X_train.shape))
print("y_train "+ str(y_train.shape))
print("X_cv "+ str(X_cv.shape))
print("y_cv "+ str(y_cv.shape))
print("X_test "+ str(X_test.shape))

X_train (3165, 50)
y_train (3165, 1)
X_cv (678, 50)
y_cv (678, 1)
X_test (678, 50)


## Standardization

As discussed in previous exercise, standardization is important when a number of features with different scales are involed. 

Q. Use StandardScaler from sklearn.preprocessing to standardize the continuous features. 


In [14]:
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()

continuous_variables = ['age', 'balance', 'day', 'campaign', 'pdays', 'previous']

# Use the above list to replace the continuous columns in X_train to scaled columns. Use fit_transform method.
X_train[continuous_variables] =scaler.fit_transform(X_train[continuous_variables]) ### WRITE CODE HERE

In [16]:
# Similarily use the above list to replace the continuous columns in X_cv and X_test to scaled columns. Use transform method.
### WRITE CODE HERE
X_cv[continuous_variables]=scaler._transform(X_cv[continuous_variables])
X_test[continuous_variables]=scaler.fit_transform(X_test[continuous_variables])


## Classification

As previously mentioned, the scikit-learn classification API makes it easy to train a classifier. 


Q. Use LogisticRegression from sklearn.linear_model to make a logistic regression classifier.

In [17]:
from sklearn.linear_model import LogisticRegression

In [None]:
# First, initialize the classifier with default parameters
aa=Logistic
# then fit the classifier on training data and labels

### WRITE CODE HERE


In [None]:
# predict the output for cross validation dataset

### WRITE CODE HERE

Implement precision(), recall(), accuracy() in exercise_2.py, and use them below.

In [None]:
from classification_utils import accuracy, precision, recall

# Using the predictions to calculate accuracy, precision, recall

### WRITE CODE HERE

Q. Accuracy<br>
Ans - 

Q. Precision<br>
Ans - 

Q. Recall<br>
Ans -

Q. Which metric (accuracy, precision, recall) is more appropriate and in what cases? Will there be scenarios where it is better to use precision than accuracy? Explain. <br>
Ans - 

Q. Which metric is suitable in this case? <br>
Ans - 

### ROC curve

Q. Use roc_Curve from sklearn.metrics and use matplotlib.pyplot to plot the ROC curve. USe cv set to make predictions.

In [449]:
from sklearn.metrics import roc_curve
probs=le.predict_proba(X_test)
preds=probs[:,1]
fpr,tpr,threshold=metrics.roc_curve(y_test,preds)
roc_auc=metrics.auc(fpr,tpr)
# calculate the fpr and tpr for all thresholds of the classification

### WRITE CODE HERE


import matplotlib.pyplot as plt
plt.title("Receiver Operating Characteristic")
plot.plot(fpr,tpr,'b',label="AUC=%0.2f"%roc_auc)
plt.legend(loc="lower right")
plt.plot([0,1],[0,1],'r--')
prt.xlim([0,1])
plt.ylim([0,1])
plt..ylabel("True positive rate")
plt.xlabel("False positive rate")
# Plot the ROC curve by giving appropriate names for title and axes. 

### WRITE CODE HERE

Q. What is the AOC obtained?<br>
Ans -

## Hyperparameters

"Model tuning" refers to model adjustments to better fit the data. This is separate from "fitting" or "training" the model. The fitting/training procedure is governed by the amount and quality of your training data, as the fitting algorithm is unique to each classifier (e.g. logistic regression or random forest). 





Build a model with hyperparameter 'C' set to 0.1 and penalty set to 'l1'. Make predictions on cross validation set and compute accuracy, precision and recall. 


In [490]:
### WRITE CODE HERE


Build a model with hyperparameter 'C' set to 0.5 and penalty set to 'l1'. Make predictions on cross validation set and compute accuracy, precision and recall. 


In [491]:
### WRITE CODE HERE


Build a model with hyperparameter 'C' set to 0.1 and penalty set to 'l2'. Make predictions on cross validation set and compute accuracy, precision and recall. 


In [492]:
### WRITE CODE HERE


Build a model with hyperparameter 'C' set to 0.5 and penalty set to 'l2'. Make predictions on cross validation set and compute accuracy, precision and recall. 

In [493]:
### WRITE CODE HERE


Q. Which of the above models is better? <br>
Ans- 

# Test set

You have worked on training and cv dataset so far, but testing data does not include the labels. Choose the best hyperparameter values as seen in previous section and build a model. Use this model to make predictions on test set. You will submit a csv file containing your predictions names as predictions.csv.


In [None]:
##########################################
### Construct your final logistic regression using the best hyperparameters obtained above(C and penalty) ###
final_model = ### WRITE CODE HERE
final_model.fit(X_train, y_train)
predicted = final_model.predict(X_test)


### save into csv with column heading as "y"

In [None]:
#end 