# Implementing Classification Methods to Marketing Dataset

## Dataset

### Overview
I used marketing campaigns dataset made by Moro et al.
You can find this dataset [here](https://archive.ics.uci.edu/ml/datasets/Bank+Marketing),
and full paper is [here](http://repositorium.sdum.uminho.pt/handle/1822/14838).
This dataset is from direct marketing capmaign of a Portuguese banking institution.

The goal of this dataset is to predict whether a target client subscribe a term deposit(Binary classification).

### Attributes
This dataset has **17** attributes, and they are categorized into groups.
#### client information
* age: client's age(numeric)
* job: types of job(categorical: admin, unemployed, management, ...)
* marital: marital status(categorical: married, single, ...)
* education: categorical: secondary, primary, ...
* default: credit card is default?(binary: yes, no)
* balance: yearly balance(numeric)
* housing: has housing loan?(binary: yes, no)
* loan: has loan?(binary: yes, no)

#### last contact of current capmaign
* contact: how contacted to the client?(categorical: telephone, cellular, unknown)
* day: last contacted day(numeric)
* month: last contacted month(categorical: jan, feb, ...)
* duration: how long contacted(numeric)

#### others
* campaign: number of contacts to this client during this campaign(numeric)
* pdays: number of days passed by after the last contact of **last** campaign(numeric)
* previous: number of contacts of campaigns 'before' this one(numeric)
* poutcome: outcome of previous campaigns(categorical: unknown, success, failure)

#### target variable
* y: has the client subscribed a term deposit?(binary: yes, no)

In [1]:
import pandas as pd

# dataset source: https://archive.ics.uci.edu/ml/datasets/Bank+Marketing
df = pd.read_csv("data/bank/bank-full.csv", sep=";")
df.head()

Unnamed: 0,age,job,marital,education,default,balance,housing,loan,contact,day,month,duration,campaign,pdays,previous,poutcome,y
0,58,management,married,tertiary,no,2143,yes,no,unknown,5,may,261,1,-1,0,unknown,no
1,44,technician,single,secondary,no,29,yes,no,unknown,5,may,151,1,-1,0,unknown,no
2,33,entrepreneur,married,secondary,no,2,yes,yes,unknown,5,may,76,1,-1,0,unknown,no
3,47,blue-collar,married,unknown,no,1506,yes,no,unknown,5,may,92,1,-1,0,unknown,no
4,33,unknown,single,unknown,no,1,no,no,unknown,5,may,198,1,-1,0,unknown,no


In [2]:
# Check the distribution of target value.
print("Yes: {:}".format(len(df.loc[df["y"] == "yes"])))
print("No: {:}".format(len(df.loc[df["y"] == "no"])))

Yes: 5289
No: 39922


### Preprocessing
Before training model, I used these preprocessings.
* Convert binary variables(yes/no) in all cells to 1/0
* Decide not to use 'day' and 'month' columns(They are hard to handle with Linear Regression Model)
* Convert categorical variables to one-hot vector(eg: job => job_admin, job_unemployed, ...)
* Standardization other numeric variables

In [3]:
# yes/no => 1/0
def convert_yn(x):
    if x == "yes" or x == "no":
        if x == "yes":
            return 1
        else:
            return 0
    else:
        return x
    
df = df.applymap(convert_yn)

# drop 'day' and 'month' column
# I think they are not suitable for Linear Regression model
df = df.drop(columns=["day", "month"])

# category variables to one-hot vector
df = pd.get_dummies(df, columns=["job", "marital", "education", "contact", "poutcome"])

# standardization
from sklearn.preprocessing import scale

for col in ["age", "balance", "duration"]:
    df[col] = scale(df[col])

df.head()

Unnamed: 0,age,default,balance,housing,loan,duration,campaign,pdays,previous,y,...,education_secondary,education_tertiary,education_unknown,contact_cellular,contact_telephone,contact_unknown,poutcome_failure,poutcome_other,poutcome_success,poutcome_unknown
0,1.606965,0,0.256419,1,0,0.011016,1,-1,0,0,...,0,1,0,0,0,1,0,0,0,1
1,0.288529,0,-0.437895,1,0,-0.416127,1,-1,0,0,...,1,0,0,0,0,1,0,0,0,1
2,-0.747384,0,-0.446762,1,1,-0.707361,1,-1,0,0,...,1,0,0,0,0,1,0,0,0,1
3,0.571051,0,0.047205,1,0,-0.645231,1,-1,0,0,...,0,0,1,0,0,1,0,0,0,1
4,-0.747384,0,-0.447091,0,0,-0.23362,1,-1,0,0,...,0,0,1,0,0,1,0,0,0,1


## Method
For implementing classification, I used two methods related to Linear Regression.
* Basic Linear Regression
* Linear Regression with L2 normalization(Ridge)

### Linear Regression
This method is quite simple.
First, we define target value $t^{(i)}$ as below:
$$
t^{(i)} = 
    \begin{cases}
        +1 \quad (yes) \\
        -1 \quad (no) \\
    \end{cases}
$$

Then, we can define i-th Linear Regression Model's output $y^{(i)}$ as below:
$$
y^{(i)} = \boldsymbol{w}^T \boldsymbol{x}^{(i)} + w_0
$$

If the output $y^{(i)} \gt 0$, then i-th data $\boldsymbol{x}^{(i)}$ is classified as 'yes', otherwise 'no'.
To combine $n$ training datas, we can use Design Matrix and represent above models as:
$$
\boldsymbol{y} = \boldsymbol{X} \boldsymbol{w}
$$

And loss function $L(\boldsymbol{w})$ is defined as below:
$$
\begin{align}
L(\boldsymbol{w}) &=& \sum_{i=1}^{n} (y^{(i)} - \boldsymbol{w}^T \boldsymbol{x}^{(i)})^2 \nonumber \\
                  &=& \|\boldsymbol{y} - \boldsymbol{X} \boldsymbol{w} \|^2_2 \nonumber \\
                  &=& (\boldsymbol{y} - \boldsymbol{X} \boldsymbol{w})^T (\boldsymbol{y} - \boldsymbol{X} \boldsymbol{w})
\end{align}
$$ 

This function is convex, so we can minimize this function by differential and optimal $\boldsymbol{w}^{\ast}$ is:
$$
\begin{align}
\boldsymbol{w}^{\ast} &=& (\boldsymbol{X}^T \boldsymbol{X})^{-1} \boldsymbol{X}^T \boldsymbol{y} \nonumber \\
                      &=& \boldsymbol{X}^{\Phi} \boldsymbol{y} \nonumber
\end{align}
$$

Note: $\boldsymbol{X}^{\Phi}$ is Pseudo inverse matrix.
These method is implemented beloew cell.

In [4]:
# Basic Linear Regression
import numpy as np
from scipy import linalg

class MyLinearRegression:
    def fit(self, X, y):
        # concat [1, 1, 1, ..., 1] to first col
        # NOTE: phi_0(x) = 1
        X = np.insert(X, 0, 1, axis=1)
        
        # train weight vetor 'w'
        self.w = linalg.pinv(X) @ y
        
        return self
        
    def predict(self, X):
        X = np.insert(X, 0, 1, axis=1)
        
        # predict 'y'
        return X @ self.w

### Linear Regression with L2 normalization(Ridge)
The basic idea is same as the Basic Linear Regression.
However, in this method, the loss function is described as below:
$$
\begin{align}
L(\boldsymbol{w}) &=& \|\boldsymbol{y} - \boldsymbol{X} \boldsymbol{w} \|^2_2 + \lambda \| \boldsymbol{w} \|_2 \nonumber 
\end{align}
$$

This loss function is also convex, so we can easily calculate the optimal $\boldsymbol{w}^{\ast}$ as:
$$
\boldsymbol{w}^{\ast} = (\boldsymbol{X}^T \boldsymbol{X} + \lambda \boldsymbol{I})^{-1} \boldsymbol{X}^T \boldsymbol{y}
$$

This method is implemented below cell.

In [5]:
# LinearRegression with L2 normalizaion(Ridge)
class MyRidge:
    def __init__(self, alpha):
        self.alpha = alpha
        
    def fit(self, X, y):
        X = np.insert(X, 0, 1, axis=1)
        n, m = X.shape
        
        # inside inverse
        T = linalg.inv(X.T @ X + self.alpha * np.eye(m))
        self.w = T @ X.T @ y
        
    def predict(self, X):
        X = np.insert(X, 0, 1, axis=1)
        return X @ self.w

# Evaluation
We evaluated these methods by calculating these indicators:
* accuracy
* precision
* recall
* f1_measure

In addtion, I check weights of the model, and test which parameters are important.
I implemented helper function to calculate them.

In [6]:
# Helper functions fot evaluating binary classification
def my_confusion_matrix(y_true, y_pred):
    assert(len(y_true == y_pred))
    cm = np.zeros((2, 2))
    
    for t, p in zip(y_true, y_pred):
        cm[int(t)][int(p)] += 1
    
    return cm

def accuracy(confusion_matrix):
    tn, fp, fn, tp = confusion_matrix.flatten()
    return (tp + tn) / np.sum(confusion_matrix.flatten())

def precision(confusion_matrix):
    tn, fp, fn, tp = confusion_matrix.flatten()
    return tp / (fp + tp)
    
def recall(confusion_matrix):
    tn, fp, fn, tp = confusion_matrix.flatten()
    return tp / (tp + fn)
    
def f1_measure(confusion_matrix):
    tn, fp, fn, tp = confusion_matrix.flatten()
    return 2 * tp / (2 * tp + fp + fn)

def result(y_true, y_pred):
    cm = my_confusion_matrix(y_true, y_pred)
    tn, fp, fn, tp = cm.flatten()
    print("True Negative: {:}".format(tn))
    print("False Positive: {:}".format(fp))
    print("False Negative: {:}".format(fn))
    print("True Positive: {:}".format(tp))
    
    print("Accuracy: {:}".format(accuracy(cm)))
    print("Precision: {:}".format(precision(cm)))
    print("Recall: {:}".format(recall(cm)))
    print("F1 Measure: {:}".format(f1_measure(cm)))

In [17]:
# Binary Classification by Linear Regression
from sklearn.model_selection import train_test_split

# prepare dataset
X = df.drop("y", axis=1).values
y = df["y"].apply(lambda x : 1 if x == 1 else -1).values

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25)

# Linear Regression
lr = MyLinearRegression()
lr.fit(X_train, y_train)

# print result
# Accuracy may be high, but F1 measure is low
# This is caused by the imbalance of the target value's distribution
result(y_test > 0, lr.predict(X_test) > 0)

True Negative: 9829.0
False Positive: 209.0
False Negative: 898.0
True Positive: 367.0
Accuracy: 0.9020613996284172
Precision: 0.6371527777777778
Recall: 0.2901185770750988
F1 Measure: 0.39869636067354697


In [19]:
# Check the weight of model
# Data columns
columns = df.drop("y", axis=1).columns

# Drop dummy weight
weights = lr.w[1:]

# Sort parameters by its absolue weight
p = {c: w for c, w in zip(columns, weights)}
sorted(p.items(), key=lambda x: -abs(x[1]))

[('poutcome_success', 0.6158274360625065),
 ('poutcome_unknown', -0.3287547220489375),
 ('poutcome_failure', -0.2679854759092196),
 ('duration', 0.2512550126160996),
 ('poutcome_other', -0.20845932751808804),
 ('job_student', 0.16975677156039867),
 ('contact_unknown', -0.13700180158731082),
 ('housing', -0.11290531038183566),
 ('marital_married', -0.08804049012480837),
 ('job_retired', 0.07288804585339101),
 ('job_housemaid', -0.06861894141730332),
 ('job_unknown', -0.06729692889879584),
 ('loan', -0.06652175016655387),
 ('education_primary', -0.0649591898442443),
 ('marital_divorced', -0.06462868238230633),
 ('job_self-employed', -0.061901342117060304),
 ('education_secondary', -0.0583411528720032),
 ('job_entrepreneur', -0.057652770167016934),
 ('job_blue-collar', -0.052179035130341306),
 ('education_unknown', -0.04928884868736161),
 ('job_services', -0.041559474237584056),
 ('job_technician', -0.03722516590237627),
 ('marital_single', -0.03670291690662517),
 ('job_management', -0.03

In [10]:
# Binary Classification by Ridge
from sklearn.model_selection import train_test_split

# prepare dataset
X = df.drop("y", axis=1).values
y = df["y"].apply(lambda x : 1 if x == 1 else -1).values

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25)

# for parameter tuning, we need KFold
from sklearn.model_selection import KFold

kf = KFold(n_splits=5)
max_f1 = 0
max_alpha = -1
for alpha in np.arange(0.1, 10, 0.1):
    sum_f1, cnt = 0, 0
    for train_index, test_index in kf.split(X_train, y_train):
        rd = MyRidge(alpha=alpha)
        rd.fit(X_train[train_index], y_train[train_index])
        
        cm = my_confusion_matrix(y_train[test_index] > 0, rd.predict(X_train[test_index]) > 0)
        sum_f1 += f1_measure(cm)
        cnt += 1
    
    avg_f1 = sum_f1 / cnt
    
    if avg_f1 > max_f1:
        max_f1 = avg_f1
        max_alpha = alpha
        
print("max_alpha: {:}".format(max_alpha))
rd = MyRidge(alpha=max_alpha)
rd.fit(X_train, y_train)

result(y_test > 0, rd.predict(X_test) > 0)

max_alpha: 3.7
True Negative: 9784.0
False Positive: 200.0
False Negative: 926.0
True Positive: 393.0
Accuracy: 0.9003804299743431
Precision: 0.6627318718381113
Recall: 0.2979529946929492
F1 Measure: 0.4110878661087866


In [20]:
# Check the weight of model
# Data columns
columns = df.drop("y", axis=1).columns

# Drop dummy weight
weights = rd.w[1:]

p = {c: w for c, w in zip(columns, weights)}
sorted(p.items(), key=lambda x: -abs(x[1]))

[('poutcome_success', 0.6228879992694635),
 ('poutcome_unknown', -0.3291813335040853),
 ('poutcome_failure', -0.2700236848888709),
 ('duration', 0.24571032035074128),
 ('poutcome_other', -0.2102677427616877),
 ('job_student', 0.15405507953897293),
 ('contact_unknown', -0.13432747474342605),
 ('housing', -0.11421447391798759),
 ('job_retired', 0.09523436648962919),
 ('marital_married', -0.09027604978269242),
 ('education_primary', -0.07153666533912495),
 ('job_housemaid', -0.07005125707214198),
 ('loan', -0.06715489512380089),
 ('marital_divorced', -0.06638007302269841),
 ('job_entrepreneur', -0.06549087943112036),
 ('education_secondary', -0.061310527201694075),
 ('job_unknown', -0.052889647154542946),
 ('job_self-employed', -0.05036845820777207),
 ('job_blue-collar', -0.049784802918973685),
 ('job_services', -0.04409358988498605),
 ('job_technician', -0.040206868495787376),
 ('education_unknown', -0.037952651670127145),
 ('job_management', -0.03434153783716415),
 ('job_unemployed', -0

## Analysis
### Indicators
As we can see above cells, accuracy is high, but F1 measure is low.
This is caused by the imbalance of target value's distribution.
It is necessary to apply appropriate sampling method before training the model.

### Weights
The top 3 parameters which has the most influential are all related to poutcome.
Because 'poutcome' column indicates the outcome of the previous maeketing campaign, so we can consent to this weight.
I think this fact is interesting.