# Logistic Regression Model Creation for Sales Target Classification

The purpose of the model is a simple implmentation of logistic regression to predict likelihood of product acceptance, based on customer and call data. Neither feature engineering to improve model fit was considered, nor sampling to improve model balancing.

### Import Packages

In [1]:
import pandas as pd
import numpy as np
from sklearn import preprocessing
from sklearn.model_selection import train_test_split
import matplotlib.pyplot as plt 
from sklearn.linear_model import LogisticRegression

### Load Data

In [2]:
data = pd.read_csv('bank_marketing_data.csv', delimiter=';', header=0)
data = data.dropna()
print(data.shape)
print(list(data.columns))

(4521, 17)
['age', 'job', 'marital', 'education', 'default', 'balance', 'housing', 'loan', 'contact', 'day', 'month', 'duration', 'campaign', 'pdays', 'previous', 'poutcome', 'target']


### Data Metrics

In [3]:
data.head()

Unnamed: 0,age,job,marital,education,default,balance,housing,loan,contact,day,month,duration,campaign,pdays,previous,poutcome,target
0,30,unemployed,married,primary,no,1787,no,no,cellular,19,oct,79,1,-1,0,unknown,no
1,33,services,married,secondary,no,4789,yes,yes,cellular,11,may,220,1,339,4,failure,no
2,35,management,single,tertiary,no,1350,yes,no,cellular,16,apr,185,1,330,1,failure,no
3,30,management,married,tertiary,no,1476,yes,yes,unknown,3,jun,199,4,-1,0,unknown,no
4,59,blue-collar,married,secondary,no,0,yes,no,unknown,5,may,226,1,-1,0,unknown,no


In [4]:
data['target'].value_counts()

no     4000
yes     521
Name: target, dtype: int64

In [23]:
data.groupby('target').mean()

Unnamed: 0_level_0,age,balance,day,duration,campaign,pdays,previous
target,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
no,40.998,1403.21175,15.94875,226.3475,2.86225,36.006,0.47125
yes,42.491363,1571.955854,15.658349,552.742802,2.266795,68.639155,1.090211


### Covert Categories to Numeric

In [5]:
cat_vars=['job','marital','education','default','housing','loan','contact','month','poutcome']
for var in cat_vars:
    cat_list = pd.get_dummies(data[var], prefix=var)
    data1=pd.concat([data, cat_list],axis=1)
    data=data1
    
data_vars=data.columns.values.tolist()

### Remove Non-Numeric Categories and Convert y to Numeric

In [7]:
to_keep=[i for i in data_vars if i not in cat_vars]
data_final=data[to_keep]
data_final.loc[:,('target')] = data_final.loc[:,('target')].eq('yes').mul(1)
data_final.columns.values

array(['age', 'balance', 'day', 'duration', 'campaign', 'pdays',
       'previous', 'target', 'job_admin.', 'job_blue-collar',
       'job_entrepreneur', 'job_housemaid', 'job_management',
       'job_retired', 'job_self-employed', 'job_services', 'job_student',
       'job_technician', 'job_unemployed', 'job_unknown',
       'marital_divorced', 'marital_married', 'marital_single',
       'education_primary', 'education_secondary', 'education_tertiary',
       'education_unknown', 'default_no', 'default_yes', 'housing_no',
       'housing_yes', 'loan_no', 'loan_yes', 'contact_cellular',
       'contact_telephone', 'contact_unknown', 'month_apr', 'month_aug',
       'month_dec', 'month_feb', 'month_jan', 'month_jul', 'month_jun',
       'month_mar', 'month_may', 'month_nov', 'month_oct', 'month_sep',
       'poutcome_failure', 'poutcome_other', 'poutcome_success',
       'poutcome_unknown'], dtype=object)

### Define X and y Matrices

In [8]:
import statsmodels.api as sm
X = data_final.loc[:, data_final.columns != 'target']
y = data_final.loc[:, data_final.columns == 'target']
y = y.values.ravel()
print(X.head())
print(y)

   age  balance  day  duration  campaign  pdays  previous  job_admin.  \
0   30     1787   19        79         1     -1         0           0   
1   33     4789   11       220         1    339         4           0   
2   35     1350   16       185         1    330         1           0   
3   30     1476    3       199         4     -1         0           0   
4   59        0    5       226         1     -1         0           0   

   job_blue-collar  job_entrepreneur        ...         month_jun  month_mar  \
0                0                 0        ...                 0          0   
1                0                 0        ...                 0          0   
2                0                 0        ...                 0          0   
3                0                 0        ...                 1          0   
4                1                 0        ...                 0          0   

   month_may  month_nov  month_oct  month_sep  poutcome_failure  \
0          0 

### Split Training and Test Set

In [9]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=0)

In [10]:
from sklearn.linear_model import LogisticRegression
from sklearn import metrics

### Train Logistic Regression Model

In [11]:
logreg = LogisticRegression(solver='liblinear')
logreg.fit(X_train, y_train)

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='warn',
          n_jobs=None, penalty='l2', random_state=None, solver='liblinear',
          tol=0.0001, verbose=0, warm_start=False)

### Generate Predictions and Test-set Accuracy

In [12]:
y_pred = logreg.predict(X_test)
print('Accuracy of logistic regression classifier on test set: {:.2f}'.format(logreg.score(X_test, y_test)))

Accuracy of logistic regression classifier on test set: 0.90


### Precision and Recall on Test Data

In [14]:
from sklearn.metrics import confusion_matrix
cmat = confusion_matrix(y_test, y_pred)

In [15]:
#print(cmat)
print('TP - True Negative {}'.format(cmat[0,0]))
print('FP - False Positive {}'.format(cmat[0,1]))
print('FN - False Negative {}'.format(cmat[1,0]))
print('TP - True Positive {}'.format(cmat[1,1]))
print('Accuracy Rate: {}'.format(np.divide(np.sum([cmat[0,0],cmat[1,1]]),np.sum(cmat))))
print('Misclassification Rate: {}'.format(np.divide(np.sum([cmat[0,1],cmat[1,0]]),np.sum(cmat))))

TP - True Negative 1157
FP - False Positive 33
FN - False Negative 109
TP - True Positive 58
Accuracy Rate: 0.8953574060427414
Misclassification Rate: 0.10464259395725865


In [68]:
from sklearn.metrics import classification_report
print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

           0       0.91      0.97      0.94      1190
           1       0.64      0.35      0.46       167

   micro avg       0.90      0.90      0.90      1357
   macro avg       0.78      0.66      0.70      1357
weighted avg       0.88      0.90      0.88      1357



$Precision = \frac{True Positive}{True Positive + False Positive}$      $Recall = \frac{True Positive}{True Positive + False Negative}$

### Conclusion

The original data-set had 4521 targets with 4000 negative and 521 positive.

The test data contained 1357 data points, with 1190 being negative and 167 being positive.
A precision of 64% on test data indicates that the model performs fairly when predicting that a particular customer will accept an offer. The model unfortunately eliminates a high number of potential customers, with a low recall of only 35% on test data.

The model could be improved through feature engineering and SMOTE sampling application to balance the dataset.
Further, an investigation into bais and variance could be conducted.