# Building Logistic Regression Model

First, do the necessary imports.

In [25]:
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
# make predictions
from pandas import read_csv
from sklearn.utils import resample
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix
from sklearn.metrics import accuracy_score
from sklearn.metrics import mean_absolute_error
from sklearn.svm import SVC

Secondly, read the .csv files and examine the categories, particularily for the target column and if the columns are categorical or numerical. At this stage, one should do some exploratory data analysis.

In [26]:
#Read the csvs
train=pd.read_csv('train.csv')
test=pd.read_csv('test.csv')
train.head()

Unnamed: 0,ID,USERID,EVENTID,TIMESTAMP,LABEL
0,0,0,0,0,0
1,1,0,1,6,0
2,2,0,2,41,0
3,3,0,1,49,0
4,4,0,2,51,0


In [27]:
test.head()

Unnamed: 0,ID,USERID,EVENTID,TIMESTAMP,LABEL
0,391749,6751,8,2497257,
1,391750,5059,26,2497260,
2,391752,2816,49,2497261,
3,391752,2816,49,2497261,
4,391753,6751,7,2497261,


In [28]:
count_open = len(train[train['LABEL']==0])
count_closed = len(train[train['LABEL']==1])
print("Labels with 0:" + str(count_open))
print("Labels with 1:"+str(count_closed))

Labels with 0:387850
Labels with 1:3899


In [29]:
#Find the percentage of 0 in the label column of the training dataset
count_open = len(train[train['LABEL']==0])
count_closed = len(train[train['LABEL']==1])
pct_of_open = count_open/(count_open +count_closed)
print("percentage of open is", pct_of_open*100)

percentage of open is 99.00471985888923


From analysis of the value counts of the 'Label' column from the train.csv, roughly 99% of the all entries are open, such that the value is '0'. This is indicative that the data is imbalanced, which is not ideal as most machine learning algorithms work better when the number of samples in each class are roughly equal.

The next step would be to clean the data, particularily, if there is any missing values, which there is none. The test.csv does have all its 'Label' values labelled as 'NaN', which is because they need to filled in with the model. All the columns of the dataframe are also determined to be numerical.

In [30]:
#Find number of columns in train that have missing values, which is none of them
missing_columns=train.isnull().sum()
print(missing_columns[missing_columns>0])

Series([], dtype: int64)


In [31]:
missing_columns=test.isnull().sum()
print(missing_columns[missing_columns>0])

LABEL    20000
dtype: int64


In [32]:
#All the columns of the training set are numerical
train.dtypes

ID           int64
USERID       int64
EVENTID      int64
TIMESTAMP    int64
LABEL        int64
dtype: object

To balance the data more, up-sampling the minority class is done.

In [33]:
# Separate majority and minority classes
df_majority = train[train.LABEL==0]
df_minority = train[train.LABEL==1]
 
# Upsample minority
df_minority_upsampled = resample(df_minority, 
                                 replace=True,    
                                 n_samples=387850,    # equals the majority class
                                 random_state=120) # make a seed to reproducible results
 
# Combine majority class with upsampled minority class
df_upsampled = pd.concat([df_majority, df_minority_upsampled])
 
# Display new class counts
df_upsampled.LABEL.value_counts()

0    387850
1    387850
Name: LABEL, dtype: int64

For the training data, drop the target column to separate the target from the other features. 

Note that the 'ID' columns are not features, but rather identifiers. The target column, 'LABEL' should also be removed and labelled for the y column.

In [34]:
#Drop the 'Label' column in the training set, X, and leave only the 'Label' column for the target, y
y = df_upsampled.LABEL
X=df_upsampled.drop(['LABEL'], axis=1)
test.drop(['LABEL'], axis=1, inplace=True)

After ensuring the best features are chosen, implement the logistic regression model. Logistic regression is chosen as the result will be binary classification. Other machine learning algorithms that could have been chosen is k-nearest neighbours, decisions trees, and SVM (support vector machine).

The next step would be to split the dataset into training and validation sets with X_train and y_train being the training set features and target, and X_valid and y_valid are the validatin set features and target. This is to prevent the model from overfitting and to evaluate the model. A porportion of 80% of the dataset for the training set is chosen and 20% of the validation set. 

The model is then defined and fitted, such that the model is trained on the training data.

In [35]:
#Split and fit the X,y data
X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=0.8, test_size=0.2, random_state=0)

In [36]:
logreg =  LogisticRegression()
logreg.fit(X_train, y_train)

LogisticRegression()

At this stage, analyze the features and find some statistic results to evluate the best features. All the p-values of the features are < 0.05, hence being good features.

In [37]:
#Get stats on the features
import statsmodels.api as sm
logit_model=sm.Logit(y,X)
result=logit_model.fit()
print(result.summary2())

Optimization terminated successfully.
         Current function value: 0.669142
         Iterations 4
                          Results: Logit
Model:              Logit            Pseudo R-squared: 0.035       
Dependent Variable: LABEL            AIC:              1038114.6070
Date:               2021-09-27 23:12 BIC:              1038160.8531
No. Observations:   775700           Log-Likelihood:   -5.1905e+05 
Df Model:           3                LL-Null:          -5.3767e+05 
Df Residuals:       775696           LLR p-value:      0.0000      
Converged:          1.0000           Scale:            1.0000      
No. Iterations:     4.0000                                         
--------------------------------------------------------------------
                Coef.   Std.Err.     z      P>|z|    [0.025   0.975]
--------------------------------------------------------------------
ID              0.0000    0.0000   55.7741  0.0000   0.0000   0.0000
USERID          0.0002    0.0000  156

In [38]:
#Use the validation set to measure the accuracy of the classifier
y_pred = logreg.predict(X_test)
score_valid=mean_absolute_error(y_test,y_pred)
print('Accuracy of logistic regression classifier on test set: {:.2f}'.format(logreg.score(X_test, y_test)))

Accuracy of logistic regression classifier on test set: 0.58


The test predictions are then generated and created into a .csv file.

In [39]:
#Use the model to predict on the test .csv file
y_pred = logreg.predict(test)
submission=pd.DataFrame({'ID':test.index+391749,'LABEL':y_pred})
submission.to_csv('result.csv',index=False)
df3 = pd.merge(test, submission, on = 'ID')
df3.set_index('ID', inplace = True)
df3.to_csv('test.csv')

In [40]:
#Count the number of 0 and 1 in the result csv file
submission=pd.read_csv('result.csv')
submission['LABEL'].value_counts()

0    12558
1     7442
Name: LABEL, dtype: int64

This is not very plausible as only roughly 1% of the values are '1' for the 'LABEL' column. Next, a confusion matrix can then be created to evaluate the performance of the classification model with the left rows being the predicted values, positive and negative from top to down, respectively and the top row being the actual value, being positive to negative from left to right, respectively [1]. There are 6290+3737=10,027 correct predictions and 6268+3705=9973 incorrect predictions. 

In [41]:
#Create the confusion matrix
y_test=y_test[:20000]
confusion_matrix = confusion_matrix(y_test, y_pred)
print(confusion_matrix)

[[6290 3705]
 [6268 3737]]


In [42]:
print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

           0       0.50      0.63      0.56      9995
           1       0.50      0.37      0.43     10005

    accuracy                           0.50     20000
   macro avg       0.50      0.50      0.49     20000
weighted avg       0.50      0.50      0.49     20000



The precision is true_positive/(true_positive+false_positive) and indicates the the ability for the model to not label a negative sample as positive. Another important metric is ratio, which is (true_positive)/(true_positive+false_negative), which is the ability for the classifier to find all the positive samples. Both of these are important to consider to evaluate the overall model.

We should use another machine learning algorithm, such as KNN or Random Forest Classifier algorithm, instead.

# Question 2

If I had more information about each event and customer, I would use that to group the customer types together and events to do more feature analysis and feature engineering. Some event types might result in a higher correlation with a customer closing their account, such as a customer interacting with a website. I could also create new features from the new information, such as if a certain user and event can produce more important information. 

If I had more time, I would also consider other algorithms other than logistical regression and KNN, such as decision trees, and support vector machine (SVM), which is also are effective for binary classification. I would also use consider other ways to balance the data for the model, such as down-scaling the majoriy class, and tree-based algorithms, such as the Random Forest Classifier [2].

Another method to improve my analysis includes tuning the machine learning algorithm used, such as for k-nearest neighbours, altering the k-value [3]. Other methods include bagging, which is to implement small sample populations and take a mean of the predictions, and boosting, which is to adust the weight of an observation based on the results of the previous classification, thereby strengthening predictive models [4]. This reduces the overall bias error of the model. 

# Citations

[1] S. Li, “Building a logistic regression in Python, step by step,” Medium, 27-Feb-2019. [Online]. Available: https://towardsdatascience.com/building-a-logistic-regression-in-python-step-by-step-becd4d56c9c8. [Accessed: 27-Sep-2021]. 

[2] “How to handle imbalanced classes in machine learning,” EliteDataScience, 23-May-2020. [Online]. Available: https://elitedatascience.com/imbalanced-classes. [Accessed: 26-Sep-2021]. 

[3] S. Ray, “How to increase accuracy of machine learning model,” Analytics Vidhya, 26-Jun-2020. [Online]. Available: https://www.analyticsvidhya.com/blog/2015/12/improve-machine-learning-results/. [Accessed: 26-Sep-2021]. 

[4] T. Srivastava, “Ensemble learning: Ensemble learning techniques,” Analytics Vidhya, 26-Jun-2020. [Online]. Available: https://www.analyticsvidhya.com/blog/2015/08/introduction-ensemble-learning/. [Accessed: 26-Sep-2021]. 