## Logistic Regression 


In this notebook we will train a [Logistic Regression model](https://en.wikipedia.org/wiki/Logistic_regression) to distinguish between legitimate and fraudulent transactions. 

Logistic Regression is a classic statistical technique used for binary classification. Here the binary variable we are predicting is 'legitimate' or 'not legitimate' (i.e. fraudulent).

We begin by loading in our generated data.

In [2]:
import numpy as np
import pandas as pd
df = pd.read_parquet("fraud-cleaned-sample.parquet")

We need to split our data set into training and testing sets:

In [3]:
from sklearn import model_selection
train, test = model_selection.train_test_split(df, random_state=43)

In [4]:
test.head()

Unnamed: 0,timestamp,label,user_id,amount,merchant_id,trans_type,foreign,interarrival
8872909,1616524579,legitimate,1776,11.6,7244,swipe,False,5304.0
15366435,1641349585,legitimate,3080,19.110001,11425,contactless,False,46935.0
37641047,1588775305,legitimate,7517,28.08,10825,swipe,False,7771.0
30004196,1602422079,legitimate,5998,3.29,457,online,False,49994.0
2634506,1602456788,legitimate,525,9.91,4987,swipe,False,7653.0


We also load in the feature engineering pipeline stage which we developed in [notebook 2](02-feature-engineering.ipynb). The model takes the feature vectors as input, rather than the raw data.

In [1]:
import cloudpickle as cp
feature_pipeline = cp.load(open('feat_pipeline.pkl', 'rb'))



In [None]:
feature_pipeline

In [None]:
feature_pipeline[1]

#### Dealing with Imbalanced Classes

When the training data set contains unequal representation from each of your classes we say we are dealing with 'imbalanced classes'. In our data set fewer than 2% of the samples are fraudulent, and the remaining 98% are legitimate. Thus we have imbalanced classes. 

This causes problems for a few reasons:
1. A model which classifies all transactions as 'legitimate' would be correct 98% of the time. This high accuracy can trick you into thinking that your model is working well, despite it just returning 'legitimate' for every sample it sees. 
2. Even if your model tries to learn patterns in the data, it may struggle to learn from the 'fraudulent' data since there simply isn't enough of it.


There are a few approaches we could take to tackle the problem, and today we will use two of them: 
1. We will use metrics which are more informative than simply counting how often the model makes a correct prediction. 
2. We will weight the samples by the inverse of the frequency of their label within the data set. These weights will be passed into the logistic regression model, and used to ensure that the model is penalised proportionally to this weight for making a misclassification for each class when it is training. 


In this next cell we compute these weights for each of the data labels. 

In [5]:
import warnings
warnings.filterwarnings('ignore')

In [6]:
fraud_frequency = train[train["label"] == "fraud"]["timestamp"].count() / train["timestamp"].count()
train.loc[train["label"] == "legitimate", "weights"] = fraud_frequency
train.loc[train["label"] == "fraud", "weights"] = (1 - fraud_frequency)


We're now ready to train our Logistic Regression model. The model is trained on the feature vectors (generated using our `feature_pipeline` from the previous notebook) and we pass the class weights we computed above as a model parameter. 

In [7]:
from sklearn.linear_model import LogisticRegression
lr = LogisticRegression(max_iter=500)

svecs = feature_pipeline.fit_transform(train)
lr.fit(svecs, train["label"], sample_weight=train["weights"])

LogisticRegression(max_iter=500)

In [None]:
model_pipeline.predict(test[0])

In [20]:
row = test.head(1)

In [22]:
row

Unnamed: 0,timestamp,label,user_id,amount,merchant_id,trans_type,foreign,interarrival
8872909,1616524579,legitimate,1776,11.6,7244,swipe,False,5304.0


We need to validate our model to check how well it performs on data it wasn't trained on. We use the model we just trained to make predictions for the data in our test set, and compare those predictions to the truth. 



In [None]:
model_pipeline.predict(row)

In [18]:
preprocessor = feature_pipeline.steps[0][1]

In [19]:
from sklearn.pipeline import Pipeline

model_pipeline = Pipeline(
    steps=[('preprocessing', preprocessor),
           ('linear_regression', lr)]
    )
model_pipeline

Pipeline(steps=[('preprocessing',
                 ColumnTransformer(transformers=[('interarrival_scaler',
                                                  Pipeline(steps=[('median_imputer',
                                                                   SimpleImputer(strategy='median')),
                                                                  ('interarrival_scaler',
                                                                   RobustScaler())]),
                                                  ['interarrival']),
                                                 ('amount_scaler',
                                                  RobustScaler(), ['amount']),
                                                 ('onehot',
                                                  OneHotEncoder(categories=[['online',
                                                                             'contactless',
                                                                            

In [11]:
help(feature_pipeline)

Help on Pipeline in module sklearn.pipeline object:

class Pipeline(sklearn.utils.metaestimators._BaseComposition)
 |  Pipeline(steps, *, memory=None, verbose=False)
 |  
 |  Pipeline of transforms with a final estimator.
 |  
 |  Sequentially apply a list of transforms and a final estimator.
 |  Intermediate steps of the pipeline must be 'transforms', that is, they
 |  must implement fit and transform methods.
 |  The final estimator only needs to implement fit.
 |  The transformers in the pipeline can be cached using ``memory`` argument.
 |  
 |  The purpose of the pipeline is to assemble several steps that can be
 |  cross-validated together while setting different parameters.
 |  For this, it enables setting parameters of the various steps using their
 |  names and the parameter name separated by a '__', as in the example below.
 |  A step's estimator may be replaced entirely by setting the parameter
 |  with its name to another estimator, or a transformer removed by setting
 |  it

In [None]:
from sklearn.metrics import classification_report

predictions = lr.predict(feature_pipeline.fit_transform(test))
print(classification_report(test.label.values, predictions))


The report shows that the model is performing okay, but is much better at identifying legitimate transactions than fraudulent ones. 

We can visualise the accuracy of classifications in a binary confusion matrix.

In [None]:
from mlworkflows import plot
df, chart = plot.binary_confusion_matrix(test["label"], predictions)
chart

Viewing the raw counts, as well as the proportions of correctly and incorrectly classified items, emphasises that the model often misclassifies 'fraudulent' transactions as 'legitimate'. 

In [None]:
df

We want to save the model so that we can use it outside of this notebook. 

In [None]:
cp.dump(lr, open("lr.pkl", "wb"))


In [9]:
import joblib

joblib.dump(model_pipeline, open('model4.joblib', 'wb'))

PicklingError: Can't pickle <function amap at 0x7fb0fb9ff5e0>: it's not found as __main__.amap

In [None]:
!pip install scikit-learn==0.24.2 joblib==1.1.0

In [10]:
!pip list

Exception ignored in: <function _releaseLock at 0x7fb190b06b80>
Traceback (most recent call last):
  File "/usr/lib64/python3.8/logging/__init__.py", line 227, in _releaseLock
    def _releaseLock():
KeyboardInterrupt: 


Package                           Version
--------------------------------- -----------
absl-py                           1.4.0
aiobotocore                       2.4.2
aiohttp                           3.8.3
aioitertools                      0.11.0
aiosignal                         1.3.1
alembic                           1.8.1
altair                            4.1.0
ansiwrap                          0.8.4
anyio                             3.6.2
argo-workflows                    3.6.1
argon2-cffi                       21.3.0
argon2-cffi-bindings              21.2.0
astroid                           2.14.1
async-generator                   1.10
async-timeout                     4.0.2
attrdict                          2.0.1
attrs                             22.1.0
autopep8                          1.6.0
Babel                             2.11.0
backcall                          0.2.0
backports.zoneinfo                0.2.1
beautifulsoup4                    4.6.3
black                      

In [None]:
features = feature_pipeline.fit_transform(test)

In [None]:
features.shape

In [None]:
features.getrow(0).shape

In [None]:
help(df)