### Fitting Logistic Regression

In this first notebook, you will be fitting a logistic regression model to a dataset where we would like to predict if a transaction is fraud or not.

To get started let's read in the libraries and take a quick look at the dataset.

In [61]:
import numpy as np
import pandas as pd
import statsmodels.api as sm
import seaborn as sns
import matplotlib.pyplot as plt


df = pd.read_csv('./fraud_dataset.csv')
df.head()

Unnamed: 0,transaction_id,duration,day,fraud
0,28891,21.3026,weekend,False
1,61629,22.932765,weekend,False
2,53707,32.694992,weekday,False
3,47812,32.784252,weekend,False
4,43455,17.756828,weekend,False


In [62]:
df.shape

(8793, 4)

In [84]:
df.fraud_flag.mean()

0.012168770612987604

In [85]:
df.query('fraud==True')['duration'].mean()

4.6242473706156568

In [86]:
df.weekday.mean()

0.34527465029000343

In [87]:
df.query('fraud==False')['duration'].mean()

30.013583132522555

In [63]:
df1 = pd.pivot_table(data=df, columns='fraud', index='day', values='transaction_id', aggfunc='count').reset_index()

In [79]:
total = [i+j for i, j in zip(df1[False], df1[True])]
not_fraud = [i/j*100 for i,j in zip(df1[False], total)]
fraud = [i/j*100 for i,j in zip(df1[True], total)]

df1[True].sum()/(df1[True]+df1[False])

0    0.035244
1    0.018586
dtype: float64

`1.` As you can see, there are two columns that need to be changed to dummy variables.  Replace each of the current columns to the dummy version.  Use the 1 for `weekday` and `True`, and 0 otherwise.  Use the first quiz to answer a few questions about the dataset.

In [65]:
df['intercept'] = 1
df['weekday'] = pd.get_dummies(df.day)['weekday']
df['fraud_flag'] = pd.get_dummies(df.fraud)[True]
df.head()

Unnamed: 0,transaction_id,duration,day,fraud,intercept,weekday,fraud_flag
0,28891,21.3026,weekend,False,1,0,0
1,61629,22.932765,weekend,False,1,0,0
2,53707,32.694992,weekday,False,1,1,0
3,47812,32.784252,weekend,False,1,0,0
4,43455,17.756828,weekend,False,1,0,0


`2.` Now that you have dummy variables, fit a logistic regression model to predict if a transaction is fraud using both day and duration.  Don't forget an intercept!  Use the second quiz below to assure you fit the model correctly. Also remember to use the `.summary2() method to get your summary results.

In [67]:
mod = sm.Logit(df['fraud_flag'], df[['intercept','weekday','duration']])
res = mod.fit()

Optimization terminated successfully.
         Current function value: inf
         Iterations 16


  return 1/(1+np.exp(-X))
  return np.sum(np.log(self.cdf(q*np.dot(X,params))))


In [70]:
res.summary2()

0,1,2,3
Model:,Logit,No. Iterations:,16.0
Dependent Variable:,fraud_flag,Pseudo R-squared:,
Date:,2022-09-10 08:17,AIC:,inf
No. Observations:,8793,BIC:,inf
Df Model:,2,Log-Likelihood:,-inf
Df Residuals:,8790,LL-Null:,-inf
Converged:,1.0000,Scale:,1.0

0,1,2,3,4,5,6
,Coef.,Std.Err.,z,P>|z|,[0.025,0.975]
intercept,9.8709,1.9438,5.0783,0.0000,6.0613,13.6806
weekday,2.5465,0.9043,2.8160,0.0049,0.7741,4.3188
duration,-1.4637,0.2905,-5.0389,0.0000,-2.0331,-0.8944
