### Fitting Logistic Regression
#### 로지스틱 회귀

In this first notebook, you will be fitting a logistic regression model to a dataset where we would like to predict if a transaction is fraud or not.

To get started let's read in the libraries and take a quick look at the dataset.

이 notebook 에서는 로지스틱 회귀를 이용하여 거래가 사기인지 아닌지를 예측하는 모델을 만들어보자.   

In [37]:
import numpy as np
import pandas as pd
import statsmodels.api as sm


df = pd.read_csv('./fraud_dataset.csv')
df.head()

Unnamed: 0,transaction_id,duration,day,fraud
0,28891,21.3026,weekend,False
1,61629,22.932765,weekend,False
2,53707,32.694992,weekday,False
3,47812,32.784252,weekend,False
4,43455,17.756828,weekend,False


`1.` As you can see, there are two columns that need to be changed to dummy variables.  Replace each of the current columns to the dummy version.  Use the 1 for `weekday` and `True`, and 0 otherwise.  Use the first quiz to answer a few questions about the dataset.

`1.` 위에서 볼 수 있듯이, 2개의 컬럼은 dummy variable 로 바꿀 필요가 있다. `weekday` 와 `True`를 1로, 나머지를 0으로 사용하라. 결과를 가지고 Quiz1에 답을 해보자. 

In [39]:
df['weekday'] = pd.get_dummies(df['day'])['weekday']
df[['not_fraud','fraud']] = pd.get_dummies(df['fraud'])
df = df.drop('not_fraud', axis=1)
df.head()

Unnamed: 0,transaction_id,duration,day,fraud,weekday
0,28891,21.3026,weekend,0,0
1,61629,22.932765,weekend,0,0
2,53707,32.694992,weekday,0,1
3,47812,32.784252,weekend,0,0
4,43455,17.756828,weekend,0,0


In [40]:
# proportions of fraudulent transactions
df.fraud.mean()

0.012168770612987604

In [48]:
# The average duration for fraudulent transaction.
df[df['fraud'] == 1].duration.mean()

4.6242473706156568

In [50]:
# The proportion of weekday transactions.
df.weekday.mean()

0.34527465029000343

In [51]:
# The average duration for non-fraudulent transactions.
df[df['fraud']==0].duration.mean()

30.013583132522555

`2.` Now that you have dummy variables, fit a logistic regression model to predict if a transaction is fraud using both day and duration.  Don't forget an intercept!  Use the second quiz below to assure you fit the model correctly. Also remember to use the `.summary2()` method to get your summary results.

이제 dummy variable이 있으니, `day` 와 `duration` 을 사용해 거래가 사기인지 예측하는 로지스틱 회귀 모델을 만들어보자. `.summary2()` 를 써서 결과를 얻을 것.  

In [56]:
df['intercept'] = 1
logis_model = sm.Logit(df['fraud'], df[['intercept', 'weekday', 'duration']])
results = logis_model.fit()
results.summary2()

Optimization terminated successfully.
         Current function value: inf
         Iterations 16


  return 1/(1+np.exp(-X))
  return np.sum(np.log(self.cdf(q*np.dot(X,params))))
  return 1 - self.llf/self.llnull


0,1,2,3
Model:,Logit,No. Iterations:,16.0
Dependent Variable:,fraud,Pseudo R-squared:,
Date:,2021-12-20 09:08,AIC:,inf
No. Observations:,8793,BIC:,inf
Df Model:,2,Log-Likelihood:,-inf
Df Residuals:,8790,LL-Null:,-inf
Converged:,1.0000,Scale:,1.0

0,1,2,3,4,5,6
,Coef.,Std.Err.,z,P>|z|,[0.025,0.975]
intercept,9.8709,1.9438,5.0783,0.0000,6.0613,13.6806
weekday,2.5465,0.9043,2.8160,0.0049,0.7741,4.3188
duration,-1.4637,0.2905,-5.0389,0.0000,-2.0331,-0.8944


In [60]:
#to exponentiate
np.exp(-1.4637), np.exp(2.5465)

(0.23137858821179411, 12.762357271496972)

We exponentiate the coefficients for duration (-1.4637) and weekday (2.5465) to get the approximate values of 0.231 and 12.762, respectively. The result is interpreted as follows:

    For each 1 unit increase in duration, fraud is 0.23 times as likely, holding all other variables (weekday) constant.
    Fraud is 12.76 times as likely on weekdays than weekends, holding all other variables (duration) constant.

In [61]:
# to see 
1/np.exp(-1.4637)

4.3219210892783329

**Note** – When multiplicative changes are **less than 1**, like duration = 0.231, it is usually useful to calculate the **reciprocal**. This changes the direction from a unit increase to a unit decrease. So, the result for duration could also be interpreted as:

    For every 1 unit decrease in duration, fraud is 4.32 times as likely, holding all other variables constant.