## > Logistic Regression

fitting a logistic regression model to a dataset where we would like to predict if a transaction is fraud or not.

In [1]:
import numpy as np
import pandas as pd
import statsmodels.api as sm

df = pd.read_csv('C:/Users/Minkun/Desktop/classes_1/NanoDeg/1.Data_AN/L4/data/fraud_dataset.csv')
df.head()

Unnamed: 0,transaction_id,duration,day,fraud
0,28891,21.3026,weekend,False
1,61629,22.932765,weekend,False
2,53707,32.694992,weekday,False
3,47812,32.784252,weekend,False
4,43455,17.756828,weekend,False


`1.` As you can see, there are two columns that need to be changed to dummy variables.  Replace each of the current columns to the dummy version.  Use the 1 for `weekday` and `True`, and 0 otherwise. 

In [2]:
#pd.get_dummies() offers 2 arrays. 
df['weekday'] = pd.get_dummies(df['day'])['weekday']
df[['not_fraud','fraud']] = pd.get_dummies(df['fraud'])
df.head(2)

df = df.drop('not_fraud', axis=1)
df.head(2)

Unnamed: 0,transaction_id,duration,day,fraud,weekday
0,28891,21.3026,weekend,0.0,0.0
1,61629,22.932765,weekend,0.0,0.0


In [3]:
# The proportion of fraudulent, weekday... transactions...
print(df['fraud'].mean())
print(df['weekday'].mean())
print(df.groupby('fraud').mean()['duration'])
# so..tendency..shorter transaction time when it is a fraud. 

0.012168770612987604
0.3452746502900034
fraud
0.0    30.013583
1.0     4.624247
Name: duration, dtype: float64


`2.` Now that you have dummy variables, fit a logistic regression model to predict if a transaction is fraud using both day and duration.  Don't forget an intercept! 

In [5]:
# AttributeError: module 'scipy.stats' has no attribute 'chisqprob'

# how to fix? 
#sm.show_versions()

#from scipy import stats
#stats.chisqprob = lambda chisq, df: stats.chi2.sf(chisq, df)

In [6]:
df['intercept'] = 1

# instead of 'OLS', we use 'Logit'
log_model = sm.Logit(df['fraud'], df[['intercept', 'weekday', 'duration']])
#coeff = log_model.fit().params; coeff
result = log_model.fit()
result.summary()
# coeff interpret: you need to exponentiate your coefficients before you can interpret them in logistic regression

Optimization terminated successfully.
         Current function value: 0.002411
         Iterations 16


0,1,2,3
Dep. Variable:,fraud,No. Observations:,8793.0
Model:,Logit,Df Residuals:,8790.0
Method:,MLE,Df Model:,2.0
Date:,"Tue, 02 Jan 2018",Pseudo R-squ.:,0.9633
Time:,19:02:46,Log-Likelihood:,-21.2
converged:,True,LL-Null:,-578.1
,,LLR p-value:,1.39e-242

0,1,2,3,4,5,6
,coef,std err,z,P>|z|,[0.025,0.975]
intercept,9.8709,1.944,5.078,0.000,6.061,13.681
weekday,2.5465,0.904,2.816,0.005,0.774,4.319
duration,-1.4637,0.290,-5.039,0.000,-2.033,-0.894


In [7]:
np.exp(2.5465), 100/1276.2357 #, 1/_

(12.762357271496972, 0.07835543230768423)

In [8]:
np.exp(-1.4637), 100/23.14

(0.23137858821179411, 4.32152117545376)

## On weekdays, the chance of fraud is 12.76 (e^2.5465) times more likely than on weekends...holding 'duration' constant. 

## For each min less spent on the transaction, the chance of fraud is 4.32 times more likely...holding the 'weekday' constant. 

In [9]:
df = pd.read_csv("C:/Users/Minkun/Desktop/classes_1/NanoDeg/1.Data_AN/L4/data/admissions.csv")
df.head()
# prestige is the prestige of an applicant alta mater (the school attended before applying), with 1 being the highest
#(highest prestige) and 4 as the lowest (not prestigious).

Unnamed: 0,admit,gre,gpa,prestige
0,0,380,3.61,3
1,1,660,3.67,3
2,1,800,4.0,1
3,1,640,3.19,4
4,0,520,2.93,4


In [10]:
# With the above idea in place, create the dummy variables needed to change prestige to a categorical variable, rather than 
#quantitative. 

df[['prest_1', 'prest_2', 'prest_3','prest_4',]] = pd.get_dummies(df['prestige'])
df.head()

Unnamed: 0,admit,gre,gpa,prestige,prest_1,prest_2,prest_3,prest_4
0,0,380,3.61,3,0.0,0.0,1.0,0.0
1,1,660,3.67,3,0.0,0.0,1.0,0.0
2,1,800,4.0,1,1.0,0.0,0.0,0.0
3,1,640,3.19,4,0.0,0.0,0.0,1.0
4,0,520,2.93,4,0.0,0.0,0.0,1.0


In [11]:
df['prestige'].astype(str).value_counts()

2    148
3    121
4     67
1     61
Name: prestige, dtype: int64

In [None]:
# Now, fit a logistic regression model to predict if an individual is admitted using gre, gpa, and prestige with a baseline of 
#the prestige value of 1.

df['intercept'] = 1

logit_mod = sm.Logit(df['admit'], df[['intercept','gre', 'gpa', 'prest_2', 'prest_3', 'prest_4']])
results = logit_mod.fit()
results.summary()

In [None]:
np.exp(results.params)

In [None]:
# inverting..
1/_

In [None]:
df.groupby('prestige').mean()['admit']

## > Model Diagnostics

In [12]:
import numpy as np
import pandas as pd

from sklearn.linear_model import LogisticRegression
from sklearn.metrics import confusion_matrix, precision_score, recall_score, accuracy_score
from sklearn.model_selection import train_test_split
np.random.seed(42)

In [None]:
import sklearn
print(sklearn.__version__) # from 0.17 to 0.19
# conda update scikit-learn 
### conda install -c anaconda scikit-learn ###

`1.` Change prestige to dummy variable columns that are added to `df`.  Then divide your data into training and test data.  Create your test set as 20% of the data, and use a random state of 0.  Your response should be the `admit` column.  [Here](http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html) are the docs, which can also find with a quick google search if you get stuck.

In [None]:
df[['prest_1', 'prest_2', 'prest_3', 'prest_4']] = pd.get_dummies(df['prestige'])
X = df.drop('admit', axis=1)
y = df['admit']
X_train, X_test, y_train, y_test = train_test_split(
          X, y, test_size=0.20, random_state=0)

`2.` Now use [sklearn's Logistic Regression](http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html) to fit a logistic model using `gre`, `gpa`, and 3 of your `prestige` dummy variables.  For now, fit the logistic regression model without changing any of the hyperparameters.  

The usual steps are:
* Instantiate
* Fit (on train)
* Predict (on test)
* Score (compare predict to test)

As a first score, obtain the [confusion matrix](http://scikit-learn.org/stable/modules/generated/sklearn.metrics.confusion_matrix.html).  Then answer the first question below about how well your model performed on the test data.

In [None]:
log_mod = LogisticRegression()
log_mod.fit(X_train, y_train)
preds = log_mod.predict(X_test)
confusion_matrix(y_test, preds) 

`3.` Now, try out a few additional metrics: [precision](http://scikit-learn.org/stable/modules/generated/sklearn.metrics.precision_score.html), [recall](http://scikit-learn.org/stable/modules/generated/sklearn.metrics.recall_score.html), and [accuracy](http://scikit-learn.org/stable/modules/generated/sklearn.metrics.accuracy_score.html) are all popular metrics, which you saw with Sebastian.  You could compute these directly from the confusion matrix, but you can also use these built in functions in sklearn.

Another very popular set of metrics are [ROC curves and AUC](http://scikit-learn.org/stable/auto_examples/model_selection/plot_roc.html#sphx-glr-auto-examples-model-selection-plot-roc-py).  These actually use the probability from the logistic regression models, and not just the label.  [This](http://blog.yhat.com/posts/roc-curves.html) is also a great resource for understanding ROC curves and AUC.

Try out these metrics to answer the second quiz question below.  I also provided the ROC plot below.  The ideal case is for this to shoot all the way to the upper left hand corner.  Again, these are discussed in more detail in the Machine Learning Udacity program.

In [None]:
precision_score(y_test, preds) 

In [None]:
recall_score(y_test, preds)

In [None]:
accuracy_score(y_test, preds)

In [None]:
from ggplot import *
from sklearn.metrics import roc_curve, auc
%matplotlib inline

preds = log_mod.predict_proba(X_test)[:,1]
fpr, tpr, _ = roc_curve(y_test, preds)

df = pd.DataFrame(dict(fpr=fpr, tpr=tpr))
ggplot(df, aes(x='fpr', y='tpr')) +\
    geom_line() +\
    geom_abline(linetype='dashed')