Lecture notes

- Sample size: n
- Predictor dim: p
- Class numbers: k


1. Logistic regression
- logistic function p(X)
- log-odds (logit) is linear in X
- likelihood function: finding coefficients that maximize it. Number of coeff depends on predictor dim
- z-stats

2. Multiple logistic regression

3. Discriminant analysis
- Bayes' theorem
- posterior prob. p_k(x) depends on pi_k (prior prob., easy to compute from Y), and f_k(x) (desity function of X, hard to get from X, but can assume simple forms)
- when p=1
    - assume f_k(x,mu_k,sigma) are Gaussian with the same variance
    - the max of p_k(x) in k <=> the max of discriminant functions in k that is linear in x, which still depends on mu_k, sigma and pi_k
    - need estimate pi_k, mu_k and sigma from X and Y
    - Set pairs of discriminant functions equal to each other to determine Bayes decision boundary
    - Once have k for given x, we can compute p_k(x) for the probability
- when p>1
    - assume f_k(x,mu_k,Sigma) are Gaussian with the same covariance
    - algorithm same as p=1 case
- Forms of discriminant analysis
    - Linear: f_k(x) are Gaussian having the same covariance
    - Quatratic: f_k(x) are Gaussian having different covariance
    - Naive Bayes: X are independent in each class (covariance matrix is diagonal); useful for large p; useful for mixed feature vectors

4. Evaluate threshold value
- Confusion matrix
- True/False positive/negative
- Two types of error: True/False postitive rates
- The Total Error is a weighted average of the False Positive Rate and False Negative Rate. The weights are determined by the Prior Probabilities of Positive and Negative Responses.
- Visual method: ROC+AUC

In [98]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from mpl_toolkits.mplot3d import Axes3D
from sklearn import datasets,preprocessing
from sklearn.linear_model import LogisticRegression
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis as LDA
from sklearn.discriminant_analysis import QuadraticDiscriminantAnalysis as QDA
from sklearn.neighbors import KNeighborsClassifier as KNN
from sklearn.metrics import mean_squared_error, r2_score, confusion_matrix
import math
import statsmodels.api as sm
import statsmodels.formula.api as smf
from patsy import dmatrices 

In [3]:
data = pd.read_csv('data/Smarket.csv',header=0)

In [4]:
data.head()

Unnamed: 0,Year,Lag1,Lag2,Lag3,Lag4,Lag5,Volume,Today,Direction
0,2001,0.381,-0.192,-2.624,-1.055,5.01,1.1913,0.959,Up
1,2001,0.959,0.381,-0.192,-2.624,-1.055,1.2965,1.032,Up
2,2001,1.032,0.959,0.381,-0.192,-2.624,1.4112,-0.623,Down
3,2001,-0.623,1.032,0.959,0.381,-0.192,1.276,0.614,Up
4,2001,0.614,-0.623,1.032,0.959,0.381,1.2057,0.213,Up


In [57]:
# logistic regression - statsmodels
# create training data and factorize classes, add intercept column to X
y,X = dmatrices('Direction~Lag1+Lag2+Lag3+Lag4+Lag5+Volume',data,return_type='dataframe')
Y = y.iloc[:,0]
lm1 = sm.Logit(Y,X).fit()
print(lm1.summary())
# determine class with threshold = 0.5
threshold = 0.5
pre_label = pd.DataFrame(np.zeros((len(lm1.predict()),1)),columns=['label'])
pre_label[lm1.predict()>threshold] =1
confusion_matrix(Y,pre_label)

Optimization terminated successfully.
         Current function value: 0.691034
         Iterations 4
                           Logit Regression Results                           
Dep. Variable:        Direction[Down]   No. Observations:                 1250
Model:                          Logit   Df Residuals:                     1243
Method:                           MLE   Df Model:                            6
Date:                Mon, 05 Feb 2018   Pseudo R-squ.:                0.002074
Time:                        11:37:05   Log-Likelihood:                -863.79
converged:                       True   LL-Null:                       -865.59
                                        LLR p-value:                    0.7319
                 coef    std err          z      P>|z|      [95.0% Conf. Int.]
------------------------------------------------------------------------------
Intercept      0.1260      0.241      0.523      0.601        -0.346     0.598
Lag1           0.0731      0.

array([[507, 141],
       [457, 145]])

In [101]:
# logistic regression - sklearn
Y = data.Direction.factorize()[0]
X = data.iloc[:,1:7]
lm2 = LogisticRegression()
lm2.fit(X,y)
print(confusion_matrix(lm2.predict(X),Y))

[[513 459]
 [135 143]]


In [100]:
# LDA - sklearn
Y = data.Direction.factorize()[0]
X = data.iloc[:,1:7]
ldam = LDA()
ldam.fit(X,Y)
print(confusion_matrix(ldam.predict(X),Y))

[[507 457]
 [141 145]]


In [99]:
# QDA - sklearn
Y = data.Direction.factorize()[0]
X = data.iloc[:,1:7]
qdam = QDA()
qdam.fit(X,Y)
print(confusion_matrix(qdam.predict(X),Y))

[[512 421]
 [136 181]]


In [108]:
# KNN - sklearn
Y = data.Direction.factorize()[0]
X = data.iloc[:,1:7]
knnm = KNN(n_neighbors=3)
knnm.fit(X,Y)
print(confusion_matrix(knnm.predict(X),Y))

[[501 157]
 [147 445]]
