# Lab Three: Extending Logistic Regression
 

#### Everett Cienkus, Blake Miller, Colin Weil

### 1. Preparation and Overview

#### 1.1 Business Case

Explain the task and what business-case or use-case it is designed to solve (or designed to investigate). Detail exactly what the classification task is and what parties would be interested in the results. For example, would the model be deployed or used mostly for offline analysis? 

#### 1.2 Preparation of Data

In [56]:
import pandas as pd
import numpy as np

# Define and prepare your class variables.
# df = pd.read_csv('wine_dataset/winequalityN.csv')
# df = df[df['type']=='white']
# df = df.drop(columns = ['type'])
# df = df.dropna()
# X = df.drop(columns = ['quality'])
# y = df['quality']
# Use proper variable representations (int, float, one-hot, etc.).
# Use pre-processing methods (as needed) for dimensionality reduction, 
# scaling, etc. Remove variables that are not needed/useful for the analysis. 
# Describe the final dataset that is used for classification/regression
# display(X.info())
# display(y.info())
# (include a description of any newly formed variables you created).
# MAKE SURE TO NORMALIZE VALUES

In [57]:
df = pd.read_csv('wine_dataset/star_classification.csv')
df = df.dropna()

X = df.drop(columns = ['obj_ID','run_ID','rerun_ID','field_ID','spec_obj_ID', 'MJD', 'class' ])
y = df['class']

display(X.info())
display(y.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100000 entries, 0 to 99999
Data columns (total 11 columns):
 #   Column    Non-Null Count   Dtype  
---  ------    --------------   -----  
 0   alpha     100000 non-null  float64
 1   delta     100000 non-null  float64
 2   u         100000 non-null  float64
 3   g         100000 non-null  float64
 4   r         100000 non-null  float64
 5   i         100000 non-null  float64
 6   z         100000 non-null  float64
 7   cam_col   100000 non-null  int64  
 8   redshift  100000 non-null  float64
 9   plate     100000 non-null  int64  
 10  fiber_ID  100000 non-null  int64  
dtypes: float64(8), int64(3)
memory usage: 8.4 MB


None

<class 'pandas.core.series.Series'>
RangeIndex: 100000 entries, 0 to 99999
Series name: class
Non-Null Count   Dtype 
--------------   ----- 
100000 non-null  object
dtypes: object(1)
memory usage: 781.4+ KB


None

#### 1.3 Division of Trainig and Testing Data

In [58]:
# Divide your data into training and testing data using an 80% training 
# and 20% testing split. Use the cross validation modules that are part 
# of scikit-learn.
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=0.1, train_size=0.9)

unique_ytrain, counts_ytrain = np.unique(y_train, return_counts=True)
print(np.asarray((unique_ytrain, counts_ytrain)).T)

[['GALAXY' 53451]
 ['QSO' 17072]
 ['STAR' 19477]]


Argue "for" or "against" splitting your data using an 80/20 split. That is, why is the 80/20 split appropriate (or not) for your dataset?  

In [59]:
### 2. Modeling

#### 2.1 One-Versus-All Logistic Regression Classifier

In [60]:
from scipy.special import expit
from sklearn.metrics import accuracy_score

class BinaryLogisticRegression:
    def __init__(self, eta, iterations=20, C=0.001):
        self.eta = eta
        self.iters = iterations
        self.C = C
        # internally we will store the weights as self.w_ to keep with sklearn conventions

    def __str__(self):
        if(hasattr(self,'w_')):
            return 'Binary Logistic Regression Object with coefficients:\n'+ str(self.w_) # is we have trained the object
        else:
            return 'Untrained Binary Logistic Regression Object'

    # convenience, private:
    @staticmethod
    def _add_bias(X):
        return np.hstack((np.ones((X.shape[0],1)),X)) # add bias term

    @staticmethod
    def _sigmoid(theta):
        # increase stability, redefine sigmoid operation
        return expit(theta) #1/(1+np.exp(-theta))

    # vectorized gradient calculation with regularization using L2 Norm
    def _get_gradient(self,X,y):
        ydiff = y-self.predict_proba(X,add_bias=False).ravel() # get y difference
        gradient = np.mean(X * ydiff[:,np.newaxis], axis=0) # make ydiff a column vector and multiply through

        gradient = gradient.reshape(self.w_.shape)
        gradient[1:] += -2 * self.w_[1:] * self.C

        return gradient

    # public:
    def predict_proba(self,X,add_bias=True):
        # add bias term if requested
        Xb = self._add_bias(X) if add_bias else X
        return self._sigmoid(Xb @ self.w_) # return the probability y=1

    def predict(self,X):
        return (self.predict_proba(X)>0.5) #return the actual prediction


    def fit(self, X, y):
        Xb = self._add_bias(X) # add bias term
        num_samples, num_features = Xb.shape

        self.w_ = np.zeros((num_features,1)) # init weight vector to zeros

        # for as many as the max iterations
        for _ in range(self.iters):
            gradient = self._get_gradient(Xb,y)
            self.w_ += gradient*self.eta # multiply by learning rate
            # add bacause maximizing

# for this, we won't perform our own BFGS implementation
# (it takes a fair amount of code and understanding, which we haven't setup yet)
# luckily for us, scipy has its own BFGS implementation:
from scipy.optimize import fmin_bfgs # maybe the most common bfgs algorithm in the world
from numpy import ma
class BFGSBinaryLogisticRegression(BinaryLogisticRegression):

    @staticmethod
    def objective_function(w,X,y,C):
        g = expit(X @ w)
        # invert this because scipy minimizes, but we derived all formulas for maximzing
        return -np.sum(ma.log(g[y==1]))-np.sum(ma.log(1-g[y==0])) + C*sum(w**2)
        #-np.sum(y*np.log(g)+(1-y)*np.log(1-g))

    @staticmethod
    def objective_gradient(w,X,y,C):
        g = expit(X @ w)
        ydiff = y-g # get y difference
        gradient = np.mean(X * ydiff[:,np.newaxis], axis=0)
        gradient = gradient.reshape(w.shape)
        gradient[1:] += -2 * w[1:] * C
        return -gradient

    # just overwrite fit function
    def fit(self, X, y):
        Xb = self._add_bias(X) # add bias term
        num_samples, num_features = Xb.shape

        self.w_ = fmin_bfgs(self.objective_function, # what to optimize
                            np.zeros((num_features,1)), # starting point
                            fprime=self.objective_gradient, # gradient function
                            args=(Xb,y,self.C), # extra args for gradient and objective function
                            gtol=1e-03, # stopping criteria for gradient, |v_k|
                            maxiter=self.iters, # stopping criteria iterations
                            disp=False)

        self.w_ = self.w_.reshape((num_features,1))

class StochasticLogisticRegression(BinaryLogisticRegression):
    # stochastic gradient calculation
    def _get_gradient(self,X,y):
        idx = int(np.random.rand()*len(y)) # grab random instance
        ydiff = y[idx]-self.predict_proba(X[idx],add_bias=False) # get y difference (now scalar)
        gradient = X[idx] * ydiff[:,np.newaxis] # make ydiff a column vector and multiply through

        gradient = gradient.reshape(self.w_.shape)
        gradient[1:] += -2 * self.w_[1:] * self.C

        return gradient

class MultiClassLogisticRegression:
    def __init__(self, eta, iterations=20,
                 C=0.0001,
                 solver=BFGSBinaryLogisticRegression):
        self.eta = eta
        self.iters = iterations
        self.C = C
        self.solver = solver
        self.classifiers_ = []
        # internally we will store the weights as self.w_ to keep with sklearn conventions

    def __str__(self):
        if(hasattr(self,'w_')):
            return 'MultiClass Logistic Regression Object with coefficients:\n'+ str(self.w_) # is we have trained the object
        else:
            return 'Untrained MultiClass Logistic Regression Object'

    def fit(self,X,y):
        num_samples, num_features = X.shape
        self.unique_ = np.sort(np.unique(y)) # get each unique class value
        num_unique_classes = len(self.unique_)
        self.classifiers_ = []
        for i,yval in enumerate(self.unique_): # for each unique value
            y_binary = np.array(y==yval).astype(int) # create a binary problem

            # train the binary classifier for this class

            hblr = self.solver(eta=self.eta,iterations=self.iters,C=self.C)
            hblr.fit(X,y_binary)

            # add the trained classifier to the list
            self.classifiers_.append(hblr)

        # save all the weights into one matrix, separate column for each class
        self.w_ = np.hstack([x.w_ for x in self.classifiers_]).T

    def predict_proba(self,X):
        probs = []
        for hblr in self.classifiers_:
            probs.append(hblr.predict_proba(X).reshape((len(X),1))) # get probability for each classifier

        return np.hstack(probs) # make into single matrix

    def predict(self,X):
        return self.unique_[np.argmax(self.predict_proba(X),axis=1)] # take argmax along row

In [61]:
%%time
lr = MultiClassLogisticRegression(eta=1,
                                  iterations=100,
                                  C=0.0001,
                                  solver=BFGSBinaryLogisticRegression
                                  )
#np.hstack((np.ones((X.shape[0],1)),X))
lr.fit(X_train,y_train)
print(lr)
yhat = lr.predict(X_test)
print('Accuracy of: ',accuracy_score(y_test,yhat))
unique_yhat, counts_yhat = np.unique(yhat, return_counts=True)
unique_y, counts_y = np.unique(y, return_counts=True)
print(np.asarray((unique_yhat, counts_yhat)).T)
print(np.asarray((unique_y, counts_y)).T)

  stp, phi1, derphi1, task = minpack2.dcsrch(alpha1, phi1, derphi1,
  [A, B] = np.dot(d1, np.asarray([fb - fa - C * db,


MultiClass Logistic Regression Object with coefficients:
[[ 0.00000000e+00  0.00000000e+00  0.00000000e+00  0.00000000e+00
   0.00000000e+00  0.00000000e+00  0.00000000e+00  0.00000000e+00
   0.00000000e+00  0.00000000e+00  0.00000000e+00  0.00000000e+00]
 [-1.71172984e-01  1.22886002e-03  1.64710496e-02 -6.67097780e-01
  -2.27072991e-01 -5.81251155e-01  3.74333200e-01  9.85211854e-01
  -7.64574247e-02  5.46352702e+00 -1.48504612e-04  1.21865832e-04]
 [-2.89539203e-01 -1.81710895e-03 -5.64459212e-03  1.74860556e-01
  -1.19560977e+00  1.62897839e+00 -4.57083679e-01 -8.35857340e-02
   3.02108822e-02 -1.86466721e+01  2.69626374e-04 -4.23657050e-04]]
Accuracy of:  0.9428
[['GALAXY' 5935]
 ['QSO' 1767]
 ['STAR' 2298]]
[['GALAXY' 59445]
 ['QSO' 18961]
 ['STAR' 21594]]
CPU times: total: 18.6 s
Wall time: 4.97 s


In [62]:
%%time
from sklearn.linear_model import LogisticRegression as SKLogisticRegression

lr_sk = SKLogisticRegression(solver='liblinear') # all params default

lr_sk.fit(X,y)
print(np.hstack((lr_sk.intercept_[:,np.newaxis],lr_sk.coef_)))
yhat = lr_sk.predict(X)
print('Accuracy of: ',accuracy_score(y,yhat))

[[ 3.21599104e-02 -3.72100655e-05 -4.25783521e-03 -7.33694232e-03
   1.69276550e+00 -6.21858188e-01 -9.58891517e-01 -1.68039674e-01
  -4.07350411e-02 -3.68118092e-01 -2.32968086e-05  8.24709265e-05]
 [-1.23869658e-01  8.24455973e-04  9.95819805e-03 -8.07164976e-01
  -1.57781158e-01 -3.53185880e-01  2.56993470e-01  9.65000234e-01
  -1.92503566e-03  3.25739162e+00  6.23099008e-06  1.39144504e-04]
 [-1.69820843e+00 -1.45384851e-03 -4.89775800e-03  3.18138505e-02
  -7.15182778e-01  1.00090829e+00 -6.14679990e-01  4.52087754e-01
   1.08065927e-02 -1.40532020e+01  2.00636049e-04 -2.27145182e-04]]
Accuracy of:  0.88628
CPU times: total: 7.03 s
Wall time: 6.55 s


#### 2. Training Classifier for Good Generalization Performance

Is your method of selecting parameters justified? That is, do you think there is any "data snooping" involved with this method of selecting parameters?

#### 2.3 Comparing Best Performing Procedure to Scikit-Learn

In [63]:
# Visualize the performance differences in terms of training time and classification performance.

Discuss the results. 

### 3. Deployment

Which implementation of logistic regression would you advise be used in a deployed machine learning model, your implementation or scikit-learn (or other third party)? Why?

### 4. BFGS (Can change but thought this would be better)

In [64]:
# Implementation of BFGS

Compare your performance accuracy and runtime to the BFGS implementation in SciPy (that we used in lecture). 
