# Logistic Regression:

## Introduction:
Logistic regression is a linear classifier which belongs to the family of discriminative machine learning models. It tries to learn $P(y|x)$ from data and does prediction following a linear threshold unit i.e.
$$h(x) =  \begin{cases} 
      1 & w_{1}x_{1}+w_{2}x_{2}+..+w_{d}x_{d}\geqslant 0 \\
      0 & otherwise
   \end{cases}$$
   
Logistic regression learns the weight of each feature in the data set and uses sigmoid function to transform the predicted values into probabilities. A linear function can have a range of (${-\infty,+\infty}$) but it is transformed to [0,1] using the sigmoid function.
$$sigmoid(wx) = \frac{1}{1+ e^{-wx}}$$

The cost function for logistic regresion is given by:
$$J(w) = \sum_{i=1}^{N} (1-y_{i})\log(1-p(x,w) + y_{i}p(x,w)$$

the gradient is given by:
$$\frac{\partial J(w)}{\partial w_{j}} = \sum_{i=1}^{N}(y_{i}-p(x,w))x_{ij}$$

and the weight update is done:
$$w_{t+1} = w_{t} + \alpha * gradient$$

where, $\alpha$ is the learning rate.

# Python Implementation:
**Let us first implement few functions that we will need before implementing logistic regression**

In [1]:
'Import libraries'
from __future__ import division
import numpy as np
import pandas as pd
from collections import defaultdict
import copy
import time
from sklearn.cross_validation import ShuffleSplit
from sklearn.preprocessing import StandardScaler
from itertools import izip

In [2]:
'Utils functions'
def paddingData(data):
    '''
    :param data: Data to be paded
    :return: Padded data with value 1 in the first column
    '''
    return np.c_[np.ones(data.shape[0]), data]

def setWeights(numFeat):
    '''
    :param numFeat: Total number of features in data
    :return: vector of ones of length equal to number of features in data
    '''
    return np.ones(numFeat).reshape((numFeat, 1))

def splitData(test_size,cv, numpoints):
    #This function from sklearn takes the length of the data and test size and returns bootstrapped indices 
    #depending on how many boostraps are required
    '''
    :param test_size: size of the test data required (value between 0 and 1).
    :param cv: Number of re-shuffling.
    :param numpoints: Total number of data points.
    :return: indices of the shuffled splits.
    '''
    ss = ShuffleSplit(n=numpoints, n_iter=cv, test_size=test_size, random_state=32)
    return ss

def calAccuracy(pred,ytest):
    '''
    :param pred: vector containing all the predicted classes
    :param ytest: vector containing all the true classes
    :return: accuracy of classification
    '''
    count = 0
    for i,j in izip(pred,ytest):
        if i==j:
            count +=1
    return count/(len(ytest))

def sigmoid(a):
    '''
    :param a: vector (w.x)
    :return: sigmoid transfer of the value
    '''
    return 1/(1+np.exp(-a))

# Logistic Regression:

In [3]:
'Implements logistic regression'
class logisticregression():

    def __init__(self,tol=0.0001,alpha = 0.01):
        '''
        :param weights: weight vector
        :param tol: tolerance with the default value of 0.0001
        :param alpha: learning rate with the default value of 0.01
        '''
        self.weights = None
        self.tolerance = tol
        self.alpha = alpha

    def fit(self,Xtrain,ytrain):
        'Start time'
        start = time.time()
        'Padding of input data'
        Xtrain  = paddingData(Xtrain)
        self.weights = setWeights(Xtrain.shape[1])
        'save the number passes over data'
        run = 0
        while True:
            run +=1
            'predict using the current weight'
            predict = np.dot(Xtrain,self.weights)
            'calculate the probability of data point belonging to class 1(in case of binary)'
            prob = sigmoid(predict)
            'calculate the error'
            error = ytrain - prob
            gradient = np.dot(error.T ,Xtrain) / ytrain.shape[0]
            temp = self.weights + self.alpha* gradient.T
            step = np.linalg.norm(np.subtract(self.weights, temp))
            self.weights = temp
            if step < self.tolerance:
                break
        end = time.time()
        print 'Time taken to fit data:',end-start
        print 'Number of passes over data:', run
                
                
    def predict(self,Xtest):
        'Pad the test data'
        Xtest = paddingData(Xtest)
        'predict using the learned weights and convert it into probability'
        pred = sigmoid(Xtest.dot(self.weights))
        pred[pred > 0.5  ] = 1
        pred[pred <= 0.5] = 0
        return pred

# Load the data set

In [4]:
from sklearn.datasets import load_breast_cancer
'''
Classes	2
Samples per class	212(M),357(B)
Samples total	569
Dimensionality	30
Features	real, positive
'''
breastcancer = load_breast_cancer()
numSamples, numFeat = breastcancer.data.shape
ss = splitData(test_size=0.25,cv=1, numpoints=numSamples)
for train_index, test_index in ss:
    Xtrain = breastcancer.data[train_index, :]
    ytrain = breastcancer.target[train_index].reshape((train_index.shape[0], 1))
    Xtest = breastcancer.data[test_index, :]
    ytest = breastcancer.target[test_index].reshape((test_index.shape[0], 1))

'Normalize the data to zero mean and unit variance'
scalar = StandardScaler()
Xtrain = scalar.fit_transform(Xtrain)
Xtest = scalar.transform(Xtest)

# Running our algorithm:
I have also compared the performance of our logistic regression implementation with sklearn logistic regression as a sanity check.

In [5]:
clf = logisticregression()
clf.fit(Xtrain,ytrain)
pred = clf.predict(Xtest)
print 'Accuracy:',calAccuracy(pred,ytest)*100

Time taken to fit data: 0.741101980209
Number of passes over data: 10706
Accuracy: 99.3006993007


In [6]:
from sklearn.linear_model import LogisticRegression
lgr = LogisticRegression()
lgr.fit(Xtrain,ytrain)
p = lgr.predict(Xtest)
print 'Accuracy:',calAccuracy(p,ytest)*100

Accuracy: 99.3006993007


  y = column_or_1d(y, warn=True)


# Conclusion:
Our algorithm gives the same accuracy as sklearn, accuracy is a good measure here because the class distribuion is almost equal. The same idea can be extended to multi class classification using one-vs-all method.

# Refrences:
1. [Wiki](https://en.wikipedia.org/wiki/Logistic_regression)
2. [Penn State: STAT 504](https://onlinecourses.science.psu.edu/stat504/node/149)
3. My class notes and slides.