# Homework 1

Feel free to use this notebook as your report for the whole 4th task of the homework. In other words, you can plot curves, print out results right here, no need to put it in the pdf (though you can if you want!). Just make sure that you **discuss**/**describe** your results somewhere.

Please keep in mind that
***all the numbers/results given in your report here should be reproducible by us***

## Part 1 : Logistic regression with SciKit Learn

In the first part of this practical section you have to implement *logistic regression classifier* using *scikit* learn library.

## Data

Please use make_blob function as we did in the lab to produce some random datasets. Consider giving us a random_seed which you will use while reporting results in the report.

In [1]:
!pip install make_blobs

Collecting make_blobs


  Could not find a version that satisfies the requirement make_blobs (from versions: )
No matching distribution found for make_blobs


In [1]:
### MAKE A DATASET HERE ###
from sklearn.datasets.samples_generator import make_blobs

In [2]:
%matplotlib widget

import numpy
import matplotlib.pyplot as plot
from tqdm import tqdm

#random_seed = numpy.random.randint(0,100)
random_seed = 65

x, y = make_blobs(n_samples=2000, centers=2, n_features=2,
                  random_state=random_seed)

plot.figure()
plot.scatter(x[:,0], x[:,1], c=y)
plot.show()

x_train, y_train = x[:1500], y[:1500]
x_test, y_test = x[1500:], y[1500:]

# sanity check
assert len(x_train) == len(y_train) == 1500
assert len(x_test) == len(y_test) == 500

FigureCanvasNbAgg()

## Model

Describe the model you used here, use latex math to write down equations.

Model: $p_{+} = p(y = 1|x) = \frac{1}{1 + e^{-w.x + b}}$, where $w, b \in R^{d}$

Distance Function: -$(y * log(p_{+}) + (1 - y) * log(1 - p_{+}))$

Learning Rule: $w \leftarrow w - \eta * (\hat{y} - y) * x$

Hint: the model definition in scikit learn is very similar to what we did in the lab with perceptron.

In [3]:
### MAKE A MODEL HERE ###
from sklearn.linear_model import LogisticRegression
myLogReg = LogisticRegression()

## Learning

Perform the training of your model on your training dataset and report the accuracy/error here.

In [4]:
### DO THE TRAINING HERE ###
myLogReg.fit(x_train,y_train)
train_accuracy = myLogReg.score(x_train, y_train)
print('Training accuracy rate is {}'.format(train_accuracy))



Training accuracy rate is 0.9426666666666667


## Analysis

Do the analysis of your model here:

compute the error rate on the test set, discuss the result. Can it be trained better? If not, why?

You can plot the class boundary produced by your model to make your arguments stronger.

In [5]:
### DO THE ANALYSIS HERE ##
test_accuracy = myLogReg.score(x_test, y_test)
print('Test accuracy rate is {}'.format(test_accuracy))

Test accuracy rate is 0.932


In [6]:
# get w and b from the model
sk_w = numpy.concatenate((myLogReg.coef_.reshape(-1),myLogReg.intercept_.reshape(-1)))
print('Sklearn: {:.3f} x_1 + {:.3f} x_2 + {:.3f} = 0'.format(*sk_w))

Sklearn: 0.557 x_1 + 1.419 x_2 + 13.914 = 0


In [7]:
w = sk_w[:2] #weight
b = sk_w[2]  #bias vector
# visualize data 
def vis_data(x, y = None, c='r'):
    if y is None: 
        y = [None] * len(x)
    for x_, y_ in zip(x, y):
        if y_ is None:
            plot.plot(x_[0], x_[1], 'o', markerfacecolor='none', markeredgecolor=c)
        else:
            plot.plot(x_[0], x_[1], c+'o' if y_ == 0 else c+'+')
    plot.grid('on')
    
def vis_hyperplane(w, b, typ='k--'):

    lim0 = plot.gca().get_xlim()
    lim1 = plot.gca().get_ylim()
    m0, m1 = lim0[0], lim0[1]

    intercept0 = -(w[0] * m0 + b)/w[1]
    intercept1 = -(w[0] * m1 + b)/w[1]
    
    plt1, = plot.plot([m0, m1], [intercept0, intercept1], typ)

    plot.gca().set_xlim(lim0)
    plot.gca().set_ylim(lim1)
        
    return plt1

plot.figure(figsize=(10,10))

vis_data(x, y, c='r')

plt1 = vis_hyperplane(w, b, 'k--')
plot.legend([plt1], [
        'Final: ${:.2} x_1 + {:.2} x_2 + {:.2} = 0$'.format(*list(w)+[b])],
           loc='best')

plot.show()

FigureCanvasNbAgg()

Passing one of 'on', 'true', 'off', 'false' as a boolean is deprecated; use an actual boolean (True/False) instead.
  warn_deprecated("2.2", "Passing one of 'on', 'true', 'off', 'false' as a "


We can improve the model by having a larger data set.

## Part 1 : SVM classifier with SciKit Learn

In the second part of this practical section you have to implement *SVM classifier* using *scikit* learn library.

## Data

Please use make_blob function as we did in the lab to produce some random datasets. Consider giving us a random_seed which you will use while reporting results in the report.

In [9]:
### MAKE A DATASET HERE ###
from sklearn.datasets.samples_generator import make_blobs
%matplotlib widget

import numpy
import matplotlib.pyplot as plot
from tqdm import tqdm

#random_seed = numpy.random.randint(0,100)
random_seed = 1234

x, y = make_blobs(n_samples=2000, centers=2, n_features=2,
                  random_state=random_seed)
plot.figure()
plot.scatter(x[:,0], x[:,1], c=y)
plot.show()

x_train, y_train = x[:1500], y[:1500]
x_test, y_test = x[1500:], y[1500:]

# sanity check
assert len(x_train) == len(y_train) == 1500
assert len(x_test) == len(y_test) == 500


FigureCanvasNbAgg()

## Model

Describe the model you used here, use latex math to write down equations

\begin{align}\begin{aligned}\min_ {w, b, \zeta} \frac{1}{2} w^T w + C \sum_{i=1}^{n} \zeta_i\\\begin{split}\textrm {subject to } & y_i (w^T \phi (x_i) + b) \geq 1 - \zeta_i,\\
& \zeta_i \geq 0, i=1, ..., n\end{split}\end{aligned}\end{align}


In [10]:
### MAKE A MODEL HERE ###
from sklearn import svm
clf = svm.SVC(gamma=0.001,kernel='linear')

## Learning

Perform the training of your model on your training dataset and report the accuracy/error here.

In [11]:
### DO THE TRAINING HERE ###
clf.fit(x_train, y_train)
train_error2 = clf.score(x_train, y_train)
print('Training error rate {}'.format(train_error2))

Training error rate 0.9993333333333333


## Analysis

Do the analysis of your model here:

compute the error rate on the test set, discuss the result. Can it be trained better? If not, why?

You can plot the class boundary priduced by your model to make your arguments stronger.

In [12]:
### DO THE ANALYSIS HERE ###
test_accuracy2 = clf.score(x_test, y_test)
print('Test accuracy rate {}'.format(test_accuracy2))

Test accuracy rate 1.0


In [13]:
SVM_w = numpy.concatenate((clf.coef_.reshape(-1),clf.intercept_.reshape(-1)))
print('Sklearn: {:.3f} x_1 + {:.3f} x_2 + {:.3f} = 0'.format(*SVM_w))

Sklearn: 1.458 x_1 + 1.297 x_2 + 0.395 = 0


In [14]:
w = SVM_w[:2] #weight
b = SVM_w[2]  #bias vector
# visualize data 
def vis_data(x, y = None, c='r'):
    if y is None: 
        y = [None] * len(x)
    for x_, y_ in zip(x, y):
        if y_ is None:
            plot.plot(x_[0], x_[1], 'o', markerfacecolor='none', markeredgecolor=c)
        else:
            plot.plot(x_[0], x_[1], c+'o' if y_ == 0 else c+'+')
    plot.grid('on')
    

plot.figure(figsize=(10,10))

vis_data(x, y, c='r')

plt1 = vis_hyperplane(w, b, 'k--')
plot.legend([plt1], [
        'Final: ${:.2} x_1 + {:.2} x_2 + {:.2} = 0$'.format(*list(w)+[b])],
           loc='best')

plot.show()

FigureCanvasNbAgg()

Passing one of 'on', 'true', 'off', 'false' as a boolean is deprecated; use an actual boolean (True/False) instead.
  warn_deprecated("2.2", "Passing one of 'on', 'true', 'off', 'false' as a "


We can improve the model by tuning parameters. I found some resources here, but I don't think other points apply to the SVM model individually.
https://stackoverflow.com/questions/38077190/how-to-increase-the-model-accuracy-of-logistic-regression-in-scikit-python
