# Learning Objectives

* Basic experience with Machine Learning libraries like scikit/sklearn
* Know the difference between classification and regression
* Perform basic classifications in Python


## Supervised Learning: Classification

* Introduce the idea of **classification**
* Compare classification and regression models
* Explain what types of problems can be solved using classification

### Regression Vs. Classification

So far, we've studied **regression** problems that allow us to make predictions of the form $y = x\cdot \theta$

* That is, we've assumed **real-valued** (or numerical) outputs

How can we predict **binary** or **categorical** variables?

<img src="Datasets/Classification.jpg" alt="Drawing" style="width: 400px;"/>

### Why classification

Will I **purchase** this product? (yes or no)

<table><tr>
<td> <img src="Datasets/Movie.jpg" alt="Drawing" style="width: 400px;"/> </td>
</tr></table>

Will I **click on** this ad? (yes or no)

<table><tr>
<td> <img src="Datasets/Ring.jpg" alt="Drawing" style="width: 400px;"/> </td>
</tr></table>


What animal appear in this image? (mandarin duck)

<table><tr>
<td> <img src="Datasets/Duck.jpg" alt="Drawing" style="width: 400px;"/> </td>
</tr></table>


## Concept: Linear Classification

We'll attempt to build **classfiers** that make decisions according to rules of the form

<table><tr>
<td> <img src="Datasets/Linear_Classification.jpg" alt="Drawing" style="width: 250px;"/> </td>
</tr></table>

This is called **linear classification**, since we're still making decisions according to a linear function, $X_i \cdot \theta$.

### Logistic Regression

A modification of regression algorithms to handle classification problems. So

**Question**: how to convert a real valued expression ($X_i \cdot \theta\in\mathbb{R}$) into a probability $(p_{\theta}(y_i | X_i)\in [0, 1])$

**Answer**

<table><tr>
<td> <img src="Datasets/Sigmoid.jpg" alt="Drawing" style="width: 500px;"/> </td>
</tr></table>

#### Training of Logistic Regression

$X_i \cdot \theta$ should be maximized when $y_i$ is positive and minimized when $y_i$ is nagative. 

Or equivalently, we have:

<table><tr>
<td> <img src="Datasets/Training_func.jpg" alt="Drawing" style="width: 600px;"/> </td>
</tr></table>

**How to optimize?**

<table><tr>
<td> <img src="Datasets/Optimize_func.jpg" alt="Drawing" style="width: 600px;"/> </td>
</tr></table>

* Take logarithm
* Compute gradient
* Solve using gradient **ascent**

<table><tr>
<td> <img src="Datasets/Derivative.jpg" alt="Drawing" style="width: 500px;"/> </td>
</tr></table>

### Example - Polish companies bankruptcy data


* The dataset is about bankruptcy prediction of Polish companies.The bankrupt companies were analyzed in the period 2000-2012, while the still operating companies were evaluated from 2007 to 2013

<img src="Datasets/bankrupt_data.jpg" alt="Drawing" style="width: 700px;"/>

### Code: Reading the data

* Data is in CSV format, but first contains a header that we need to skip.

In [1]:
import zipfile
z = zipfile.ZipFile("Datasets/data.zip")
f = z.open("5year.arff", 'r')

**NOTE** 

Header ends and the "real" data begins after we see the "@data" tag.

In [2]:
while not b'@data' in f.readline():
    pass

**Next** we read the CSV data. We skip rows with missing entries; convert all fields to floats; and convert the label to a bool.

In [6]:
dataset = []
for line in f:
    if b'?' in line:
        continue
    line = line.split(b',')
    values = [1] + [float(x) for x in line]
    values[-1] = values[-1] > 0  # convert to bool
    dataset.append(values)

dataset[0]

[1,
 0.078518,
 0.20546,
 0.10393,
 2.7939,
 77.784,
 0.36515,
 0.093388,
 3.8672,
 1.2322,
 0.79454,
 0.093388,
 1.6119,
 0.25844,
 0.093388,
 735.12,
 0.49652,
 4.8672,
 0.093388,
 0.23659,
 32.076,
 0.99207,
 0.075428,
 0.19892,
 0.43626,
 0.79454,
 0.42414,
 2.3545,
 0.12401,
 5.0933,
 0.51863,
 0.23659,
 66.013,
 5.5292,
 0.36712,
 0.075428,
 0.41595,
 0.86215,
 0.94206,
 0.19109,
 0.045408,
 0.080363,
 0.19109,
 147.25,
 115.17,
 2.2635,
 2.1951,
 39.524,
 0.066803,
 0.16924,
 0.78786,
 0.057938,
 0.18086,
 0.948,
 1.124,
 12885.0,
 0.18842,
 0.098822,
 0.81158,
 0.18566,
 11.379,
 3.1692,
 53.575,
 6.8129,
 0.47096,
 False]

In [7]:
len(dataset)

3028

* Number of **positive** samples

In [8]:
sum(x[-1] for x in dataset)

102

* Next we extract our features (X) and labels (y), much as we would do for a regression problem

In [9]:
X = [values[:-1] for values in dataset]

In [10]:
y = [values[-1] for values in dataset]   # True/False labels

### Concept: The ```sklearn``` library

The ```sklearn``` library contains a number of different regression and classification models.

* ```linear_model.LinearRegression()``` - linear regression
* ```linear_model.LogisticRegression()``` - logistic regression


### Code: Fitting the logistic regression model

* First we import the library and create an instance of the model, before fitting it to data

In [11]:
from sklearn import linear_model

In [18]:
model = linear_model.LogisticRegression(solver='liblinear')

model.fit(X, y)



LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='warn',
          n_jobs=None, penalty='l2', random_state=None, solver='liblinear',
          tol=0.0001, verbose=0, warm_start=False)

* **Note** that this function doesn't produce any output, rather it just update the class instance to store the model

### Code: Making predictions

* Make predictions from the data

In [19]:
predictions = model.predict(X)
predictions

array([False, False, False, ..., False, False, False])

* Check whether they match the labels

In [21]:
correctPredictions = predictions == y
correctPredictions

array([ True,  True,  True, ..., False, False, False])

* And compute the error

In [22]:
sum(correctPredictions) / len(correctPredictions)

0.9666446499339498

## Training Vs. Testing

We achieved fairly high accuracy using a simple classifier "off the shelf"

* But note that we're evaluating our classifer on the same data that was used to train it
* How can we be sure that our classifier will work well on **unseen data**?
* This is something we'll cover in the next course, when we look at **training, testing, and validation**

If we **evaluate** a system on the same data used to **train** the system, we may overestimate its performance. Really, we want to know how well a method is likely to work on **unseen data**.

To estimate how well a system is likely to perform on new data, we can split our dataset into two components:

* A **training set** to train the machine learning model
* A **test set** used to estimate the performance on new data


### Code: Training and testing

First we read the dataset, exactly as we did above:

In [23]:
f = open("Datasets/data/5year.arff", 'r')

In [24]:
while not '@data' in f.readline():
    pass

In [25]:
dataset = []

for line in f:
    if '?' in line:
        continue
    line = line.split(',')
    values = [1] + [float(x) for x in line]
    values[-1] = values[-1] > 0  # convert to bool
    dataset.append(values)
    
dataset[0]

[1,
 0.088238,
 0.55472,
 0.01134,
 1.0205,
 -66.52,
 0.34204,
 0.10949,
 0.57752,
 1.0881,
 0.32036,
 0.10949,
 0.1976,
 0.096885,
 0.10949,
 1475.2,
 0.24742,
 1.8027,
 0.10949,
 0.077287,
 50.199,
 1.1574,
 0.13523,
 0.062287,
 0.41949,
 0.32036,
 0.20912,
 1.0387,
 0.026093,
 6.1267,
 0.37788,
 0.077287,
 155.33,
 2.3498,
 0.24377,
 0.13523,
 1.4493,
 571.37,
 0.32101,
 0.095457,
 0.12879,
 0.11189,
 0.095457,
 127.3,
 77.096,
 0.45289,
 0.66883,
 54.621,
 0.10746,
 0.075859,
 1.0193,
 0.55407,
 0.42557,
 0.73717,
 0.73866,
 15182.0,
 0.080955,
 0.27543,
 0.91905,
 0.002024,
 7.2711,
 4.7343,
 142.76,
 2.5568,
 3.2597,
 False]

The first thing we do differently is to **shuffle** the data:

* We do this beacause we want the training and test set to be **random samples** of the data - if we didn't use random samples, different subsets of the data could have distinct characteristics that could cause the model to under- (or over) perform on one of them.

In [30]:
import random

random.shuffle(dataset)

In [31]:
X = [values[:-1] for values in dataset]
y = [values[-1] for values in dataset]

Next we **split** the data into a **train** and a **test** portion.

In [32]:
N = len(X)

X_train = X[:N//2]   # double-slash for “floor” division (rounds down to nearest whole number)
X_test = X[N//2:]
y_train = y[:N//2]
y_test = y[N//2:]

In [33]:
len(X), len(X_train), len(X_test)

(3031, 1515, 1516)

Now we train our model as before, but we use **only the training data and labels**.

In [34]:
from sklearn import linear_model

model = linear_model.LogisticRegression(solver='liblinear')

model.fit(X_train, y_train)



LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='warn',
          n_jobs=None, penalty='l2', random_state=None, solver='liblinear',
          tol=0.0001, verbose=0, warm_start=False)

Finally we can compute the accuracy of the model, but this time we do so seperately for the training and test portions.

In [35]:
predictionTrain = model.predict(X_train)
predictionTest = model.predict(X_test)

In [36]:
correctPredictionTrain = predictionTrain == y_train
correctPredictionTest = predictionTest == y_test

In [37]:
sum(correctPredictionTrain) / len(correctPredictionTrain)   # Training Accuracy|

0.9696369636963696

In [38]:
sum(correctPredictionTest) / len(correctPredictionTest)    # Test Accuracy

0.9571240105540897

The latter quantity measures **how well the model is likely to perform on any data**

## Summary of concepts

* Simply training on a dataset doesn't give us a sense of how a model will **generalize to new data**
* This generalization ability can be estimated using a test set
* Training and test sets should be **non-overlapping, random** splits of our data