# Class 5 - Logistic regression

## Logistic regression: the model

#### From regression to classification

In the previous two classes we have seen how to solve regression problems, in which we had to predict a continuous output, i.e. our setting was $\mathcal{X} = \mathbb{R}^d$ and $\mathcal{Y} = \mathbb{R}$. Today we will see how to solve problems of binary **classification**, in which we are interested in correctly predicting if a given input in $\mathcal{X} = \mathbb{R}^d$ belongs to one of two classes (i.e. $\mathcal{Y} = \{0,1\}$). 

Today we will present **logistic regression**, which is one of the possible models that can be used to solve a binary classification problem and is conceptually close to linear regression. Of course we cannot just use linear regression for binary classification, because a linear regression model outputs response values in $\mathbb{R}$, while we want our response to be in $\{0,1\}$.

First of all, let's pass from affine models to linear models by adding one more feature (constant one) to our dataset ($x \mapsto (1, x)$). From now on $d$ will denote the dimension of our extended feature space.

A possible solution to our range problem is to compose the output of a linear regression with a function with a binary range, that is we could use the following hypothesis class:

$$\{x\mapsto \text{sign}\left(\langle w,x\rangle \right) \;\vert\; w\in\mathbb{R}^d\}$$

and claim that if the sign of our linear regression prediction is positive, then we will label the point $1$, while if the sign is negative, we will label it $0$. 

This hypothesis class is well known and particularly useful for so called linearly-separable problems (we will see more on this in a following class).

Logistic regression, instead, is based on the following choice of hypothesis class:

$$\{x\mapsto f_{w}(x) = \sigma\left(\langle w,x\rangle \right) \;\vert\; w \in \mathbb{R}^d\},$$

where $S$ is the **sigmoid** function, defined as

$$\sigma(x) = \frac{1}{1 + e^{-x}}.$$

In [None]:
%matplotlib inline
import numpy as np
import matplotlib.pyplot as plt

x = np.linspace(-6,6,1000)

plt.plot(x, 1/(1+np.exp(-x)), 'b-')
plt.show()

#### The maximum likelihood method

The output of our model will therefore not be binary, but it will lie in $[0,1]$ and we can then interpret it as the conditional probability that the correct labeling is $Y=1$, given that we observe $X=x$:

$$y = f_{w}(x) = \sigma\left(\langle w,x\rangle \right) = \mathbb{P}(Y = 1 | X = x).$$

Now, we want to choose our parameters ($w \in \mathbb{R}^{d}$), in such a way that this conditional probability is as close as possible to the real one, by using the information provided by our training dataset.

In the case of linear regression, we decided to find the parameters by minimizing a given loss function (squared error or absolute value loss) on our training dataset. This method worked very well analytically and/or computationally because that loss function was convex in the parameters, but in the case of logistic regression the non-linearity of the sigmoid function, $\sigma$, forces us to find another way.

We can fix our parameters by imposing that **the training dataset that we observe must have the maximum probability of occurring**, in other words we want to maximize the following function:



$$ \prod_{i=1}^n \mathbb{P}(Y=y_i, X=x_i) = \prod_{i=1}^n \mathbb{P}(Y=y_i | X=x_i) \mathbb{P}(X = x_i) \propto \prod_{i=1}^n \left(f_{w}(x_i)\right)^{y_i} \left(1 - f_{w}(x_i)\right)^{1 - y_i} =: L(w) $$

where $L(w)$ is known as the likelihood function and is proportional to the probability of observing our training dataset, under our working assumption that $f_{w}(x_i)$ must be the conditional probability.

Of course maximizing $L(w)$ is equivalent to maximizing its logarithm, so that our parameters, $w$, can be found by solving:

$$ \hat{w} = \underset{w \in \mathbb{R}^d}{\text{argmax}} \: \log L(w) = \underset{w \in \mathbb{R}^d}{\text{argmax}} \: \sum_{i=1}^n \left[ y_i \log\left(f_{w}(x_i)\right) + (1 - y_i) \log\left( 1 - f_{w}(x_i)\right) \right] $$

This method is known in statistics as Maximum Likelihood Estimation (MLE) and estimators generated via this method are known as MLE estimators.




It is important to realize that the MLE method is entirely compatible with our Empirical Risk Miniminization theoretical framework. In fact maximizing the logarithm of the Maximum Likelihood function, $\log(L(w))$, is equivalent to minimizing the log-loss function, defined as follows:

$$ \tilde{L}(y, y') =   - \left[ y \log\left( y' \right) + (1 - y) \log\left( 1 - y'\right) \right], \quad y\in \{0,1\}, y' \in (0,1) $$

## Logistic Regression: the algorithm

#### Gradient descent implementation

The minimization can be performed via gradient descent. The gradient is easily computed as:

$$ \frac{\partial}{\partial w_k} \left( - \log L(w) \right) = - \sum_{i=1}^n x_i^{(k)} (y_i - f_{w}(x_i)) =: - \sum_{i=1}^n x_i^{(k)} z_i(w)  \quad \Rightarrow \quad \nabla \log L(w) = - X^T z(w), $$  

Therefore the iterating update rule for our gradient descent will look like:

$$ w^{\text{new}} = w^{\text{old}} - \alpha \left( - X^T z(w,b)\right),$$

where the gradients are taken with negative sign because the optimization problem at hand is a minimization.

#### Code implementation

In [None]:
def sigmoid(w, x):
    return 1 / (1 + np.exp(-w.dot(x)))

def log_likelihood(w, x, y):
    n = x.shape[0]
    result = np.zeros(n)
    for i in range(n):
        result[i] = - (1 - y[i,0]) * w.dot(x[i,:]) - np.log(1 + np.exp(-w.dot(x[i,:]))) 
        # the previous line is equivalent to:
        # result[i] = y[i,0] * np.log(sigmoid(w, x[i,:])) + (1 - y[i,0]) * np.log(1 - sigmoid(w, x[i,:]))
    return sum(result)

In [None]:
def logistic_regression_train(x, y, alpha=None):
    
    n, d = x.shape
    b = 0
    w = np.zeros(d)
    cost_old = 0
    cost_new = log_likelihood(w, x, y) 
    i = 0
    while np.abs(cost_new - cost_old) > 10 ** (-4):
        print(i)
        print(cost_new)
        z = y - np.array([sigmoid(w, row) for row in x]).reshape((n,1))
        gradient = - np.transpose(x).dot(z).reshape((d,))
        alpha = (np.transpose(gradient).dot(gradient)) / (
            (np.transpose(gradient).dot(np.transpose(x))).dot(x.dot(gradient)))
        w = w - alpha * gradient
        cost_new, cost_old = log_likelihood(w, x, y) , cost_new
        i = i + 1
    print("Iterations: {}".format(i))
    return w


def logistic_regression_predict_label(w, x):
    n = x.shape[0]
    predictions = np.zeros((n, 1))
    for i in range(n):
        prob = sigmoid(w, x[i,:])
        if prob >= 0.5:
            predictions[i,0] = 1
        else:
            predictions[i,0] = 0
    return predictions

def logistic_regression_predict_prob(w, x):
    n = x.shape[0]
    predictions = np.zeros((n, 1))
    for i in range(n):
        predictions[i,0] = sigmoid(w, x[i,:])
    return predictions

## Logistic regression: practical implementation

Now we are ready to implement logistic regression on a binary classification problem. The only thing we need is a problem.

#### Dataset acquisition: how to read text

We are going to work on the **Titanic dataset**. This is an online dataset ([see here](https://www.kaggle.com/c/titanic/data) for an online repository) that contains data about the passengers of the Titanic, together with the information whether they survived or not. The goal is to predict, as well as possible, the fate of each passanger by using the information provided.

The dataset comes as a csv file (comma-separated values file), which looks as follows:

    PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked  
    1,0,3,"Braund, Mr. Owen Harris",male,22,1,0,A/5 21171,7.25,,S  
    2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Thayer)",female,38,1,0,PC 17599,71.2833,C85,C  
    3,1,3,"Heikkinen, Miss. Laina",female,26,0,0,STON/O2. 3101282,7.925,,S  
    ...

If you save the file in the same folder as your Jupyter notebook or python code, your program can access it via several possible python functions. 

We will use the most basic (and reliable) method, which is via the csv.reader function of the csv package.

In [None]:
import csv


with open('train.csv', 'r') as f:
    data = csv.reader(f)

    row = data.__next__()
    features_names = np.array(row)

    x = []
    y = []

    for row in data:
        x.append(row)
        y.append(row[1])

    x = np.array(x)
    y = np.array(y)

print(x.shape)
print(y.shape)

We can have an idea of the dataset by printing the header (which contains the names of the columns) and the first line:

In [None]:
print(features_names)
print(x[0,:])

#### Feature selection

The features are as follows:

0. '**PassengerId**': a progressive numbering of the passangers (integer)
1. '**Survived**': survival status (integer: 0 if dead, 1 if survived)
2. '**Pclass**': passenger class (integer: 1, 2, or 3 if 1st, 2nd, or 3rd class respectively)
3. '**Name**': name of the passenger (string)
4. '**Sex**': gender of the passenger (string: 'male' or 'female')
5. '**Age**': age of the passanger (integer)
6. '**SibSp**': number of siblings/spouses on board (integer)
7. '**Parch**': number of parents/children on board (integer)
8. '**Ticket**': string specifying the ticket code (alphanumeric string)
9. '**Fare**': cost of the ticket (float)
10. '**Cabin**': personal cabin number  (alphanumeric string)
11. '**Embarked**': port of embarkation (C = Cherbourg, Q = Queenstown, S = Southampton)

We notice that:

+ the 'Survived' column is the column of our labels, it's not a feature!
+ some features can be meaningful, but cannot be readily translated into real numbers,
+ not all features appear to be useful.

We decide to limit ourselves to the following features:

In [None]:
x = x[:, [2, 4, 5, 6, 7, 9]]
features_names = features_names[[2, 4, 5, 6, 7, 9]]
print(features_names)

#### Feature representation

Our (restricted) dataset now looks like this:

In [None]:
print(x)

We must turn the gender feature into a categorical one (0 and 1).

In [None]:
x[:,1] = (x[:,1] == 'female').astype(np.float)
print(x)

And we try to convert all data to type float, so we can work with "real" numbers.

In [None]:
x = x.astype(np.float)
y = y.astype(np.float)

#### Data imputation

The conversion to float fails because some features have no value (represented in text usually as '', 'NaN', or 'NA'). We can search for the culprit in the following way:

In [None]:
for i in range(len(features_names)):
    if any(x[:,i] == ''):
        print("Feature", i, "has", sum(x[:,i] == ''), "NaN value(s)")

We can deal with missing data in one of the following ways:
+ Remove the datapoint (not good if the dataset is small)
+ Use a machine learning algorithm that accepts missing features (but that's not the case of logistic regression)
+ **Data imputation** techniques, i.e. substitute the NaN with a plausible value: 
    + the mean over non-NaN data of same feature, 
    + the mid-point of the range of non-NaN data range for same feature, 
    + the prediction of a regression problem run on the remaining features to predict missing feature.

We choose the simplest solution and substitute the missing value with the mean value of the feature:

In [None]:
ages = x[:,2]
mean_age = np.mean(x[ages != '',2].astype(np.float))
x[ages == '', 2] = mean_age

print(x)

In [None]:
x = x.astype(np.float)
y = y.astype(np.float).reshape((x.shape[0],1))

x = np.hstack((np.ones((x.shape[0],1)), x))

print(x)

#### Training

As usual we shuffle the dataset and partition it into a training dataset and a testing dataset.

In [None]:
def shuffle(x, y):
    z = np.hstack((x, y))
    np.random.shuffle(z)
    return np.hsplit(z, [x.shape[1]])

x, y = shuffle(x, y)

def splitting(x, y, test_size=0.2):
    n = x.shape[0]
    train_size = int(n * (1 - test_size))
    return x[:train_size, ], x[train_size:, ], y[:train_size, ], y[train_size:, ]

x_train, x_test, y_train, y_test = splitting(x, y)

We train our logistic regression:

In [None]:
logistic_coeff = logistic_regression_train(x_train, y_train)

We define two loss functions:

+ the 0-1 loss (percentage of wrong labels)
+ the log-loss (as defined above)

In [None]:
def zero_one_loss(y_true, y_pred):
    n = y_true.shape[0]
    return (1/n) * np.sum(y_true != y_pred)

In [None]:
def log_loss(y_true, y_pred):
    return np.mean(- (y_true*(np.log(y_pred)) + (1 - y_true)*(np.log(1 - y_pred))))    

We test our model on the test dataset:

In [None]:
train_predicted_labels = logistic_regression_predict_label(logistic_coeff, x_train)
train_predicted_prob = logistic_regression_predict_prob(logistic_coeff, x_train)
test_predicted_labels = logistic_regression_predict_label(logistic_coeff, x_test)
test_predicted_prob = logistic_regression_predict_prob(logistic_coeff, x_test)

print('Train 0-1 loss:', zero_one_loss(y_train, train_predicted_labels))
print('Train log-loss:', log_loss(y_train, train_predicted_prob))
print('-----------------------------------------------------------------------')
print('Test 0-1 loss:', zero_one_loss(y_test, test_predicted_labels))
print('Test log-loss:', log_loss(y_test, test_predicted_prob))

In [None]:
print(np.hstack((y_test, test_predicted_labels)))

We can compare this error with the implementation of logistic regression of the scikit-learn package:

In [None]:
from sklearn.linear_model import LogisticRegression
clf = LogisticRegression(solver='liblinear').fit(x_train[:,1:], y_train.reshape(y_train.shape[0],))


train_predicted_labels = clf.predict(x_train[:,1:]).reshape(y_train.shape)
train_predicted_prob = (clf.predict_proba(x_train[:,1:])[:,1]).reshape(y_train.shape)
test_predicted_labels = clf.predict(x_test[:,1:]).reshape(y_test.shape)
test_predicted_prob = (clf.predict_proba(x_test[:,1:])[:,1]).reshape(y_test.shape)

print('Train 0-1 loss:', zero_one_loss(y_train, train_predicted_labels))
print('Train log-loss:', log_loss(y_train, train_predicted_prob))
print('-----------------------------------------------------------------------')
print('Test 0-1 loss:', zero_one_loss(y_test, test_predicted_labels))
print('Test log-loss:', log_loss(y_test, test_predicted_prob))

## Practice yourself!

Play around with the code yourself! Possible ideas that might lead you to interesting observations are:

1. Learn from our results! What features of our datasets are statistically significant for survival prediction?
2. Try the algorithms on the dataset for breast cancer diagnosis from scikit-learn (see [here](https://scikit-learn.org/stable/datasets/index.html)).
3. Try selecting more or less features. How is the model performance affected?
4. Try using different data imputation techniques to substitute the missing "Age" values.
5. Implement a logistic regression on the Titanic dataset, but this time choose a feature other than "Survived" as label.
6. Implement an algorithm similar to what we have seen for logistic regression, but on the following hypothesis class:
$$\{x\mapsto \text{sign}\left(\langle w,x\rangle \right) \;\vert\; w\in\mathbb{R}^d\}.$$