# Exercise 3 - Binary Classification with Logistic Regression (30 Points)

This exercise is meant to familiarize you with the complete pipeline of solving a machine learning problem. You
need to obtain and pre-process the data, develop, implement and train a machine learning model and evaluate it
by splitting the data into a train and testset.

First, we will derive and implement all the functions we need and put it into a single class.

In a second part, we will use this class to build a spam filter.

In the event of a persistent problem, do not hesitate to contact the course instructors under
- christoph.staudt@uni-jena.de

### Submission

- Deadline of submission:
        12.05.2021 23:59
- Submission on [moodle page](https://moodle.uni-jena.de/course/view.php?id=28746)

 

### Again if you have any troubles with one of the steps please reach out, as you will not be able to move on with the next steps in most cases 


 

## Data Preparation

In the model of *logistic regression*, we have $m$ samples $x_i\in\mathbb{R}^n$ with labels $y_i\in\{-1,1\}$.
In this exercise, we will use the equivalent formulation with $y_i\in\{0,1\}$.
We use the example dataset `data.npy`, where we have 2 dimensional features (first two columns) and a binary label (3rd column).

### Task 1 (1 Point)
Load and split the dataset into samples and labels. Then plot the data with a scatterplot and use different colors for different labels.

In [None]:
# TODO: Load and split dataset

# TODO: plot data


The function $\sigma$ is called the logistic *sigmoid function*:

$
\sigma(a) = \cfrac{1}{1+\exp(-a)}\ .
$

###  Task 2 (1 Point)
Implement a vectorized logistic sigmoid function, i.e. it takes a vector of x-coordinates X and returns a vector of their respective y values. Use it to plot the function between -10 and 10.

In [None]:
def sigmoid(X):
    # TODO: implement sigmoid function
    pass

# TODO: Plot function from -10 to 10

The goal in logistic regression is to find the parameter vector $\theta\in\mathbb{R}^n$, so that 

\begin{align}
p(y_i=1|x_i,\theta)=\sigma(x_i^T\theta) \quad &
p(y_i=0|x_i,\theta)=1-p(y_i=1|x_i,\theta)
\end{align}

fits our data and can be used to predict the label on unseen data (binary classification).


With an estimated $\theta$, a new feature $x\in\mathbb{R}^n$ is classified according to:

$
\hat{y} = \begin{cases}
1\text{, if \ }p(y=1|x,\theta)\geq 0.5\\
0\text{, else}
\end{cases}.
$

Since $\sigma(0) =  1/(1+\exp(0)) = 1/2$. This is equivalent to 
$\hat{y} = \begin{cases}
1\text{,\ if \ } x_i^T\theta \geq 0\\
0\text{,\ else}
\end{cases}$
 as noted in the lecture.

### Task 3 (1 Point)
Prepare `X` so that the classification function for an estimated $\theta$ is [*affine*](https://math.stackexchange.com/questions/275310/what-is-the-difference-between-linear-and-affine-function). Add this affine component at the **first column**.

In [None]:
# TODO: Prepare X

### Task 4 (1 Point)

Implement a `predict` function based on the above definition of probabilities.
The function should take $m$ input features $X\in\mathbb{R}^{m\times n}$ and a vector $\theta$ as input and output predictions $\hat{Y}\in\{0,1\}^m$.

Test your function with a randomly chosen $\theta$.

In [None]:
def predict(X,theta):
    # TODO: calculate and return predictions
    pass

# TODO: test function

## Learning $\theta$

For a given $\theta$, we can calculate $p(y|x,\theta)$ and use this probability for classification.
To evaluate how well a learned $\theta$ can be used to classify our data, we define a *loss function*.
Here we want to use [binary cross entropy](https://towardsdatascience.com/understanding-binary-cross-entropy-log-loss-a-visual-explanation-a3ac6025181a) given as:
$
L(\theta) = -\cfrac{1}{m}\sum_{i=1}^m y_i\log(p(y_i=1|x_i,\theta))+(1-y_i)\log(1-p(y_i=1|x_i,\theta))
$
Often it is convenient to have multiple metrics at hand. In classification problems, the *accuracy* of a
prediction is defined as the percentage of correctly classified features. In the case of logistic regression, this corresponds to 

$
Acc(\theta) = \cfrac{1}{m}\sum_{i=1}^m y_i \hat{y_i} + (1-y_i)(1-\hat{y_i})
$
where $\hat{y_i}$ is the prediction for $x_i$.

As our model becomes better, we expect the accuracy to increase and the loss to decrease.  

### Task 5 (2 Points)
Implement the binary cross entropy and the accuracy for logistic regression. 
The loss takes the features $X$, the true labels $Y$ and the parameter vector $\theta$ as input, whereas the accuracy only needs $Y$ and the predicted labels $\hat{Y}$.

Again, test your functions with a randomly chosen $\theta$.

In [None]:
def loss(X, Y, theta):
    # TODO: implement binary cross entropy
    pass

def acc(Y, Y_hat):
    # TODO: implement accuracy
    pass

# TODO: test function

Given the loss function $L(\theta)$, we want to minimize this function with respect to the parameters $\theta$, that is we are looking for

\begin{align}
    \text{argmin}_\theta L(\theta)
\end{align}

However, since this is a highly nonlinear optimization problem, we use an iterative approach that starts with an initial estimate for $\theta$ and approaches the solution at each iteration step. 
The most simple approach is to take the gradient
$\nabla L(\theta)$ of $L(\theta)$ with respect to $\theta$ and walk into direction of the negative gradient. 
This method is called gradient-descent.

### Task 6 (3 Points)

Calculate $\nabla L(\theta) = \cfrac{\partial L}{\partial \theta}$ and implement this function.
The resulting function takes features $X$, labels $Y$ and $\theta$ as input and outputs a gradient $\nabla L(\theta)\in\mathbb{R}^n$.

Again, test your function with a randomly chosen $\theta$.




In [None]:
def gradient(X,Y,theta):
    # TODO: Implement gradient
    pass

# TODO: test function

### Task 7 (3 Points)
With the gradient function, implement the *gradient descend* algorithm:

 1. (randomly) choose initial $\hat{\theta}$
 2. update $\hat{\theta} \leftarrow \hat{\theta} -\eta\nabla L(\hat{\theta})$
 3. repeat 2. until a maximum number of iterations $\lambda$ (parameter `max_it`) is reached or the loss did not change more than $\varepsilon$ (parameter `eps`).
 
The hyperparameter $\eta$ is also called *learning rate* (parameter `lr`).

The function should take the features $X$, the labels $Y$ and values for $\eta,\lambda$ and $\varepsilon$ as input and output $\hat{\theta}$.

Test your function.

In [None]:
def fit(X, Y, lr=1e-2, max_it=1000, eps=1e-4):
    # TODO: Implement gradient descend algorithm
    pass

# TODO: test function

### Task 8 (4 Points)

Now we have all functionalities and want to bring them together in a single class.

- Use the previously defined functions to implement the `LogReg` class. 
- Make use of the fact, that you can store parameters as attributes. 
- Additionaly track the losses and accuracies that occur during the iterations of gradient descend. 
- Test your class (on the prepared data from above) and plot the accuracies and losses over the iterations.

In [None]:
class LogReg():
    # TODO: fill in functions
    
    def __init__(self):
        pass
        
    def sigmoid(self, X):
        pass

    def predict(self, X):
        pass
    
    def loss(self, X, Y):
        pass

    def acc(self, Y, Y_hat):
        pass
    
    def gradient(self, X, Y):
        pass
    
    def fit(self, X, Y, lr=1e-2, max_it=1000, eps=1e-4):
        # TODO: track losses and accuracies
        pass
                
# TODO: test class + plot losses/accuracies

### Task 9 (2 Points)

So far, we used the whole dataset for fitting the `LogReg` class.

- Use [scikit-learn](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html) to split the dataset into train (75%) and testset (25%).
- Fit the Logistic Regression model on the trainset and calculate the final accuracies on the train and testset. 
- Experiment with the hyperparameters for fit, to get a good result.

In [None]:
# TODO: Split data into train and test data

# TODO: apply logistic regression

# TODO: determine train and test accuracy

## Visualization

Next we want to visualize our classifier. To to this, we want to visualize the *decision boundary* defined by $\hat{\theta}$.

The decision boundary is defined as 
$
\{x\in\mathbb{R}^n: p(y=1|x)=0.5\}
$
or as in the lecture:
$\{x\in \{1\} \times \mathbb{R}^n: x^T\hat{\theta}=0\}$


### Task 10 (2 Points)

Implement a function `plot_dec_boundary` that visualizes the data and the regression line for 2 dimensional samples $X$ and an estimated $\hat{\theta}$.

Test this function with the $\hat{\theta}$ estimated in Task 8.

In [None]:
def plot_dec_boundary(X,Y, theta):    
    # TODO: plot data and decision boundary
    pass
    
# TODO: test function

### Task 11 (2 Points)

Use the [implementation from scikit-learn](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html) to train a logistic regressor.

Visualize the regression line that you obtain with scikit learn.

In [None]:
# TODO: estimate theta with scikit-learn

# TODO: plot regression line with data

## Spam Filter

We want to use logistic regression to perform Spam Filtering on the [*UCI SMS Spam Collection*](http://www.dt.fee.unicamp.br/~tiago/smsspamcollection/) (the certificate expired but accessing the page should still be okay, anyhow a description and the dataset can be found in smsspamcollection,so you don't need to open the page) dataset. The goal is to classify a SMS from its text into the categories "spam" or "ham".

### Task 12 (3 Points)

The dataset is saved as a text file at `SMSSpamCollection.txt`. Find a way to load the dataset and transform the features `X` (SMS) and the labels `Y` (spam/ham) into numerical representations.

Hint:

For transforming SMS into features, check out the bag of words representation from [scikit-learn](https://scikit-learn.org/stable/modules/feature_extraction.html)

In [None]:
# TODO: load and preprocess dataset

### Task 13 (2 Points)

Split the dataset into train (75%) and testset (25%) and use your implementation of logistic regression to learn $\theta$ for this dataset. Try to get your accuracy as high as possible.

In [None]:
# TODO: use own logistic regression on dataset

# TODO: determine train and test accuracy

### Task 14 (3 Points)
Visualizing our classifier is not that easy anymore, as our features are in a high-dimensional space. 
Nevertheless, the values of $\hat{\theta}$ can tell us what words are indicators for the decision for spam/ham.

Use $\hat{\theta}$ and your word encoding to output the top 10 most likely words for ham and spam.

In [None]:
# TODO: use theta to print top 10 words for spam and ham