**D3APL: Aplicações em Ciência de Dados** <br/>
IFSP Campinas

Prof. Dr. Samuel Martins (Samuka) <br/><br/>

<a rel="license" href="http://creativecommons.org/licenses/by-nc-sa/4.0/"><img alt="Creative Commons License" style="border-width:0" src="https://i.creativecommons.org/l/by-nc-sa/4.0/88x31.png" /></a><br />This work is licensed under a <a rel="license" href="http://creativecommons.org/licenses/by-nc-sa/4.0/">Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License</a>.

# Implementing a Binary Linear Classifier

## Logistic Regression with GD

PS: use the lecture slides to support your development.

## 1. Set up

#### Imports

In [None]:
import numpy as np
from sklearn.datasets import make_blobs
from sklearn.model_selection import train_test_split
import matplotlib.pyplot as plt

#### Creating fake data

In [None]:
# fake data for testing
X, y = make_blobs(n_samples=1000, n_features=2, centers=2, cluster_std=1.5, random_state=42)

print(X.shape)
print(y.shape)
print(f'Labels: {np.unique(y)}')

In [None]:
# splitting into train and test


In [None]:
# feature scaling


In [None]:
import seaborn as sns

sns.scatterplot(x=X_train[:, 0], y=X_train[:, 1], hue=y_train)
plt.xlabel('x1')
plt.xlabel('x2')
plt.title('Scatter plot: Training Samples')

In [None]:
import seaborn as sns

sns.scatterplot(x=X_test[:, 0], y=X_test[:, 1], hue=y_test, marker='s')
plt.xlabel('x1')
plt.xlabel('x2')
plt.title('Scatter plot: Testing Samples')

#### Saving the fake preprocessed data

In [None]:
np.save('./datasets/fake_train.npy', X_train)
np.save('./datasets/fake_train_labels.npy', y_train)

np.save('./datasets/fake_test.npy', X_test)
np.save('./datasets/fake_test_labels.npy', y_test)

## 2. Implementation
### Plan of Attack
- Class
- Sklearn TemplateClassifier
    - https://scikit-learn.org/stable/developers/develop.html
    - https://github.com/scikit-learn-contrib/project-template/blob/a06bc1a701fbb320848e4d5295e4477b596078df/skltemplate/_template.py#L74
    - https://scikit-learn.org/stable/modules/classes.html#base-classes
- Constructor
    - learning_rate, n_epochs
- \_\_str\_\_
- fit
- sigmoid
- log_loss
- gradient
- coef_, intercept_ (@property)
- predict_proba
- predict
- visualize decision boundary

In [None]:
from typing import Tuple

import numpy as np
from numpy import ndarray

from sklearn.base import BaseEstimator, ClassifierMixin


class LogisticRegression(ClassifierMixin, BaseEstimator):
    """Our Logistic Regression implemented from scratch."""
    
    def __init__(self, learning_rate : float = 0.001,
                 n_epochs : int = 1000, random_state : int = 42):
        """
        Parameters
        ----------
        learning_rate : float, default=0.001
            Learning rate.
        n_epochs : int, default=1000
            Number of epochs for training (convergence stop).
        random_state : int, default=42
            Seed used for generating random numbers.
        """
        pass
    
    
    # a special method used to represent a class object as a string, called with print() or str()
    def __str__(self):
        pass
    


    def fit(self, X: ndarray, y: ndarray, verbose: int = 0):
        '''Train a Logistic Regression classifier.

        Parameters
        ----------
        X: ndarray of shape (n_samples, n_features)
            Training data.
        y: ndarray of shape (n_samples,).
            Target (true) labels.
        verbose: int, default=0
            Verbose flag. Print training information every `verbose` iterations.
            
        Returns
        -------
        self : object
            Returns self.
        '''
        pass
    
        ### CHECK INPUT ARRAY DIMENSIONS
        

        ### SETTING SEED

        ### PARAMETER INITIALIZATION
        # return values from the “standard normal” distribution.

        # LEARNING ITERATIONS


        ### ASSIGN THE TRAINED PARAMETERS TO THE PRIVATE ATTRIBUTES
        
    
    
    def predict_proba(self, X: ndarray) -> ndarray:
        '''Estimate the probability for the positive class of input samples.

        Parameters
        ----------
        X: ndarray of shape (n_samples, n_features)
            Input samples.
            
        Returns
        -------
        ndarray of shape (n_samples,)
            The estimated probabilities for the positive class of input samples.
        '''
        pass

    
    def predict(self, X: ndarray) -> ndarray:
        '''Predict the labels for input samples.
        
        Thresholding at probability >= 0.5.

        Parameters
        ----------
        X: ndarray of shape (n_samples, n_features)
            Input samples.
            
        Returns
        -------
        ndarray of shape (n_samples,)
            Predicted labels of input samples.
        '''
        pass
    
    
    

#### **Testing constructor and `__str__`**
- evaluate the default hyperparameters
- try different valid values for them
- try invalid values for them

In [None]:
clf = LogisticRegression()

print('Printing object by print()')
print(clf)

In [None]:
print('Displaying object')
clf

#### **Testing sigmoid**
PS: convert the method to a public static method, by using @staticmethod, and removing the prefix __ and paramater self.

In [None]:
z = 0.458

In [None]:
# scipy


In [None]:
# our sigmoid function


#### **Testing log loss**
PS: convert the method to a public static method, by using @staticmethod, and removing the prefix __ and paramater self.

In [None]:
y_debug = np.array([0, 0, 1, 1])
p_hat_debug = np.array([0, 0.25, 0.5, 1])

In [None]:
# sklearn


In [None]:
# our log loss


#### **Testing input conditions in `fit()`**

In [None]:
clf = LogisticRegression()

In [None]:
# invalid ndim for X


In [None]:
# different n_samples for X and y
clf.fit(X_debug, y_debug)

#### **Testing `fit()`**
PS: use `pdb.set_trace()` inside the main loop of `fit()` for debugging.

#### **Visualizing the Decision Boundary**

$w_1x_1 + w_2x_2 + b = 0$

$x_2 = -(b + w_1x_1)/w_2$

In [None]:
b = clf.intercept_
w1, w2 = clf.coef_

In [None]:
x1_decision_line = np.array([X_train[:,0].min(), X_train[:,0].max()])
x2_decision_line = -(b + (w1 * x1_decision_line)) / w2

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns

sns.scatterplot(x=X_train[:, 0], y=X_train[:, 1], hue=y_train)
sns.lineplot(x=x1_decision_line, y=x2_decision_line, color='lightseagreen')
plt.xlabel('x1')
plt.xlabel('x2')
plt.title('Decision Boundary on Training Samples')

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns

sns.scatterplot(x=X_test[:, 0], y=X_test[:, 1], hue=y_test, marker='s')
sns.lineplot(x=x1_decision_line, y=x2_decision_line, color='lightseagreen')
plt.xlabel('x1')
plt.xlabel('x2')
plt.title('Decision Boundary on Testing Samples')

### Prediction

## Using our model inside Sklearn environment
Since we designed our linear classifier by following the sklearn standard, we can use our model with all sklearn environment.

### Fine-tuning our model

### Using our model within a Pipeline

In [None]:
# splitting into train and test
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

In [None]:
# automatic pipeline


In [None]:
# manual pipeline


# Exercise
Evaluate our classifier with the **Breast Cancer dataset**, [avaible on sklearn](https://scikit-learn.org/stable/modules/generated/sklearn.datasets.load_breast_cancer.html#sklearn.datasets.load_breast_cancer).

Suggestion for the experiments:
- Fix a given seed (random_state) for reproducibility
- Use 80% of the data for training, and 20% for testing - stratified sample
- Compared methods:
    - Our implementation with default parameters
    - Our implementation after fine-tuning the `learning_rate` and `n_epochs`
    - [LogisticRegression from sklearn](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html) (which is not implemented with Grad. Descent)
    - [SGDClassifier](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.SGDClassifier.html#sklearn.linear_model.SGDClassifier.fit) with default parameters
- Compute (at least) the F1-scores

**PS:** Only optimize our method against the default (non-optimized) baselines is not a fair comparison. One should also optimize at least the main hyperparameters from the baselines for a more fair comparison. But, this is a simple exercise! Don't worry about that.