# Understanding Gradient Descent with Python

---

In our first notebook, we walked through the basics of gradient descent and applied them to linear regression. In this notebook, we will be doing the same thing, but for a new task.

### By the end of this lession you will: 

- Understand the fundamentals of logistic regression
- Be able to implement your own model class that's just like sci-kit learns!

## A new dataset, a new task

Now that we've seen one task, regression, it's time to see if we can apply our understanding of Gradient Descent to another common machine learning task: classification.

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
plt.style.use('fivethirtyeight')


from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split

## The dataset

This new dataset is another classic machine learning dataset from the UCI ML datasets for classifying Benign and Malignant [cancer from tissue samples](https://archive.ics.uci.edu/ml/datasets/Breast+Cancer+Wisconsin+(Diagnostic).

The goal is to see if we can accurately estimate likelihood of having cancer, given measurements of tissue.

In [None]:
print(load_breast_cancer().DESCR)

In [None]:
def get_classification_dataset_train_test():
    
    dataset_object = load_breast_cancer()
    X_features_df = pd.DataFrame(data=dataset_object['data'], 
                                        columns=dataset_object['feature_names'])
    y_labels_df = pd.DataFrame(data=dataset_object['target'], 
                               columns=['target_labels'])
    X_train, X_test, y_train, y_test = train_test_split(X_features_df, y_labels_df, shuffle=True, test_size=0.2)
    return X_train, X_test, y_train, y_test

In [None]:
X_train, X_test, y_train, y_test = get_classification_dataset_train_test()

### Let's take a look at the outcome variable against one of our input features.

---
**_Note_**: I have flipped the sign. In our dataset $1 = {Malignan}t$ and $0 = {Benign}$, but I have flipped it for convinience of plotting such that $0 = {Malignant}$ and $1 = {Benign}$

---

In [None]:
fig = plt.figure(figsize=(12, 8))
ax = plt.axes()
plt.scatter(X_train['mean radius'], 1 - y_train['target_labels'], color='darkorange')
ax.set(xlabel="X Input", ylabel="Cancer, 'Benign' = 1",
       title="'Benign' or 'Malignant' tissue given some input");

## Does our simple linear regression model work?

If we wanted to predict the outcome, 0 or 1, could we use our simple linear regression model?

In [None]:
fig = plt.figure(figsize=(12, 8))
ax = plt.axes()
plt.scatter(X_train['mean radius'], 1 - y_train['target_labels'], color='darkorange')
plt.plot([10, 20], [-0.1, 1.1], color='black')
ax.set(xlabel="X Input", ylabel="Cancer, 'Benign' = 1",
       title="'Benign' or 'Malignant' tissue given some input");

## Looks like we need a new model

We need a new model that will help map from inputs to, in this case, a binary output bounded by 0 and 1. We can no longer use _just_ a model that is a straight line, ie. our trusty $Y_{pred} = \sum_i^j(X_j*W_j)$, but we need a new model that can, ideally, take that _weighted sum_ and then *squashes* it in between $0 <= Y_{pred} <= 1$


## Logistic regression, and the sigmoid, to the rescue! 

If we take our weighted sum: $\sum_i^j(X_j*W_j) = dot(X, W)$, but pass it through another function called the **_sigmoid_**, it will weight our inputs just like a linear regression, but magically map them to 0 and 1. 

$$ \frac{1}{1 + e^{-dot(X,W)} } $$

This helps map our inputs to something that acts like a **_probability_**! Now, we have a model that will take inputs, weight them, and then return us a likelihood of the outcome.

In [None]:
def sigmoid(X):
    return 1 / (1. + np.exp(-X))

In [None]:
fig = plt.figure(figsize=(12, 8))
ax = plt.axes()
X = np.linspace(8, 23, 50)
y = [sigmoid(x - 14) for x in X]
plt.plot(X, y, color='black')
ax.set(xlabel="X Input", ylabel="Probability of Cancer, 'Benign' = 1",
       title="Probability mapped of 'Benign' or 'Malignant' tissue given some input");

In [None]:
fig = plt.figure(figsize=(12, 8))
ax = plt.axes()
plt.scatter(X_train['mean radius'], 1 - y_train['target_labels'], color='darkorange')
X = np.linspace(8, 23, 50)
y = [sigmoid(x - 14) for x in X]
plt.plot(X, y, color='black')
ax.set(xlabel="X Input", ylabel="Probability of Cancer, 'Benign' = 1",
       title="Probability mapped of 'Benign' or 'Malignant' tissue given some input");

## Once again, finding the optimal weight

We're left with a similar problem: given the data we have, what is the optimal weight that we should choose? 

However, given this new model, we need a new definition or criteria for measuring "wrong-ness" or error. Without going to deep into the details, there exists another mathematical tool for estimating how optimally close probabilities are to their associated outcomes called: **_cross entropy_**. It's often called the **_log loss_** (sounds an aweful lot like our "_Logistic_ regression"?!) 

$$ Log Loss := -\frac{1}{N} \sum_i^N \sum_j^M(Y_{ij}*\log{(P_{ij})}) $$ 

For $N$ examples of $M$ classes.

For the binary case, it's simply:

$$ Binary Log Loss := -\frac{1}{N} \sum_i^N(Y_{i}*\log{(P_{i})} + (1 - Y_{i})*\log{(1 - P_{i})}) $$ 

And, as you can see from above, our $P_{i} = Sigmoid(X,W)$ because our sigmoid is our handy tool for producing probabilities from weighted sums of inputs!

## Let's take another look at our data

In [None]:
X_train.head()

In [None]:
y_train.head()

## The goal for the rest of the session

Just as we have done for the Linear Regression, let's implement our own model called `MyLogisticRegression` as a python `class` that can `fit()` and learn the optimal weights and be used to predict the output probability.


Take a look at the `models.py` file and you'll see that you have a class to implement. Once you've properly implemented this, you should have a working model! 

In [None]:
X_train.columns

In [None]:
# Choose your input features here!
feature_cols = ['mean radius', 'mean texture', 'mean compactness']

In [None]:
from models import MyLogisticRegression

In [None]:
test_model = MyLogisticRegression(learning_rate=0.01, max_iterations=100000, optimizer='simple_gd', epochs=2500, batch_size=32)

In [None]:
test_model.fit(X_train[feature_cols].to_numpy(), y_train.to_numpy())

In [None]:
test_model.W

## Once again, let's comnpare to the _pros_

Take a look at your new model class as it compares to the implementations in Sklearn!

In [None]:
from sklearn.linear_model import SGDClassifier, LogisticRegression
from sklearn.metrics import log_loss

In [None]:
sklearn_model = LogisticRegression(penalty='none', verbose=2, solver='saga', fit_intercept=True, max_iter=100000)

In [None]:
sklearn_model.fit(X_train[feature_cols].to_numpy(), y_train.to_numpy().ravel())

In [None]:
print(sklearn_model.intercept_, sklearn_model.coef_)

Again, another example of a similar model that is implementing a different flavor of gradient descent

In [None]:
sklearn_sgd = SGDClassifier(loss='log', penalty='none', alpha=0.0, max_iter=100000, verbose=2, learning_rate='constant', eta0=0.01)

In [None]:
sklearn_sgd.fit(X_train[feature_cols].to_numpy(), y_train.to_numpy().ravel())

In [None]:
print(sklearn_sgd.intercept_, sklearn_sgd.coef_)

## After looking at those coefficients, look at the losses

In [None]:
log_loss(y_test.to_numpy().ravel(), test_model.predict_probability(X_test[feature_cols].to_numpy()))

In [None]:
log_loss(y_test.to_numpy().ravel(), sklearn_model.predict_proba(X_test[feature_cols].to_numpy()))

In [None]:
log_loss(y_test.to_numpy().ravel(), sklearn_sgd.predict_proba(X_test[feature_cols].to_numpy()))

## Closing notes

Now you have your own machine learning model that implements logistic regression with Gradient or Stochastic Gradient Descent!

There are many areas of the subject that we **did not** discuss that are critical to deploying gradient descent. Some further reading would be:

- Advanced Stochastic Gradient Descent
- Momentum and variants on Stochastic Gradient Descent
- The importance of _scaled_ inputs for Gradient Descent (having weights that are very large and very small depending on the values)
- Early stopping techniques
- Initialization techniques

There's so much more to discover! 