# Training - in depth

Optimization or **training** in the process of making your model better with respect to your training data.

**Simple models** with few parameters can often be trained very quickly, this is because you can either directly compute their optimal model weights or because, due to their small number of parameters, the brute force optimizations can be done quickly. More **complex models**, such as neural networks, have far too many weights to directly solve for, so we always rely on a brute force implementations, which, when coupled with a lot of training data, can sometimes take hours or days to train.

In order to train a model we must specify both the **loss function** and the (brute-force) **optimization method**.

## Loss functions

#### A familiar one
As stated above, trianing is about making the model better. Well, what dictates what is "better"? The loss function does!

A very familiar loss function is the one used in least squares regression:

$$\sum_{i=1}^{n} \Big(y_i - \hat{y_i}\Big)$$

Where:<br>
$y_i$ = the observed $y$<br>
$\hat{y_i}$ = the predicted $y$<br><br><br>

#### How this can lead to changes in the model
As the one above does, usually a loss function specifies how bad your model currently is and then you can change your model weights to change the "badness" value that the loss function provides. This can be done because your prediction is simply a function of the model weights:

$$\hat{y_i} = \hat{a}x_i+\hat{b}$$

Where:<br>
$\hat{a}$ = your current estimate for the intercept<br>
$\hat{b}$ = your current estimate for the slope

More generally, a loss function is a type of objective function - a function you somehow want to optimize with respect to. Other fields of AI, such as robotics will more commonly use reward functions that reward a robot for performing the desired task.<br><br><br>

#### For logistic regression
In logistic regression, the loss function is:

$$\sum_{i=1}^{n} -\big(y_ilog\big(\hat{y_i}\big) + \big(1-y_i\big)log\big(1-\hat{y_i}\big)\big)$$

Where your model predictions ($\hat{y_i}$) now equal:
$$\frac{1}{1 + e^{\hat{a}x_i + \hat{b}}}$$

Remember that this is just mapping your linear regression onto the S-curve bound between $0$ and $1$. Visualized here: http://www.wolframalpha.com/input/?i=plot+(1%2F(1%2Be%5E(-y)))+from+y+%3D+-10+to+y+%3D+10

To make **sense of this loss function**, imagine the situation where you predict that $\hat{y} = 0$ but $y = 1$, then you get:<br><br>
$$-1\big(1*log\big(0\big) + \big(1 - 1\big)log\big(1-0\big)\big)$$
Which simplifies to:<br>
$$-1\big(-\infty + 0\big)$$

Which is just $\infty$.<br>

Infinity is more a concept than a number, so it would be hard to optimize this, but, thankfully, the model should never return $\hat{y} = 0$ (because it would require $\hat{a}x_i+\hat{b} = -\infty$). This does, however, show that if your model is absolutely wrong, that the loss will be very large, and while $\hat{y}$ wont equal zero, it can definitely approach it.<br>

It can similarly be shown that if you predict $\hat{y} = 1$ when $y = 0$ that the loss will also equal $\infty$ (The log operation on the right side of the equation will return a $-\infty$ while the left side will return $0$)<br>

It may be worth emphasizing that loss functions are not necessarily specific to models and can be broadly applicable to many models. This is because they only give you an idea of how bad your predictions are in a continuous and discrete case

## Optimizating the loss function

Here is a visualization of the difference between the loss function (height) mapped against two model weights looks like for a simple model such as linear regression (left) and a complex model such as a neural network (right): https://image.slidesharecdn.com/mlconf2015-sf-anandkumar-151114002155-lva1-app6892/95/animashree-anandkumar-electrical-engineering-and-cs-dept-uc-irvine-at-mlconf-sf-111315-10-638.jpg?cb=1447460604

Since we often can't directly compute the optimal solution for complex-model loss functions. We usually use gradient-based methods. A gradient is simply an n-dimensional derivative and it gives you the n-dimensional direction of greatest decrease.

So, while we may not know what the optimal solution is, we have a good idea of what direction we need to go in order to get closer to it. This process is done with over many iterations (steps) with a specified learning rate (step size) - some optimization methods will help you determine this learning rate which helps get rid of one hyperparameter.

We are able to determine these gradients/derivatives because at the end of the day, even the most complex models are based on simpler subequations of adding and multiplying; it is therefore a matter of calculating simpler derivatives using the chain rule from calculus.

Here is a video showing a gradient-based optimization: https://www.youtube.com/watch?v=kJgx2RcJKZY

# The code

In [None]:
import pandas as pd
import torch
import torch.nn as nn
from torch.autograd import Variable
import matplotlib.pyplot as plt
%matplotlib inline
from sklearn.metrics import accuracy_score, confusion_matrix

In [None]:
# Import the training data# Impor 
tor_df = pd.read_csv("tor_train_set.csv")


# Get the outcomes
tornado_outcome = tor_df.iloc[:, [2]]

# Convert the pandas column to a ndarray and then into a FloatTensor
train_outcome_Variable = Variable(torch.from_numpy(tornado_outcome.values).float())


# Get the predictors
tornado_predictors = tor_df.iloc[:, 3:]

# Make the validation set predictors into a numpy array
train_predictors_Variable = Variable(torch.from_numpy(tornado_predictors.values).float())

In [None]:
# Import the test set data# Impor 
test_df = pd.read_csv("tor_test_set.csv")


# Get the outcomes
test_outcome = test_df.iloc[:, [2]]

# Convert the pandas column to a ndarray and then into a FloatTensor
test_outcome_Variable = Variable(torch.from_numpy(test_outcome.values).float())


# Get the test set predictors
test_predictors = test_df.iloc[:, 3:]

# Make the test set predictors into a numpy array
test_predictors_Variable = Variable(torch.from_numpy(test_predictors.values).float())

In [None]:
def convert_prop_dam_to_binary(property_damage_values):
    
    # This function will convert continuous property damage values to binary values defining whether
        # or not a tornado caused any damage
    # property_damage_values = a PyTorch Tensor containing property damage values
    # Returns as PyTorch Tensor of binary values
    
    
    # Get the Tensor as a ndarray 
    prop_dam_array = property_damage_values.data.numpy()
    
    # For-loop to convert to binary
    for i in list(range(len(prop_dam_array))):
        
        if (prop_dam_array[i] == prop_dam_array.min())[0]:
            
            prop_dam_array[i] = 0
            
        else:
            
            prop_dam_array[i] = 1
     
    # Convert ndarray to Tensor
    prop_dam_Tensor = Variable(torch.from_numpy(prop_dam_array))
    
    # Return Tensor
    return(prop_dam_Tensor)

In [None]:
# Convert the training data
train_Y_binary = convert_prop_dam_to_binary(train_outcome_Variable)

# And the test data
test_Y_binary = convert_prop_dam_to_binary(test_outcome_Variable)

In [None]:
train_predictors_Variable.size()

In [None]:
torch.manual_seed(123)

class LogisticRegression(torch.nn.Module):
    
    def __init__(self):
        super(LogisticRegression, self).__init__()
        self.logistic_layer = nn.Sequential(nn.Linear(51, 1),
                                          nn.Sigmoid())
        
        
    def forward(self, x):
        logistic_output = self.logistic_layer(x)
        return(logistic_output)


# Make it
classifier = LogisticRegression()

# Optimizing options
loss_function = nn.BCELoss()
optimizer = torch.optim.Adam(classifier.parameters())

In [None]:
loss_list  = []
test_loss_list = []

for i in range(1000):
    optimizer.zero_grad()
    
    predictions = classifier(train_predictors_Variable)
    test_predictions = classifier(test_predictors_Variable)
    
    loss = loss_function(predictions, train_Y_binary)
    test_loss = loss_function(test_predictions, test_Y_binary)
    
    loss_list.append(loss.data[0])
    test_loss_list.append(test_loss.data[0])
    loss.backward()
    optimizer.step()

In [None]:
plt.plot(loss_list, label = 'train')
plt.plot(test_loss_list, label = 'test')
plt.legend();

In [None]:
plain_prediction_list  = []

test_predictions = classifier(test_predictors_Variable)

for i in range(len(test_predictions)):
    plain_prediction = test_predictions[i].data.numpy()[0]
    if plain_prediction < 0.5:
        plain_prediction_list.append(0)
    else:
        plain_prediction_list.append(1)

In [None]:
test_Y_binary_list = test_Y_binary.data.numpy().tolist()

In [None]:
accuracy_score(plain_prediction_list, test_Y_binary_list)

In [None]:
confusion_matrix(plain_prediction_list, test_Y_binary_list)

<br><br><br>

In [None]:
torch.manual_seed(123)

class NeuralNetwork(torch.nn.Module):
    
    def __init__(self):
        super(NeuralNetwork, self).__init__()
        self.hidden_layer = nn.Sequential(nn.Linear(51, 26),
                                          nn.ReLU())
        self.output_layer = nn.Sequential(nn.Linear(26, 1),
                                          nn.Sigmoid())
        
        
        
    def forward(self, x):
        hidden_output = self.hidden_layer(x)
        final_output = self.output_layer(hidden_output)
        return(final_output)


# Make it
classifier = NeuralNetwork()

# Optimizing options
loss_function = nn.BCELoss()
optimizer = torch.optim.Adam(classifier.parameters())

In [None]:
loss_list = []
test_loss_list = []

for i in range(1000):
    optimizer.zero_grad()
    
    predictions = classifier(train_predictors_Variable)
    test_predictions = classifier(test_predictors_Variable)
    
    loss = loss_function(predictions, train_Y_binary)
    test_loss = loss_function(test_predictions, test_Y_binary)
    
    loss_list.append(loss.data[0])
    test_loss_list.append(test_loss.data[0])
    loss.backward()
    optimizer.step()

In [None]:
plt.plot(loss_list, label = 'train')
plt.plot(test_loss_list, label = 'test')
plt.legend();

In [None]:
plain_prediction_list = []

test_predictions = classifier(test_predictors_Variable)

for i in range(len(test_predictions)):
    plain_prediction = test_predictions[i].data.numpy()[0]
    if plain_prediction < 0.5:
        plain_prediction_list.append(0)
    else:
        plain_prediction_list.append(1)

In [None]:
accuracy_score(plain_prediction_list, test_Y_binary_list)

In [None]:
confusion_matrix(plain_prediction_list, test_Y_binary_list)