# Week 18 In Class Problems

1) Look up the Adam optimization functions in PyTorch https://pytorch.org/docs/stable/optim.html . 
How does it work? Try at least one other optimization function with the diabetes dataset shown in class. How does the model perform with the new optimizer? Did it perform better or worse than Adam? Why do you think that is?

Adam is a gradient descent optimization algorithm. All optimization algorithms seek to find the input for a target function that minimizes (or maximizes) output. Some optimization algorithms require that their objective (loss) functions are differentiable at a given point (e.g., a proposed solution). Within this set, some algorithms are based on using the first derivative; for multivariate functions, the "derivitive" is the gradient. "First order" algorithms use the gradient to find the direction to move in the search space, with the goal of getting as close to the optimum as possible. For each step, the derivative is computed, and, if the minimum is sought, a step (a parameter, the learning rate) is taken downhill, in the direction OPPOSITE to the gradient. This is the gradient descent. Stochastic gradient descents (SGDs) operate similarly, except that the gradient is not explicitly computed. Instead, it is obtained from prediction error in the training data set. 

The chief difference between Adam and earlier SGDs is that the learning rate is adaptive, not fixed. My understanding is that Adam adapts the learning rate based on the first and second moments of the gradient (mean and variance). The consequence of this is that it performs well in situations that earlier SGD algorithms had some difficulty with, in particular, problems where the underlying values change over time (non-stationary) or sparse gradients  (a very weak "signal", i.e. a lot of small numbers that don't have enough information to update the weights of the network). Furthermore, Adam is computationally efficient, requires little memory, and is good for problems with lots of data and parameters.

I used four optimiation algorithms: Adam, its two "ancestral" algorithsms, Adagrad and RMSprop, and Rprop (resilient backpropagation). Here are the performance results. The loss score is for the final epoch (491). 
Adam:  loss score = 0.0046, accuracy score = 0.69 
UAdagrad: loss score = 0.3084, accuracy score = 0.70. 
RMSprop: loss score = 0.0555, accuracy score = 0.73. 
Rprop algorithm: loss score = 0.0043, accuracy score = 0.71.

Rprop had the smallest loss score and the second smallest accuracy score; RMSprop had the highest accuracy score. Adam had the second lowest loss scorr and the lowest accuracy score. To be honest, I am not sure why the performances are what they are, but Adam outperformed its parent algorithms in terms of minimizing the loss score. This is probably due to the fact that it combines the adaptive properties of both. I can't speak to the accuracy. They seem close enough to one another, but I don't know how to compare them. 

In [1]:
import pandas as pd
import torch

diabetes_df = pd.read_csv("../week_13/diabetes.csv")
diabetes_df.head()

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
0,6,148,72,35,0,33.6,0.627,50,1
1,1,85,66,29,0,26.6,0.351,31,0
2,8,183,64,0,0,23.3,0.672,32,1
3,1,89,66,23,94,28.1,0.167,21,0
4,0,137,40,35,168,43.1,2.288,33,1


In [2]:
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

X = diabetes_df.drop('Outcome', axis=1).values
y = diabetes_df['Outcome'].values

# Split into training and test set
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state=42, stratify=y)

# #Standardize
sc= StandardScaler()
X_train=sc.fit_transform(X_train)
X_test=sc.fit_transform(X_test)

In [3]:
import torch.nn as nn
import torch.nn.functional as F #this has activation functions

# Creating tensors
X_train = torch.FloatTensor(X_train)
X_test = torch.FloatTensor(X_test)

y_train = torch.LongTensor(y_train)
y_test = torch.LongTensor(y_test)

print(X_train)

tensor([[-0.8514, -0.9801, -0.4048,  ..., -0.6077,  0.3108, -0.7922],
        [ 0.3566,  0.1614,  0.4654,  ..., -0.3021, -0.1164,  0.5610],
        [-0.5494, -0.5045, -0.6223,  ...,  0.3726, -0.7649, -0.7076],
        ...,
        [-0.8514, -0.7582,  0.0303,  ...,  0.7800, -0.7861, -0.2847],
        [ 1.8665, -0.3142,  0.0303,  ..., -0.5695, -1.0194,  0.5610],
        [ 0.0546,  0.7322, -0.6223,  ..., -0.3149, -0.5770,  0.3073]])


In [4]:
class ANN_Model(nn.Module):
    def __init__(self, input_features=8, hidden1=20, hidden2=20, out_features =2):
        super().__init__()
        self.layer_1_connection = nn.Linear(input_features, hidden1)
        self.layer_2_connection = nn.Linear(hidden1, hidden2)
        self.out = nn.Linear(hidden2, out_features)
    
    def forward(self, x):
        #apply activation functions
        x = F.relu(self.layer_1_connection(x))
        x = F.relu(self.layer_2_connection(x))
        x = self.out(x)
        return x

In [5]:
torch.manual_seed(42)

#instantiate the model
model = ANN_Model()

In [6]:
# loss function
loss_function = nn.CrossEntropyLoss()

#optimizer
optimizer = torch.optim.Rprop(model.parameters(), lr = 0.01)

In [7]:
#run model through multiple epochs/iterations
final_loss = []
n_epochs = 500
for epoch in range(n_epochs):
    y_pred = model.forward(X_train)
    loss = loss_function(y_pred, y_train)
    final_loss.append(loss)
    
    if epoch % 10 == 1:
        print(f'Epoch number: {epoch} with loss: {loss.item()}')
    
    optimizer.zero_grad() #zero the gradient before running backwards propagation
    loss.backward() #for backward propagation 
    optimizer.step() #performs one optimization step each epoch
    

Epoch number: 1 with loss: 0.6474142670631409
Epoch number: 11 with loss: 0.45198854804039
Epoch number: 21 with loss: 0.3870345950126648
Epoch number: 31 with loss: 0.3395462930202484
Epoch number: 41 with loss: 0.30658769607543945
Epoch number: 51 with loss: 0.2752005457878113
Epoch number: 61 with loss: 0.25215259194374084
Epoch number: 71 with loss: 0.22651627659797668
Epoch number: 81 with loss: 0.20444045960903168
Epoch number: 91 with loss: 0.18579360842704773
Epoch number: 101 with loss: 0.17048318684101105
Epoch number: 111 with loss: 0.1541174203157425
Epoch number: 121 with loss: 0.13827119767665863
Epoch number: 131 with loss: 0.12396775931119919
Epoch number: 141 with loss: 0.11416137963533401
Epoch number: 151 with loss: 0.10690709203481674
Epoch number: 161 with loss: 0.10016005486249924
Epoch number: 171 with loss: 0.09251974523067474
Epoch number: 181 with loss: 0.08521822094917297
Epoch number: 191 with loss: 0.07992039620876312
Epoch number: 201 with loss: 0.07561808

In [8]:
#predictions
y_pred = []

with torch.no_grad():
    for i, data in enumerate(X_test):
        prediction = model(data)
        y_pred.append(prediction.argmax().item())



In [9]:
from sklearn.metrics import accuracy_score
a_score = accuracy_score(y_test, y_pred)
print(a_score)

0.7142857142857143


In [10]:
from sklearn.metrics import classification_report

print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

           0       0.79      0.77      0.78       100
           1       0.59      0.61      0.60        54

    accuracy                           0.71       154
   macro avg       0.69      0.69      0.69       154
weighted avg       0.72      0.71      0.72       154



2) Write a function that lists and counts the number of divisors for an input value.

Example 1:
Input: 5
Output: “There are 2 divisors: 1 and 5”

Example 2:
Input: 40
Output: “There are 8 divisors: 1, 2, 4, 5, 8, 10, 20, and 40”

2) Create a function that accepts an array of names and returns a string formatted as a list of names separated by commas EXCEPT for the last two names, which are separated by an ampersand (and sign - &)