<a href="https://colab.research.google.com/github/nalika/NLP/blob/master/Lab_1_2020.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Introduction to Colab**

**Environment setup**
> First, make sure you have an environment using GPU:
1.   Click runtime at the top
2.   Change runtime type
3.   Hardware accelerator
4.   GPU then save

> Then, click connect in the top right and wait until you see a checkmark





**Drive mounting and data reading**

> On the left hand side, you should see table of contents (if you don't see this, click on the arrow to expand the menu):
1.   Click on files
2.   Mount drive
3.   Then, run the code using the play button to the left of the cell and follow the instructions






In [0]:
from google.colab import drive
drive.mount('/content/drive')

> 4.   Click refresh in the files tab
5.   Open the drive folder then my drive, and you will see the contents of your Google Drive
6. To reference files from here in your code, right click on the file and use copy path

# **Linear Regression with Sklearn libraries, Numpy, and Python coding**

> First, we need to import the libraries we'll be using

In [0]:
# Get the linear model package from sklearn
from sklearn import linear_model

# Get the train/test split package from sklearn for preparing our dataset to
# train and test the model with
from sklearn.model_selection import train_test_split

# Get the mean_squared_error and r2_score packages from sklearn to measure our
# model's performance
from sklearn.metrics import mean_squared_error, r2_score

# Import the pandas library to read our dataset
import pandas as pd

# Import the numpy library to work with and manipulate the data
import numpy as np



> Next, we will read the following dataset: 
* https://github.com/ageron/handson-ml/blob/master/datasets/housing/housing.csv
*   Try searching on Google for "github ageron housing.csv" to download this file



In [0]:
# Reads a comma-separated value (CSV) file
dataset = pd.read_csv('/content/drive/My Drive/Lakehead University/2020-2021' +
                      '/Work/Natural Language Processing/Labs/housing.csv')
                      # The path to your .csv file

# View the first five rows of the dataset
print("Here are the first five rows of the dataset:")
dataset.head()

> Next, we will split the dataset into the input (X) and output (Y) datasets
*   The Y dataset is what we want to predict
*   The X dataset is what we will use to predict Y

In [0]:
# We will predict the "MEDV" column
Y = dataset['MEDV']

# The remainder of the columns will be used to predict Y
# Select from the "CRIM" column to the "LSTAT" column
X = dataset.loc[:,'CRIM':'LSTAT']

> Once the dataset has been read, we need to split it into training and testing portions
*   The linear model will learn from the training set
*   The performance of the trained model will be evaluated by predicting the outputs of the testing set

In [0]:
# Splits the dataset so 80% is used for training and 20% for testing
x_train, x_test, y_train, y_test = train_test_split(X, Y, test_size=0.2)

# View the number of entries in the splits
print("There are " + str(x_train.size) + " training entries and " + 
      str(x_test.size) + " testing entries!")

> Now that the dataset has been split, we need to train the linear model

In [0]:
# Initialize the model
model = linear_model.LinearRegression()

# Train the model
model.fit(x_train, y_train)

> Next, we will evaluate the performance of the model
*   The R^2 score is a measure of variance in the model. The closer to 1 it is, the better it's performing.




In [0]:
# Get the predictions of the model on the testing dataset
predictions = model.predict(x_test)

# Output the performance of the model
mse = mean_squared_error(y_test, predictions)
print("The model's mean squared error is: " + str(mse))

r2 = r2_score(y_test, predictions)
print("The model's R^2 score is: " + str(r2))

# **Linear Regression with Pytorch on Colab**

> First, let's import the needed libraries



In [0]:
# Import the pytorch library
import torch

# Import the Linear package from pytorch for our model
from torch.nn import Linear

# Import the SGD (stochastic gradient descent) package from pytorch for
# our optimizer
from torch.optim import SGD

# Import the L1Loss (mean absolute error loss) package from pytorch for
# our performance measure
from torch.nn import L1Loss

> Next, we need to convert the dataset we defined previously to numpy arrays to use with pytorch



In [0]:
# Convert the training data
x_train = x_train.to_numpy()
y_train = y_train.to_numpy()

# Convert the testing data
x_test = x_test.to_numpy()
y_test = y_test.to_numpy()

> Next, we will define our model



In [0]:
# We need to tell it how many inputs (X columns) and 
# outputs (Y columns) there are
model = Linear(X.shape[1], 1)

# Set the model to use the GPU for processing
model.cuda()

> The basic definition of linear regression is *Y = Xβ + ε* where *β* and *ε* are parameters that the model is trying to optimize to better predict the *Y* value
*   For sklearn, the common optimization technique used is called *stochastic gradient descent*
*   With pytorch, we need to define an optimization technique for our model as well as the learning rate for it

> The learning rate affects how the parameters are updated when optimizing. A high learning rate will change the parameters more significantly than a low learning rate.





In [0]:
# Define a learning rate
learning_rate = 0.0001

# Define the optimizer with our model's parameters
optimizer = SGD(model.parameters(), learning_rate)

> Now we need to use the optimizer to fine-tune the parameters based on the model's output
*   For sklearn's linear_model, the default technique it uses is to run until it has minimized the total sum of squared errors between the predicted outputs and expected outputs
*   With pytorch, we decide how long to train the model for using our selected optimizer. We will measure how it's doing with the "L1Loss" package.

In [0]:
# Define the number of epochs to train for
epochs = 100

# Define the performance measure
performance = L1Loss()

# Convert the training set into torch variables for our model using the GPU
# as floats. The reshape is to remove a warning pytorch outputs otherwise.
inputs = torch.from_numpy(x_train).cuda().float()
outputs = torch.from_numpy(y_train.reshape(y_train.shape[0], 1)).cuda().float()

# Start the training loop
for epoch in range(epochs):
  # Clear the current errors so they don't cummulate
  optimizer.zero_grad()

  # Get the model's predictions for the training dataset
  predictions = model(inputs)

  # See how well the model performed
  loss = performance(predictions, outputs)

  # Compute the gradients for our optimizer
  loss.backward()

  # Use the optimizer to update the model's parameters based on the gradients
  optimizer.step()

  print("Epoch " + str(epoch + 1) + ": " + str(loss.item()))

> Lastly, we will test the model to see how it performs on the testing dataset




In [0]:
# Convert the testing set into torch variables for our model using the GPU
# as floats
inputs = torch.from_numpy(x_test).cuda().float()
outputs = torch.from_numpy(y_test.reshape(y_test.shape[0], 1)).cuda().float()

# Get the predictions of the model on the testing dataset
predictions = model(inputs)

# Output the performance of the model
loss = performance(predictions, outputs)
print("The model's L1 loss is: " + str(loss.item()))