#### Copyright 2018 Google LLC.

In [0]:
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# https://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

# Linear Regression With scikit-learn

We have learned about linear regression in theory, now let's put our newly found skills into practice. In this Colab we will create multiple linear regression models using the scikit-learn toolkit.

## Overview

### Learning Objectives

* Create a closed-formed linear regression model
* Create a linear regression model using an optimizer

### Prerequisites

* Introduction to Colab
* Intermediate Python
* Introduction to Pandas
* Visualizations
* Introduction to scikit-learn

### Estimated Duration

60 minutes

### Grading Criteria

Each exercise is worth 3 points. The rubric for calculating those points is:

| Points | Description |
|--------|-------------|
| 0      | No attempt at exercise |
| 1      | Attempted exercise, but code does not run |
| 2      | Attempted exercise, code runs, but produces incorrect answer |
| 3      | Exercise completed successfully |

There is 1 exercises in this Colab so there are 3 points available. The grading scale will be 3 points.

# Creating a dataset

In this colab we will examine different methods of applying a linear regression to a data set. Calculating a linear regression can be done with a simple equation if the data is small enough. If the data is large, there are methods to calculate the regression using sampling and/or batching.

But before we get started, let's create some sample data to perform our regression on. The code below creates 1000 data points.

In [0]:
import numpy as np

np.random.seed(312)

# The size of the dataset that we'll be using to perform our linear regression.
DATA_SET_SIZE = 1000

# The maximum value of the x coordinate. The range of values of X will be
# (0, X_MAX).
X_MAX = 5

# The Y-intercept is one of the "secret" values that we'll be trying to predict
# via linear regression.
INTERCEPT = 4

# The slope is another value that we'll be trying to predict using linear
# regression.
SLOPE = 3

# Generate the x-coordinates for our dataset.
coffee = X_MAX * np.random.rand(DATA_SET_SIZE, 1)

# Generate the y-coordinates for our dataset using the linear equation
# y = mx + b.
energy = SLOPE * coffee + INTERCEPT

Let's take a look at the data set that was just generated.

In [0]:
import matplotlib
import matplotlib.pyplot as plt

plt.plot(coffee, energy, 'b.')
plt.show()

The data does indeed have an x range from zero to our max x value. Notice that the y-intercept and slope match our seeded values.

This data looks nothing like we'd see in the real world though. It would be trivial to fit a line to the data as is. Let's add a little randomness to the data to make it more realistic.

In [0]:
energy = energy + 2 * np.random.randn(DATA_SET_SIZE, 1)

plt.plot(coffee, energy, 'b.')
plt.show()

That's much better! There is still a linear trend to the data, but there is also much more noise.

# The Normal Equation

If the dataset being processed is small enough, then the Normal Equation can be used to calculate the slope and y-intercept of the regression line in-memory. The normal equation can easily be written in NumPy as seen below.

In [0]:
# x is a Nx1 matrix containing our x-values. The first step in calculating the
# Normal Equation is to create an Nx2 matrix where each "row" has the value 1
# and the x value.
coffee_ = np.c_[np.ones((DATA_SET_SIZE, 1)), coffee]

norm = np.linalg.inv(coffee_.T.dot(coffee_)).dot(coffee_.T).dot(energy)

calculated_intercept = norm[0][0]
calculated_slope = norm[1][0]

print("Calculated slope {} vs actual {}".format(calculated_slope, SLOPE))
print("Calculated intercept {} vs actual {}".format(calculated_intercept, INTERCEPT))

Not bad! The random noise we added to the data prevented us from exactly predicting the slope and intercept, but our calculations were pretty close.

We can now use these values to make predictions.

In [0]:
# Create a (5,1) matrix of containing values to make predictions on.
coffee_ = np.array([[0.34], [1.65], [2.45], [3.78], [4.56]])

# Convert the matrix to a (5, 2) matrix with 1s in the first column
# in order to perform a dot-product against the calculated slope and
# intercept.
coffee_predict = np.c_[np.ones((5, 1)), coffee_]

# Make the predictions
energy_predict = coffee_predict.dot(norm)

# Plot the original data as blue dots.
plt.plot(coffee, energy, 'b.')

# Plot the predictions as red dots.
plt.plot(coffee_, energy_predict, 'r.')
plt.show()

If we want to plot the calculated line we can ask for the prediction at the min and max values for x and plot the line.

In [0]:
coffee_ = np.array([[0.0], [X_MAX]])
coffee_predict = np.c_[np.ones((2,1)), coffee_]
energy_predict = coffee_predict.dot(norm)

plt.plot(coffee, energy, 'b.')
plt.plot(coffee_, energy_predict, 'r-')
plt.show()

# Challenge: Optimized Normal Equation

It turns out that the math operations used to calculate the Normal Equation are quite expensive. We'll explore other methods of performing a linear regression soon, but there is a purely mathematical optimization that has been discovered. The equation uses the pseudoinverse of the input matrix to predict y.

> $\hat{\theta} = X^+y$

Find the NumPy function that calculates the pseudoinverse of a matrix and then use that function to write a more optimal method for finding the slope and intercept for a linear regression in memory.

In [0]:
coffee_ = np.c_[np.ones((DATA_SET_SIZE, 1)), coffee]

norm2 = [[0], [0]] # Update this line to perform an in-memory calculation of a
                   # linear regression using the optimized equation in the place
                   # of the [[0], [0]] matrix.

calculated_intercept2 = norm2[0][0] 
calculated_slope2 = norm2[1][0]

EPSILON = 0.00001
if abs(calculated_slope - calculated_slope2) < EPSILON and abs(calculated_intercept - calculated_intercept2) < EPSILON:
  print("You win!")
else:
  print("Try again :(")

### Answer Key

**Solution**

In [0]:
from numpy.linalg import pinv

coffee_ = np.c_[np.ones((DATA_SET_SIZE, 1)), coffee]

norm2 = pinv(coffee_).dot(energy)

calculated_intercept2 = norm2[0][0] 
calculated_slope2 = norm2[1][0]

**Validation**

In [0]:
EPSILON = 0.00001
if abs(calculated_slope - calculated_slope2) >= EPSILON:
  raise Exception(f'calculated slope of {calculated_slope2} is too far from {calculated_slope}')
if abs(calculated_intercept - calculated_intercept2) >= EPSILON:
  raise Exception(f'calculated intercept of {calculated_intercept2} is too far from {calculated_intercept}')
print("LGTM")

# Linear Regression in SciKit Learn

Using NumPy to calculate the slope and intercept for a linear regression isn't difficult, but it can be error-prone. Luckily you won't have to perform these calculations directly in most cases. SciKit Learn performs the optimized in-memory calculations that we just created above in its LinearRegression implementation.

In [0]:
from sklearn.linear_model import LinearRegression

lin_reg = LinearRegression()
lin_reg.fit(coffee, energy)
lin_reg.coef_, lin_reg.intercept_

Notice that the slope (SciKit Learn calls this coef_) and intercept are the same values that we calculated above manually.

We can use this slope and intercept to predict y values given x values.

In [0]:
coffee_ = np.array([[0.34], [1.65], [2.45], [3.78], [4.56]])

energy_predict = lin_reg.predict(coffee_)
energy_predict

Plotting these values we see they are the same that were predicted by our manual process.

In [0]:
plt.plot(coffee, energy, 'b.')
plt.plot(coffee_, energy_predict, 'r.')
plt.show()

And we can calculate two points and use those to draw the regression line.

In [0]:
coffee_ = np.array([[0.0], [5.0]])
energy_predict = lin_reg.predict(coffee_)

plt.plot(coffee, energy, 'b.')
plt.plot(coffee_, energy_predict, 'r-')
plt.show()

# Stochastic Gradient Descent

It is not always practical to compute the linear regression using the entire training data set. For cases where training using the entire set is impractical the stochastic gradient descent method can be used. In SciKit Learn this is as simple as using the SGDRegressor.

In [0]:
from sklearn.linear_model import SGDRegressor

# Create a new Stochastic Gradient Descent regressor
sgd_reg = SGDRegressor()

# Fit the model
sgd_reg.fit(coffee, energy.ravel())

# Display the slope and intercept
sgd_reg.coef_, sgd_reg.intercept_

You might notice that the slope and intercept aren't nearly as accurate as what we were getting when processing the entire dataset. This is because the SGDRegressor is only using a subset of the training data.

Let's take a look at the regression line calculated by the full linear regressor and the SGD one.

In [0]:
coffee_ = np.array([[0.0], [5.0]])

lin_predict = lin_reg.predict(coffee_)
sgd_predict = sgd_reg.predict(coffee_)

plt.plot(coffee, energy, 'b.')
plt.plot(coffee_, lin_predict, 'r-')
plt.plot(coffee_, sgd_predict, 'g-')
plt.show()

# Challenge: Regressor Parameters

The SGDRegressor has many parameters that can be tuned. Out-of-the-box, our regressor didn't do that well. Let's see if we can tune some of the parameters of the regressor to get its predicted values for the slope and intercept closer to those we predicted using the entire data set.

Check out the [SGDRegressor documentation](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.SGDRegressor.html) and look over the parameters available. Pay special attention to parameters related to learning rate and iterations over the data. See if you can tweak the parameters to get with epsilon of our calculated values below. Think about why your changes worked.

In [0]:
np.random.seed(1)

sgd_reg = SGDRegressor() # Update the parameters to SGDRegressor

# Fit the model
sgd_reg.fit(coffee, energy.ravel())

EPSILON = 0.05
if abs(calculated_slope - sgd_reg.coef_) < EPSILON and abs(calculated_intercept - sgd_reg.intercept_) < EPSILON:
  print("You win!")
else:
  print("Try again :(")

### Answer Key

**Solution**

In [0]:
np.random.seed(1)

sgd_reg = SGDRegressor(max_iter=50, tol=None, eta0=0.01)

# Fit the model
sgd_reg.fit(coffee, energy.ravel())

**Validation**

In [0]:
EPSILON = 0.05
if abs(calculated_slope - sgd_reg.coef_) >= EPSILON:
  print(f'Slope {sgd_reg.coef_} is too far from expected {calculated_slope}')
if abs(calculated_intercept - sgd_reg.intercept_) >= EPSILON:
  print(f'Intercept {sgd_reg.intercept_} is too far from expected {calculated_intercept}')

print("LGTM")

# Challenge: Mini Batching

So far in this colab every method that we have used relied on the entire dataset being in memory at one time. For "big data" problems, this won't always be possible.

In the code below we have broken our data set up into small batches. In practice these batches might be loaded into memory one at a time. Each batch is fed to the SGDRegressor using the `partial_fit` method. This allows us to train the model in chunks.

Your challenge is to mini-batch train the model within the desired epsilon of our calculated slope and intercept. You'll find hints in the [SGDRegressor](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.SGDRegressor.html) documentation by searching for `partial_fit`. Pay attention to parameters that don't apply to `partial_fit`.

In [0]:
np.random.seed(1)

BATCH_SIZE = 50

sgd_reg = SGDRegressor()

for i in range(0, DATA_SET_SIZE, BATCH_SIZE):
  sgd_reg.partial_fit(coffee[i:i+BATCH_SIZE], energy[i:i+BATCH_SIZE].ravel())

print("Intercept: {}, Coef: {}".format(sgd_reg.intercept_, sgd_reg.coef_))

EPSILON = 0.05
if abs(calculated_slope - sgd_reg.coef_) < EPSILON and abs(calculated_intercept - sgd_reg.intercept_) < EPSILON:
  print("You win!")
else:
  print("Try again :(")

### Answer Key

**Solution**

In [0]:
np.random.seed(1)

BATCH_SIZE = 50

sgd_reg = SGDRegressor()

for _ in range(15):
  for i in range(0, DATA_SET_SIZE, BATCH_SIZE):
    sgd_reg.partial_fit(coffee[i:i+BATCH_SIZE], energy[i:i+BATCH_SIZE].ravel())

print("Intercept: {}, Coef: {}".format(sgd_reg.intercept_, sgd_reg.coef_))

**Validation**

In [0]:
EPSILON = 0.05
if abs(calculated_slope - sgd_reg.coef_) >= EPSILON:
  raise Exception(f'Slope {sgd_reg.coef_} is too far from expected {calculated_slope}')
if abs(calculated_intercept - sgd_reg.intercept_) >= EPSILON:
  raise Exception(f'Intercept {sgd_reg.intercept_} is too far from expected {calculated_intercept}')
print("LGTM")

# Exercises

For these exercises we will download a CSV of life expectancies from [GapMinder](https://www.gapminder.org/data/) and create a linear regression for life expectancy in the United States.

## Exercise 1: Obtain the data

Download a CSV of life expectancy data from [GapMinder](https://www.gapminder.org/data/), upload it to this Colab, and read the data into memory using Pandas.

Examine the data using describe.

### Student Solution

In [0]:
# Your code goes here

### Answer Key

**Solution**

In [0]:
import pandas as pd

data = pd.read_csv('life_expectancy_years.csv')
data.describe()

**Validation**

In [0]:
import datetime
now = datetime.datetime.now()

if len(data.columns) != now.year - 1800 + 1:
  raise Exception("Unexpected column count")
if len(data) != 187:
  raise Exception("Unexpected row count")
if data.columns[0] != 'country' and data.columns['1'] != 1800:
  raise Exception("Unexpected column names")
print("LGTM")

## Exercise 2: Look at the data

Examine the data using head and/or tail.

### Student Solution

In [0]:
# Your code goes here

### Answer Key

**Solution**

In [0]:
data.head()
data.tail()

**Validation**

In [0]:
# N/A

## Exercise 3: Prep Life Expectancy

Extract the life expectancy values for the United States into a NumPy array. 

To do this you'll need to find the row that contains data for the United States. When you find that row of data you'll find the word 'United States' in the first column and then floating point numbers in subsequent columns. The goal of this step is to create a NumPy array containing those numbers.

### Student Solution

In [0]:
# Your code goes here

### Answer Key

**Solution**

In [0]:
life_expectancy = data.iloc[178][1:].values

**Validation**

In [0]:
import datetime
now = datetime.datetime.now()
wanted_row_count = now.year - 1800

if len(life_expectancy) != wanted_row_count:
  raise Exception(f'got {len(life_expectancy)} rows, wanted {wanted_row_count}')
if all(map(lambda x: x == 39.4, life_expectancy[0:20])):
  raise Exception("Unexpected values in life_expectancy")
print('LGTM')

## Exercise 4: Create year data

We need to now create an array of year data from the min to the max year in the dataset. There are a couple of ways that this can be done.

The column names (except column 0) are the years. You can extract those years into an array similarly to what you did for life expectancy. Note that the columns names are strings so you'll want to convert those names into integers.

If no years are missing from the data set you can also just use a range function to generate numbers between the min and max years.

### Student Solution

In [0]:
# Your code goes here

### Answer Key

**Solution**

In [0]:
import numpy as np

years = np.array(range(int(data.columns[1:].min()), int(data.columns[1:].max()) + 1))

**Validation**

In [0]:
import datetime
now = datetime.datetime.now()
wanted_row_count = now.year - 1800

if len(years) != wanted_row_count:
  raise Exception(f'got {len(life_expectancy)} rows, wanted {wanted_row_count}')
if years[0] != 1800 and years[100] != 1900:
  raise Exception("Unexpected values in life_expectancy")
print('LGTM')

## Exercise 5: Plot the data

Create a scatterplot of the data.

### Student Solution

In [0]:
# Your code goes here

### Answer Key

**Solution**

In [0]:
import matplotlib.pyplot as plt

plt.plot(years, life_expectancy, 'b.')
plt.show()

**Validation**

In [0]:
# N/A

## Exercise 6: Divide the data

Split off 20% of the data as a test set and keep the rest for training data.

To do this, it will be useful to create a `DataFrame` and store the years and life expectancy arrays created above as columns in that data frame.

You can then randomize and split the data frame or use scikit-learn's built in test/train data splitter.

### Student Solution

In [0]:
# Your code goes here

### Answer Key

**Solution**

In [0]:
df = pd.DataFrame()
df["Years"] = years
df["Life Expectancy"] = life_expectancy
df = df.iloc[np.random.permutation(len(df))]
test_set_size = int(len(df) * 0.2)

test_data = df.iloc[:test_set_size]
train_data = df.iloc[test_set_size:]

Alternative solution using sklearn

In [0]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(df['Years'], df['Life Expectancy'])

**Validation**

In [0]:
import datetime
now = datetime.datetime.now()
wanted_row_count = now.year - 1800

if df.columns[0] != 'Years' or df.columns[1] != 'Life Expectancy':
  raise Exception('Unexpected columns')

if len(df) != wanted_row_count:
  raise Exception('Unexpected row count')

if len(test_data) != int(wanted_row_count * .2):
  raise Exception('Unexpected test data row count')

if len(train_data) != wanted_row_count - int(wanted_row_count * .2):
  raise Exception('Unexpected train data row count')

print('LGTM')

## Exercise 7: Create and train a model

Use LinearRegression to create a model.

### Student Solution

In [0]:
# Your code goes here

### Answer Key

**Solution**

In [0]:
from sklearn.linear_model import LinearRegression

x = train_data[["Years"]]
y = train_data["Life Expectancy"]

lin_reg = LinearRegression()
lin_reg.fit(x, y)

**Validation**

In [0]:
from sklearn.linear_model import LinearRegression

x = train_data[["Years"]]
y = train_data["Life Expectancy"]

validator_lin_reg = LinearRegression()
validator_lin_reg.fit(x, y)

if validator_lin_reg.coef_ != lin_reg.coef_:
  raise Exception('Unexpected coef_')
if validator_lin_reg.intercept_ != lin_reg.intercept_:
  raise Exception('Unexpected intercept_')

print('LGTM')

## Exercise 8: Test your model

Use the test data you retained to make predictions of life expectancy based on year. Compare the predictions to the actual data. Use scikit-learn to calculate root mean squared error.

### Student Solution

In [0]:
# Your code goes here

### Answer Key

**Solution**

In [0]:
from sklearn.metrics import mean_squared_error
import math

predictions = lin_reg.predict(test_data[["Years"]])
rmse = math.sqrt(mean_squared_error(test_data[["Life Expectancy"]], predictions))

**Validation**

In [0]:
from sklearn.metrics import mean_squared_error
import math

predictions = lin_reg.predict(test_data[["Years"]])
validator_rmse = math.sqrt(mean_squared_error(test_data[["Life Expectancy"]], predictions))

if rmse != validator_rmse:
  raise Exception('Unexpected RMSE')

print('LGTM')

## Exercise 9: Plot your regression

Create a scatter plot of the full set of life expectancy data. Draw your regression line over the scatterplot.

### Student Solution

In [0]:
# Your code goes here

### Answer Key

**Solution**

In [0]:
import matplotlib
import matplotlib.pyplot as plt

x_line = [[df["Years"].min()], [df["Years"].max()]]
y_line = lin_reg.predict(x_line)

plt.plot(df["Years"], df["Life Expectancy"], 'b.')
plt.plot(x_line, y_line, 'r-')
plt.show()

**Validation**

In [0]:
# N/A

## Exercise 10: Challenge (Ungraded)

The [CalCOFI](https://www.kaggle.com/sohier/calcofi/version/2) dataset contains decades of oceanic data. In this exercise we will use this data to attempt to predict water temperature based on salinity. The exercise is divided into multiple steps, each with a code block after it for your solution.

**Acquire the data**

The [CalCOFI](https://www.kaggle.com/sohier/calcofi/version/2) data consists of two files, one containing data about "Casts" and the other about "Bottles". Look at the data files and try to get an understanding of what a cast is and what a bottle is.

Find the file that contains temperature and salinity information, download that file, and then upload it to Colab. You'll want to use the zipped version of the file so that the upload doesn't take too long.

Once the file is uploaded, write some Python code to unzip the file.

**Load the data using Pandas**

Now that you have an unzipped version of the file you can load the data into memory using Pandas. Write code to read the file into memory and describe the data table that was created.

**Drop rows with missing data**

Looking at the counts for temperature and salinity you can see that there are some rows with missing data. Remove the rows with missing temperature or salinity data from the data frame. After you are done, describe the data to make sure that every temperature and salinity row has data.

**Plot the data**

Create a scatterplot of salinity and temperature.

**Shuffle the data**

In this exercise we will split the data into a training and test set. Since the data is ordered we will shuffle the data frame before splitting it. Write code to shuffle the data frame and look at the data (using head, tail, or some other means) to make sure that it is shuffled.

**Split the data frame**

For this exercise we'll split the data frame so that 20% of the data is held out for testing and the remaining data is used for training. Write code below to split the data into two data frames, one for testing and one for training.

**Create a linear regression model**

Use scikit-learn's LinearRegression to fit a linear regression model to your training data.

**Test your model**

Use your test data to make predictions and then find the mean squared error of those predictions vs. the actual measured temperatures for the test data.

scikit-learn has support for calculating mean squared error.

**Plot your regression line**

Create another plot that contains the scatterplot of the salinity and temperatures. Draw the prediction line over the scatterplot.

**Dig deeper**

The model we built wasn't very good; however, we only used one feature. Are there other features or combinations of features that are more predictive of temperature?

Measurements were recorded at different depths. Is salinity a good predictor of temperature at any depth range?

### Student Solution

In [0]:
# Your code goes here

### Answer Key

**Solution**

In [0]:
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error
from sklearn.utils import shuffle

import matplotlib.pyplot as plt
import pandas as pd
import zipfile

zip_ref = zipfile.ZipFile("./bottle.csv.zip", 'r')
zip_ref.extractall("./")
zip_ref.close()

data = pd.read_csv("./bottle.csv")
data.describe()

data=data[~(data["T_degC"].isna() | data["Salnty"].isna())]
data.describe()

plt.plot(data["Salnty"], data["T_degC"], 'b.')
plt.show()

data = shuffle(data)
data.head()

test_set_size = int(data["Salnty"].count() * .2)

test_data = data[:test_set_size]
train_data = data[test_set_size:]

print("{} test data points; {} train data points".format(test_data["Salnty"].count(), train_data["Salnty"].count()))

x = train_data[["Salnty"]]
y = train_data["T_degC"]

lin_reg = LinearRegression()
lin_reg.fit(x, y)

predictions = lin_reg.predict(test_data[["Salnty"]])
mean_squared_error(test_data[["T_degC"]], predictions)

x_line = [[data["Salnty"].min()], [data["Salnty"].max()]]
y_line = lin_reg.predict(x_line)

plt.plot(data["Salnty"], data["T_degC"], 'b.')
plt.plot(x_line, y_line, 'r-')
plt.show()


**Validation**

In [0]:
# N/A