<a href="https://colab.research.google.com/github/mithamokelvinm/Foundations_Of_Data_Science_For_Machine_Learning/blob/main/Build_classical_machine_learning_models_with_supervised_learning.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Build classical machine learning models with supervised learning 

### Introduction


This module explores a process called supervised learning, in which machine learning models learn from examples.

By understanding supervised learning, we'll start a deeper dive into the individual components of the learning process, and exactly how this process can improve a model. Through examples, we'll also explore how setting up this learning process correctly is critical to achieving a high-performance model.

Throughout this module, we'll use the following scenario to explain the process of supervised learning. This scenario provides an example for how you might meet these concepts while you're programming.

Your family has managed Washington State's longest-running elk farm for several generations, but the health of your herd has slowly worsened for decades. It's well known that your farm's breed of elk should not be fed grain when nightly temperatures average above freezing (32°F or 0°C). For that reason, you've always followed your grandfather's farming calendar and switched from grain feed after January 31.

You've recently read about climate change affecting farming practices. Could this explain the poorer health of elk in recent years? With some historical weather data at your side, you seek to determine whether local temperatures have changed from your grandfather's day, and whether your farming calendar needs to be updated.

**Prerequisites**

You should have a basic familiarity with inputs, outputs, and models.

**Learning objectives**

In this module, you will:

Define supervised and unsupervised learning.

Explore how cost functions affect the learning process.

Discover how models are optimized by gradient descent.

Experiment with learning rates, and see how they can affect training.

### Define supervised learning

The process of training a model can be either supervised or unsupervised. Our goal here is to contrast these approaches and then take a deeper dive into the learning process, with a focus on supervised learning. It's worth remembering throughout this discussion that the only difference between supervised and unsupervised learning is how the objective function works.

**What is unsupervised learning?**

In unsupervised learning, we train a model to solve a problem without us knowing the correct answer. In fact, unsupervised learning is typically used for problems where there isn't one correct answer, but instead, better and worse solutions.

Imagine that we want our machine learning model to draw realistic pictures of avalanche rescue dogs. There isn't one "correct" drawing to draw. As long as the image looks somewhat like a dog, we'll be satisfied. But if the produced image is of a cat, that's a worse solution.

Recall that training requires several components:

Diagram of the model and objective function parts of the machine learning lifecycle.

In unsupervised learning, the objective function makes its judgment purely on the model's estimate. That means the objective function often needs to be relatively sophisticated. For example, the objective function might need to contain a "dog detector" to assess if images that the model draws look realistic. The only data that we need for unsupervised learning is about features that we provide to the model.

**What is supervised learning?**

Supervised learning can be thought of as learning by example. In supervised learning, we assess the model's performance by comparing its estimates to the correct answer. Although we can have very simple objective functions, we need both:

Features that are provided as inputs to the model
Labels, which are the correct answers that we want the model to be able to produce
Diagram of the model and objective function parts of the machine learning lifecycle, with labels.

For example, consider our desire to predict what the temperature will be on January 31 of a given year. For this prediction, we'll need data with two components:

**Feature:** Date

**Label:** Daily temperature (for example, from historical records)
In the scenario, we provide the date feature to the model. The model predicts the temperature, and we compare this result to the dataset's "correct" temperature. The objective function can then calculate how well the model worked, and we can make adjustments to the model.

**Labels are only for learning**

It's important to remember that no matter how models are trained, they only process features. During supervised learning, the objective function is the only component that relies on access to labels. After training, we don't need labels to use our model.

### Implement supervised learning


**Exercise: Supervised learning**

Recall our farming scenario, in which we want to look at how January temperatures have changed over time. Now we'll build a model that achieves this by using supervised learning.

With many libraries, we can build a model in only a few lines of code. Here, we'll break down the process into steps so that we can explore how things work.

**Four components**

Recall that there are four key components to supervised learning: the data, the model, the cost function, and the optimizer. Let's inspect these one at a time.

**1. The data**

We'll use publicly available weather data for Seattle. Let's load that and restrict it to January temperatures.

In [7]:
import pandas
!wget https://raw.githubusercontent.com/MicrosoftDocs/mslearn-introduction-to-machine-learning/main/graphing.py
!wget https://raw.githubusercontent.com/MicrosoftDocs/mslearn-introduction-to-machine-learning/main/m0b_optimizer.py
!wget https://raw.githubusercontent.com/MicrosoftDocs/mslearn-introduction-to-machine-learning/main/Data/seattleWeather_1948-2017.csv

# Load a file that contains weather data for Seattle
data = pandas.read_csv('seattleWeather_1948-2017.csv', parse_dates=['date'])

# Keep only January temperatures
data = data[[d.month == 1 for d in data.date]].copy()


# Print the first and last few rows
# Remember that with Jupyter notebooks, the last line of 
# code is automatically printed
data

--2022-10-10 14:03:56--  https://raw.githubusercontent.com/MicrosoftDocs/mslearn-introduction-to-machine-learning/main/graphing.py
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 21511 (21K) [text/plain]
Saving to: ‘graphing.py.2’


2022-10-10 14:03:56 (52.8 MB/s) - ‘graphing.py.2’ saved [21511/21511]

--2022-10-10 14:03:56--  https://raw.githubusercontent.com/MicrosoftDocs/mslearn-introduction-to-machine-learning/main/m0b_optimizer.py
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 1287 (1.3K) [text/plain]
Saving to: ‘m0b_optimiz

Unnamed: 0,date,amount_of_precipitation,max_temperature,min_temperature,rain
0,1948-01-01,0.47,51,42,True
1,1948-01-02,0.59,45,36,True
2,1948-01-03,0.42,45,35,True
3,1948-01-04,0.31,45,34,True
4,1948-01-05,0.17,45,32,True
...,...,...,...,...,...
25229,2017-01-27,0.00,54,37,False
25230,2017-01-28,0.00,52,37,False
25231,2017-01-29,0.03,48,37,True
25232,2017-01-30,0.02,45,40,True


We have data from 1948 to 2017, split across 2,170 rows.

We'll analyze the relationship between date and daily minimum temperatures. Let's take a quick look at our data as a graph.

In [8]:
import graphing # Custom graphing code. See our GitHub repository for details

# Let's take a quick look at our data
graphing.scatter_2D(data, label_x="date", label_y="min_temperature", title="January Temperatures (°F)")

Machine learning usually works best when the X and Y axes have roughly the same range of values. We'll cover why in later learning material. For now, let's just scale our data slightly.

In [9]:
import numpy as np

# This block of code scales and offsets the data slightly, which helps the training process
# You don't need to understand this code. We'll cover these concepts in later learning material

# Offset date into number of years since 1982
data["years_since_1982"] = [(d.year + d.timetuple().tm_yday / 365.25) - 1982 for d in data.date]

# Scale and offset temperature so that it has a smaller range of values
data["normalised_temperature"] = (data["min_temperature"] - np.mean(data["min_temperature"])) / np.std(data["min_temperature"])

# Graph
graphing.scatter_2D(data, label_x="years_since_1982", label_y="normalised_temperature", title="January Temperatures (Normalised)")

**2. The model**

We'll select a simple linear regression model. This model uses a line to make estimates. You might have come across trendlines like these before when making graphs.

In [11]:
class MyModel:

    def __init__(self):
        '''
        Creates a new MyModel
        '''
        # Straight lines described by two parameters:
        # The slope is the angle of the line
        self.slope = 0
        # The intercept moves the line up or down
        self.intercept = 0

    def predict(self, date):
        '''
        Estimates the temperature from the date
        '''
        return date * self.slope + self.intercept

# Create our model ready to be trained
model = MyModel()

print("Model made!")

Model made!


We wouldn't normally use a model before it has been trained, but for the sake of learning, let's take a quick look at it.

In [12]:
print(f"Model parameters before training: {model.intercept}, {model.slope}")

# Look at how well the model does before training
print("Model visualised before training:")
graphing.scatter_2D(data, "years_since_1982", "normalised_temperature", trendline=model.predict) 

Model parameters before training: 0, 0
Model visualised before training:


You can see that before training, our model (the red line) isn't useful at all. It always simply predicts zero.

**3. The cost (objective) function**

Our next step is to create a cost function (objective function).

These functions in supervised learning compare the model's estimate to the correct answer. In our case, our label is temperature, so our cost function will compare the estimated temperature to temperatures seen in the historical records.

In [13]:
def cost_function(actual_temperatures, estimated_temperatures):
    '''
    Calculates the difference between actual and estimated temperatures
    Returns the difference, and also returns the squared difference (the cost)

    actual_temperatures: One or more temperatures recorded in the past
    estimated_temperatures: Corresponding temperature(s) estimated by the model
    '''

    # Calculate the difference between actual temperatures and those
    # estimated by the model
    difference = estimated_temperatures - actual_temperatures

    # Convert to a single number that tells us how well the model did
    # (smaller numbers are better)
    cost = sum(difference ** 2)

    return difference, cost

**4. The optimizer**

The role of the optimizer is to guess new parameter values for the model.

We haven't covered optimizers in detail yet, so to make things simple, we'll use an prewritten optimizer. You don't need to understand how this works, but if you're curious, you can find it in our GitHub repository.

In [14]:
from m0b_optimizer import MyOptimizer

# Create an optimizer
optimizer = MyOptimizer()

**The training loop**

Let's put these components together so that they train the model.

First, let's make a function that performs one iteration of training. Read each step carefully in the following code. If you want, add some print() statements inside the method to help you see the training in action.

In [15]:
def train_one_iteration(model_inputs, true_temperatures, last_cost:float):
    '''
    Runs a single iteration of training.


    model_inputs: One or more dates to provide the model (dates)
    true_temperatues: Corresponding temperatures known to occur on those dates

    Returns:
        A Boolean, as to whether training should continue
        The cost calculated (small numbers are better)
    '''

    # === USE THE MODEL ===
    # Estimate temperatures for all data that we have
    estimated_temperatures = model.predict(model_inputs)

    # === OBJECTIVE FUNCTION ===
    # Calculate how well the model is working
    # Smaller numbers are better 
    difference, cost = cost_function(true_temperatures, estimated_temperatures)

    # Decide whether to keep training
    # We'll stop if the training is no longer improving the model effectively
    if cost >= last_cost:
        # Stop training
        return False, cost
    else:
        # === OPTIMIZER ===
        # Calculate updates to parameters
        intercept_update, slope_update = optimizer.get_parameter_updates(model_inputs, cost, difference)

        # Change the model parameters
        model.slope += slope_update
        model.intercept += intercept_update

        return True, cost

print("Training method ready")

Training method ready


Let's run a few iterations manually, so that you can watch how training works.

Run the following code several times, and note how the model changes.

In [16]:
import math

print(f"Model parameters before training:\t\t{model.intercept:.8f},\t{model.slope:.8f}")

continue_loop, cost = train_one_iteration(model_inputs = data["years_since_1982"],
                                                    true_temperatures = data["normalised_temperature"],
                                                    last_cost = math.inf)

print(f"Model parameters after 1 iteration of training:\t{model.intercept:.8f},\t{model.slope:.8f}")


Model parameters before training:		0.00000000,	0.00000000
Model parameters after 1 iteration of training:	0.00000000,	0.01006832


It will take thousands of iterations to train the model well, so let's wrap it in a loop.

In [17]:
# Start the loop
print("Training beginning...")
last_cost = math.inf
i = 0
continue_loop = True
while continue_loop:

    # Run one iteration of training
    # This will tell us whether to stop training, and also what
    # the cost was for this iteration
    continue_loop, last_cost = train_one_iteration(model_inputs = data["years_since_1982"],
                                                    true_temperatures = data["normalised_temperature"],
                                                    last_cost = last_cost)
   
    # Print the status
    if i % 400 == 0:
        print("Iteration:", i)

    i += 1

    
print("Training complete!")
print(f"Model parameters after training:\t{model.intercept:.8f},\t{model.slope:.8f}")
graphing.scatter_2D(data, "years_since_1982", "normalised_temperature", trendline=model.predict)    

Training beginning...
Iteration: 0
Iteration: 400
Iteration: 800
Iteration: 1200
Iteration: 1600
Iteration: 2000
Iteration: 2400
Iteration: 2800
Iteration: 3200
Iteration: 3600
Iteration: 4000
Training complete!
Model parameters after training:	-0.00648846,	0.01193327


Notice how now the model is trained. It's giving more sensible predictions about January temperatures.

Interestingly, the model shows temperatures going up over time. Perhaps we need to stop feeding grain to our elk earlier in the year!

**Summary**

In this exercise, we split up supervised learning into its individual stages to see what's going on in code when we use third-party libraries. The important point to take away is how these pieces fit together. Note that most parts of this process require data.

### Minimize model errors with cost functions

The learning process repeatedly alters a model until it can make high-quality estimates. To determine how well a model is performing, the learning process uses mathematics in the form of a cost function (also known as an objective function). To understand what a cost function is, let's break it down a little.

**Error, cost, and loss**

In supervised learning, error, cost, and loss all refer to the number of mistakes that a model makes in predicting one or more labels.

These three terms are used somewhat loosely in machine learning, which can cause some confusion. For the sake of simplicity, we'll use them interchangeably here. Cost is calculated through mathematics; it isn't a qualitative judgment. For example, if a model predicts that a daily temperature will be 40°F, but the actual value is 35°F, we might say it has an error of 5°F.

**Minimizing cost is our goal**

Because cost indicates how badly a model works, our goal is to have zero cost. In other words, we want to train the model to make no mistakes at all. This idea is often impossible, though, so instead we set a slightly more nebulous goal of training the model to have the lowest cost possible.

Because of this goal, how we calculate cost dictates what the model will try to learn. In the preceding example, we defined cost as the error in estimating temperature.

**What is a cost function?**

In supervised learning, a cost function is a small piece of code that calculates cost from a model's prediction and the expected label—the correct answer. For example, in our previous exercise we calculated cost by calculating the prediction errors, squaring them, and summing them.

After the cost function has calculated cost, we know whether the model is performing well or not. If it's performing well, we might choose to stop training. If not, we can pass cost information to the optimizer, which uses this information to select new parameters for the model.

Diagram of the machine learning lifecycle with labels, but without features.

During training, different cost functions can change how long training takes, or how well it works. For example, if the cost function always states that errors are small, the optimizer will make only small changes to the model. As another example, if the cost function returns very large values when certain mistakes are made, the optimizer will make changes to the model so that it doesn't make these kinds of mistakes.

There isn't a one-size-fits-all cost function. Which one is best depends on what we're trying to achieve. We often need to experiment with cost functions to get a result we're happy with. In the next exercise, we'll do this experiment.



### Optimize a model by using cost functions

**Exercise: Supervised learning by using different cost functions**

In this exercise, we'll have a deeper look at how cost functions can change:

How well models appear to have fit data.

The kinds of relationships a model represents.

Loading the data

Let's start by loading the data. To make this exercise simpler, we'll use only a few datapoints this time.

In [18]:
import pandas
!wget https://raw.githubusercontent.com/MicrosoftDocs/mslearn-introduction-to-machine-learning/main/graphing.py
!wget https://raw.githubusercontent.com/MicrosoftDocs/mslearn-introduction-to-machine-learning/main/microsoft_custom_linear_regressor.py
!wget https://raw.githubusercontent.com/MicrosoftDocs/mslearn-introduction-to-machine-learning/main/Data/seattleWeather_1948-2017.csv
from datetime import datetime

# Load a file that contains our weather data
dataset = pandas.read_csv('seattleWeather_1948-2017.csv', parse_dates=['date'])

# Convert the dates into numbers so we can use them in our models
# We make a year column that can contain fractions. For example,
# 1948.5 is halfway through the year 1948
dataset["year"] = [(d.year + d.timetuple().tm_yday / 365.25) for d in dataset.date]


# For the sake of this exercise, let's look at February 1 for the following years:
desired_dates = [
    datetime(1950,2,1),
    datetime(1960,2,1),
    datetime(1970,2,1),
    datetime(1980,2,1),
    datetime(1990,2,1),
    datetime(2000,2,1),
    datetime(2010,2,1),
    datetime(2017,2,1),
]

dataset = dataset[dataset.date.isin(desired_dates)].copy()

# Print the dataset
dataset


--2022-10-10 14:46:11--  https://raw.githubusercontent.com/MicrosoftDocs/mslearn-introduction-to-machine-learning/main/graphing.py
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 21511 (21K) [text/plain]
Saving to: ‘graphing.py.3’


2022-10-10 14:46:11 (15.1 MB/s) - ‘graphing.py.3’ saved [21511/21511]

--2022-10-10 14:46:11--  https://raw.githubusercontent.com/MicrosoftDocs/mslearn-introduction-to-machine-learning/main/microsoft_custom_linear_regressor.py
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 2167 (2.1K) [text/plain]
Sav

Unnamed: 0,date,amount_of_precipitation,max_temperature,min_temperature,rain,year
762,1950-02-01,0.0,27,1,False,1950.087611
4414,1960-02-01,0.15,52,44,True,1960.087611
8067,1970-02-01,0.0,50,42,False,1970.087611
11719,1980-02-01,0.37,54,36,True,1980.087611
15372,1990-02-01,0.08,45,37,True,1990.087611
19024,2000-02-01,1.34,49,41,True,2000.087611
22677,2010-02-01,0.08,49,40,True,2010.087611
25234,2017-02-01,0.0,43,29,False,2017.087611


**Comparing two cost functions**

Let's compare two common cost functions: the sum of squared differences (SSD) and the sum of absolute differences (SAD). They both calculate the difference between each predicted value and the expected value. The distinction is simply:

SSD squares that difference and sums the result.
SAD converts differences into absolute differences and then sums them.
To see these cost functions in action, we need to first implement them:

In [19]:
import numpy

def sum_of_square_differences(estimate, actual):
    # Note that with NumPy, to square each value we use **
    return numpy.sum((estimate - actual)**2)

def sum_of_absolute_differences(estimate, actual):
    return numpy.sum(numpy.abs(estimate - actual))

They're very similar. How do they behave? Let's test with some fake model estimates.

Let's say that the correct answers are 1 and 3, but the model estimates 2 and 2:

In [20]:
actual_label = numpy.array([1, 3])
model_estimate = numpy.array([2, 2])

print("SSD:", sum_of_square_differences(model_estimate, actual_label))
print("SAD:", sum_of_absolute_differences(model_estimate, actual_label))

SSD: 2
SAD: 2


We have an error of 1 for each estimate, and both methods have returned the same error.

What happens if we distribute these errors differently? Let's pretend that we estimated the first value perfectly but were off by 2 for the second value:

In [21]:
actual_label = numpy.array([1, 3])
model_estimate = numpy.array([1, 1])

print("SSD:", sum_of_square_differences(model_estimate, actual_label))
print("SAD:", sum_of_absolute_differences(model_estimate, actual_label))

SSD: 4
SAD: 2


SAD has calculated the same cost as before, because the average error is still the same (1 + 1 = 0 + 2). According to SAD, the first and second set of estimates were equally good.

By contrast, SSD has given a higher (worse) cost for the second set of estimates (  
1
2
+
1
2
<
0
2
+
2
2
  ) [raised to power 2]. When we use SSD, we encourage models to be both accurate and consistent in their accuracy.

**Differences in action**

Let's compare how our two cost functions affect model fitting.

First, fit a model by using the SSD cost function:

In [22]:
from microsoft_custom_linear_regressor import MicrosoftCustomLinearRegressor
import graphing

# Create and fit the model
# We use a custom object that we've hidden from this notebook, because
# you don't need to understand its details. This fits a linear model
# by using a provided cost function

# Fit a model by using sum of square differences
model = MicrosoftCustomLinearRegressor().fit(X = dataset.year, 
                                             y = dataset.min_temperature, 
                                             cost_function = sum_of_square_differences)

# Graph the model
graphing.scatter_2D(dataset, 
                    label_x="year", 
                    label_y="min_temperature", 
                    trendline=model.predict)

Our SSD method normally does well, but here it did a poor job. The line is a far distance from the values for many years. Why? Notice that the datapoint at the lower left doesn't seem to follow the trend of the other datapoints. 1950 was a very cold winter in Seattle, and this datapoint is strongly influencing our final model (the blue line). What happens if we change the cost function?

**Sum of absolute differences**

Let's repeat what we've just done, but using SAD.

In [23]:
# Fit a model with SSD
# Fit a model by using sum of square differences
model = MicrosoftCustomLinearRegressor().fit(X = dataset.year, 
                                             y = dataset.min_temperature, 
                                             cost_function = sum_of_absolute_differences)

# Graph the model
graphing.scatter_2D(dataset, 
                    label_x="year", 
                    label_y="min_temperature", 
                    trendline=model.predict)

It's clear that this line passes through the majority of points much better than before, at the expense of almost ignoring the measurement taken in 1950.

In our farming scenario, we're interested in how average temperatures are changing over time. We don't have much interest in 1950 specifically, so for us, this is a better result. In other situations, of course, we might consider this result worse.

**Summary**

In this exercise, you learned about how changing the cost function that's used during fitting can result in different final results.

You also learned how this behavior happens because these cost functions describe the "best" way to fit a model. Although from a data analyst's point of view, there can be drawbacks no matter which cost function is chosen.

### Optimize models by using gradient descent

We've seen how cost functions evaluate how well models perform by using data. The optimizer is the final piece of the puzzle.

The role of the optimizer is to alter the model in a way that improves its performance. It does this by inspecting the model outputs and cost and suggesting new parameters for the model.

For example, in our farming scenario, our linear model has two parameters: the line's intercept and the line's slope. If the intercept of the line is wrong, the model will underestimate or overestimate temperatures on average. If the slope is set wrong, the model won't do a good job of demonstrating how temperatures have changed since the 1950s. The optimizer changes these two parameters so that they do an optimal job of modeling temperatures over time.

Diagram that shows the optimizer part of the machine learning lifecycle.

**Gradient descent**

The most common optimization algorithm today is gradient descent. Several variants of this algorithm exist, but they all use the same core concepts.

Gradient descent uses calculus to estimate how changing each parameter will change the cost. For example, increasing a parameter might be predicted to reduce the cost.

Gradient descent is named as such because it calculates the gradient (slope) of the relationship between each model parameter and the cost. The parameters are then altered to move down this slope.

This algorithm is simple and powerful, yet it isn't guaranteed to find the optimal model parameters that minimize the cost. The two main sources of error are local minima and instability.

**Local minima**

Our previous example looked to do a good job, assuming that cost would have kept increasing when the parameter was smaller than 0 or greater than 10:

Plot of cost versus model parameter, with a minima for cost when the model parameter is 5.

This wouldn't have been such a great job, if parameters smaller than 0 or larger than 10 would have resulted in lower costs, like in this image:

Plot of cost versus model parameter, with a local minima for cost when the model parameter is 5 but a lower cost when the model parameter is at negative 6.

In the preceding graph, a parameter value of -7 would have been a better solution than 5 because it has a lower cost. Gradient descent doesn't know the full relationship between each parameter and the cost—which is represented by the dotted line—in advance. So it's prone to finding local minima: parameter estimates that aren't the best solution, but the gradient is zero.

**Instability**

A related issue is that gradient descent sometimes shows instability. This instability usually occurs when the step size or learning rate—the amount that each parameter is adjusted by each iteration—is too large. The parameters are then adjusted too far on each step, and the model actually gets worse with each iteration:

Plot of cost versus model parameter, which shows cost moving in large steps with minimal decrease in cost.

Having a slower learning rate can solve this problem but might also introduce issues. First, slower learning rates can mean training takes a long time, because more steps are required. Second, taking smaller steps makes it more likely that training settles on a local minimum:

Plot of cost versus model parameter, showing small movements in cost.

By contrast, a faster learning rate can make it easier to avoid hitting local minima, because larger steps can skip over local maxima:

Plot of cost versus model parameter, with regular movements in cost until a minima is reached.

As we'll see in the next exercise, for each problem, there's an optimal step size. Finding this optimum is something that often requires experimentation.



### Implement gradient descent

**Exercise: Gradient descent**

Previously, we identified trends in winter temperatures by fitting a linear regression model to weather data. Here, we'll repeat this process by focusing on the optimizer. Specifically, we'll work with batch gradient descent and explore how changing the learning rate can alter its behavior.

The model we'll be working with will be the same linear regression model that we've used in other units. The principles we learn, however, also apply to much more complex models.

**Loading data and preparing our model**

Let's load up our weather data from Seattle, filter to January temperatures, and make slight adjustments so that the dates are mathematically interpretable.



In [24]:
from datetime import datetime
import pandas
!pip install statsmodels
!wget https://raw.githubusercontent.com/MicrosoftDocs/mslearn-introduction-to-machine-learning/main/graphing.py
!wget https://raw.githubusercontent.com/MicrosoftDocs/mslearn-introduction-to-machine-learning/main/Data/seattleWeather_1948-2017.csv
import graphing # Custom graphing code. See our GitHub repository

# Load a file that contains weather data for Seattle
data = pandas.read_csv('seattleWeather_1948-2017.csv', parse_dates=['date'])

# Remove all dates after July 1 because we have to to plant onions before summer begins
data = data[[d.month < 7 for d in data.date]].copy()


# Convert the dates into numbers so we can use them in our models
# We make a year column that can contain fractions. For example,
# 1948.5 is halfway through the year 1948
data["year"] = [(d.year + d.timetuple().tm_yday / 365.25) for d in data.date]

# Let's take a quick look at our data
print("Visual Check:")
graphing.scatter_2D(data, 
                    label_x="year", 
                    label_y="min_temperature",
                    title="Temperatures over time (°F)")


Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
--2022-10-10 14:59:07--  https://raw.githubusercontent.com/MicrosoftDocs/mslearn-introduction-to-machine-learning/main/graphing.py
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.109.133, 185.199.110.133, 185.199.108.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.109.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 21511 (21K) [text/plain]
Saving to: ‘graphing.py.4’


2022-10-10 14:59:07 (27.2 MB/s) - ‘graphing.py.4’ saved [21511/21511]

--2022-10-10 14:59:07--  https://raw.githubusercontent.com/MicrosoftDocs/mslearn-introduction-to-machine-learning/main/Data/seattleWeather_1948-2017.csv
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.111.133, 185.199.109.133, 185.199.108.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.111.133|:443..

**Fitting a model automatically**

Let's fit a line to this data well by using an existing library.

In [26]:
import statsmodels.formula.api as smf

# Perform linear regression to fit a line to our data
# NB OLS uses the sum or mean of squared differences as a cost function,
# which we're familiar with from our last exercise 
model = smf.ols(formula = "min_temperature ~ year", data = data).fit()

# Print the model
intercept = model.params[0]
slope = model.params[1]

print(f"The model is: y = {slope:0.3f} * X + {intercept:0.3f}")

The model is: y = 0.063 * X + -83.073


Ooh, some math! Don't let that bother you. It's quite common for labels and features to be referred to as Y and X, respectively. Here:

Y is temperature (°F).

X is year.

-83 is a model parameter that acts as the line offset.

0.063 is a model parameter that defines the line slope (in °F per year).

So this little equation states that the model estimates temperature by multiplying the year by 0.063 and then subtracting 83.

How did the library calculate these values? Let's go through the process.

**Model selection**

The first step is always selecting a model. Let's reuse the model that we used in previous exercises.

In [27]:
class MyModel:

    def __init__(self):
        '''
        Creates a new MyModel
        '''
        # Straight lines described by two parameters:
        # The slope is the angle of the line
        self.slope = 0
        # The intercept moves the line up or down
        self.intercept = 0

    def predict(self, date):
        '''
        Estimates the temperature from the date
        '''
        return date * self.slope + self.intercept

    def get_summary(self):
        '''
        Returns a string that summarises the model
        '''
        return f"y = {self.slope} * x + {self.intercept}"

print("Model class ready")

Model class ready


**Fitting our model with gradient descent**

The automatic method used the ordinary least squares (OLS) method, which is the standard way to fit a line. OLS uses the mean (or sum) of square differences as a cost function. (Recall our experimentation with the sum of squared differences in the last exercise.) Let's replicate the preceding line fitting, and break down each step so we can watch it in action.

Recall that for each iteration, our training conducts three steps:

Estimation of Y (temperature) from X (year).

Calculation of the cost function and its slope.

Adjustment of our model according to this slope.

Let's implement this now to watch it in action. Note that to keep things simple, we'll focus on estimating one parameter (line slope) for now.

**Visualizing the error function**

First, let's look at the error function for this data. Normally we don't know this in advance, but for learning purposes, let's calculate it now for different potential models.



In [28]:
import numpy as np

x = data.year
temperature_true = data.min_temperature

# We'll use a prebuilt method to show a 3D plot
# This requires a range of x values, a range of y values,
# and a way to calculate z
# Here, we set:
#   x to a range of potential model intercepts
#   y to a range of potential model slopes
#   z as the cost for that combination of model parameters   

# Choose a range of intercepts and slopes values
intercepts = np.linspace(-100,-70,10)
slopes = np.linspace(0.060,0.07,10)


# Set a cost function. This will be the mean of squared differences
def cost_function(temperature_estimate):
    """
    Calculates cost for a given temperature estimate
    Our cost function is the mean of squared differences (a.k.a. mean squared error)
    """
    # Note that with NumPy to square each value, we use **
    return np.mean((temperature_true - temperature_estimate) ** 2)

def predict_and_calc_cost(intercept, slope):
    '''
    Uses the model to make a prediction, then calculates the cost 
    '''

    # Predict temperature by using these model parameters
    temperature_estimate = x * slope + intercept

    # Calculate cost
    return cost_function(temperature_estimate)

# Call the graphing method. This will use our cost function,
# which is above. If you want to view this code in detail,
# then see this project's GitHub repository
graphing.surface(x_values=intercepts, 
                y_values=slopes, 
                calc_z=predict_and_calc_cost, 
                title="Cost for Different Model Parameters",
                axis_title_x="Model intercept",
                axis_title_y="Model slope",
                axis_title_z="Cost")

The preceding graph is interactive. Try clicking and dragging the mouse to rotate it.

Notice how cost changes with both intercept and line slope. This is because our model has a slope and an intercept, which both will affect how well the line fits the data. A consequence is that the gradient of the cost function must also be described by two numbers: one for intercept and one for line slope.

Our lowest point on the graph is the location of the best line equation for our data: a slope of 0.063 and an intercept of -83. Let's try to train a model to find this point.

**Implementing gradient descent**

To implement gradient descent, we need a method that can calculate the gradient of the preceding curve.



In [29]:
def calculate_gradient(temperature_estimate):
    """
    This calculates the gradient for a linear regession 
    by using the Mean Squared Error cost function
    """

    # The partial derivatives of MSE are as follows
    # You don't need to be able to do this just yet, but
    # it's important to note that these give you the two gradients
    # that we need to train our model
    error = temperature_estimate - temperature_true
    grad_intercept = np.mean(error) * 2
    grad_slope = (x * error).mean() * 2

    return grad_intercept, grad_slope

print("Function is ready!")

Function is ready!


Now all we need is a starting guess, and a loop that will update this guess with each iteration.

In [30]:
def gradient_descent(learning_rate, number_of_iterations):
    """
    Performs gradient descent for a one-variable function. 

    learning_rate: Larger numbers follow the gradient more aggressively
    number_of_iterations: The maximum number of iterations to perform
    """

    # Our starting guess is y = 0 * x - 83
    # We're going to start with the correct intercept so that 
    # only the line's slope is estimated. This is just to keep
    # things simple for this exercise
    model = MyModel()
    model.intercept = -83
    model.slope = 0

    for i in range(number_of_iterations):
        # Calculate the predicted values
        predicted_temperature = model.predict(x)

        # == OPTIMIZER ===
        # Calculate the gradient
        _, grad_slope = calculate_gradient(predicted_temperature)
        # Update the estimation of the line
        model.slope -= learning_rate * grad_slope

        # Print the current estimation and cost every 100 iterations
        if( i % 100 == 0):
            estimate = model.predict(x)
            cost = cost_function(estimate)
            print("Next estimate:", model.get_summary(), f"Cost: {cost}")

    # Print the final model
    print(f"Final estimate:", model.get_summary())

# Run gradient descent
gradient_descent(learning_rate=1E-9, number_of_iterations=1000)

Next estimate: y = 0.0004946403321335815 * x + -83 Cost: 15374.064817888926
Next estimate: y = 0.034564263954523125 * x + -83 Cost: 3218.050332426434
Next estimate: y = 0.050035120236006536 * x + -83 Cost: 711.4491469584532
Next estimate: y = 0.057060363506525755 * x + -83 Cost: 194.58159053167708
Next estimate: y = 0.060250493523378544 * x + -83 Cost: 88.00218235322349
Next estimate: y = 0.06169911660055105 * x + -83 Cost: 66.02523660294695
Next estimate: y = 0.06235692954504888 * x + -83 Cost: 61.4935343467107
Next estimate: y = 0.0626556393176375 * x + -83 Cost: 60.559085785362484
Next estimate: y = 0.06279128202425543 * x + -83 Cost: 60.36640010911254
Next estimate: y = 0.06285287674109104 * x + -83 Cost: 60.32666783130979
Final estimate: y = 0.06288066221361607 * x + -83


Our model found the correct answer, but it took a number of steps. Looking at the printout, we can see how it progressively stepped toward the correct solution.

Now, what happens if we make the learning rate faster? This means taking larger steps.

In [31]:
gradient_descent(learning_rate=1E-8, number_of_iterations=200)

Next estimate: y = 0.004946403321335815 * x + -83 Cost: 13267.277888290606
Next estimate: y = 0.06288803098785394 * x + -83 Cost: 60.317363492453254
Final estimate: y = 0.0629041077135948 * x + -83


Our model appears to have found the solution faster. If we increase the rate even more, however, things don't go so well:

In [32]:
gradient_descent(learning_rate=5E-7, number_of_iterations=500)

Next estimate: y = 0.24732016606679072 * x + -83 Cost: 133774.64171440934
Next estimate: y = 9.500952345613634e+45 * x + -83 Cost: 3.549071667291563e+98
Next estimate: y = 4.8948068107652476e+92 * x + -83 Cost: 9.420015144175701e+191
Next estimate: y = 2.52176127646564e+139 * x + -83 Cost: 2.500278766819551e+285
Next estimate: y = 1.2991891572708264e+186 * x + -83 Cost: inf
Final estimate: y = -2.2830799448010082e+232 * x + -83


Notice how the cost is getting worse each time.

This is because the steps that the model was taking were too large. Although it would step toward the correct solution, it would step too far and actually get worse with each attempt.

For each model, there's an ideal learning rate. It requires experimentation.

**Fitting multiple variables simultaneously**

We've just fit one variable here to keep things simple. Expanding this to fit multiple variables requires only a few small code changes:

We need to update more than one variable in the gradient descent loop.

We need to do some preprocessing of the data, which we alluded to in an earlier exercise. We'll cover how and why in later learning material.

**Summary**

Well done! In this unit, we:

Watched gradient descent in action.

Saw how changing the learning rate can improve a model's training speed.

Learned that changing the learning rate can also result in unstable models.

You might have noticed that where the cost function stopped and the optimizer began became a little blurred here. Don't let that bother you. This is happens commonly, simply because they're conceptually separate and their mathematics sometimes can become intertwined.

### Knowledge check

**Check your knowledge**

**1. What is the difference between supervised learning and unsupervised learning?**

Supervised learning requires human supervision, whereas unsupervised learning doesn't.

Supervised learning always uses an optimizer, but unsupervised learning never does.

Supervised learning trains a model by comparing estimations to correct answers. The cost function for unsupervised learning doesn't need correct answers.

**2. What is the role of the cost function in supervised learning?**

To maximize the cost so that the objective is reached.

To calculate the cost by comparing estimations to correct answers.

To update model parameters.

**3. How does gradient descent know how to update parameters?**

It compares costs for several combinations of parameters, and then it selects the best option.

It uses an internal understanding of the relationship between features and labels to make intelligent choices.

It uses calculus to estimate the slope of the cost function.

**4. Why are many cost functions available?**

A unique cost function is required for each processed currency or banking system.

Cost functions help models process data, and many model types are available.

Different cost functions can arrive at different answers, and what's best depends on the goal.

**5. Why is learning rate important?**

It speeds up or slows down training.

If the learning rate is too large or too small, it can prevent a model from being trained optimally.

Both options are correct.

### Summary

Well done for getting through all of that! Let's recap what we covered:

Supervised learning is a kind of learning by example. A model makes predictions, the predictions are compared to expected labels, and the model is then updated to produce better results.
A cost function is a mathematical way to describe what we want a model to learn. Cost functions calculate large numbers when a model isn't making good predictions, and small numbers when it's performing well.
Gradient descent is an optimization algorithm. It's way of calculating how to improve a model, given a cost function and some data.
Step size (learning rate) changes how quickly and how well gradient descent performs.

**Module complete:**