<a href="https://colab.research.google.com/github/mithamokelvinm/Foundations_Of_Data_Science_For_Machine_Learning/blob/main/Introduction_To_Machine_Learning.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Introduction To Machine Learning

---



---



### Introduction

Introduction -

Machine-learning models are computer algorithms that use data to make estimations (educated guesses) or decisions. Machine-learning models differ from traditional algorithms in how they are designed. When normal computer software needs to be improved, people edit it. By contrast, a machine-learning algorithm uses data to get better at a specific task.

For example, spam filters use machine learning. 20 years ago, spam filters did not have many examples to learn from and were not good at identifying what is and isn’t spam. As more spam has arrived and been labeled as junk by human users, the machine-learning algorithms have gained more experience and become better at their job.

Boots that fit - 

Throughout this module, we’ll be using an example scenario to explain key machine-learning concepts.

In this scenario, you own a shop that sells harnesses for avalanche-rescue dogs, and you’ve recently expanded to also sell doggy boots. Customers all seem to pick the correct harness sizes, but are constantly ordering doggy boots that are the wrong size. You know most customers buy harnesses and boots in the same transaction, which gives you an idea: perhaps you could approximate which doggy boots are the correct size, depending on the harness chosen. Then, you could warn customers if the boots they have selected are likely to be the correct size before they make the purchase.

During this module, we’ll create a machine-learning model that does exactly this. Along the way, we’ll use this scenario to introduce you to some basic machine-learning concepts and demonstrate how to use them in a practical setting.


Learning objectives:

In this module, you'll: -

Explore how machine learning differs from traditional software.
Create and test a machine-learning model.
Load a model and use it with new data.

### What Are Machine Learning Models

What are machine learning models? -

The model is the core component of machine learning, and ultimately what we are trying to build. A model might estimate how old a person is from a photo, predict what you might like to see on social media, or decide where a robotic arm should move. In our scenario, we want to build a model that can estimate the best boot size for a dog based on their harness size.

Models can be built in many ways. For example, a traditional model that simulates how an airplane flies is built by people, using knowledge of physics and engineering. Machine-learning models are special; rather than being edited by people so that they work well, machine learning models are shaped by data. They learn from experience.

How to think about models -

You can think of a model as a function that accepts data as an input and produces an output. More specifically, a model uses input data to estimate something else. For example, in our scenario, we want to build a model that is given a harness size and estimates boot size:

Diagram showing a model without parameters.

Note that harness size and dog boot size are data; they are not part of the model. Harness size is our input, dog boot size is the output.

Models are often simple code

Models are often not meaningfully different from simple functions you're already familiar with. Like other code, they contain logic and parameters. For example, the logic might be “multiply the harness size by parameter_1”:

A diagram showing a model with a single unspecified parameter.

If parameter_1 here was 2.5, our model would multiply harness size by 2.5 and return the result:

Diagram showing a model with 2.5 as the only parameter.

Select a model

There are many model types, some simple and some complex.

Like all code, simpler models are often the most reliable and easy to understand, while complex models can potentially perform impressive feats. Which kind of model you should choose depends on your goal. For example, medical scientists often work with models that are relatively simple, because they are reliable and intuitive. By contrast, AI-based robots typically rely on very complex models.

The first step in machine learning is selecting the kind of model that you'd like to use. This means we're choosing a model based on its internal logic. For example, we might select a two-parameter model to estimate dog boot size from harness size:

Diagram showing a model with two unspecified parameters.

Notice how we selected a model based on how it works logically, but not based on its parameter values. In fact, at this point the parameters have not yet been set to any particular value.

Parameters are discovered during training

The human designer doesn't select parameter values. Instead, parameter values are set to an initial guess, then adjusted during an automated learning process called training.

Given our selection of a two-parameter model (above), we'll now provide random guesses for our parameters:

Diagram showing a model with 0.2 and 1.2 as the parameters.

These random parameters will mean the model isn’t good at estimating boot size, so we'll perform training. During training, these parameters are automatically changed to two new values that give better results:

Diagram showing a model with 1.5 and 4 as the parameters.

Exactly how this process works is something we'll progressively explain throughout your learning journey.

### Train_and_Run_a_Machine_Learning_Model

Here, we'll train a model to guess a comfortable boot size for a dog, based on the size of the harness that fits them.

In [None]:
import pandas
!wget https://raw.githubusercontent.com/MicrosoftDocs/mslearn-introduction-to-machine-learning/main/graphing.py
!wget https://raw.githubusercontent.com/MicrosoftDocs/mslearn-introduction-to-machine-learning/main/Data/doggy-boot-harness.csv
!pip install statsmodels


# Make a dictionary of data for boot sizes
# and harness size in cm
data = {
    'boot_size' : [ 39, 38, 37, 39, 38, 35, 37, 36, 35, 40, 
                    40, 36, 38, 39, 42, 42, 36, 36, 35, 41, 
                    42, 38, 37, 35, 40, 36, 35, 39, 41, 37, 
                    35, 41, 39, 41, 42, 42, 36, 37, 37, 39,
                    42, 35, 36, 41, 41, 41, 39, 39, 35, 39
 ],
    'harness_size': [ 58, 58, 52, 58, 57, 52, 55, 53, 49, 54,
                59, 56, 53, 58, 57, 58, 56, 51, 50, 59,
                59, 59, 55, 50, 55, 52, 53, 54, 61, 56,
                55, 60, 57, 56, 61, 58, 53, 57, 57, 55,
                60, 51, 52, 56, 55, 57, 58, 57, 51, 59
                ]
}

# Convert it into a table using pandas
dataset = pandas.DataFrame(data)

# Print the data
# In normal python we would write
# print(dataset)
# but in Jupyter notebooks, if we simple write the name
# of the variable and it is printed nicely 
dataset

--2022-10-10 11:32:24--  https://raw.githubusercontent.com/MicrosoftDocs/mslearn-introduction-to-machine-learning/main/graphing.py
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.111.133, 185.199.109.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 21511 (21K) [text/plain]
Saving to: ‘graphing.py’


2022-10-10 11:32:25 (6.07 MB/s) - ‘graphing.py’ saved [21511/21511]

--2022-10-10 11:32:25--  https://raw.githubusercontent.com/MicrosoftDocs/mslearn-introduction-to-machine-learning/main/Data/doggy-boot-harness.csv
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.111.133, 185.199.109.133, 185.199.108.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.111.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 838 [text/plain]
Saving to: ‘doggy-boot-h

Unnamed: 0,boot_size,harness_size
0,39,58
1,38,58
2,37,52
3,39,58
4,38,57
5,35,52
6,37,55
7,36,53
8,35,49
9,40,54


As you can see, we have the sizes of boots and harnesses for 50 avalanche dogs.

We want to use harness size to estimate boot size. This means harness_size is our input. We want a model that will process the input and make its own estimations of the boot size (output).

Select a model
The first thing we must do is select a model. We're just getting started, so we'll start with a very simple model called OLS. This is just a straight line (sometimes called a trendline).

Let's use an existing library to create our model, but we won't train it yet.

In [None]:
# Load a library to do the hard work for us
import statsmodels.formula.api as smf

# First, we define our formula using a special syntax
# This says that boot_size is explained by harness_size
formula = "boot_size ~ harness_size"

# Create the model, but don't train it yet
model = smf.ols(formula = formula, data = dataset)

# Note that we have created our model but it does not 
# have internal parameters set yet
if not hasattr(model, 'params'):
    print("Model selected but it does not have parameters set. We need to train it!")

Model selected but it does not have parameters set. We need to train it!


Train our model
OLS models have two parameters (a slope and an offset), but these haven't been set in our model yet. We need to train (fit) our model to find these values so that the model can reliably estimate dogs' boot size based on their harness size.

The following code fits our model to data you've now seen:

In [None]:
# Load some libraries to do the hard work for us
import graphing 

# Train (fit) the model so that it creates a line that 
# fits our data. This method does the hard work for
# us. We will look at how this method works in a later unit.
fitted_model = model.fit()

# Print information about our model now it has been fit
print("The following model parameters have been found:\n" +
        f"Line slope: {fitted_model.params[1]}\n"+
        f"Line Intercept: {fitted_model.params[0]}")

The following model parameters have been found:
Line slope: 0.585925416738271
Line Intercept: 5.719109812682589


Notice how training the model set its parameters. We could interpret these directly, but it's simpler to see it as a graph:

In [None]:
import graphing

# Show a graph of the result
# Don't worry about how this works for now
graphing.scatter_2D(dataset,    label_x="harness_size", 
                                label_y="boot_size",
                                trendline=lambda x: fitted_model.params[1] * x + fitted_model.params[0]
                                )

The graph above shows our original data as circles with a red line through it. The red line shows our model.

We can look at this line to understand our model. For example, we can see that as harness size increases, so will the estimated boot size.

Use the model
Now that we've finished training, we can use our model to predict a dog's boot size from their harness size.

For example, by looking at the red line, we can see that that a harness size of 52.5 (x axis) corresponds to a boot size of about 36.5 (y axis).

We don't have to do this by eye though. We can use the model in our program to predict any boot size we like. Run the following code to see how we can use our model now that it's trained:

In [None]:
# harness_size states the size of the harness we are interested in
harness_size = { 'harness_size' : [52.5] }

# Use the model to predict what size of boots the dog will fit
approximate_boot_size = fitted_model.predict(harness_size)

# Print the result
print("Estimated approximate_boot_size:")
print(approximate_boot_size[0])

Estimated approximate_boot_size:
36.48019419144182


If you'd like, change the value of 52.5 in harness_size to a new value and run the block above to see the model in action.

Summary
Well done! You've trained your first model. We've demonstrated some topics here without detailed explanation in order to just get your feet wet. In later units, we'll explain many of these topics in more detail.

### What are inputs and outputs?

The goal of training is to improve a model so that it can make high-quality estimations or predictions. Once trained, you can use a model in the real world like normal software.

Models don’t train themselves. They're trained using data plus two pieces of code, the objective function and the optimizer. Let’s explore how these components work together to train a model to work well.

Diagram showing an untrained model with two parameters, and a trained model with 0.7 and 0.4 as the parameters.

The objective

The objective is what we want to the model to be able to do. For example, the objective of our scenario is to be able to estimate a dog’s boot size based on their harness size.

So that a computer can understand our objective, we need to provide our goal as code snippet called an objective function (also known as cost function). Objective functions judge whether the model is doing a good job (estimating boot size well) or bad job (estimating boot size badly). We'll cover objective functions in more depth in later learning material.

The data

Data refers to the information that we provide to the model (also known as inputs). In our scenario, this is harness size.

Data also refers to information that the objective function might need. For example, if our objective function reports whether the model guessed the boot size correctly, it will need to know the correct boot size! This is why in our previous exercise, we provided both harness sizes and the correct answers to the training code.

We'll practice working with data in the next exercise.

The optimizer

During training, the model makes a prediction, and the objective function calculates how well it performed. The optimizer is code that then changes the model’s parameters so the model will do a better job next time.

How an optimizer does this is complex, and something we'll cover in later material. Don’t be intimidated, though; we don’t normally write our own optimizers, we use open-source frameworks where the hard work has been done for us.

It's important to keep in mind that the objective, data, and optimizer are simply a means to train the model. They are not needed once training is complete. It's also important to remember that training only changes the parameter values inside of a model; it doesn't change what kind of model is used.



### Visualize inputs and outputs

Exercise: Datasets in Python
In the previous exercise, we loaded some data and fit a model to it. Several aspects of this were simplified, particularly that the data was hard-coded into our python script, and we didn't spend any time really looking at the data itself.

Here, we'll load data from a file, filter it, and graph it. Doing so is a very important first step in order to build proper models, or to understand their limitations.

As before, there's no need to edit any code in the examples in this unit. Try to read it, understand it, then press the Run button to run it. As always, it's vitally important that these code blocks are run in the correct order, and nothing is missed.

Load data with Pandas
There are large variety of libraries that help you work with data. In Python, one of the most common is Pandas. We used pandas briefly in the previous exercise. Pandas can open data saved as text files and store it in an organized table called a DataFrame.

Let's open some text data that's stored on disk. Our data is saved in a file called doggy-boot-harness.csv.

In [7]:
import pandas
!wget https://raw.githubusercontent.com/MicrosoftDocs/mslearn-introduction-to-machine-learning/main/graphing.py
!wget https://raw.githubusercontent.com/MicrosoftDocs/mslearn-introduction-to-machine-learning/main/Data/doggy-boot-harness.csv

# Read the text file containing data using pandas
dataset = pandas.read_csv('doggy-boot-harness.csv')

# Print the data
# Because there are a lot of data, use head() to only print the first few rows
dataset.head()

--2022-10-10 12:25:30--  https://raw.githubusercontent.com/MicrosoftDocs/mslearn-introduction-to-machine-learning/main/graphing.py
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 21511 (21K) [text/plain]
Saving to: ‘graphing.py.1’


2022-10-10 12:25:31 (6.67 MB/s) - ‘graphing.py.1’ saved [21511/21511]

--2022-10-10 12:25:31--  https://raw.githubusercontent.com/MicrosoftDocs/mslearn-introduction-to-machine-learning/main/Data/doggy-boot-harness.csv
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.110.133, 185.199.109.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 838 [text/plain]
Saving to: ‘doggy-bo

Unnamed: 0,boot_size,harness_size,sex,age_years
0,39,58,male,12.0
1,38,58,male,9.6
2,37,52,female,8.6
3,39,58,male,10.2
4,38,57,male,7.8



As you can see, this dataset contains information about dogs, including their doggy boot size, harness size, sex, and age in years.

Data is stored as columns and rows, similar to a table you might see in Excel.


**Filter data by Columns**

Data is easy to filter by columns. We can either type this directly, like dataset.my_column_name, or like so: dataset["my_column_name"].

We can use this to either extract data, or to delete data.

Lets take a look at the harness sizes, and delete the sex and age_years columns.

In [8]:
# Look at the harness sizes
print("Harness sizes")
print(dataset.harness_size)

# Remove the sex and age-in-years columns.
del dataset["sex"]
del dataset["age_years"]

# Print the column names
print("\nAvailable columns after deleting sex and age information:")
print(dataset.columns.values)

Harness sizes
0     58
1     58
2     52
3     58
4     57
5     52
6     55
7     53
8     49
9     54
10    59
11    56
12    53
13    58
14    57
15    58
16    56
17    51
18    50
19    59
20    59
21    59
22    55
23    50
24    55
25    52
26    53
27    54
28    61
29    56
30    55
31    60
32    57
33    56
34    61
35    58
36    53
37    57
38    57
39    55
40    60
41    51
42    52
43    56
44    55
45    57
46    58
47    57
48    51
49    59
Name: harness_size, dtype: int64

Available columns after deleting sex and age information:
['boot_size' 'harness_size']


**Filter data by Rows**

We can get data from the top of the table by using the head() function, or from the bottom of the table by using the tail() function.

Both functions make a shallow copy of a section of our dataframe. Here, we're sending these copies to the print() function. The head and tail views can also be used for other purposes, such as for use in analyses or graphs.

In [9]:
# Print the data at the top of the table
print("TOP OF TABLE")
print(dataset.head())

# print the data at the bottom of the table
print("\nBOTTOM OF TABLE")
print(dataset.tail())

TOP OF TABLE
   boot_size  harness_size
0         39            58
1         38            58
2         37            52
3         39            58
4         38            57

BOTTOM OF TABLE
    boot_size  harness_size
45         41            57
46         39            58
47         39            57
48         35            51
49         39            59


We can also filter logically. For example, we can look at data for dogs who have a harness smaller than a size 55.

This works by calculating a True or False value for each row, then keeping only those rows where the value is True.

In [10]:
# Print how many rows of data we have
print(f"We have {len(dataset)} rows of data")

# Determine whether each avalanche dog's harness size is < 55
# This creates a True or False value for each row where True means 
# they are smaller than 55
is_small = dataset.harness_size < 55
print("\nWhether the dog's harness was smaller than size 55:")
print(is_small)

# Now apply this 'mask' to our data to keep the smaller dogs
data_from_small_dogs = dataset[is_small]
print("\nData for dogs with harness smaller than size 55:")
print(data_from_small_dogs)

# Print the number of small dogs
print(f"\nNumber of dogs with harness size less than 55: {len(data_from_small_dogs)}")

We have 50 rows of data

Whether the dog's harness was smaller than size 55:
0     False
1     False
2      True
3     False
4     False
5      True
6     False
7      True
8      True
9      True
10    False
11    False
12     True
13    False
14    False
15    False
16    False
17     True
18     True
19    False
20    False
21    False
22    False
23     True
24    False
25     True
26     True
27     True
28    False
29    False
30    False
31    False
32    False
33    False
34    False
35    False
36     True
37    False
38    False
39    False
40    False
41     True
42     True
43    False
44    False
45    False
46    False
47    False
48     True
49    False
Name: harness_size, dtype: bool

Data for dogs with harness smaller than size 55:
    boot_size  harness_size
2          37            52
5          35            52
7          36            53
8          35            49
9          40            54
12         38            53
17         36            51
18         35    

This looks like a lot of code, but we can compress the important parts into a single line.

Let's do something similar: restrict our data to only those with boot sizes smaller than 40.

In [11]:
# Make a copy of the dataset that only contains dogs with 
# a boot size below size 40
# The call to copy() is optional but can help avoid unexpected
# behaviour in more complex scenarios
data_smaller_paws = dataset[dataset.boot_size < 40].copy()


# Print information about this
print(f"We now have {len(data_smaller_paws)} rows in our dataset. The last few rows are:")
data_smaller_paws.tail()

We now have 34 rows in our dataset. The last few rows are:


Unnamed: 0,boot_size,harness_size
42,36,52
46,39,58
47,39,57
48,35,51
49,39,59


**Graph Data**

Graphing data is often the easiest way to understand it.

In these exercises, we usually make our graphs using code in a custom file we've created, called graphing.py, which you can look at on our github page.

Here, we'll practice making a graph without this custom code, however.

Lets make a simple graph of harness size versus boot size for our avalanche dogs with smaller feet.

In [12]:
# Load and prepare plotly to create our graphs
import plotly.express
import graphing # this is a custom file you can find in our code on github

# Show a graph of harness size by boot size:
plotly.express.scatter(data_smaller_paws, x="harness_size", y="boot_size")

**Create New Columns**

The preceding graph shows the relationship we want to investigate for our store, but some customers might want harness-size lists in inches, not centimeters. How can we view these harness sizes in imperial units?

To do this, we will need to create a new column called harness_size_imperial and put that on the X axis instead.

Creating new columns uses very similar syntax to what we've seen before.

In [13]:
# Convert harness sizes from metric to imperial units 
# and save the result to a new column
data_smaller_paws['harness_size_imperial'] = data_smaller_paws.harness_size / 2.54

# Show a graph of harness size in imperial units
plotly.express.scatter(data_smaller_paws, x="harness_size_imperial", y="boot_size")

We've now graphed our new column of data (harness_size_imperial) against boot size for dogs with small paws.

**Summary**

We've introduced working with data in Python, including:

1. Opening data from a file into a DataFrame (table)
2. Inspecting the top and bottom of the dataframe
3. Adding and removing columns of data
4. Removing rows of data based on criteria
5. Graphing data to understand trends

Learning to work with dataframes can feel tedious or dry, but keep going, because these basic skills are critical to unlocking exciting machine-learning techniques that we'll cover in later modules.

### How to use a model

Let’s revise how these parts fit together to train a model.

**Training versus using a model**

It's important to make a distinction between training and using a model.

Using a model means providing inputs and receiving an estimation or prediction. We do this both when we're training our model and when we or our customers use it in the real world. Using a model normally takes less than a few seconds.

Diagram showing a machine learning model with data going into the model, which then moves to an estimate.

By contrast, training a model is the process of improving how well a model works. Training requires that we use the model, as well as the objective function and optimizer, in a special loop. This can take minutes or days to complete. Usually, we only train a model once. Once it's trained, we can use it as many times as we like without making further changes.

Diagram showing the final training figure showing the machine learning model lifecycle.

For example, in our avalanche-rescue dog store scenario, we want to train a model using a public dataset, which will change the model so that it can predict a dog’s boot size based on its harness size. Once our model is trained, we'll use the model as part of our online store to make sure customers are buying doggy boots that will fit their dogs.

**Data for use, data for training**

Recall that a dataset is a collection of information about objects or things. For example, a dataset might contain information about dogs:


When we use our model, we only need the column(s) of data that the model accepts as input. These columns are called features. In our scenario, if our model accepts harness size and estimates boot size, then our feature is harness size.

During training, the objective function usually needs to know both the model’s output and what the correct answer was. These are called labels. In our scenario, if our model predicts boot size, boot size is our label.

Taken together, this means that to use a model, we only ever need features, while during training we usually need both features and labels. During training in our scenario, we need both our harness-size feature and our boot-size label. When we use our model in our website, we only need to know the harness-size feature; our model will then estimate the boot size for us to use.

**I've finished training. What now?**

Once a model has finished training, you can save it to a file by itself. We no longer need the original data, the objective function, or the model updater. When we want to use the model, we can load it from disk, provide it with new data, and get back a prediction.

In our next exercise, we'll practice saving a model, loading it from disk, and using it like we would in the real world. To complete our online store scenario, we'll also practice using the model's outputs to warn our customers if they seem to be buying the wrong sized doggy boots.



### Use machine learning models

**Exercise: Using a Trained Model on New Data**

In Unit 3, we created a basic model that let us find the relationship between a dog's harness size and their boot size. We showed how this model could then be used to make a prediction about a new, previously unseen dog.

It's common to build, train, then use a model while we are just learning about machine learning; but in the real world, we don't want to train the model every time we want to make a prediction.

**Consider our avalanche-dog equipment store scenario:**

We want to train the model just once, then load that model onto the server that runs our online store.
Although the model is trained on a dataset we downloaded from the internet, we actually want to use it to estimate the boot size of our customers' dogs who are not in this dataset!

**How can we do this?**

**Here, we'll:**

Create a basic model

Save it to disk

Load it from disk

Use it to make predictions about a dog who was not in the training dataset


**Load the dataset**

Let's begin by opening the dataset from file.



In [14]:
import pandas
!pip install statsmodels
!wget https://raw.githubusercontent.com/MicrosoftDocs/mslearn-introduction-to-machine-learning/main/graphing.py
!wget https://raw.githubusercontent.com/MicrosoftDocs/mslearn-introduction-to-machine-learning/main/Data/doggy-boot-harness.csv

# Load a file containing dog's boot and harness sizes
data = pandas.read_csv('doggy-boot-harness.csv')

# Print the first few rows
data.head()

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
--2022-10-10 12:52:57--  https://raw.githubusercontent.com/MicrosoftDocs/mslearn-introduction-to-machine-learning/main/graphing.py
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 21511 (21K) [text/plain]
Saving to: ‘graphing.py.2’


2022-10-10 12:52:58 (6.47 MB/s) - ‘graphing.py.2’ saved [21511/21511]

--2022-10-10 12:52:58--  https://raw.githubusercontent.com/MicrosoftDocs/mslearn-introduction-to-machine-learning/main/Data/doggy-boot-harness.csv
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.109.133, 185.199.110.133, 185.199.111.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.109.133|:443... conn

Unnamed: 0,boot_size,harness_size,sex,age_years
0,39,58,male,12.0
1,38,58,male,9.6
2,37,52,female,8.6
3,39,58,male,10.2
4,38,57,male,7.8


**Create and train a model**

As we've done before, we'll create a simple Linear Regression model and train it on our dataset.

In [15]:
import statsmodels.formula.api as smf

# Fit a simple model that finds a linear relationship
# between boot size and harness size, which we can use later
# to predict a dog's boot size, given their harness size
model = smf.ols(formula = "boot_size ~ harness_size", data = data).fit()

print("Model trained!")

Model trained!


**Save and load a model**

Our model is ready to use, but we don't need it yet. Let's save it to disk.

In [16]:
import joblib

model_filename = './avalanche_dog_boot_model.pkl'
joblib.dump(model, model_filename)

print("Model saved!")

Model saved!


Loading our model is just as easy:

In [17]:
model_loaded = joblib.load(model_filename)

print("We have loaded a model with the following parameters:")
print(model_loaded.params)

We have loaded a model with the following parameters:
Intercept       5.719110
harness_size    0.585925
dtype: float64


**Put it together**

On our website, we'll want to take the harness of our customer's dog, then calculate their dog's boot size using the model that we've already trained.

Let's put everything here together to make a function that loads the model from disk, then uses it to predict our customer's dog's boot size height.

In [18]:
# Let's write a function that loads and uses our model
def load_model_and_predict(harness_size):
    '''
    This function loads a pretrained model. It uses the model
    with the customer's dog's harness size to predict the size of
    boots that will fit that dog.

    harness_size: The dog harness size, in cm 
    '''

    # Load the model from file and print basic information about it
    loaded_model = joblib.load(model_filename)

    print("We've loaded a model with the following parameters:")
    print(loaded_model.params)

    # Prepare data for the model
    inputs = {"harness_size":[harness_size]} 

    # Use the model to make a prediction
    predicted_boot_size = loaded_model.predict(inputs)[0]

    return predicted_boot_size

# Practice using our model
predicted_boot_size = load_model_and_predict(45)

print("Predicted dog boot size:", predicted_boot_size)

We've loaded a model with the following parameters:
Intercept       5.719110
harness_size    0.585925
dtype: float64
Predicted dog boot size: 32.08575356590478


**Real world use**

We've done it; we can predict an avalanche dog's boot size based on the size of their harness. Our last step is to use this to warn people if they might be buying the wrong sized doggy boots.

As an example, we'll make a function that accepts the harness size, the size of the boots selected, and returns a message for the customer. We would integrate this function into our online store.

In [20]:
def check_size_of_boots(selected_harness_size, selected_boot_size):
    '''
    Calculates whether the customer has chosen a pair of doggy boots that 
    are a sensible size. This works by estimating the dog's actual boot 
    size from their harness size.

    This returns a message for the customer that should be shown before
    they complete their payment 

    selected_harness_size: The size of the harness the customer wants to buy
    selected_boot_size: The size of the doggy boots the customer wants to buy
    '''

    # Estimate the customer's dog's boot size
    estimated_boot_size = load_model_and_predict(selected_harness_size)

    # Round to the nearest whole number because we don't sell partial sizes
    estimated_boot_size = int(round(estimated_boot_size))

    # Check if the boot size selected is appropriate
    if selected_boot_size == estimated_boot_size:
        # The selected boots are probably OK
        return f"Great choice! We think these boots will fit your avalanche dog well."

    if selected_boot_size < estimated_boot_size:
        # Selected boots might be too small 
        return "The boots you have selected might be TOO SMALL for a dog as "\
               f"big as yours. We recommend a doggy boots size of {estimated_boot_size}."

    if selected_boot_size > estimated_boot_size:
        # Selected boots might be too big 
        return "The boots you have selected might be TOO BIG for a dog as "\
               f"small as yours. We recommend a doggy boots size of {estimated_boot_size}."
    

# Practice using our new warning system
check_size_of_boots(selected_harness_size=55, selected_boot_size=39)

We've loaded a model with the following parameters:
Intercept       5.719110
harness_size    0.585925
dtype: float64


'The boots you have selected might be TOO BIG for a dog as small as yours. We recommend a doggy boots size of 38.'

Change selected_harness_size and selected_boot_size in the preceding example and re-run the cell to see this in action.

**Summary**

Well done! We've put together a system that can predict if customers are buying doggy boots that may not fit their avalanche dog, based solely on the size of harness they're purchasing.

In this exercise, we practiced:

Creating basic models
Training, then saving them to disk
Loading them from disk
Making predictions with them using new data sets

### Knowledge check

**Check your knowledge**

**What makes machine-learning algorithms different from traditional algorithms?**

Machine-learning algorithms are always more complicated to build than traditional algorithms.

Machine-learning algorithms must be trained every time they're used.

Machine-learning algorithms are shaped by data directly as part of development. Traditional algorithms are based almost entirely on theory or on opinions of the person writing the code.

**When do we want to perform training?**

Whenever we want to use a model

Only when we want to improve the model

Every time we load a model from file

**What is the relationship between a model, an objective, and training data?**

The training data is used to make changes to the model. These changes help the model get better at achieving the objective.

The training data is used to make changes to the objective. These changes help the objective be more like the model.

The model is used to make changes to the training data. These changes help the training data get better at achieving the objective.

### Summary


We covered some significant new jargon in this module. Let’s recap what we've learned:

The goal of machine learning is to find patterns in data and use these patterns to make estimates.

Machine learning differs from normal software development in that we use special code, rather than our own intuition, to improve how well the software works.

**The learning process conceptually uses four components:**

Data about the topic we're interested in.

A model, which makes estimates.

An objective the model is trying to achieve.

An optimizer, which is the extra code that changes the model depending on its performance.

Data can be thought of as features, and labels. Features correspond to potential model inputs, while labels correspond to model outputs, or desired model outputs.

Pandas and Plotly are powerful tools to explore datasets in Python.

Once we have a trained model, we can save to disk for later use.

**Module complete:**