# Exercise: Fitting a Polynomial Curve

In this exercise, we will have a look at a different type of regression called _polynomial regression_.
In contrast to _linear regression_ which models relationships as straight lines, _polynomial regression_ models relationships as curves.

Recall in our previous exercise how the relationship between `core_temperature` and `protein_content_of_last_meal` could not be properly explained using a straight line. In this exercise, we will use _polynomial regression_ to fit a curve to the data instead.

## Data visualisation

Let's start this exercise by loading in and having a look at our data.

In [1]:
!wget -P data https://raw.githubusercontent.com/MicrosoftDocs/mslearn-introduction-to-machine-learning/main/Data/doggy-illness.csv
!wget https://raw.githubusercontent.com/MicrosoftDocs/mslearn-introduction-to-machine-learning/main/graphing.py

--2022-10-18 18:20:52--  https://raw.githubusercontent.com/MicrosoftDocs/mslearn-introduction-to-machine-learning/main/Data/doggy-illness.csv
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 3293 (3.2K) [text/plain]
Saving to: 'data/doggy-illness.csv'

     0K ...                                                   100% 4.87M=0.001s

2022-10-18 18:20:52 (4.87 MB/s) - 'data/doggy-illness.csv' saved [3293/3293]

--2022-10-18 18:20:52--  https://raw.githubusercontent.com/MicrosoftDocs/mslearn-introduction-to-machine-learning/main/graphing.py
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
H

In [2]:
import pandas

#Import the data from the .csv file
dataset = pandas.read_csv('data/doggy-illness.csv', delimiter="\t")

#Let's have a look at the data
dataset

Unnamed: 0,male,attended_training,age,body_fat_percentage,core_temperature,ate_at_tonys_steakhouse,needed_intensive_care,protein_content_of_last_meal
0,0,1,6.9,38,38.423169,0,0,7.66
1,0,1,5.4,32,39.015998,0,0,13.36
2,1,1,5.4,12,39.148341,0,0,12.90
3,1,0,4.8,23,39.060049,0,0,13.45
4,1,0,4.8,15,38.655439,0,0,10.53
...,...,...,...,...,...,...,...,...
93,0,0,4.5,38,37.939942,0,0,7.35
94,1,0,1.8,11,38.790426,1,1,12.18
95,0,0,6.6,20,39.489962,0,0,15.84
96,0,0,6.9,32,38.575742,1,1,9.79


# Simple Linear Regression

Let's quickly jog our memory by performing the same _simple linear regression_ as we did in the previous exercise using the `temperature` and `protein_content_of_last_meal` columns of the dataset. 


In [3]:
import statsmodels.formula.api as smf
import graphing # custom graphing code. See our GitHub repo for details

# Perform linear regression. This method takes care of
# the entire fitting procedure for us.
simple_formula = "core_temperature ~ protein_content_of_last_meal"
simple_model = smf.ols(formula = simple_formula, data = dataset).fit()

# Show a graph of the result
graphing.scatter_2D(dataset, label_x="protein_content_of_last_meal", 
                             label_y="core_temperature",
                             trendline=lambda x: simple_model.params[1] * x + simple_model.params[0])


Notice how the relationship between the two variables is not truly linear. Looking at the plot, it is fairly clear to see that the points tend more heavily towards one side of the line, especially for the higher `core-temperature` and `protein_content_of_last_meal` values. 
A straight line might not be the best way to describe this relationship.

Let's have a quick look at the model's R<sup>2</sup> score:

In [4]:
print("R-squared:", simple_model.rsquared)

R-squared: 0.9155158150005704


That is quite a reasonable R<sup>2</sup> score, but let's see if we can get an even better one!

## Simple Polynomial Regression

Let's fit a _simple polynomial regression_ this time. Similarly to a _simple linear regression_, a _simple polynomial regression_ models the relationship between a label and a single feature. Unlike a _simple linear regression_, a _simple polynomial regression_ can explain relationships that are not simply straight lines. 

In our example, we are going to use a three parameter polynomial.

In [5]:
# Perform polynomial regression. This method takes care of
# the entire fitting procedure for us.
polynomial_formula = "core_temperature ~ protein_content_of_last_meal + I(protein_content_of_last_meal**2)"
polynomial_model = smf.ols(formula = polynomial_formula, data = dataset).fit()

# Show a graph of the result
graphing.scatter_2D(dataset, label_x="protein_content_of_last_meal", 
                             label_y="core_temperature",
                             # Our trendline is the equation for the polynomial
                             trendline=lambda x: polynomial_model.params[2] * x**2 + polynomial_model.params[1] * x + polynomial_model.params[0])


That looks a lot better already. Let's confirm by having a quick look at the R<sup>2</sup> score:

In [6]:
print("R-squared:", polynomial_model.rsquared)

R-squared: 0.9514426069911689


That's a better R<sup>2</sup> score than the one obtained from the previous model - great! We can now confidently tell our vet to prioritize dogs who ate a high protein diet the night before. 

## Extrapolating

Let's see what happens if we extroplate our data. We would like to see if dogs that ate meals even higher in protein are expected to get even sicker.

Let's start with the _linear regression_. We can set what range we would like to extrapolate our data over by using the `x_range` argument in the plotting function. Let's extrapolate over the range `[0,100]`:


In [7]:
# Show an extrapolated graph of the linear model
graphing.scatter_2D(dataset, label_x="protein_content_of_last_meal", 
                             label_y="core_temperature",
                             # We extrapolate over the following range
                             x_range = [0,100],
                             trendline=lambda x: simple_model.params[1] * x + simple_model.params[0])


Next, we extrapolate the _polynomial regression_ over the same range:

In [8]:
# Show an extrapolated graph of the polynomial model
graphing.scatter_2D(dataset, label_x="protein_content_of_last_meal", 
                             label_y="core_temperature",
                             # We extrapolate over the following range
                             x_range = [0,100],
                             trendline=lambda x: polynomial_model.params[2] * x**2 + polynomial_model.params[1] * x + polynomial_model.params[0])


These two graphs predict two very different things!

The extrapolated _polynolmial regression_ expects `core_temperature` to go down, while the extrapolated _linear regression_ expects linear expects `core_temperature` to go up.
A quick look at the graphs obtained in the previous exercise confirms that we should expect the `core_temeprature` to be rising as the `protein_content_of_last_meal` increases, not falling.

In general, it's not recommended to extrapolate from a _polynomial regression_ unless you have an a-priori reason do so (which is only very rarely the case, so it is best to err on the side of caution, and never extrapolate from  _polynomial regressions_!)

## Summary

We covered the following concepts in this exercise:

- Build _simple linear regression_ and _simple polynomial regression_ models.
- Compare the performance of both models by plotting them, and looking at R<sup>2</sup> values.
- Extrapolated the models over a wider range of values.