# A First Jupyter Notebook
Last Updated: 2024-08-06 <jonathan.senning@gordon.edu>

In this example we are provided data relating the number of Calories burned per hour by people with different weights while riding a bicycle at 12 mph.  We want to use this data to develop a model that predicts the number of Calories per hour a person with a given weight will consume.

## Setup the environment

In [None]:
# Make figures appear in the notebook, not in separate windows
%matplotlib inline

# Import Python packages with frequently used nicknames
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

## Read in and examine the data

Were going to read the data and store it in a variable named `df`.  (There is nothing special about `df`, here it just means 'dataframe').

In [None]:
# Create a Pandas dataframe from a CSV file

# Define base URL used for class datasets in GitHub
baseURL = 'https://raw.githubusercontent.com/gordon-cs/cps330/main'
dataset_filename = baseURL + '/datasets/bicycling.csv'
df = pd.read_csv(dataset_filename)

In [None]:
# Get basic information about the data in the dataframe
df.info()

**Note:** From the output above we see that there are three columns of 20 data elements indexed from 0 to 19.  The columns have labels 'Subject', 'Weight', and 'Cal/hr', and the data records, indexed from 0 to 19.  We're also told that the type of data in each of the three columns is 'int64', which means each integer is stored in 64 bits (8 bytes).

In [None]:
# Examine the first five data records in each column
df.head()

**Note:** The numbers in the far left column are printed automatically and represent the indices that Pandas uses for the data.  The numbers in the 'Subject' column represent ID numbers of people in the study.

Just as we used `df.head()`, the command `df.tail()` can be used to inspect the last few data records.

In [None]:
df.tail()

## Plot data

Next we'll plot the data.  Because we will do this more than once it's helpful to create a Python function that we can use multiple times.

In [None]:
def plot_data(plt, weight, cal_per_hour, marker='bo'):
    """Create scatter plot of study data"""
    plt.plot(np.array(weight), np.array(cal_per_hour), marker)
    plt.xlim((min(weight)-20, max(weight)+20))
    plt.ylim((min(cal_per_hour)-20,max(cal_per_hour)+20))
    plt.xlabel('Weight (lbs)')
    plt.ylabel('Calories burned per hour')
    plt.title('Calories per hour burned while riding a bicycle at 12 mph')

# Use the function we just created to plot the data
plot_data(plt, df['Weight'], df['Cal/hr'])
plt.show()

## Prepare the data

The first thing we should do (actually, we **should** have done it before graphing our data) is separate the data into a *training set* and a *test set*.  While this is often done randomly, we do it here so that it matches the example shown in the class presentation.

In [None]:
# Create a test dataframe by extracting the test data
test_df = df.iloc[[4,6,11,16,17]]

In [None]:
# Create a training dataframe by dropping test data
train_df = df.drop([4,6,11,16,17])

**Note:** As we noted before, in real problems we don't look at the test data even though we do so here.

In [None]:
test_df.head()

In [None]:
# Create a 1-dimensional arrays from training data
weight = train_df['Weight']
cal_per_hour = train_df['Cal/hr']
plot_data(plt, weight, cal_per_hour)
plt.show()

## Model Development

The scatter plot of the training data suggest that there is a linear relationship between a person's weight and the number of Calories per hour that they burn while riding a bicycle.  This suggests that a *linear model* would do well for this problem.

In [None]:
def simple_linear_regression(x, y):
    """Compute intercept and slope using linear regression"""
    n = len(x)
    x_sum = sum(x)
    y_sum = sum(y)
    xy_sum = sum(x*y)
    x2_sum = sum(x*x)

    # See, for example, https://www.geeksforgeeks.org/linear-regression-formula/
    a = (y_sum*x2_sum-x_sum*xy_sum)/(n*x2_sum-x_sum*x_sum)
    b = (n*xy_sum-x_sum*y_sum)/(n*x2_sum-x_sum*x_sum)
    return (a, b)

Create a function to plot a line though a domain that contains all the weight values using a provided intercept and slope.

In [None]:
def plot_fit(plt, a, b, weight):
    """Draw line with y-intercept b and slope a in weight domain"""
    u = np.array([min(weight)-20, max(weight)+20]) # compute independent values
    v = a + b * u                                  # compute dependent values
    plt.plot(u, v, 'r-', linewidth=1)
    plt.legend( ('Data', f'$y=a+bx$ with $a={a:.3f}$, $b={b:.3f}$'), loc='lower right')

While we know it will not give a very good result, we use the average Calories per hour as the predicted value for riders of all weights.  This will look like a horizontal line on the graph.

In [None]:
# While clearly not optimal, we start by plotting a horizontal line
plot_data(plt, weight, cal_per_hour)
plot_fit(plt, np.average(cal_per_hour), 0, weight)
plt.show()

Now let's use linear regression to determine the *parameters* of a linear model to fit the data.  In this case the parameters are the intercept and slope of the line that *best fits* the data in a least-squares sense.

In [None]:
# Perform linear regression on training data
# to compute the intercept a and slope b
a, b = simple_linear_regression(weight, cal_per_hour)
plot_data(plt, weight, cal_per_hour)
plot_fit(plt, a, b, weight)

In the Machine Learning context, we often use the term *cost* to mean the measure of the *error* or *difference* between the actual data and the estimates from the linear model.  A cost of 0 means that the linear model exactly predicts the data values.

There are various was to compute cost.  Here we compute the *sum of squared residuals*, the sum of all the squares of the distances between each actual data value and the value predicted by the model, and divide it by twice the number of data values.

In [None]:
def cost(a, b, x, y):
    """Compute the cost function J(a,b)"""
    n = len(x)
    return sum((a + b * x - y)**2)/(2*n)

In [None]:
J = cost(a, b, weight,cal_per_hour)
print(f'The model cost using the training data is {J:6.2f}')

## Check performance using test data

Once a model has been trained, it is important to validate it.  Here we will use our test data -- which was not used during training.

In [None]:
plot_data(plt, test_df['Weight'], test_df['Cal/hr'])
plot_fit(plt, a, b, test_df['Weight'])

In [None]:
J = cost(a, b, test_df['Weight'], test_df['Cal/hr'])
print(f'Cost using the test data is {J:6.2f}')

Not surprisingly, the cost using the test data is larger than the cost we found when using the training data.  Ideally we want both these cost values to be as small as possible and reasonably close together.  If the cost computed from the test data is significantly larger than the training cost, then the model is probably not doing a very good job of predicting data.