# A First Jupyter Notebook
Last Updated: 2024-08-06 <jonathan.senning@gordon.edu>

This notebook reproduces the work for the example shown during Friday's class presentation.

In this example we are provided data relating the number of Calories burned per hour by people with different weights while riding a bicycle at 12 mph.  We want to use this data to develop a model that predicts the number of Calories per hour a person with a given weight will consume.

## Prep dataset
We first need to make sure our dataset is available.  Try cloning repo; if it already exists, then we make sure it is up-to-date.

To do this, we use a "shell escape" -- the exclimation point at the start of the line causes the entire command to be executed by the host operating system's command intepreter rather than as python code.  In this case, the `git clone` command is used to make a copy of the class repository.  This only needs to be done one time, so any subsequent attempt to clone the repository will result in a 'cps330 already exists' error.  If this is the case, then we use `git fetch` and `git pull` to make sure the repository is up-to-date.  

In [None]:
!if ! git clone https://github.com/gordon-cs/cps330.git; then cd cps330 && git fetch && git pull; fi

## Setup the environment

In [None]:
# Cause MatPlotLib figures to appear in the notebook rather than
#   in separate windows
%matplotlib inline

# Import Python packages with frequently used nicknames
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

## Read in and examine the data

In [None]:
# Create a Pandas dataframe from a CSV file
df = pd.read_csv('cps330/datasets/bicycling.csv')

In [None]:
# Get basic information about the data in the dataframe
df.info()

**Note:** From the output above we see that there are three columns of 20 data elements.  The columns have labels 'Subject', 'Weight', and 'Cal/hr', and the data records are indexed from 0 to 19.  We're also told that the type of data in each of the three columns is 'int64', which means an integer stored in 64 bits (8 bytes).

In [None]:
# Examine the first five data records.
df.head()

**Note:** The numbers in the far left column are printed automatically and represent the indices that Pandas uses for the data.  The numbers in the 'Subject' column represent ID numbers of people in the study.

The command **df.tail()** can be used to inspect the last few data records.

## Plot data

Next we'll plot the data.  Because we will do this more than once it's helpful to create a Python function that we can use multiple times.

In [None]:
def plot_data(plt, weight, cal_per_hour, marker='bo'):
    """Create scatter plot of study data"""
    plt.plot(np.array(weight), np.array(cal_per_hour), marker)
    plt.xlim((min(weight)-20, max(weight)+20))
    plt.ylim((min(cal_per_hour)-20,max(cal_per_hour)+20))
    plt.xlabel('Weight (lbs)')
    plt.ylabel('Calories burned per hour')
    plt.title('Calories per hour burned while riding a bicycle at 12 mph')

# Use the function we just created to plot the data
plot_data(plt, df['Weight'], df['Cal/hr'])
plt.show()

## Prepare the data

The first thing we should do (we probably should have done it before graphing our data) is separate the data into a *training set* and a *test set*.  While this is often done randomly, we do it here so that it matches the example shown in the class presentation.

In [None]:
# Create a test dataframe by extracting the test data
test_df = df.iloc[[4,6,11,16,17]]

In [None]:
# Create a training dataframe by dropping test data
train_df = df.drop([4,6,11,16,17])

**Note:** As we noted before, in real problems we don't look at the test data even though we do here.

In [None]:
test_df.head()

In [None]:
# Let's create a 1-dimensional array with weights
#weight = np.array(train_df['Weight'])
#cal_per_hour = np.array(train_df['Cal/hr'])
weight = train_df['Weight']
cal_per_hour = train_df['Cal/hr']
plot_data(plt, weight, cal_per_hour)
plt.show()

## Model Development

The scatter plot of the training data suggest that there is a linear relationship between a person's weight and the number of Calories per hour that they burn while riding a bicycle.  This suggests that a *linear model* might be appropriate for this problem.

In [None]:
# crate a linear function
def f(x, a=-10, b=2.8):
    """Evaluate a+bx"""
    return a + b * x

def do_regression(x, y):
    """Compute intercept and slope using linear regression"""
    n = len(x)
    x_sum = sum(x)
    y_sum = sum(y)
    xy_sum = x.dot(y) # or could be sum(x*y)
    x2_sum = x.dot(x) # or could be sum(x*x)

    a = (y_sum*x2_sum-x_sum*xy_sum)/(n*x2_sum-x_sum*x_sum)
    b = (n*xy_sum-x_sum*y_sum)/(n*x2_sum-x_sum*x_sum)
    return (a, b)

Create a function to plot a line though a domain that contains all the weight values using a provided intercept and slope.

In [None]:
def plot_fit(plt, a, b, weight):
    """Draw line with y-intercept b and slope a in weight domain"""
    u = np.array([min(weight)-20, max(weight)+20]) # compute independent values
    v = a + b * u                                  # compute dependent values
    plt.plot(u, v, 'r-', linewidth=1)
    plt.legend( ('Data', f'$y=a+bx$ with $a={a:.3f}$, $b={b:.3f}$'), loc='lower right')

In [None]:
# While clearly not optimal, we start by plotting a horizontal line
plot_data(plt, weight, cal_per_hour)
plot_fit(plt, np.average(cal_per_hour), 0, weight)
plt.show()

In [None]:
# Perform linear regression to compute the intercept a and slope b
a, b = do_regression(weight, cal_per_hour)
plot_data(plt, weight, cal_per_hour)
plot_fit(plt, a, b, weight)

In [None]:
def cost(a, b, x, y):
    """Compute the cost function J(a,b)"""
    n = len(x)
    return sum((a + b * x - y)**2)/(2*n)

In [None]:
J = cost(a,b,weight,cal_per_hour)
print(f'The cost using the training data is {J:6.2f}')

## Check performance using test data

In [None]:
# Perform linear regression to compute the intercept a and slope b
a, b = do_regression(weight, cal_per_hour)
plot_data(plt, test_df['Weight'], test_df['Cal/hr'])
plot_fit(plt, a, b, test_df['Weight'])

In [None]:
J = cost(a, b, test_df['Weight'], test_df['Cal/hr'])
print(f'Cost using the test data is {J:6.2f}')