In [None]:
import numpy as np
from datascience import *
import matplotlib.pyplot as plt

%matplotlib inline

# Class 18 Warm Up -- Fitting a Line to Data (Linear Regression)
For this exercise, we will once again be working with ratings of Pixar movies.
The goals are:
* Learn how fit a line to data
* Use the fitted line to make predictions

## We begin with the same steps as last time...

## Load the data

In [None]:
url = 'https://raw.githubusercontent.com/rfordatascience/tidytuesday/main/data/2025/2025-03-11/public_response.csv'
ratings = Table.read_table(url)
ratings.show(3)

A couple of the rows have nan (not a number) markers indicating missing data. The tables in the datascience module we use to not have any built-in methods for removing these, but Pandas dataframes do. So we will transform the table to a dataframe, drop the rows with nan's, then transform the dataframe back to a table.

Remember this trick. It may come in handy for the final project.

In [None]:
df = ratings.to_df().dropna() 
ratings = Table().from_df(df)

Now create a new table with just the numeric ratings.

In [None]:
rmc = ratings.select("rotten_tomatoes", "metacritic", "critics_choice")
rmc.show(3)

## Scatter Plot
Let's look at how well the two different rating services track each other.

In [None]:
rmc.scatter("metacritic", "critics_choice")

## This time we will fit a line to data

Define the functions we will need:

In [None]:
def standard_units(xyz):
    '''Returns data in standard units'''
    return (xyz - np.mean(xyz)) / np.std(xyz)

def correlation(t, label_x, label_y):
    '''Calculates the correlation coefficient'''
    return np.mean(standard_units(t.column(label_x)) * standard_units(t.column(label_y)))

def slope(t, label_x, label_y):
    '''Finds the slope of the best-fit line in the original units.'''
    r = correlation(t, label_x, label_y)
    return r * np.std(t.column(label_y)) / np.std(t.column(label_x))

def intercept(t, label_x, label_y):
    '''Finds the intercept of the best-fit line in the original units.'''
    return np.mean(t.column(label_y)) - slope(t, label_x, label_y) * np.mean(t.column(label_x))

## Student Challenge 1
What are the slope and the intercept for the best-fit line for the scatter plot above?

In [None]:
m = ...
b = ...

In [None]:
# x & y values
xi = rmc.column("metacritic")
yi = rmc.column("critics_choice")

x = np.arange(min(xi), max(xi))
y = m * x + b

plt.plot(xi, yi, '*')
plt.plot(x, y)
plt.xlabel("Metacritic Rating")
plt.ylabel("Critics Choice Rating");

## Student Challenge 2
Based on your fitted line, if a new Pixar movie came out that Metacritic gave a rating of 75, what would you predict to be the rating of Critic's Choice?

## Student Challenge 3
What is the r-squared value of the fit? In your opinion, is this a strong, moderate, or weak correlation?