**What is R-squared?**

Imagine you're trying to predict your friend's mood based on how many hours of sleep they got last night. R-squared is a number that tells you how much of their mood can be explained by just that one factor: sleep.

Think of it like a percentage:

**100% R-squared (or 1.0):** This would mean that 100% of your friend's mood is perfectly explained by how much they slept. If they get 8 hours, they're always happy. If they get 4, they're always grumpy. It's a perfect relationship.

**0% R-squared (or 0.0):** This would mean that sleep has nothing to do with their mood. The R-squared value is 0 because there is no relationship between the two variables. Their mood is a coin toss, regardless of how much they slept.

**70% R-squared (or 0.7):** This means that 70% of the ups and downs in your friend's mood can be explained by their sleep. The other 30% is due to other things—maybe they had a bad day at school, or they're hungry.

**In short, R-squared is a simple way to measure how well a model fits the data. The closer the number is to 1, the better your model is at explaining what's happening.**


In [None]:
import numpy as np
import pandas as pd
from sklearn.linear_model import LinearRegression
from sklearn.metrics import r2_score

# Sample data: We'll make up some numbers for daily temp and ice cream sales.
# Notice how sales generally go up with temperature.
data = {'temperature': [20, 25, 22, 28, 30, 26, 31, 29, 24, 27],
        'ice_cream_sales': [100, 150, 120, 180, 200, 160, 210, 190, 140, 170]}
df = pd.DataFrame(data)

# Our variables:
# X (the thing we're using to predict) is the temperature
# y (the thing we're trying to predict) is the ice cream sales
X = df[['temperature']]
y = df['ice_cream_sales']

# Create a simple "predictor" (a linear regression model)
predictor = LinearRegression()

# Tell the predictor to "learn" from our data
predictor.fit(X, y)

# Ask the predictor to make its guesses for sales based on our temperatures
predicted_sales = predictor.predict(X)

# Now we calculate the R-squared. This tells us how good the predictor's guesses were.
r_squared = r2_score(y, predicted_sales)

print(f"R-squared value: {r_squared:.2f}")

# The output will be something around 0.96.
# This means that about 96% of the changes in ice cream sales can be explained by the daily temperature. That's a very good fit!

**Assignment 1: Temperature vs. Sales**
Goal: Understand how a variable's relevance affects R-squared.

Start with the provided Python example. Run the code and observe the R-squared value. What does this number tell you about the relationship between temperature and ice cream sales?

**Change the data:** In the data dictionary, replace the temperature list with random numbers that have no clear relationship to the ice_cream_sales. For example, use numbers like [45, 10, 32, 6, 50, 21, 15, 40, 5, 29].

Run the code again: Re-run the script with the new data. What is the R-squared value now?

**Explain:** Why did the R-squared value drop so much? What does the new, lower number mean in the context of our ice cream sales model?

In [None]:
# Your code here

**Assignment 2: Predicting Test Scores**
Goal: See how a model's complexity relates to its R-squared value.

Create a dataset: Make a pandas DataFrame with three columns: hours_studied, hours_slept, and test_score.

**Add data:** Populate the dataset with at least 10 rows of numbers. Try to make both hours_studied and hours_slept have a positive relationship with test_score.

**Model 1:** Build a linear regression model that uses only hours_studied to predict test_score. Calculate and print the R-squared value.

**Model 2:** Build a new linear regression model that uses both hours_studied and hours_slept to predict test_score. Calculate and print the R-squared value for this new model.

**Compare:** Is the R-squared value higher for the second model? Why do you think adding more relevant information (hours_slept) improved the model's ability to explain the data?

In [None]:
# Your code here

**Assignment 3: Misleading R-squared**
**Goal:** Understand that a high R-squared doesn't always mean a good model.

**Create a curved dataset:** In a new Python script, create a DataFrame with x and y columns. Set x to a range of numbers from -5 to 5. Set y to be calculated using the formula y = x**2. This will create a U-shaped curve.

**Fit a straight line:** Use LinearRegression to fit a straight line to this curved data.

**Calculate and print:** Print the R-squared value. It might be surprisingly high, even though the line doesn't fit the curve well.

**Plot the data:** Use a plotting library like Matplotlib to create a scatter plot of your x and y points. Then, plot the straight line from your LinearRegression model on the same graph.

Analyze: Looking at the plot, does the straight line accurately represent the data? Why is R-squared high even though the model is clearly a bad fit? This shows that R-squared is a great metric, but it should be used with other checks, like a visual inspection of the data and the model's predictions.

In [None]:
# Your code here