# CSS 300 | Spring 2025 | Module 8 | Project Model Development
<hr style="border: 5px solid #005A96;" />

##### *The very first thing you should do: rename this file by replacing "YOURNAME" with your actual first name. For this assignment, you should submit all the files that are required for your Jupyter notebook to run properly. For example, if you are using `pandas.read_csv()` to create your `DataFrame`, then you should include the (for example) `.csv` file in your submission.*

#### Bottom line: make sure that Lucas can run all cells in your Jupyter notebook without errors occurring!

If you find yourself uncertain about how to do something, you should (in order):

- Have a look at the [`pandas` API reference](https://pandas.pydata.org/docs/reference/index.html)
- Consider also the [`Mathplotlib`](https://matplotlib.org/stable/api/index.html) and [`Seaborn`](https://seaborn.pydata.org/api.html) API references
- Ask Lucas or a Learning Support Specialist for help

*You are reminded that the use of generative AI in CSS 300, in any shape or form, is considered academic dishonesty and will result in a grade of zero (and possibly worse!).*

<hr style="border: 5px solid #005A96;" />

### Imports go here (feel free to import more libraries if needed). Don't forget to import your dataset as a `pandas` `DataFrame` too (of course, you **must** use the data you have chosen for your final project in this assignment).

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# You should import your dataset below (perhaps with pd.read_csv() or some such function) as a pandas DataFrame
df = pd.read_csv("btc_change.csv")
df.head(10)

# Model Development

One of the requirements for your final project is to **attempt** to build a simple linear regression model from your chosen dataset. That is, pick two quantitative variables in your data (one to be the *independent* variable $x$, and one to be the *dependent* variable $y$). The problems in *this* assignment are designed to get you to build this model, and to evaluate it.

Suggestion: you may wish to refer to Lab #7 and/or the lecture slides from Weeks 7 and 8 for inspiration!

#### 1) In the cell directly below, store your independent variable data $x_1,\dots,x_n$ as a `NumPy` `array` called `x`, and store your dependent variable data $y_1,\dots,y_n$ as a `NumPy` `array` called `y`.
##### (It's worth noting that both variables must be of the same size — you must make a sensible choice about how to rectify this if this is not already the case! Your choice must be at least somewhat justifiable!)

In [None]:
# Note that, throughout this assignment, the use of ellipses ("...") indicates a place where you should write your code.
# You should, of course, *remove* the ellipses before doing so.
# Store your variables as NumPy arrays below!
x = np.array(df['Months_Passed'])
y = np.array(df['Change'])

# Don't remove the following print statement - make sure your arrays are the same size
print(x.size, y.size) 

*(If applicable, your explanation of how you resized your data goes here. If you did not need to resize either of your variables, you can leave this blank.)*

--------
#### 2) In the cell directly below, compute some summary statistics: $\overline{x}$, $\overline{y}$, $s_x$, $s_y$, and $r$.

In [None]:
x_bar = np.mean(x)
y_bar = np.mean(y)
std_x = np.std(x)
std_y = np.std(y)
r = np.corrcoef(x, y)[0, 1]

# Don't remove the following print statement
print(x_bar, y_bar, std_x, std_y, r)

------
For the purpose of this assignment, you'll use the $L_2$ loss function, and so the function you'll want to minimize is the **Mean Squared Error (MSE)** function:
$$
R(\theta) = \frac{1}{n} \sum_{i=1}^n \left(y_i - \left(\theta_0 + \theta_1 x_i\right)\right)^2.
$$

You'll recall from class that the *optimal* values of $\theta_0$ and $\theta_1$ (which minimize the MSE function) are $\hat{\theta_0}$ and $\hat{\theta_1}$ respectively, given by
$$
\begin{aligned}
\hat{\theta_1} &= r \cdot \frac{s_y}{s_x},\\
\hat{\theta_0} &= \overline{y} - \hat{\theta_1}\cdot \overline{x}.
\end{aligned}
$$

#### 3) In the cell directly below, compute $\hat{\theta_0}$ and $\hat{\theta_1}$ using the statistics you calculated in Problem 2.

In [7]:
theta1_hat = r * (std_y / std_x)
theta0_hat = y_bar - theta1_hat * x_bar

----------
#### 4) Now you have everything you need to plot the **linear regression line** with your data! Most of the code for this has been written for you below, but you should adjust the chart title and axis titles.

In [None]:
# Most code's been written for you, but make sure you amend the titles accordingly!

# Plot the linear regression line
line_x_values = np.linspace(start=x.min(), stop=x.max(), num=500) # 500 evenly spaced x-values
line_y_hat = theta0_hat + theta1_hat * line_x_values # Corresponding y-values
plt.plot(line_x_values, line_y_hat, color='red', label=f"ŷ = {round(theta1_hat, 3)}x + {round(theta0_hat, 3)}")

# Plot your actual data as a Scatterplot
plt.scatter(x=x, y=y, color='blue')

# Make sure you specify good titles/labels. Think carefully about units!
plt.xlabel("Months Passed")
plt.ylabel("Percentage Change in Bitcoin Price")
plt.title("Bitcoin Price Change Over Time")

# Finishing touches
plt.grid(True)
plt.legend()
# plt.savefig("regression.png")

plt.show();

-------------
#### 5) Plot the **residual plot** by running the code cell below. Again, most of the code's been written for you, but you should amend titles/labels accordingly.

In [None]:
# Most code's been written for you, but make sure you amend the titles accordingly!

# Plot the line ŷ = y
plt.hlines(y=0, xmin=x.min(), xmax=x.max(), colors='blue', linestyles='dashed', label='ŷ = y')

# Plot the actual residuals as a Scatterplot
y_hat = theta0_hat + theta1_hat * x # A NumPy array of all the \hat{y_i} values
plt.scatter(x=x, y=y - y_hat, color='red')

# Make sure you specify good titles/labels. Think carefully about units!
plt.xlabel("Months Passed")
plt.ylabel("Percentage Change in Bitcoin Price")
plt.title("Bitcoin Price Change Over Time")

# Finishing touches
plt.grid(True)
plt.legend()
# plt.savefig("residuals.png")

plt.show();

----------------
#### 6) Now, evaluate how well model your did and write a summary below.

As a hint, you should consider:
- The coefficient of determination $r^2$ (you'll need to square the correlation coefficient $r$ from Problem 2);
- Your plots (especially your residual plot) from Problems 4 and 5;
- The *Root Mean Squared Error* (RMSE) — you'll need compute this;
- Other summary statistics you found in Problem 2.

*Refer to what we learnt in class in Week 8 if you get stuck!*

You're expected to write a paragraph or two detailing how well your model performed. It is perfectly fine if your model was awful (linear regression can't predict everything!), *but*, whether it's awful or not, you should provide an explanation that you put a lot of effort into!

In [None]:
r_squared = r ** 2

n = len(x)
total = 0

for i in range(n):
    y_i = y[i]
    y_pred = (0.134 * x[i]) + 21.542
    total = total +  (y_pred - y_i) ** 2

mse = total / n
rmse = np.sqrt((mse))

print(r_squared, rmse)

According to the calculations, there is a r_squared value of 0.02. This suggests that the line is a poor indicator of predicted value where only 2% of the percentage change is dependent on the time of year. The rest of the variance cannot be explained by the line.

Furthermore, the root mean square error is 49 which is very high in this context. Excluding the outliers which are around 170 and 440, all the accepted range of values calculated lie between 10 and 80. As a result, a root mean square error of 49 is very high in this context which suggests the model did poorly.

<hr style="border: 5px solid #005A96;" />

# Woohoo! You're all done.