# Checkpoint 1

Reminder: 

- You are being evaluated for completion and effort in this checkpoint. 
- Avoid manual labor / hard coding as much as possible, everything we've taught you so far are meant to simplify and automate your process.
- Please do not remove any comment that starts with: "# @@@". 

We will be working with the same `states_edu.csv` that you should already be familiar with from the tutorial.

We investigated Grade 8 reading score in the tutorial. For this checkpoint, you are asked to investigate another test. Here's an overview:

* Choose a specific response variable to focus on
>Grade 4 Math, Grade 4 Reading, Grade 8 Math
* Pick or create features to use
>Will all the features be useful in predicting test score? Are some more important than others? Should you standardize, bin, or scale the data?
* Explore the data as it relates to that test
>Create at least 2 visualizations (graphs), each with a caption describing the graph and what it tells us about the data
* Create training and testing data
>Do you want to train on all the data? Only data from the last 10 years? Only Michigan data?
* Train a ML model to predict outcome 
>Define what you want to predict, and pick a model in sklearn to use (see sklearn <a href="https://scikit-learn.org/stable/modules/linear_model.html">regressors</a>).


Include comments throughout your code! Every cleanup and preprocessing task should be documented.


<h2> Data Cleanup </h2>

Import `numpy`, `pandas`, and `matplotlib`.

(Feel free to import other libraries!)

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score

Load in the "states_edu.csv" dataset and take a look at the head of the data

In [2]:
data = pd.read_csv("states_edu.csv")
data.head()

Rename columns and deal with missing data. Drop rows where `AVG_MATH_8_SCORE` is missing.

In [3]:
data.rename(columns={"AVG_MATH_8_SCORE": "Math_Grade8_Score"}, inplace=True)
data = data.dropna(subset=["Math_Grade8_Score"])

<h2>Exploratory Data Analysis (EDA) </h2>

Chosen test: **Grade 8 Math**

How many years of data are logged in our dataset? 

In [4]:
years_logged = data["YEAR"].nunique()
print(f"Years logged in dataset: {years_logged}")

Compare Michigan to Ohio. Which state has the higher average across all years?

In [5]:
mi_avg = data[data["STATE"] == "MICHIGAN"]["Math_Grade8_Score"].mean()
oh_avg = data[data["STATE"] == "OHIO"]["Math_Grade8_Score"].mean()
print(f"Michigan Average: {mi_avg:.2f}, Ohio Average: {oh_avg:.2f}")

Find the average for Grade 8 Math across all states in 2019.

In [6]:
avg_2019 = data[data["YEAR"] == 2019]["Math_Grade8_Score"].mean()
print(f"Average Grade 8 Math score in 2019: {avg_2019:.2f}")

For each state, find the maximum value for Grade 8 Math.

In [7]:
max_scores = data.groupby("STATE")["Math_Grade8_Score"].max()
max_scores.head()

<h2>Visualization</h2>

Investigate trends in Grade 8 Math over the years for Michigan and Ohio.

In [8]:
plt.figure(figsize=(10, 6))
for state in ["MICHIGAN", "OHIO"]:
    state_data = data[data["STATE"] == state]
    plt.plot(state_data["YEAR"], state_data["Math_Grade8_Score"], label=state)
plt.xlabel("Year")
plt.ylabel("Grade 8 Math Score")
plt.title("Grade 8 Math Scores in Michigan and Ohio Over Time")
plt.legend()
plt.show()

<h2> Model Training and Prediction </h2>

Train a model to predict Grade 8 Math scores using state expenditures per student.

In [9]:
data = data.dropna(subset=["TOTAL_EXPENDITURE"])
X = data[["TOTAL_EXPENDITURE"]]
y = data["Math_Grade8_Score"]

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

model = LinearRegression()
model.fit(X_train, y_train)
y_pred = model.predict(X_test)

mse = mean_squared_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)
print(f"MSE: {mse:.2f}, R^2: {r2:.2f}")

Visualize predictions versus actual values.

In [10]:
plt.scatter(y_test, y_pred)
plt.xlabel("Actual Scores")
plt.ylabel("Predicted Scores")
plt.title("Actual vs Predicted Grade 8 Math Scores")
plt.show()