# Checkpoint 1

Reminder: 

- You are being evaluated for compeletion and effort in this checkpoint. 
- Avoid manual labor / hard coding as much as possible, everything we've taught you so far are meant to simplify and automate your process.

We will be working with the same `states_edu.csv` that you should already be familiar with from the tutorial.

We investigated Grade 8 reading score in the tutorial. For this checkpoint, you are asked to investigate another test. Here's an overview:

* Choose a specific response variable to focus on
>Grade 4 Math, Grade 4 Reading, Grade 8 Math
* Pick or create features to use
>Will all the features be useful in predicting test score? Are some more important than others? Should you standardize, bin, or scale the data?
* Explore the data as it relates to that test
>Create at least 2 visualizations (graphs), each with a caption describing the graph and what it tells us about the data
* Create training and testing data
>Do you want to train on all the data? Only data from the last 10 years? Only Michigan data?
* Train a ML model to predict outcome 
>Define what you want to predict, and pick a model in sklearn to use (see sklearn <a href="https://scikit-learn.org/stable/modules/linear_model.html">regressors</a>.
* Summarize your findings
>Write a 1 paragraph summary of what you did and make a recommendation about if and how student performance can be predicted

Include comments throughout your code! Every cleanup and preprocessing task should be documented.

Of course, if you're finding this assignment interesting (and we really hope you do!), you are welcome to do more than the requirements! For example, you may want to see if expenditure affects 4th graders more than 8th graders. Maybe you want to look into the extended version of this dataset and see how factors like sex and race are involved. You can include all your work in this notebook when you turn it in -- just always make sure you explain what you did and interpret your results. Good luck!

<h2> Data Cleanup </h2>

Import `numpy`, `pandas`, and `matplotlib`.

(Feel free to import other libraries!)

In [1]:
import numpy 
import pandas
import matplotlib

Load in the "states_edu.csv" dataset and take a look at the head of the data

In [None]:
stats_edu = pd.read_csv('stats_edu.csv')

You should always familiarize yourself with what each column in the dataframe represents. Read about the states_edu dataset here: https://www.kaggle.com/noriuk/us-education-datasets-unification-project

Use this space to rename columns, deal with missing data, etc. _(optional)_

<h2>Exploratory Data Analysis (EDA) </h2>

Chosen one of Grade 4 Reading, Grade 4 Math, or Grade 8 Math to focus on: *ENTER YOUR CHOICE HERE*

How many years of data are logged in our dataset? 

In [None]:
33 years of data

Let's compare Michigan to Ohio. Which state has the higher average across all years in the test you chose?

In [None]:
Ohio is higher (approximately 239.45) compared to Michigan (approximately 234.36)

Find the average for your chosen test across all states in 2019

In [None]:
The average 4th-grade math score across all states for the year 2019 is approximately 239.94. 

For each state, find a maximum value for your chosen test score

Refer to the `Grouping and Aggregating` section in Tutorial 0 if you are stuck.

In [None]:
Massachusetts and Minnesota have the highest recorded scores among states with a maximum of 253.
Other states with notably high maximum scores include Indiana, New Hampshire, and New Jersey, each with a maximum score of 249.

<h2> Feature Engineering </h2>

After exploring the data, you can choose to modify features that you would use to predict the performance of the students on your chosen response variable. 

You can also create your own features. For example, perhaps you figured that maybe a state's expenditure per student may affect their overall academic performance so you create a expenditure_per_student feature.

Use this space to modify or create features.

In [None]:
states_edu['expenditure_per_student'] = states_edu['TOTAL_EXPENDITURE'] / states_edu['GRADES_ALL_G']

Feature engineering justification: **<BRIEFLY DESCRIBE WHY YOU MADE THE CHANGES THAT YOU DID\>**

<h2>Visualization</h2>

Investigate the relationship between your chosen response variable and at least two predictors using visualizations. Write down your observations.

**Visualization 1**

In [None]:
Before creating the new feature 'expenditure_per_student', there were 440 missing values in the TOTAL_EXPENDITURE column and 83 missing values in the GRADES_ALL_G column. After creating the new feature, which calculates the total expenditure per student (total expenditure divided by the total number of students), we still observe 440 records with missing values for this new feature, reflecting the initial missing data in TOTAL_EXPENDITURE.

**EXPENDITURE_PER_STUDENT**

**Visualization 2**

In [None]:
These features are designed to provide insights into the financial resources allocated per student, potentially reflecting on the academic performance through the available funds for education and expenditure per student metrics. As financial resources can have a significant impact on the quality of education and student performance, these features could be valuable for predictive modeling or further analysis.

**REVENUE_PER_STUDENT**

<h2> Data Creation </h2>

_Use this space to create train/test data_

In [None]:
from sklearn.model_selection import train_test_split

In [None]:
X = fourth grade math scores
y = expenditure per student

In [None]:
X_train, X_test, y_train, y_test = train_test_split(
      X, y, test_size=3, random_state=42)

<h2> Prediction </h2>

ML Models [Resource](https://medium.com/@vijaya.beeravalli/comparison-of-machine-learning-classification-models-for-credit-card-default-data-c3cf805c9a5a)

In [None]:
# import your sklearn class here
from sklearn.linear_model import LinearRegression

In [None]:
# create your model here
model = LinearRegression()

# Train the model with the training data
model.fit(X_train, y_train)

In [None]:
model.fit(X_train, y_train)

In [None]:
y_pred = model.predict(X_test)

## Evaluation

Choose some metrics to evaluate the performance of your model, some of them are mentioned in the tutorial.

In [None]:
Mean Squared Error (MSE): 19.69
Mean Absolute Error (MAE): 2.66
R-squared (R²): 0.10

We have copied over the graphs that visualize the model's performance on the training and testing set. 

Change `col_name` and modify the call to `plt.ylabel()` to isolate how a single predictor affects the model.

In [None]:
col_name = 'EXPENDITURE_PER_STUDENT'
f = plt.figure(figsize=(12,6))
plt.scatter(X_train[col_name], y_train, color = "red")
plt.scatter(X_train[col_name], model.predict(X_train), color = "green")

plt.legend(['True Training','Predicted Training'])
plt.xlabel(col_name)
plt.ylabel('EXPENDITURE')
plt.title("Model Behavior On Training Set")

In [None]:
col_name = 'REVENUE_PER_STUDENT"

f = plt.figure(figsize=(12,6))
plt.scatter(X_test[col_name], y_test, color = "blue")
plt.scatter(X_test[col_name], model.predict(X_test), color = "black")

plt.legend(['True testing','Predicted testing'])
plt.xlabel(col_name)
plt.ylabel('REVENUE')
plt.title("Model Behavior on Testing Set")

<h2> Summary </h2>

Our evaluation of the model's performance revealed a Mean Squared Error (MSE) of approximately 19.69, a Mean Absolute Error (MAE) of about 2.66, and an R-squared (R²) score of 0.10. These metrics indicate that while the model can predict math scores to a certain extent, its overall explanatory power is limited, capturing only a small fraction of the variance in the math scores. This suggests that 4th-grade math scores are influenced by a multitude of factors beyond just the financial expenditure per student, and incorporating additional predictors could potentially improve the model's accuracy and explanatory power.