# Final Project

## Part 1: Data Collection 

The early youth of a child is a developmental time where students are learning how to perform many tasks and learn skills, both book smart and street smart, that can help them in life. One of those skills that begins to develop in a young age is literacy in basic math and reading, as the majority of math that one deals with in adulthood is taught in middle school (get reference), and reading comprehension is key to understanding the majority of events that happen in an adults life - understanding forms, learning new information, searching for housing, etc. Therefore, it is important that all children in this developmental stage have equitable opportunities deserving of them that in such a key growth period, they all have the tools and education necessary to learn such important and long lasting skills such as math and reading comprehension.

However, not all students are given such equally fitted opportunities. The US education system has long been known to have varying standards of education (GET REFERENCE), where differences in education quality begin as early as pre-kindergarten, but not a lot of documentation has been procured to confirm on any large variation in education quality. It is imperative that if these differences in education quality exist, then they be resolved on an institutional level. 

So, our focus of project is to confirm if education inequality is reflected by national math and reading examination differences and recognize factors such as race or gender or state that may play significant roles in such (if they exist), and use such analysis to predict how future years education inequality will be if the current education system/institution is maintained. 

Our null hypothesis will be that race, gender, and state do not have any relationship or impact on math or reading literacy in children in developmental stages. Our alternative hypothesis will be that race, gender, and state have some relationship or impact on math or reading literacy in children in developmental stages.

## Part 2: Data Management/Representation

First we have to import the necessary libraries that we need to load the dataset. We are using pandas, numpy, and matplotlib.pyplot. Pandas is used for the DataFrame object since that is an easy way to store tabular data. Numpy is used for its math functionality and mathplotlib.pyplot is used to plot graphs demonstrating relationships between variables in our data.

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

Now we have to load the data. The data is stored in the "states_all_extended.csv" file and so we have to load it into a DataFrame. This can be done using pandas "read_csv" method. We will store this data in a variable called "school_data".

In [2]:
school_data = pd.read_csv("states_all_extended.csv")

school_data.head()

Unnamed: 0,PRIMARY_KEY,STATE,YEAR,ENROLL,TOTAL_REVENUE,FEDERAL_REVENUE,STATE_REVENUE,LOCAL_REVENUE,TOTAL_EXPENDITURE,INSTRUCTION_EXPENDITURE,...,G08_HI_A_READING,G08_HI_A_MATHEMATICS,G08_AS_A_READING,G08_AS_A_MATHEMATICS,G08_AM_A_READING,G08_AM_A_MATHEMATICS,G08_HP_A_READING,G08_HP_A_MATHEMATICS,G08_TR_A_READING,G08_TR_A_MATHEMATICS
0,1992_ALABAMA,ALABAMA,1992,,2678885.0,304177.0,1659028.0,715680.0,2653798.0,1481703.0,...,,,,,,,,,,
1,1992_ALASKA,ALASKA,1992,,1049591.0,106780.0,720711.0,222100.0,972488.0,498362.0,...,,,,,,,,,,
2,1992_ARIZONA,ARIZONA,1992,,3258079.0,297888.0,1369815.0,1590376.0,3401580.0,1435908.0,...,,,,,,,,,,
3,1992_ARKANSAS,ARKANSAS,1992,,1711959.0,178571.0,958785.0,574603.0,1743022.0,964323.0,...,,,,,,,,,,
4,1992_CALIFORNIA,CALIFORNIA,1992,,26260025.0,2072470.0,16546514.0,7641041.0,27138832.0,14358922.0,...,,,,,,,,,,


Looking at the data, we can see that there are a few columns we will not need. For example PRIMARY_KEY isn't a data point we need to consider when testing our hypothesis so we can get rid of it. We can use the DataFrame method drop and specify the columns we want to drop.

In [3]:
school_data = school_data.drop(columns=['PRIMARY_KEY'])

We should get rid of all the rows that have any kind of missing data in them since we do not want to use those data points if anything is missing. The built in method dropna can help us here since it will drop all rows with any value NaN in it.

In [4]:
prev_rows = len(school_data.index)
school_data = school_data[school_data['YEAR'] >= 2009]
curr_rows = len(school_data.index)

print(str(prev_rows - curr_rows) + " rows were dropped.")

school_data.head()

1193 rows were dropped.


Unnamed: 0,STATE,YEAR,ENROLL,TOTAL_REVENUE,FEDERAL_REVENUE,STATE_REVENUE,LOCAL_REVENUE,TOTAL_EXPENDITURE,INSTRUCTION_EXPENDITURE,SUPPORT_SERVICES_EXPENDITURE,...,G08_HI_A_READING,G08_HI_A_MATHEMATICS,G08_AS_A_READING,G08_AS_A_MATHEMATICS,G08_AM_A_READING,G08_AM_A_MATHEMATICS,G08_HP_A_READING,G08_HP_A_MATHEMATICS,G08_TR_A_READING,G08_TR_A_MATHEMATICS
867,ALABAMA,2009,745668.0,7186390.0,728795.0,4161103.0,2296492.0,7815467.0,3836398.0,2331552.0,...,,,,,,,,,,
868,ALASKA,2009,130236.0,2158970.0,312667.0,1357747.0,488556.0,2396412.0,1129756.0,832783.0,...,,,,,,,,,,
869,ARIZONA,2009,981303.0,8802515.0,1044140.0,3806064.0,3952311.0,9580393.0,4296503.0,2983729.0,...,,,,,,,,,,
870,ARKANSAS,2009,474423.0,4753142.0,534510.0,3530487.0,688145.0,5017352.0,2417974.0,1492691.0,...,,,,,,,,,,
871,CALIFORNIA,2009,6234155.0,73958896.0,9745250.0,40084244.0,24129402.0,74766086.0,35617964.0,21693675.0,...,,,,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1710,VIRGINIA,2019,,,,,,,,,...,247.0,278.0,286.0,315.0,,,,,269.0,293.0
1711,WASHINGTON,2019,,,,,,,,,...,248.0,267.0,285.0,315.0,237.0,259.0,,,263.0,292.0
1712,WEST_VIRGINIA,2019,,,,,,,,,...,,,,,,,,,249.0,
1713,WISCONSIN,2019,,,,,,,,,...,251.0,273.0,277.0,294.0,253.0,267.0,,,268.0,276.0


Since the columns names are a little tricky to figure out, we are going to outline how to read them here. 

G## - This signifies which grade this value is talking about; for example G04 is referring to grade 4.

G##\_A\_A - This refers to all the students in that grade from all races.

G##\_x\_g - This is read as the number of students of race _x_ and gender _g_ in grade ##; for example G06_AS_M is all asian male students in grade 6.

G##\_x\_g\_test - This is average _test_ score of race _x_ and gender _g_ in grade ##; for example G06_AS_A_MATH is the average math score of all asian students in grade 6.

A in place of a gender or race signifies all genders or all races.

The different race codes are AM - American Indian or Alaska Native, AS - Asian, HI - Hispanic/Latino, BL - Black, WH - White, HP - Hawaiian Native/Pacific Islander and TR - two or more races.

## Part 3: Exploratory Data Analysis

### Test Score Growth per State Prediction

One of the predictive models we are creating is predicting the change in average test scores in Grade 4 based on previous years data for each state. First we are going to remove all the columns except for state, and the average test scores for math and reading.

In [6]:
state_avg = school_data[['STATE', 'YEAR', 'G04_A_A_READING', 'G04_A_A_MATHEMATICS']]

state_avg.head()

Unnamed: 0,STATE,YEAR,G04_A_A_READING,G04_A_A_MATHEMATICS
867,ALABAMA,2009,216.0,228.0
868,ALASKA,2009,211.0,237.0
869,ARIZONA,2009,210.0,230.0
870,ARKANSAS,2009,216.0,238.0
871,CALIFORNIA,2009,210.0,232.0


## Hypothesis testing

## Communication of Insights Attained