# Final Project

## Part 1: Data Collection 

The early youth of a child is a developmental time where students are learning how to perform many tasks and learn skills, both book smart and street smart, that can help them in life. One of those skills that begins to develop in a young age is literacy in basic math and reading, as the majority of math that one deals with in adulthood is taught in middle school (get reference), and reading comprehension is key to understanding the majority of events that happen in an adults life - understanding forms, learning new information, searching for housing, etc. Therefore, it is important that all children in this developmental stage have equitable opportunities deserving of them that in such a key growth period, they all have the tools and education necessary to learn such important and long lasting skills such as math and reading comprehension.

However, not all students are given such equally fitted opportunities. The US education system has long been known to have varying standards of education (GET REFERENCE), where differences in education quality begin as early as pre-kindergarten, but not a lot of documentation has been procured to confirm on any large variation in education quality. It is imperative that if these differences in education quality exist, then they be resolved on an institutional level. 

So, our focus of project is to confirm if education inequality is reflected by national math and reading examination differences and recognize factors such as race or gender or state that may play significant roles in such (if they exist), and use such analysis to predict how future years education inequality will be if the current education system/institution is maintained. 

Our null hypothesis will be that race, gender, and state do not have any relationship or impact on math or reading literacy in children in developmental stages. Our alternative hypothesis will be that race, gender, and state have some relationship or impact on math or reading literacy in children in developmental stages.

## Part 2: Data Management/Representation

First we have to import the necessary libraries that we need to load the dataset. We are using pandas, numpy, and matplotlib.pyplot. Pandas is used for the DataFrame object since that is an easy way to store tabular data. Numpy is used for its math functionality and mathplotlib.pyplot is used to plot graphs demonstrating relationships between variables in our data.

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

Now we have to load the data. The data is stored in the "states_all_extended.csv" file and so we have to load it into a DataFrame. This can be done using pandas "read_csv" method. We will store this data in a variable called "school_data".

In [2]:
school_data = pd.read_csv("states_all_extended.csv")

# display first few rows
school_data.head()

Unnamed: 0,PRIMARY_KEY,STATE,YEAR,ENROLL,TOTAL_REVENUE,FEDERAL_REVENUE,STATE_REVENUE,LOCAL_REVENUE,TOTAL_EXPENDITURE,INSTRUCTION_EXPENDITURE,...,G08_HI_A_READING,G08_HI_A_MATHEMATICS,G08_AS_A_READING,G08_AS_A_MATHEMATICS,G08_AM_A_READING,G08_AM_A_MATHEMATICS,G08_HP_A_READING,G08_HP_A_MATHEMATICS,G08_TR_A_READING,G08_TR_A_MATHEMATICS
0,1992_ALABAMA,ALABAMA,1992,,2678885.0,304177.0,1659028.0,715680.0,2653798.0,1481703.0,...,,,,,,,,,,
1,1992_ALASKA,ALASKA,1992,,1049591.0,106780.0,720711.0,222100.0,972488.0,498362.0,...,,,,,,,,,,
2,1992_ARIZONA,ARIZONA,1992,,3258079.0,297888.0,1369815.0,1590376.0,3401580.0,1435908.0,...,,,,,,,,,,
3,1992_ARKANSAS,ARKANSAS,1992,,1711959.0,178571.0,958785.0,574603.0,1743022.0,964323.0,...,,,,,,,,,,
4,1992_CALIFORNIA,CALIFORNIA,1992,,26260025.0,2072470.0,16546514.0,7641041.0,27138832.0,14358922.0,...,,,,,,,,,,


Looking at the data, we can see that there are a few columns we will not need. For example PRIMARY_KEY isn't a data point we need to consider when testing our hypothesis so we can get rid of it. We can use the DataFrame method drop and specify the columns we want to drop.

In [3]:
school_data = school_data.drop(columns=['PRIMARY_KEY'])

We want to use data that is from 2009 and beyond because data before this did not record demographics.

In [4]:
# get previous number of rows
prev_rows = len(school_data.index)
school_data = school_data[school_data['YEAR'] >= 2009]
# get current number of rows
curr_rows = len(school_data.index)

print(str(prev_rows - curr_rows) + " rows were dropped.")

school_data.head()

1193 rows were dropped.


Unnamed: 0,STATE,YEAR,ENROLL,TOTAL_REVENUE,FEDERAL_REVENUE,STATE_REVENUE,LOCAL_REVENUE,TOTAL_EXPENDITURE,INSTRUCTION_EXPENDITURE,SUPPORT_SERVICES_EXPENDITURE,...,G08_HI_A_READING,G08_HI_A_MATHEMATICS,G08_AS_A_READING,G08_AS_A_MATHEMATICS,G08_AM_A_READING,G08_AM_A_MATHEMATICS,G08_HP_A_READING,G08_HP_A_MATHEMATICS,G08_TR_A_READING,G08_TR_A_MATHEMATICS
867,ALABAMA,2009,745668.0,7186390.0,728795.0,4161103.0,2296492.0,7815467.0,3836398.0,2331552.0,...,,,,,,,,,,
868,ALASKA,2009,130236.0,2158970.0,312667.0,1357747.0,488556.0,2396412.0,1129756.0,832783.0,...,,,,,,,,,,
869,ARIZONA,2009,981303.0,8802515.0,1044140.0,3806064.0,3952311.0,9580393.0,4296503.0,2983729.0,...,,,,,,,,,,
870,ARKANSAS,2009,474423.0,4753142.0,534510.0,3530487.0,688145.0,5017352.0,2417974.0,1492691.0,...,,,,,,,,,,
871,CALIFORNIA,2009,6234155.0,73958896.0,9745250.0,40084244.0,24129402.0,74766086.0,35617964.0,21693675.0,...,,,,,,,,,,


Since the columns names are a little tricky to figure out, we are going to outline how to read them here. 

G## - This signifies which grade this value is talking about; for example G04 is referring to grade 4.

G##\_A\_A - This refers to all the students in that grade from all races.

G##\_x\_g - This is read as the number of students of race _x_ and gender _g_ in grade ##; for example G06_AS_M is all asian male students in grade 6.

G##\_x\_g\_test - This is average _test_ score of race _x_ and gender _g_ in grade ##; for example G06_AS_A_MATH is the average math score of all asian students in grade 6.

A in place of a gender or race signifies all genders or all races.

The different race codes are AM - American Indian or Alaska Native, AS - Asian, HI - Hispanic/Latino, BL - Black, WH - White, HP - Hawaiian Native/Pacific Islander and TR - two or more races.

## Part 3: Exploratory Data Analysis

### Test Score Growth per State Prediction

One of the predictive models we are creating is predicting the change in average test scores in Grade 4 based on previous years data for each state. First we are going to remove all the columns except for state, and the average test scores for math and reading.

In [5]:
# get columns needed
state_avg = school_data[['STATE', 'YEAR', 'G04_A_A_READING', 'G04_A_A_MATHEMATICS']]

state_avg.head()

Unnamed: 0,STATE,YEAR,G04_A_A_READING,G04_A_A_MATHEMATICS
867,ALABAMA,2009,216.0,228.0
868,ALASKA,2009,211.0,237.0
869,ARIZONA,2009,210.0,230.0
870,ARKANSAS,2009,216.0,238.0
871,CALIFORNIA,2009,210.0,232.0


To create a metric for how the test scores have improved, we are subtracting the 2009 average score from each data point to define how much the average math and reading test scores have changed since 2009. We are storing this metric in a new column, "READING_GROWTH" and "MATH_GROWTH".

In [6]:
# set reading growth to NaN first
state_avg['READING_GROWTH'] = np.NaN

# method to process each row and return the reading average in 2009
def process_reading(row):
    state = row['STATE']
    new = state_avg.loc[state_avg['STATE'] == state]
    new = new.loc[new['YEAR'] == 2009]
    return new['G04_A_A_READING']

# in each row update the reading growth value with the difference between this value and the value in 2009
for i, row in state_avg.iterrows():
    state_avg.at[i, 'READING_GROWTH'] = row['G04_A_A_READING'] - process_reading(row)
    
state_avg['MATHEMATICS_GROWTH'] = np.NaN

# similar function as reading, but for mathematics
def process_reading(row):
    state = row['STATE']
    new = state_avg.loc[state_avg['STATE'] == state]
    new = new.loc[new['YEAR'] == 2009]
    return new['G04_A_A_MATHEMATICS']
 
for i, row in state_avg.iterrows():
    state_avg.at[i, 'MATHEMATICS_GROWTH'] = row['G04_A_A_MATHEMATICS'] - process_reading(row)
    
state_avg.head()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  state_avg['READING_GROWTH'] = np.NaN
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  state_avg['MATHEMATICS_GROWTH'] = np.NaN


Unnamed: 0,STATE,YEAR,G04_A_A_READING,G04_A_A_MATHEMATICS,READING_GROWTH,MATHEMATICS_GROWTH
867,ALABAMA,2009,216.0,228.0,0.0,0.0
868,ALASKA,2009,211.0,237.0,0.0,0.0
869,ARIZONA,2009,210.0,230.0,0.0,0.0
870,ARKANSAS,2009,216.0,238.0,0.0,0.0
871,CALIFORNIA,2009,210.0,232.0,0.0,0.0


Since each state counts as a unique independent variable, we can use the pandas method get_dummies to create a dataframe where each state is represented by either 1 or 0, 1 if the data value is in that state and 0 if the data value is not in that state. Then we will drop the Alabama column because if all the other states are 0 we can assume that the data value must be in Alabama. 

In [7]:
# get dummies
state_avg = pd.get_dummies(state_avg, columns=['STATE'])
# drop alabama and reading and mathematics averages since we no longer need them
state_avg = state_avg.drop(columns=['STATE_ALABAMA', 'G04_A_A_READING', 'G04_A_A_MATHEMATICS'])

state_avg.head()

Unnamed: 0,YEAR,READING_GROWTH,MATHEMATICS_GROWTH,STATE_ALASKA,STATE_ARIZONA,STATE_ARKANSAS,STATE_CALIFORNIA,STATE_COLORADO,STATE_CONNECTICUT,STATE_DELAWARE,...,STATE_SOUTH_DAKOTA,STATE_TENNESSEE,STATE_TEXAS,STATE_UTAH,STATE_VERMONT,STATE_VIRGINIA,STATE_WASHINGTON,STATE_WEST_VIRGINIA,STATE_WISCONSIN,STATE_WYOMING
867,2009,0.0,0.0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
868,2009,0.0,0.0,1,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
869,2009,0.0,0.0,0,1,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
870,2009,0.0,0.0,0,0,1,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
871,2009,0.0,0.0,0,0,0,1,0,0,0,...,0,0,0,0,0,0,0,0,0,0


Now we have to split apart the dataset into a train and test dataset so that we can use the test dataset to determine how accurate our predictor is. We are going to predict the years 2017, 2018, and 2019 so we will use those rows as our test data and the rest as our train data.

In [8]:
# drop the NaN rows
train_data = state_avg[state_avg['YEAR'] < 2017].dropna()
test_data = state_avg[state_avg['YEAR'] >= 2017].dropna()

We can use Linear SVM from sklearn to make a regression model. We will make one for reading and one for mathematics. First we have to separate our independent and dependent variables.

In [9]:
from sklearn.linear_model import LinearRegression

X_reading = []
y_reading = []
X_math = []
y_math = []

# iterate through each row and add the year and state to the X variables and the growths to the y variables
for i, row in train_data.iterrows():
    add = row[3:].tolist()
    add.insert(0, row['YEAR'])
    X_reading.append(add)
    y_reading.append(row['READING_GROWTH'])
    X_math.append(add)
    y_math.append(row['MATHEMATICS_GROWTH'])

We are going to use the Linear Regression model to fit the X and y variables and create a prediction model. Then we are adding the predicts to a separate column in the test_data DataFrame so we can easily compare the values.

In [10]:
# create reading regression and math regression
reading_regr = LinearRegression().fit(X_reading, y_reading)
mathematics_regr = LinearRegression().fit(X_math, y_math)

X_test_reading = []
X_test_math = []

# accumulate X values for reading and math
for i, row in test_data.iterrows():
    add = row[3:].tolist()
    add.insert(0, row['YEAR'])
    X_test_reading.append(add)
    X_test_math.append(add)
    
# predict based of X values
test_data['PREDICT_READING'] = reading_regr.predict(X_test_reading)
test_data['PREDICT_MATH'] = reading_regr.predict(X_test_reading)

test_data.head()

Unnamed: 0,YEAR,READING_GROWTH,MATHEMATICS_GROWTH,STATE_ALASKA,STATE_ARIZONA,STATE_ARKANSAS,STATE_CALIFORNIA,STATE_COLORADO,STATE_CONNECTICUT,STATE_DELAWARE,...,STATE_TEXAS,STATE_UTAH,STATE_VERMONT,STATE_VIRGINIA,STATE_WASHINGTON,STATE_WEST_VIRGINIA,STATE_WISCONSIN,STATE_WYOMING,PREDICT_READING,PREDICT_MATH
1281,2017,0.0,4.0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,3.712264,3.712264
1288,2017,-4.0,-7.0,1,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0.962264,0.962264
1295,2017,5.0,4.0,0,1,0,0,0,0,0,...,0,0,0,0,0,0,0,0,4.212264,4.212264
1302,2017,0.0,-4.0,0,0,1,0,0,0,0,...,0,0,0,0,0,0,0,0,3.212264,3.212264
1309,2017,5.0,0.0,0,0,0,1,0,0,0,...,0,0,0,0,0,0,0,0,3.462264,3.462264


To analyze if being in a particular state does affect growth of reading and math test scores, we can use statsmodel to see the p value of each coefficient we are passing into the model. 

In [11]:
import statsmodels.api as sm

# create statsmodel for reading data
p_reading = sm.OLS(train_data['READING_GROWTH'].tolist(), sm.add_constant(X_reading)).fit()
p_reading.summary()

0,1,2,3
Dep. Variable:,y,R-squared:,0.511
Model:,OLS,Adj. R-squared:,0.347
Method:,Least Squares,F-statistic:,3.113
Date:,"Mon, 16 May 2022",Prob (F-statistic):,2.28e-08
Time:,01:55:08,Log-Likelihood:,-394.27
No. Observations:,212,AIC:,896.5
Df Residuals:,158,BIC:,1078.0
Df Model:,53,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
const,-687.0151,111.242,-6.176,0.000,-906.729,-467.301
x1,0.3425,0.055,6.194,0.000,0.233,0.452
x2,-2.7500,1.273,-2.161,0.032,-5.264,-0.236
x3,0.5000,1.273,0.393,0.695,-2.014,3.014
x4,-0.5000,1.273,-0.393,0.695,-3.014,2.014
x5,-0.2500,1.273,-0.196,0.845,-2.764,2.264
x6,-3.0000,1.273,-2.357,0.020,-5.514,-0.486
x7,-2.2500,1.273,-1.768,0.079,-4.764,0.264
x8,-2.7500,1.273,-2.161,0.032,-5.264,-0.236

0,1,2,3
Omnibus:,6.181,Durbin-Watson:,2.013
Prob(Omnibus):,0.045,Jarque-Bera (JB):,9.162
Skew:,0.103,Prob(JB):,0.0102
Kurtosis:,3.997,Cond. No.,1810000.0


In [12]:
# create statsmodel for math data
p_math = sm.OLS(train_data['MATHEMATICS_GROWTH'].tolist(), sm.add_constant(X_math)).fit()
p_math.summary()

0,1,2,3
Dep. Variable:,y,R-squared:,0.511
Model:,OLS,Adj. R-squared:,0.346
Method:,Least Squares,F-statistic:,3.11
Date:,"Mon, 16 May 2022",Prob (F-statistic):,2.35e-08
Time:,01:55:08,Log-Likelihood:,-420.79
No. Observations:,212,AIC:,949.6
Df Residuals:,158,BIC:,1131.0
Df Model:,53,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
const,-407.2425,126.068,-3.230,0.002,-656.238,-158.247
x1,0.2038,0.063,3.252,0.001,0.080,0.328
x2,-3.5000,1.442,-2.426,0.016,-6.349,-0.651
x3,3.0000,1.442,2.080,0.039,0.151,5.849
x4,-3.0000,1.442,-2.080,0.039,-5.849,-0.151
x5,-1.7500,1.442,-1.213,0.227,-4.599,1.099
x6,-1.7500,1.442,-1.213,0.227,-4.599,1.099
x7,-5.2500,1.442,-3.640,0.000,-8.099,-2.401
x8,-1.5000,1.442,-1.040,0.300,-4.349,1.349

0,1,2,3
Omnibus:,6.335,Durbin-Watson:,1.739
Prob(Omnibus):,0.042,Jarque-Bera (JB):,6.673
Skew:,-0.294,Prob(JB):,0.0356
Kurtosis:,3.639,Cond. No.,1810000.0


Looking at the statsmodel summary, we can see analyze which coefficients had a notable effect on the predicted value. We can list which ones by looking at the P>|t| columns and if they have a value greater than 0.05 then we can consider it to be significant. 

In [13]:
# accumulate constant names
const_names = train_data.columns[3:].tolist()
const_names.insert(0, 'YEAR')

reading_significant = []

# iterate through p values and if greater than 0.05 add const name to reading_significant
for i in range(len(p_reading.pvalues) - 1):
    if p_reading.pvalues[i + 1] > 0.05:
        reading_significant.append(const_names[i])
        
print("Significant reading test constants: " + str(reading_significant))
print()

math_significant = []

# iterate through p values and if greater than 0.05 add const name to math_significant
for i in range(len(p_math.pvalues) - 1):
    if p_math.pvalues[i + 1] > 0.05:
        math_significant.append(const_names[i])
        
print("Significant math test constants: " + str(math_significant))

Significant reading test constants: ['STATE_ARIZONA', 'STATE_ARKANSAS', 'STATE_CALIFORNIA', 'STATE_CONNECTICUT', 'STATE_DISTRICT_OF_COLUMBIA', 'STATE_DODEA', 'STATE_FLORIDA', 'STATE_GEORGIA', 'STATE_HAWAII', 'STATE_IDAHO', 'STATE_ILLINOIS', 'STATE_INDIANA', 'STATE_IOWA', 'STATE_KENTUCKY', 'STATE_LOUISIANA', 'STATE_MAINE', 'STATE_MARYLAND', 'STATE_MASSACHUSETTS', 'STATE_MICHIGAN', 'STATE_MINNESOTA', 'STATE_MISSISSIPPI', 'STATE_MONTANA', 'STATE_NATIONAL', 'STATE_NEBRASKA', 'STATE_NEVADA', 'STATE_NEW_HAMPSHIRE', 'STATE_NEW_JERSEY', 'STATE_NORTH_CAROLINA', 'STATE_OHIO', 'STATE_OKLAHOMA', 'STATE_OREGON', 'STATE_PENNSYLVANIA', 'STATE_RHODE_ISLAND', 'STATE_SOUTH_CAROLINA', 'STATE_TENNESSEE', 'STATE_UTAH', 'STATE_VERMONT', 'STATE_VIRGINIA', 'STATE_WASHINGTON', 'STATE_WEST_VIRGINIA', 'STATE_WISCONSIN', 'STATE_WYOMING']

Significant math test constants: ['STATE_CALIFORNIA', 'STATE_COLORADO', 'STATE_DELAWARE', 'STATE_DODEA', 'STATE_GEORGIA', 'STATE_HAWAII', 'STATE_ILLINOIS', 'STATE_INDIANA', 'STA

From this list we can tell that a notable amount of states affect the growth of each test score.

## Hypothesis testing

## Communication of Insights Attained