This is a follow-up analysis to the machine learning code. The LSTM model showed that demographic information and the state achievement test reading score from one grade can be used to predict the reading score in the following grade with about 77 percent accuracy (depending on the train-test split).  However, we don't know which variables contributed significant variance to reading scores. This multiple regression should answer that question.

In [1]:
import pandas as pd
import numpy as np
import statsmodels.api as sm
from sklearn.preprocessing import StandardScaler

In [2]:
df = pd.read_csv('DataLinkageML.csv')
df.drop('Unnamed: 0', axis=1, inplace=True)
df.head()

Unnamed: 0,StudentNumber,DPS_HomeLg,CYI_Lat,CYI_Deg,Disability,GT_C,FRL_C,Sect504_C,SPED_C,t_grade,time_t,time_t1
0,405587,1.0,1.0,1.0,0.0,0.0,1.0,0.1,0.8,4,414,220
1,405587,1.0,1.0,1.0,0.0,0.0,1.0,0.1,0.8,5,220,260
2,405587,1.0,1.0,1.0,0.0,0.0,1.0,0.1,0.8,6,260,503
3,405587,1.0,1.0,1.0,0.0,0.0,1.0,0.1,0.8,7,503,537
4,405587,1.0,1.0,1.0,0.0,0.0,1.0,0.1,0.8,8,537,559


In [3]:
X = df[['StudentNumber', 'DPS_HomeLg','CYI_Lat','CYI_Deg','Disability',
        'GT_C','FRL_C','Sect504_C','SPED_C','t_grade', 'time_t']]
X = sm.add_constant(X)
y = df['time_t1']

model = sm.OLS(y,X).fit()
predictions = model.predict(X)
model.summary()

  return ptp(axis=axis, out=out, **kwargs)


0,1,2,3
Dep. Variable:,time_t1,R-squared:,0.728
Model:,OLS,Adj. R-squared:,0.724
Method:,Least Squares,F-statistic:,199.9
Date:,"Thu, 11 Jun 2020",Prob (F-statistic):,1.77e-223
Time:,12:30:06,Log-Likelihood:,-4440.9
No. Observations:,833,AIC:,8906.0
Df Residuals:,821,BIC:,8962.0
Df Model:,11,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
const,153.4342,21.171,7.247,0.000,111.878,194.991
StudentNumber,6.713e-05,2.34e-05,2.874,0.004,2.13e-05,0.000
DPS_HomeLg,7.1489,4.260,1.678,0.094,-1.213,15.511
CYI_Lat,3.4364,4.614,0.745,0.457,-5.620,12.493
CYI_Deg,-0.4462,4.450,-0.100,0.920,-9.181,8.289
Disability,-12.7100,7.003,-1.815,0.070,-26.456,1.036
GT_C,36.8580,8.463,4.355,0.000,20.247,53.469
FRL_C,-17.4988,5.447,-3.213,0.001,-28.190,-6.807
Sect504_C,16.0265,12.270,1.306,0.192,-8.058,40.111

0,1,2,3
Omnibus:,340.798,Durbin-Watson:,2.21
Prob(Omnibus):,0.0,Jarque-Bera (JB):,2435.16
Skew:,-1.689,Prob(JB):,0.0
Kurtosis:,10.665,Cond. No.,7450000.0


These results show that the predictor variables account for 72.8% of the variance in reading scores, which is in line with the results from the LSTM model. However, the analysis also triggered a warning about multicollinearity. To counteract that, let's scale the predictor variables.

In [4]:
X = df[['StudentNumber', 'DPS_HomeLg','CYI_Lat','CYI_Deg','Disability',
        'GT_C','FRL_C','Sect504_C','SPED_C','t_grade', 'time_t']]
y = df['time_t1']

scaler = StandardScaler()
scaled_X = scaler.fit_transform(X)
scaled_X = pd.DataFrame(scaled_X, columns = X.columns)
scaled_X = sm.add_constant(scaled_X)

model = sm.OLS(y,scaled_X).fit()
prediction = model.predict(scaled_X)
model.summary()

  return self.partial_fit(X, y)
  return self.fit(X, **fit_params).transform(X)
  return ptp(axis=axis, out=out, **kwargs)


0,1,2,3
Dep. Variable:,time_t1,R-squared:,0.728
Model:,OLS,Adj. R-squared:,0.724
Method:,Least Squares,F-statistic:,199.9
Date:,"Thu, 11 Jun 2020",Prob (F-statistic):,1.77e-223
Time:,12:30:23,Log-Likelihood:,-4440.9
No. Observations:,833,AIC:,8906.0
Df Residuals:,821,BIC:,8962.0
Df Model:,11,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
const,562.4394,1.745,322.240,0.000,559.013,565.865
StudentNumber,5.3536,1.863,2.874,0.004,1.698,9.010
DPS_HomeLg,3.2727,1.950,1.678,0.094,-0.555,7.101
CYI_Lat,1.5215,2.043,0.745,0.457,-2.488,5.531
CYI_Deg,-0.2082,2.077,-0.100,0.920,-4.285,3.868
Disability,-3.4566,1.905,-1.815,0.070,-7.195,0.282
GT_C,8.7938,2.019,4.355,0.000,4.831,12.757
FRL_C,-6.7706,2.107,-3.213,0.001,-10.907,-2.634
Sect504_C,3.4176,2.617,1.306,0.192,-1.718,8.554

0,1,2,3
Omnibus:,340.798,Durbin-Watson:,2.21
Prob(Omnibus):,0.0,Jarque-Bera (JB):,2435.16
Skew:,-1.689,Prob(JB):,0.0
Kurtosis:,10.665,Cond. No.,3.29


The results didn't really change after scaling the predictor variables.

StudentNumber was a significant predictor, t = 2.874, p < .01, demonstrating individual differences. Participation in the Gifted/Talented program (t = 4.355, p < .001) and the Free/Reduced Lunch program (t = -3.213, p = .001) were both significant predictors of reading scores. The coefficients reveal that students who participated in the gifted/talented program scored higher than students who did not. Conversely, students who participated in the free/reduced lunch program scored lower than children who did not. The child's grade also explained a significant portion of the variance in reading scores, t = 5.673, p < .001. Finally, the reading score one year significantly predicted the reading score the following year (autocorrelation), t = 26.806, p < .001.

I worry that the autocorrelation effects are driving this entire analysis. Will the demographic variables alone explain a significant portion of the variance in reading scores without the previous year's score?

In [5]:
X = df[['StudentNumber', 'DPS_HomeLg','CYI_Lat','CYI_Deg','Disability',
        'GT_C','FRL_C','Sect504_C','SPED_C','t_grade']]

scaler = StandardScaler()
scaled_X = scaler.fit_transform(X)
scaled_X = pd.DataFrame(scaled_X, columns = X.columns)
scaled_X = sm.add_constant(scaled_X)

model = sm.OLS(y,scaled_X).fit()
prediction = model.predict(scaled_X)
model.summary()

  return self.partial_fit(X, y)
  return self.fit(X, **fit_params).transform(X)
  return ptp(axis=axis, out=out, **kwargs)


0,1,2,3
Dep. Variable:,time_t1,R-squared:,0.49
Model:,OLS,Adj. R-squared:,0.484
Method:,Least Squares,F-statistic:,79.03
Date:,"Thu, 11 Jun 2020",Prob (F-statistic):,4.02e-113
Time:,12:30:39,Log-Likelihood:,-4702.7
No. Observations:,833,AIC:,9427.0
Df Residuals:,822,BIC:,9479.0
Df Model:,10,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
const,562.4394,2.389,235.460,0.000,557.751,567.128
StudentNumber,9.2440,2.541,3.638,0.000,4.256,14.232
DPS_HomeLg,9.2928,2.651,3.505,0.000,4.089,14.497
CYI_Lat,5.9433,2.787,2.133,0.033,0.474,11.413
CYI_Deg,-3.0813,2.838,-1.086,0.278,-8.653,2.490
Disability,-5.6139,2.604,-2.156,0.031,-10.725,-0.502
GT_C,25.2255,2.633,9.581,0.000,20.058,30.393
FRL_C,-21.7908,2.780,-7.837,0.000,-27.248,-16.333
Sect504_C,12.3591,3.552,3.480,0.001,5.388,19.331

0,1,2,3
Omnibus:,261.508,Durbin-Watson:,0.957
Prob(Omnibus):,0.0,Jarque-Bera (JB):,910.075
Skew:,-1.485,Prob(JB):,2.4000000000000002e-198
Kurtosis:,7.171,Cond. No.,3.09


Even without the previous year's reading score, the model with just demographic predictors accounted for 49.0% of the variance in reading scores. Without that autocorrelation effect, almost all predictor variables accounted for a significant portion of the variance in reading scores.  

First, there was a significant effect of student demonstrating individual differences, t = 3.638, p < .001. 

There was a significant effect of home language, with an advantage for children from English-speaking homes, t = 3.505, p < .001. 

Children with unilateral hearing loss actually scored lower than children with bilateral loss, t = 2.133, p < .05. 

Children with hearing loss AND other disabilities scored lower than children with just hearing loss, t = -2.156, p < .05.

Scores increased for participation in the gifted/talented program, t = 9.581, p < .001.

Scores decreased for participation in the free/reduced lunch program, t = -7.837, p < .001.

Scores increased with participation in the Section 504 program, t = 3.480, p = .001.

Scores decreased with participation in special education, t = -4.183, p < .001.

Finally, reading scores increased as the grade level increased, t = 16.427, p < .001.

All of these findings are expected with the exception of two results. First, children with unilateral hearing loss were coded as 0, while children with bilateral loss were coded as 1. The coefficient for this variable was positive indicating that reading scores were higher for children with bilateral loss. This is unexpected.

Second, the analysis found no effect for degree of hearing loss, t = -1.086, ns. We would expect children with a higher degree of loss to score lower than children with a lesser degree of loss, but this claim was not substantiated by the analysis. I suspect that degree of hearing loss was correlated with other variables, in that it predicted other disabilities or special eduction. Because of its shared variance with those other predictors, there was insufficient unique variance to rise to the level of significace. 