This is a follow-up analysis to the machine learning code. The LSTM model showed that demographic information and the state achievement test reading score from one grade can be used to predict the reading score in the following grade with about 77 percent accuracy (depending on the train-test split).  However, we don't know which variables contributed significant variance to reading scores. This multiple regression should answer that question.

In [1]:
import pandas as pd
import numpy as np
import statsmodels.api as sm
from sklearn.preprocessing import StandardScaler

In [2]:
df = pd.read_csv('DataLinkageML.csv')
df.drop('Unnamed: 0', axis=1, inplace=True)
df.head()

Unnamed: 0,StudentNumber,DPS_HomeLg,CYI_Lat,CYI_Deg,Disability,GT_C,FRL_C,Sect504_C,SPED_C,t_grade,time_t,time_t1
0,405587,1.0,1.0,1.0,0.0,0.0,1.0,0.1,0.8,4,414,220
1,405587,1.0,1.0,1.0,0.0,0.0,1.0,0.1,0.8,5,220,260
2,405587,1.0,1.0,1.0,0.0,0.0,1.0,0.1,0.8,6,260,503
3,405587,1.0,1.0,1.0,0.0,0.0,1.0,0.1,0.8,7,503,537
4,405587,1.0,1.0,1.0,0.0,0.0,1.0,0.1,0.8,8,537,559


In [3]:
X = df[['DPS_HomeLg','CYI_Lat','CYI_Deg','Disability',
        'GT_C','FRL_C','Sect504_C','SPED_C','t_grade', 'time_t']]
X = sm.add_constant(X)
y = df['time_t1']

model = sm.OLS(y,X).fit()
predictions = model.predict(X)
model.summary()

  return ptp(axis=axis, out=out, **kwargs)


0,1,2,3
Dep. Variable:,time_t1,R-squared:,0.725
Model:,OLS,Adj. R-squared:,0.722
Method:,Least Squares,F-statistic:,217.1
Date:,"Sun, 16 Aug 2020",Prob (F-statistic):,7e-223
Time:,16:32:23,Log-Likelihood:,-4445.0
No. Observations:,833,AIC:,8912.0
Df Residuals:,822,BIC:,8964.0
Df Model:,10,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
const,195.6614,15.312,12.778,0.000,165.607,225.716
DPS_HomeLg,5.9731,4.259,1.402,0.161,-2.387,14.333
CYI_Lat,3.9150,4.631,0.845,0.398,-5.175,13.005
CYI_Deg,-2.3366,4.421,-0.529,0.597,-11.014,6.341
Disability,-16.1169,6.932,-2.325,0.020,-29.724,-2.510
GT_C,36.3039,8.498,4.272,0.000,19.624,52.984
FRL_C,-17.2802,5.470,-3.159,0.002,-28.018,-6.543
Sect504_C,15.2228,12.321,1.236,0.217,-8.962,39.407
SPED_C,-12.7261,7.973,-1.596,0.111,-28.376,2.924

0,1,2,3
Omnibus:,326.482,Durbin-Watson:,2.199
Prob(Omnibus):,0.0,Jarque-Bera (JB):,2265.487
Skew:,-1.614,Prob(JB):,0.0
Kurtosis:,10.406,Cond. No.,5450.0


These results show that the predictor variables account for 72.5% of the variance in reading scores, which is in line with the results from the LSTM model. However, the analysis also triggered a warning about multicollinearity. To counteract that, let's scale the predictor variables.

In [4]:
X = df[['DPS_HomeLg','CYI_Lat','CYI_Deg','Disability',
        'GT_C','FRL_C','Sect504_C','SPED_C','t_grade', 'time_t']]
y = df['time_t1']

scaler = StandardScaler()
scaled_X = scaler.fit_transform(X)
scaled_X = pd.DataFrame(scaled_X, columns = X.columns)
scaled_X = sm.add_constant(scaled_X)

model = sm.OLS(y,scaled_X).fit()
prediction = model.predict(scaled_X)
model.summary()

  return self.partial_fit(X, y)
  return self.fit(X, **fit_params).transform(X)
  return ptp(axis=axis, out=out, **kwargs)


0,1,2,3
Dep. Variable:,time_t1,R-squared:,0.725
Model:,OLS,Adj. R-squared:,0.722
Method:,Least Squares,F-statistic:,217.1
Date:,"Sun, 16 Aug 2020",Prob (F-statistic):,7e-223
Time:,16:33:27,Log-Likelihood:,-4445.0
No. Observations:,833,AIC:,8912.0
Df Residuals:,822,BIC:,8964.0
Df Model:,10,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
const,562.4394,1.753,320.826,0.000,558.998,565.880
DPS_HomeLg,2.7344,1.950,1.402,0.161,-1.093,6.562
CYI_Lat,1.7334,2.051,0.845,0.398,-2.291,5.758
CYI_Deg,-1.0904,2.063,-0.529,0.597,-5.140,2.959
Disability,-4.3832,1.885,-2.325,0.020,-8.084,-0.683
GT_C,8.6616,2.027,4.272,0.000,4.682,12.641
FRL_C,-6.6860,2.117,-3.159,0.002,-10.841,-2.532
Sect504_C,3.2462,2.627,1.236,0.217,-1.911,8.404
SPED_C,-4.3431,2.721,-1.596,0.111,-9.684,0.998

0,1,2,3
Omnibus:,326.482,Durbin-Watson:,2.199
Prob(Omnibus):,0.0,Jarque-Bera (JB):,2265.487
Skew:,-1.614,Prob(JB):,0.0
Kurtosis:,10.406,Cond. No.,3.29


The results didn't really change after scaling the predictor variables.

Having other disabilities besides deafness resulted in significantly lower reading scores (t = -2.325, p < .05). Participation in the Gifted/Talented program (t = 4.272, p < .001) and the Free/Reduced Lunch program (t = -3.159, p < .01) were both significant predictors of reading scores. The coefficients reveal that students who participated in the gifted/talented program scored higher than students who did not. Conversely, students who participated in the free/reduced lunch program scored lower than children who did not. The child's grade also explained a significant portion of the variance in reading scores, t = 5.092, p < .001. Finally, the reading score one year significantly predicted the reading score the following year (autocorrelation), t = 26.993, p < .001.

I worry that the autocorrelation effects are driving this entire analysis. Will the demographic variables alone explain a significant portion of the variance in reading scores without the previous year's score?

In [5]:
X = df[['DPS_HomeLg','CYI_Lat','CYI_Deg','Disability',
        'GT_C','FRL_C','Sect504_C','SPED_C','t_grade']]

scaler = StandardScaler()
scaled_X = scaler.fit_transform(X)
scaled_X = pd.DataFrame(scaled_X, columns = X.columns)
scaled_X = sm.add_constant(scaled_X)

model = sm.OLS(y,scaled_X).fit()
prediction = model.predict(scaled_X)
model.summary()

  return self.partial_fit(X, y)
  return self.fit(X, **fit_params).transform(X)
  return ptp(axis=axis, out=out, **kwargs)


0,1,2,3
Dep. Variable:,time_t1,R-squared:,0.482
Model:,OLS,Adj. R-squared:,0.476
Method:,Least Squares,F-statistic:,85.08
Date:,"Sun, 16 Aug 2020",Prob (F-statistic):,2.81e-111
Time:,16:34:16,Log-Likelihood:,-4709.4
No. Observations:,833,AIC:,9439.0
Df Residuals:,823,BIC:,9486.0
Df Model:,9,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
const,562.4394,2.406,233.730,0.000,557.716,567.163
DPS_HomeLg,8.4451,2.661,3.174,0.002,3.223,13.667
CYI_Lat,6.3756,2.805,2.273,0.023,0.870,11.881
CYI_Deg,-4.6557,2.826,-1.647,0.100,-10.203,0.891
Disability,-7.2548,2.584,-2.808,0.005,-12.326,-2.183
GT_C,25.2343,2.652,9.514,0.000,20.028,30.440
FRL_C,-21.8619,2.801,-7.805,0.000,-27.360,-16.364
Sect504_C,12.1912,3.578,3.407,0.001,5.169,19.214
SPED_C,-15.4567,3.692,-4.187,0.000,-22.704,-8.210

0,1,2,3
Omnibus:,253.122,Durbin-Watson:,0.943
Prob(Omnibus):,0.0,Jarque-Bera (JB):,858.254
Skew:,-1.445,Prob(JB):,4.2899999999999997e-187
Kurtosis:,7.047,Cond. No.,3.09


Even without the previous year's reading score, the model with just demographic predictors accounted for 48.2% of the variance in reading scores. Without that autocorrelation effect, almost all predictor variables accounted for a significant portion of the variance in reading scores.   

There was a significant effect of home language, with an advantage for children from English-speaking homes, t = 3.174, p < .01. 

Children with unilateral hearing loss actually scored lower than children with bilateral loss, t = 2.273, p < .05. 

Children with hearing loss AND other disabilities scored lower than children with just hearing loss, t = -2.808, p < .01.

Scores increased for participation in the gifted/talented program, t = 9.514, p < .001.

Scores decreased for participation in the free/reduced lunch program, t = -7.805, p < .001.

Scores increased with participation in the Section 504 program, t = 3.407, p = .001.

Scores decreased with participation in special education, t = -4.187, p < .001.

Finally, reading scores increased as the grade level increased, t = 150908, p < .001.

All of these findings are expected with the exception of two results. First, children with unilateral hearing loss were coded as 0, while children with bilateral loss were coded as 1. The coefficient for this variable was positive indicating that reading scores were higher for children with bilateral loss. This is unexpected.

Second, the analysis found no effect for degree of hearing loss, t = -1.647, ns. We would expect children with a higher degree of loss to score lower than children with a lesser degree of loss, but this claim was not substantiated by the analysis. If we look at the results from the PCA, we see that degree of hearing loss was correlated with laterality, as evidence by two different PCs. So I suspect that those two had shared variance.