# Finding cheating schools with linear regression

Let's use **test scores** to see if we can find schools that are cheating on standardized tests!

**Our question:** if we find the connection between 2003 and 2004 test scores, can we find schools that have unusually high or low scores?

* [Cheating may be pervasive; TAKS surges raise questions about hundreds of schools](http://clipfile.org/?p=754) Joshua Benton and Holly K. Hacker, Dallas Morning News
* [Tech writeup on investigate.ai](https://investigate.ai/dmn-texas-school-cheating/)

In [28]:
import pandas as pd
import numpy as np

## Read in our data

We have two datasets: 2003 scores and 2004 scores. We'll need to combine them based on the school `CAMPUS` code.

In [39]:
df_2003 = pd.read_csv("2003-grade-3.csv")
df_2003.head()

Unnamed: 0,CAMPUS,CNAME,reading_score_2003
0,1902103,CAYUGA EL,2330.0
1,1903101,ELKHART EL,2285.0
2,1904102,FRANKSTON EL,2299.0
3,1906102,NECHES EL,2236.0
4,1907110,STORY EL,2202.0


In [40]:
df_2004 = pd.read_csv("2004-grade-4.csv")
df_2004.head()

Unnamed: 0,CAMPUS,reading_score_2004
0,1902103,2392.0
1,1903101,2263.0
2,1904102,2242.0
3,1906102,2218.0
4,1907110,2200.0


In [41]:
df = df_2003.merge(df_2004, on='CAMPUS')
df.head()

Unnamed: 0,CAMPUS,CNAME,reading_score_2003,reading_score_2004
0,1902103,CAYUGA EL,2330.0,2392.0
1,1903101,ELKHART EL,2285.0,2263.0
2,1904102,FRANKSTON EL,2299.0,2242.0
3,1906102,NECHES EL,2236.0,2218.0
4,1907110,STORY EL,2202.0,2200.0


## Our friend: correlation!

Any time you ask yourself, "is this thing correlated with this other thing?" you *probably* are more interested in a regression instead of an actual correlation.

But we'll look at it anyway!

In [42]:
df.corr()

Unnamed: 0,CAMPUS,reading_score_2003,reading_score_2004
CAMPUS,1.0,0.040762,0.022746
reading_score_2003,0.040762,1.0,0.804288
reading_score_2004,0.022746,0.804288,1.0


Correlation of `0.8` seems pretty good.

## Training our model

When we train our model, we're going to show it 2003 scores and 2004 scores, and ask "please figure out how these are related to each other."

In [46]:
# First we need to get rid of missing data.
# This will remove any columns that have any missing data.

print("Before dropping missing data", df.shape)
df = df.dropna()
print("After dropping missing data", df.shape)

Before dropping missing data (3501, 4)
After dropping missing data (3501, 4)


To build our model, we're using the [statsmodels](https://www.statsmodels.org/stable/index.html) library. It allows you to write formulas similar to the R language (so you can ask R people if you know any and need a hand!).

In [47]:
import statsmodels.formula.api as smf

# Determine the impact of 2003's scores on 2004's scores
model = smf.ols("reading_score_2004 ~ reading_score_2003", data=df)

results = model.fit()
results.summary()

0,1,2,3
Dep. Variable:,reading_score_2004,R-squared:,0.647
Model:,OLS,Adj. R-squared:,0.647
Method:,Least Squares,F-statistic:,6410.0
Date:,"Fri, 29 Jul 2022",Prob (F-statistic):,0.0
Time:,20:07:09,Log-Likelihood:,-17961.0
No. Observations:,3501,AIC:,35930.0
Df Residuals:,3499,BIC:,35940.0
Df Model:,1,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
Intercept,632.6929,19.936,31.736,0.000,593.606,671.780
reading_score_2003,0.7094,0.009,80.061,0.000,0.692,0.727

0,1,2,3
Omnibus:,263.054,Durbin-Watson:,1.897
Prob(Omnibus):,0.0,Jarque-Bera (JB):,1312.9
Skew:,0.147,Prob(JB):,8.08e-286
Kurtosis:,5.986,Cond. No.,64900.0


I'm going to make a risky risky move and say: **I don't care about any of the math-y, stats-y numbers**. I'm just doing an investigation here! If it's a dead end it's a dead end, same as if someone on the street said something that didn't turn out to be true.

## Predict what the 2004 scores *should* be

We'll save it to a new column.

In [49]:
df['predicted_2004'] = results.predict()
df.head()

Unnamed: 0,CAMPUS,CNAME,reading_score_2003,reading_score_2004,predicted_2004
0,1902103,CAYUGA EL,2330.0,2392.0,2285.616126
1,1903101,ELKHART EL,2285.0,2263.0,2253.692715
2,1904102,FRANKSTON EL,2299.0,2242.0,2263.624443
3,1906102,NECHES EL,2236.0,2218.0,2218.931668
4,1907110,STORY EL,2202.0,2200.0,2194.811757


What schools scored the most above the prediction?

In [50]:
# Simple numeric difference
df['score_diff'] = df.predicted_2004 - df.reading_score_2004

df.sort_values(by='score_diff', ascending=False).head(10)

Unnamed: 0,CAMPUS,CNAME,reading_score_2003,reading_score_2004,predicted_2004,score_diff
2024,105802041,SAN MARCOS PREP,2245.0,2025.0,2225.31635,200.31635
2370,139908101,ROXTON EL,2388.0,2130.0,2326.761855,196.761855
1708,101912140,DOGAN EL,2150.0,1972.0,2157.922483,185.922483
104,15819001,SHEKINAH RADIAN,2149.0,1976.0,2157.213074,181.213074
1543,101840101,TWO DIMENSIONS,2275.0,2076.0,2246.598624,170.598624
1702,101912134,CRAWFORD EL,2240.0,2056.0,2221.769304,165.769304
1127,66005101,RAMIREZ EL,2195.0,2033.0,2189.845893,156.845893
1451,90905101,GRANDVIEW-HOPKI,2346.0,2149.0,2296.966672,147.966672
2155,108912110,JOSE DE ESCANDO,2360.0,2166.0,2306.8984,140.8984
694,57817101,FOCUS LEARNING,2079.0,1973.0,2107.554435,134.554435


Sorry to do this to you, but let's get **more stats-y** and look at standard deviation. The last calculation showed us the raw difference in score, this tells you (statistically) how strange of a score it is. The higher it is, the more of an outlier it is.

In [51]:
# Just trust me, this is how you do it
df['error_std_dev'] = (results.resid / np.sqrt(results.mse_resid)).abs()

df.sort_values(by='error_std_dev', ascending=False).head(10)

Unnamed: 0,CAMPUS,CNAME,reading_score_2003,reading_score_2004,predicted_2004,score_diff,error_std_dev
746,57905115,HARRELL BUDD EL,2140.0,2470.0,2150.828391,-319.171609,7.801048
2261,123803101,TEKOA ACADEMY O,2021.0,2313.0,2066.408705,-246.591295,6.027073
96,15803101,HIGGS CARTER KI,2097.0,2349.0,2120.323799,-228.676201,5.5892
1737,101912172,HENDERSON N EL,2093.0,2324.0,2117.486162,-206.513838,5.047518
2024,105802041,SAN MARCOS PREP,2245.0,2025.0,2225.31635,200.31635,4.896042
2370,139908101,ROXTON EL,2388.0,2130.0,2326.761855,196.761855,4.809164
2708,180901101,MIMI FARLEY ELE,2294.0,2448.0,2260.077397,-187.922603,4.593119
1708,101912140,DOGAN EL,2150.0,1972.0,2157.922483,185.922483,4.544233
104,15819001,SHEKINAH RADIAN,2149.0,1976.0,2157.213074,181.213074,4.429128
697,57825001,PINNACLE SCHOOL,2068.0,2274.0,2099.750934,-174.249066,4.258917


Let's [open the store](http://clipfile.org/?p=754) and search for "Harrell Budd Elementary." How is the research described?

> At Budd, the questions involve the fourth grade, where results in both reading and math were questionable. **In the third grade, Budd’s students finished in the bottom 4 percent of the state in reading.** Not unusual, considering nearly 95 percent of its students are poor and more than 40 percent have limited English skills.
>
> **But Budd’s fourth-graders were worldbeaters. In reading, they had the second-highest scores in the state**, beating schools in Highland Park, Plano and every other high-wealth district. The only school to finish ahead of them was a Houston magnet school for gifted children. Budd’s fourth-graders fared almost as well in math, ranking in the top 2 percent of Texas.

Here's the one place where "standard deviation" shows up:

> More than 200 schools had large, unexplained score gaps between grades or between tests. In statisticians’ lingo, these schools had at least one average scale score that was **more than three standard deviations** away from what would be predicted based on their scores in other grades or on other tests.
> 
> In some cases, there may be legitimate explanations for such gaps. School attendance boundaries could have changed dramatically. Or a new public housing development might have radically changed the composition of a school’s student body.
> 
> But researchers said that large differences between tests are generally signs of something amiss.
> 
> **"If you see big swings in those numbers, I think we should raise our eyebrows and say this is very, very unusual," Dr. Haladyna said.**
