## Week 3-1 - Linear Regression - homework

This assignment is inspired by a 2006 [story in the Pioneer Press](
https://www.twincities.com/2010/07/09/schools-that-work-despite-appearances-schools-doing-better-than-expected-have-traits-in-common/) which asked, what makes a school succeed despite disadvantages? The reporters performed a regression of standardized test scores vs. the number of low income students at that school (actually we only have the number of students who are eligable for free meals at school, but this is a widely used proxy for the economic status of the students.)

The 2006 story explains the purpose of doing the regression:

"Schools with large numbers of students from low-income families — or who move often, are learning English or have other special needs — almost always fare worse on standardized tests, most educators agree. The Pioneer Press analyzed three years of test scores from all 731 Minnesota elementary schools to predict how well each school should do when its percentage of low-income students is taken into account — effectively leveling the playing field between the haves and have-nots."

The reporters then visited those schools that out-performed their predicted scores, trying to determine what made them succeed despite the odds.

This file was adapted from a notebook created by Chase Davis and Richard Dunks for the 2015 version of this course, gratefully used with permission.


In [1]:
import pandas as pd
import numpy as np 
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression
%matplotlib inline

### 1. Load the data

Load the file and take a look. Each row is one school. The fields all have abbreviated names, and you can look at the [data dictionary](http://www.cde.ca.gov/ta/ac/ap/reclayout12b.asp) to see what's what.

This is standardized test score results and many other variables for schools in California, from 2012. We are going to be looking at a variable called the Academic Performance Index (API) which is basically standardized test scores. This way of measuring school performance has was discontinued in 2017, but we're going to use this data anyway, because we can closely reproduce the original Pioneer Press analysis.


In [8]:
# Load `apib12tx.csv` and take a look at the raw data


Index(['CHARTER', 'SNAME', 'DNAME', 'CNAME', 'API12B', 'ST_RANK', 'PCT_AA',
       'PCT_AI', 'PCT_AS', 'PCT_FI', 'PCT_HI', 'PCT_PI', 'PCT_WH', 'PCT_MR',
       'MEALS', 'P_GATE', 'P_MIGED', 'P_EL', 'P_RFEP', 'P_DI', 'ACS_K3',
       'ACS_46', 'ACS_CORE', 'PCT_RESP', 'NOT_HSG', 'HSG', 'SOME_COL',
       'COL_GRAD', 'GRAD_SCH', 'AVG_ED'],
      dtype='object')

### 2. Looking at one variable at a time

To start with, let's look at some histograms of single variables, to get an idea what the data (and the students) look like.

In [1]:
# Make a histogram of the API12B column, which is standardized test scores for grade 12


In [2]:
# Make a histogram of the MEALS column, which is the percentage of students enrolled in free/reduced-price lunch programs, which is often used as a proxy for poverty.


You also find it interesting to look at variables like `PCT_WH` which is the percentage of white students, and `AVG_ED` which is the average education level of the parents of the students at that school.

### 3. Looking at two variables at a time

Looking at histograms of one variable at a time cant't tell us about the relationship between variables, so let's do some scatter plots to get a qualitative sense of the relationships.

In [3]:
# Make a scatter plot of test scores (`API12B`) vs the percentage of students with subsidized lunches (`MEALS`)


As expected, test scores decrease with an increasing fraction of students in poverty. The parents' education is also strongly correlated with test scores.

In [4]:
# Make a scatter plot of test scores vs parents' education (`AVG_ED`)


## Linear regression

Let's draw trend lines through these scatter plots. Or, more precisely, we're going to use single variable linear regression to build a model of the relationship between two variables (test scores, which is the "dependent" variable, and one "independent" variable.")

Let's start with text scores vs. our poverty proxy.

In [49]:
# To start with, let's have the MEALS variable on the X axis and the API12B variable on the Y axis 
# This is some drudgery to convert dataframes into the NumPy arrays that sklearn needs


In [5]:
# Make a LinearRegression object and fit our data to it 


Now let's see what we've got. Make the scatter plot of scores vs. meals, and add the regression line

In [6]:
# scatter plot of test scores vs percentage of students eligable for subsidized meals


In [7]:
# What is the slope of this line? That is, how many test score points do we lose 
# for every percentage point increase in students receiving subsidized meals?


In [8]:
# And what is the intercept? That is, what does the model predict for MEALS=0?


We now have a model that predicts test performance based on the number of students receiving subsidized meals. This is a pretty naive mdodel, but it's a start. And we can already learn things from it: there are some schools that seem to be doing much better than would be expected given the number of impoverished students attending. Essentially by, subtracting off the regression line we are "taking poverty out of the equation."

Let's take a look at some of these schools. 

In [9]:
# Print out all schools with MEALS greater than 80 and API12B > 900


Let's look specifically at Solano Avenue Elementary, which has an `API12B` of 922 and 80 percent of students being in the free/reduced lunch program. 

How well would we expect this school to do, based on its `MEALS` of 80?

In [10]:
# Use the linear model to predict the score for this school


With an index of 922, clearly the school is overperforming what our simplified model expects. What is different about this school? Use your favorite search engine to look for relevant articles and see if you can figure out anything that might matter (remembering that our data is from 2012)

(your answer here)