In [None]:
import pandas as pd
import numpy as np 
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression
%matplotlib inline

## Setting things up

Let's load the data and give it a quick look.

In [None]:
df = pd.read_csv('data/apib12tx.csv')

In [None]:
df.describe()

## Checking out correlations

Let's start looking at how variables in our dataset relate to each other so we know what to expect when we start modeling.

In [None]:
df.corr()

The percentage of students enrolled in free/reduced-price lunch programs is often used as a proxy for poverty.

In [None]:
df.plot(kind="scatter", x="MEALS", y="API12B")

Conversely, the education level of a student's parents is often a good predictor of how well a student will do in school.

In [None]:
df.plot(kind="scatter", x="AVG_ED", y="API12B")

## Running the regression

Like we did last week, we'll use scikit-learn to run basic single-variable regressions. Let's start by looking at California's Academic Performance index as it relates to the percentage of students, per school, enrolled in free/reduced-price lunch programs.

In [None]:
data = np.asarray(df[['API12B','MEALS']])
x, y = data[:, 1:], data[:, 0]

In [None]:
lr = LinearRegression() 
lr.fit(x, y)

In [None]:
# plot the linear regression line on the scatter plot
lr.coef_

In [None]:
lr.score(x, y)

In [None]:
plt.scatter(x, y, color='blue')
plt.plot(x, lr.predict(x), color='red', linewidth=1)

In our naive universe where we're only paying attention to two variables -- academic performance and free/reduced lunch -- we can clearly see that some percentage of schools is overperforming the performance that would be expected of them, taking poverty out of the equation.

A handful, in particular, seem to be dramatically overperforming. Let's look at them:

In [None]:
df[(df['MEALS'] >= 80) & (df['API12B'] >= 900)]

Let's look specifically at Solano Avenue Elementary, which has an API of 922 and 80 percent of students being in the free/reduced lunch program. If you were to use the above regression to predict how well Solano would do, it would look like this:

In [None]:
lr.predict(80)

With an index of 922, clearly the school is overperforming what our simplified model expects.