# Nature and Education Outcomes
## Introduction

Last spring, in a psychology class called "Mind", I had a professor lecture about Environmental Neuroscience. She talked about effects an individual's environment can have on behavior and health, and one of the most striking examples she presented was a study on Public Housing in Chicago. In their research, [Taylor, Kuo and Sullivan (2001)]("http://lhhl.illinois.edu/all.scientific.articles.htm") studied views from windows in Robert Taylor Public Housing and classified them as either having good or poor nature views. They found that kids with more nature views had better memory, attention, and self-discipline.

While looking though Open Source Data to see what kinds of data might be interesting to play with, I came across the NYC [2015 Tree Census Data]("https://data.cityofnewyork.us/Environment/2015-Street-Tree-Census-Tree-Data/uvpi-gqnh"). This dataset lists over 683k trees and their precise location. Coupled with

All of my data comes from NYC Open data: I used [2012 SAT results]("https://data.cityofnewyork.us/Education/2012-SAT-Results/f9bf-2cp4"), [School Locations]("https://data.cityofnewyork.us/Education/2017-2018-School-Locations/p6h4-mpyy"), [2012 Demographics and Accountability]("https://data.cityofnewyork.us/Education/2006-2012-School-Demographics-and-Accountability-S/ihfw-zy9j"), and [2015 Tree Census Data]("https://data.cityofnewyork.us/Environment/2015-Street-Tree-Census-Tree-Data/uvpi-gqnh").

Most preprocessing has already been done in data_cleaning.py


## Finish Preprocessing Data

In [1]:
import numpy as np
import pandas as pd

In [2]:
schools = pd.read_csv('nyc_schools.csv', usecols=['DBN', 'SCHOOL NAME', 'Num of SAT Test Takers', 'SAT Critical Reading Avg. Score', 'SAT Math Avg. Score', 'SAT Writing Avg. Score', 'X_COORDINATE', 'Y_COORDINATE', 'NTA', 'NTA_NAME', 'grade9', 'grade10', 'grade11', 'grade12', 'ell_percent', 'sped_percent', 'asian_per', 'white_per', 'black_per', 'hispanic_per', 'male_per', 'total', 'enrollment'])
trees = pd.read_csv('nyc_trees.csv', usecols=['tree_id', 'status', 'x_sp', 'y_sp', 'nta', 'nta_name'])

In [3]:
# Rename columns
schools = schools.rename(columns={"SAT Critical Reading Avg. Score": "crit_reading", "SAT Math Avg. Score": "math", "SAT Writing Avg. Score": "writing"})

In [4]:
# Create column containing Borough school is in from NTA
boro = []
for i in range(schools['DBN'].count()):
    n = schools['NTA'][i]
    b = n[:2]
    boro.append(b)

In [5]:
# Create One Hot Encoding for each Borough
schools['boro'] = boro
boro_dummy = pd.get_dummies(schools['boro'], prefix='boro')

In [6]:
# Join Schools data with Borough data
schools = schools.join(boro_dummy)
schools.head()

Unnamed: 0,DBN,SCHOOL NAME,Num of SAT Test Takers,crit_reading,math,writing,grade9,grade10,grade11,grade12,...,NTA,NTA_NAME,total,enrollment,boro,boro_BK,boro_BX,boro_MN,boro_QN,boro_SI
0,01M292,HENRY STREET SCHOOL FOR INTERNATIONAL STUDIES,29,355,404,363,98.0,79,80.0,50.0,...,MN28,Lower East Side ...,1122,307.0,MN,0,0,1,0,0
1,01M448,UNIVERSITY NEIGHBORHOOD HIGH SCHOOL,91,383,423,366,109.0,97,93.0,95.0,...,MN28,Lower East Side ...,1172,394.0,MN,0,0,1,0,0
2,01M450,EAST SIDE COMMUNITY SCHOOL,70,377,402,370,101.0,93,77.0,86.0,...,MN22,East Village ...,1149,357.0,MN,0,0,1,0,0
3,01M458,FORSYTH SATELLITE ACADEMY,7,414,401,359,131.0,49,44.0,,...,MN27,Chinatown ...,1174,,MN,0,0,1,0,0
4,01M509,MARTA VALLE HIGH SCHOOL,44,390,433,384,143.0,100,51.0,73.0,...,MN27,Chinatown ...,1207,367.0,MN,0,0,1,0,0


In [7]:
trees.describe()

Unnamed: 0,tree_id,x_sp,y_sp
count,683788.0,683788.0,683788.0
mean,365205.011085,1005280.0,194798.424625
std,208122.092902,34285.05,32902.061114
min,3.0,913349.3,120973.7922
25%,186582.75,989657.8,169515.1537
50%,366214.5,1008386.0,194560.2525
75%,546170.25,1029991.0,217019.57195
max,722694.0,1067248.0,271894.0921


In [8]:
trees['status'].unique()

array(['Alive', 'Stump', 'Dead'], dtype=object)

In [9]:
# Create separate df for each type of tree
alive = trees[trees['status']=='Alive']
stump = trees[trees['status']=='Stump']
dead = trees[trees['status']=='Dead']

In [None]:
# DATA VISUALIZATION 1: All of the trees in NYC + High Schools
plot = alive.plot(kind='scatter', x='x_sp', y='y_sp', color='green', label='Alive Trees', alpha = .3, figsize=(25, 20))
stump.plot(kind='scatter', x='x_sp', y='y_sp', color='olive', label='Stump Trees', alpha = .3, ax=plot)
dead.plot(kind='scatter', x='x_sp', y='y_sp', color='darkslategrey', label='Dead Trees', alpha = .3, ax=plot)
schools.plot.scatter(x='X_COORDINATE', y='Y_COORDINATE', c='hotpink', label='Schools', ax=plot)

<matplotlib.axes._subplots.AxesSubplot at 0x11afbf890>

In [None]:
# DATA VISUALIZATION 2: SAT Scores by High School
plot = alive.plot(kind='scatter', x='x_sp', y='y_sp', color='lightgrey', figsize=(25, 20))
schools.plot.scatter(x='X_COORDINATE', y='Y_COORDINATE', c='total', colormap='plasma', ax=plot)

In [None]:
top_schools = schools.sort_values(by=['total'], ascending=False)
top_schools[['SCHOOL NAME', 'crit_reading', 'math', 'writing', 'total']].head(10)

From here we can see that there are a handful of schools that excel -- these are top schools in NYC that are known for high academic acheivement.

In [11]:
schools['x_min'] = schools['X_COORDINATE'] - 100
schools['x_max'] = schools['X_COORDINATE'] + 100
schools['y_min'] = schools['Y_COORDINATE'] - 100
schools['y_max'] = schools['Y_COORDINATE'] + 100
schools = schools.reset_index(drop=True)

In [12]:
all_trees = []
alive_trees = []
stump_trees = []
dead_trees = []

for i in range(schools['DBN'].count()):
    t = trees[(trees['x_sp'] <= schools['x_max'][i]) & (trees['x_sp'] >= schools['x_min'][i])& (trees['y_sp'] <= schools['y_max'][i]) & (trees['y_sp'] >= schools['y_min'][i])]['tree_id'].count()
    a = trees[(trees['x_sp'] <= schools['x_max'][i]) & (trees['x_sp'] >= schools['x_min'][i])& (trees['y_sp'] <= schools['y_max'][i]) & (trees['y_sp'] >= schools['y_min'][i]) & (trees['status'] == 'Alive')]['tree_id'].count()
    s = trees[(trees['x_sp'] <= schools['x_max'][i]) & (trees['x_sp'] >= schools['x_min'][i])& (trees['y_sp'] <= schools['y_max'][i]) & (trees['y_sp'] >= schools['y_min'][i]) & (trees['status'] == 'Stump')]['tree_id'].count()
    d = trees[(trees['x_sp'] <= schools['x_max'][i]) & (trees['x_sp'] >= schools['x_min'][i])& (trees['y_sp'] <= schools['y_max'][i]) & (trees['y_sp'] >= schools['y_min'][i]) & (trees['status'] == 'Dead')]['tree_id'].count()

    all_trees.append(t)
    alive_trees.append(a)
    stump_trees.append(s)
    dead_trees.append(d)

In [13]:
schools['trees'] = all_trees
schools['alive_trees'] = alive_trees
schools['stump_trees'] = stump_trees
schools['dead_trees'] = dead_trees

In [14]:
schools.head()

Unnamed: 0,DBN,SCHOOL NAME,Num of SAT Test Takers,crit_reading,math,writing,grade9,grade10,grade11,grade12,...,boro_QN,boro_SI,x_min,x_max,y_min,y_max,trees,alive_trees,stump_trees,dead_trees
0,01M292,HENRY STREET SCHOOL FOR INTERNATIONAL STUDIES,29,355,404,363,98.0,79,80.0,50.0,...,0,0,988017.0,988217.0,199074.0,199274.0,0,0,0,0
1,01M448,UNIVERSITY NEIGHBORHOOD HIGH SCHOOL,91,383,423,366,109.0,97,93.0,95.0,...,0,0,988550.0,988750.0,198676.0,198876.0,0,0,0,0
2,01M450,EAST SIDE COMMUNITY SCHOOL,70,377,402,370,101.0,93,77.0,86.0,...,0,0,989008.0,989208.0,204827.0,205027.0,1,1,0,0
3,01M458,FORSYTH SATELLITE ACADEMY,7,414,401,359,131.0,49,44.0,,...,0,0,986792.0,986992.0,202372.0,202572.0,5,5,0,0
4,01M509,MARTA VALLE HIGH SCHOOL,44,390,433,384,143.0,100,51.0,73.0,...,0,0,988020.0,988220.0,201507.0,201707.0,3,3,0,0


## Regression

In [16]:
import statsmodels.formula.api as sm

In [17]:
regression = schools[['total', 'enrollment', 'asian_per', 'black_per', 'hispanic_per' , 'white_per', 'male_per' , 'boro_BK' , 'boro_BX' ,'boro_MN' , 'boro_QN' , 'boro_SI', 'trees', 'alive_trees', 'stump_trees', 'dead_trees']].copy()
regression.corr(method ='pearson') 

Unnamed: 0,total,enrollment,asian_per,black_per,hispanic_per,white_per,male_per,boro_BK,boro_BX,boro_MN,boro_QN,boro_SI,trees,alive_trees,stump_trees,dead_trees
total,1.0,0.346954,0.537415,-0.307973,-0.383232,0.64554,-0.097572,-0.173739,-0.196997,0.165583,0.164439,0.151901,0.017204,0.000283,0.009012,0.148439
enrollment,0.346954,1.0,0.428462,-0.253719,-0.171825,0.324562,0.091823,-0.022123,-0.122894,-0.126421,0.224354,0.196504,-0.140296,-0.143368,0.025176,-0.027223
asian_per,0.537415,0.428462,1.0,-0.432616,-0.342372,0.316771,0.057084,-0.152446,-0.246772,0.059317,0.391692,-0.008823,-0.029301,-0.03051,0.002833,0.000842
black_per,-0.307973,-0.253719,-0.432616,1.0,-0.557146,-0.419522,-0.018206,0.507009,-0.138717,-0.236019,-0.130244,-0.118798,-0.156085,-0.152759,-0.018195,-0.063316
hispanic_per,-0.383232,-0.171825,-0.342372,-0.557146,1.0,-0.327164,0.020801,-0.38527,0.428372,0.183652,-0.161081,-0.134691,0.177677,0.186969,-0.000257,-0.031375
white_per,0.64554,0.324562,0.316771,-0.419522,-0.327164,1.0,-0.060656,-0.097092,-0.217475,0.039751,0.096558,0.474583,0.013078,-0.007282,0.033379,0.16432
male_per,-0.097572,0.091823,0.057084,-0.018206,0.020801,-0.060656,1.0,0.109177,0.02464,-0.164523,0.020924,0.029779,-0.063584,-0.062061,0.015705,-0.039715
boro_BK,-0.173739,-0.022123,-0.152446,0.507009,-0.38527,-0.097092,0.109177,1.0,-0.352228,-0.401916,-0.2989,-0.115387,-0.111828,-0.103633,-0.009445,-0.098287
boro_BX,-0.196997,-0.122894,-0.246772,-0.138717,0.428372,-0.217475,0.02464,-0.352228,1.0,-0.329463,-0.245017,-0.094587,-0.042955,-0.041783,0.056701,-0.052907
boro_MN,0.165583,-0.126421,0.059317,-0.236019,0.183652,0.039751,-0.164523,-0.401916,-0.329463,1.0,-0.279581,-0.10793,0.241025,0.235908,-0.052528,0.141014


In [18]:
model = sm.ols(formula="total ~ enrollment + asian_per + black_per + hispanic_per + male_per + boro_BK + boro_BX +boro_MN + boro_QN + trees", data=regression).fit()

In [19]:
model.summary()

0,1,2,3
Dep. Variable:,total,R-squared:,0.618
Model:,OLS,Adj. R-squared:,0.606
Method:,Least Squares,F-statistic:,55.54
Date:,"Sun, 10 Nov 2019",Prob (F-statistic):,9.15e-66
Time:,07:23:20,Log-Likelihood:,-2184.0
No. Observations:,355,AIC:,4390.0
Df Residuals:,344,BIC:,4433.0
Df Model:,10,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
Intercept,1709.0290,53.514,31.936,0.000,1603.773,1814.285
enrollment,0.0181,0.008,2.227,0.027,0.002,0.034
asian_per,-2.7143,0.832,-3.263,0.001,-4.350,-1.078
black_per,-6.4554,0.551,-11.724,0.000,-7.538,-5.372
hispanic_per,-8.1170,0.568,-14.302,0.000,-9.233,-7.001
male_per,-0.6452,0.487,-1.325,0.186,-1.603,0.313
boro_BK,104.3174,44.399,2.350,0.019,16.989,191.646
boro_BX,179.4603,46.092,3.893,0.000,88.802,270.119
boro_MN,197.7497,44.851,4.409,0.000,109.533,285.966

0,1,2,3
Omnibus:,30.422,Durbin-Watson:,1.767
Prob(Omnibus):,0.0,Jarque-Bera (JB):,75.888
Skew:,0.401,Prob(JB):,3.32e-17
Kurtosis:,5.118,Cond. No.,17100.0


From here, we see that there seems to be a positive relationship between total SAT score and number of trees around the school, however it isn't statistically significant. 

That being said, this project has been a interesting way for me to think about how interconnected the world is, and also how we can think about using alternative data to come up with potential solutions.