###  This analysis will dig into the school test data a little deeper.  It will narrow the scope by examing only the test results the 3rd grade.  I will examine the data to see if there are any geographical differences in schools having successful math test results.  I will also test the hypothesis that elementary schools that are near universities have higher math test success.  The reasoning here is that parents that have easy access to a university might be better educated and this may in turn result in their children doing better in school.

### Import dependencies

In [3]:
# Dependencies and Setup
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
import requests
import os

#Adding datatime dependency to be able to get the current date each time code is run
import datetime

# Import API key
# from api_keys import api_key

# Incorporated citipy to determine city based on latitude and longitude
from citipy import citipy

#Import pprint
from pprint import pprint

# Google developer API key
from config import gkey

# import gmaps
import gmaps

# Access maps with unique API key
gmaps.configure(api_key=gkey)

# Import statistical functions
from scipy.stats import sem, ttest_ind

### Below I bring in the cleaned up Staar test data by school that Steve created.  I will focus on just Grade 3 to narrow the scope.  This data includes data on the math tests as well as the schools address.   The test data includes the number of students that were below grade level, approached grade level, met grade level and mastered grade level.  I will use this to define those the percentage of students in each schools that passed the math text.


In [7]:
#Import cleaned STAAR data for Grade 3.  

grade3all_df = pd.read_csv('../cleandata/AllStudentsGrade3.csv', header=0)

# I create a new field which combines street, city, state, and zipcode, which will be used to pass to Google
# to retrieve latitude and longitude

grade3all_df['Full Address'] = grade3all_df['School Site Street Address'] + ' ' + grade3all_df['School Site City'] + ',TX ' + grade3all_df['School Site Zip']

# Examining the data, I found that there were 33 schools that did not have address data, because they were not incuded
# in the original school level data file.  It is not clear why these were not included in the file.  In examining the location
# of these schools, they appear to be spread across the state, so no obvious relationship between them.  Given the small number
# of schools missing the address (25 out of over 4500 schools) and no obvious way of retieving this data, I have removed 
# these schools from the further analysis.  There were also 33 schools that did not have test score data.  Some of 
# these were the same schools that did not have address data, while others were not.  Given the small number of schools 
# and inability of retieving this data, these were removed. There were also some schools that were missing other data,
# such as district name (DNAME), but had the data needed for the analysis, so they were left in the analysis.

# Check for columns with NaN

grade3all_df[grade3all_df.isna().any(axis=1)].count()

# Drop rows that have NaN for just the columns of interest

grade3all_df=grade3all_df.dropna(subset=['Meet Grade Level','Master Grade Level', 'Full Address'])
grade3all_df.count()



CAMPUS                        4509
DNAME                         4509
CNAME                         4509
GRADE                         4509
District Type                 4509
School Site Street Address    4509
School Site City              4509
School Site Zip               4509
Below Grade Level             4509
Approach Grade Level          4509
Meet Grade Level              4509
Master Grade Level            4509
Full Address                  4509
dtype: int64

###  This next section passes the school address  to the Google Geocode API to retrieve the latitude and longitude of each school.  The API call takes a long time to run for all schools and incurs cost.  So, I used a small subset to develop my later analysis.  Once I had the analysis finalized, I ran the API call for all schools and output it to a csv file that can be read back in for further analysis.  I then reset the code so it would only look at a small subset and commented out the code that output the data. This way the API call can be tested, without taking a lot of time.

In [8]:

# Create columns to hold Latitude and Longitude
grade3all_df['Latitude']=''
grade3all_df['Longitude']=''

# Create a subset for testing. Note that I commented out the line below and ran for the whole
#list of schools. I outputed the results to a csv file to use in the analysis below.
grade3all_df = grade3all_df[:5]

# I have commented out the rest of the code so I wouldn't run the full API call again
# Instead, for further analysis, I pull in the csv file that was created.

# Iterate over the rows of the dataframe and pass the address to the API and retried latitude and longitude
for index, row in grade3all_df.iterrows():
    address = grade3all_df.loc[index, 'Full Address']
    school = grade3all_df.loc[index, 'CAMPUS']
    # Build the endpoint URL, request the data and convert to JSON
    target_url = ('https://maps.googleapis.com/maps/api/geocode/json?address={0}&key={1}').format(address, gkey)
    
    response = requests.get(target_url)
 
    response_json = response.json()

    # Retrieve the lat/long and add to the datafram
    
    try:
        grade3all_df.loc[index,'Latitude'] = response_json["results"][0]["geometry"]["location"]["lat"]
        grade3all_df.loc[index,'Longitude'] = response_json["results"][0]["geometry"]["location"]["lng"]
    except IndexError:
        print(f"Problem with school {school} and address {address}")

# Output the data with latitude and longitude, so I won't have to run this time consuming API request all the time 
# Commented this out after running so not to overwrite csv file
# grade3all_df.to_csv('../cleandata/grade3_geodata.csv')

Problem with school 15917102 and address 19190 HWY 281 S #3 SAN ANTONIO,TX 78221-9648
Problem with school 33902101 and address 106 W 9TH ST PANHANDLE,TX 79068-1030
Problem with school 46901109 and address 2620 KLEIN WAY NEW BRAUNFELS,TX 78130
Problem with school 97903101 and address 9TH AND CHERRY ST HICO,TX 76457-0218
Problem with school 101912100 and address 10550 RICHMOND AVE HOUSTON TX #140 HOUSTON,TX 77042-5112
Problem with school 126908101 and address #20 BULLDOG DR VENUS,TX 76084-0364
Problem with school 161903105 and address #1 WICKSON RD WACO,TX 76712-7552
Problem with school 161919042 and address #1 EAGLE DR EDDY,TX 76524
Problem with school 227816005 and address 2124 E ST ELMO RD #A AUSTIN,TX 78744
Problem with school 249908001 and address #1 GREYHOUND LN SLIDELL,TX 76267-0069


### The addresses below were identified using exception handling as not working with the API call.

#### Problem with school 15917102 and address 19190 HWY 281 S #3 SAN ANTONIO,TX 78221-9648
#### Problem with school 33902101 and address 106 W 9TH ST PANHANDLE,TX 79068-1030
#### Problem with school 46901109 and address 2620 KLEIN WAY NEW BRAUNFELS,TX 78130
#### Problem with school 97903101 and address 9TH AND CHERRY ST HICO,TX 76457-0218
#### Problem with school 101912100 and address 10550 RICHMOND AVE HOUSTON TX #140 HOUSTON,TX 77042-5112
#### Problem with school 126908101 and address #20 BULLDOG DR VENUS,TX 76084-0364
#### Problem with school 161903105 and address #1 WICKSON RD WACO,TX 76712-7552
#### Problem with school 161919042 and address #1 EAGLE DR EDDY,TX 76524
#### Problem with school 227816005 and address 2124 E ST ELMO RD #A AUSTIN,TX 78744
#### Problem with school 249908001 and address #1 GREYHOUND LN SLIDELL,TX 76267-0069

### Here i bring in the csv file that was saved in the cell above, so I won't have to repeat the api call

In [9]:
grade3all_df = pd.read_csv('../cleandata/grade3_geodata.csv')
grade3all_df.head()

Unnamed: 0.1,Unnamed: 0,CAMPUS,DNAME,CNAME,GRADE,District Type,School Site Street Address,School Site City,School Site Zip,Below Grade Level,Approach Grade Level,Meet Grade Level,Master Grade Level,Full Address,Latitude,Longitude
0,0,1902103,CAYUGA ISD,CAYUGA ELEM.,3,INDEPENDENT,17750 N US HWY 287,TENNESSEE COLONY,75861,6.0,40.0,38.0,27.0,"17750 N US HWY 287 TENNESSEE COLONY,TX 75861",31.922964,-95.923871
1,1,1903102,ELKHART ISD,ELKHART INTERME,3,INDEPENDENT,301 E PARKER ST,ELKHART,75839-9701,26.0,68.0,37.0,12.0,"301 E PARKER ST ELKHART,TX 75839-9701",31.628102,-95.578983
2,2,1904102,FRANKSTON ISD,FRANKSTON ELEM.,3,INDEPENDENT,100 PERRY ST,FRANKSTON,75763-0428,16.0,39.0,26.0,10.0,"100 PERRY ST FRANKSTON,TX 75763-0428",32.062148,-95.504274
3,3,1906102,NECHES ISD,NECHES ELEM.,3,INDEPENDENT,3055 FM 2574,PALESTINE,75803,5.0,20.0,7.0,2.0,"3055 FM 2574 PALESTINE,TX 75803",31.870604,-95.49208
4,4,1907107,PALESTINE ISD,SOUTHSIDE ELEM.,3,INDEPENDENT,201 GILLESPIE RD,PALESTINE,75801-7627,46.0,174.0,93.0,51.0,"201 GILLESPIE RD PALESTINE,TX 75801-7627",31.740921,-95.624853


### Below I define the % Pass as the number of students that met or mastered grade level divided by the total number of students

In [None]:
# Define the % passing score as the total that meet or master grade level divided by total students

grade3all_df['Total Students'] = grade3all_df['Below Grade Level'] + grade3all_df['Approach Grade Level'] + grade3all_df['Meet Grade Level'] + grade3all_df['Master Grade Level'] 
grade3all_df["% Pass"] = 100*(grade3all_df['Meet Grade Level'] + grade3all_df['Master Grade Level'])/grade3all_df['Total Students']



### In this next section, I will identify the top 25 schools and 25 bottom schools based on the % Pass variable
### These are then plotted on a gmap with the top schools colored green and the bottom colored red.

In [None]:
# Sort by descending % Pass and then create new datafram with just the top 25 schools
grade3all_df=grade3all_df.sort_values(by='% Pass', ascending=False)
grade3top_df=grade3all_df[:25]

# Now grab the lowest 25 which will be at the bottom of the sorted list
grade3bottom_df=grade3all_df[(len(grade3all_df)-25):]


# Here I plot the top schools in terms of passing percent on a map

fig=gmaps.figure()

# Combine list of lats and lngs into list of tuples to create coordinates to be passed to the figure

topcoordinates = tuple(zip(grade3top_df['Latitude'], grade3top_df['Longitude']))
botcoordinates = tuple(zip(grade3bottom_df['Latitude'], grade3bottom_df['Longitude']))

# Create a symbol layers using our coordinates of the top and bottom schools
# The top schools will be colored green and the bottom schools will be colored red
symbols1 = gmaps.symbol_layer(topcoordinates, fill_color='green', stroke_color='green') 
symbols2 = gmaps.symbol_layer(botcoordinates, fill_color='red', stroke_color='red') 

# Add the layers to the map
fig.add_layer(symbols1)
fig.add_layer(symbols2)

# display the figure with the newly added layers
fig



# In the next cell, I define four quadrants of Texas (ie. NW, NE, SW, SE)
### These are defined by comparing the latitide/longitude for each school to the geographical center of texas


In [None]:
#define the latititude and longitude of the geographical center of Texas

centerlat=31.3915
centerlng=-99.1707

# Create column in dataframe to hold the quadrant

grade3all_df['Quadrant']=''

# Iterate over the rows in the dataframe and used nested if statement to define geographical quadrant
# geographical center of Texas

for index, row in grade3all_df.iterrows():
    if grade3all_df.loc[index,'Latitude'] > centerlat:
        if grade3all_df.loc[index,'Longitude'] > centerlng:
            grade3all_df.loc[index,'Quadrant']='NE'
        else:
            grade3all_df.loc[index,'Quadrant']='NW'
    else:
        if grade3all_df.loc[index,'Longitude'] > centerlng:
            grade3all_df.loc[index,'Quadrant']='SE'
        else:
            grade3all_df.loc[index,'Quadrant']='SW'
                
               
    

### In the next cell, I calculate the average % Pass by quadrant

In [None]:
# Group by quadrant and calculate average 
quadrant_grp_df=grade3all_df.groupby(['Quadrant'])

avgpass = quadrant_grp_df['% Pass'].mean()

# Convert series to dataframe and reset index so quadrant can be used
avgpass_df=avgpass.to_frame()
avgpass_df=avgpass_df.reset_index()


### The next cell created a barchart of hte average % Pass by quadrant

In [None]:
# Define series to by plotted on barchart

quadrant=avgpass_df['Quadrant']
percpass=avgpass_df['% Pass']

# Create bar chart of % Passing by Region

fig, ax=plt.subplots()

bar=plt.bar(x=quadrant, height=percpass)

# Include Title and axis labeling
plt.title('Average Percentage of Students Passing by Region')
plt.xlabel('Region of Texas')
plt.ylabel('% Passing')
ax.set_xticks(['NE', 'NW', 'SE', 'SW'])
ax.set_xticklabels(['Northeast', 'Northwest','Southeast','Southwest'])
plt.ylim(0,100)


### Now I create boxplot of the distribution of the % Pass by quadrant

In [None]:

# Define series of % Pass for each quadrant
northeast=grade3all_df['% Pass'][grade3all_df['Quadrant']=='NE']
northwest=grade3all_df['% Pass'][grade3all_df['Quadrant']=='NW']
southeast=grade3all_df['% Pass'][grade3all_df['Quadrant']=='SE']
southwest=grade3all_df['% Pass'][grade3all_df['Quadrant']=='SW']

len_ne=len(northeast)
len_nw=len(northwest)
len_se=len(southeast)
len_sw=len(southwest)

# Create the boxplots for each region
fig, ax=plt.subplots()

plt.boxplot(northeast, positions=[1], patch_artist=True)
plt.boxplot(northwest, positions=[2], patch_artist=True)
plt.boxplot(southeast, positions=[3], patch_artist=True)
plt.boxplot(southwest, positions=[4], patch_artist=True)

# Add some labeling and title
plt.title("Distribution of % of Students Passing in Each Region")
ax.set_xticklabels(['Northeast', 'Northwest','Southeast','Southwest'])
plt.ylim(0,100)
plt.xlabel('Region of Texas')
plt.ylabel('% Passing')
ax.text(0.6,80,f"Schools = {len_ne}")
ax.text(1.6,80,f"Schools = {len_nw}")
ax.text(2.6,80,f"Schools = {len_se}")
ax.text(3.6,80,f"Schools = {len_sw}")


### In the next cell, I will use the latitude and longitude of each school and do a Google Places search to identify if there is a university with 5 Kilometers from the school.  The hypothesis is that schools that are near universites may have higher scores, since the children's parents might be more educated.

In [None]:
#add columns to hold the results of a search for nearby universities

grade3all_df['university'] = 0

# define base url for goodle maps
base_url = "https://maps.googleapis.com/maps/api/place/nearbysearch/json"

# Iterate over rows in dataframe
for index, row in grade3all_df.iterrows():

    # define parameters for request
    
    # Define location string that holds cooridinates of school to be passed to google maps
    location = str(grade3all_df.loc[index, 'Latitude']) + ',' + str(grade3all_df.loc[index,'Longitude'])
    target_radius2=5000
    type2='university'
   
    params1 = {
    "location": location,
    "type": type2,
    "radius" : target_radius2,
    "key": gkey,
    }
        
    response = requests.get(base_url, params=params1)

    response_json=response.json()
    
    # Check the status of the resonse.  If OK, it means that it found a university in the area
    # and the university column is changed to 1
    
    univstatus=response_json['status']
    if univstatus == "OK":
        grade3all_df.loc[index, 'university'] = 1    
        
    




### In the next cell, I create a boxplot of the distribtion of % Pass for schools that are not near a university and schools that are near a university

In [None]:
# Define series of % Pass for each quadrant
universityno=grade3all_df['% Pass'][grade3all_df['university']==0]
universityyes=grade3all_df['% Pass'][grade3all_df['university']== 1]

# Define number of schools that are or are not near a university 
lenno=len(universityno)
lenyes=len(universityyes)

# Calculate the t-statistic and p-value
(t_stat, p) = ttest_ind(universityno, universityyes, equal_var=False)

print(f"The t-statisic is {t_stat} and the p-value is {p}")

# Create the boxplots for each region
fig, ax=plt.subplots()

plt.boxplot(universityno, positions=[1], patch_artist=True)
plt.boxplot(universityyes, positions=[2], patch_artist=True)

# Add some labeling and a title
plt.title("% of Students Passing Based on Proximity to University")
ax.set_xticklabels(['No', 'Yes'])
plt.ylim(0,100)
plt.xlabel('University within 5 Kilometers')
plt.ylabel('% Passing')
ax.text(0.6,80,f"Number of Schools = {lenno}")
ax.text(1.6,80,f"Number of Schools = {lenyes}")
ax.text(2.05, 5, f"p-value = {round(p,2)}")

