## NYC schools and SAT scores

The Project has NYC Public School data, and tries to compare NYC Public schools with regards to demographis and test scores.

**Datasets included:**
* SAT scores by school - SAT scores for each high school in New York City
* School attendance - Attendance information for each school in New York City
* Class size - Information on class size for each school
* AP test results - Advanced Placement (AP) exam results for each high school (passing an optional AP exam in a particular subject can earn a student college credit in that subject)
* Graduation outcomes - The percentage of students who graduated, and other outcome information
* Demographics - Demographic information for each school
* School survey - Surveys of parents, teachers, and students at each school

**Points to consider:**
* NYC has 5 boroughs(regions)
* DBN: District Borough number

In [2]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
import warnings
warnings.filterwarnings("ignore")

In [3]:
#Reading files into a single list
#list of csv files
data_files = [
    "ap_2010.csv",
    "class_size.csv",
    "demographics.csv",
    "graduation.csv",
    "hs_directory.csv",
    "sat_results.csv"]

In [4]:
data = {} #Dictionary with keys = csv name, value = pd.csv_read() csv files
csv_names = [] #List with CVS file names
for i in data_files:
    word = i.split(".")[0]
    csv_names.append(word)
    data[word] = pd.read_csv(i)
print("CSV files:\n",csv_names)
print("\ndata dictionary, keys:\n", data.keys())

CSV files:
 ['ap_2010', 'class_size', 'demographics', 'graduation', 'hs_directory', 'sat_results']

data dictionary, keys:
 dict_keys(['ap_2010', 'class_size', 'demographics', 'graduation', 'hs_directory', 'sat_results'])


Data dictionary contains all csv files. with name as key and dataframe as value

In [5]:
sat = data["sat_results"]
print("First 5 rows of SAT dataframe:\n")
sat.head()

First 5 rows of SAT dataframe:



Unnamed: 0,DBN,SCHOOL NAME,Num of SAT Test Takers,SAT Critical Reading Avg. Score,SAT Math Avg. Score,SAT Writing Avg. Score
0,01M292,HENRY STREET SCHOOL FOR INTERNATIONAL STUDIES,29,355,404,363
1,01M448,UNIVERSITY NEIGHBORHOOD HIGH SCHOOL,91,383,423,366
2,01M450,EAST SIDE COMMUNITY SCHOOL,70,377,402,370
3,01M458,FORSYTH SATELLITE ACADEMY,7,414,401,359
4,01M509,MARTA VALLE HIGH SCHOOL,44,390,433,384


In [6]:
#for key,value in data.items():
#   print("First 5 rows of", key, ":\n", value)

* Each dataset contains a DBN column, or DBN information, so all datasets will be combined through DBN on a single dataset.
* Some Schools are repeated in multiple rows, duplicate DBN values

## Survey Data
* Both survey data files are in .txt, so they need to be read with "Windows-1252"
* Both survey files will be combined in a single dataframe. concat()

In [7]:
#Reading survey_all text file
all_survey = pd.read_csv("survey_all.txt", delimiter="\t", encoding="windows-1252")

In [8]:
#Reading survey_d75 text file
d75_survey = pd.read_csv("survey_d75.txt", delimiter="\t", encoding="windows-1252")

In [9]:
#Combining both survey dataframes
#all_survey on left, d75_survey on right
survey = pd.concat([all_survey, d75_survey], axis = 0)
survey.head()

Unnamed: 0,N_p,N_s,N_t,aca_p_11,aca_s_11,aca_t_11,aca_tot_11,bn,com_p_11,com_s_11,...,t_q8c_1,t_q8c_2,t_q8c_3,t_q8c_4,t_q9,t_q9_1,t_q9_2,t_q9_3,t_q9_4,t_q9_5
0,90.0,,22.0,7.8,,7.9,7.9,M015,7.6,,...,29.0,67.0,5.0,0.0,,5.0,14.0,52.0,24.0,5.0
1,161.0,,34.0,7.8,,9.1,8.4,M019,7.6,,...,74.0,21.0,6.0,0.0,,3.0,6.0,3.0,78.0,9.0
2,367.0,,42.0,8.6,,7.5,8.0,M020,8.3,,...,33.0,35.0,20.0,13.0,,3.0,5.0,16.0,70.0,5.0
3,151.0,145.0,29.0,8.5,7.4,7.8,7.9,M034,8.2,5.9,...,21.0,45.0,28.0,7.0,,0.0,18.0,32.0,39.0,11.0
4,90.0,,23.0,7.9,,8.1,8.0,M063,7.9,,...,59.0,36.0,5.0,0.0,,10.0,5.0,10.0,60.0,15.0


* survey dataframe has a column "dbn", which should be renamed to "DBN" to match other dataframe "DBN" columns.
* There are also 2773 columns, most of which are unnecessary for the analysis.

Relevant columns:

In [10]:
#list of important columns in survey
relevant_cols = ["DBN", "rr_s", "rr_t", "rr_p", "N_s", "N_t", "N_p", "saf_p_11", "com_p_11", "eng_p_11", "aca_p_11", "saf_t_11", "com_t_11", "eng_t_11", "aca_t_11", "saf_s_11", "com_s_11", "eng_s_11", "aca_s_11", "saf_tot_11", "com_tot_11", "eng_tot_11", "aca_tot_11"]

In [11]:
#New column "DBN" with "dbn" info
survey["DBN"] = survey["dbn"]

In [12]:
#New Survey Dataframe contains only relevant_cols columns
survey = survey.loc[:,relevant_cols]
#survey1 = survey.loc[:,relevant_cols]
#survey2 = survey[relevant_cols]
print("Survey shape:\n", survey.shape)

Survey shape:
 (1702, 23)


There are 1702 rows and 23 columns(relevant columns)

## CSV dataframes

In [13]:
#List with data keys / CSV dataset names
datasets = []
for i in data.keys():
    datasets.append(i)
print("These are the dataset names:\n", datasets)

These are the dataset names:
 ['ap_2010', 'class_size', 'demographics', 'graduation', 'hs_directory', 'sat_results']


In [14]:
#hs_directory DF
data["hs_directory"].head(1)

Unnamed: 0,dbn,school_name,borough,building_code,phone_number,fax_number,grade_span_min,grade_span_max,expgrade_span_min,expgrade_span_max,...,priority08,priority09,priority10,Location 1,Community Board,Council District,Census Tract,BIN,BBL,NTA
0,17K548,Brooklyn School for Music & Theatre,Brooklyn,K440,718-230-6250,718-230-6262,9.0,12,,,...,,,,"883 Classon Avenue\nBrooklyn, NY 11225\n(40.67...",9.0,35.0,213.0,3029686.0,3011870000.0,Crown Heights South ...


* hs_directory has "dbn" column (needs to change to "DBN")

In [15]:
#class_size DF
data["class_size"].head(1)

Unnamed: 0,CSD,BOROUGH,SCHOOL CODE,SCHOOL NAME,GRADE,PROGRAM TYPE,CORE SUBJECT (MS CORE and 9-12 ONLY),CORE COURSE (MS CORE and 9-12 ONLY),SERVICE CATEGORY(K-9* ONLY),NUMBER OF STUDENTS / SEATS FILLED,NUMBER OF SECTIONS,AVERAGE CLASS SIZE,SIZE OF SMALLEST CLASS,SIZE OF LARGEST CLASS,DATA SOURCE,SCHOOLWIDE PUPIL-TEACHER RATIO
0,1,M,M015,P.S. 015 Roberto Clemente,0K,GEN ED,-,-,-,19.0,1.0,19.0,19.0,19.0,ATS,


* class_size doesn't have "DBN" column at all

In [16]:
data["sat_results"].head(2)

Unnamed: 0,DBN,SCHOOL NAME,Num of SAT Test Takers,SAT Critical Reading Avg. Score,SAT Math Avg. Score,SAT Writing Avg. Score
0,01M292,HENRY STREET SCHOOL FOR INTERNATIONAL STUDIES,29,355,404,363
1,01M448,UNIVERSITY NEIGHBORHOOD HIGH SCHOOL,91,383,423,366


* From sat_results we can observe that DBN is obtained from mixing columns "CSD" and "SCHOOL CODE" 
* e.g. CSD (1) + SCHOOL CODE (M015)
* DBN on sat_results also has 0 padding at the beggining, so "padding" + "CSD" + "SCHOOL CODE"
* Need to fill in "CSD" column to contain two digits.

## Modifying DBN columns

**hs_directory**

In [17]:
#Creating new "DBN" colunm with "dbn" data on hs_directory
data["hs_directory"]["DBN"]=data["hs_directory"]["dbn"]

**class_size**

In [18]:
#lenght function takes in an argument and checks whether that argument has 2 characters.
#If argument has 2 characters, then it returns the argument
#Else it adds 1 zero to the front
def lenght(y):
    x = str(y) 
    s = len(x) #s checks lenght of str. Whether need padding or not
    if s == 2:
        return x
    elif s == 1:
        return x.zfill(2) #zfill adds padding if only 1 character

In [19]:
#Creating new "padded_csd" column on class_size dataset
data["class_size"]["padded_csd"] = data["class_size"]["CSD"].apply(lenght)

In [20]:
data["class_size"]["padded_csd"].unique()

array(['01', '04', '02', '21', '27', '05', '06', '14', '17', '20', '03',
       '32', '07', '08', '09', '10', '11', '12', '13', '15', '16', '19',
       '18', '22', '23', '24', '25', '26', '28', '29', '30', '31'],
      dtype=object)

Now, the new column has values consisting of two characters

In [21]:
#Adding up class_size columns ("padded_csv" and "SCHOOL CODE")
#New "DBN" column contains added values
data["class_size"]["DBN"] = data["class_size"]["padded_csd"] + data["class_size"]["SCHOOL CODE"]
print("First few lines of DBN column:")
data["class_size"]["DBN"].head()

First few lines of DBN column:


0    01M015
1    01M015
2    01M015
3    01M015
4    01M015
Name: DBN, dtype: object

## SAT scores

As SAT scores are separated among different columns, for different score subjects, we need to join them together on a single columnn to be able to better compare it to demographics and other factors.
* They also need to be switched from "string" to a numeric data type

In [22]:
#Converting SAT columns to numeric type
print("SAT column types before converting:")
print(data["sat_results"]["SAT Math Avg. Score"].dtype)
print(data["sat_results"]["SAT Critical Reading Avg. Score"].dtype)
print(data["sat_results"]["SAT Writing Avg. Score"].dtype)
data["sat_results"]["SAT Math Avg. Score"] = pd.to_numeric(data["sat_results"]["SAT Math Avg. Score"], errors = "coerce")
data["sat_results"]["SAT Critical Reading Avg. Score"] = pd.to_numeric(data["sat_results"]["SAT Critical Reading Avg. Score"], errors = "coerce")
data["sat_results"]["SAT Writing Avg. Score"] = pd.to_numeric(data["sat_results"]["SAT Writing Avg. Score"], errors = "coerce")
print("SAT column types after converting:")
print(data["sat_results"]["SAT Math Avg. Score"].dtype)
print(data["sat_results"]["SAT Critical Reading Avg. Score"].dtype)
print(data["sat_results"]["SAT Writing Avg. Score"].dtype)

SAT column types before converting:
object
object
object
SAT column types after converting:
float64
float64
float64


In [23]:
#Adding up SAT score values/ columns into a single column "sat_results"
data["sat_results"]["sat_score"] = data["sat_results"]["SAT Math Avg. Score"] + data["sat_results"]["SAT Critical Reading Avg. Score"] + data["sat_results"]["SAT Writing Avg. Score"]
print("New SAT combined column results:")
data["sat_results"]["sat_score"].head()

New SAT combined column results:


0    1122.0
1    1172.0
2    1149.0
3    1174.0
4    1207.0
Name: sat_score, dtype: float64

## School coordinates

The coordinates are located in the hs_directory dataset. They are located at the end of the "Location 1" column
* We need to create a function that extracts latitude and longitude
* New values will be added in new "lat" and "lon" columns

In [24]:
data["hs_directory"]["Location 1"].head(3)

0    883 Classon Avenue\nBrooklyn, NY 11225\n(40.67...
1    1110 Boston Road\nBronx, NY 10456\n(40.8276026...
2    1501 Jerome Avenue\nBronx, NY 10452\n(40.84241...
Name: Location 1, dtype: object

**Extracting Latitude**

In [25]:
import re #regular expression
#Extract function takes in a string and extracts items in parenthesis
#Takes in hs_directory's "Location 1" column and extracts latitude
def extract_latitude(string):
    coordinate = re.findall("\(.+\)", string)
    latitude = coordinate[0].split(",")[0].replace("(","")
    return latitude

In [26]:
#Applying function to Location column
data["hs_directory"]["lat"] = data["hs_directory"]["Location 1"].apply(extract_latitude)
print("Latitudes in new 'lat' column:")
data["hs_directory"]["lat"].head()

Latitudes in new 'lat' column:


0     40.67029890700047
1      40.8276026690005
2    40.842414068000494
3     40.71067947100045
4    40.718810094000446
Name: lat, dtype: object

**Extractinng Longitude**

In [27]:
#Function that extracts longitude from "Location 1" column
def extract_longitude(string):
    coordinate = re.findall("\(.+\)", string)
    longitude = coordinate[0].split(",")[1].replace(")","")
    return longitude

In [28]:
#Applying function to Location column
data["hs_directory"]["lon"] = data["hs_directory"]["Location 1"].apply(extract_longitude)
print("Longitudes in new 'lon' column:")
data["hs_directory"]["lon"].head()

Longitudes in new 'lon' column:


0     -73.96164787599963
1     -73.90447525699966
2     -73.91616158599965
3     -74.00080702099967
4     -73.80650045499965
Name: lon, dtype: object

## Modifying to unique DBN columns

Multiple datasets contain DNB column rows with similar values, and so these similar values will cause problem when adding them (as index)