# Looking into the SAT data of New York schools
This project is built based on the DataQuest article of the series Building a Data Science Portfolio: [Storytelling with Data](https://www.dataquest.io/blog/data-science-portfolio-project/). 
The project consists in analyzing the [SAT scores](https://en.wikipedia.org/wiki/SAT) of high schoolers from different boroughs, along with other information related to it. As explained in the [article](https://www.dataquest.io/blog/data-science-portfolio-project/):
> The SAT, or Scholastic Aptitude Test, is a test that high schoolers take in the US before applying to college. Colleges take the test scores into account when making admissions decisions, so it’s fairly important to do well on. The test is divided into 3 sections, each of which is scored out of 800 points. The total score is out of 2400 (although this has changed back and forth a few times, the scores in this dataset are out of 2400). High schools are often ranked by their average SAT scores, and high SAT scores are considered a sign of how good a school district is.

There has been allegation of the SAT tests being unfair with racial groups in the US, we're going to look at this too.  

## 1 Collecting our data
This is the moment in which we are going to download the files and look for extra information to complement it, so we can move to see how it looks like.
### 1.1 Downloading all data

To make such a big analysis, we need to take the right amount of data. Thinking about that, we are going to take data from the following sources:
* [High School Directory](https://data.cityofnewyork.us/Education/2014-2015-DOE-High-School-Directory/n3p6-zve2) - directory containing information about each high school
* [SAT scores by school](https://data.cityofnewyork.us/Education/SAT-Results/f9bf-2cp4) — SAT scores for each high school in New York City.
* [School attendance](https://data.cityofnewyork.us/Education/School-Attendance-and-Enrollment-Statistics-by-Dis/7z8d-msnt) — attendance information on every school in NYC.
* [Math test results](https://data.cityofnewyork.us/Education/NYS-Math-Test-Results-By-Grade-2006-2011-School-Le/jufi-gzgp) — math test results for every school in NYC.
* [Class size](https://data.cityofnewyork.us/Education/2010-2011-Class-Size-School-level-detail/urz7-pzb3) — class size information for each school in NYC.
* [AP test results](https://data.cityofnewyork.us/Education/AP-College-Board-2010-School-Level-Results/itfs-ms3e) — Advanced Placement exam results for each high school. Passing AP exams can get you college credit in the US.
* [Graduation outcomes](https://data.cityofnewyork.us/Education/Graduation-Outcomes-Classes-Of-2005-2010-School-Le/vh2h-md7a) — percentage of students who graduated, and other outcome information.
* [Demographics](https://data.cityofnewyork.us/Education/School-Demographics-and-Accountability-Snapshot-20/ihfw-zy9j) — demographic information for each school.
* [School survey](https://data.cityofnewyork.us/Education/NYC-School-Survey-2011/mnz3-dyi8) — surveys of parents, teachers, and students at each school.
* [School district maps](https://data.cityofnewyork.us/Education/School-Districts/r8nu-ymqj) — contains information on the layout of the school districts, so that we can map them out.

### 1.2 Getting background information

- New York city is divided into 5 boroughs, which are essentially distinct regions;
- Schools in New York city are divided into several school disticts, each of which can contain dozens of schools;
- Not all the schools in all of the dataset are high schools;
- Each school in New York city has a unique code called a `DNB` or District Borough Number;
- By aggregating the data by district, we can use the district mapping data to plot district-by-district differences;

## 2 Understanding our data
After all the collection we need to see how it works, but first of all lets import the libraries we are going to use.
### 2.1 Setting up the environment
Lets import all libraries we are goingo to need to clean, analyze and display the data.

In [1]:
# importing numpy and pandas to the analysis
import numpy as np
import pandas as pd

# importing the visualization libraries
import matplotlib.pyplot as plt
import seaborn as sns

print("Libraries imported!")

Libraries imported!


### 2.2 Opening and reading our data
Now we are going to open our data and read it into pandas dataframes so we can understand how it looks like.

In [2]:
# creating a list with the name of our csv files
csv_files = ['ap_2010.csv', 'class_size.csv', 'demographics.csv', 'graduation.csv', 'hs_directory.csv', 'math_test_results.csv', 'sat_results.csv', 'school_attendence.csv']

# creating a dictionary to store the future dataframes
data = {}

# creating and populating the dataframes with the csv
for f in csv_files:
    d = pd.read_csv('/home/nathalia/Documents/2 Data science/1 Projetos/4 New york SAT Analysis/{}'.format(f))
    # creating the key without the .csv in the name and storing the data
    data[f.replace('.csv', '')] = d
    
# checking to see if it worked
data.keys()

dict_keys(['ap_2010', 'class_size', 'demographics', 'graduation', 'hs_directory', 'math_test_results', 'sat_results', 'school_attendence'])

After reading our data and storing it in a dictionary, the next step is to look at it.

In [3]:
# reading the first five rows of each dataframe
for k in data.keys():
    print('\n {} \n {}'.format(k, data[k].head()))


 ap_2010 
       DBN                             SchoolName  AP Test Takers   \
0  01M448           UNIVERSITY NEIGHBORHOOD H.S.             39.0   
1  01M450                 EAST SIDE COMMUNITY HS             19.0   
2  01M515                    LOWER EASTSIDE PREP             24.0   
3  01M539         NEW EXPLORATIONS SCI,TECH,MATH            255.0   
4  02M296  High School of Hospitality Management              NaN   

   Total Exams Taken  Number of Exams with scores 3 4 or 5  
0               49.0                                  10.0  
1               21.0                                   NaN  
2               26.0                                  24.0  
3              377.0                                 191.0  
4                NaN                                   NaN  

 class_size 
    CSD BOROUGH SCHOOL CODE                SCHOOL NAME GRADE  PROGRAM TYPE  \
0    1       M        M015  P.S. 015 Roberto Clemente     0K       GEN ED   
1    1       M        M015  P.S. 015 R

Looking at our data we can see some things:
- Most of the datasets contain the `DBN` column;
- Some fields look interesting for mapping;
- Some of the datasets appear to contain multiple rows for each school, showing that we'll have to do some preprocessing.
After seeing this, let's move to our next step and unify this data.

## 3 Data Preprocessing
First thing will be unify our data based on a common column, in our casa this common column is `DBN`. **INSERT HERE THE EXPLANATION AND LINK ABOUT THE DBN OF NEW YORK SCHOOLS**. 
We are going to need to change the names of the columns in every dataset that is not the `hs_directory` dataframe so it is all lowercase, and create the `dbn` column in the `class_size` dataset too.  
Let's rename the `dbn` column to `DBN` in the `hs_directory` dataset.

In [38]:
# accessing the dataset and renaming the column
datasets = ['ap_2010', 'class_size', 'demographics', 'graduation', 'hs_directory', 'math_test_results', 'sat_results']

# changin the DBN columns to lower case (all of them)
for item in datasets:
    if ('DBN') in data[item].columns: 
        data[item].rename({'DBN':'dbn'}, axis=1, inplace=True)
        print("The {} dataset was changed.".format(item))
    elif ('dbn') in data[item].columns:
        print("The {} is already lowercase.".format(item))
    else:
        # let's make this information a litte bit more visible
        print("\nThe {} has no DBN.\n".format(item))

The ap_2010 is already lowercase.

The class_size has no DBN.

The demographics is already lowercase.
The graduation is already lowercase.
The hs_directory is already lowercase.
The math_test_results is already lowercase.
The sat_results is already lowercase.


As we can see above, the `class_size` has no `DBN` or `dbn` column, let's look at this specific dataframe to see if the information is there, in any other form. But first, let's look at the `dbm` values to see how they look like.

In [46]:
# a sample of what the dbn looks like
data['demographics']['dbn'].unique()

array(['01M015', '01M019', '01M020', ..., '32K554', '32K556', '32K564'],
      dtype=object)

As we can see above, the `dbn` values are made of two numbers, followed by a letter and other three numbers.