---
title: "Data Collection"
format:
    html: 
        code-fold: false
---

{{< include instructions.qmd >}} 


{{< include overview.qmd >}} 

{{< include methods.qmd >}} 

# Overview

In this section, provide a high-level overview for technical staff, summarizing the key tasks and processes carried out here. Include the following elements:

- **Goals**: Clearly define the purpose of the tasks or analysis being conducted.
- **Motivation**: Explain the reasoning behind the work, such as solving a specific problem, improving a system, or optimizing performance.
- **Objectives**: Outline the specific outcomes you aim to achieve, whether it's implementing a solution, analyzing data, or building a model.

This overview should give technical staff a clear understanding of the context and importance of the work, while guiding them on what to focus on in the details that follow.

### Overview

The goal of this section is to collect enough data to create a comprehensive dataset that encapsulates all of the factors that I believe are a driver of negative high school outcomes in New York City. This includes data on high school offerings, qualities, and socioeconomic data.

### American Community Survey API
In the following code, I first used the requests library to retrieve the Census data from the American Community Survey. The params include get and for, which specify the columns we want and to pull data at the zip code level, respectively. Chat GPT was used to brainstorm potentional relevant columns to pull from the 10,000+ columns that would satisfy the project objectives. I stored this in a pandas dataframe and renamed the columns to ensure readability. Finally, I exported this dataframe to a csv file to prepare for data cleaning.

In [2]:
import requests
import pandas as pd


# base url + specifiy params
url = "https://api.census.gov/data/2022/acs/acs5/profile"
params = {
    "get": "DP02_0060PE,DP02_0068PE,DP02_0114PE,DP02_0072PE,DP02_0094PE,DP02_0154PE,DP03_0062E,DP03_0074PE",
    "for": "zip code tabulation area:*",
    "key": "1c6835368d6cc1f7472ed2e8a39e07ee7e9d1cd6"
}

response = requests.get(url, params=params)

# check the response
if response.status_code == 200:
    data = response.json()
    df = pd.DataFrame(data[1:], columns=data[0])

    df.rename(columns={
    "DP02_0060PE": "Percent No High School (25+)",
    "DP02_0068PE": "Percent Bachelor's Degree or Higher (25+)",
    "DP02_0114PE": "Percent Language Other Than English at Home",
    "DP02_0072PE": "Percent Population with Disabilities",
    "DP02_0094PE": "Percent Foreign-Born Population",
    "DP02_0154PE": "Percent Households with Broadband Internet",
    "DP03_0062E": "Median Household Income",
    "DP03_0074PE": "Percent Households on SNAP/Food Stamps"
    }, inplace=True)

    print(df)
else:
    print(f"Error: {response.status_code} - {response.text}")

      Percent No High School (25+) Percent Bachelor's Degree or Higher (25+)  \
0                             None                                      None   
1                             None                                      None   
2                             None                                      None   
3                             None                                      None   
4                             None                                      None   
...                            ...                                       ...   
33769                          0.0                                      64.0   
33770                          3.0                                      17.0   
33771                          0.9                                       9.2   
33772                          0.0                                       0.0   
33773                          1.7                                      18.0   

      Percent Language Other Than Engli

Export to csv

In [3]:
df.to_csv('../../data/raw-data/ACS_data.csv')

### NYC Open Data API
This data on graduation rate outcomes is taken from NYC Open Data. This file contains the dropout rates that we are interested in. This code is adapted from the API documentation (https://dev.socrata.com/foundry/data.cityofnewyork.us/3vje-du8p). We use the library Socrata from sodapy to retrieve the data. The package returns the JSON data to a Python list of dictionaries so we can easily convert to a pandas dataframe.

In [4]:
from sodapy import Socrata

# unauthenticated client works with public data sets

client = Socrata("data.cityofnewyork.us", None)
results = client.get("mjm3-8dw8", limit=321002)

# convert to pandas df
results_df = pd.DataFrame.from_records(results)



Export to csv

In [32]:
results_df.to_csv('../../data/raw-data/dropout_data.csv')

## Other Datasets used

The get data on quality of each high school in New York City including some of my targets of interest inclduing chronic absenteeism and college persistence, I downloaded a dataset from NYC InfoHub.

[Download the 2022-23 High School Quality Report]('../../data/raw-data/202223-hs-sqr-results.xlsx')

In addition, in order to map the quality data to the Census data, I used an NYC High School Directory file found on the Infohub site. This provided us with zip codes that I could join the Census data on.

[Download the mapping file]('../../data/mapping/2021_DOE_High_School_Directory.csv')


### Summary

## Challenges
One major challenge faced was finding the data I needed that could be aggregaated to my dataset. The NYC high school quality data is aggregated by school name and DBN, which is a uniue new york identifier that uses the district, bourough, and nyc doe school number. However many other datasets, for example country wide ones, that I was interested in getting features from did not include these aggregations since it appears to be new york specific. Therefore I had to spend a lot of time and was a little limited to finding data with this aggregation. I was able to find a mapping of DBN to zip code, which allowed me to map the census data to my school data, incorporating a few socioeconomic factors. 

In future work, I would be interesting is scaling this project to all of New York, or even country wide. Since I am only looking at New York City high schools, I only have about 500 rows, which is a bit of a small dataset. If this can be scaled to a larger amount of schools, I would probably be able to create a more robust dataset and models.

## Conclusion and Future Steps

However, we by aggreagating data from 3 different datasets, we are able to evaluate many features that may have impacts on our target varaibles and help us evaluate how high school qualities and socioeconomic factors effect high school outcomes in New York City.