---
title: "Data Collection"
format:
    html: 
        code-fold: false
---

{{< include instructions.qmd >}} 


{{< include overview.qmd >}} 

{{< include methods.qmd >}} 

### American Community Survey API
In the following code, I first used the requests library to retrieve the Census data from the American Community Survey. The params include get and for, which specify the columns we want and to pull data at the zip code level, respectively. Chat GPT was used to brainstorm potentional relevant columns to pull from the 10,000+ columns that would satisfy the project objectives. I stored this in a pandas dataframe and renamed the columns to ensure readability. Finally, I exported this dataframe to a csv file to prepare for data cleaning.


In [2]:
import requests
import pandas as pd


# base url + specifiy params
url = "https://api.census.gov/data/2022/acs/acs5/profile"
params = {
    "get": "DP02_0060PE,DP02_0068PE,DP02_0114PE,DP02_0072PE,DP02_0094PE,DP02_0154PE,DP03_0062E,DP03_0074PE",
    "for": "zip code tabulation area:*",
    "key": "1c6835368d6cc1f7472ed2e8a39e07ee7e9d1cd6"
}

response = requests.get(url, params=params)

# check the response
if response.status_code == 200:
    data = response.json()
    df = pd.DataFrame(data[1:], columns=data[0])

    df.rename(columns={
    "DP02_0060PE": "Percent No High School (25+)",
    "DP02_0068PE": "Percent Bachelor's Degree or Higher (25+)",
    "DP02_0114PE": "Percent Language Other Than English at Home",
    "DP02_0072PE": "Percent Population with Disabilities",
    "DP02_0094PE": "Percent Foreign-Born Population",
    "DP02_0154PE": "Percent Households with Broadband Internet",
    "DP03_0062E": "Median Household Income",
    "DP03_0074PE": "Percent Households on SNAP/Food Stamps"
    }, inplace=True)

    print(df)
else:
    print(f"Error: {response.status_code} - {response.text}")

      Percent No High School (25+) Percent Bachelor's Degree or Higher (25+)  \
0                             None                                      None   
1                             None                                      None   
2                             None                                      None   
3                             None                                      None   
4                             None                                      None   
...                            ...                                       ...   
33769                          0.0                                      64.0   
33770                          3.0                                      17.0   
33771                          0.9                                       9.2   
33772                          0.0                                       0.0   
33773                          1.7                                      18.0   

      Percent Language Other Than Engli

In [3]:
#export to csv
df.to_csv('../../data/raw-data/ACS_data.csv')

### NYC Open Data API
This data on graduation rate outcomes is taken from NYC Open Data. This code is adapted from the API documentation (https://dev.socrata.com/foundry/data.cityofnewyork.us/3vje-du8p). We use the library Socrata from sodapy to retrieve the data. The pakcage returns the JSON data to a Python list of dictionaries so we can easily convert to a pandas dataframe.

In [4]:
from sodapy import Socrata

# Unauthenticated client works with public data sets

client = Socrata("data.cityofnewyork.us", None)
results = client.get("mjm3-8dw8", limit=321002)

# convert to pandas df
results_df = pd.DataFrame.from_records(results)



In [32]:
#export to csv
results_df.to_csv('../../data/raw-data/dropout_data.csv')

{{< include closing.qmd >}} 