# DATA EXTRACTION DOCUMENTATION

This notebook outlines how all of the different data sets were collected and added to the repository.

### EPA Data Extraction

The EPA has blockgroup level data available from its EPA EJ Screen data tool. Data are collected from https://gaftp.epa.gov/EJSCREEN and stored as (geopandas) parquet files. This notebook will assist with extracting data that may be streamed, downloaded or cloned via Quilt or other data portals. The EPA EJSCREEN variable descriptions are available in an MD file in the repo titled "EPA EJSCREEN Variables."

In [31]:
# Import software

import pandas as pd
import geopandas as gpd 
import matplotlib.pyplot as plt 
import contextily as ctx
import numpy as np
import quilt3
from geopandas_view import view 

In [43]:
# Retrieve and download EPA EJ Screen data from UCR CGS Quilt Bucket in parquet form

b = quilt3.Bucket("s3://spatial-ucr")
b.fetch("epa/ejscreen/ejscreen_2020.parquet", "./EPA Data/ejscreen_2020.parquet"), 

100%|██████████| 146M/146M [00:06<00:00, 22.1MB/s] 


(None,)

In [47]:
# Convert into CSV and add to Repository

ej = pd.read_parquet("./EPAData/ejscreen_2020.parquet")

ej.to_csv("./EPAData/ejscreen_2020.csv")

### Census Data Extraction

There are many Census data sets available; this notebook will assist with extracting data that may be streamed, downloaded or cloned via Quilt or other data portals. For this study's purposes, perhaps using the ACS 5-Year Survey Data could be a more up-to-date collection of data

In [49]:
# Retrieve ACS Tract and Blockgroup data from UCR CGS Quilt Bucket

b = quilt3.Bucket("s3://spatial-ucr")

b.fetch("/census/acs/acs_2018_tract.parquet", "./CensusData/acs_2018_tract.parquet"), 
b.fetch("/census/acs/acs_2018_bg.parquet", "./CensusData/acs_2018_bg.parquet"),

100%|██████████| 575M/575M [00:06<00:00, 87.4MB/s]  
100%|██████████| 921M/921M [00:09<00:00, 101MB/s]   


(None,)

In [50]:
# Convert into CSV's and add to Repository

acs_tract = pd.read_parquet("./CensusData/acs_2018_tract.parquet")
acs_tract.to_csv("./CensusData/acs_2018_tract.csv")

acs_bg = pd.read_parquet("./CensusData/acs_2018_bg.parquet")
acs_bg.to_csv("./CensusData/acs_2018_bg.csv")

In [52]:
# Iterate the ACS columns to list the variable names; these list the demographic characteristics 
for col in acs_bg.columns:
    print(col)

n_persons_under_18
n_persons_over_60
n_persons_over_75
n_persons_over_15
n_married
n_widowed_divorced
n_total_families
n_female_headed_families
n_nonhisp_white_persons
n_nonhisp_black_persons
n_hispanic_persons
n_native_persons
n_hawaiian_persons
n_asian_indian_persons
n_asian_persons
n_veterans
median_household_income
n_total_households
per_capita_income
n_poverty_families_children
n_total_pop
p_persons_under_18
p_persons_over_60
p_persons_over_75
p_married
p_widowed_divorced
p_female_headed_families
p_nonhisp_white_persons
p_nonhisp_black_persons
p_hispanic_persons
p_native_persons
p_asian_persons
p_hawaiian_persons
p_asian_indian_persons
p_veterans
geometry
