# Cleaning data for visualization
While the data is not committed to the repository to save space, each of the data sets used in this notebook may be accessed at the following link, downloaded and stored locally in the `data` subdirectory of the `propublica_hotspots` directory such that you can run this notebook:
- [ProPublica: Toxic Air Pollution Hot Spots](https://www.propublica.org/datastore/dataset/toxic-air-pollution-hot-spots)
- [2020 DEC Redistricting Data Race Table, geographically filtered by Place](https://data.census.gov/cedsci/table?g=0100000US%241600000&y=2020&d=DEC%20Redistricting%20Data%20%28PL%2094-171%29&tid=DECENNIALPL2020.P1)
- [2010 DEC Redistricting Data Race Table, geographically filtered by Block](https://data.census.gov/cedsci/table?g=0400000US01%241000000,02%241000000,04%241000000,05%241000000,06%241000000,08%241000000,09%241000000,10%241000000,11%241000000,12%241000000,13%241000000,15%241000000,16%241000000,17%241000000,18%241000000,19%241000000,20%241000000,21%241000000,22%241000000,23%241000000,24%241000000,25%241000000,26%241000000,27%241000000,28%241000000,29%241000000,30%241000000,31%241000000,32%241000000,33%241000000,34%241000000,35%241000000,36%241000000,37%241000000,38%241000000,39%241000000,40%241000000,41%241000000,42%241000000,44%241000000,45%241000000,46%241000000,47%241000000,48%241000000,49%241000000,50%241000000,51%241000000,53%241000000,54%241000000,55%241000000,56%241000000&d=DEC%20Redistricting%20Data%20%28PL%2094-171%29&tid=DECENNIALPL2010.P1)

Resources:
- [EPA's explanation of the components of the Risk-Screening Environmental Indicators (RSEI) Model](https://www.epa.gov/rsei/ways-get-rsei-results): EPA data set that ProPublica used to form its grids.
- [EPA's Risk-Screening Environmental Indicators (RSEI) Methodology](https://www.epa.gov/sites/default/files/2020-02/documents/rsei_methodology_v2.3.8.pdf): documentation for the RSEI data set, useful for understanding how grids are represented.
- [RSEI data dictionary for crosswalk data set](https://www.epa.gov/rsei/rsei-data-dictionary-census-crosswalks)
- [ProPublica: The Most Detailed Map of Cancer-Causing Industrial Air Pollution in the U.S.](https://projects.propublica.org/toxmap/): Original visualization created using the data in this project, good way to explore the data and get a familiarity with it.


In [1]:
import pandas as pd
import json
import numpy as np

## Loading pollution data

In [2]:
data = json.load(open("../data/toxmaps_files_2022-03-15b/toxmaps_files_for_data_store/hotspot_perimeters_for_data_store.geojson"))
data = data["features"]
pollution_hotspot_df = pd.DataFrame([d["properties"] for d in data])

# Normalizing with EPA acceptable risk, which is 1/10,000
pollution_hotspot_df["avg_ilcr"] = pollution_hotspot_df["avg_ilcr"] * 10000
pollution_hotspot_df["max_ilcr"] = pollution_hotspot_df["max_ilcr"] * 10000

pollution_hotspot_df


Unnamed: 0,id,place,state,avg_ilcr,max_ilcr,area,pop
0,24228,Corpus Christi,Texas,0.294000,3.403039,0.006039,57150
1,27627,Catoosa,Oklahoma,0.213759,0.472415,0.000526,68
2,38309,Chicago,Illinois,0.252534,0.252534,0.000071,388
3,27293,Attica,Indiana,0.280931,0.280931,0.000069,5
4,1180,Watertown,Wisconsin,0.313408,0.418059,0.000145,872
...,...,...,...,...,...,...,...
1356,38536,Monaca,Pennsylvania,0.183536,0.183536,0.000070,0
1357,8056,Prairie du Sac,Wisconsin,0.178760,0.178760,0.000073,280
1358,209,Sheboygan,Wisconsin,0.175148,0.175148,0.000073,210
1359,9940,Hooksett,New Hampshire,0.186700,0.186700,0.000072,36


## Loading census data

In [3]:
name_keys = ["city", "CDP", "town", "municipality", "borough", "village", "corporation", "unified government", "urban county", "consolidated government", "consolidated government \(balance\)", "metropolitan government \(balance\)", "metro government \(balance\)", "\(balance\)"]

def extract_city_name(x):
    item_found = False
    i = 0

    while not item_found and i < len(name_keys):
        if name_keys[i] in x:
            item_found = True
        else:
            i = i + 1

    if item_found:
        return x.split(" " + name_keys[i])[0]
    else:
        return x

In [4]:
raw_census_df = pd.read_csv("../data/DECENNIALPL2020.P1_2022-07-19T162239/DECENNIALPL2020.P1_data_with_overlays_2022-04-27T142004.csv", skiprows=1)

In [5]:
census_df = pd.DataFrame(raw_census_df)
census_df.columns = census_df.columns.str.strip()

# Filtering out unneeded columns
cols = [col for col in census_df.columns if "!!Total:!!Population of one race:!!" in col or col in ["id", "Geographic Area Name", "!!Total:"]]
census_df = census_df.loc[:, cols]

# Renaming columns
census_df.columns = [s.replace("!!Total:!!Population of one race:!!", "") for s in census_df.columns]
census_df = census_df.rename(columns={"!!Total:": "Total"})

# Splitting geographic area name into city and state (seperated by comma)
location_name_split = census_df["Geographic Area Name"].str.split(",", expand=True)
census_df = census_df.drop("Geographic Area Name", axis=1)
census_df["City"] = location_name_split[0]
census_df["State"] = location_name_split[1].str.strip()

# Filling in the state of places where the county is also stored in the geographic area name
census_df.loc[~location_name_split[2].isna(), "State"] = location_name_split[2].str.strip()

# Filtering out Puerto Rico
census_df = census_df.loc[census_df["State"] != "Puerto Rico"]

# Extracting city names without place type indicator
census_df["City"] = census_df["City"].apply(lambda x: extract_city_name(x))

census_df

Unnamed: 0,id,Total,White alone,Black or African American alone,American Indian and Alaska Native alone,Asian alone,Native Hawaiian and Other Pacific Islander alone,Some Other Race alone,City,State
0,1600000US0100100,133,95,34,0,0,0,3,Abanda,Alabama
1,1600000US0100124,2358,1165,1039,5,15,0,51,Abbeville,Alabama
2,1600000US0100460,4366,1741,2313,31,10,3,123,Adamsville,Alabama
3,1600000US0100484,659,624,0,9,2,1,7,Addison,Alabama
4,1600000US0100676,225,19,199,0,2,0,0,Akron,Alabama
...,...,...,...,...,...,...,...,...,...,...
31612,1600000US5684852,118,105,1,0,2,0,0,Woods Landing-Jelm,Wyoming
31613,1600000US5684925,4773,3944,6,66,22,3,313,Worland,Wyoming
31614,1600000US5685015,1644,1447,0,18,0,0,86,Wright,Wyoming
31615,1600000US5686665,131,123,0,0,0,0,0,Yoder,Wyoming


## Merging data sets

In [6]:
df = census_df.merge(pollution_hotspot_df, how="left", left_on=["City", "State"], right_on=["place", "state"])
df

Unnamed: 0,id_x,Total,White alone,Black or African American alone,American Indian and Alaska Native alone,Asian alone,Native Hawaiian and Other Pacific Islander alone,Some Other Race alone,City,State,id_y,place,state,avg_ilcr,max_ilcr,area,pop
0,1600000US0100100,133,95,34,0,0,0,3,Abanda,Alabama,,,,,,,
1,1600000US0100124,2358,1165,1039,5,15,0,51,Abbeville,Alabama,,,,,,,
2,1600000US0100460,4366,1741,2313,31,10,3,123,Adamsville,Alabama,,,,,,,
3,1600000US0100484,659,624,0,9,2,1,7,Addison,Alabama,,,,,,,
4,1600000US0100676,225,19,199,0,2,0,0,Akron,Alabama,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
31870,1600000US5684852,118,105,1,0,2,0,0,Woods Landing-Jelm,Wyoming,,,,,,,
31871,1600000US5684925,4773,3944,6,66,22,3,313,Worland,Wyoming,,,,,,,
31872,1600000US5685015,1644,1447,0,18,0,0,86,Wright,Wyoming,,,,,,,
31873,1600000US5686665,131,123,0,0,0,0,0,Yoder,Wyoming,,,,,,,


## Block data retrieval

In [7]:
# Using Hawaii as example/proof of concept
hi_blocks_raw = pd.read_csv("../data/HI/DECENNIALPL2010.P1_2022-07-22T131852/DECENNIALPL2010.P1_data_with_overlays_2022-07-22T131846.csv", skiprows=1)
hi_blocks_raw

  has_raised = await self.run_ast_nodes(code_ast.body, cell_name,


Unnamed: 0,id,Total,Total!!Population of one race,Total!!Population of one race!!White alone,Total!!Population of one race!!Black or African American alone,Total!!Population of one race!!American Indian and Alaska Native alone,Total!!Population of one race!!Asian alone,Total!!Population of one race!!Native Hawaiian and Other Pacific Islander alone,Total!!Population of one race!!Some Other Race alone,Total!!Two or More Races,...,Total!!Two or More Races!!Population of five races,Total!!Two or More Races!!Population of five races!!White; Black or African American; American Indian and Alaska Native; Asian; Native Hawaiian and Other Pacific Islander,Total!!Two or More Races!!Population of five races!!White; Black or African American; American Indian and Alaska Native; Asian; Some Other Race,Total!!Two or More Races!!Population of five races!!White; Black or African American; American Indian and Alaska Native; Native Hawaiian and Other Pacific Islander; Some Other Race,Total!!Two or More Races!!Population of five races!!White; Black or African American; Asian; Native Hawaiian and Other Pacific Islander; Some Other Race,Total!!Two or More Races!!Population of five races!!White; American Indian and Alaska Native; Asian; Native Hawaiian and Other Pacific Islander; Some Other Race,Total!!Two or More Races!!Population of five races!!Black or African American; American Indian and Alaska Native; Asian; Native Hawaiian and Other Pacific Islander; Some Other Race,Total!!Two or More Races!!Population of six races,Total!!Two or More Races!!Population of six races!!White; Black or African American; American Indian and Alaska Native; Asian; Native Hawaiian and Other Pacific Islander; Some Other Race,Geographic Area Name
0,1000000US150010201001000,1,1,1,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,"Block 1000, Block Group 1, Census Tract 201, H..."
1,1000000US150010201001001,123,95,62,0,1,27,4,1,28,...,0,0,0,0,0,0,0,0,0,"Block 1001, Block Group 1, Census Tract 201, H..."
2,1000000US150010201001002,16,12,9,0,0,1,2,0,4,...,0,0,0,0,0,0,0,0,0,"Block 1002, Block Group 1, Census Tract 201, H..."
3,1000000US150010201001003,11,9,4,0,0,4,0,1,2,...,0,0,0,0,0,0,0,0,0,"Block 1003, Block Group 1, Census Tract 201, H..."
4,1000000US150010201001004,1,1,1,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,"Block 1004, Block Group 1, Census Tract 201, H..."
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
25011,1000000US150099912000002,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,"Block 0002, Block Group 0, Census Tract 9912, ..."
25012,1000000US150099912000003,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,"Block 0003, Block Group 0, Census Tract 9912, ..."
25013,1000000US150099912000004,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,"Block 0004, Block Group 0, Census Tract 9912, ..."
25014,1000000US150099912000005,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,"Block 0005, Block Group 0, Census Tract 9912, ..."


In [8]:
def clean_blocks(raw_blocks):
    blocks = pd.DataFrame(raw_blocks)

    # Cleaning columns
    blocks["Total"] = blocks["Total"].astype(str)
    blocks["Total"] = blocks["Total"].str.split("(").str[0]
    blocks["Total"] = blocks["Total"].astype(int)

    # Filtering out unneeded columns
    cols = [col for col in blocks.columns if "Total!!Population of one race!!" in col or col in ["id", "Total"]]
    blocks = blocks.loc[:, cols]

    # Renaming columns
    blocks.columns = [s.replace("Total!!Population of one race!!", "") for s in blocks.columns]
    blocks = blocks.rename(columns={"!!Total:": "Total"})

    # Removing rows with no residents
    blocks = blocks.loc[blocks["Total"] > 0]

    # Formatting ID for use in crosswalk
    blocks["id"] = blocks["id"].str.replace("1000000US", "")
    blocks["id"] = pd.to_numeric(blocks["id"])

    return blocks

In [9]:
hi_blocks = clean_blocks(hi_blocks_raw)
hi_blocks

Unnamed: 0,id,Total,White alone,Black or African American alone,American Indian and Alaska Native alone,Asian alone,Native Hawaiian and Other Pacific Islander alone,Some Other Race alone
0,150010201001000,1,1,0,0,0,0,0
1,150010201001001,123,62,0,1,27,4,1
2,150010201001002,16,9,0,0,1,2,0
3,150010201001003,11,4,0,0,4,0,1
4,150010201001004,1,1,0,0,0,0,0
...,...,...,...,...,...,...,...,...
24917,150090320002118,4,4,0,0,0,0,0
24918,150090320002119,4,4,0,0,0,0,0
24919,150090320002120,2,2,0,0,0,0,0
24930,150090320002131,3,2,0,0,0,0,0


## Crosswalk retrieval
This is the file that the EPA uses to match census-designated blocks with its grid map coordinates used in the measuring of pollution levels.

In [10]:
hi_x_walk_raw = pd.read_csv("../data/HI/CensusBlock2010_Hawaii_810m.csv")
hi_x_walk_raw

Unnamed: 0,GridID,X,Y,BlockID00,UR,PCT_B_C,PCT_C_B,PCT_CP_B
0,34,254,-203,150010212022293,Z,0.547465,0.028747,0.000000
1,34,254,-202,150010212022293,Z,0.451824,0.023725,0.000000
2,34,257,-203,150010212022359,Z,0.649448,0.145474,0.000000
3,34,257,-202,150010212022359,Z,0.350130,0.078428,0.000000
4,34,258,-202,150010212022359,Z,0.000463,0.000104,0.000000
...,...,...,...,...,...,...,...,...
109999,34,115,58,150090315023000,Z,0.242369,0.068603,0.087691
110000,34,135,48,150090308002023,Z,0.797387,0.097005,0.000000
110001,34,136,48,150090308002023,Z,0.192064,0.023365,0.000000
110002,34,136,49,150090308002023,Z,0.010631,0.001293,0.000000


In [11]:
def add_race_to_crosswalk(raw_crosswalk, blocks_df):
    # Makes copy of raw DataFrame
    crosswalk = pd.DataFrame(raw_crosswalk)
    
    # Getting number of total and white-identifying in each block
    crosswalk = crosswalk.merge(blocks_df.loc[:, ["id", "Total", "White alone"]], left_on="BlockID00", right_on="id")
    crosswalk = crosswalk.drop(columns="id")
    crosswalk = crosswalk.rename(columns={"White alone": "White in block", "Total": "Block total"})

    # Normalizing based on percentage
    crosswalk["Number of white from block"] = crosswalk["White in block"] * crosswalk["PCT_B_C"]
    crosswalk["Total from block"] = crosswalk["Block total"] * crosswalk["PCT_B_C"]

    return crosswalk

In [12]:
add_race_to_crosswalk(hi_x_walk_raw, hi_blocks)

Unnamed: 0,GridID,X,Y,BlockID00,UR,PCT_B_C,PCT_C_B,PCT_CP_B,Block total,White in block,Number of white from block,Total from block
0,34,243,-199,150010212022003,Z,0.000027,0.007577,1.000095,91,49,0.001323,0.002457
1,34,243,-198,150010212022003,Z,0.000180,0.050827,1.000095,91,49,0.008820,0.016380
2,34,243,-188,150010212022003,Z,0.000013,0.003720,1.000095,91,49,0.000637,0.001183
3,34,244,-201,150010212022003,Z,0.000551,0.155338,1.000095,91,49,0.026999,0.050141
4,34,244,-200,150010212022003,Z,0.001667,0.470405,1.000095,91,49,0.081683,0.151697
...,...,...,...,...,...,...,...,...,...,...,...,...
41435,34,157,23,150090303011087,Z,1.000000,0.019423,0.018935,3,1,1.000000,3.000000
41436,34,114,57,150090315023000,Z,0.301733,0.085406,0.184975,483,317,95.649361,145.737039
41437,34,114,58,150090315023000,Z,0.455723,0.128993,0.794152,483,317,144.464191,220.114209
41438,34,115,58,150090315023000,Z,0.242369,0.068603,0.087691,483,317,76.830973,117.064227


## Creating DataFrame for grid layout
Each row corresponds to a grid in which the EPA measured pollution levels.

In [13]:
def create_grid_from_crosswalk(crosswalk: pd.DataFrame) -> pd.DataFrame:
    # Columns (roughly) reflect the blocks .csv file
    grid = pd.DataFrame(columns=["X", "Y"])

    # Get list of unique cell coordinates from crosswalk
    unique_cell_points = crosswalk[["X", "Y"]].value_counts().reset_index(name="count")

    grid["X"] = unique_cell_points["X"]
    grid["Y"] = unique_cell_points["Y"]
    
    return grid

In [14]:
create_grid_from_crosswalk(hi_x_walk_raw)

Unnamed: 0,X,Y
0,-34,100
1,-32,99
2,-38,102
3,-51,106
4,-30,98
...,...,...
40352,219,-90
40353,219,-89
40354,219,-86
40355,219,-77


In [15]:
def create_race_grid(raw_crosswalk: pd.DataFrame, raw_blocks: pd.DataFrame) -> pd.DataFrame:
    cleaned_blocks = clean_blocks(raw_blocks)

    # Retrieve crosswalk with the proportion of total and white residents contributed by the block to the grid space
    crosswalk_with_race = add_race_to_crosswalk(raw_crosswalk, cleaned_blocks)

    grid = create_grid_from_crosswalk(crosswalk_with_race)

    # Calculate total and white-identifying in grid based on number in block, percentage of block that is contained in each grid
    grid_race_counts = crosswalk_with_race.groupby(["X", "Y"]).sum().reset_index()
    grid = grid.merge(grid_race_counts.loc[:, ["X", "Y", "Number of white from block", "Total from block"]], on=["X", "Y"])

    grid = grid.rename(columns={"Number of white from block": "White alone", "Total from block": "Total"})

    return grid

In [16]:
grid = create_race_grid(hi_x_walk_raw, hi_blocks_raw)
grid

Unnamed: 0,X,Y,White alone,Total
0,-34,100,773.159772,5913.056076
1,-32,98,733.214695,4572.439720
2,-30,98,337.111481,2610.817266
3,-35,100,883.245731,7286.946884
4,-47,112,122.473944,2048.070305
...,...,...,...,...
15785,225,-172,2.271204,4.309464
15786,225,-173,2.271204,4.309464
15787,225,-174,2.271204,4.309464
15788,225,-176,0.253070,0.253070


## Retrieving pollution data by grid space

In [17]:
data = json.load(open("../data/toxmaps_files_2022-03-15b/toxmaps_files_for_data_store/hotspot_gridsquares_for_data_store.geojson"))
data = data["features"]

rows = []

for d in data:
    x_centerpoint = sum([d["geometry"]["coordinates"][0][i][0] for i in range(4)]) / 4
    y_centerpoint = sum([d["geometry"]["coordinates"][0][i][1] for i in range(4)]) / 4

    centerpoint = (x_centerpoint, y_centerpoint)

    row = d["properties"]
    row["centerpoint"] = centerpoint

    rows.append(row)

pollution_gridspace_df = pd.DataFrame(rows)
pollution_gridspace_df = pollution_gridspace_df.rename(columns={"x": "X", "y": "Y"})

# Normalizing with EPA acceptable risk, which is 1/10,000
pollution_gridspace_df["ilcr"] = pollution_gridspace_df["ilcr"] * 10000

pollution_gridspace_df

Unnamed: 0,X,Y,gridcode,pop,ilcr,ilcr_2014,ilcr_2015,ilcr_2016,ilcr_2017,ilcr_2018,cluster_id,centerpoint
0,251,966,14,38,0.420968,4.790455e-05,3.769349e-05,6.004251e-05,0.000032,0.000033,26,"(-93.8879279214834, 30.097545326241)"
1,91,892,14,1926,0.121234,1.414638e-05,1.198967e-05,1.055006e-05,0.000010,0.000014,6,"(-95.23914509756315, 29.574912519332347)"
2,263,985,14,11,0.161599,1.559376e-05,1.456587e-05,2.335146e-05,0.000013,0.000014,26,"(-93.78325946023887, 30.234148583816776)"
3,648,977,14,338,2.323963,2.435303e-04,2.340715e-04,3.008869e-04,0.000257,0.000127,2,"(-90.54709637204613, 30.062467554026572)"
4,261,979,14,47,0.244414,2.325287e-05,2.111681e-05,3.412669e-05,0.000021,0.000022,26,"(-93.80128060364268, 30.190724935572675)"
...,...,...,...,...,...,...,...,...,...,...,...,...
41183,-113,750,14,0,0.175297,9.806828e-06,9.468954e-06,7.979511e-06,0.000012,0.000048,8087,"(-96.93312965135505, 28.53472346994222)"
41184,241,911,14,0,0.116320,1.193648e-05,1.036187e-05,1.696981e-05,0.000009,0.000009,26,"(-93.98187490836554, 29.697678817620076)"
41185,233,978,14,48,0.240806,2.288500e-05,2.135100e-05,3.396695e-05,0.000021,0.000022,26,"(-94.03726140364626, 30.187903120807775)"
41186,1094,1976,14,135,0.155266,1.076361e-07,2.090732e-05,2.132226e-05,0.000021,0.000014,680,"(-85.92636006604992, 37.02493370273622)"


## Alabama case study

In [18]:
al_blocks_raw = pd.read_csv("../data/AL/DECENNIALPL2010.P1_data_with_overlays_2022-07-26T154032.csv", skiprows=1)
al_blocks_raw

  has_raised = await self.run_ast_nodes(code_ast.body, cell_name,


Unnamed: 0,id,Total,Total!!Population of one race,Total!!Population of one race!!White alone,Total!!Population of one race!!Black or African American alone,Total!!Population of one race!!American Indian and Alaska Native alone,Total!!Population of one race!!Asian alone,Total!!Population of one race!!Native Hawaiian and Other Pacific Islander alone,Total!!Population of one race!!Some Other Race alone,Total!!Two or More Races,...,Total!!Two or More Races!!Population of five races,Total!!Two or More Races!!Population of five races!!White; Black or African American; American Indian and Alaska Native; Asian; Native Hawaiian and Other Pacific Islander,Total!!Two or More Races!!Population of five races!!White; Black or African American; American Indian and Alaska Native; Asian; Some Other Race,Total!!Two or More Races!!Population of five races!!White; Black or African American; American Indian and Alaska Native; Native Hawaiian and Other Pacific Islander; Some Other Race,Total!!Two or More Races!!Population of five races!!White; Black or African American; Asian; Native Hawaiian and Other Pacific Islander; Some Other Race,Total!!Two or More Races!!Population of five races!!White; American Indian and Alaska Native; Asian; Native Hawaiian and Other Pacific Islander; Some Other Race,Total!!Two or More Races!!Population of five races!!Black or African American; American Indian and Alaska Native; Asian; Native Hawaiian and Other Pacific Islander; Some Other Race,Total!!Two or More Races!!Population of six races,Total!!Two or More Races!!Population of six races!!White; Black or African American; American Indian and Alaska Native; Asian; Native Hawaiian and Other Pacific Islander; Some Other Race,Geographic Area Name
0,1000000US010010201001000,61,60,55,4,0,1,0,0,1,...,0,0,0,0,0,0,0,0,0,"Block 1000, Block Group 1, Census Tract 201, A..."
1,1000000US010010201001001,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,"Block 1001, Block Group 1, Census Tract 201, A..."
2,1000000US010010201001002,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,"Block 1002, Block Group 1, Census Tract 201, A..."
3,1000000US010010201001003,75,74,66,4,0,0,0,4,1,...,0,0,0,0,0,0,0,0,0,"Block 1003, Block Group 1, Census Tract 201, A..."
4,1000000US010010201001004,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,"Block 1004, Block Group 1, Census Tract 201, A..."
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
252261,1000000US011339659003101,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,"Block 3101, Block Group 3, Census Tract 9659, ..."
252262,1000000US011339659003102,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,"Block 3102, Block Group 3, Census Tract 9659, ..."
252263,1000000US011339659003103,3,3,3,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,"Block 3103, Block Group 3, Census Tract 9659, ..."
252264,1000000US011339659003104,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,"Block 3104, Block Group 3, Census Tract 9659, ..."


In [19]:
cont_x_walk_raw = pd.read_csv("../data/CensusBlock2010_ConUS_810m.csv")
cont_x_walk_raw

Unnamed: 0,GridID,X,Y,BlockID00,UR,PCT_B_C,PCT_C_B,PCT_CP_B
0,14,-2126,1523,40270111071059,Z,0.006029,0.000754,0.000000
1,14,-2125,1523,40270111071059,Z,0.993539,0.124338,0.000000
2,14,-2125,1522,40270111072002,Z,0.431253,0.219350,0.000000
3,14,-2125,1523,40270111072002,Z,0.441065,0.224341,0.000000
4,14,-2124,1522,40270111072002,Z,0.002394,0.001218,0.000000
...,...,...,...,...,...,...,...,...
41612389,14,-866,2897,560459511001785,Z,0.029898,0.075755,0.000000
41612390,14,-810,2896,560459513003023,Z,1.000000,0.000313,0.000000
41612391,14,-847,2936,560459511001065,Z,1.000000,0.009000,0.272847
41612392,14,-810,2897,560459513001015,Z,0.979167,0.002615,0.000000


In [20]:
al_grid = create_race_grid(cont_x_walk_raw, al_blocks_raw)
al_grid = al_grid.loc[al_grid["Total"] >= 1]
al_grid

Unnamed: 0,X,Y,White alone,Total
0,1035,1481,58.729173,1143.021901
1,1033,1481,82.730295,876.506583
2,1035,1478,46.142953,1445.064405
3,1055,1655,742.971717,823.838064
4,1034,1478,16.856246,1513.377625
...,...,...,...,...
192094,1025,1349,1.514943,3.534867
192096,1025,1350,1.514943,3.534867
192097,1025,1354,0.935606,1.202922
192098,1025,1355,0.935606,1.202922


**NOTE:** In the grid below, the "Total" field and the "pop" field are each supposed to represent the total number of people residing in the grid space. The "Total" field is sourced from the 2010 census translation onto the grid, and the "pop" field is sourced from the ProPublica data. As can be seen, the two values are roughly on the same magnitude, but are not equivalent. According to page 13 of the [RSEI methodology document](https://www.epa.gov/sites/default/files/2020-02/documents/rsei_methodology_v2.3.8.pdf), the source of ProPublica's grid pollution data: 

> Population values for non-decennial years are estimated based on linear interpolations at
the block level between the 1990 and 2000 and between the 2000 and 2010 U.S. census
datasets, and on extrapolation back to 1988 and forward to 2018.

While it is unclear which year the "pop" field in the ProPublica data represents in the documentation they provided, it is likely that it uses a similar extrapolation method (and the ProPublica data dictionary also notes that the "pop" field on the grid space level may be extrapolated). Due to the limited information on how to improve the accuracy of the "Total" field to be more comparable to the "pop" field, we can move forward in analysis despite this discrepancy. 

In [21]:
al_grid = al_grid.merge(pollution_gridspace_df, on=["X", "Y"])
al_grid

Unnamed: 0,X,Y,White alone,Total,gridcode,pop,ilcr,ilcr_2014,ilcr_2015,ilcr_2016,ilcr_2017,ilcr_2018,cluster_id,centerpoint
0,1043,1489,13.955425,896.018766,14,648,0.101050,0.000012,0.000012,0.000013,7.494950e-06,6.224917e-06,417,"(-86.82441083440537, 33.55513399939133)"
1,1044,1486,119.830108,1461.063865,14,1130,0.234299,0.000035,0.000035,0.000036,6.846359e-06,4.849859e-06,417,"(-86.81821013562913, 33.532779492334804)"
2,1046,1487,65.023603,961.356534,14,660,0.966275,0.000151,0.000151,0.000152,1.882699e-05,1.040820e-05,417,"(-86.79988131536486, 33.53859727598715)"
3,1010,1631,307.518815,1102.090684,14,970,0.173141,0.000013,0.000014,0.000016,2.288464e-05,1.971056e-05,264,"(-86.99495961462304, 34.6023676682713)"
4,1048,1492,325.264265,1141.941042,14,1025,0.178581,0.000019,0.000017,0.000019,1.847088e-05,1.645158e-05,417,"(-86.77814619464759, 33.57328380811362)"
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1381,1007,1644,3.923892,8.937754,14,7,0.116650,0.000010,0.000010,0.000011,1.486110e-05,1.235139e-05,264,"(-87.01061674469655, 34.69814863536888)"
1382,1008,1634,9.313563,9.313563,14,15,0.565904,0.000041,0.000042,0.000056,7.365082e-05,6.922843e-05,264,"(-87.01016240869703, 34.62536720253925)"
1383,1008,1642,0.000000,4.332909,14,4,0.140085,0.000012,0.000012,0.000014,1.727422e-05,1.521399e-05,264,"(-87.00342820570859, 34.68304538442057)"
1384,1011,1641,7.678400,12.957300,14,18,0.116777,0.000010,0.000010,0.000013,1.289314e-05,1.246620e-05,264,"(-86.97765787792477, 34.67377931295585)"


In [22]:
al_grid.to_csv("../data/al_grid.csv", index=False)