# Cleaning and Merging the Data

Open Streets Data:  

- Downloaded from 311 Open Data portal (Vector file of lines for each open street)
- Calculated mean coordinates for each line (i.e. each registered Open Street) in QGIS

Census data: 

- From ACS survey
- Table B19013 from ACS 5yr (2021) downloaded from Census Reporter

Census Tract data for NYC 

- Downloaded [here](https://www.nyc.gov/site/planning/data-maps/open-data/census-download-metadata.page) via NYC portal
- Originally as a shapefile then converted to csv in QGIS


In [70]:
import pandas as pd

In [71]:
! ls

README.md                             [34mgeocode_cache[m[m
[34macs2021_5yr_B19013_14000US36047030600[m[m mean_coords_table.tsv
acs5_B19113_2019_NY.csv               mean_coords_table_with_geos.csv
analysis.ipynb                        ny_census_tracts.csv
clean_merge.ipynb                     ny_census_tracts.qmd
edited_complaints.csv                 open_streets_pivot.csv
geocode.ipynb                         requirements.txt


Reading the ACS income data

In [72]:
income = pd.read_csv('./acs2021_5yr_B19013_14000US36047030600/acs2021_5yr_B19013_14000US36047030600.csv')

In [73]:
income

Unnamed: 0,geoid,name,B19013001,"B19013001, Error"
0,14000US36005000100,"Census Tract 1, Bronx, NY",,
1,14000US36005000200,"Census Tract 2, Bronx, NY",70867.0,25423.0
2,14000US36005000400,"Census Tract 4, Bronx, NY",98090.0,18180.0
3,14000US36005001600,"Census Tract 16, Bronx, NY",40033.0,9907.0
4,14000US36005001901,"Census Tract 19.01, Bronx, NY",55924.0,12028.0
...,...,...,...,...
2322,14000US36085030302,"Census Tract 303.02, Richmond, NY",85842.0,18154.0
2323,14000US36085031901,"Census Tract 319.01, Richmond, NY",,
2324,14000US36085031902,"Census Tract 319.02, Richmond, NY",76066.0,35257.0
2325,14000US36085032300,"Census Tract 323, Richmond, NY",86471.0,25095.0


In [74]:
# rename the columns 

# 3rd column is Median Household Income in the Past 12 Months (In 2021 Inflation-adjusted Dollars)

income.rename(columns={'B19013001': 'median_household_income'}, inplace=True)

In [75]:
income = income[income['median_household_income'] > 0]

In [76]:
income['geoid'] = income['geoid'].str.replace('14000US', '')

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  income['geoid'] = income['geoid'].str.replace('14000US', '')


In [77]:
income['geoid'] = income['geoid'].astype(int)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  income['geoid'] = income['geoid'].astype(int)


Although all my data is at the tract level, the notation doesn't match. Fixing that below. 

In [78]:
tracts = pd.read_csv('ny_census_tracts.csv')

In [79]:
tracts

Unnamed: 0,CTLabel,BoroCode,BoroName,CT2020,BoroCT2020,CDEligibil,NTAName,NTA2020,CDTA2020,CDTANAME,GEOID,Shape_Leng,Shape_Area
0,1.00,1,Manhattan,100,1000100,,The Battery-Governors Island-Ellis Island-Libe...,MN0191,MN01,MN01 Financial District-Tribeca (CD 1 Equivalent),36061000100,11023.048501,1.844421e+06
1,2.01,1,Manhattan,201,1000201,,Chinatown-Two Bridges,MN0301,MN03,MN03 Lower East Side-Chinatown (CD 3 Equivalent),36061000201,4754.495247,9.723121e+05
2,6.00,1,Manhattan,600,1000600,,Chinatown-Two Bridges,MN0301,MN03,MN03 Lower East Side-Chinatown (CD 3 Equivalent),36061000600,6976.286456,2.582705e+06
3,14.01,1,Manhattan,1401,1001401,,Lower East Side,MN0302,MN03,MN03 Lower East Side-Chinatown (CD 3 Equivalent),36061001401,5075.332000,1.006117e+06
4,14.02,1,Manhattan,1402,1001402,,Lower East Side,MN0302,MN03,MN03 Lower East Side-Chinatown (CD 3 Equivalent),36061001402,4459.156019,1.226206e+06
...,...,...,...,...,...,...,...,...,...,...,...,...,...
2320,77.00,5,Staten Island,7700,5007700,,St. George-New Brighton,SI0101,SI01,SI01 North Shore (CD 1 Equivalent),36085007700,7325.091410,2.674908e+06
2321,19.02,4,Queens,1902,4001902,,Long Island City-Hunters Point,QN0201,QN02,QN02 Long Island City-Sunnyside-Woodside (CD 2...,36081001902,5659.156615,1.909110e+06
2322,171.01,4,Queens,17101,4017101,,Sunnyside Yards (South),QN0261,QN02,QN02 Long Island City-Sunnyside-Woodside (CD 2...,36081017101,22732.905385,8.783519e+06
2323,475.00,4,Queens,47500,4047500,,Elmhurst,QN0401,QN04,QN04 Elmhurst-Corona (CD 4 Approximation),36081047500,8890.142310,3.028836e+06


In [80]:
# only keep the columns we need

tracts = tracts[['CTLabel', 'CT2020', 'GEOID']]

In [81]:
tracts.rename(columns={'GEOID': 'geoid'}, inplace=True)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  tracts.rename(columns={'GEOID': 'geoid'}, inplace=True)


In [82]:
# find column types

tracts.dtypes

CTLabel    float64
CT2020       int64
geoid        int64
dtype: object

In [83]:
# join tracts to income based on the geoid using an inner join

income_tracts = pd.merge(income, tracts, on='geoid', how='outer', indicator=True)

In [84]:
income_tracts._merge.value_counts()

# confirming that all the tracts in the income df are accounted for 

both          2196
right_only     129
left_only        0
Name: _merge, dtype: int64

In [85]:
# only keep the both rows

income_tracts = income_tracts[income_tracts['_merge'] == 'both']

# now I have a dataset I can use to join to the other datasets

In [86]:
income_tracts


Unnamed: 0,geoid,name,median_household_income,"B19013001, Error",CTLabel,CT2020,_merge
0,36005000200,"Census Tract 2, Bronx, NY",70867.0,25423.0,2.00,200,both
1,36005000400,"Census Tract 4, Bronx, NY",98090.0,18180.0,4.00,400,both
2,36005001600,"Census Tract 16, Bronx, NY",40033.0,9907.0,16.00,1600,both
3,36005001901,"Census Tract 19.01, Bronx, NY",55924.0,12028.0,19.01,1901,both
4,36005001902,"Census Tract 19.02, Bronx, NY",60804.0,12156.0,19.02,1902,both
...,...,...,...,...,...,...,...
2191,36085029106,"Census Tract 291.06, Richmond, NY",127671.0,25994.0,291.06,29106,both
2192,36085030301,"Census Tract 303.01, Richmond, NY",95913.0,6123.0,303.01,30301,both
2193,36085030302,"Census Tract 303.02, Richmond, NY",85842.0,18154.0,303.02,30302,both
2194,36085031902,"Census Tract 319.02, Richmond, NY",76066.0,35257.0,319.02,31902,both


Reading the Open Streets data

In [87]:
open_streets = pd.read_csv('mean_coords_table_with_geos.csv')

In [88]:
open_streets

Unnamed: 0,wkt_geom,MEAN_X,MEAN_Y,object_id,lat,lon,GEOID,STATE,COUNTY,TRACT,BLOCK
0,Point (932357.50813676416873932 129799.6779506...,9.323575e+05,129799.677951,1.0,40.52279,-74.186652,360850198004041,36,85,19800,4041
1,Point (952222.48087393492460251 147937.0785196...,9.522225e+05,147937.078520,384.0,40.57267,-74.115286,360850134001007,36,85,13400,1007
2,Point (957307.78094421629793942 167722.3556344...,9.573078e+05,167722.355634,480.0,40.62699,-74.097060,360850059021000,36,85,5902,1000
3,Point (961974.16859200922772288 165665.9565205...,9.619742e+05,165665.956521,501.0,40.62136,-74.080242,360850029002000,36,85,2900,2000
4,Point (962028.83035567402839661 168307.7564461...,9.620288e+05,168307.756446,6.0,40.62861,-74.080054,360850033002000,36,85,3300,2000
...,...,...,...,...,...,...,...,...,...,...,...
537,Point (1056074.98569662659429014 158679.738558...,1.056075e+06,158679.738558,325.0,40.60192,-73.741347,360811010021009,36,81,101002,1009
538,Point (1056284.10559147596359253 158493.067578...,1.056284e+06,158493.067578,326.0,40.60141,-73.740596,360811010021003,36,81,101002,1003
539,Point (1056446.62167064100503922 158342.209940...,1.056447e+06,158342.209940,320.0,40.60099,-73.740012,360811010021000,36,81,101002,1000
540,Point (1056606.48847688734531403 158161.324475...,1.056606e+06,158161.324475,321.0,40.60050,-73.739439,360811010021000,36,81,101002,1000


In [93]:
# this gives us the number of open streets per tract

open_streets_count = open_streets['TRACT'].value_counts()

In [95]:
# save as a dataframe

open_streets_count = pd.DataFrame(open_streets_count)

In [96]:
open_streets_count

Unnamed: 0,TRACT
38302,14
16900,12
7300,10
101002,9
29100,9
...,...
24700,1
34100,1
13604,1
20000,1


In [97]:
open_streets_count.rename(columns={'TRACT': 'count'}, inplace=True)

In [98]:
open_streets_count.reset_index(inplace=True)

In [100]:
open_streets_count.rename(columns={'index': 'tract'}, inplace=True)

In [102]:
open_streets_count

Unnamed: 0,tract,count
0,38302,14
1,16900,12
2,7300,10
3,101002,9
4,29100,9
...,...,...
192,24700,1
193,34100,1
194,13604,1
195,20000,1


In [103]:
income_tracts.rename(columns={'CT2020': 'tract'}, inplace=True)

In [110]:
# drop the _merge column

income_tracts.drop(columns=['_merge'], inplace=True)

In [111]:
merged = pd.merge(open_streets_count, income_tracts, on='tract', how='outer', indicator=True)

In [112]:
merged

Unnamed: 0,tract,count,geoid,name,median_household_income,"B19013001, Error",CTLabel,_merge
0,38302,14.0,,,,,,left_only
1,16900,12.0,3.600502e+10,"Census Tract 169, Bronx, NY",45273.0,7447.0,169.00,both
2,16900,12.0,3.604702e+10,"Census Tract 169, Kings, NY",112419.0,13532.0,169.00,both
3,16900,12.0,3.606102e+10,"Census Tract 169, New York, NY",153854.0,36359.0,169.00,both
4,16900,12.0,3.608102e+10,"Census Tract 169, Queens, NY",77027.0,15493.0,169.00,both
...,...,...,...,...,...,...,...,...
2196,29105,,3.608503e+10,"Census Tract 291.05, Richmond, NY",97279.0,19464.0,291.05,right_only
2197,29106,,3.608503e+10,"Census Tract 291.06, Richmond, NY",127671.0,25994.0,291.06,right_only
2198,30301,,3.608503e+10,"Census Tract 303.01, Richmond, NY",95913.0,6123.0,303.01,right_only
2199,30302,,3.608503e+10,"Census Tract 303.02, Richmond, NY",85842.0,18154.0,303.02,right_only


In [113]:
# show the value counts of the _merge column

merged._merge.value_counts()

# this means that there are five tracts that have open streets but no median household income data

right_only    1748
both           448
left_only        5
Name: _merge, dtype: int64

In [114]:
# keep only the rows where the _merge column is 'both'

merged = merged[merged['_merge'] == 'both']

In [115]:
merged

Unnamed: 0,tract,count,geoid,name,median_household_income,"B19013001, Error",CTLabel,_merge
1,16900,12.0,3.600502e+10,"Census Tract 169, Bronx, NY",45273.0,7447.0,169.0,both
2,16900,12.0,3.604702e+10,"Census Tract 169, Kings, NY",112419.0,13532.0,169.0,both
3,16900,12.0,3.606102e+10,"Census Tract 169, New York, NY",153854.0,36359.0,169.0,both
4,16900,12.0,3.608102e+10,"Census Tract 169, Queens, NY",77027.0,15493.0,169.0,both
5,7300,10.0,3.600501e+10,"Census Tract 73, Bronx, NY",25619.0,6243.0,73.0,both
...,...,...,...,...,...,...,...,...
448,20000,1.0,3.606102e+10,"Census Tract 200, New York, NY",61250.0,23829.0,200.0,both
449,19800,1.0,3.604702e+10,"Census Tract 198, Kings, NY",66563.0,23189.0,198.0,both
450,19800,1.0,3.606102e+10,"Census Tract 198, New York, NY",88646.0,35928.0,198.0,both
451,19800,1.0,3.608102e+10,"Census Tract 198, Queens, NY",84034.0,25699.0,198.0,both


In [116]:
# save the merged dataset

merged.to_csv('final_data.csv', index=False)