In [2]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

***

# CENSUS TRACTS

Aim for this notebook is to create a set of dictionaries to then remap all the census tracts to the correct ones:
- [] Part 1: existing census tract numbers to the correct format -> {current census_tract->correct census tract} *maybe this could also be used to remap 'vintage' census tracts to the current ones
- [] Part 2: missing Census tracts with Lon & Lat values -> {Lat&Lon Ref -> correct census tract}
- [] Part 3: missing Census tracts with no Lon & Lat values -> {ID -> correct cencus tracts}
- [] Plus save all the tract numbers that we need

***

**OLD DESCRIPTION**

Ultimately, from this subsection we want the dataframe permits_wip to have a column with the correct Census Tract Numbers that can be used to join it with Income & Education Datasets </br>
* Import a sample Income & Education Dataframe (let's use Income 2021 Dataframe)
    - Save the set of the available census tracts
* Separate the dataframe into 2 (present census tract number / NaN census tract number)</br>
    - For present tract numbers:
        - Create a new column 'Census_Tract_WIP' which will store the correct tract numbers
        - Modify CT numbers from Permits to match CT numbers from Income
        - Fill in the ones not matched using external resources
        - Save the ID for which there were still no matches
        - Create a dictionary from this column {'ID':'Correct Census Tract'} that can be applied onto permits_wip
    - For NaN tract numbers:
        - Fill in CT numbers based on Lat&Lon
        - Reserve engineer to create a dictionary {'ID':'Correct Census Tract'}
        - Save the ID for which there were still no matches
    - For IDs where there are still no matches, investigate further methods:
        - If successful, produce {'ID':'Correct Census Tract'}
* Apply {'ID':'Census Tract'} and create a new column Census_Tract_Clean
* Export the dataframe for EDA Analysis

In [3]:
tracts_all=pd.read_csv('../data/interim/wip/census_tracts.csv',index_col=0)
tracts_all.sample()

Census_Tract column has almost 15% of null values - this is an issue because Census_Tract number will be used to join the databsee.</br>
Hence, it is required to fill them in.

# **TARGET CENSUS TRACTS**

**Filling in census_tract null values & ensuring CENSUS_TRACT numbers from the PERMIT dataframe match INCOME & EDUCATION**

Import Income_2021 Dataframe

In [37]:
edu_2021=pd.read_csv('../data/interim/wip/edu_2021.csv',index_col=0)
edu_2021['Census_Tract']=edu_2021['Census_Tract'].apply(lambda x: x.lstrip('s'))

check=[int(i) for i in edu_2021['Census_Tract'].values]
check[:10]

[10100, 10201, 10202, 10300, 10400, 10501, 10502, 10503, 10600, 10701]

Note, the datatype of the Census_Tract column in Income_2021 Dataframe is Object on purpose: 
* storing census tract as a string ensures no leading zeros are lost 
* makes character manipulations easier

In [7]:
tracts_all['CENSUS_TRACT']

0              NaN
1              NaN
2              NaN
3              NaN
4              NaN
            ...   
730506    650500.0
730507    220400.0
730508     81202.0
730509    837300.0
730510    530503.0
Name: CENSUS_TRACT, Length: 730511, dtype: float64

### AIM: MATCH permits_wip['CENSUS_TRACT'] with income_2021 ['CENSUS_TRACT']

***

Let's create a separate dataframe (instead of permits_wip) for the process of reassigning Census_Tract values. </br>
The following columns can be useful : 'ID','CENSUS_TRACT','STREET_NAME','STREET_NUMBER','SUFFIX','LATITUDE','LONGITUDE'

Separating rows into the ones that have a value in the CENSUS_TRACT column & the ones with NaN for CENSUS_TRACT

In [8]:
permit_tracts=tracts_all[~tracts_all['CENSUS_TRACT'].isna()]
#df where census tract number present

# PART 1: PRESENT TRACT NUMBERS

Match the present census tract number from Building Permits dataset with the numbers from Income dataset. </br>
**Store the correct census tract number in permit_tracts['Census_Tract_WIP']**

In [9]:
#create a new column
permit_tracts.insert(permit_tracts.columns.get_loc('CENSUS_TRACT'),'Census_Tract_WIP',[str(int(i)) for i in permit_tracts['CENSUS_TRACT']])

**Format the present tract numbers of Permits to match the format from US Census Data**

In [10]:
permit_tracts['Census_Tract_WIP']=[(lambda x:''.join(('0',x)) if (len(x)==3) else x)(str(i)) for i in list(permit_tracts['Census_Tract_WIP'])]

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  permit_tracts['Census_Tract_WIP']=[(lambda x:''.join(('0',x)) if (len(x)==3) else x)(str(i)) for i in list(permit_tracts['Census_Tract_WIP'])]


In [11]:
permit_tracts['Census_Tract_WIP']=[''.join((i,'0'*(6-len(i)))) for i in permit_tracts['Census_Tract_WIP']]

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  permit_tracts['Census_Tract_WIP']=[''.join((i,'0'*(6-len(i)))) for i in permit_tracts['Census_Tract_WIP']]


In [12]:
#Now we need to find census tracts in the Permit_df for which we do not have a corresponding Census_Tract in Income_df
missing=[i for i in set(permit_tracts['Census_Tract_WIP']) if i not in set(edu_2021['Census_Tract'])]

In [13]:
print('Following the formating, TC numbers from Census_Tract_WIP column can now be used to join dataframes from US Census Datasets')
print(f"{len(set(permit_tracts['Census_Tract_WIP']))-len(set(missing))} tract numbers are now matched")
print(f'However, {len(set(missing))} tract numbers are still unmatched')

Following the formating, TC numbers from Census_Tract_WIP column can now be used to join dataframes from US Census Datasets
751 tract numbers are now matched
However, 454 tract numbers are still unmatched


 ********

**Look for alternative ways to reformat the missing tract numbers**

In [14]:
pip install censusgeocode;

Note: you may need to restart the kernel to use updated packages.


In [15]:
import censusgeocode as cg;

In [16]:
permit_tracts.isna().sum()

ID                     0
Census_Tract_WIP       0
CENSUS_TRACT           0
STREET_NAME            0
STREET DIRECTION       0
STREET_NUMBER          0
SUFFIX              5865
LATITUDE            1688
LONGITUDE           1688
dtype: int64

In [17]:
'''#at this stage, we want a dataframe with one census tract from 'missing' & Latitude + Longitude
missing_p_tracts_coord=permit_tracts[(permit_tracts['Census_Tract_WIP'].isin(missing))&(~permit_tracts['LATITUDE'].isna())].groupby('Census_Tract_WIP').mean()[['LATITUDE','LONGITUDE']]'''
#but we can fill in for the missing lat & long from the existing ones. Mean is calculated by ignoring NaNs

"#at this stage, we want a dataframe with one census tract from 'missing' & Latitude + Longitude\nmissing_p_tracts_coord=permit_tracts[(permit_tracts['Census_Tract_WIP'].isin(missing))&(~permit_tracts['LATITUDE'].isna())].groupby('Census_Tract_WIP').mean()[['LATITUDE','LONGITUDE']]"

At this stage, we want a dataframe with an umathced census tract with with the average of Latitude & Longitude values for its applications

In [18]:
missing_p_tracts_coord=permit_tracts[permit_tracts['Census_Tract_WIP'].isin(missing)].groupby('Census_Tract_WIP').mean()[['LATITUDE','LONGITUDE']]
missing_p_tracts_coord.sample(2)

Unnamed: 0_level_0,LATITUDE,LONGITUDE
Census_Tract_WIP,Unnamed: 1_level_1,Unnamed: 2_level_1
701030,41.926773,-87.642288
303000,41.993011,-87.672826


Do we have any unmatched census tracts for which there are no Lat & Lon </br>
#We have one unmatched adress -> let's drop it but record its ID

In [19]:
missing_p_tracts_coord.isna().sum()

LATITUDE     1
LONGITUDE    1
dtype: int64

In [20]:
if missing_p_tracts_coord.isna().sum()[0]>0:
    ct_missing_coord=list(missing_p_tracts_coord[missing_p_tracts_coord.isna().iloc[:,0]].index)
ct_missing_coord

['805000']

Note: ct_missing_coord - list storing application census tracts that are missing Lat & Lon values

In [21]:
missing_p_tracts_coord=missing_p_tracts_coord.dropna(axis=0)

In [22]:
missing_p_tracts_coord.shape[0]

453

**Generating the new census tract numbers:**

In [23]:
#THIS CELL TAKES 4 MIN TO RUN

tracts_new=[(cg.coordinates(x=row['LONGITUDE'],y=row['LATITUDE']))['Census Tracts'][0]['TRACT'] for i,row in missing_p_tracts_coord.iterrows()]
#this is a list with census tracts generated based on latitude & longitude for Permit Census Tracts that did not match Income Census Tracts

In [24]:
#let's check if they (tracts_new) now match the income census tracts
missing_2=[i for i in set(tracts_new) if i not in set(edu_2021['Census_Tract'])]
(f'There are now {len(missing_2)} unmatched census tracts for the applications that had a value for census tract and had Lat&Lon values')

'There are now 0 unmatched census tracts for the applications that had a value for census tract and had Lat&Lon values'

**Dictionary for the unmatched census tracts:**

In [25]:
#dictionary with new census tracts for the previously unmatched census tracts
missing_to_new={i[0]:i[1] for i in zip(missing_p_tracts_coord.index, tracts_new)}
missing_to_new;

**Update Census_Tract_WIP using the dictionary**

In [26]:
permit_tracts['Census_Tract_WIP'].replace(missing_to_new,inplace=True)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  permit_tracts['Census_Tract_WIP'].replace(missing_to_new,inplace=True)


In [38]:
set([i for i in permit_tracts['Census_Tract_WIP'] if int(i) not in check])

{'805000'}

***

**Sanity check:**

Recall the dictionary 'missing_to_new' which was used to update the existing census tracts. If we look inside the dictionary, a lot of values appear very different.

In [39]:
permit_tracts.sample()

Unnamed: 0,ID,Census_Tract_WIP,CENSUS_TRACT,STREET_NAME,STREET DIRECTION,STREET_NUMBER,SUFFIX,LATITUDE,LONGITUDE
402209,2787454,740400,740400.0,DRAKE,S,11215,AVE,41.689073,-87.709125


In [40]:
p_tracts_check=permit_tracts[~permit_tracts['Census_Tract_WIP'].isin(missing)].groupby('Census_Tract_WIP').mean()[['LATITUDE','LONGITUDE']]

p_tracts_check.isna().sum()

LATITUDE     0
LONGITUDE    0
dtype: int64

In [41]:
p_tracts_check.sample(2)

Unnamed: 0_level_0,LATITUDE,LONGITUDE
Census_Tract_WIP,Unnamed: 1_level_1,Unnamed: 2_level_1
90300,41.999205,-87.816643
301200,41.849229,-87.691815


In [42]:
p_tracts_check_di={i:(cg.coordinates(x=row['LONGITUDE'],y=row['LATITUDE']))['Census Tracts'][0]['TRACT'] for i,row in p_tracts_check.iterrows()}
#key = existing, value = generated

In [43]:
incorrect=[(key,p_tracts_check_di[key]) for key in p_tracts_check_di if key!=p_tracts_check_di[key]]

In [44]:
incorrect={key:p_tracts_check_di[key] for key in p_tracts_check_di if key!=p_tracts_check_di[key]}

In [45]:
print(len(incorrect),'census tracts were matched incorrectly')

19 census tracts were matched incorrectly


In [46]:
permit_tracts['Census_Tract_WIP']=permit_tracts['Census_Tract_WIP'].replace(incorrect)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  permit_tracts['Census_Tract_WIP']=permit_tracts['Census_Tract_WIP'].replace(incorrect)


In [47]:
print(f'{ct_missing_coord} - census tract number(s) still unmatched')

['805000'] - census tract number(s) still unmatched


In [48]:
permit_tracts

Unnamed: 0,ID,Census_Tract_WIP,CENSUS_TRACT,STREET_NAME,STREET DIRECTION,STREET_NUMBER,SUFFIX,LATITUDE,LONGITUDE
32,3201678,611900,611900.0,51ST,W,1339,ST,41.801286,-87.659013
33,2694469,830600,830600.0,GREENVIEW,N,6538,AVE,42.001103,-87.668190
34,2694826,170600,170600.0,PLAINFIELD,N,3522,AVE,41.943635,-87.834024
46,2699488,100500,100500.0,ORIOLE,N,5500,AVE,41.979777,-87.817015
47,2700325,292500,292500.0,CERMAK,W,4044,RD,41.851631,-87.726538
...,...,...,...,...,...,...,...,...,...
730506,3307044,650500,650500.0,KARLOV,S,6926,AVE,41.766834,-87.725048
730507,3308612,220400,220400.0,RICHMOND,N,2529,ST,41.927464,-87.701016
730508,3308742,081202,81202.0,STATE,N,1125,ST,41.902604,-87.628254
730509,3308572,837300,837300.0,LEXINGTON,W,3328,ST,41.871944,-87.709470


In [49]:
for ct in ct_missing_coord:
    permit_coord_missing=permit_tracts[permit_tracts['Census_Tract_WIP']==ct]

In [50]:
permit_coord_missing

Unnamed: 0,ID,Census_Tract_WIP,CENSUS_TRACT,STREET_NAME,STREET DIRECTION,STREET_NUMBER,SUFFIX,LATITUDE,LONGITUDE
61121,3243454,805000,80500.0,SCHILLER,W,749,ST,,
334619,3179764,805000,80500.0,SCHILLER,W,749,ST,,
438192,3170470,805000,80500.0,SCHILLER,W,749,ST,,
520483,3144794,805000,80500.0,SCHILLER,W,749,ST,,


In [51]:
result=cg.address('749 W Schiller Street', city='Chicago', state='IL')
print(result)

[]


Let's manually finds the coordinates and follow the same appriach as before

In [52]:
update={'805000':cg.coordinates(x=-87.64779131343974,y=41.907408400875006)['Census Tracts'][0]['TRACT']}

In [53]:
permit_tracts['Census_Tract_WIP']=permit_tracts['Census_Tract_WIP'].replace(update)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  permit_tracts['Census_Tract_WIP']=permit_tracts['Census_Tract_WIP'].replace(update)


In [54]:
permit_tracts.head(2)

Unnamed: 0,ID,Census_Tract_WIP,CENSUS_TRACT,STREET_NAME,STREET DIRECTION,STREET_NUMBER,SUFFIX,LATITUDE,LONGITUDE
32,3201678,611900,611900.0,51ST,W,1339,ST,41.801286,-87.659013
33,2694469,830600,830600.0,GREENVIEW,N,6538,AVE,42.001103,-87.66819


In [55]:
permit_tracts_part1_di={row['CENSUS_TRACT']:row['Census_Tract_WIP'] for i,row in permit_tracts.iterrows()}

In [56]:
# load csv module
import csv

# open file for writing, "w" is writing
w = csv.writer(open("../data/interim/di_part1.csv", "w"))
w.writerow(['REF', 'Census_Tract_Match'])

# loop over dictionary keys and values
for key, val in permit_tracts_part1_di.items():
    # write every key and value to file
    w.writerow([key, val])

In [57]:
check_part1=pd.read_csv("../data/interim/di_part1.csv")

In [58]:
check_part1.sample()

Unnamed: 0,REF,Census_Tract_Match
665,837800.0,837800


In [59]:
[i for i in check_part1['Census_Tract_Match'] if int(i) not in check]

[]

In [60]:
#let's check if they (tracts_new) now match the income census tracts
missing_2=[i for i in set(tracts_new) if i not in set(edu_2021['Census_Tract'])]
(f'There are now {len(missing_2)} unmatched census tracts for the applications that had a value for census tract and had Lat&Lon values')

'There are now 0 unmatched census tracts for the applications that had a value for census tract and had Lat&Lon values'

**'permit_tracts_part1_di' is the dictionary that will be used to update the Permit Dataframe before joining it the US Census Data**

***

**#RUN THE SCRAPING METHOD ON ALL OF THE PERMIT_TRACTS FOR CONSISTENCY WHEN CLEANING THE NOTEBOOK NEXT**

***
***

# **PART 2**

#### We still need to fill in the missing tract numbers for application that had the census tract number missing (permit_tracts_missing):

**Where possible, let's fill in using the latitude & longitude:**

In [61]:
#Create the new dataframe permit_tracts_missing by filtering 'permits_wip' on 'CENSUS_TRACT'==NaN
permit_tracts_missing=tracts_all[tracts_all['CENSUS_TRACT'].isna()]

In [62]:
permit_tracts_missing=permit_tracts_missing.reset_index(drop=True)

In [63]:
permit_tracts_missing.sample()

Unnamed: 0,ID,CENSUS_TRACT,STREET_NAME,STREET DIRECTION,STREET_NUMBER,SUFFIX,LATITUDE,LONGITUDE
16443,1739070,,GRANVILLE,W,1543,AVE,41.994336,-87.669425


In [64]:
print(f"Currently there are {permit_tracts_missing.groupby(['LATITUDE','LONGITUDE']).size().shape[0]} rows(applications) that do not have a value for Census_tract. This is too many to query census_tract using the geocoder")

Currently there are 59751 rows(applications) that do not have a value for Census_tract. This is too many to query census_tract using the geocoder


In [65]:
permit_tracts_missing[['LATITUDE_REF','LONGITUDE_REF']]=permit_tracts_missing[['LATITUDE','LONGITUDE']].apply(lambda x: round(x,3))

In [66]:
print(f"Rounding to 3 decimal points still leaves us with {permit_tracts_missing.groupby(['LATITUDE_REF','LONGITUDE_REF']).size().shape[0]} rows(applications)")

Rounding to 3 decimal points still leaves us with 29164 rows(applications)


In [67]:
missing_ID=[]

In [68]:
[missing_ID.append(i) for i in set(permit_tracts_missing[permit_tracts_missing['LATITUDE'].isna()]['ID'])];

In [69]:
#before running the query we need to drop the missing values & let's only select the columns that we actually need
permit_tracts_missing_coord=permit_tracts_missing[~permit_tracts_missing['LATITUDE'].isna()][['ID','LATITUDE','LONGITUDE','LATITUDE_REF','LONGITUDE_REF']]

In [70]:
permit_tracts_missing_coord.sample()

Unnamed: 0,ID,LATITUDE,LONGITUDE,LATITUDE_REF,LONGITUDE_REF
77319,1657002,41.98418,-87.781791,41.984,-87.782


In [71]:
'''tracts_missing_new=[(cg.coordinates(x=row['LONGITUDE_RND'],y=row['LATITUDE_RND']))['Census Tracts'][0]['TRACT'] for i,row in permit_tracts_missing_coord[:100].iterrows()]'''

"tracts_missing_new=[(cg.coordinates(x=row['LONGITUDE_RND'],y=row['LATITUDE_RND']))['Census Tracts'][0]['TRACT'] for i,row in permit_tracts_missing_coord[:100].iterrows()]"

In [72]:
#filling in the missing values for 100 rows took 1 min. Hence, we need to reduce the number of rows. Currently, the query would take 29164/100 -> 291 minutes

In [73]:
permit_tracts_missing_coord[['LATITUDE_REF','LONGITUDE_REF']]=permit_tracts_missing_coord[['LATITUDE','LONGITUDE']].apply(lambda x: round(x/4,3))
permit_tracts_missing_coord.groupby(['LATITUDE_REF','LONGITUDE_REF']).size().shape

(3530,)

In [74]:
coord_pairs=permit_tracts_missing_coord.groupby(['LATITUDE_REF','LONGITUDE_REF']).mean().reset_index().drop(columns='ID')
coord_pairs.sample(2)

Unnamed: 0,LATITUDE_REF,LONGITUDE_REF,LATITUDE,LONGITUDE
1502,10.452,-21.913,41.807127,-87.651822
2086,10.47,-21.933,41.879811,-87.732038


In [75]:
coord_pairs.sample(2)

Unnamed: 0,LATITUDE_REF,LONGITUDE_REF,LATITUDE,LONGITUDE
1908,10.465,-21.914,41.85988,-87.655852
1484,10.452,-21.932,41.807937,-87.728074


In [76]:
coord_pairs['REF']= [str(row['LATITUDE_REF'])+', '+str(row['LONGITUDE_REF'])for i,row in coord_pairs.iterrows()]

In [77]:
coord_pairs.tail()

Unnamed: 0,LATITUDE_REF,LONGITUDE_REF,LATITUDE,LONGITUDE,REF
3525,10.505,-21.916,42.018751,-87.665456,"10.505, -21.916"
3526,10.506,-21.919,42.022447,-87.674689,"10.506, -21.919"
3527,10.506,-21.918,42.022516,-87.67199,"10.506, -21.918"
3528,10.506,-21.917,42.022256,-87.668178,"10.506, -21.917"
3529,10.506,-21.916,42.022286,-87.665795,"10.506, -21.916"


Creating a dictionary with REF: correct census tract

In [78]:
'''tracts_missing_coord_pairs={row['REF']:(cg.coordinates(x=row['LONGITUDE'],y=row['LATITUDE']))['Census Tracts'][0]['TRACT'] for i,row in coord_pairs.iterrows()}''';
#this cell takes 31m to run the output was saved and can be downloaded directly;

In [79]:
def coord_ref(lat,lon):
    coord_ref_li=[]
    for (j,i) in zip(lat,lon):
        ref=str(round(lat/4,3))+', '+str(round(lon/4,3))
        coord_ref_li.append(ref)

In [80]:
'''# load csv module
import csv

# open file for writing, "w" is writing
w = csv.writer(open("../data/interim/di_part2.csv", "w"))
w.writerow(['REF', 'Census_Tract_Match'])

# loop over dictionary keys and values
for key, val in tracts_missing_coord_pairs.items():
    # write every key and value to file
    w.writerow([key, val])'''

'# load csv module\nimport csv\n\n# open file for writing, "w" is writing\nw = csv.writer(open("../data/interim/di_part2.csv", "w"))\nw.writerow([\'REF\', \'Census_Tract_Match\'])\n\n# loop over dictionary keys and values\nfor key, val in tracts_missing_coord_pairs.items():\n    # write every key and value to file\n    w.writerow([key, val])'

In [81]:
tracts_missing_coord_pairs=pd.read_csv('../data/interim/di_part2.csv')

In [82]:
[i for i in list(tracts_missing_coord_pairs['Census_Tract_Match']) if int(i) not in check]

[]

# **PART 3**

In [83]:
permit_tracts_missing_no_cord=permit_tracts_missing[permit_tracts_missing['LATITUDE'].isna()][['STREET_NAME','SUFFIX','STREET DIRECTION','STREET_NUMBER']]
permit_tracts_missing_no_cord

Unnamed: 0,STREET_NAME,SUFFIX,STREET DIRECTION,STREET_NUMBER
20,COTTAGE GROVE,AVE,S,6349
325,BESSIE COLEMAN,DR,N,10000
399,HERMITAGE,AVE,N,7535
9048,CULLERTON,ST,W,248
31494,SACRAMENTO,AVE,S,1901
...,...,...,...,...
108567,BESSIE COLEMAN,DR,N,10000
108579,CHESTNUT,ST,W,1223
108706,BESSIE COLEMAN,DR,N,10000
108955,BESSIE COLEMAN,DR,N,10000


566 rows are missing Census Tract Numbers as well as Lat & Lon Values

In [84]:
print(permit_tracts_missing_no_cord.duplicated().sum())
permit_tracts_missing_no_cord=permit_tracts_missing_no_cord.drop_duplicates()

353


Out of 566 rows, 353 are duplicates

In [85]:
permit_tracts_missing_no_cord=permit_tracts_missing_no_cord.drop_duplicates()

In [86]:
len(permit_tracts_missing_no_cord)

213

213 are unique and these are the address combinations that we want to find the census tracts for

In [87]:
#How to reform the columns for the geocoder imput
permit_tracts_missing_no_cord['ADDRESS_REF']=permit_tracts_missing_no_cord['STREET_NUMBER'].astype('str')+' '+permit_tracts_missing_no_cord['STREET DIRECTION'].astype('str')+' '+\
    permit_tracts_missing_no_cord['STREET_NAME'].astype('str')+' '+permit_tracts_missing_no_cord['SUFFIX'].astype('str')

In [88]:
permit_tracts_missing_no_cord.sample()

Unnamed: 0,STREET_NAME,SUFFIX,STREET DIRECTION,STREET_NUMBER,ADDRESS_REF
103017,66TH,PL,W,6216,6216 W 66TH PL


In [89]:
census_tracts_part3={}
missing_addresses={}
for address in permit_tracts_missing_no_cord['ADDRESS_REF']:
    t=cg.address(address, city='Chicago', state='IL')
    if t:
        census_tracts_part3[address]=t[0]['geographies']['Census Tracts'][0]['TRACT']
    else:
        missing_addresses[address]=t

In [None]:
'''cg.address('1550 S THROOP ST', city='Chicago', state='IL')[0]['geographies']['Census Tracts'][0]['TRACT']''';

In [None]:
'''location=cg.address('1550 S THROOP ST', city='Chicago', state='IL',returntype='locations')
cg.coordinates(x=location[0]['coordinates']['x'],y=location[0]['coordinates']['y'])['Census Tracts'][0]['TRACT']''';

In [91]:
missing=[i for i in list(census_tracts_part3.values()) if int(i) not in check]
missing
#this is because 862901 is outside of Cook County
#https://censusreporter.org/profiles/14000US17097862901-census-tract-862901-lake-il/;

In [93]:
wrong_address=[key for key in census_tracts_part3 if census_tracts_part3[key]=='862901'][len(missing)-1]
wrong_address

'1215 W SHERIDAN RD'

In [94]:
del census_tracts_part3[wrong_address]
len(census_tracts_part3)

In [103]:
part3_add_di={'10000 N BESSIE COLEMAN DR': [41.9744188,-87.892622],
 '1901 S SACRAMENTO AVE': [41.8554961,-87.6992071],
 '2324 N FREMONT ST': [41.9244996,-87.6534804],
 '12900 S METRON DR': [41.6625611,-87.5883168],
 '6500 S MICHIGAN AVE': [41.774537,-87.6252669],
 '1223 W CHESTNUT ST': [41.8979353,-87.6606072],
 '404 W MERCHANDISE MART PLZ': [41.8879948,-87.6385347],
 '10000 W OHARE ST': [41.9790169,-87.9087582],
 '444 W MERCHANDISE MART PLZ': [41.8879948,-87.6385347],
 '5400 S DR MARTIN L KING JR DR': [41.7963736,-87.6185842],
 '1902 S SACRAMENTO AVE': [41.8516478,-87.7026112],
 '222 N RIVERSIDE PLZ': [41.8848886,-87.6398917],
 '11061 W TOUHY AVE': [42.0113822,-87.8221779],
 '10700 S OGLESBY AVE': [41.7018567,-87.5678331],
 '9525 W BRYN MAWR AVE': [41.9804602,-87.8600796],
 '4100 W 71ST ST': [41.751877,-87.6725605],
 '1160 W TOUHY AVE': [42.0083279,-87.9415539],
 '3232 E CHELTENHAM DR': [41.7540108,-87.5509272],
 '1048 W MERCHANDISE MART PLZ': [41.8879882,-87.6374229],
 '200 N WACKER DR': [41.8856911,-87.6398546],
 '600 W MONTROSE HARBOR DR': [41.9628527,-87.6365966],
 '820 W POLK ST': [41.8719809,-87.6491222],
 '300 N RIVERSIDE PLZ': [41.8854109,-87.6410865],
 '9929 S AVENUE N nan': [41.7182278,-87.5410035],
 '1400 W 32ND ST': [41.836034,-87.6636717],
 '4401 N LAKE SHORE DR': [41.9546256,-87.6474362],
 '111601 W TOUHY AVE': [42.0116941,-87.8046477],
 '400 E WALDRON DR': [41.8603206,-87.6195214],
 '9177 W CHICAGO AVE': [41.8945072,-87.777267],
 '1440 N SACRAMENTO AVE': [41.8964216,-87.7015229],
 '8600 S GREEN BAY AVE': [41.7047451,-87.5397905],
 '1215 W SHERIDAN RD':[41.9982024,-87.6626523]}

In [104]:
part3_add2_di={k:(cg.coordinates(x=v[1],y=v[0]))['Census Tracts'][0]['TRACT'] for (k,v) in part3_add_di.items()}
#this is a list with census tracts generated based on latitude & longitude for Permit Census Tracts that did not match Income Census Tracts

In [105]:
census_tracts_part3.update(part3_add2_di)

In [106]:
len(census_tracts_part3)

213

In [107]:
[i for i in census_tracts_part3.values() if int(i) not in check]

[]

In [115]:
# load csv module
import csv

# open file for writing, "w" is writing
w = csv.writer(open("../data/interim/di_part3.csv", "w"))
w.writerow(['Adress', 'Census_Tract_Match'])

# loop over dictionary keys and values
for key, val in census_tracts_part3.items():
    # write every key and value to file
    w.writerow([key, val])

****

In [109]:
part1_df=pd.read_csv('../data/interim/di_part1.csv')

In [110]:
part1_df.sample()

Unnamed: 0,REF,Census_Tract_Match
90,1404.0,140400


In [111]:
part2_df=pd.read_csv('../data/interim/di_part2.csv')

In [112]:
part2_df.sample()

Unnamed: 0,REF,Census_Tract_Match
1914,"10.465, -21.908",330200


In [116]:
part3_df=pd.read_csv('../data/interim/di_part3.csv')

In [117]:
part3_df.sample()

Unnamed: 0,Adress,Census_Tract_Match
106,2533 S CALIFORNIA AVE,301100


In [120]:
len(part1_df)

1671

In [121]:
[i for i in part1_df['Census_Tract_Match'] if int(i) not in check]

[]

In [122]:
[i for i in part2_df['Census_Tract_Match'] if int(i) not in check]

[]

In [123]:
[i for i in part3_df['Census_Tract_Match'] if int(i) not in check]

[]

In [124]:
tract_li=[]
for df in [part1_df,part2_df,part3_df]:
    for i in df['Census_Tract_Match']:
        tract_li.append(i)
tract_li=list(set(tract_li))

In [125]:
len(tract_li)

809

In [126]:
[i for i in tract_li if int(i) not in check]

[]

In [128]:
np.savetxt('../data/interim/list.csv', tract_li, delimiter=',', fmt='%s')

***