In [2]:
import pandas as pd

I took this data from the [City of Chicago](https://data.cityofchicago.org/Facilities-Geographic-Boundaries/Boundaries-ZIP-Codes/gdcf-axmw) data site.

In [10]:
zip_codes_df = pd.read_csv('Zip_Codes.csv')

In [11]:
zip_codes_df.head()

Unnamed: 0,the_geom,OBJECTID,ZIP,SHAPE_AREA,SHAPE_LEN
0,MULTIPOLYGON (((-87.67762151065281 41.91775780...,33,60647,106052300.0,42720.044406
1,MULTIPOLYGON (((-87.72683253163021 41.92264626...,34,60639,127476100.0,48103.782721
2,MULTIPOLYGON (((-87.78500237831095 41.90914785...,35,60707,45069040.0,27288.609612
3,MULTIPOLYGON (((-87.6670686895295 41.888851884...,36,60622,70853830.0,42527.989679
4,MULTIPOLYGON (((-87.70655631674127 41.89555340...,37,60651,99039620.0,47970.140153


I want to see what exactly this ```the_geom``` object is because it looks like it designates the coordinates of each ZIP code, but I need to see what form it takes.

In [12]:
zip_codes_df.loc[0,"the_geom"]

'MULTIPOLYGON (((-87.67762151065281 41.917757801062976, -87.67760690174252 41.91734720847048, -87.67760511932958 41.91730238504324, -87.67760063005984 41.917188199438144, -87.67759765058332 41.916991119379674, -87.67759310412458 41.91669230188045, -87.6775879922483 41.91640204306182, -87.67758757917139 41.91637778172058, -87.677586292912 41.91623784614856, -87.67758575813848 41.9161788147321, -87.67758053649547 41.91594236952244, -87.677578323991 41.91584265904772, -87.67757444476528 41.91567606225867, -87.67757172877451 41.915553763898004, -87.6775700490947 41.915473787467334, -87.67756756301671 41.91536131475606, -87.67756217455818 41.91510527489929, -87.67755693540155 41.91485617857848, -87.67755510345481 41.91471571851807, -87.6775517264733 41.91445700077503, -87.67754655234228 41.914230270407344, -87.67754521366163 41.91417082274463, -87.67754237233984 41.91404641051031, -87.67754180121058 41.914016083576996, -87.67753818696835 41.91382720489476, -87.67753749882326 41.913790346093

Each ```the_geom``` object is just a string that starts with ```'MULTIPOLYGON (((``` and ends with ```)))```. Everything else seems to be a pairs of coordinates that are separated by whitespace; each pair of coordinates is separated by a comma. I'm not particularly interested by the ```SHAPE_AREA``` or ```SHAPE_LEN``` so I will drop those columns.

In [13]:
zip_codes_df = zip_codes_df.drop(columns=['OBJECTID', 'SHAPE_AREA', 'SHAPE_LEN'])

In [14]:
def remove_multipolygon(row):
    geometry_string = row['the_geom'].replace('MULTIPOLYGON (((', '').replace(')))', '')
    return geometry_string
    
zip_codes_df['set_of_coordinates'] = zip_codes_df.apply(remove_multipolygon, axis=1)

In [16]:
zip_codes_df = zip_codes_df.drop(columns='the_geom')

Now I need to figure out a way to extract all the coordinates. Hopefully the coordinates are in the correct order. I'll assume they are and go from there. (See R notebook: it worked!)

I need to:

1) For each ZIP code, take the string that is in the set of coordinates

2) Split the set of coordinates by the comma

3) Extract the longitude (the value around -87) and the latitude (the value around 41)

I will do all this and append that data into a new data frame.

In [35]:
output_columns = ['zip','longitude','latitude']

zip_code_coords_df = pd.DataFrame(columns=output_columns)

In [23]:
zip_code_list = list(zip_codes_df['ZIP'].unique())

Yeah, there is a crazy amount of replacing to be done here. I noticed that there were a couple strange ZIP codes whose lines weren't exactly in the form I expected; the end result was a latitude that had a trailing ```)```. In the raw data, I looked up the ZIP codes that had the unexpected values but couldn't seem to figure out why it was processing strangely. There *is* a lot of text, so it's quite possible that I missed something. Consequently it's quite possible that there is something wrong with the visualization of one or two of the ZIP codes, but for the most part it's indistinguishable from what you can see online, so I'll ignore that part for now.

In [40]:
for zip_code in zip_code_list:
    
    coordinates = zip_codes_df[zip_codes_df['ZIP'] == zip_code]['set_of_coordinates'].values[0].split(sep=',')
    coordinates = [x.strip().replace(' ',',').replace('(','').replace(')','').split(sep=',') for x in coordinates]
    
    for coordinate in coordinates:

        zip_code_coords_df = zip_code_coords_df.append(
                                pd.DataFrame([[zip_code,float(coordinate[0]),float(coordinate[1])]],
                                            columns=output_columns),
                                            ignore_index=True)
    
zip_code_coords_df.to_csv('chicago_county_zip_codes.csv', sep=',', index=False)

Actually, it should be Cook County (in the file name)!