# RQ3 Prep: Extracting Zip Code Data
## Part 2: Making a Dictionary

In the last part we took all of the relevant coordinated from the KML file. In this file:

- The text is formatted with string.replace() to look like a dictionary
- The coordinates are formatted correctly
- We sort it to only take the 830 postal codes in our dataset, instead of all 33,1400
- The resulting dictionary is written to a text file to be accessed in other Jupyter Notebooks

In [13]:
import os 
import sys

module_path = os.path.abspath(os.path.join('../../data/..'))
if module_path not in sys.path:
    sys.path.append(module_path)

In [14]:
#import the tags created in part 1 of the prep
file = open("../../data/prep/coordsXML.txt", "r")
lines = file.readlines()
file.close()

This cell cleans the tags to make a list. For example, 

```
'['<at><openparen>00601<closeparen>']
```
becomes
```
"00601":
```

and for the coordinates, 
```
<coordinates>-66.835256,18.209981,0.0 ..... -66.836246,18.209447,0.0</coordinates>
```
becomes
```
"[[-66.835256,18.209981],....,[-66.836246,18.209447]]",
```

In [15]:
cleanedText =[]

for line in lines:
    newline = line.replace("['<at><openparen>", '"')
    newline = newline.replace("<closeparen>\']\n", '":')
    newline = newline.replace("['", '"[[')
    newline = newline.replace(",0.0 ", '],[')
    newline = newline.replace(",0.0\']\n", ']]"')
    cleanedText.append(newline)

The cell above has the postcode and coordinates on different lines. The cell below cleans it so that every pair is combined to give a line like
```
"00601":"[[-66.835256,18.209981],....,[-66.836246,18.209447]]",
```

In [16]:
codeCoords = []

i = 0
while i < len(cleanedText):
    combine = cleanedText[i] + cleanedText[i+1] + ","
    codeCoords.append(combine)
    i = i+2
    

However, some coordinates contain multiple "islands" which means when the above combinations are made, some coordinates are left behind on their own line. 

The cell below checks if a coordinate starts with a digit. If it does, it's a postal code in the correct format. If it does not, it's an island that's been left behind; so it's combined with the postal code to which it belongs.

In [17]:
multiLists = []

i = 0
while i < len(codeCoords):
    checking = codeCoords[i] #the line we're checking
    if checking[1].isdigit():
        line = checking #fine as it is if its a postcode
    elif checking[1].isdigit() == False:
        line = multiLists[len(multiLists)-1] + checking #combine with line before
        del multiLists[len(multiLists)-1] #delete line before
    multiLists.append(line) #replace with new combined line
    i = i + 1

The combinations caused some brackets to not align correctly so this cell completes the last of the string replacement.

In [18]:
finalList = []

for line in multiLists:
    new = line.replace(']]","[[', '],[')
    new = new.replace('""', ",")
    new = new.replace(']],[[', '],[')
    finalList.append(new)

The cleaned codes above contains every postcode in the US. So now we have to choose only the relevant codes.

This cell takes our business table, and chooses only the postcodes in the USA. Then it gets a list of all the US postcodes in our dataset and puts it into a list.

In [19]:
import pandas as pd

df = pd.read_csv("../../data/raw/yelp_business.csv")

#all 50 state abbreviations taken from https://gist.github.com/JeffPaine/3083347
states = ["AL", "AK", "AZ", "AR", "CA", "CO", "CT", "DC", "DE", "FL", "GA", 
          "HI", "ID", "IL", "IN", "IA", "KS", "KY", "LA", "ME", "MD", 
          "MA", "MI", "MN", "MS", "MO", "MT", "NE", "NV", "NH", "NJ", 
          "NM", "NY", "NC", "ND", "OH", "OK", "OR", "PA", "RI", "SC", 
          "SD", "TN", "TX", "UT", "VT", "VA", "WA", "WV", "WI", "WY"]

USA = df[df['state'].isin(states)] #from our dataframe, take all the US areas
listCodeswithnan = USA['postal_code'].unique().tolist() #get all the unique US postcodes in the dataframe and make a list
listCodes = [i for i in listCodeswithnan if str(i) != 'nan'] #remove nan values

This cell chooses all the revlevant postcodes from our master list of all cleaned codes.

In [20]:
sortYelpCodes = []
for i in finalList:
    if i[1:6] in listCodes: #[1:6] is where our postcode is in the string
        sortYelpCodes.append(i)

The very last entry in the list ends with a comma, which we don't need.

In [21]:
#get the last entry in the list, and take the entire string except for the last character
removeComma = sortYelpCodes[len(sortYelpCodes)-1][:len(sortYelpCodes[len(sortYelpCodes)-1])-1]
del sortYelpCodes[len(sortYelpCodes)-1] #delete old last entry
sortYelpCodes.append(removeComma) #replace with version that has no comma

Finally, we make it all one line and add curly brackets to make it look like a dictionary.

In [24]:
yelpCodesFinal = "".join(sortYelpCodes) #make it all one line
makeDict = "{" + yelpCodesFinal +"}" #add brackets to make it a dict

Write it to a text file to be accessed by the Basemap notebook

In [25]:
yc = open("../../data/prep/yelpCodes.txt", "w+")
yc.write(makeDict)
yc.close()

### Example

In the Basemap Notebook, we will take this text file and evaluate it to make it a dictionary. We can then call any postcode from our data and it will give is all the coordinates to draw that postcode boundary.

For example, the codes below represent the outline of postcode 89121 (in Nevada)

In [26]:
example = eval(makeDict)
example['89121']

'[[-115.119175,36.11475],[-115.11912,36.11097],[-115.117069,36.111026],[-115.117031,36.109484],[-115.119122,36.109976],[-115.119067,36.1073],[-115.11674,36.107327],[-115.116743,36.103634],[-115.118992,36.103612],[-115.118925,36.099947],[-115.109963,36.099981],[-115.109469,36.09725],[-115.104396,36.09727],[-115.103825,36.09563],[-115.101015,36.0957],[-115.101037,36.100034],[-115.091771,36.100168],[-115.073216,36.100281],[-115.0732,36.099001],[-115.071308,36.098982],[-115.070106,36.099785],[-115.066353,36.099742],[-115.066324,36.100251],[-115.063931,36.100234],[-115.064097,36.102993],[-115.060349,36.102875],[-115.061336,36.103425],[-115.064124,36.103402],[-115.064959,36.118632],[-115.062616,36.118627],[-115.060093,36.11957],[-115.060182,36.121334],[-115.062633,36.122301],[-115.065124,36.122277],[-115.065432,36.129598],[-115.060742,36.129652],[-115.060713,36.137033],[-115.065438,36.136967],[-115.06559,36.13697],[-115.065524,36.142332],[-115.078675,36.142272],[-115.082936,36.143008],[-115.