# RQ3 Prep: Extracting Zip Code Data
## Part 1: Cleaning the KML file

In order to draw lines on Basemap to represent each Zip Code, we need information about the exact coordinates of where these boundaries lie on a US map. The only Zip Code information that could be found online was a KML file from the US Census Bureau.

This file has over 5 million lines, most of which is irrelevant to what is needed for our project.

This notebook extracts each relevant tag; the zip code, and all of the coordiantes detailing its boundary. As this notebook takes 20 minutes+ to run, the results are saved to a text file for easy access later.

In [1]:
import os 
import sys

module_path = os.path.abspath(os.path.join('../../data/..'))
if module_path not in sys.path:
    sys.path.append(module_path)

We will use BeautifulSoup to find all the tags we need. The original kml file was renamed and changed to an xml file so BeautifulSoup could interpret it. All the information inside has been preserved.

In [2]:
import bs4 # import BeautifulSoup 4

#open full kml file
f = open("../../data/raw/fullInfo.xml")
html = f.read()
f.close()

soup = bs4.BeautifulSoup(html, 'lxml-xml') # create text for bs to evaluate

This KML file contains 33,144 "Placemark" tags, which represent an individual postcode.

We are interested in two parts:
- The 'name' tag: In the example below we can see '00601' inside, which is a postcode.
- The 'coordinates' tag: This file contains a comma-seperated coordinates which each represent a point on a map. When each of these are connected by a  line to the next in the list, it forms a shape - the postcode.

In [None]:
soup.find('Placemark')

<Placemark id="cb_2016_us_zcta510_500k.kml">
<name>&lt;at&gt;&lt;openparen&gt;00601&lt;closeparen&gt;</name>
<visibility>1</visibility>
<description>&lt;center&gt;&lt;table&gt;&lt;tr&gt;&lt;th colspan='2' align='center'&gt;&lt;em&gt;Attributes&lt;/em&gt;&lt;/th&gt;&lt;/tr&gt;&lt;tr bgcolor="#E3E3F3"&gt;
&lt;th&gt;ZCTA5CE10&lt;/th&gt;
&lt;td&gt;00601&lt;/td&gt;
&lt;/tr&gt;&lt;tr bgcolor=""&gt;
&lt;th&gt;AFFGEOID10&lt;/th&gt;
&lt;td&gt;8600000US00601&lt;/td&gt;
&lt;/tr&gt;&lt;tr bgcolor="#E3E3F3"&gt;
&lt;th&gt;GEOID10&lt;/th&gt;
&lt;td&gt;00601&lt;/td&gt;
&lt;/tr&gt;&lt;tr bgcolor=""&gt;
&lt;th&gt;ALAND10&lt;/th&gt;
&lt;td&gt;166659884&lt;/td&gt;
&lt;/tr&gt;&lt;tr bgcolor="#E3E3F3"&gt;
&lt;th&gt;AWATER10&lt;/th&gt;
&lt;td&gt;799293&lt;/td&gt;
&lt;/tr&gt;&lt;/table&gt;&lt;/center&gt;</description>
<styleUrl>#KMLStyler</styleUrl>
<ExtendedData>
<SchemaData schemaUrl="#kml_schema_ft_cb_2016_us_zcta510_500k">
<SimpleData name="ZCTA5CE10">00601</SimpleData>
<SimpleData name="AFFGEOID10">86000

The code below uses BeautifulSoup to find each Placemark tag, extract the name, then extract the coordinates.

Sometimes a postcode will have multiple coordinate lists arranged differently in the Placemark file. This happens if a postcode has multiple 'islands'. The code below accounts for this. 

In [None]:
bigList = "" # All relevant tags will be stored here in this master list 

allSections = soup.find_all('Placemark') #Find each individual 'Placemark' tag

for section in allSections:
    postcode = section.find('name') #The postcode is a string inside a <name> tag
    bigList += "%s\n" % postcode.contents #extract the string and add to list
    
    if section.findChildren('MultiGeometry'): # if the postcode contains multiple 'islands'
        multi = section.find('MultiGeometry') #look inside
        poly = multi.find('Polygon') #find <polyglon>
        for part in poly:
            outer = poly.find('outerBoundaryIs') #find <outerBoundaryIs> inside <Polyglon>
            lin = outer.find('LinearRing') #find <LinearRing> inside <outerBoundaryIs>
            coords = lin.find('coordinates') #find <coordinates> inside the <LinearRing>
            bigList += "%s\n" % coords.contents #get the string of coordinates inside and add to list
    else: #if section does not have multiple islands
        coord = section.find('coordinates') #find the coordinates lines
        bigList += "%s\n" % coord.contents #add to list

Save the relevant tags to a text file; this prevents us having to traverse the file again.

In [None]:
text_file = open("../../data/prep/coordsXML.txt", "w+")
text_file.write(bigList)
text_file.close()