# Data Project - Web Scraping - Collecting Postal Codes from Wiki

Good Reference for Web Scraping: [here](https://www.youtube.com/watch?v=Ewgy-G9cmbg&list=PLFCB5Dp81iNWRZu_TqtS5NPYvyfcyrD3F&index=3)

Starting page for Postal Code data: [here](https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_A)

BeautifulSoup Documentation: [here](https://www.crummy.com/software/BeautifulSoup/bs4/doc/#find-all)

For any Web Scraping Project make sure to use [insert webpage]/robots.txt to get information on what scraping is permitted.

## What is the problem that we are sovling?

For our AMC project, we need Canadian address data. But instead of taking a random subset of fake addresses, we decided to use Stat Cans ODA [Open Database of Addresses]. The down fall of this dataset is the data providers did not give postal code data. Therefore, we are trying to get a rough approximation for postal codes without using a Google Maps API.

In short, we need postal codes for our database, and are choosing to web scrap them from Wikipedia.

## Importing Libraries:

In [1]:
# Importing Libraries:

#Standard Data Imports [i.e. Data Manipulation, etc.. ]
import numpy as np
import pandas as pd
from pandas import Series,DataFrame

#Standard Plotting Libraries
import matplotlib as mpl
import matplotlib.pyplot as plt
import seaborn as sns

#Modifcation of seaborn background
sns.set_style('whitegrid')

#Standard Stats Library
from scipy import stats

#Command for showing plots in jupyter notebooks
%matplotlib inline

In [2]:
# Importing Scraping Libraries
from bs4 import BeautifulSoup as bs
import requests

### Task 1 - Figure out what to scrape from the wiki page (save in python dictionary)

Scrape the Urban table, with postal code and cities, into a dicitonary.

#### Task 1.1 - Load the webpage

In [3]:
r = requests.get('https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_A')

#### Task 1.2 - Convert webpage to a beautiful soup object

In [48]:
soup = bs(r.content)

# Print out the HTML
contents = soup.prettify()
print(contents)

<!DOCTYPE html>
<html class="client-nojs vector-feature-language-in-header-enabled vector-feature-language-in-main-page-header-disabled vector-feature-sticky-header-disabled vector-feature-page-tools-pinned-disabled vector-feature-toc-pinned-enabled vector-feature-main-menu-pinned-disabled vector-feature-limited-width-clientpref-1 vector-feature-limited-width-content-enabled vector-feature-zebra-design-disabled vector-feature-custom-font-size-clientpref-disabled" dir="ltr" lang="en">
 <head>
  <meta charset="utf-8"/>
  <title>
   List of postal codes of Canada: A - Wikipedia
  </title>
  <script>
   (function(){var className="client-js vector-feature-language-in-header-enabled vector-feature-language-in-main-page-header-disabled vector-feature-sticky-header-disabled vector-feature-page-tools-pinned-disabled vector-feature-toc-pinned-enabled vector-feature-main-menu-pinned-disabled vector-feature-limited-width-clientpref-1 vector-feature-limited-width-content-enabled vector-feature-zebr

#### Task 1.3 - Find the table where all the Urban postal codes are - put it into a variable and print

In [47]:
urban_table = soup.find('table', style = "border-collapse: collapse; border: 1px solid #ccc")

print(urban_table.prettify())

<table cellpadding="2" rules="all" style="border-collapse: collapse; border: 1px solid #ccc" width="100%">
 <tbody>
  <tr>
   <td valign="top" width="20%">
    <span style="line-height: 125%">
     <b>
      A1A
     </b>
     <br/>
     <a href="/wiki/St._John%27s,_Newfoundland_and_Labrador" title="St. John's, Newfoundland and Labrador">
      St. John's
     </a>
     <br/>
     North
    </span>
   </td>
   <td valign="top" width="20%">
    <span style="line-height: 125%">
     <b>
      A2A
     </b>
     <br/>
     <a href="/wiki/Grand_Falls-Windsor" title="Grand Falls-Windsor">
      Grand Falls-Windsor
     </a>
    </span>
   </td>
   <td valign="top" width="20%">
    <span style="line-height: 125%">
     <b>
      A5A
     </b>
     <br/>
     <a href="/wiki/Clarenville" title="Clarenville">
      Clarenville
     </a>
    </span>
   </td>
   <td valign="top" width="20%">
    <span style="line-height: 125%">
     <b>
      A8A
     </b>
     <br/>
     <a href="/wiki/Deer_Lake

It looks like the data we are looking for are in the table rows, and further in the table data tags.

#### Task 1.4 -   Start building your dictionary by pulling out all the table rows "tr"

In [46]:
# Pull out all the table rows from the table in question
urban_table_rows = urban_table.find_all('tr')

# Loop through table rows to see the table data structure
for row in urban_table_rows:
    print(row.prettify())

<tr>
 <td valign="top" width="20%">
  <span style="line-height: 125%">
   <b>
    A1A
   </b>
   <br/>
   <a href="/wiki/St._John%27s,_Newfoundland_and_Labrador" title="St. John's, Newfoundland and Labrador">
    St. John's
   </a>
   <br/>
   North
  </span>
 </td>
 <td valign="top" width="20%">
  <span style="line-height: 125%">
   <b>
    A2A
   </b>
   <br/>
   <a href="/wiki/Grand_Falls-Windsor" title="Grand Falls-Windsor">
    Grand Falls-Windsor
   </a>
  </span>
 </td>
 <td valign="top" width="20%">
  <span style="line-height: 125%">
   <b>
    A5A
   </b>
   <br/>
   <a href="/wiki/Clarenville" title="Clarenville">
    Clarenville
   </a>
  </span>
 </td>
 <td valign="top" width="20%">
  <span style="line-height: 125%">
   <b>
    A8A
   </b>
   <br/>
   <a href="/wiki/Deer_Lake,_Newfoundland_and_Labrador" title="Deer Lake, Newfoundland and Labrador">
    Deer Lake
   </a>
  </span>
 </td>
 <td valign="top" width="20%">
  <span style="color: #CCC; line-height: 125%">
   <b>
  

In the table row, where would you find the data we need? It looks like the key for the dictionary would be in table data td -> b tag, and the value for the dictionary would be in the table data td -> a tag.

NOTE - what would you want to do with the city modifiers outside the b and a tags? [ex. the first td has St.John's and North]

#### Task 1.5 - Loop through the table data [td] to get the information you need into a python dictionary:

    1.) Postal Code table data [td] into dictionary keys
    2.) Index and 'i' tag data into values of dictionary
        - 'i' = 'Not assigned' and 'Reserved postal codes.
        - NOTE - I could figure out a way to exclude these, BUT we are here are we might as well scrape them just in case.

In [67]:
# Filter the Webpage table data [td] into a bs4 object
urban_table_data = urban_table.find_all('td')

In [68]:
# Initiate Empty Dictionary
urban_postal_code_dict = {}

# Create a FOR loop that satifies 1. and 2. from above Task declaration.
for index, data in enumerate(urban_table_data):
    
    #pull the postal code into a variable to be added to the dictionary
    postal_code = data.find('b').get_text()
    
    #logically statment to update dictionary based on criteria in 2. of declared Task 1.5
    if data.find('i'):
        city_town = data.find('i').get_text()
        urban_postal_code_dict[postal_code] = city_town
    else:
        #pull the cities/towns from the 'a' tags and use the next_element method to filter out href information
        urban_postal_code_dict[postal_code] = data.find('a').next_element
        
# Check the results
urban_postal_code_dict

{'A1A': "St. John's",
 'A2A': 'Grand Falls-Windsor',
 'A5A': 'Clarenville',
 'A8A': 'Deer Lake',
 'A9A': 'Not assigned',
 'A1B': "St. John's",
 'A2B': 'Grand Falls-Windsor',
 'A5B': 'Not assigned',
 'A8B': 'Not assigned',
 'A9B': 'Not assigned',
 'A1C': "St. John's",
 'A2C': 'Not assigned',
 'A5C': 'Not assigned',
 'A8C': 'Not assigned',
 'A9C': 'Not assigned',
 'A1E': "St. John's",
 'A2E': 'Not assigned',
 'A5E': 'Not assigned',
 'A8E': 'Not assigned',
 'A9E': 'Not assigned',
 'A1G': "St. John's",
 'A2G': 'Not assigned',
 'A5G': 'Not assigned',
 'A8G': 'Not assigned',
 'A9G': 'Not assigned',
 'A1H': "St. John's",
 'A2H': 'Corner Brook',
 'A5H': 'Not assigned',
 'A8H': 'Not assigned',
 'A9H': 'Not assigned',
 'A1J': 'Not assigned',
 'A2J': 'Not assigned',
 'A5J': 'Not assigned',
 'A8J': 'Not assigned',
 'A9J': 'Not assigned',
 'A1K': 'Torbay',
 'A2K': 'Not assigned',
 'A5K': 'Not assigned',
 'A8K': 'Not assigned',
 'A9K': 'Not assigned',
 'A1L': 'Paradise',
 'A2L': 'Not assigned',
 'A5

NOTE - All the rough work that was done on this questions [i.e. see doc], showed difficulty in pulling the 'a' tag AND filtering the next element out.

It was throwing an error, where the way around this error was to treat the 'a' tag pulling, and filtering separately. Pulling the 'a' tag into a bs4 object and looping the next_element of the tag into a list.

From there we would use the keys of the previously built dictionary to loop through and replace the values, that were ints, with the indexed list of cities/towns.

In essence we removed a step, but also realized that we need to check for multiple 'a' tag values that are scraped because it might throw off the cities.
    
    - EXAMPLE - In this case we could change A1S from 'St.Johns -> Goulds'

### Task 2 - Figure out how to scrape the Rural table for postal codes (save as a list of dictionaries)

#### Task 2.1 - Fine the table where all the Rural postal codes are and loop through it to put 'b' and 'p' tags into a dictionary

I think the logic to put this all together is:

    1.) loop through the table data [td]
    2.) find the 'b' tag and make it the key of the dictionary
    3.) find the 'p' tag, iterate it into a list and make it the value of the dictionary

In [63]:
# Find the Rural Table
rural_table = soup.find('table', style= "border-collapse: collapse; border: 1px solid #ccc; line-height: 125%;")

print(rural_table.prettify())

<table cellpadding="2" cellspacing="0" rules="all" style="border-collapse: collapse; border: 1px solid #ccc; line-height: 125%;" width="100%">
 <tbody>
  <tr>
   <td valign="top" width="20%">
    <b>
     A0A
    </b>
    <br/>
    Southeastern
    <a href="/wiki/Avalon_Peninsula" title="Avalon Peninsula">
     Avalon Peninsula
    </a>
    <p>
     <span style="font-size: smaller; line-height: 125%;">
      1A0:
      <a class="mw-redirect" href="/wiki/Aquaforte,_Newfoundland_and_Labrador" title="Aquaforte, Newfoundland and Labrador">
       Aquaforte
      </a>
      <br/>
      1B0:
      <a href="/wiki/Avondale,_Newfoundland_and_Labrador" title="Avondale, Newfoundland and Labrador">
       Avondale
      </a>
      <br/>
      1C0:
      <a href="/wiki/Bay_Bulls,_Newfoundland_and_Labrador" title="Bay Bulls, Newfoundland and Labrador">
       Bay Bulls
      </a>
      <br/>
      1E0:
      <a class="mw-redirect" href="/wiki/Bay_de_Verde,_Newfoundland_and_Labrador" title="Bay de Ve

#### Task 2.2 - Pull out the table data [td] that the Rural FSAs are in.

In [83]:
# Pull out all the table rows from the rural table in question 
rural_table_data = rural_table.find_all('td')

# Loop through the table rows to see the table data structure
for data in rural_table_data:
    print(data.prettify())

<td valign="top" width="20%">
 <b>
  A0A
 </b>
 <br/>
 Southeastern
 <a href="/wiki/Avalon_Peninsula" title="Avalon Peninsula">
  Avalon Peninsula
 </a>
 <p>
  <span style="font-size: smaller; line-height: 125%;">
   1A0:
   <a class="mw-redirect" href="/wiki/Aquaforte,_Newfoundland_and_Labrador" title="Aquaforte, Newfoundland and Labrador">
    Aquaforte
   </a>
   <br/>
   1B0:
   <a href="/wiki/Avondale,_Newfoundland_and_Labrador" title="Avondale, Newfoundland and Labrador">
    Avondale
   </a>
   <br/>
   1C0:
   <a href="/wiki/Bay_Bulls,_Newfoundland_and_Labrador" title="Bay Bulls, Newfoundland and Labrador">
    Bay Bulls
   </a>
   <br/>
   1E0:
   <a class="mw-redirect" href="/wiki/Bay_de_Verde,_Newfoundland_and_Labrador" title="Bay de Verde, Newfoundland and Labrador">
    Bay de Verde
   </a>
   <br/>
   1G0:
   <a class="mw-redirect" href="/wiki/Bay_Roberts,_Newfoundland_and_Labrador" title="Bay Roberts, Newfoundland and Labrador">
    Bay Roberts
   </a>
   <br/>
   1H0:
 

#### Task 2.3 - Build the functions and loops that will pull and clean the data for your dictionary 

In [92]:
# Initiate the dictionary for storage
rural_postal_code_dict = {}

# String Conversion
def conver_string_to_list(string):
    '''
    This function takes in a list,
    separates it on a \n delimiter,
    and puts it into a list
    '''
    list_from_string = list(string.split('\n'))
    
    while('' in list_from_string):
        list_from_string.remove('')
    
    return list_from_string

In [98]:
# Create the loop for the table data, that will pull 'b' and 'p' tags, and put them in a dictionary
for data in rural_table_data:
    
    # Pull out the 'b' tag/FSA for that table data section:
    dictionary_key = data.find('b')
    
    # Get just the text to put it in the final dictionary
    dictionary_key = dictionary_key.get_text()
    
    
    # Pull the 'p' tag data and put it in a variable
    if data.find('p'):
        p_tag_data = data.find('p').get_text()
           
        # Convert string to list so you can put it in the values of your dictionary
        dictionary_values = conver_string_to_list(p_tag_data)
    else:
        dictionary_values = data.find('i').get_text()
    
    # Put the keys and values into your dictionary
    rural_postal_code_dict[dictionary_key] = dictionary_values
    
    
    

In [99]:
rural_postal_code_dict

{'A0A': ['1A0: Aquaforte',
  '1B0: Avondale',
  '1C0: Bay Bulls',
  '1E0: Bay de Verde',
  '1G0: Bay Roberts',
  '1H0: Bell Island Front',
  '1J0: Shea Heights',
  '1K0: Brigus',
  '1L0: Broad Cove',
  '1M0: Burnt Point',
  '1N0: Calvert',
  '1P0: Cape Broyle',
  '1R0: Caplin Cove',
  '1S0: Cappahayden',
  '1V0: Chapel Cove',
  "1W0: Clarke's Beach",
  '1X0: Coleys Point South',
  '1Y0: Colliers Riverhead',
  '1Z0: Conception Harbour',
  '2B0: Cupids',
  '2G0: Fermeuse',
  '2H0: Ferryland',
  '2L0: Grates Cove',
  '2M0: Harbour Grace',
  '2N0: Harbour Grace South',
  '2P0: Harbour Main',
  '2R0: Holyrood',
  '2S0: Jobs Cove',
  '2W0: Lower Island Cove',
  '2X0: Makinsons',
  '2Z0: Marysvale',
  '3A0: Mobile',
  '3B0: Northern Bay',
  '3C0: North River',
  '3E0: Ochre Pit Cove',
  '3G0: Old Perlican',
  '3H0: Petty Harbour',
  '3J0: Port de Grave',
  '3L0: Pouch Cove',
  '3M0: Red Head Cove',
  '3N0: Renews',
  '3P0: Riverhead Harbour Grace',
  "3R0: St. Shott's",
  '3S0: Salmon Cove',


### Task 3 - Take the dictionaries and convert them into dataframes

#### Task 3.1 - Convert the Urban dictionary into a dataframe

In [100]:
# Convert the dictionary into a dataframe
urban_postal_df = DataFrame(urban_postal_code_dict.items(), columns = ['FSA','city_town'])

In [101]:
# show the dataframe
urban_postal_df

Unnamed: 0,FSA,city_town
0,A1A,St. John's
1,A2A,Grand Falls-Windsor
2,A5A,Clarenville
3,A8A,Deer Lake
4,A9A,Not assigned
...,...,...
95,A1Z,Not assigned
96,A2Z,Not assigned
97,A5Z,Not assigned
98,A8Z,Not assigned


#### Task 3.2 - Convert the Rural dictionary into a dataframe

In [111]:
# Convert dictionary into a dataframe
rural_postal_df = DataFrame([(key, value) for key, values in rural_postal_code_dict.items() for value in values], 
                            columns = ['FSA', 'city_town'])
'''
rural_postal_df = DataFrame.from_dict(rural_postal_code_dict, orient = 'index')

Above is another way of doing this conversion, but it seems like both have a similar problem. They both separate out the Not in use FSAs

Either way this is fixable
'''

"\nrural_postal_df = DataFrame.from_dict(rural_postal_code_dict, orient = 'index')\n\nAbove is another way of doing this conversion, but it seems like both have a similar problem. They both separate out the Not in use FSAs\n\nEither way this is fixable\n"

In [103]:
#show dataframe
rural_postal_df

Unnamed: 0,FSA,city_town
0,A0A,1A0: Aquaforte
1,A0A,1B0: Avondale
2,A0A,1C0: Bay Bulls
3,A0A,1E0: Bay de Verde
4,A0A,1G0: Bay Roberts
...,...,...
484,A0Z,n
485,A0Z,
486,A0Z,u
487,A0Z,s


#### Task 3.3 - Cleaning the rural dataframe to consolidate the Not in use FSAs

In [186]:
# Get the length of the City column and put it in a new column
rural_postal_df['string_length'] = rural_postal_df['city_town'].str.len()

In [187]:
# Show DataFrame
rural_postal_df

Unnamed: 0,FSA,city_town,string_length
0,A0A,1A0: Aquaforte,14
1,A0A,1B0: Avondale,13
2,A0A,1C0: Bay Bulls,14
3,A0A,1E0: Bay de Verde,17
4,A0A,1G0: Bay Roberts,16
...,...,...,...
484,A0Z,n,1
485,A0Z,,1
486,A0Z,u,1
487,A0Z,s,1


In [206]:
not_used_FSAs_df = rural_postal_df[rural_postal_df['string_length'] == 1]

In [207]:
not_used_FSAs_df = not_used_FSAs_df.drop_duplicates('FSA')

In [208]:
not_used_FSAs_df['city_town'] = 'Not in use'

In [209]:
not_used_FSAs_df

Unnamed: 0,FSA,city_town,string_length
419,A0S,Not in use,1
429,A0T,Not in use,1
439,A0V,Not in use,1
449,A0W,Not in use,1
459,A0X,Not in use,1
469,A0Y,Not in use,1
479,A0Z,Not in use,1


In [210]:
used_FSAs_df = rural_postal_df[rural_postal_df['string_length'] != 1]

In [211]:
used_FSAs_df

Unnamed: 0,FSA,city_town,string_length
0,A0A,1A0: Aquaforte,14
1,A0A,1B0: Avondale,13
2,A0A,1C0: Bay Bulls,14
3,A0A,1E0: Bay de Verde,17
4,A0A,1G0: Bay Roberts,16
...,...,...,...
414,A0P,1N0: Postville,14
415,A0P,1P0: Rigolet,12
416,A0P,1S0: Happy Valley-Goose Bay,27
417,A0R,1A0: Churchill Falls,20


In [228]:
rural_postal_df = pd.concat((used_FSAs_df, not_used_FSAs_df))

In [231]:
rural_postal_df = rural_postal_df.reset_index(drop = True)

In [233]:
rural_postal_df = rural_postal_df.drop('string_length', axis = 1)

In [234]:
rural_postal_df

Unnamed: 0,FSA,city_town
0,A0A,1A0: Aquaforte
1,A0A,1B0: Avondale
2,A0A,1C0: Bay Bulls
3,A0A,1E0: Bay de Verde
4,A0A,1G0: Bay Roberts
...,...,...
421,A0V,Not in use
422,A0W,Not in use
423,A0X,Not in use
424,A0Y,Not in use


### Task 4 - Bring everything together so you can loop through Postal Code pages and pull data into a DataFrame

NOTE - you could pull the reference urls from the table at the bottom of any one of the pages, but for the amount of pages it is easier to input it into a list manually. Additionally, the good reference web scraping doc has the reference pulling information.

In [290]:
'''
postal_code_page_urls = ['https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_A',
                        'https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_B',
                         'https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_C',
                        'https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_E',
                         'https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_G',
                        'https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_H',
                        'https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_J',
                         'https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_K',
                        'https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_L',
                        'https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M',
                        'https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_N',
                        'https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_P',
                        'https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_R',
                        'https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_S',
                        'https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_T',
                        'https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_V',
                        'https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_X',
                        'https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_Y']
'''

"\npostal_code_page_urls = ['https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_A',\n                        'https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_B',\n                         'https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_C',\n                        'https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_E',\n                         'https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_G',\n                        'https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_H',\n                        'https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_J',\n                         'https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_K',\n                        'https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_L',\n                        'https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M',\n                        'https://en.wikipedia.org/wiki/List_of_postal_cod

NOTE - yes, you could also have just made a list of the letters and added it to a base path url.

#### Task 4.1 - Create a Function to loop through pages and get information

CORRECTION - there is more variablility in the pages than initially thought, and therefore, it is too laborious to loop through the pages. BUT it is easier to manually use the function to retrieve the data by inputting the needed url. Furthermore, we will be able to make small ajustments to the function that will allow us to manually scrape the other tables with grouped stylings. Next Tasks will achieve this.

In [None]:
# String Conversion
def conver_string_to_list(string):
    '''
    This function takes in a list,
    separates it on a \n delimiter,
    and puts it into a list
    '''
    list_from_string = list(string.split('\n'))
    
    while('' in list_from_string):
        list_from_string.remove('')
    
    return list_from_string

In [None]:
def get_postal_code_data(url):
    
    r = requests.get(url)
    soup = bs(r.content)

    tables = soup.find_all('table')
    
    # Filter the Webpage table data [td] into a bs4 object
    urban_table_data = tables[0].find_all('td')
    print(urban_table_data)
    
    # Initiate Empty Dictionary
    urban_postal_code_dict = {}

    # Create a FOR loop that satifies 1. and 2. from above Task declaration.
    for index, data in enumerate(urban_table_data):

        #pull the postal code into a variable to be added to the dictionary
        postal_code = data.find('b').get_text()

        #logically statment to update dictionary based on criteria in 2. of declared Task 1.5
        if data.find('i'):
            city_town = data.find('i').get_text()
            urban_postal_code_dict[postal_code] = city_town
        else:
            #pull the cities/towns from the 'a' tags and use the next_element method to filter out href information
            urban_postal_code_dict[postal_code] = data.find('a').next_element
    '''       
    # Find the Rural Table    
    # Pull out all the table rows from the rural table in question 
    rural_table_data = tables[1].find_all('td')
    
    # Initiate the dictionary for storage
    rural_postal_code_dict = {}

    # Create the loop for the table data, that will pull 'b' and 'p' tags, and put them in a dictionary
    for data in rural_table_data:

        # Pull out the 'b' tag/FSA for that table data section:
        dictionary_key = data.find('b')

        # Get just the text to put it in the final dictionary
        dictionary_key = dictionary_key.get_text()


        # Pull the 'p' tag data and put it in a variable
        if data.find('p'):
            p_tag_data = data.find('p').get_text()

            # Convert string to list so you can put it in the values of your dictionary
            dictionary_values = conver_string_to_list(p_tag_data)
        else:
            dictionary_values = data.find('i').get_text()

        # Put the keys and values into your dictionary
        rural_postal_code_dict[dictionary_key] = dictionary_values
'''
    # Convert the dictionary into a dataframe
    urban_postal_df = DataFrame(urban_postal_code_dict.items(), columns = ['FSA','city_town'])
    
        #DELETE AFTER
    combined_postal_data_df = pd.concat((combined_postal_data_df,urban_postal_df))
    combined_postal_data_df = combined_postal_data_df.reset_index(drop = True)
'''   
    # Convert dictionary into a dataframe
    rural_postal_df = DataFrame([(key, value) for key, values in rural_postal_code_dict.items() for value in values], 
                            columns = ['FSA', 'city_town'])
    
    # Get the length of the City column and put it in a new column
    rural_postal_df['string_length'] = rural_postal_df['city_town'].str.len()
    
    not_used_FSAs_df = rural_postal_df[rural_postal_df['string_length'] == 1]
    not_used_FSAs_df = not_used_FSAs_df.drop_duplicates('FSA')
    not_used_FSAs_df['city_town'] = 'Not in use'
    
    used_FSAs_df = rural_postal_df[rural_postal_df['string_length'] != 1]
    rural_postal_df = pd.concat((used_FSAs_df, not_used_FSAs_df))
    rural_postal_df = rural_postal_df.reset_index(drop = True)
    rural_postal_df = rural_postal_df.drop('string_length', axis = 1)
    
    combined_postal_data_df = pd.concat((urban_postal_df,rural_postal_df))
    combined_postal_data_df = combined_postal_data_df.reset_index(drop = True)
''' 
    region_title = soup.find('h2')
    region_title = region_title.get_text()
    region_title = region_title.split(sep = '[')
    
    combined_postal_data_df['FSA_Region'] = region_title[0]
    
    return combined_postal_data_df

In [319]:
# for getting data from pages with Urban and Rural tables separated
def get_postal_code_data(url):
    
    r = requests.get(url)
    soup = bs(r.content)

    tables = soup.find_all('table')
    
    # Filter the Webpage table data [td] into a bs4 object
    urban_table_data = tables[0].find_all('td')
    print(urban_table_data)
    
    # Initiate Empty Dictionary
    urban_postal_code_dict = {}

    # Create a FOR loop that satifies 1. and 2. from above Task declaration.
    for index, data in enumerate(urban_table_data):

        #pull the postal code into a variable to be added to the dictionary
        postal_code = data.find('b').get_text()

        #logically statment to update dictionary based on criteria in 2. of declared Task 1.5
        if data.find('i'):
            city_town = data.find('i').get_text()
            urban_postal_code_dict[postal_code] = city_town
        else:
            #pull the cities/towns from the 'a' tags and use the next_element method to filter out href information
            urban_postal_code_dict[postal_code] = data.find('a').next_element
      
    # Find the Rural Table    
    # Pull out all the table rows from the rural table in question 
    rural_table_data = tables[1].find_all('td')
    
    # Initiate the dictionary for storage
    rural_postal_code_dict = {}

    # Create the loop for the table data, that will pull 'b' and 'p' tags, and put them in a dictionary
    for data in rural_table_data:

        # Pull out the 'b' tag/FSA for that table data section:
        dictionary_key = data.find('b')

        # Get just the text to put it in the final dictionary
        dictionary_key = dictionary_key.get_text()


        # Pull the 'p' tag data and put it in a variable
        if data.find('p'):
            p_tag_data = data.find('p').get_text()

            # Convert string to list so you can put it in the values of your dictionary
            dictionary_values = conver_string_to_list(p_tag_data)
        else:
            dictionary_values = data.find('i').get_text()

        # Put the keys and values into your dictionary
        rural_postal_code_dict[dictionary_key] = dictionary_values

    # Convert the dictionary into a dataframe
    urban_postal_df = DataFrame(urban_postal_code_dict.items(), columns = ['FSA','city_town'])
    
    # Convert dictionary into a dataframe
    rural_postal_df = DataFrame([(key, value) for key, values in rural_postal_code_dict.items() for value in values], 
                            columns = ['FSA', 'city_town'])
    
    # Get the length of the City column and put it in a new column
    rural_postal_df['string_length'] = rural_postal_df['city_town'].str.len()
    
    not_used_FSAs_df = rural_postal_df[rural_postal_df['string_length'] == 1]
    not_used_FSAs_df = not_used_FSAs_df.drop_duplicates('FSA')
    not_used_FSAs_df['city_town'] = 'Not in use'
    
    used_FSAs_df = rural_postal_df[rural_postal_df['string_length'] != 1]
    rural_postal_df = pd.concat((used_FSAs_df, not_used_FSAs_df))
    rural_postal_df = rural_postal_df.reset_index(drop = True)
    rural_postal_df = rural_postal_df.drop('string_length', axis = 1)
    
    combined_postal_data_df = pd.concat((urban_postal_df,rural_postal_df))
    combined_postal_data_df = combined_postal_data_df.reset_index(drop = True)

    region_title = soup.find('h2')
    region_title = region_title.get_text()
    region_title = region_title.split(sep = '[')
    
    combined_postal_data_df['FSA_Region'] = region_title[0]
    
    return combined_postal_data_df

In [285]:
canadian_FSAs_df = DataFrame()

In [340]:
canadian_FSAs_df = pd.concat((canadian_FSAs_df, get_postal_code_data('https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_V')))

[<td valign="top" width="11.1%"><b>V1A</b><br/><span style="font-size: smaller; line-height: 125%;"><a href="/wiki/Kimberley,_British_Columbia" title="Kimberley, British Columbia">Kimberley</a></span>
</td>, <td valign="top" width="11.1%"><b>V2A</b><br/><span style="font-size: smaller; line-height: 125%;"><a href="/wiki/Penticton" title="Penticton">Penticton</a></span>
</td>, <td valign="top" width="11.1%"><b>V3A</b><br/><span style="font-size: smaller; line-height: 125%;"><a href="/wiki/Langley,_British_Columbia_(district_municipality)" title="Langley, British Columbia (district municipality)">Langley Township</a><br/>(Langley City)</span>
</td>, <td valign="top" width="11.1%"><b>V4A</b><br/><span style="font-size: smaller; line-height: 125%;"><a href="/wiki/Surrey,_British_Columbia" title="Surrey, British Columbia">Surrey</a><br/>Southwest</span>
</td>, <td valign="top" width="11.1%"><b>V5A</b><br/><span style="font-size: smaller; line-height: 125%;"><a href="/wiki/Burnaby" title="Bu

In [341]:
canadian_FSAs_df

Unnamed: 0,FSA,city_town,FSA_Region
0,A1A,St. John's,Newfoundland and Labrador
1,A2A,Grand Falls-Windsor,Newfoundland and Labrador
2,A5A,Clarenville,Newfoundland and Labrador
3,A8A,Deer Lake,Newfoundland and Labrador
4,A9A,Not assigned,Newfoundland and Labrador
...,...,...,...
555,V0X,1X0: Rosedale,British Columbia
556,V0X,2L0: Tulameen,British Columbia
557,V0X,2W0: Princeton,British Columbia
558,V0Y,Not in use,British Columbia


In [342]:
canadian_FSAs_df.to_csv('some_canadian_FSA.csv')

#### Task 4.2 - Adjust your funciton to only pull table data that looks like the Urban table to Newfoundland. You it to scrape the table data for the city oreiented postal codes of H and M. Put into the same dataframe.

In [377]:
# Adjust the function to get Urban table data only
def get_postal_code_data_urban(url):
    
    r = requests.get(url)
    soup = bs(r.content)

    tables = soup.find_all('table')
    
    # Filter the Webpage table data [td] into a bs4 object
    urban_table_data = tables[0].find_all('td')
    
    # Initiate Empty Dictionary
    urban_postal_code_dict = {}

    # Create a FOR loop that satifies 1. and 2. from above Task declaration.
    for index, data in enumerate(urban_table_data):

        #pull the postal code into a variable to be added to the dictionary
        postal_code = data.find('b').get_text()

        #logically statment to update dictionary based on criteria in 2. of declared Task 1.5
        if data.find('i'):
            city_town = data.find('i').get_text()
            urban_postal_code_dict[postal_code] = city_town
        else:
            #M postal codes page is organized differently from the other Urban tables - the below code doesn't work
            #pull the cities/towns from the 'a' tags and use the next_element method to filter out href information
            #urban_postal_code_dict[postal_code] = data.find('a').next_element
            # DEALING WITH NO 'a' ELEMENT IN 'td'
            try:
                urban_postal_code_dict[postal_code] = data.find('a').next_element
            except Exception as e:
                urban_postal_code_dict[postal_code] = data.find('span').get_text()

    # Convert the dictionary into a dataframe
    urban_postal_df = DataFrame(urban_postal_code_dict.items(), columns = ['FSA','city_town'])
        
    combined_postal_data_df = urban_postal_df
    combined_postal_data_df = combined_postal_data_df.reset_index(drop = True)

    region_title = soup.find('h2')
    region_title = region_title.get_text()
    region_title = region_title.split(sep = '[')
    
    combined_postal_data_df['FSA_Region'] = region_title[0]
    
    return combined_postal_data_df

In [379]:
# Apply the function to H and M postal codes
canadian_FSAs_df = pd.concat((canadian_FSAs_df, get_postal_code_data_urban('https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M')))

In [378]:
canadian_FSAs_df

Unnamed: 0,FSA,city_town,FSA_Region
0,A1A,St. John's,Newfoundland and Labrador
1,A2A,Grand Falls-Windsor,Newfoundland and Labrador
2,A5A,Clarenville,Newfoundland and Labrador
3,A8A,Deer Lake,Newfoundland and Labrador
4,A9A,Not assigned,Newfoundland and Labrador
...,...,...,...
175,M5Z,Not assigned,Toronto
176,M6Z,Not assigned,Toronto
177,M7Z,Not assigned,Toronto
178,M8Z,Etobicoke,Toronto


#### Task 4.3 - Adjust your function to be able to pull data from postal code pages that that merge Rural and Urban postal codes into one table. [i.e. C, X, and Y]. Put the pulled data in to the same dataframe.

In [380]:
# for getting data from pages with Urban and Rural tables combined
def get_postal_code_data_rural_urban_combined(url):
    
    r = requests.get(url)
    soup = bs(r.content)

    tables = soup.find_all('table')
    
    # Filter the Webpage table data [td] into a bs4 object
    combined_table_data = tables[0].find_all('td')
    
    # Initiate Empty Dictionary
    urban_postal_code_dict = {}
    
    # Initiate the dictionary for storage
    rural_postal_code_dict = {}

    # Create a FOR loop that satifies 1. and 2. from above Task declaration.
    for index, data in enumerate(combined_table_data):

        #pull the postal code into a variable to be added to the dictionary
        postal_code = data.find('b').get_text()
        
        # Pull the 'p' tag data and put it in a variable
        if data.find('p'):
            p_tag_data = data.find('p').get_text()
            # Convert string to list so you can put it in the values of your dictionary
            dictionary_values = conver_string_to_list(p_tag_data)
            # Put the keys and values into your dictionary
            rural_postal_code_dict[postal_code] = dictionary_values
        #logically statment to update dictionary based on criteria in 2. of declared Task 1.5
        elif data.find('i'):
            city_town = data.find('i').get_text()
            urban_postal_code_dict[postal_code] = city_town
        else:
            #pull the cities/towns from the 'a' tags and use the next_element method to filter out href information
            urban_postal_code_dict[postal_code] = data.find('a').next_element



    # Convert the dictionary into a dataframe
    urban_postal_df = DataFrame(urban_postal_code_dict.items(), columns = ['FSA','city_town'])
    
    # Convert dictionary into a dataframe
    rural_postal_df = DataFrame([(key, value) for key, values in rural_postal_code_dict.items() for value in values], 
                            columns = ['FSA', 'city_town'])
    
    # Get the length of the City column and put it in a new column
    rural_postal_df['string_length'] = rural_postal_df['city_town'].str.len()
    
    not_used_FSAs_df = rural_postal_df[rural_postal_df['string_length'] == 1]
    not_used_FSAs_df = not_used_FSAs_df.drop_duplicates('FSA')
    not_used_FSAs_df['city_town'] = 'Not in use'
    
    used_FSAs_df = rural_postal_df[rural_postal_df['string_length'] != 1]
    rural_postal_df = pd.concat((used_FSAs_df, not_used_FSAs_df))
    rural_postal_df = rural_postal_df.reset_index(drop = True)
    rural_postal_df = rural_postal_df.drop('string_length', axis = 1)
    
    combined_postal_data_df = pd.concat((urban_postal_df,rural_postal_df))
    combined_postal_data_df = combined_postal_data_df.reset_index(drop = True)

    region_title = soup.find('h2')
    region_title = region_title.get_text()
    region_title = region_title.split(sep = '[')
    
    combined_postal_data_df['FSA_Region'] = region_title[0]
    
    return combined_postal_data_df

In [387]:
# Apply the function to C, X, and Y postal codes
test_FSAs_df = pd.concat((canadian_FSAs_df, get_postal_code_data_rural_urban_combined('https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_C')))

In [403]:
test_FSAs_df = get_postal_code_data_rural_urban_combined('https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_Y')

In [404]:
test_FSAs_df

Unnamed: 0,FSA,city_town,FSA_Region
0,Y0A,Teslin,Yukon
1,Y1A,Whitehorse,Yukon
2,Y1B,Not assigned,Yukon
3,Y1C,Not assigned,Yukon
4,Y1E,Not assigned,Yukon
5,Y1G,Not assigned,Yukon
6,Y1H,Not assigned,Yukon
7,Y1J,Not assigned,Yukon
8,Y1K,Not assigned,Yukon
9,Y1L,Not assigned,Yukon


In [396]:
CXY_FSA_df = DataFrame()

In [405]:
CXY_FSA_df = pd.concat((CXY_FSA_df, test_FSAs_df))

In [406]:
CXY_FSA_df

Unnamed: 0,FSA,city_town,FSA_Region
0,C1A,Charlottetown,Prince Edward Island
1,C1B,Stratford,Prince Edward Island
2,C1C,Charlottetown,Prince Edward Island
3,C1E,Charlottetown,Prince Edward Island
4,C1G,Not assigned,Prince Edward Island
...,...,...,...
17,Y1V,Not assigned,Yukon
18,Y1W,Not assigned,Yukon
19,Y1X,Not assigned,Yukon
20,Y1Y,Not assigned,Yukon


In [439]:
canadian_FSAs_df[canadian_FSAs_df['FSA_Region'] == 'British Columbia']

Unnamed: 0,FSA,city_town,FSA_Region
0,V1A,Kimberley,British Columbia
1,V2A,Penticton,British Columbia
2,V3A,Langley Township,British Columbia
3,V4A,Surrey,British Columbia
4,V5A,Burnaby,British Columbia
...,...,...,...
555,V0X,1X0: Rosedale,British Columbia
556,V0X,2L0: Tulameen,British Columbia
557,V0X,2W0: Princeton,British Columbia
558,V0Y,Not in use,British Columbia


In [440]:
canadian_FSAs_df[canadian_FSAs_df['FSA'] == 'V0X']

Unnamed: 0,FSA,city_town,FSA_Region
548,V0X,1C0: Cawston,British Columbia
549,V0X,1G0: Coalmont,British Columbia
550,V0X,1K0: Hedley,British Columbia
551,V0X,1L0: Hope,British Columbia
552,V0X,1N0: Keremeos,British Columbia
553,V0X,1R0: East Gate,British Columbia
554,V0X,1W0: Princeton,British Columbia
555,V0X,1X0: Rosedale,British Columbia
556,V0X,2L0: Tulameen,British Columbia
557,V0X,2W0: Princeton,British Columbia


In [446]:
canadian_FSAs_df = canadian_FSAs_df.drop_duplicates()

In [447]:
canadian_FSAs_df = pd.concat((canadian_FSAs_df, CXY_FSA_df))

In [448]:
canadian_FSAs_df

Unnamed: 0,FSA,city_town,FSA_Region
0,A1A,St. John's,Newfoundland and Labrador
1,A2A,Grand Falls-Windsor,Newfoundland and Labrador
2,A5A,Clarenville,Newfoundland and Labrador
3,A8A,Deer Lake,Newfoundland and Labrador
4,A9A,Not assigned,Newfoundland and Labrador
...,...,...,...
17,Y1V,Not assigned,Yukon
18,Y1W,Not assigned,Yukon
19,Y1X,Not assigned,Yukon
20,Y1Y,Not assigned,Yukon


In [449]:
# Save the DataFrame, with all the FSAs, to a .csv
canadian_FSAs_df.to_csv('all_canadian_FSAs.csv')

## Web Scraping - COMPLETED