# Segmentation and Clustering Stage 2: Adding Coordinate Data
### Authored By: Jon Ingram  

  
### NOTE: Skip to section labeled "Stage 2" for relevant changes.

## *Stage 1*

Step 1: 
- Importing the proper libraries.

In [4]:
from bs4 import BeautifulSoup
import requests
import pandas as pd
import numpy as np

Step 2:  
- Use the 'requests' library to retrieve the dataset from the wikipedia page.  
- Create a BeautifulSoup object using the retrieved file.  

In [5]:
postal_file = requests.get('https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M').text

soup = BeautifulSoup(postal_file, 'lxml')

Step 3:  
- Create the new DataFrame to store the data. (Will be called `df_tor`)

In [6]:
columns = ['PostalCode', 'Borough', 'Neighborhood']
df_tor = pd.DataFrame(columns=columns)
df_tor

Unnamed: 0,PostalCode,Borough,Neighborhood


Step 4:
- Define a function to add rows of data to a pandas Series.

In [7]:
def addRowToData(post, bor, neigh, data):
    if bor == 'Not assigned':
        return
    
    if neigh == 'Not assigned':
        neigh = bor
    
    data.append({'PostalCode':post, 'Borough':bor, 'Neighborhood':neigh})

Step 5:
- Use the 'soup' object to rip the information.
    - Iterates through each row of data from the websites' data table. (Denoted by `<tr>` blocks)
    - Stores row information in three objects: 'post', 'bor', and 'neigh'.
    - Places info in one of two Series: 
        - `data` if the current post code does not exist in `df_tor`
        - `dupCodeData` if it does
- Move data from Series to pandas DataFrames.
- Fix Column name order in `df_tor`

In [8]:
rows = soup.find('tbody').find_all('tr')[1:]
data = []
dupCodeData = []
setOfCodes = set()

for row in rows:
    row_content = row.find_all('td')
    post = row_content[0].text
    bor = row_content[1].text
    neigh = row_content[2].text.split('\n')[0]
    
    if post in setOfCodes:
        addRowToData(post, bor, neigh, dupCodeData)
    else:
        setOfCodes.add(post)
        addRowToData(post, bor, neigh, data)
        
df_tor = pd.DataFrame(data)
df_dupes = pd.DataFrame(dupCodeData)

df_tor = df_tor[['PostalCode', 'Borough', 'Neighborhood']]

Step 6:  
- Check the current state of the DataFrame

In [9]:
df_tor.head(15)

Unnamed: 0,PostalCode,Borough,Neighborhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,Harbourfront
3,M6A,North York,Lawrence Heights
4,M7A,Queen's Park,Queen's Park
5,M9A,Etobicoke,Islington Avenue
6,M1B,Scarborough,Rouge
7,M3B,North York,Don Mills North
8,M4B,East York,Woodbine Gardens
9,M5B,Downtown Toronto,Ryerson


Step 7:
- Add all neighborhood data from `df_dupes` into `df_tor` according to project restrictions

In [10]:
for index, row in df_dupes.iterrows():
    newNeighborhoodString = '{}, {}'.format(df_tor.loc[df_tor['PostalCode']==row['PostalCode'],:].values[0][2], row['Neighborhood'])
    df_tor.loc[df_tor['PostalCode']==row['PostalCode'], 'Neighborhood'] = newNeighborhoodString
    
df_tor.head(15)

Unnamed: 0,PostalCode,Borough,Neighborhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,"Harbourfront, Regent Park"
3,M6A,North York,"Lawrence Heights, Lawrence Manor"
4,M7A,Queen's Park,Queen's Park
5,M9A,Etobicoke,Islington Avenue
6,M1B,Scarborough,"Rouge, Malvern"
7,M3B,North York,Don Mills North
8,M4B,East York,"Woodbine Gardens, Parkview Hill"
9,M5B,Downtown Toronto,"Ryerson, Garden District"


Final Step:
- Print the shape of the finalized dataframe

In [11]:
df_tor.shape

(103, 3)

## *Stage 2*

Step 1:
- Rip data from .csv file and place the data in a pandas DataFrame

In [13]:
with open('Geospatial_Coordinates.csv', 'r') as f:
    df_coor = pd.read_csv(f)

df_coor.columns = ['PostalCode', 'Latitude', 'Longitude']
df_coor.head(5)

Unnamed: 0,PostalCode,Latitude,Longitude
0,M1B,43.806686,-79.194353
1,M1C,43.784535,-79.160497
2,M1E,43.763573,-79.188711
3,M1G,43.770992,-79.216917
4,M1H,43.773136,-79.239476


Step 2:
- Join coordinates DataFrame with `df_tor` DataFrame

In [16]:
df_tor_merged = pd.merge(df_tor, df_coor, how='left', on='PostalCode')
df_tor_merged.head(15)

Unnamed: 0,PostalCode,Borough,Neighborhood,Latitude,Longitude
0,M3A,North York,Parkwoods,43.753259,-79.329656
1,M4A,North York,Victoria Village,43.725882,-79.315572
2,M5A,Downtown Toronto,"Harbourfront, Regent Park",43.65426,-79.360636
3,M6A,North York,"Lawrence Heights, Lawrence Manor",43.718518,-79.464763
4,M7A,Queen's Park,Queen's Park,43.662301,-79.389494
5,M9A,Etobicoke,Islington Avenue,43.667856,-79.532242
6,M1B,Scarborough,"Rouge, Malvern",43.806686,-79.194353
7,M3B,North York,Don Mills North,43.745906,-79.352188
8,M4B,East York,"Woodbine Gardens, Parkview Hill",43.706397,-79.309937
9,M5B,Downtown Toronto,"Ryerson, Garden District",43.657162,-79.378937
