# Exploring Toronto (cont.)
I will be copying a lot from the previous project. For the sake of space I will be merging a lot in a single cell.

The outcome of this notebook is to simply add the geocodes for the applicbale postalcodes in Toronto Canada, which were extracted in the previous phase of this project.

### Installing Libraries

In [1]:
!conda install beautifulsoup4
!conda install lxml
!conda install requests

Fetching package metadata ...........
Solving package specifications: .

Package plan for installation in environment /opt/conda/envs/DSX-Python35:

The following packages will be UPDATED:

    beautifulsoup4: 4.6.0-py35h442a8c9_1 --> 4.6.3-py35_0

beautifulsoup4 100% |################################| Time: 0:00:00  40.66 MB/s
Fetching package metadata ...........
Solving package specifications: .

Package plan for installation in environment /opt/conda/envs/DSX-Python35:

The following packages will be UPDATED:

    libgcc-ng: 7.2.0-h7cc24e2_2     --> 8.2.0-hdf63c60_1    
    libxml2:   2.9.4-h6b072ca_5     --> 2.9.8-hf84eae3_0    
    libxslt:   1.1.29-hcf9102b_5    --> 1.1.33-h7d1a2b0_0   
    lxml:      4.1.0-py35ha401a81_0 --> 4.2.5-py35hefd8a0e_0

libgcc-ng-8.2. 100% |################################| Time: 0:00:00  67.50 MB/s
libxml2-2.9.8- 100% |################################| Time: 0:00:00  20.04 MB/s
libxslt-1.1.33 100% |################################| Time: 0:00:00  43.

### Importing the libraries

In [2]:
from bs4 import BeautifulSoup as bs
import requests as rq
import pandas as pd
import numpy as np

In [3]:
source = rq.get('https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M').text
soup = bs(source, 'lxml')

table = soup.table

column_names = []
for te in table.find_all('th'):
    column_names.append(te.text)


## Creating a blank dataframe
postalCodes_DF = pd.DataFrame(columns = column_names)

for i,te in enumerate(table.find_all('tr')):
    if (te.td):
        row_lst = te.text[:].split('\n')
        # Below: Eliminating the 'Not assigned'-rows
        if row_lst[2] != 'Not assigned':
            postcode = row_lst[1]
            borough = row_lst[2]
            # Below: Eliminating rows where 'Borough's are stipulated, but 'Neighbourhood's are 'Not assigned'
            if row_lst[3] == 'Not assigned': 
                nhood = borough
            else:
                nhood = row_lst[3]

            postalCodes_DF = postalCodes_DF.append({'Postcode': postcode,
                                                    'Borough': borough,
                                                    'Neighbourhood': nhood},ignore_index=True)

pc_DF = postalCodes_DF.drop(['Neighbourhood\n'],1)


new_pc_DF = pc_DF.groupby(['Postcode','Borough'])['Neighbourhood'].apply(','.join)
new_pc_DF = new_pc_DF.to_frame().reset_index()
new_pc_DF.loc[new_pc_DF['Postcode'] == 'M7A']

print(('The shape of the DF is: {}').format(new_pc_DF.shape))

The shape of the DF is: (103, 3)


### GeoCoder

I have tried implementing GeoCoder a couple of times and have not succeeded.

In [7]:
!pip install geocoder

Requirement not upgraded as not directly required: geocoder in /opt/conda/envs/DSX-Python35/lib/python3.5/site-packages
Requirement not upgraded as not directly required: click in /opt/conda/envs/DSX-Python35/lib/python3.5/site-packages (from geocoder)
Requirement not upgraded as not directly required: ratelim in /opt/conda/envs/DSX-Python35/lib/python3.5/site-packages (from geocoder)
Requirement not upgraded as not directly required: requests in /opt/conda/envs/DSX-Python35/lib/python3.5/site-packages (from geocoder)
Requirement not upgraded as not directly required: six in /opt/conda/envs/DSX-Python35/lib/python3.5/site-packages (from geocoder)
Requirement not upgraded as not directly required: future in /opt/conda/envs/DSX-Python35/lib/python3.5/site-packages (from geocoder)
Requirement not upgraded as not directly required: decorator in /opt/conda/envs/DSX-Python35/lib/python3.5/site-packages (from ratelim->geocoder)
Requirement not upgraded as not directly required: urllib3

 In the interest of time I am moving on to the CSV file and using this method to attach the latlongs onto the dataframe.

In [4]:
import types
import pandas as pd
from botocore.client import Config
import ibm_boto3

def __iter__(self): return 0

df_data_1 = pd.read_csv(body)
df_data_1.head()

Unnamed: 0,Postal Code,Latitude,Longitude
0,M1B,43.806686,-79.194353
1,M1C,43.784535,-79.160497
2,M1E,43.763573,-79.188711
3,M1G,43.770992,-79.216917
4,M1H,43.773136,-79.239476


### Checking the the basics
The below is simply checking the basics of the imported data:
1. Checking that there are no surprises (hidden spaces, etc.) in the column names

In [24]:
df_data_1.columns

Index(['Postal Code', 'Latitude', 'Longitude'], dtype='object')

2. Checking what the dimensions of the dataframe is:

In [25]:
df_data_1.shape

(103, 3)

Therefore both of these dataframes are of the same dimensions, and if there are any missalignments in the contents of the key columns, it will result in some data loss.

### Joining the datasets
First however I will join the data:

In [17]:
ll_df = pd.merge(new_pc_DF,df_data_1,left_on='Postcode',right_on='Postal Code',how='left').drop('Postal Code', axis=1).rename(columns={'Postcode':'PostalCode'})
ll_df.shape

(103, 5)

From the above it seems as though no data loss has occured. To be sure I will select all null rows resulting from the left join (if any).

In [26]:
ll_df[ll_df.isnull().any(axis=1)]

Unnamed: 0,PostalCode,Borough,Neighbourhood,Latitude,Longitude


No rows have been selected, it is therefore a perfect join (i.e. equavalent to a inner join) and no dataloss has occured

### Checking the received data agains Coursera
I created a list of the firts five rows as per the Coursera table for this assignment. I then checked the latlongs against the table given in their example.

In [22]:
ll_df[ll_df['PostalCode'].isin(['M5G','M2H','M4B','M1J','M4G'])]

Unnamed: 0,PostalCode,Borough,Neighbourhood,Latitude,Longitude
5,M1J,Scarborough,Scarborough Village,43.744734,-79.239476
17,M2H,North York,Hillcrest Village,43.803762,-79.363452
35,M4B,East York,"Woodbine Gardens,Parkview Hill",43.706397,-79.309937
38,M4G,East York,Leaside,43.70906,-79.363452
43,M4M,East Toronto,Studio District,43.659526,-79.340923
57,M5G,Downtown Toronto,Central Bay Street,43.657952,-79.387383


## CONCLUSION
The values match and I am now 100% satisfied that this data is the same as what I will need for the further analisys.

In [18]:
ll_df.head()

Unnamed: 0,PostalCode,Borough,Neighbourhood,Latitude,Longitude
0,M1B,Scarborough,"Rouge,Malvern",43.806686,-79.194353
1,M1C,Scarborough,"Highland Creek,Rouge Hill,Port Union",43.784535,-79.160497
2,M1E,Scarborough,"Guildwood,Morningside,West Hill",43.763573,-79.188711
3,M1G,Scarborough,Woburn,43.770992,-79.216917
4,M1H,Scarborough,Cedarbrae,43.773136,-79.239476
