# Wikipedia Data scrapping

__Scrapping data on Toronto Neighbourhoods__
In this assignment we will be scrapping data on Toronto Neighborhoods from the wikipedia page,
https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M

We will be fetching the following:

Fetch __PostalCode__ , __Borough__ , __Neighborhood__

### 1) Creating Pandas DataFrame
    First we convert the url webpage into a pandas data frame. 

In [1]:
import pandas as pd
url="https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M"

df=pd.read_html(url, header=0)[0]

df.head()

Unnamed: 0,Postcode,Borough,Neighbourhood
0,M1A,Not assigned,Not assigned
1,M2A,Not assigned,Not assigned
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,Harbourfront


# 2) Data cleaning

In [2]:
# We will replace all the "Not assigned" values in the Borough dataframe and convert it to np.Nan values

# import numpy liabrary for np.Nan function
import numpy as np

df['Borough'].replace('Not assigned', np.NaN, inplace=True)
df.head()

Unnamed: 0,Postcode,Borough,Neighbourhood
0,M1A,,Not assigned
1,M2A,,Not assigned
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,Harbourfront


As you can see above, all the null values in the Borough column are converted to __np.Nan__

This makes droping the null value rows in a column easier. 

__.dropna()__ function can be used to drop all the __np.Nan__ values
from the dataframe. 

In [3]:
#Use the .dropna() function to drop all the null values. 
# inplace=True modifies/updates the original dataframe. 

df.dropna(inplace=True)
df.head(10)

Unnamed: 0,Postcode,Borough,Neighbourhood
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,Harbourfront
5,M6A,North York,Lawrence Heights
6,M6A,North York,Lawrence Manor
7,M7A,Queen's Park,Not assigned
9,M9A,Queen's Park,Queen's Park
10,M1B,Scarborough,Rouge
11,M1B,Scarborough,Malvern
13,M3B,North York,Don Mills North


We need to create a __for__ loop to replace 'Not assigned' values of the 'Neighbourhood' column to have the same values of the Borough column. 

In [4]:
for i in range(len(df)):
    if df.iloc[i,2]=='Not assigned':
        df.iloc[i,2]=df.iloc[i,1]

# Take a look at df
df.head(10)

Unnamed: 0,Postcode,Borough,Neighbourhood
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,Harbourfront
5,M6A,North York,Lawrence Heights
6,M6A,North York,Lawrence Manor
7,M7A,Queen's Park,Queen's Park
9,M9A,Queen's Park,Queen's Park
10,M1B,Scarborough,Rouge
11,M1B,Scarborough,Malvern
13,M3B,North York,Don Mills North


As you can see, the value of Neighbourhood column at idex=7 is changed to __Queen's Park__

Next we will use the __groupby__ command to group by unique values of the Postcode

In [5]:
# groupby() to group elements of Postalcode
df.groupby(['Postcode']).head()

Unnamed: 0,Postcode,Borough,Neighbourhood
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,Harbourfront
5,M6A,North York,Lawrence Heights
6,M6A,North York,Lawrence Manor
...,...,...,...
281,M8Z,Etobicoke,Kingsway Park South West
282,M8Z,Etobicoke,Mimico NW
283,M8Z,Etobicoke,The Queensway West
284,M8Z,Etobicoke,Royal York South West


Next we need to __Aggregate__ the values of the __Neighbourhood__ column based on the __Postcode__ values.

Here, we update the original dataframe and use __.agg(','.join)__ command to aggreagate the values of __Neighbourhood__ and __Borough__ column. 

The __.agg__ will the __join__ the values seperated by __","__

In [6]:
df=df[['Postcode','Borough','Neighbourhood']].groupby('Postcode',as_index=False).agg(','.join)

Notice that in the table below, the values of __Borough__ table also got __aggregated__  resulting in a lot of repeat values.

example: Scarborough,Scarborough,Scarborough

In [7]:
# Let's take a look at the updated dataframe. 
df.head()

Unnamed: 0,Postcode,Borough,Neighbourhood
0,M1B,"Scarborough,Scarborough","Rouge,Malvern"
1,M1C,"Scarborough,Scarborough,Scarborough","Highland Creek,Rouge Hill,Port Union"
2,M1E,"Scarborough,Scarborough,Scarborough","Guildwood,Morningside,West Hill"
3,M1G,Scarborough,Woburn
4,M1H,Scarborough,Cedarbrae


Next, we need to get rid of the repeat values in Borough column.

First we create object __'col'__. Then we create the column value into a string and __split__ the string on __","__ . 

Next we conver the resulting list to a __set__. Recall that __set__ does not contain repeate values. We then __join__ the strings with a __","__ seperating the strings. 

In [8]:
# use .apply(set) method to convert the array into a set. 
col=df['Borough'].str.split(',').apply(set).str.join(',')

In [9]:
# Next we update the df with the col values. 
df.update(col)

In [10]:
#Lets take a look at the final dataframe
df.head(10)

Unnamed: 0,Postcode,Borough,Neighbourhood
0,M1B,Scarborough,"Rouge,Malvern"
1,M1C,Scarborough,"Highland Creek,Rouge Hill,Port Union"
2,M1E,Scarborough,"Guildwood,Morningside,West Hill"
3,M1G,Scarborough,Woburn
4,M1H,Scarborough,Cedarbrae
5,M1J,Scarborough,Scarborough Village
6,M1K,Scarborough,"East Birchmount Park,Ionview,Kennedy Park"
7,M1L,Scarborough,"Clairlea,Golden Mile,Oakridge"
8,M1M,Scarborough,"Cliffcrest,Cliffside,Scarborough Village West"
9,M1N,Scarborough,"Birch Cliff,Cliffside West"


#### shape of the dataframe

In [11]:
df.shape

(103, 3)

# Part 3 : geocoder

__Assignment__:
Now that you have built a dataframe of the postal code of each neighbourhood along with the borough name and neighborhood name, in order to utilize the Foursquare location data, we need to get the __latitude__ and the __longitude__ coordinates of each neighborhood. 

The geocoder package kept denying the API request, hence we decide to go ahead with __pgeocoder__

In [12]:
#import python geocoder
import pgeocode as pgeo

Let us explore how geocoder works:
#### 1) First we create an object 
#### 2) Post Query

In [13]:
# creat an object
nomi=pgeo.Nominatim("ca")

#post query
nomi.query_postal_code("M1B")

postal_code                                       M1B
country code                                       CA
place_name        Scarborough (Malvern / Rouge River)
state_name                                    Ontario
state_code                                         ON
county_name                               Scarborough
county_code                                       NaN
community_name                                    NaN
community_code                                    NaN
latitude                                      43.8113
longitude                                     -79.193
accuracy                                            6
Name: 0, dtype: object

As you can see in the dataframe above. The query returns: 

postal_code, country code, place_name, state_name, state_code, country_code, community_name, community_code, __latitude__ , __longitude__ .

We are only interested in the latitude and longitude of the place, this is how we go about accessing thse:

In [14]:
# save the output of the query in a variable
a=nomi.query_postal_code("M1C")

# convert to pandas dataframe
df_2=pd.DataFrame(a)
df_2.head()

Unnamed: 0,0
postal_code,M1C
country code,CA
place_name,Scarborough (Rouge Hill / Port Union / Highlan...
state_name,Ontario
state_code,ON


In [15]:
#Accessing latitude and longitude
print("Latitude  :" + str(a.latitude))
print("Longitude :" + str(a.longitude))

Latitude  :43.7878
Longitude :-79.1564


Now lets turn back to our original dataframe. 

First we create additional columns in the dataframe. Columns titled __Latitude__ and __Longitude__

In [16]:
df.insert(3, "Latitude","")
df.insert(4, "Longitude","")

df.head()

Unnamed: 0,Postcode,Borough,Neighbourhood,Latitude,Longitude
0,M1B,Scarborough,"Rouge,Malvern",,
1,M1C,Scarborough,"Highland Creek,Rouge Hill,Port Union",,
2,M1E,Scarborough,"Guildwood,Morningside,West Hill",,
3,M1G,Scarborough,Woburn,,
4,M1H,Scarborough,Cedarbrae,,


Next, we make query calls inside a for loop and extracting the latitude and longitude values from the results and then update the dataframe

In [17]:
# For loop to extract latitude and longitude and update dataframe
for i in range(len(df)):
    A=nomi.query_postal_code(df.iloc[i,0])
    df.iloc[i,3]=A.latitude
    df.iloc[i,4]=A.longitude

In [18]:
df.head(10)

Unnamed: 0,Postcode,Borough,Neighbourhood,Latitude,Longitude
0,M1B,Scarborough,"Rouge,Malvern",43.8113,-79.193
1,M1C,Scarborough,"Highland Creek,Rouge Hill,Port Union",43.7878,-79.1564
2,M1E,Scarborough,"Guildwood,Morningside,West Hill",43.7678,-79.1866
3,M1G,Scarborough,Woburn,43.7712,-79.2144
4,M1H,Scarborough,Cedarbrae,43.7686,-79.2389
5,M1J,Scarborough,Scarborough Village,43.7464,-79.2323
6,M1K,Scarborough,"East Birchmount Park,Ionview,Kennedy Park",43.7298,-79.2639
7,M1L,Scarborough,"Clairlea,Golden Mile,Oakridge",43.7122,-79.2843
8,M1M,Scarborough,"Cliffcrest,Cliffside,Scarborough Village West",43.7247,-79.2312
9,M1N,Scarborough,"Birch Cliff,Cliffside West",43.6952,-79.2646


The Above Data frame is complete. 