## This Notebook contains assignment for week 3. I have divided the work in the three parts in this Notebook.
### Part 1 consist of scraping the wikipedia page and making the dataframe
### Part 2 consist of adding the coordinates into the dataframe, that was created in part one
### Part 3 consists of clustering the neighborhood

## Part 1

### In this part, we use BeautifulSoup and urllib packages to make a scrape and make request to the wiki page. we use lxml parser to parse the page into BeautifulSoup Object that contains the tags. 
### then we use tag names to get the content we want

In [1]:
import pandas as pd
from bs4 import BeautifulSoup
from urllib.request import urlopen

In [2]:
page = urlopen(r'https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M')

In [3]:
bsobj = BeautifulSoup(page, 'lxml')

### As you can see in the above cell BeautifulSoup has parsed the html page into a tree kind of structure, now we will use the table tag to get the content we want

In [5]:
table = bsobj.table

### As you can see in the above cell we queried the content that is in table tag. 
### Now, the code in cell below loops through each row in the table and gets the data from each row appends into the list

In [6]:
PostalCode=[]
Borough=[]
Neighborhood=[]

for tr in table.find_all('tr')[1:]:
    PostalCode.append(tr.find_all('td')[0].get_text())
    Borough.append(tr.find_all('td')[1].get_text())
    Neighborhood.append(tr.find_all('td')[2].get_text().strip('\n'))


### As mentioned in the assignment; If a cell has a borough but a Not assigned neighborhood, then the neighborhood will be the same as the borough. the next block of code does that

In [7]:
for i,j in enumerate(Neighborhood):
    if j=='Not assigned':
        if Borough[i]=='Not assigned':
            Neighborhood[i]=Borough[i]
        else:
            Neighborhood[i]=Borough[i]

In [8]:
df = pd.DataFrame({
        'PostalCode':PostalCode,
        'Borough':Borough,
        'Neighborhood':Neighborhood
    })

In [9]:
df.head(10)

Unnamed: 0,Borough,Neighborhood,PostalCode
0,Not assigned,Not assigned,M1A
1,Not assigned,Not assigned,M2A
2,North York,Parkwoods,M3A
3,North York,Victoria Village,M4A
4,Downtown Toronto,Harbourfront,M5A
5,Downtown Toronto,Regent Park,M5A
6,North York,Lawrence Heights,M6A
7,North York,Lawrence Manor,M6A
8,Queen's Park,Queen's Park,M7A
9,Not assigned,Not assigned,M8A


### we see in the below two cells that both the Borough and Neighborhood columns has 77 counts of Not assigned values

In [10]:
df['Borough'].value_counts()

Not assigned        77
Etobicoke           45
North York          38
Scarborough         37
Downtown Toronto    37
Central Toronto     17
West Toronto        13
York                 9
East Toronto         7
East York            6
Queen's Park         1
Mississauga          1
Name: Borough, dtype: int64

In [12]:
df['Neighborhood'].value_counts()

Not assigned            77
St. James Town           2
Runnymede                2
Christie                 1
Thistletown              1
Queen's Park             1
East Birchmount Park     1
Cedarbrae                1
Regent Park              1
The Beaches              1
Little Portugal          1
Jamestown                1
Bathurst Quay            1
Don Mills North          1
Parkdale                 1
Weston                   1
Woburn                   1
The Queensway West       1
Upper Rouge              1
Clairlea                 1
South Niagara            1
Keelesdale               1
Woodbine Heights         1
King and Spadina         1
The Kingsway             1
Exhibition Place         1
CFB Toronto              1
L'Amoreaux West          1
Oriole                   1
Union Station            1
                        ..
Church and Wellesley     1
Downsview Central        1
West Deane Park          1
Toronto Islands          1
Old Burnhamthorpe        1
Glencairn                1
H

### Now we want to ignore the cells that has a Borough Not assigned

In [13]:
df = df[df.Borough!='Not assigned']

### now by applying groupby function, we add more than one neighborhood  that can exist in one postal code area.

In [14]:
df_new = df.groupby(['PostalCode', 'Borough'])['Neighborhood'].apply(', '.join).reset_index()

In [15]:
df_new.shape

(103, 3)

## Part 2
### We will add the coordinates into the dataframe we created above

In [16]:
df_coordinates = pd.read_csv(r'c:\users\paras\desktop\Geospatial_Coordinates.csv')

In [17]:
df_coordinates.shape

(103, 3)

### changing the name of columns, so that when we merge to dataframes on one key, the name of key should be same. just like joining table in SQL

In [18]:
df_coordinates.rename(columns={'Postal Code':'PostalCode'}, inplace=True)

In [19]:
df_coordinates.columns

Index(['PostalCode', 'Latitude', 'Longitude'], dtype='object')

In [20]:
df_data = pd.merge(df_new, df_coordinates, on='PostalCode')

In [21]:
df_data.shape

(103, 5)

In [22]:
df_data

Unnamed: 0,PostalCode,Borough,Neighborhood,Latitude,Longitude
0,M1B,Scarborough,"Rouge, Malvern",43.806686,-79.194353
1,M1C,Scarborough,"Highland Creek, Rouge Hill, Port Union",43.784535,-79.160497
2,M1E,Scarborough,"Guildwood, Morningside, West Hill",43.763573,-79.188711
3,M1G,Scarborough,Woburn,43.770992,-79.216917
4,M1H,Scarborough,Cedarbrae,43.773136,-79.239476
5,M1J,Scarborough,Scarborough Village,43.744734,-79.239476
6,M1K,Scarborough,"East Birchmount Park, Ionview, Kennedy Park",43.727929,-79.262029
7,M1L,Scarborough,"Clairlea, Golden Mile, Oakridge",43.711112,-79.284577
8,M1M,Scarborough,"Cliffcrest, Cliffside, Scarborough Village West",43.716316,-79.239476
9,M1N,Scarborough,"Birch Cliff, Cliffside West",43.692657,-79.264848


## Part 3
### this part focusses on exploring and clustering the neighborhoods in Toronto

In [23]:
import folium

In [24]:
latitude = 43.6532
longitude = -79.3832
map_toronto = folium.Map(location=[latitude, longitude], zoom_start=10)
map_toronto

In [25]:

# add markers to map
for lat, lng, borough, neighborhood in zip(df_data['Latitude'], df_data['Longitude'], df_data['Borough'], df_data['Neighborhood']):
    label = '{}, {}'.format(neighborhood, borough)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_toronto)  
    
map_toronto

In [26]:
df_data['Borough'].value_counts()

North York          24
Downtown Toronto    18
Scarborough         17
Etobicoke           12
Central Toronto      9
West Toronto         6
East Toronto         5
York                 5
East York            5
Queen's Park         1
Mississauga          1
Name: Borough, dtype: int64

### we plan to explore only Downtown Toronto Boroughs

In [27]:
downtown_toronto = df_data[df_data['Borough']=='Downtown Toronto'].reset_index(drop=True)

In [28]:
downtown_toronto

Unnamed: 0,PostalCode,Borough,Neighborhood,Latitude,Longitude
0,M4W,Downtown Toronto,Rosedale,43.679563,-79.377529
1,M4X,Downtown Toronto,"Cabbagetown, St. James Town",43.667967,-79.367675
2,M4Y,Downtown Toronto,Church and Wellesley,43.66586,-79.38316
3,M5A,Downtown Toronto,"Harbourfront, Regent Park",43.65426,-79.360636
4,M5B,Downtown Toronto,"Ryerson, Garden District",43.657162,-79.378937
5,M5C,Downtown Toronto,St. James Town,43.651494,-79.375418
6,M5E,Downtown Toronto,Berczy Park,43.644771,-79.373306
7,M5G,Downtown Toronto,Central Bay Street,43.657952,-79.387383
8,M5H,Downtown Toronto,"Adelaide, King, Richmond",43.650571,-79.384568
9,M5J,Downtown Toronto,"Harbourfront East, Toronto Islands, Union Station",43.640816,-79.381752


In [29]:
latitude = 43.6543 
longitude = -79.3860
map_downtown = folium.Map(location=[latitude, longitude], zoom_start=12)

for lat, lng, borough, neighborhood in zip(downtown_toronto['Latitude'], downtown_toronto['Longitude'], downtown_toronto['Borough'], downtown_toronto['Neighborhood']):
    label = '{}, {}'.format(neighborhood, borough)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_downtown)  
    
map_downtown

### lets explore the neighborhood near Central Bay Street

In [30]:
downtown_toronto.loc[7, 'Neighborhood']

'Central Bay Street'

In [31]:
neighborhood_latitude = downtown_toronto.loc[7, 'Latitude'] # neighborhood latitude value
neighborhood_longitude = downtown_toronto.loc[7, 'Longitude'] # neighborhood longitude value

neighborhood_name = downtown_toronto.loc[7, 'Neighborhood'] # neighborhood name

print('Latitude and longitude values of {} are {}, {}.'.format(neighborhood_name, 
                                                               neighborhood_latitude, 
                                                               neighborhood_longitude))

Latitude and longitude values of Central Bay Street are 43.6579524, -79.3873826.


In [32]:
# @hidden_cell
CLIENT_ID='HEL3APBYSYIE4X4Y2S1RGKECXNJELZX3SDULSUPPLEXVFJ0E'
CLIENT_SECRET = 'JESGRQE1GDCRGUDQMZBODK0RLMLBCMBWYP2RMHCD5KMTZHFD'

In [35]:
radius=100
LIMIT=10
VERSION='20190915'
import requests
url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
    CLIENT_ID, 
    CLIENT_SECRET, 
    VERSION, 
    neighborhood_latitude, 
    neighborhood_longitude, 
    radius, 
    LIMIT)

In [36]:
results = requests.get(url).json()
results

{'meta': {'code': 200, 'requestId': '5d7e709c531593002ce455d0'},
 'response': {'groups': [{'items': [{'reasons': {'count': 0,
       'items': [{'reasonName': 'globalInteractionReason',
         'summary': 'This spot is popular',
         'type': 'general'}]},
      'referralId': 'e-0-4c6affbe9669e21e3708aa51-0',
      'venue': {'categories': [{'icon': {'prefix': 'https://ss3.4sqi.net/img/categories_v2/food/coffeeshop_',
          'suffix': '.png'},
         'id': '4bf58dd8d48988d1e0931735',
         'name': 'Coffee Shop',
         'pluralName': 'Coffee Shops',
         'primary': True,
         'shortName': 'Coffee Shop'}],
       'id': '4c6affbe9669e21e3708aa51',
       'location': {'address': '555 University Ave',
        'cc': 'CA',
        'city': 'Toronto',
        'country': 'Canada',
        'crossStreet': 'in Sick Kids Hospital',
        'distance': 97,
        'formattedAddress': ['555 University Ave (in Sick Kids Hospital)',
         'Toronto ON M5G 1X8',
         'Canada'],


In [37]:
def get_category_type(row):
    try:
        categories_list = row['categories']
    except:
        categories_list = row['venue.categories']
        
    if len(categories_list) == 0:
        return None
    else:
        return categories_list[0]['name']

In [38]:
from pandas.io.json import json_normalize 
venues = results['response']['groups'][0]['items']
    
nearby_venues = json_normalize(venues) # flatten JSON

# filter columns
filtered_columns = ['venue.name', 'venue.categories', 'venue.location.lat', 'venue.location.lng']
nearby_venues =nearby_venues.loc[:, filtered_columns]

# filter the category for each row
nearby_venues['venue.categories'] = nearby_venues.apply(get_category_type, axis=1)

# clean columns
nearby_venues.columns = [col.split(".")[-1] for col in nearby_venues.columns]

nearby_venues.head()

Unnamed: 0,name,categories,lat,lng
0,Tim Hortons,Coffee Shop,43.657399,-79.388313
1,Shoppers Drug Mart,Pharmacy,43.657566,-79.388477
2,Subway,Sandwich Place,43.657461,-79.386479


In [39]:
print('{} venues were returned by Foursquare.'.format(nearby_venues.shape[0]))

3 venues were returned by Foursquare.
