# Python Week 3 Assignment

### Part 1: Webscraping

In the cell below, I import all the required libraries and get the wiki page using the requests library. 

In [1]:
import pandas as pd
import numpy as np
import requests
from bs4 import BeautifulSoup
wiki_page = requests.get("https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M")

In the cell below, I start off by assigning soup to a Beautiful Soup object with the HTML of the wiki page and the type of parser as the inputs. I then create an empty list called table_contents and then use the .find method to find the table in the HTML. I then iterate through all the table data rows and delete the rows saying "Not Assigned". During the iteration, I create a dictionary called "cell" to store all of the row contents. I go through the table data and get the postal code, borough, and neighborhood. I put this data in the table_contents list and go through all the table data. After that, I use pandas to create a DataFrame of all the data and make sure to normalize the data so it is consistent. The DataFrame of the data is created and the Webscraping is complete.  

In [2]:
soup = BeautifulSoup(wiki_page.text, 'html.parser')
table_contents=[]
table=soup.find('table')
for row in table.findAll('td'):
    cell = {}
    if row.span.text=='Not assigned':
        pass
    else:
        cell['Postal Code'] = row.p.text[:3]
        cell['Borough'] = (row.span.text).split('(')[0]
        cell['Neighborhood'] = (((((row.span.text).split('(')[1]).strip(')')).replace(' /',',')).replace(')',' ')).strip(' ')
        table_contents.append(cell)

# print(table_contents)
df=pd.DataFrame(table_contents)
df['Borough']=df['Borough'].replace({'Downtown TorontoStn A PO Boxes25 The Esplanade':'Downtown Toronto Stn A',
                                             'East TorontoBusiness reply mail Processing Centre969 Eastern':'East Toronto Business',
                                             'EtobicokeNorthwest':'Etobicoke Northwest','East YorkEast Toronto':'East York/East Toronto',
                                             'MississaugaCanada Post Gateway Processing Centre':'Mississauga'})

In the cell below, I print the size of the DataFrame, which, in this case, is 103 data points and 3 features. I also sort the data by postal code and reset the index. I also print the first 10 rows of the DataFrame to make sure that it is formatted correctly. 

In [3]:
print(df.shape)
df.sort_values(by = "Postal Code", ascending = True, inplace = True)
df.reset_index(drop = True, inplace = True)
df.head(10)

(103, 3)


Unnamed: 0,Postal Code,Borough,Neighborhood
0,M1B,Scarborough,"Malvern, Rouge"
1,M1C,Scarborough,"Rouge Hill, Port Union, Highland Creek"
2,M1E,Scarborough,"Guildwood, Morningside, West Hill"
3,M1G,Scarborough,Woburn
4,M1H,Scarborough,Cedarbrae
5,M1J,Scarborough,Scarborough Village
6,M1K,Scarborough,"Kennedy Park, Ionview, East Birchmount Park"
7,M1L,Scarborough,"Golden Mile, Clairlea, Oakridge"
8,M1M,Scarborough,"Cliffside, Cliffcrest, Scarborough Village West"
9,M1N,Scarborough,"Birch Cliff, Cliffside West"


### Part 2: Getting Longitudinal and Latitudinal Values

I get the data from Coursera because I am unable to use Geocoder. I get the data, create a DataFrame using pandas, and check the first 5 rows of it. 

In [4]:
url = "https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBMDeveloperSkillsNetwork-DS0701EN-SkillsNetwork/labs_v1/Geospatial_Coordinates.csv"
lat_long = pd.read_csv(url)
lat_long.head()

Unnamed: 0,Postal Code,Latitude,Longitude
0,M1B,43.806686,-79.194353
1,M1C,43.784535,-79.160497
2,M1E,43.763573,-79.188711
3,M1G,43.770992,-79.216917
4,M1H,43.773136,-79.239476


Here, I merge the two data sets based on the postal code and check the first 10 rows. Everything seems to be in order. 

In [5]:
df = pd.merge(df, lat_long, on = "Postal Code")
df.head(10)

Unnamed: 0,Postal Code,Borough,Neighborhood,Latitude,Longitude
0,M1B,Scarborough,"Malvern, Rouge",43.806686,-79.194353
1,M1C,Scarborough,"Rouge Hill, Port Union, Highland Creek",43.784535,-79.160497
2,M1E,Scarborough,"Guildwood, Morningside, West Hill",43.763573,-79.188711
3,M1G,Scarborough,Woburn,43.770992,-79.216917
4,M1H,Scarborough,Cedarbrae,43.773136,-79.239476
5,M1J,Scarborough,Scarborough Village,43.744734,-79.239476
6,M1K,Scarborough,"Kennedy Park, Ionview, East Birchmount Park",43.727929,-79.262029
7,M1L,Scarborough,"Golden Mile, Clairlea, Oakridge",43.711112,-79.284577
8,M1M,Scarborough,"Cliffside, Cliffcrest, Scarborough Village West",43.716316,-79.239476
9,M1N,Scarborough,"Birch Cliff, Cliffside West",43.692657,-79.264848


### Part 3: Exploring the Data

I import folium in order to create a map of Toronto.

In [6]:
!conda install -c conda-forge folium=0.5.0 --yes
import folium

Collecting package metadata (current_repodata.json): done
Solving environment: done

# All requested packages already installed.



I first define the location of Toronto in latitude and longitude. I then use the folium map method in order to create a map of zoom 11 at Tornoto's location. I then iterate through the data and create blue markers at each of the locations. Each marker has the borough and neighborhood of the location. I then display the map. This is a visualization of the data so I can see how it is spread out. 

In [7]:
latitude = 43.6532
longitude = -79.3832
map_toronto = folium.Map(location=[latitude, longitude], zoom_start=11)

# add markers to map
for lat, lng, borough, neighborhood in zip(df['Latitude'], df['Longitude'], df['Borough'], df['Neighborhood']):
    label='{}, {}'.format(neighborhood, borough)
    label=folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_toronto)  
    
map_toronto

This is cell below is foursquare information. I also include the limit as 25 and the radius as 500. 

In [8]:
# @hidden_cell
CLIENT_ID = 'YM1LMKNPYUUCMV1TD1XKTCOZZ1ZWJZR2AHISIGPJST0HBJWC' # your Foursquare ID
CLIENT_SECRET = 'LO0OLVMN2Y2ARRGQZHCHG5QLRYUSQWEQDLHKRZEFLKOY34SK' # your Foursquare Secret
VERSION = '20210705' # Foursquare API version
LIMIT = 25
radius = 500

The function below is used to get venue information around a neighborhood. 

In [9]:
def getNearbyVenues(names, latitudes, longitudes, radius=500):
    
    venues_list=[]
    for name, lat, lng in zip(names, latitudes, longitudes):
        print(name)
            
        # create the API request URL
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lat, 
            lng, 
            radius, 
            LIMIT)
            
        # make the GET request
        results = requests.get(url).json()["response"]["groups"][0]["items"]
        
        # return only relevant information for each nearby venue
        venues_list.append([(
            name, 
            lat, 
            lng, 
            v['venue']['name'], 
            v['venue']['location']['lat'], 
            v['venue']['location']['lng'],  
            v['venue']['categories'][0]['name']) for v in results])

    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['Neighborhood', 
                  'Neighborhood Latitude', 
                  'Neighborhood Longitude', 
                  'Venue', 
                  'Venue Latitude', 
                  'Venue Longitude', 
                  'Venue Category']
    
    return(nearby_venues)

The cell below creates the Toronto DataFrame which includes venues and neighborhood information. 

In [10]:
toronto_venues = getNearbyVenues(names = df['Neighborhood'], latitudes = df['Latitude'], longitudes = df['Longitude'])

Malvern, Rouge
Rouge Hill, Port Union, Highland Creek
Guildwood, Morningside, West Hill
Woburn
Cedarbrae
Scarborough Village
Kennedy Park, Ionview, East Birchmount Park
Golden Mile, Clairlea, Oakridge
Cliffside, Cliffcrest, Scarborough Village West
Birch Cliff, Cliffside West
Dorset Park, Wexford Heights, Scarborough Town Centre
Wexford, Maryvale
Agincourt
Clarks Corners, Tam O'Shanter, Sullivan
Milliken, Agincourt North, Steeles East, L'Amoreaux East
Steeles West, L'Amoreaux West
Upper Rouge
Hillcrest Village
Fairview, Henry Farm, Oriole
Bayview Village
York Mills, Silver Hills
Willowdale, Newtonbrook
Willowdale South
York Mills West
Willowdale West
Parkwoods
Don Mills North
Don Mills South
Bathurst Manor, Wilson Heights, Downsview North
Northwood Park, York University
Downsview East
Downsview West
Downsview Central
Downsview Northwest
Victoria Village
Parkview Hill, Woodbine Gardens
Woodbine Heights
The Beaches
Leaside
Thorncliffe Park
The Danforth  East
The Danforth West, Riverdale


Here I check the Toronto Venues DataFrame and also check the counts to make sure they are fine. 

In [11]:
toronto_venues.head()

Unnamed: 0,Neighborhood,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
0,"Malvern, Rouge",43.806686,-79.194353,Wendy’s,43.807448,-79.199056,Fast Food Restaurant
1,"Rouge Hill, Port Union, Highland Creek",43.784535,-79.160497,Royal Canadian Legion,43.782533,-79.163085,Bar
2,"Guildwood, Morningside, West Hill",43.763573,-79.188711,RBC Royal Bank,43.76679,-79.191151,Bank
3,"Guildwood, Morningside, West Hill",43.763573,-79.188711,G & G Electronics,43.765309,-79.191537,Electronics Store
4,"Guildwood, Morningside, West Hill",43.763573,-79.188711,Sail Sushi,43.765951,-79.191275,Restaurant


In [12]:
toronto_venues.groupby('Neighborhood').count()

Unnamed: 0_level_0,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
Neighborhood,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
Agincourt,4,4,4,4,4,4
"Alderwood, Long Branch",8,8,8,8,8,8
"Bathurst Manor, Wilson Heights, Downsview North",22,22,22,22,22,22
Bayview Village,4,4,4,4,4,4
"Bedford Park, Lawrence Manor East",23,23,23,23,23,23
...,...,...,...,...,...,...
Willowdale West,6,6,6,6,6,6
"Willowdale, Newtonbrook",2,2,2,2,2,2
Woburn,3,3,3,3,3,3
Woodbine Heights,8,8,8,8,8,8


In this cell, I use one-hot encoding to create a DataFrame with the categories of the venues around neighborhoods. 

In [13]:
toronto_onehot = pd.get_dummies(toronto_venues[['Venue Category']], prefix="", prefix_sep="")

# add neighborhood column back to dataframe
toronto_onehot['Neighborhood'] = toronto_venues['Neighborhood'] 

# move neighborhood column to the first column
fixed_columns = [toronto_onehot.columns[-1]] + list(toronto_onehot.columns[:-1])
toronto_onehot = toronto_onehot[fixed_columns]

toronto_onehot.head()

Unnamed: 0,Yoga Studio,Accessories Store,Airport,Airport Food Court,Airport Lounge,Airport Service,Airport Terminal,American Restaurant,Antique Shop,Aquarium,...,Turkish Restaurant,Vegetarian / Vegan Restaurant,Video Game Store,Video Store,Vietnamese Restaurant,Warehouse Store,Wine Bar,Wine Shop,Wings Joint,Women's Store
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


This is where I get the frequencies of the venue categories. 

In [14]:
toronto_grouped = toronto_onehot.groupby('Neighborhood').mean().reset_index()

This is where we define how to get the most common venues in a neighborhood. 

In [15]:
def return_most_common_venues(row, num_top_venues):
    row_categories = row.iloc[1:]
    row_categories_sorted = row_categories.sort_values(ascending=False)
    
    return row_categories_sorted.index.values[0:num_top_venues]

This is where we create a DataFrame including the top 3 venue categories in a neighborhood. 

In [16]:
num_top_venues = 3

indicators = ['st', 'nd', 'rd']

# create columns according to number of top venues
columns = ['Neighborhood']
for ind in np.arange(num_top_venues):
    try:
        columns.append('{}{} Most Common Venue'.format(ind+1, indicators[ind]))
    except:
        columns.append('{}th Most Common Venue'.format(ind+1))

# create a new dataframe
neighborhoods_venues_sorted = pd.DataFrame(columns=columns)
neighborhoods_venues_sorted['Neighborhood'] = toronto_grouped['Neighborhood']

for ind in np.arange(toronto_grouped.shape[0]):
    neighborhoods_venues_sorted.iloc[ind, 1:] = return_most_common_venues(toronto_grouped.iloc[ind, :], num_top_venues)

neighborhoods_venues_sorted.head()

Unnamed: 0,Neighborhood,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue
0,Agincourt,Breakfast Spot,Lounge,Latin American Restaurant
1,"Alderwood, Long Branch",Pizza Place,Pub,Pharmacy
2,"Bathurst Manor, Wilson Heights, Downsview North",Bank,Coffee Shop,Pizza Place
3,Bayview Village,Japanese Restaurant,Chinese Restaurant,Bank
4,"Bedford Park, Lawrence Manor East",Coffee Shop,Sandwich Place,Italian Restaurant


Here I import the KMeans model so I can cluster the data. 

In [17]:
from sklearn.cluster import KMeans

I define the number of clusters I want and then fit the KMeans object using the Toronto DataFrame. 

In [18]:
kclusters = 11

toronto_grouped_clustering = toronto_grouped.drop('Neighborhood', 1)

# run k-means clustering
kmeans = KMeans(n_clusters=kclusters, random_state=0).fit(toronto_grouped_clustering)

  app.launch_new_instance()


Here I create a DataFrame which includes that cluster the neighborhood is in, the top 3 venues categories, and the information about the neighborhood. I also drop all of the Na values. 

In [19]:
neighborhoods_venues_sorted.insert(0, 'Cluster Labels', kmeans.labels_)

toronto_merged = df

# merge toronto_grouped with df to add latitude/longitude for each neighborhood
toronto_merged = toronto_merged.join(neighborhoods_venues_sorted.set_index('Neighborhood'), on='Neighborhood')

toronto_merged.dropna(subset = ["Cluster Labels"], inplace=True)

In [20]:
import matplotlib.cm as cm
import matplotlib.colors as colors

In the cell below, I create a map with markers at the neighborhoods. They also have different colors representing different clusters. 

In [21]:
map_clusters = folium.Map(location=[latitude, longitude], zoom_start=11)

# set color scheme for the clusters
x = np.arange(kclusters)
ys = [i + x + (i*x)**2 for i in range(kclusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# add markers to the map
markers_colors = []
for lat, lon, poi, cluster in zip(toronto_merged['Latitude'], toronto_merged['Longitude'], toronto_merged['Neighborhood'], toronto_merged['Cluster Labels']):
    label = folium.Popup(str(poi) + ' Cluster ' + str(cluster), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color=rainbow[int(cluster)-1],
        fill=True,
        fill_color=rainbow[int(cluster)-1],
        fill_opacity=0.7).add_to(map_clusters)
       
map_clusters

The lines below represent which neighborhoods are in the different clusters. You can see all of the Datapoints within these clusters. 

In [22]:
toronto_merged.loc[toronto_merged['Cluster Labels'] == 0, toronto_merged.columns[[1] + list(range(5, toronto_merged.shape[1]))]]

Unnamed: 0,Borough,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue
96,North York,0.0,Pizza Place,Yoga Studio,Museum
100,Etobicoke,0.0,Pizza Place,Mobile Phone Shop,Bus Line


In [23]:
toronto_merged.loc[toronto_merged['Cluster Labels'] == 1, toronto_merged.columns[[1] + list(range(5, toronto_merged.shape[1]))]]

Unnamed: 0,Borough,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue
2,Scarborough,1.0,Breakfast Spot,Donut Shop,Mexican Restaurant
3,Scarborough,1.0,Coffee Shop,Korean BBQ Restaurant,Yoga Studio
6,Scarborough,1.0,Hobby Shop,Convenience Store,Discount Store
8,Scarborough,1.0,American Restaurant,Motel,Movie Theater
12,Scarborough,1.0,Breakfast Spot,Lounge,Latin American Restaurant
15,Scarborough,1.0,Fast Food Restaurant,Coffee Shop,Chinese Restaurant
18,North York,1.0,Coffee Shop,Fast Food Restaurant,Clothing Store
24,North York,1.0,Grocery Store,Pharmacy,Supermarket
27,North York,1.0,Coffee Shop,Gym,Restaurant
28,North York,1.0,Bank,Coffee Shop,Pizza Place


In [24]:
toronto_merged.loc[toronto_merged['Cluster Labels'] == 2, toronto_merged.columns[[1] + list(range(5, toronto_merged.shape[1]))]]

Unnamed: 0,Borough,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue
5,Scarborough,2.0,Playground,Yoga Studio,Museum


In [25]:
toronto_merged.loc[toronto_merged['Cluster Labels'] == 3, toronto_merged.columns[[1] + list(range(5, toronto_merged.shape[1]))]]

Unnamed: 0,Borough,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue
14,Scarborough,3.0,Playground,Park,Intersection
25,North York,3.0,Fast Food Restaurant,Food & Drink Shop,Park
30,North York,3.0,Airport,Park,Electronics Store
31,North York,3.0,Grocery Store,Park,Bank
33,North York,3.0,Liquor Store,Gym / Fitness Center,Grocery Store
44,Central Toronto,3.0,Swim School,Park,Bus Line
48,Central Toronto,3.0,Lawyer,Park,Restaurant
64,Central Toronto,3.0,Jewelry Store,Sushi Restaurant,Park
75,Downtown Toronto,3.0,Grocery Store,Café,Park
79,North York,3.0,Construction & Landscaping,Massage Studio,Park


In [26]:
toronto_merged.loc[toronto_merged['Cluster Labels'] == 4, toronto_merged.columns[[1] + list(range(5, toronto_merged.shape[1]))]]

Unnamed: 0,Borough,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue
4,Scarborough,4.0,Caribbean Restaurant,Hakka Restaurant,Gas Station
7,Scarborough,4.0,Bus Line,Bakery,Soccer Field
9,Scarborough,4.0,College Stadium,Skating Rink,Café
10,Scarborough,4.0,Indian Restaurant,Vietnamese Restaurant,Pet Store
11,Scarborough,4.0,Middle Eastern Restaurant,Smoke Shop,Bakery
13,Scarborough,4.0,Fast Food Restaurant,Pizza Place,Italian Restaurant
17,North York,4.0,Pool,Golf Course,Dog Run
19,North York,4.0,Japanese Restaurant,Chinese Restaurant,Bank
22,North York,4.0,Ramen Restaurant,Pizza Place,Japanese Restaurant
26,North York,4.0,Caribbean Restaurant,Athletics & Sports,Café


In [27]:
toronto_merged.loc[toronto_merged['Cluster Labels'] == 5, toronto_merged.columns[[1] + list(range(5, toronto_merged.shape[1]))]]

Unnamed: 0,Borough,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue
32,North York,5.0,Food Truck,Baseball Field,Yoga Studio
91,Etobicoke,5.0,Baseball Field,Yoga Studio,Music Venue
97,North York,5.0,Baseball Field,Yoga Studio,Music Venue


In [28]:
toronto_merged.loc[toronto_merged['Cluster Labels'] == 6, toronto_merged.columns[[1] + list(range(5, toronto_merged.shape[1]))]]

Unnamed: 0,Borough,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue
21,North York,6.0,Park,Yoga Studio,Museum
23,North York,6.0,Construction & Landscaping,Convenience Store,Park
40,East York/East Toronto,6.0,Convenience Store,Park,Yoga Studio
50,Downtown Toronto,6.0,Park,Playground,Trail
74,York,6.0,Park,Women's Store,Pool
98,York,6.0,Park,Convenience Store,Yoga Studio


In [29]:
toronto_merged.loc[toronto_merged['Cluster Labels'] == 7, toronto_merged.columns[[1] + list(range(5, toronto_merged.shape[1]))]]

Unnamed: 0,Borough,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue
0,Scarborough,7.0,Fast Food Restaurant,Poke Place,Men's Store


In [30]:
toronto_merged.loc[toronto_merged['Cluster Labels'] == 8, toronto_merged.columns[[1] + list(range(5, toronto_merged.shape[1]))]]

Unnamed: 0,Borough,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue
94,Etobicoke,8.0,Bakery,Yoga Studio,Music Venue


In [31]:
toronto_merged.loc[toronto_merged['Cluster Labels'] == 9, toronto_merged.columns[[1] + list(range(5, toronto_merged.shape[1]))]]

Unnamed: 0,Borough,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue
1,Scarborough,9.0,Bar,Yoga Studio,Music Venue


In [32]:
toronto_merged.loc[toronto_merged['Cluster Labels'] == 10, toronto_merged.columns[[1] + list(range(5, toronto_merged.shape[1]))]]

Unnamed: 0,Borough,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue
90,Etobicoke,10.0,Pool,River,Music Venue


## Thank You!