<h3> Introduction to data 

We are going to use data from below Sources.

For **Toronto** data we are going to use web url: "https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M" and cleanse it.

**Geocoder_File**: Geospatial_Coordinates.csv

The **New York** neighbourhood data is available free on the internet which has been downloaded to a local file to use it for this assignment.

**Filename**: 
nyu_2451_34572-geojson.json

**FourSquare Data**:

Another aspect of this project is the Foursquare data. Though the data as good as provided,
the amount and accuracy of data captured can't 100% determine correct classification in real world.



**Let's have a look at data in thses files**

In [2]:
import numpy as np # library to handle data in a vectorized manner

import pandas as pd # library for data analsysis
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)

import json # library to handle JSON files

!conda install -c conda-forge geopy --yes 
from geopy.geocoders import Nominatim 
import requests # library to handle requests
from pandas.io.json import json_normalize # tranform JSON file into a pandas dataframe

# Matplotlib and associated plotting modules
import matplotlib.cm as cm
import matplotlib.colors as colors

# import k-means from clustering stage
from sklearn.cluster import KMeans

!conda install -c conda-forge folium=0.5.0 --yes 
import folium # map rendering library

import requests

from bs4 import BeautifulSoup


pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)
import json # library to handle JSON files

Solving environment: ...working... done

# All requested packages already installed.

Solving environment: ...working... done

# All requested packages already installed.



<h3>Load and Explore the data

In [5]:
website_url = requests.get("https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M").text
soup = BeautifulSoup(website_url, 'lxml')
#print(soup.prettify)
my_table = soup.find('table',{'class': "wikitable sortable"})
#creating lists for each of the columns
A=[]
B=[]
C=[]

for row in my_table.findAll("tr"):
    cells = row.findAll("td")
    if len(cells)==3: #Only extract table body not heading
        A.append(cells[0].find(text=True))
        B.append(cells[1].find(text=True))
        C.append(cells[2].find(text=True))
        
df=pd.DataFrame(A,columns=['PostCode'])
df['Borough']=B
df['Neighbourhood']=C

df=df[df.Borough != 'Not assigned']

df.Neighbourhood[df.Neighbourhood == "Not assigned"] = df.Borough[df.Neighbourhood == "Not assigned"]
def neighbourhood_list(grouped):    
    return ', '.join(sorted(grouped['Neighbourhood'].tolist()))
grp = df.groupby(['PostCode', 'Borough'])
new_df = grp.apply(neighbourhood_list).reset_index(name='Neighbourhood')

In [6]:
df_geo=pd.read_csv('C:\\Users\\HimaniVerma\\Downloads\\Geospatial_Coordinates.csv')
df_geo = df_geo.rename(columns={'Postal Code': 'PostCode'})
df_geo.drop_duplicates(subset='PostCode', keep="last", inplace=True)

#creating Toronto_df by merging the two dataframes
toronto_df = pd.merge(new_df,df_geo,how='inner', on='PostCode')
toronto_df.head()

Unnamed: 0,PostCode,Borough,Neighbourhood,Latitude,Longitude
0,M1B,Scarborough,"Malvern, Rouge",43.806686,-79.194353
1,M1C,Scarborough,"Highland Creek, Port Union, Rouge Hill",43.784535,-79.160497
2,M1E,Scarborough,"Guildwood\n, Morningside, West Hill",43.763573,-79.188711
3,M1G,Scarborough,Woburn,43.770992,-79.216917
4,M1H,Scarborough,Cedarbrae,43.773136,-79.239476


In [7]:
toronto_df.shape

(103, 5)

In [8]:
#NYU data
with open('C:\\Users\\HimaniVerma\\Downloads\\nyu_2451_34572-geojson.json') as json_data:
    newyork_data = json.load(json_data)
    
neighborhoods_data = newyork_data['features']


# define the dataframe columns
column_names = ['Borough', 'Neighborhood', 'Latitude', 'Longitude'] 

# instantiate the dataframe
neighborhoods = pd.DataFrame(columns=column_names)


for data in neighborhoods_data:
    borough = neighborhood_name = data['properties']['borough'] 
    neighborhood_name = data['properties']['name']
        
    neighborhood_latlon = data['geometry']['coordinates']
    neighborhood_lat = neighborhood_latlon[1]
    neighborhood_lon = neighborhood_latlon[0]
    
    neighborhoods = neighborhoods.append({'Borough': borough,
                                          'Neighborhood': neighborhood_name,
                                          'Latitude': neighborhood_lat,
                                          'Longitude': neighborhood_lon}, ignore_index=True)
    
neighborhoods.head()

Unnamed: 0,Borough,Neighborhood,Latitude,Longitude
0,Bronx,Wakefield,40.894705,-73.847201
1,Bronx,Co-op City,40.874294,-73.829939
2,Bronx,Eastchester,40.887556,-73.827806
3,Bronx,Fieldston,40.895437,-73.905643
4,Bronx,Riverdale,40.890834,-73.912585


In [9]:
neighborhoods.shape

(306, 4)

_We can now use this data to Plot Maps of Toronto and New York City_

In [10]:
#Map_Toronto

from geopy.geocoders import Nominatim # convert an address into latitude and longitude values

import requests # library to handle requests
from pandas.io.json import json_normalize # tranform JSON file into a pandas dataframe

# Matplotlib and associated plotting modules
import matplotlib.cm as cm
import matplotlib.colors as colors

# import k-means from clustering stage
from sklearn.cluster import KMeans

#Locating Toronto on map

address = 'Toronto, TO'

geolocator = Nominatim()
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude
#print('The geograpical coordinate of Toronto City are {}, {}.'.format(latitude, longitude))

# create map of Toronto using latitude and longitude values
map_toronto = folium.Map(location=[latitude, longitude], zoom_start=10)

# add markers to map
for lat, lng, borough, neighborhood in zip(toronto_df['Latitude'], toronto_df['Longitude'], toronto_df['Borough'], toronto_df['Neighbourhood']):
    label = '{}, {}'.format(neighborhood, borough)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_toronto)  
    
map_toronto



In [11]:
#Map New York City

#Use geopy library to get the latitude and longitude values of New York City

address = 'New York City, NY'

geolocator = Nominatim()
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude
#print('The geograpical coordinate of New York City are {}, {}.'.format(latitude, longitude))

# create map of New York using latitude and longitude values
map_newyork = folium.Map(location=[latitude, longitude], zoom_start=10)

# add markers to map
for lat, lng, borough, neighborhood in zip(neighborhoods['Latitude'], neighborhoods['Longitude'], neighborhoods['Borough'], neighborhoods['Neighborhood']):
    label = '{}, {}'.format(neighborhood, borough)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_newyork)  
    
map_newyork

  import sys


We have cleansed and converted our souce data to Python data Frames.
However, for illustration purposes, let's simplify the above map and segment and cluster only the neighborhoods in East Toronto for Toronto and Manhattan for New York.

Then we can use **foursquare** data and apply machine learning algorithms such as **K-means** clustering to determine classifications and to achive our objective of comparision between the two Popular Cities