## Introduction/Business Problem

In  this project we want to provide real estate agencies with an objective criterion that allows them to identify neighborhoods with similar characteristics in two cities without having to know any of them. This will allow them to have satisfied customers as they will move to an environment with the expected characteristics. One of the questions that we will be able to answer with our application is in which neighborhoods of Toronto I will have to move to have characteristics similar to Williamsbridge.


To achieve our goal we are going to use a concept called UQI (Urban Quality Index).Since the beginning of the 20th century, many public and private administrations have worked to measure the quality of cities, neighborhoods, etc. For this, many indicators / indices have been created that allow quantitatively comparing different administrative units with each other. These indicators have several dimensions, some objective such as economic, social, services and other subjective such as the perception that citizens have of their city, neighborhood, etc.

In our project we are going to create three objective dimensions Social, Economic and Services, leaving for other projects the inclusion of a subjective dimension that can be obtained by processing the evaluations of the clients of the places.

Nor are we going to create a UQI, as is usually created, weighting the dimensions and getting a numerical value, but rather we will apply Machine Learning to the dimensions, which we will use as input variables of an unsupervised “Clustering” algorithm.
To weight each of the three dimensions in each neighborhood, we are going to measure the number of places in a radius of 500 meters above the point where it is geolocated, that is, the density. For example, for the Economic dimension we will count how many places of an economic type “Foursquare” returns us.

But "Foursquare" does not manage these variables, so we will have to perform a previous manual task of classifying each one of the "categories" uniquely. For example, "Music Scholl" is labeled as "Social", Internet Café as "Budget" and "Taxi" as "Service".

Therefore, once each neighborhood in New York has the three dimensions weighted, we proceed to classify them into 5 groups as indicated above.
Now we have a dataset with three characteristics and a class. So we can create a model that allows us to predict the class.

Therefore, if we take the neighborhoods of another city and calculate the dimensions for them, we can infer what type of neighborhood belongs to the five that we have, and ultimately we can indicate to a client in New York which neighborhoods in Toronto have characteristics similar to Williamsbridge, such as we said at the beginning.


## Data Section

#### New York Neighborhood

In order to get the New York neighborhood dataset:

1. Download json file from:  https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBMDeveloperSkillsNetwork-DS0701EN-SkillsNetwork/labs/newyork_data.json

2. Dataframe named "newyork_neighborhoods" with geolocated data is created 

In [1]:
import os, types
import pandas as pd
from botocore.client import Config
import ibm_boto3

def __iter__(self): return 0

# @hidden_cell
# The following code accesses a file in your IBM Cloud Object Storage. It includes your credentials.
# You might want to remove those credentials before you share the notebook.

if os.environ.get('RUNTIME_ENV_LOCATION_TYPE') == 'external':
    endpoint_b052ad164f274d0b9ec21008590c5f23 = 'https://s3.eu.cloud-object-storage.appdomain.cloud'
else:
    endpoint_b052ad164f274d0b9ec21008590c5f23 = 'https://s3.private.eu.cloud-object-storage.appdomain.cloud'

client_b052ad164f274d0b9ec21008590c5f23 = ibm_boto3.client(service_name='s3',
    ibm_api_key_id='4yTxfE_s2nWyNojtR1WiueSib8gnPZpti50FQ8BeM6ZQ',
    ibm_auth_endpoint="https://iam.cloud.ibm.com/oidc/token",
    config=Config(signature_version='oauth'),
    endpoint_url=endpoint_b052ad164f274d0b9ec21008590c5f23)

body = client_b052ad164f274d0b9ec21008590c5f23.get_object(Bucket='courseracapstoneproject-donotdelete-pr-ilrbhif6fhl0iz',Key='newyork_neighborhoods.csv')['Body']
# add missing __iter__ method, so pandas accepts body as file-like object
if not hasattr(body, "__iter__"): body.__iter__ = types.MethodType( __iter__, body )

Newyork_Neighborhood = pd.read_csv(body)
Newyork_Neighborhood.head()


Unnamed: 0.1,Unnamed: 0,Borough,Neighborhood,Latitude,Longitude
0,0,Bronx,Wakefield,40.894705,-73.847201
1,1,Bronx,Co-op City,40.874294,-73.829939
2,2,Bronx,Eastchester,40.887556,-73.827806
3,3,Bronx,Fieldston,40.895437,-73.905643
4,4,Bronx,Riverdale,40.890834,-73.912585


####  Foursquare Categories

In order to get the Foursquare Categories dataset:

1. Foursquare API is called on to get all categories that  provie,the response is stored in a dataframe : https://api.foursquare.com/v2/venues/categories?&client_id=XXX&client_secret=YYY&v=20180605&m=foursquare

3. It is exported to an excel and classified manually, each category is assigned a dimension ECO (Economic), SOC (Social), SER (Service). Finally the excel is imported again and saved in a dataframe called "categories_pos".



In [2]:

body = client_b052ad164f274d0b9ec21008590c5f23.get_object(Bucket='courseracapstoneproject-donotdelete-pr-ilrbhif6fhl0iz',Key='categories_pos.csv')['Body']
# add missing __iter__ method, so pandas accepts body as file-like object
if not hasattr(body, "__iter__"): body.__iter__ = types.MethodType( __iter__, body )

categories_pos = pd.read_csv(body)
categories_pos.head()


Unnamed: 0,id_category,Venue Category,index
0,56aa371be4b08b9a8d5734db,Amphitheater,SER
1,4fceea171983d5d06c3e9823,Aquarium,SER
2,4bf58dd8d48988d1e1931735,Arcade,ECO
3,4bf58dd8d48988d1e2931735,Art Gallery,ECO
4,4bf58dd8d48988d1e4931735,Bowling Alley,ECO


####  New York Venues

1. For each neighborhoods stored in the "newyork_neighborhoods" data frame, a search is performed to find the 500 nearest locations and stored in the "newyork_venues" data frame.

2.Finally we combine the dataframe "" to obtain the dataframe "result" where we include the dimension.





In [3]:

body = client_b052ad164f274d0b9ec21008590c5f23.get_object(Bucket='courseracapstoneproject-donotdelete-pr-ilrbhif6fhl0iz',Key='newyork_venues.csv')['Body']
# add missing __iter__ method, so pandas accepts body as file-like object
if not hasattr(body, "__iter__"): body.__iter__ = types.MethodType( __iter__, body )

newyork_venues = pd.read_csv(body)
newyork_venues.head()


Unnamed: 0,Borough,Neighborhood,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
0,Bronx,Wakefield,40.894705,-73.847201,Lollipops Gelato,40.894123,-73.845892,Dessert Shop
1,Bronx,Wakefield,40.894705,-73.847201,Rite Aid,40.896649,-73.844846,Pharmacy
2,Bronx,Wakefield,40.894705,-73.847201,Carvel Ice Cream,40.890487,-73.848568,Ice Cream Shop
3,Bronx,Wakefield,40.894705,-73.847201,Walgreens,40.896528,-73.8447,Pharmacy
4,Bronx,Wakefield,40.894705,-73.847201,Dunkin',40.890459,-73.849089,Donut Shop


In [4]:

body = client_b052ad164f274d0b9ec21008590c5f23.get_object(Bucket='courseracapstoneproject-donotdelete-pr-ilrbhif6fhl0iz',Key='result (3).csv')['Body']
# add missing __iter__ method, so pandas accepts body as file-like object
if not hasattr(body, "__iter__"): body.__iter__ = types.MethodType( __iter__, body )

result = pd.read_csv(body)
result.head()


Unnamed: 0.1,Unnamed: 0,Borough,City,Neighborhood,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Category,Venue Latitude,Venue Longitude,index
0,0,Bronx,NewYork,Wakefield,40.894705,-73.847201,Lollipops Gelato,Dessert Shop,40.894123,-73.845892,ECO
1,1,Bronx,NewYork,Wakefield,40.894705,-73.847201,Rite Aid,Pharmacy,40.896649,-73.844846,ECO
2,2,Bronx,NewYork,Wakefield,40.894705,-73.847201,Walgreens,Pharmacy,40.896528,-73.8447,ECO
3,3,Bronx,NewYork,Wakefield,40.894705,-73.847201,Carvel Ice Cream,Ice Cream Shop,40.890487,-73.848568,ECO
4,4,Bronx,NewYork,Wakefield,40.894705,-73.847201,Dunkin',Donut Shop,40.890459,-73.849089,ECO


####  New York Venues Grouped

1. The Venues are grouped by  neighborhoods, making a sum in the corresponding dimension. The datafrema that we are going to use to perform the classtering will be named "newyork_grouped"

In [5]:

body = client_b052ad164f274d0b9ec21008590c5f23.get_object(Bucket='courseracapstoneproject-donotdelete-pr-ilrbhif6fhl0iz',Key='newyork_grouped.csv')['Body']
# add missing __iter__ method, so pandas accepts body as file-like object
if not hasattr(body, "__iter__"): body.__iter__ = types.MethodType( __iter__, body )

newyork_grouped = pd.read_csv(body)
newyork_grouped.head(10)


Unnamed: 0.1,Unnamed: 0,Neighborhood,ECO,SER,SOC
0,0,Allerton,25,4,0
1,1,Annadale,11,1,0
2,2,Arden Heights,3,1,0
3,3,Arlington,3,3,0
4,4,Arrochar,19,4,0
5,5,Arverne,14,7,0
6,6,Astoria,48,0,2
7,7,Astoria Heights,7,4,1
8,8,Auburndale,18,2,0
9,9,Bath Beach,47,3,0


#### Toronto Neighborhood

In order to get the Toronto neighborhood dataframe:

1. WebScraping is made on:  "https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M"

2. Get Geospacial information from  "https://cocl.us/Geospatial_data"

3. Create a Toronto_neighborhoods data frame joining both

In [6]:
body = client_b052ad164f274d0b9ec21008590c5f23.get_object(Bucket='courseracapstoneproject-donotdelete-pr-ilrbhif6fhl0iz',Key='Toronto_neighborhoods.csv')['Body']
# add missing __iter__ method, so pandas accepts body as file-like object
if not hasattr(body, "__iter__"): body.__iter__ = types.MethodType( __iter__, body )

Toronto_Neighborhood = pd.read_csv(body)
Toronto_Neighborhood.head()


Unnamed: 0.1,Unnamed: 0,PostalCode,Borough,Neighbourhood,Latitude,Longitude
0,0,M3A,North York,Parkwoods,43.753259,-79.329656
1,1,M4A,North York,Victoria Village,43.725882,-79.315572
2,2,M5A,Downtown Toronto,"Regent Park, Harbourfront",43.65426,-79.360636
3,3,M6A,North York,"Lawrence Manor, Lawrence Heights",43.718518,-79.464763
4,4,M7A,Queen's Park,Ontario Provincial Government,43.662301,-79.389494


####  Toronto Venues Grouped

1. the same procedure is carried out as in New York. This data frame will be used as input to predict what type of neighborhood you belong to in the model created with New York neighborhoods.

In [7]:

body = client_b052ad164f274d0b9ec21008590c5f23.get_object(Bucket='courseracapstoneproject-donotdelete-pr-ilrbhif6fhl0iz',Key='toronto_grouped.csv')['Body']
# add missing __iter__ method, so pandas accepts body as file-like object
if not hasattr(body, "__iter__"): body.__iter__ = types.MethodType( __iter__, body )

newyork_grouped  = pd.read_csv(body)
newyork_grouped .head()


Unnamed: 0.1,Unnamed: 0,Neighborhood,ECO,SER,SOC
0,0,Agincourt,4,0,0
1,1,"Alderwood, Long Branch",7,1,1
2,2,"Bathurst Manor, Wilson Heights, Downsview North",22,1,0
3,3,Bayview Village,4,0,0
4,4,"Bedford Park, Lawrence Manor East",24,0,0
