# Capstone Project - The Battle of the Neighborhoods (Week 1)
### Applied Data Science Capstone by IBM/Coursera

## Introduction: Business Problem 

For a healthy life, you need to get enough sleep, eat healthy with a balanced diet, and exercise. Workout 3 times a week is a good deal. You can go to a park but sometimes go to the gym is more motivating. By this way, in this project we will try to find the best place to open a **gym** in **Toronto** for a contractor who is trying to start his own business. 

Obviously there already are gyms in **Toronto** so we will focus our study to find venue with a lot of gym and also venue with no gym. It is very important to know why some place have or do not have this activities and let the contractor choose if he prefer a gym near industrial or residential area.


With clustering is it possible to segment neighborhoods in order to choose a place for a gym ? 



## Data

To answer the question of our problem, we will make a dataframe of the neighborhoods of **Toronto** by scraping a wikipédia page. 

We will obtain the geographical coordinates of **Toronto** and analyse each neighborhood and classify the most common venues of them, using **Foursquare API**. We also need the number of gym, their name and location in each neighborhood.


The data come from a wikipedia page.
https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M

Another dataset of the geographical coordinate of each neighborhood is here. This data was obtained in the week 3 of the Capstone project.
https://cocl.us/Geospatial_data

***In first step we are going to extract the data from wikipedia and transform them in a Pandas dataframe and clean it.***

Libraries we need.

In [1]:
! pip install lxml

import pandas as pd
import numpy as np

# import k-means from clustering
from sklearn.cluster import KMeans

import matplotlib.cm as cm
import matplotlib.colors as colors

import json # library to handle JSON files

import requests # library to handle requests
from pandas.io.json import json_normalize # tranform JSON file into a pandas dataframe


Collecting lxml
[?25l  Downloading https://files.pythonhosted.org/packages/64/28/0b761b64ecbd63d272ed0e7a6ae6e4402fc37886b59181bfdf274424d693/lxml-4.6.1-cp36-cp36m-manylinux1_x86_64.whl (5.5MB)
[K     |████████████████████████████████| 5.5MB 6.0MB/s eta 0:00:01
[?25hInstalling collected packages: lxml
Successfully installed lxml-4.6.1


In [None]:
#Geocoders to generate geographical cordinates with address
!conda install -c conda-forge geopy --yes
from geopy.geocoders import Nominatim 

!conda install -c conda-forge folium=0.5.0 --yes 
import folium # map rendering library

print('Libraries imported and installed')

***We are going to read the page with pandas and extract a list.***

In [3]:
url='https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M'
list_wiki=pd.read_html(url)
list_wiki


[     Unnamed: 0         Neighborhood name Within larger district  \
 0             1             North Seattle                Seattle   
 1             2                 Broadview      North Seattle[42]   
 2             3               Bitter Lake      North Seattle[42]   
 3             4  North Beach / Blue Ridge      North Seattle[42]   
 4             5                Crown Hill      North Seattle[42]   
 ..          ...                       ...                    ...   
 122         123                 Riverview          Delridge[164]   
 123         124             Highland Park          Delridge[165]   
 124         125            South Delridge          Delridge[166]   
 125         126                   Roxhill          Delridge[167]   
 126         127                High Point          Delridge[168]   
 
                       Annexed[41]  Locator map  Street map  Image  \
 0                         Various          NaN         NaN    NaN   
 1                        1954

***We will convert the list to a dataframe df and change the name of the column "Neighbourhood".***

In [4]:
df=pd.DataFrame(list_wiki[0])
df = df.rename(columns = {"Neighbourhood":"Neighborhood"})
df.head()

Unnamed: 0.1,Unnamed: 0,Neighborhood name,Within larger district,Annexed[41],Locator map,Street map,Image,Notes
0,1,North Seattle,Seattle,Various,,,,North of the Lake Washington Ship Canal[42]
1,2,Broadview,North Seattle[42],1954[43],,,,[44]
2,3,Bitter Lake,North Seattle[42],1954[43],,,,[45]
3,4,North Beach / Blue Ridge,North Seattle[42],"1940,[43] 1954[43]",,,,[46]
4,5,Crown Hill,North Seattle[42],"1907,[47] 1952,[43] 1954[43]",,,,[48]


## Pre-Processing

*Now let us CLEAN the dataframe by droping the row which contain Not assigned value
after we have replaced Not assigned value to NaN.*

In [4]:
df.replace("Not assigned",np.NaN,inplace=True)

In [5]:
df.dropna(subset=['Borough'],axis=0,inplace=True)
df.reset_index(drop=True)

Unnamed: 0,Postal Code,Borough,Neighborhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,"Regent Park, Harbourfront"
3,M6A,North York,"Lawrence Manor, Lawrence Heights"
4,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government"
...,...,...,...
98,M8X,Etobicoke,"The Kingsway, Montgomery Road, Old Mill North"
99,M4Y,Downtown Toronto,Church and Wellesley
100,M7Y,East Toronto,"Business reply mail Processing Centre, South C..."
101,M8Y,Etobicoke,"Old Mill South, King's Mill Park, Sunnylea, Hu..."


In [6]:
df.shape

(103, 3)

***We will read another dataframe with the geographical coordinates of each postal code and merge it with our first dataframe.***

In [10]:
#Read the new dataframe
df_coord=pd.read_csv('https://cocl.us/Geospatial_data')
df_coord.head()

In [8]:
#Merge the two dataframes
df_merged=df.merge(df_coord)
df_merged.head(10)

Unnamed: 0,Postal Code,Borough,Neighborhood,Latitude,Longitude
0,M3A,North York,Parkwoods,43.753259,-79.329656
1,M4A,North York,Victoria Village,43.725882,-79.315572
2,M5A,Downtown Toronto,"Regent Park, Harbourfront",43.65426,-79.360636
3,M6A,North York,"Lawrence Manor, Lawrence Heights",43.718518,-79.464763
4,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government",43.662301,-79.389494
5,M9A,Etobicoke,"Islington Avenue, Humber Valley Village",43.667856,-79.532242
6,M1B,Scarborough,"Malvern, Rouge",43.806686,-79.194353
7,M3B,North York,Don Mills,43.745906,-79.352188
8,M4B,East York,"Parkview Hill, Woodbine Gardens",43.706397,-79.309937
9,M5B,Downtown Toronto,"Garden District, Ryerson",43.657162,-79.378937


**This is the main dataframe we will use to solve our problem.** 