# Capstone Project - The Battle of the Neighborhoods (Week 2)
### Applied Data Science Capstone by IBM/Coursera

## Table of contents
* [Introduction: Business Problem](#introduction)
* [Data](#data)
* [Methodology](#methodology)
* [Analysis](#analysis)
* [Results and Discussion](#results)
* [Conclusion](#conclusion)

## Introduction: Business Problem <a name="introduction"></a>

In this project we will try to help a real estate client that is about to move to London find a neighborhood to live in.  As they client and his family is an outdoors person, her main criteria is that there is greenery, i.e parks with in close proximity

The aim of this project will be to provide a list of locations within London/Greater London in which our real estate client will have close access to greenery. We will provide a list and map of these locations so that the client can then select the “green” locations that will satisfy his other needs. 


## Data <a name="data"></a>

Based on definition of our problem, factors that will influence our decision are:
* The density of "green" areas within in a 200m radius of each neighborhood.  
* distance of "green" neighborhood from city center

Note that "green" areas will be defined as parks, gardens, botanical gardens. 

Following data sources will be needed to extract/generate the required information:
* A list of London neighborhoods is extracted from Wikipedia page: https://en.wikipedia.org/wiki/List_of_areas_of_London.
* The coordinates of each London neighborhood will extracted from csv file extracted from: https://www.doogal.co.uk/london_postcodes.php 
* The number of "green" areas and their type and location in every neighborhood will be obtained using **Foursquare API**
* The coordinate of London center will be obtained using **Google Maps API geocoding**

In [3]:
import pandas as pd
import numpy as np 
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

In [4]:
!conda install -c anaconda beautifulsoup4 --yes

Solving environment: done


  current version: 4.5.11
  latest version: 4.8.1

Please update conda by running

    $ conda update -n base -c defaults conda



## Package Plan ##

  environment location: /home/jupyterlab/conda/envs/python

  added / updated specs: 
    - beautifulsoup4


The following packages will be downloaded:

    package                    |            build
    ---------------------------|-----------------
    numpy-base-1.15.4          |   py36h81de0dd_0         4.2 MB  anaconda
    beautifulsoup4-4.8.2       |           py36_0         161 KB  anaconda
    numpy-1.15.4               |   py36h1d66e8a_0          35 KB  anaconda
    openssl-1.1.1              |       h7b6447c_0         5.0 MB  anaconda
    soupsieve-1.9.5            |           py36_0          61 KB  anaconda
    mkl_fft-1.0.6              |   py36h7dd41cf_0         150 KB  anaconda
    certifi-2019.11.28         |           py36_0         156 KB  anaconda
    blas-1.0                   |           

In [5]:
!conda install -c anaconda requests --yes

Solving environment: done


  current version: 4.5.11
  latest version: 4.8.1

Please update conda by running

    $ conda update -n base -c defaults conda



## Package Plan ##

  environment location: /home/jupyterlab/conda/envs/python

  added / updated specs: 
    - requests


The following packages will be downloaded:

    package                    |            build
    ---------------------------|-----------------
    requests-2.22.0            |           py36_1          90 KB  anaconda

The following packages will be UPDATED:

    requests: 2.22.0-py36_1 conda-forge --> 2.22.0-py36_1 anaconda


Downloading and Extracting Packages
requests-2.22.0      | 90 KB     | ##################################### | 100% 
Preparing transaction: done
Verifying transaction: done
Executing transaction: done


In [6]:
!conda install -c anaconda lxml --yes # may need to restart kernel for lxml to work

Solving environment: done


  current version: 4.5.11
  latest version: 4.8.1

Please update conda by running

    $ conda update -n base -c defaults conda



## Package Plan ##

  environment location: /home/jupyterlab/conda/envs/python

  added / updated specs: 
    - lxml


The following packages will be downloaded:

    package                    |            build
    ---------------------------|-----------------
    lxml-4.3.0                 |   py36hefd8a0e_0         1.5 MB  anaconda

The following packages will be UPDATED:

    lxml: 4.2.5-py37hefd8a0e_0 --> 4.3.0-py36hefd8a0e_0 anaconda


Downloading and Extracting Packages
lxml-4.3.0           | 1.5 MB    | ##################################### | 100% 
Preparing transaction: done
Verifying transaction: done
Executing transaction: done


### Create data frame with London Neighbourhoods and corresponding coordinates:

In [9]:
import requests
from bs4 import BeautifulSoup # The Beautiful Soup package is used to extract data from html files.

#### Extract table of London Areas from Wikipedia link defined:

In [10]:
url = "https://en.wikipedia.org/wiki/List_of_areas_of_London"
source = requests.get(url).text

The Beautiful Soup package is used to parse the html -  take the raw html text and break it into Python objects. The second argument 'lxml' or features = 'html' is the html parser.

In [11]:
soup= BeautifulSoup(source,'lxml') 
type(soup)

bs4.BeautifulSoup

##### Scrape webpage and isolate table of interest:

In [12]:
tables = soup.findChildren('table')
ldn_table = tables[1]
ldn_table_html = str(tables[1])

##### Convert html table to dataframe: 

In [13]:
from IPython.display import display_html
df_ldn = pd.read_html(ldn_table_html)
df_ldn

[            Location                     London borough       Post town  \
 0         Abbey Wood              Bexley, Greenwich [7]          LONDON   
 1              Acton  Ealing, Hammersmith and Fulham[8]          LONDON   
 2          Addington                         Croydon[8]         CROYDON   
 3         Addiscombe                         Croydon[8]         CROYDON   
 4        Albany Park                             Bexley  BEXLEY, SIDCUP   
 ..               ...                                ...             ...   
 528         Woolwich                          Greenwich          LONDON   
 529   Worcester Park       Sutton, Kingston upon Thames  WORCESTER PARK   
 530  Wormwood Scrubs             Hammersmith and Fulham          LONDON   
 531          Yeading                         Hillingdon           HAYES   
 532         Yiewsley                         Hillingdon    WEST DRAYTON   
 
     Postcode district Dial code OS grid ref  
 0                 SE2       020    TQ4

In [14]:
df_ldn = df_ldn[0]
df_ldn

Unnamed: 0,Location,London borough,Post town,Postcode district,Dial code,OS grid ref
0,Abbey Wood,"Bexley, Greenwich [7]",LONDON,SE2,020,TQ465785
1,Acton,"Ealing, Hammersmith and Fulham[8]",LONDON,"W3, W4",020,TQ205805
2,Addington,Croydon[8],CROYDON,CR0,020,TQ375645
3,Addiscombe,Croydon[8],CROYDON,CR0,020,TQ345665
4,Albany Park,Bexley,"BEXLEY, SIDCUP","DA5, DA14",020,TQ478728
...,...,...,...,...,...,...
528,Woolwich,Greenwich,LONDON,SE18,020,TQ435795
529,Worcester Park,"Sutton, Kingston upon Thames",WORCESTER PARK,KT4,020,TQ225655
530,Wormwood Scrubs,Hammersmith and Fulham,LONDON,W12,020,TQ225815
531,Yeading,Hillingdon,HAYES,UB4,020,TQ115825


In [15]:
list(df_ldn.columns.values)

['Location',
 'London\xa0borough',
 'Post town',
 'Postcode\xa0district',
 'Dial\xa0code',
 'OS grid ref']

In [16]:
#remove spaces
df_ldn = df_ldn.rename(columns={'Post town': 'Post_Town', 'London\xa0borough': 'Borough', 'Postcode\xa0district': 'Postal_district', 'Dial\xa0code': 'Dial_code', 'OS grid ref': 'OS_grid_ref'})
list(df_ldn.columns.values)

['Location',
 'Borough',
 'Post_Town',
 'Postal_district',
 'Dial_code',
 'OS_grid_ref']

### Dataframe: London Areas

In [17]:
df_ldn1= df_ldn[['OS_grid_ref','Location']].copy()
df_ldn1.head()

Unnamed: 0,OS_grid_ref,Location
0,TQ465785,Abbey Wood
1,TQ205805,Acton
2,TQ375645,Addington
3,TQ345665,Addiscombe
4,TQ478728,Albany Park


In [18]:
df_ldn1.shape

(533, 2)

#### Read data for London area coordinates into a new dataframe:
Dowloaded from https://www.doogal.co.uk/london_postcodes.php

In [19]:
df_coords = pd.read_csv('/resources/labs/Capstone Project/London_postcodes.csv')
df_coords.head()

Unnamed: 0,Postcode,In Use?,Latitude,Longitude,Easting,Northing,Grid Ref,County,District,Ward,...,Quality,User Type,Last updated,Nearest station,Distance to station,Postcode area,Postcode district,Police force,Water company,Plus Code
0,BR1 1AA,Yes,51.401546,0.015415,540291,168873,TQ402688,Greater London,Bromley,Bromley Town,...,1,0,23/11/2019,Bromley South,0.218257,BR,BR1,Metropolitan Police,Thames Water,9F32C228+J5
1,BR1 1AB,Yes,51.406333,0.015208,540262,169405,TQ402694,Greater London,Bromley,Bromley Town,...,1,0,23/11/2019,Bromley North,0.253666,BR,BR1,Metropolitan Police,Thames Water,9F32C248+G3
2,BR1 1AD,No,51.400057,0.016715,540386,168710,TQ403687,Greater London,Bromley,Bromley Town,...,1,1,23/11/2019,Bromley South,0.044559,BR,BR1,Metropolitan Police,,9F32C228+2M
3,BR1 1AE,Yes,51.404543,0.014195,540197,169204,TQ401692,Greater London,Bromley,Bromley Town,...,1,0,23/11/2019,Bromley North,0.462939,BR,BR1,Metropolitan Police,Thames Water,9F32C237+RM
4,BR1 1AF,Yes,51.401392,0.014948,540259,168855,TQ402688,Greater London,Bromley,Bromley Town,...,1,0,23/11/2019,Bromley South,0.227664,BR,BR1,Metropolitan Police,Thames Water,9F32C227+HX


##### Clean data - Remove cells in which postcode is not "in use". 

In [20]:
df_coords = df_coords.rename(columns={'In Use?': 'In_Use', 'Grid Ref': 'Grid_Ref', 'Postcode district': 'Postal_district'})


In [21]:
df_coords= df_coords[df_coords.In_Use != 'No']
df_coords.head()

Unnamed: 0,Postcode,In_Use,Latitude,Longitude,Easting,Northing,Grid_Ref,County,District,Ward,...,Quality,User Type,Last updated,Nearest station,Distance to station,Postcode area,Postal_district,Police force,Water company,Plus Code
0,BR1 1AA,Yes,51.401546,0.015415,540291,168873,TQ402688,Greater London,Bromley,Bromley Town,...,1,0,23/11/2019,Bromley South,0.218257,BR,BR1,Metropolitan Police,Thames Water,9F32C228+J5
1,BR1 1AB,Yes,51.406333,0.015208,540262,169405,TQ402694,Greater London,Bromley,Bromley Town,...,1,0,23/11/2019,Bromley North,0.253666,BR,BR1,Metropolitan Police,Thames Water,9F32C248+G3
3,BR1 1AE,Yes,51.404543,0.014195,540197,169204,TQ401692,Greater London,Bromley,Bromley Town,...,1,0,23/11/2019,Bromley North,0.462939,BR,BR1,Metropolitan Police,Thames Water,9F32C237+RM
4,BR1 1AF,Yes,51.401392,0.014948,540259,168855,TQ402688,Greater London,Bromley,Bromley Town,...,1,0,23/11/2019,Bromley South,0.227664,BR,BR1,Metropolitan Police,Thames Water,9F32C227+HX
5,BR1 1AG,Yes,51.401392,0.014948,540259,168855,TQ402688,Greater London,Bromley,Bromley Town,...,1,0,23/11/2019,Bromley South,0.227664,BR,BR1,Metropolitan Police,Thames Water,9F32C227+HX


##### Create new dataframe with London coordinate data of interest : 

In [22]:
df_coords1= df_coords[['Grid_Ref','Latitude','Longitude']].copy()
df_coords1.reset_index(drop=True, inplace=True) #reset indes after dropping rows
df_coords1.head()

Unnamed: 0,Grid_Ref,Latitude,Longitude
0,TQ402688,51.401546,0.015415
1,TQ402694,51.406333,0.015208
2,TQ401692,51.404543,0.014195
3,TQ402688,51.401392,0.014948
4,TQ402688,51.401392,0.014948


### Dataframe: London Neighbourhood Coordinates

Since a grid_ref can cover multiple areas, for those with multiple areas, take the average of the latitude/longitude coordinates

In [23]:
df_coords2=df_coords1.groupby(['Grid_Ref'])['Latitude', 'Longitude'].mean().reset_index()
df_coords2.title = "London coordinates"
df_coords2.head()

Unnamed: 0,Grid_Ref,Latitude,Longitude
0,TL273000,51.684825,-0.158865
1,TL306000,51.68423,-0.11101
2,TL307000,51.684317,-0.109368
3,TL308000,51.684247,-0.108937
4,TL310001,51.684383,-0.105662


In [24]:
df_coords2.shape

(79166, 3)

#### Merge Dataframes: "London Neighbourhood Coordinates" and "London Areas" to get the coordinates of each London neighbourhood

In [25]:
df_merge= pd.merge(left=df_ldn1,right=df_coords2, how='left', left_on='OS_grid_ref', right_on='Grid_Ref') #merged_left
df_merge.head()

Unnamed: 0,OS_grid_ref,Location,Grid_Ref,Latitude,Longitude
0,TQ465785,Abbey Wood,,,
1,TQ205805,Acton,,,
2,TQ375645,Addington,,,
3,TQ345665,Addiscombe,TQ345665,51.382106,-0.068576
4,TQ478728,Albany Park,,,


In [26]:
df_merge.shape

(533, 5)

Clean dataframe: 

In [27]:
df_merge2=df_merge.dropna() #Drop all rows with NAN 
df_merge2.reset_index(drop=True, inplace=True) #reset indes after dropping rows
df_merge2.head()

Unnamed: 0,OS_grid_ref,Location,Grid_Ref,Latitude,Longitude
0,TQ345665,Addiscombe,TQ345665,51.382106,-0.068576
1,TQ334813,Aldgate,TQ334813,51.515328,-0.078387
2,TQ307810,Aldwych,TQ307810,51.513626,-0.117027
3,TQ185835,Alperton,TQ185835,51.538395,-0.292108
4,TQ345695,Anerley,TQ345695,51.408792,-0.066646


In [28]:
df_merge2.shape

(359, 5)

In [29]:
df_London = df_merge2.drop(df_merge.columns[2], axis=1)
df_London.head()

Unnamed: 0,OS_grid_ref,Location,Latitude,Longitude
0,TQ345665,Addiscombe,51.382106,-0.068576
1,TQ334813,Aldgate,51.515328,-0.078387
2,TQ307810,Aldwych,51.513626,-0.117027
3,TQ185835,Alperton,51.538395,-0.292108
4,TQ345695,Anerley,51.408792,-0.066646


#### Use geopy library to get the latitude and longitude values of London.

In [30]:
import pandas as pd # library for data analsysis
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)

import json # library to handle JSON files

!conda install -c conda-forge geopy --yes # uncomment this line if you haven't completed the Foursquare API lab
from geopy.geocoders import Nominatim # convert an address into latitude and longitude values

import requests # library to handle requests
from pandas.io.json import json_normalize # tranform JSON file into a pandas dataframe

# Matplotlib and associated plotting modules
import matplotlib.cm as cm
import matplotlib.colors as colors

# import k-means from clustering stage
from sklearn.cluster import KMeans

!conda install -c conda-forge folium=0.5.0 --yes # uncomment this line if you haven't completed the Foursquare API lab
import folium # map rendering library

print('Libraries imported.')

Solving environment: done


  current version: 4.5.11
  latest version: 4.8.1

Please update conda by running

    $ conda update -n base -c defaults conda



## Package Plan ##

  environment location: /home/jupyterlab/conda/envs/python

  added / updated specs: 
    - geopy


The following packages will be downloaded:

    package                    |            build
    ---------------------------|-----------------
    geopy-1.20.0               |             py_0          57 KB  conda-forge
    certifi-2019.11.28         |           py36_0         149 KB  conda-forge
    geographiclib-1.50         |             py_0          34 KB  conda-forge
    ------------------------------------------------------------
                                           Total:         240 KB

The following NEW packages will be INSTALLED:

    geographiclib: 1.50-py_0         conda-forge
    geopy:         1.20.0-py_0       conda-forge

The following packages will be UPDATED:

    certifi:       2019.1

In order to define an instance of the geocoder, we need to define a user_agent. We will name our agent <em>Tor_explorer</em>, as shown below.

In [31]:
address = 'London, England'

geolocator = Nominatim(user_agent="Tor_explorer")
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude
print('The geographical coordinate of London are {}, {}.'.format(latitude, longitude))

The geographical coordinate of London are 51.5073219, -0.1276474.


### Foursquare
Now that we have our location candidates, let's use Foursquare API to get info on the "green" areas in each neighbourhood.

We're interested in isolating the following categories 'parks, gardens, nature preserves and botanical gardens '.

#### Define Foursquare Credentials and Version

In [32]:
CLIENT_ID = 'X2PZL4TKOWMNH4F2F1ZAPJ3EQ3433Z4TJHJME55WSV3MKIO4' # your Foursquare ID
CLIENT_SECRET = 'FUM0PWZDF0OEUILMUZ5UWGE2IF3CVRJTKPXFYCTNDPMZZYFH' # your Foursquare Secret
VERSION = '20180605' # Foursquare API version

print('Your credentails:')
print('CLIENT_ID: ' + CLIENT_ID)
print('CLIENT_SECRET:' + CLIENT_SECRET)

Your credentails:
CLIENT_ID: X2PZL4TKOWMNH4F2F1ZAPJ3EQ3433Z4TJHJME55WSV3MKIO4
CLIENT_SECRET:FUM0PWZDF0OEUILMUZ5UWGE2IF3CVRJTKPXFYCTNDPMZZYFH


#### Create a function to get the top 20 venues in each London neighbourhood within a radius of 200 meters.

In [33]:
def getNearbyVenues(names, latitudes, longitudes, radius=200):
    
    venues_list=[]
    for name, lat, lng in zip(names, latitudes, longitudes):
        print(name)
            
        # create the API request URL
        LIMIT = 20 # limit of number of venues returned by Foursquare API. 
        KeyError: 'groups'
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lat, 
            lng, 
            radius, 
            LIMIT)
            
        # make the GET request
        results = requests.get(url).json()["response"]['groups'][0]['items']
        
        # return only relevant information for each nearby venue
        venues_list.append([(
            name, 
            lat, 
            lng, 
            v['venue']['name'], 
            v['venue']['location']['lat'], 
            v['venue']['location']['lng'],  
            v['venue']['categories'][0]['name']) for v in results])

    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['Location', 
                  'Neighborhood Latitude', 
                  'Neighborhood Longitude', 
                  'Venue', 
                  'Venue Latitude', 
                  'Venue Longitude', 
                  'Venue Category']
    
    return(nearby_venues)



London_venues = getNearbyVenues(names=df_London['Location'],
                                   latitudes=df_London['Latitude'],
                                   longitudes=df_London['Longitude']
                                  )


Addiscombe
Aldgate
Aldwych
Alperton
Anerley
Angel
Balham
Bankside
Barbican
Barkingside
Barnet Gate
Barnet (also Chipping Barnet, High Barnet)
Barnsbury
Bayswater
Beckenham
Becontree
Bedford Park
Belgravia
Bellingham
Belmont
Belmont
Belsize Park
Bermondsey
Berrylands
Bexleyheath (also Bexley New Town)
Biggin Hill
Blackfriars
Blackwall
Bloomsbury
Bounds Green
Bow
Brentford
Brent Cross
Brimsdown
Brixton
Bromley
Bromley (also Bromley-by-Bow)
Brompton
Brondesbury
Bulls Cross
Burnt Oak
Burroughs, The
Camberwell
Cambridge Heath
Camden Town
Canary Wharf
Cann Hall
Canning Town
Canonbury
Castelnau
Castle Green
Catford
Chalk Farm
Charlton
Chelsfield
Chinbrook
Chislehurst
Chiswick
Church End
Church End
Clapham
Clerkenwell
Colindale
Collier Row
Colliers Wood
Colyers
Coney Hall
Coulsdon
Covent Garden
Cranford
Cranham
Crayford
Cricklewood
Crofton Park
Crook Log
Croydon
Custom House
Dagenham
Dalston
Dartford
De Beauvoir Town
Deptford
Dollis Hill
Ealing
Earls Court
East Bedfont
East Ham
East Sheen
East

In [34]:
print(London_venues.shape)
London_venues.head()

(2278, 7)


Unnamed: 0,Location,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
0,Addiscombe,51.382106,-0.068576,Co-op Food,51.381969,-0.069717,Grocery Store
1,Addiscombe,51.382106,-0.068576,Kidplay cafe,51.382178,-0.069117,Café
2,Addiscombe,51.382106,-0.068576,Zafran,51.382394,-0.067949,Indian Restaurant
3,Addiscombe,51.382106,-0.068576,Favourite Chicken,51.381501,-0.069221,Fried Chicken Joint
4,Aldgate,51.515328,-0.078387,1Rebel,51.515569,-0.08004,Gym / Fitness Center


Let's check how many venues were returned for each neighborhood

#### Let's find out how many unique categories can be curated from all the returned venues

In [35]:
print('There are {} uniques categories.'.format(len(London_venues['Venue Category'].unique())))

There are 245 uniques categories.


#### Analyze Each London Neighborhood and what types of venues are present

In [36]:
# one hot encoding
London_onehot = pd.get_dummies(London_venues[['Venue Category']], prefix="", prefix_sep="")

# add neighborhood column back to dataframe
London_onehot['Location'] = London_venues['Location'] 

# move neighborhood column to the first column
fixed_columns = [London_onehot.columns[-1]] + list(London_onehot.columns[:-1])
London_onehot = London_onehot[fixed_columns]

London_onehot = London_onehot.rename(columns={'Botanical Garden': 'Botanical_Garden'}) #remove spaces
London_onehot.head()

Unnamed: 0,Location,Afghan Restaurant,African Restaurant,American Restaurant,Antique Shop,Arepa Restaurant,Argentinian Restaurant,Art Gallery,Arts & Crafts Store,Asian Restaurant,Athletics & Sports,Australian Restaurant,Austrian Restaurant,Auto Garage,Auto Workshop,BBQ Joint,Bagel Shop,Bakery,Bar,Basketball Court,Bed & Breakfast,Beer Bar,Beer Store,Bistro,Boat or Ferry,Bookstore,Botanical_Garden,Boutique,Breakfast Spot,Brewery,Bubble Tea Shop,Buffet,Building,Burger Joint,Burrito Place,Bus Station,Bus Stop,Business Service,Café,Camera Store,Canal,Caribbean Restaurant,Cheese Shop,Chinese Restaurant,Chocolate Shop,Clothing Store,Club House,Cocktail Bar,Coffee Shop,College Cafeteria,Concert Hall,Construction & Landscaping,Convenience Store,Cosmetics Shop,Creperie,Cricket Ground,Cuban Restaurant,Cupcake Shop,Cycle Studio,Dance Studio,Deli / Bodega,Department Store,Dessert Shop,Dim Sum Restaurant,Diner,Discount Store,Dive Bar,Donut Shop,Electronics Store,English Restaurant,Event Space,Falafel Restaurant,Farmers Market,Fast Food Restaurant,Filipino Restaurant,Fish & Chips Shop,Fish Market,Fishing Store,Flea Market,Flower Shop,Food & Drink Shop,Food Truck,Fountain,French Restaurant,Fried Chicken Joint,Fruit & Vegetable Store,Furniture / Home Store,Garden,Garden Center,Gas Station,Gastropub,Gay Bar,Gift Shop,Gourmet Shop,Greek Restaurant,Grocery Store,Gym,Gym / Fitness Center,Gym Pool,Gymnastics Gym,Harbor / Marina,Hardware Store,Health & Beauty Service,Historic Site,History Museum,Home Service,Hookah Bar,Hostel,Hotel,Hotel Bar,Hunan Restaurant,Ice Cream Shop,Indian Restaurant,Indie Movie Theater,Italian Restaurant,Japanese Restaurant,Jewelry Store,Juice Bar,Karaoke Bar,Kebab Restaurant,Kids Store,Korean Restaurant,Kosher Restaurant,Lake,Latin American Restaurant,Laundromat,Leather Goods Store,Lebanese Restaurant,Library,Lighthouse,Lingerie Store,Liquor Store,Locksmith,Lounge,Market,Martial Arts Dojo,Mediterranean Restaurant,Men's Store,Metro Station,Mexican Restaurant,Middle Eastern Restaurant,Mini Golf,Mobile Phone Shop,Modern European Restaurant,Moroccan Restaurant,Movie Theater,Multiplex,Museum,Music Store,Music Venue,Nail Salon,Nature Preserve,Newsstand,Nightclub,Noodle House,North Indian Restaurant,Opera House,Organic Grocery,Outdoor Supply Store,Pakistani Restaurant,Park,Pastry Shop,Pedestrian Plaza,Performing Arts Venue,Perfume Shop,Persian Restaurant,Peruvian Restaurant,Pet Store,Pharmacy,Piano Bar,Pier,Piercing Parlor,Pizza Place,Platform,Playground,Plaza,Pool,Portuguese Restaurant,Print Shop,Pub,Public Art,Rafting,Ramen Restaurant,Record Shop,Recording Studio,Rental Car Location,Residential Building (Apartment / Condo),Restaurant,Road,Rock Club,Rugby Pitch,Russian Restaurant,Salad Place,Sandwich Place,Scandinavian Restaurant,Scenic Lookout,Science Museum,Seafood Restaurant,Shoe Repair,Shoe Store,Shopping Mall,Shopping Plaza,Skate Park,Soccer Field,Soccer Stadium,South American Restaurant,South Indian Restaurant,Southern / Soul Food Restaurant,Souvenir Shop,Spa,Spanish Restaurant,Speakeasy,Sporting Goods Shop,Sports Bar,Sports Club,Sri Lankan Restaurant,Stables,Stationery Store,Steakhouse,Street Food Gathering,Student Center,Supermarket,Sushi Restaurant,Szechuan Restaurant,Tapas Restaurant,Taxi Stand,Tea Room,Tennis Court,Thai Restaurant,Theater,Thrift / Vintage Store,Toy / Game Store,Trail,Train Station,Tram Station,Turkish Restaurant,Used Bookstore,Vegetarian / Vegan Restaurant,Vietnamese Restaurant,Warehouse Store,Whisky Bar,Wine Bar,Wine Shop,Wings Joint,Women's Store,Yoga Studio
0,Addiscombe,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
1,Addiscombe,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
2,Addiscombe,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
3,Addiscombe,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
4,Aldgate,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0


### Isolate "Green" Neighbourhoods
It will be assumed that green neighbourhoods are defined by those with either Botanical Gardens, Gardens, Parks or  Nature_Preserve within proximity

In [37]:
London_onehot_green = London_onehot[['Location', 'Botanical_Garden','Garden','Park']].copy()
London_onehot_green.head()

Unnamed: 0,Location,Botanical_Garden,Garden,Park
0,Addiscombe,0,0,0
1,Addiscombe,0,0,0
2,Addiscombe,0,0,0
3,Addiscombe,0,0,0
4,Aldgate,0,0,0


In [38]:
London_onehot_green.shape

(2278, 4)

In [39]:
London_grouped_green = London_onehot_green.groupby('Location').agg({'Botanical_Garden':'sum','Garden':'sum','Park':'sum'}).reset_index() 
London_grouped_green.head(10)

Unnamed: 0,Location,Botanical_Garden,Garden,Park
0,Addiscombe,0,0,0
1,Aldgate,0,0,0
2,Aldwych,0,0,0
3,Alperton,0,0,0
4,Anerley,0,0,1
5,Angel,0,0,0
6,Balham,0,0,0
7,Bankside,0,1,1
8,Barbican,1,1,0
9,Barkingside,0,0,0


#### Add column for total number of "green areas" in neighbourhood:

In [40]:
London_grouped_green['Total'] = London_grouped_green.sum(axis=1)
London_grouped_green.head(10)

Unnamed: 0,Location,Botanical_Garden,Garden,Park,Total
0,Addiscombe,0,0,0,0
1,Aldgate,0,0,0,0
2,Aldwych,0,0,0,0
3,Alperton,0,0,0,0
4,Anerley,0,0,1,1
5,Angel,0,0,0,0
6,Balham,0,0,0,0
7,Bankside,0,1,1,2
8,Barbican,1,1,0,2
9,Barkingside,0,0,0,0


In [41]:
London_grouped_green.shape

(297, 5)

Now let's create the new dataframe and display the top 10 venues for each Location.

In [42]:
df_Green= London_grouped_green[['Location','Total']].copy()
df_Green.head()

Unnamed: 0,Location,Total
0,Addiscombe,0
1,Aldgate,0
2,Aldwych,0
3,Alperton,0
4,Anerley,1


In [43]:
df_Green.shape

(297, 2)

We now have all the total of "green" areas within a 200m radius for each London neighbourhood. 

This concludes the data gathering phase. 

## Methodology <a name="methodology"></a>

In this project, based on the clients preference to be have close access to a greenery,  we will direct our efforts to creating a list of "green" neighbourhoods across London, and displaying these location on a map in aim to help the client decide where in London to move to. 

In first step we have collected the required **data: London neighbourhoods and their corresponding latitude/longtitude coordinates** . We have also ** determined the associated number of green areas with in a 200m radius for each London neighbourhood** (according to Foursquare categorisation).

In this analysis section we will use the machine learning technique - **k-means clustering** in aim to form various clusters of neighbourhoods ranging from locations with zero green areas within a 200m radius, to locations with  multiple green areas within a 200m radius. We will present these clusters on a map in order to display the "green" neighbourhoods in relation to the centre of London. The stakeholder can then use this map as a baseline to determine which areas he/she would consider to live in. 


## Analysis <a name="analysis"></a>

### Cluster Green Neighborhoods

In the Data section, we created a list of London neighbourhoods and the number of green areas present within a proximity of 200m. We will now create clusters, in aim to seperate the neighbourhoods relative to their access to green areas. 

After playing around with the number of clusters to use, it was decided that 3 neighbourhood clusters was sufficient in aim to provide a a list of appropriate neighbourhood candidates to the client. 

In [44]:
# set number of clusters
kclusters = 3

df_Green_clustering = df_Green.drop('Location', 1)

# run k-means clustering
kmeans = KMeans(n_clusters=kclusters, random_state=0).fit(df_Green_clustering)

# check cluster labels generated for each row in the dataframe
kmeans.labels_[0:10] 

array([0, 0, 0, 0, 2, 0, 0, 1, 1, 0], dtype=int32)

Let's create a new dataframe that includes the cluster as well the latitude/longtitude coordinates of each neighbourhood.

In [45]:
# add clustering labels
df_Green.insert(0, 'Cluster_Labels', kmeans.labels_)
London_Green_merged = df_London

# merge london_grouped with london_data to add latitude/longitude for each neighborhood
London_Green_merged = London_Green_merged.join(df_Green.set_index('Location'), on='Location')

London_Green_merged.head() # check the last columns!

Unnamed: 0,OS_grid_ref,Location,Latitude,Longitude,Cluster_Labels,Total
0,TQ345665,Addiscombe,51.382106,-0.068576,0.0,0.0
1,TQ334813,Aldgate,51.515328,-0.078387,0.0,0.0
2,TQ307810,Aldwych,51.513626,-0.117027,0.0,0.0
3,TQ185835,Alperton,51.538395,-0.292108,0.0,0.0
4,TQ345695,Anerley,51.408792,-0.066646,2.0,1.0


#### Clean Dataframe removing all rows with NAN

In [46]:
London_Green_merged2 = London_Green_merged.dropna() #Drop all rows with NAN 
London_Green_merged2.reset_index(drop=True, inplace=True) #reset indes after dropping rows
London_Green_merged2.Cluster_Labels = London_Green_merged2.Cluster_Labels.astype(int)
London_Green_merged2.head() # check the first columns!

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self[name] = value


Unnamed: 0,OS_grid_ref,Location,Latitude,Longitude,Cluster_Labels,Total
0,TQ345665,Addiscombe,51.382106,-0.068576,0,0.0
1,TQ334813,Aldgate,51.515328,-0.078387,0,0.0
2,TQ307810,Aldwych,51.513626,-0.117027,0,0.0
3,TQ185835,Alperton,51.538395,-0.292108,0,0.0
4,TQ345695,Anerley,51.408792,-0.066646,2,1.0


### Examine Clusters

#### Cluster 1

In [47]:
Cluster1=London_Green_merged2.loc[London_Green_merged2['Cluster_Labels'] == 0, London_Green_merged2.columns[[1] + list(range(5, London_Green_merged2.shape[1]))]]
Cluster1.reset_index()

Unnamed: 0,index,Location,Total
0,0,Addiscombe,0.0
1,1,Aldgate,0.0
2,2,Aldwych,0.0
3,3,Alperton,0.0
4,5,Angel,0.0
5,6,Balham,0.0
6,9,Barkingside,0.0
7,10,Barnet Gate,0.0
8,11,"Barnet (also Chipping Barnet, High Barnet)",0.0
9,12,Barnsbury,0.0


#### Cluster 2

In [48]:
Cluster2=London_Green_merged2.loc[London_Green_merged2['Cluster_Labels'] == 1, London_Green_merged2.columns[[1] + list(range(5, London_Green_merged2.shape[1]))]]
Cluster2.reset_index()

Unnamed: 0,index,Location,Total
0,7,Bankside,2.0
1,8,Barbican,2.0
2,147,Kensington,2.0
3,188,Newington,2.0
4,193,North Kensington,2.0
5,247,Stepney,2.0
6,259,Temple,3.0


#### Cluster 3

In [49]:
Cluster3=London_Green_merged2.loc[London_Green_merged2['Cluster_Labels'] == 2, London_Green_merged2.columns[[1] + list(range(5, London_Green_merged2.shape[1]))]]
Cluster3.reset_index()

Unnamed: 0,index,Location,Total
0,4,Anerley,1.0
1,14,Beckenham,1.0
2,18,Belmont,1.0
3,19,Belmont,1.0
4,20,Belsize Park,1.0
5,22,Berrylands,1.0
6,27,Bloomsbury,1.0
7,39,Camberwell,1.0
8,41,Camden Town,1.0
9,42,Canary Wharf,1.0


Our K-means clustering analysing highlights that the clusters represent the following:

**Cluster 1: 0 "green" areas within a 200m radius of the London neighbourhood**

**Cluster 2: 1 "green" area within a 200m radius of the London neighbourhood**

**Cluster 3: 2 or more "green" areas within a 200m radius of the London neighbourhood**


Finally, let's visualize the resulting clusters

### Client MAP 1: Create a map of London with ALL neighborhoods superimposed on top.

In [50]:
# create map
map_clusters = folium.Map(location=[latitude, longitude], zoom_start=10) # Coordinates of london centre

# set color scheme for the clusters
x = np.arange(kclusters)
ys = [i + x + (i*x)**2 for i in range(kclusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# add markers to the map
markers_colors = []
for lat, lon, poi, cluster in zip(London_Green_merged2['Latitude'], London_Green_merged2['Longitude'], London_Green_merged2['Location'], London_Green_merged2['Cluster_Labels']):
    label = folium.Popup(str(poi) + ' Cluster ' + str(cluster+1), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color=rainbow[cluster-1],
        fill=True,
        fill_color=rainbow[cluster-1],
        fill_opacity=0.7).add_to(map_clusters)
       
map_clusters

In [51]:
London_green_neighbourhoods = pd.concat([Cluster2, Cluster3], axis=0)
London_green_neighbourhoods.reset_index()



Unnamed: 0,index,Location,Total
0,7,Bankside,2.0
1,8,Barbican,2.0
2,147,Kensington,2.0
3,188,Newington,2.0
4,193,North Kensington,2.0
5,247,Stepney,2.0
6,259,Temple,3.0
7,4,Anerley,1.0
8,14,Beckenham,1.0
9,18,Belmont,1.0


In [52]:
print("Total number of 'green' neighbourhoods in london:", London_green_neighbourhoods.shape[0])

Total number of 'green' neighbourhoods in london: 53


In [53]:
print("Total number of 'non-green' neighbourhoods in london:", Cluster1.shape[0])

Total number of 'non-green' neighbourhoods in london: 249


### Client Map 2: Create a map of London with all "green" neighborhoods superimposed on top.

In [54]:
# create map
map_clusters = folium.Map(location=[latitude, longitude], zoom_start=10) # Coordinates of london centre

# set color scheme for the clusters
x = np.arange(kclusters)
ys = [i + x + (i*x)**2 for i in range(kclusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# add markers to the map
markers_colors = []
for lat, lon, poi, cluster in zip(London_Green_merged2['Latitude'], London_Green_merged2['Longitude'], London_Green_merged2['Location'], London_Green_merged2['Cluster_Labels']):
    if cluster >= 1:
        label = folium.Popup(str(poi) + ' Cluster ' + str(cluster+1), parse_html=True)
        folium.CircleMarker(
            [lat, lon],
            radius=5,
            popup=label,
            color=rainbow[cluster-1],
            fill=True, 
            fill_color=rainbow[cluster-1],
            fill_opacity=0.7).add_to(map_clusters)
       
map_clusters

## Results and Discussion <a name="results"></a>

Our K means cluster analysis shows that 53 London neighbourhoods have at least one park or garden within a vicinity of 200m, these will be referred to as "green neighbourhoods". Most London neighbourhoods (249), according to this analysis, are classed as "non green" neighbourhoods, meaning that they do not have a park or garden within a 200m radius.

2 Maps were created. 
Client Map 1: A Map of all the london neighbourhoods, colour coded due to their "green status".
Client Map 2: A Map of all "green" neighbourhoods, colour coded due to their "green status"

Red Markers: Neighbourhoods with 0 parks/gardens within 200m radius

Green Markers: Neighbourhoods with 1 park/garden within 200m radius 

Blue Markers: Neighbourhoods with 2+ parks/gardens within 200m radius

The result of this analysis is that we were able to narrow down the neighbourhood options for a client that intends to move to London and wants to move to an area with close access to a park and/or garden. This information will serve as a baseline or recommendation in order to help the client start further researching these neighbourhoods and adding further criteria in aim to narrow down her options, e.g distance from city centre, reputation of the neighbourhood etc. 


## Conclusion <a name="conclusion"></a>

The purpose of this project was to identify the London neighbourhoods which had close access to greenery in order to aid a client narrow down her london neighbourhood options by taking into account that she wanted to live close to a park/garden. By calculating the greenery (parks and gardens) density distribution from Foursquare data and then clustering these locations, we were able to create a short list of "green" neighbourhoods to which the client could considered moving to.

The final decision on the optimal neighbourhood will be made by the stakeholder based on further living criteria she deems important. The information provided to the client (Maps and tables) will serve as  solid foundation in which she can begin her decision making process. 