# Capstone Project - The Battle of the Neighborhoods (Data)
### Applied Data Science Capstone by IBM/Coursera

 ## Table of contents
* [Introduction: Business Problem](#introduction)
* [Data setup](#data)
* [London Map with area clusters](#map1)
* [Foursquare API setup](#foursq)
* [Generate venue list](#venuelist)
* [Generate common venues](#commonvenues)
* [Footer Note](#footer)


# A Recommender System for Travel Consultants

### Introduction <a name="introduction"></a>

We have introduced our Business Problem in part 1 where the clients require us to generate a recommender system for cusomters staying in areas of London.
As discussed, we will be using the https://en.wikipedia.org/wiki/List_of_areas_of_London wikipedia page to fetch a list of areas in the city. 
We will use the Foursquare API to generate a list of common venues for each area.
We will then use the generate data to calculate the best optinum area and the list of Hotels within those areas recommended to the customers.

### Data Setup <a name="data"></a>

###### Import required Libraries

In [3]:
import numpy as np
import pandas as pd
from bs4 import BeautifulSoup
import requests

from geopy.geocoders import Nominatim # convert an address into latitude and longitude values
import folium 
from pandas.io.json import json_normalize

website_url = requests.get('https://en.wikipedia.org/wiki/List_of_areas_of_London').text
soup = BeautifulSoup(website_url,'lxml')
table = soup.find("table", { "class" : "wikitable sortable" })

In [4]:
table = soup.find("table", { "class" : "wikitable sortable" })

In [6]:
# For Data setup introductory phase, we are only generating the top 100 areas from the wikipedia page
# Our main deliverable will have a complete set of data
Location=[]
Borough=[]
Town=[]
Postcode=[]
mycounter = 0


for row in table.findAll("tr"):
    cells = row.findAll("td")
    mycounter += 1   
    if mycounter <= 100:
        #For each "tr", assign each "td" to a variable.
        if len(cells) == 6:
            Location.append(cells[0].find(text=True))
            Borough.append(cells[1].find(text=True))
            Town.append(cells[2].find(text=True))
            Postcode.append(cells[3].find(text=True).replace("\n", ""))

In [8]:
df=pd.DataFrame(Location,columns=['Location'])
df['Borough']=Borough
df['Town']=Town
df['Postcode']=Postcode
df.head(10)

Unnamed: 0,Location,Borough,Town,Postcode
0,Abbey Wood,"Bexley, Greenwich",LONDON,SE2
1,Acton,"Ealing, Hammersmith and Fulham",LONDON,"W3, W4"
2,Addington,Croydon,CROYDON,CR0
3,Addiscombe,Croydon,CROYDON,CR0
4,Albany Park,Bexley,"BEXLEY, SIDCUP","DA5, DA14"
5,Aldborough Hatch,Redbridge,ILFORD,IG2
6,Aldgate,City,LONDON,EC3
7,Aldwych,Westminster,LONDON,WC2
8,Alperton,Brent,WEMBLEY,HA0
9,Anerley,Bromley,LONDON,SE20


###### Let's check the data types

In [9]:
df.dtypes

Location    object
Borough     object
Town        object
Postcode    object
dtype: object

###### Let's add Latitudes and Longitudes to the dataframe

In [11]:
df['Latitude'] = ''
df['Longitude'] = ''
for i in df.index:
    if df.at[i, 'Town'] == "Not assigned":
        #print("Replaced for ", df.at[i, 'Neighbourhood'],  "with", df.at[i, 'Borough'], "at index ", i )
        df.at[i, 'Town'] = df.at[i, 'Borough']
    address = df.at[i, 'Location'] + ', ' + df.at[i, 'Town']
    try:
        geolocator = Nominatim(user_agent="df_explorer")
        location = geolocator.geocode(address)
        latitude = location.latitude
        longitude = location.longitude
        df.at[i, 'Latitude'] = latitude
        df.at[i, 'Longitude'] = longitude
    except:
        continue

    

In [12]:
df.head(10)

Unnamed: 0,Location,Borough,Town,Postcode,Latitude,Longitude
0,Abbey Wood,"Bexley, Greenwich",LONDON,SE2,51.4876,0.11405
1,Acton,"Ealing, Hammersmith and Fulham",LONDON,"W3, W4",51.5081,-0.273261
2,Addington,Croydon,CROYDON,CR0,44.4206,-76.9782
3,Addiscombe,Croydon,CROYDON,CR0,51.3797,-0.0742821
4,Albany Park,Bexley,"BEXLEY, SIDCUP","DA5, DA14",51.434,0.1032
5,Aldborough Hatch,Redbridge,ILFORD,IG2,,
6,Aldgate,City,LONDON,EC3,51.5142,-0.0757186
7,Aldwych,Westminster,LONDON,WC2,51.5129,-0.118101
8,Alperton,Brent,WEMBLEY,HA0,51.5378,-0.297924
9,Anerley,Bromley,LONDON,SE20,51.4128,-0.0653006


###### Great, now let's clean up the dataframe for further analysis

In [13]:
neighborhoods = df
neighborhoods["Borough"].replace("", np.nan, inplace=True)
neighborhoods["Latitude"].replace("", np.nan, inplace=True)
neighborhoods["Longitude"].replace("", np.nan, inplace=True)
neighborhoods.dropna(subset=["Latitude"], axis=0, inplace=True)
neighborhoods.dropna(subset=["Longitude"], axis=0, inplace=True)
neighborhoods.dropna(subset=["Borough"], axis=0, inplace=True)
neighborhoods.dropna(subset=["Town"], axis=0, inplace=True)
# reset index, because we droped rows where Borough are Not assigned
neighborhoods.reset_index(drop=True, inplace=True)


neighborhoods['Latitude'] = neighborhoods['Latitude'].astype(float)
neighborhoods['Longitude'] = neighborhoods['Longitude'].astype(float)

neighborhoods.dtypes

Location      object
Borough       object
Town          object
Postcode      object
Latitude     float64
Longitude    float64
dtype: object

##### Create a cluster map based on the list collected  <a name="map1"></a>

In [14]:
map_london = folium.Map(location=[latitude, longitude], zoom_start=11)

# add markers to map
for lat, lng, label in zip(neighborhoods['Latitude'], neighborhoods['Longitude'], neighborhoods['Location']):
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_london)  
    
map_london

##### Let's add an image of the generated map if not visible on Github

![alt text](Capstone_Images_02_Data_01_ClusterMap.JPG "MapCluster")

#### Now, let's setup the Foursquare API <a name="foursq"></a>

In [16]:
CLIENT_ID = '<Hidden>' # your Foursquare ID
CLIENT_SECRET = '<Hidden>' # your Foursquare Secret
VERSION = '20180605' # Foursquare API version

In [19]:
neighborhood_latitude = neighborhoods.loc[0, 'Latitude'] # neighborhood latitude value
neighborhood_longitude = neighborhoods.loc[0, 'Longitude'] # neighborhood longitude value

# Check if correctly populated. Check the first value in the DF                                           
neighborhood_name = neighborhoods.loc[0, 'Location'] # neighborhood name
neighborhood_name

'Abbey Wood'

In [21]:
LIMIT = 100 # limit of number of venues returned by Foursquare API
radius = 500 # define radius
url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
    CLIENT_ID, 
    CLIENT_SECRET, 
    VERSION, 
    neighborhood_latitude, 
    neighborhood_longitude, 
    radius, 
    LIMIT)


In [22]:
results = requests.get(url).json()

In [23]:
def get_category_type(row):
    try:
        categories_list = row['categories']
    except:
        categories_list = row['venue.categories']
        
    if len(categories_list) == 0:
        return None
    else:
        return categories_list[0]['name']

In [24]:
venues = results['response']['groups'][0]['items']
    
nearby_venues = json_normalize(venues) # flatten JSON

# filter columns
filtered_columns = ['venue.name', 'venue.categories', 'venue.location.lat', 'venue.location.lng']
nearby_venues =nearby_venues.loc[:, filtered_columns]

# filter the category for each row
nearby_venues['venue.categories'] = nearby_venues.apply(get_category_type, axis=1)

# clean columns
nearby_venues.columns = [col.split(".")[-1] for col in nearby_venues.columns]

nearby_venues.head()

Unnamed: 0,name,categories,lat,lng
0,Co-op Food,Grocery Store,51.48765,0.11349
1,Bostal Gardens,Playground,51.48667,0.110462
2,Cheers Off License,Grocery Store,51.486808,0.107396
3,Abbey Wood Caravan Club,Campground,51.485502,0.120014


In [25]:
def getNearbyVenues(names, latitudes, longitudes, radius=500):
    
    venues_list=[]
    for name, lat, lng in zip(names, latitudes, longitudes):
        print(name)
            
        # create the API request URL
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lat, 
            lng, 
            radius, 
            LIMIT)
            
        # make the GET request
        results = requests.get(url).json()["response"]['groups'][0]['items']
        
        # return only relevant information for each nearby venue
        venues_list.append([(
            name, 
            lat, 
            lng, 
            v['venue']['name'], 
            v['venue']['location']['lat'], 
            v['venue']['location']['lng'],  
            v['venue']['categories'][0]['name']) for v in results])

    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['Location', 
                  'Location Latitude', 
                  'Location Longitude', 
                  'Venue', 
                  'Venue Latitude', 
                  'Venue Longitude', 
                  'Venue Category']
    
    return(nearby_venues)

###### Generate venues list <a name="venuelist"></a>

In [26]:
london_venues = getNearbyVenues(names=neighborhoods['Location'],
                                   latitudes=neighborhoods['Latitude'],
                                   longitudes=neighborhoods['Longitude']
                                  )

Abbey Wood
Acton
Addington
Addiscombe
Albany Park
Aldgate
Aldwych
Alperton
Anerley
Angel
Aperfield
Archway
Ardleigh Green
Arkley
Arnos Grove
Balham
Bankside
Barbican
Barking
Barkingside
Barnehurst
Barnes
Barnet Gate
Barnet
Barnsbury
Battersea
Bayswater
Beckenham
Beckton
Becontree
Becontree Heath
Beddington
Bedford Park
Belgravia
Bellingham
Belmont
Belmont
Belsize Park
Belvedere
Bermondsey
Berrylands
Bethnal Green
Bexley
Bexleyheath
Bickley
Biggin Hill
Blackfen
Blackfriars
Blackheath
Blackheath Royal Standard
Blackwall
Blendon
Bloomsbury
Botany Bay
Bounds Green
Bow
Bowes Park
Brentford
Brent Cross
Brent Park
Brimsdown
Brixton
Brockley
Bromley
Bromley
Bromley Common
Brompton
Brondesbury
Brunswick Park
Bulls Cross
Burnt Oak
Burroughs, The
Camberwell
Cambridge Heath
Camden Town
Canary Wharf
Cann Hall
Canning Town
Canonbury
Carshalton
Castelnau
Castle Green
Catford
Chadwell Heath
Chalk Farm
Charing Cross
Charlton
Chase Cross
Cheam
Chelsea
Chelsfield
Chessington
Childs Hill
Chinatown
Chinbro

In [28]:
london_venues.head()

Unnamed: 0,Location,Location Latitude,Location Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
0,Abbey Wood,51.487621,0.11405,Co-op Food,51.48765,0.11349,Grocery Store
1,Abbey Wood,51.487621,0.11405,Bostal Gardens,51.48667,0.110462,Playground
2,Abbey Wood,51.487621,0.11405,Cheers Off License,51.486808,0.107396,Grocery Store
3,Abbey Wood,51.487621,0.11405,Abbey Wood Caravan Club,51.485502,0.120014,Campground
4,Acton,51.50814,-0.273261,London Star Hotel,51.509624,-0.272456,Hotel


## Analyze Each Neighborhood

In [30]:
london_onehot = pd.get_dummies(london_venues[['Venue Category']], prefix="", prefix_sep="")

# add neighborhood column back to dataframe
london_onehot['Location'] = london_venues['Location'] 

# move neighborhood column to the first column
fixed_columns = [london_onehot.columns[-1]] + list(london_onehot.columns[:-1])
london_onehot = london_onehot[fixed_columns]

london_onehot.head()

Unnamed: 0,Location,Accessories Store,Afghan Restaurant,African Restaurant,American Restaurant,Aquarium,Arepa Restaurant,Argentinian Restaurant,Art Gallery,Art Museum,...,Warehouse Store,Wine Bar,Wine Shop,Wings Joint,Women's Store,Xinjiang Restaurant,Yoga Studio,Yoshoku Restaurant,Zoo,Zoo Exhibit
0,Abbey Wood,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,Abbey Wood,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,Abbey Wood,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,Abbey Wood,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,Acton,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [31]:
london_grouped = london_onehot.groupby('Location').mean().reset_index()

###### Let's print each neighborhood along with the top 3 most common venues <a name="commonvenues"></a>

In [33]:
# For Introductory data setup phase, we will list only the top 5 venues
num_top_venues = 3
DisplayRange = 0

for hood in london_grouped['Location']:
    if DisplayRange < 5:
        print("----"+hood+"----")
        temp = london_grouped[london_grouped['Location'] == hood].T.reset_index()
        temp.columns = ['venue','freq']
        temp = temp.iloc[1:]
        temp['freq'] = temp['freq'].astype(float)
        temp = temp.round({'freq': 2})
        print(temp.sort_values('freq', ascending=False).reset_index(drop=True).head(num_top_venues))
        DisplayRange += 1
        print('\n')

----Abbey Wood----
           venue  freq
0  Grocery Store  0.50
1     Campground  0.25
2     Playground  0.25


----Acton----
                  venue  freq
0                   Pub  0.17
1  Gym / Fitness Center  0.17
2  Fast Food Restaurant  0.09


----Addiscombe----
           venue  freq
0           Park  0.27
1  Grocery Store  0.18
2            Pub  0.09


----Albany Park----
                venue  freq
0  Italian Restaurant  0.14
1                 Pub  0.14
2                Café  0.07


----Aldgate----
               venue  freq
0        Coffee Shop  0.11
1              Hotel  0.11
2  Indian Restaurant  0.06




###### Footnote: This concludes the data setup for this project. Thank you for reviewing my work <a name="footer"></a>