<h1 style='text-align:center'>Neighborhood Recommendation Tool</h1>
<h2 style='text-align:center'>Applied Data Science Capstone by IBM/Coursera - The Battle of the Neighborhoods</h2>
<h4 style='text-align:center'>This tool will use foursquare venue data and ratings to recommend a neighborhood to live in based on a user's ranking of neighborhood features.</h4>

<h2>Introduction/Business Problem</h2>

<p>Problem Statement: A user is moving to a new city, and wants to better understand how each neighborhood in that city aligns with the user's particular taste/preference.</p>
<p>Introduction: Using foursquare api venue ratings, each neighborhood in a city is given a score for a set of neighborhood features. The final output will be a map which contains the neighborhood rankings for the user to explore.</p>
<p>For the sake of this exercise, the city will be <b>Washington, DC</b>, and the neighborhood features will be: <b>Parks, Restaurants, Metro/Subway Options, Shopping Venues, and Art Venues.</b></p>

<h2>Data</h2>

Based on the business problem, the following datasets are needed before performing the Content-Based recommendation algorithm to recommend a neighborhood:
1. Washington, DC neighborhoods with latitude, longitude
2. Foursquare venue type rating data associated with Washington, DC neighborhood
3. User input rating of neighborhood features Parks, Restaurants, Metro/Subway Options, Shopping Venues, and Art Venues

<b>Below I will describe in detail the procedure for pulling these datasets.</b>

<h3>Import relevant packages</h3>

In [None]:
##import packages
import pandas as pd
import requests
from bs4 import BeautifulSoup

import numpy as np # library to handle data in a vectorized manner

import json # library to handle JSON files

!conda install -c conda-forge geopy --yes # uncomment this line if you haven't completed the Foursquare API lab
from geopy.geocoders import Nominatim # convert an address into latitude and longitude values

from pandas.io.json import json_normalize # tranform JSON file into a pandas dataframe

# Matplotlib and associated plotting modules
import matplotlib.cm as cm
import matplotlib.colors as colors

# import k-means from clustering stage
from sklearn.cluster import KMeans

!conda install -c conda-forge folium=0.5.0 --yes # uncomment this line if you haven't completed the Foursquare API lab
import folium # map rendering library

print('Import Complete')

<h3>Dataset 1: Washington, DC neighborhoods with latitude, longitude</h3>
<b>First, I will scrape Washington DC neighborhoods from DC.GOV website using beautifulsoup. Some neighborhoods are not named correctly on the website, these are changed manually.</b>

In [17]:
url = 'https://washington.org/dc-neighborhoods'

html_data = requests.get(url).text
soup = BeautifulSoup(html_data, "html.parser")

h5 = soup.findAll('h5')

dc_neighborhoods = []

for i in range(27,51):
    cell={}
    cell['Neighborhood'] = h5[i].text.strip('\n').strip('\n ')
    cell['City'] = 'Washington'
    cell['State'] = 'D.C.'
    dc_neighborhoods.append(cell)

dc_neighborhoods_df = pd.DataFrame(dc_neighborhoods)
dc_neighborhoods_df['Neighborhood']=dc_neighborhoods_df['Neighborhood'].replace({'Explore Dupont Circle':'Dupont Circle',
                                             'Your Guide to the Georgetown Neighborhood':'Georgetown',
                                             'Your Guide to the National Mall':'National Mall',
                                             'Penn Quarter & Chinatown':'Penn Quarter',
                                             'Capitol Riverfront':'Navy Yard',
                                             'Upper Northwest':'Tenleytown',
                                             'U Street':'1200 U St NW',
                                             'Mount Vernon Square':'700 M St NW'})

print(dc_neighborhoods_df.head())
print(dc_neighborhoods_df.shape)

   Neighborhood        City State
0  Adams Morgan  Washington  D.C.
1     Anacostia  Washington  D.C.
2     Brookland  Washington  D.C.
3  Capitol Hill  Washington  D.C.
4     Navy Yard  Washington  D.C.
(24, 3)


<b>Then, I will use geopy to get Lat/Long for each neighborhood</b>

In [18]:
locator = Nominatim(user_agent='dc_explorer')

lat_long = []

for ind in dc_neighborhoods_df.index:
    cell={}
    #print('Neighborhood = {}'.format(dc_neighborhoods_df['Neighborhood'][ind]))
    location = locator.geocode(dc_neighborhoods_df['Neighborhood'][ind]+','+dc_neighborhoods_df['City'][ind]+','+dc_neighborhoods_df['State'][ind])
    #print('Neighborhood = {}, Latitude = {}, Longitude = {}'.format(dc_neighborhoods_df['Neighborhood'][ind],location.latitude, location.longitude))
    cell['Neighborhood'] = dc_neighborhoods_df['Neighborhood'][ind]
    cell['Latitude'] = location.latitude
    cell['Longitude'] = location.longitude
    lat_long.append(cell)
    
lat_long_df = pd.DataFrame(lat_long)

print(lat_long_df.head())
print(lat_long_df.shape)

   Neighborhood   Latitude  Longitude
0  Adams Morgan  38.921500 -77.042199
1     Anacostia  38.862581 -76.984441
2     Brookland  38.932832 -76.984226
3  Capitol Hill  38.889803 -77.009418
4     Navy Yard  38.876307 -77.000478
(24, 3)


<b>Finally, the two dataframes are merged.</b>

In [19]:
dc_df = dc_neighborhoods_df.set_index('Neighborhood').join(lat_long_df.set_index('Neighborhood'))
dc_df.reset_index(inplace=True)
dc_df

Unnamed: 0,Neighborhood,City,State,Latitude,Longitude
0,Adams Morgan,Washington,D.C.,38.9215,-77.042199
1,Anacostia,Washington,D.C.,38.862581,-76.984441
2,Brookland,Washington,D.C.,38.932832,-76.984226
3,Capitol Hill,Washington,D.C.,38.889803,-77.009418
4,Navy Yard,Washington,D.C.,38.876307,-77.000478
5,Columbia Heights,Washington,D.C.,38.928185,-77.031923
6,Congress Heights,Washington,D.C.,38.842897,-77.000255
7,Downtown,Washington,D.C.,38.900397,-77.028259
8,Dupont Circle,Washington,D.C.,38.912423,-77.041251
9,Foggy Bottom,Washington,D.C.,38.899114,-77.054728


<b>Let's vizualize the 24 nieghborhoods with Folium to see that we have good coverage around the city.</b>

In [20]:
address = 'Washington, D.C.'

location = locator.geocode(address)
latitude = location.latitude
longitude = location.longitude

# create map 
map_dc = folium.Map(width=800,height=500,location=[latitude, longitude], zoom_start=12)

# add markers to map
for lat, lng, neighborhood in zip(dc_df['Latitude'], dc_df['Longitude'], dc_df['Neighborhood']):
    label = '{}'.format(neighborhood)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_dc)  
    
map_dc

<b>Now let's see if there is any overlap in the neighborhoods with a radius of 400 meters surrounding the center.</b>

In [21]:
# create map 
map_dc_overlap = folium.Map(width=800,height=500,location=[latitude, longitude], zoom_start=12)

# add markers to map
for lat, lng, neighborhood in zip(dc_df['Latitude'], dc_df['Longitude'], dc_df['Neighborhood']):
    label = '{}'.format(neighborhood)
    label = folium.Popup(label, parse_html=True)
    folium.Circle(
        [lat, lng],
        radius=400, #use circle and set radius to 400m
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_dc_overlap)  
    
map_dc_overlap

<b>Looks good! Now let's get the foursqare data</b>
<h3>Dataset 2: Foursquare venue type rating data associated with Washington, DC neighborhood</h3>
<b>First, initiate the credentials (hidden)</b>

In [96]:
# The code was removed by Watson Studio for sharing.

<b>Then, set up the function to get foursquare venues based on category</b>

In [84]:
def getVenues(names, latitudes, longitudes, features, radius=400):
    
    venues_list=[]
    for name, lat, lng in zip(names, latitudes, longitudes):
        print('Pulling venues for: ', name)
        for feat in features:
            # create the API request URL
            url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}&section={}&sortByPopularity=1'.format(
                CLIENT_ID, 
                CLIENT_SECRET, 
                VERSION, 
                lat, 
                lng, 
                radius, 
                LIMIT,
                feat)

            # make the GET request
            results = requests.get(url).json()["response"]['groups'][0]['items']

            # return only relevant information for each nearby venue
            venues_list.append([(
                name, 
                lat, 
                lng,
                feat,
                v['venue']['name'], 
                v['venue']['location']['lat'], 
                v['venue']['location']['lng'],  
                v['venue']['categories'][0]['name']) for v in results])

    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['Neighborhood', 
                  'Neighborhood Latitude', 
                  'Neighborhood Longitude',
                  'Feature',
                  'Venue', 
                  'Venue Latitude', 
                  'Venue Longitude', 
                  'Venue Category']
    
    return(nearby_venues)

<b>Let's try an example neighborhood to explore how this will work</b>

In [73]:
sample = dc_df.head(1)
print(sample)

neigh_features = ['food', 'drinks', 'shops', 'arts', 'outdoors']

sample_venues = getVenues(names=sample['Neighborhood'], latitudes=sample['Latitude'], longitudes=sample['Longitude'], features=neigh_features)

sample_venues.head()


   Neighborhood        City State  Latitude  Longitude
0  Adams Morgan  Washington  D.C.   38.9215 -77.042199


Unnamed: 0,Neighborhood,Neighborhood Latitude,Neighborhood Longitude,Feature,Venue,Venue Latitude,Venue Longitude,Venue Category
0,Adams Morgan,38.9215,-77.042199,food,Lapis,38.921302,-77.04389,Afghan Restaurant
1,Adams Morgan,38.9215,-77.042199,food,Popeyes Louisiana Kitchen,38.923937,-77.040411,Fried Chicken Joint
2,Adams Morgan,38.9215,-77.042199,food,Amsterdam Falafelshop,38.921162,-77.041959,Falafel Restaurant
3,Adams Morgan,38.9215,-77.042199,food,Mintwood Place,38.922053,-77.043611,New American Restaurant
4,Adams Morgan,38.9215,-77.042199,food,So's Your Mom,38.921671,-77.043753,Bagel Shop


<b>Let's visualize these venues on the map. Food will be Blue, Drinks Green, Shops Purple, Arts Red, and Outdoors Orange</b>

In [74]:
feature_to_color= {'food':'blue','drinks':'purple','shops':'green','arts':'red','outdoors':'orange'}


latitude = sample['Latitude'][0]
longitude = sample['Longitude'][0]

# create map 
map_admo = folium.Map(width=800,height=500,location=[latitude, longitude], zoom_start=16)

# add markers to map
for lat, lng, venue, feat in zip(sample_venues['Venue Latitude'], sample_venues['Venue Longitude'], sample_venues['Venue'], sample_venues['Feature']):
    label = '{}, {}'.format(venue, feat)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color=feature_to_color[feat],
        fill=True,
        fill_color=feature_to_color[feat],
        fill_opacity=0.3,
        parse_html=False).add_to(map_admo)  
    
map_admo

<b>Now let's see the breakup by feature for this neighborood.</b>

In [82]:
#sample_venues[sample_venues['Feature']=='arts']
sample_venues.groupby('Feature').count()

Unnamed: 0_level_0,Neighborhood,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
Feature,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
arts,4,4,4,4,4,4,4
drinks,29,29,29,29,29,29,29
food,40,40,40,40,40,40,40
outdoors,19,19,19,19,19,19,19
shops,74,74,74,74,74,74,74


<b>Once we do this for all the neighborhoods and normalize, we can do the comparison and assign a rating for each feature for each neighborhood.</b>
<p>
<b>Now, let's repeat with all neighborhoods</b>

In [85]:
full_venues = getVenues(names=dc_df['Neighborhood'], latitudes=dc_df['Latitude'], longitudes=dc_df['Longitude'], features=neigh_features)
print('Venues Pulled')

Pulling venues for:  Adams Morgan
Pulling venues for:  Anacostia
Pulling venues for:  Brookland
Pulling venues for:  Capitol Hill
Pulling venues for:  Navy Yard
Pulling venues for:  Columbia Heights
Pulling venues for:  Congress Heights
Pulling venues for:  Downtown
Pulling venues for:  Dupont Circle
Pulling venues for:  Foggy Bottom
Pulling venues for:  Georgetown
Pulling venues for:  H Street NE
Pulling venues for:  Ivy City
Pulling venues for:  Logan Circle
Pulling venues for:  700 M St NW
Pulling venues for:  National Mall
Pulling venues for:  NoMa
Pulling venues for:  Penn Quarter
Pulling venues for:  Petworth
Pulling venues for:  Shaw
Pulling venues for:  Southwest & The Wharf
Pulling venues for:  1200 U St NW
Pulling venues for:  Tenleytown
Pulling venues for:  Woodley Park
Venues Pulled


In [93]:
print(full_venues.groupby(['Neighborhood','Feature']).count().to_string())

                                Neighborhood Latitude  Neighborhood Longitude  Venue  Venue Latitude  Venue Longitude  Venue Category
Neighborhood          Feature                                                                                                        
1200 U St NW          arts                         16                      16     16              16               16              16
                      drinks                       57                      57     57              57               57              57
                      food                         60                      60     60              60               60              60
                      outdoors                     26                      26     26              26               26              26
                      shops                        75                      75     75              75               75              75
700 M St NW           arts                          7         

<b>Great, now we will get the user input rating for these features.</b>
<h3>Dataset 3: User input rating of neighborhood features Parks, Restaurants, Metro/Subway Options, Shopping Venues, and Art Venues</h3>

In [95]:
table_contents=[]
cell={}

for x in neigh_features:
    val = input('On a scale of 1 (least important) to 5 (most important), how important are '+x+' venues to you: ')
    cell[x] = val
    
table_contents.append(cell)
    
user_df=pd.DataFrame(table_contents)
user_df

On a scale of 1 (least important) to 5 (most important), how important are food venues to you: 5
On a scale of 1 (least important) to 5 (most important), how important are drinks venues to you: 4
On a scale of 1 (least important) to 5 (most important), how important are shops venues to you: 3
On a scale of 1 (least important) to 5 (most important), how important are arts venues to you: 2
On a scale of 1 (least important) to 5 (most important), how important are outdoors venues to you: 1


Unnamed: 0,food,drinks,shops,arts,outdoors
0,5,4,3,2,1


<b>This concludes the Week 1 work for the Data Science Capstone, next steps are to describe and apply a methodology to the above datasets to arrive at a recommendation for the user, present the results, and conclude the report.</b>