# Capstone Project: The Battle of Neighbourhoods

By Harry Ho

In the US, there are different cities and some types of businesses are the majority of each city. When the entrepreneur would like to set up his/her own business in a city, they need to consider the number of competitors in it and what type of business is the majority of the city. Therefore, they can make a better plan for choosing the city to set up their business. In the report, I will analyze the common venues in different cities by clustering different cities in US on overall types of business in neighbourhoods.

### 1.Introduction

### 2.Data Analysis

This report will use the data from Wikipedia and Foursquare. The analysis approach is using K-means to cluster the data and display different groups on map. 

In [1]:
import urllib.request as urllib2
from bs4 import BeautifulSoup
import pandas as pd
import json
import requests
import numpy as np
from geopy.geocoders import Nominatim
from pandas.io.json import json_normalize 
import matplotlib.cm as cm
import matplotlib.colors as colors
from sklearn.cluster import KMeans
import folium

#### Cities data from Wikipedia

In [2]:
url = r'https://en.wikipedia.org/wiki/List_of_United_States_cities_by_population'
page = urllib2.urlopen(url)
soup = BeautifulSoup(page,'html.parser')
table = soup.find("table",{"class":"wikitable sortable"})
table_rows = table.find_all('tr')

l = []
for tr in table_rows:
    td = tr.find_all('td')
    row = [tr.text for tr in td]
    l.append(row)
data = pd.DataFrame(l)
data = data.iloc[1:,1]
data['clean'] = data.apply(lambda x: x.split('[', 1)[0])

data['clean'] = data['clean'].str.strip()
cities = [x+', US' for x in data['clean'].values[1:]]
cities[:10]

['Los Angeles, US',
 'Chicago, US',
 'Houston, US',
 'Phoenix, US',
 'Philadelphia, US',
 'San Antonio, US',
 'San Diego, US',
 'Dallas, US',
 'San Jose, US',
 'Austin, US']

In [3]:
def lat_lng(address):
    geolocator = Nominatim(user_agent="explorer")
    location = geolocator.geocode(address)
    latitude = location.latitude
    longitude = location.longitude
    return (latitude, longitude)

In [4]:
data = pd.DataFrame(columns = ['City', 'Latitude', 'Longitude'])

for city in cities:
    print(city)
    try:
        (lat, lng) = lat_lng(city)
        data = data.append({'City':city, 'Latitude':lat, 'Longitude':lng}, ignore_index=True)
    except:
        pass

Los Angeles, US
Chicago, US
Houston, US
Phoenix, US
Philadelphia, US
San Antonio, US
San Diego, US
Dallas, US
San Jose, US
Austin, US
Jacksonville, US
Fort Worth, US
Columbus, US
San Francisco, US
Charlotte, US
Indianapolis, US
Seattle, US
Denver, US
Washington, US
Boston, US
El Paso, US
Detroit, US
Nashville, US
Portland, US
Memphis, US
Oklahoma City, US
Las Vegas, US
Louisville, US
Baltimore, US
Milwaukee, US
Albuquerque, US
Tucson, US
Fresno, US
Mesa, US
Sacramento, US
Atlanta, US
Kansas City, US
Colorado Springs, US
Miami, US
Raleigh, US
Omaha, US
Long Beach, US
Virginia Beach, US
Oakland, US
Minneapolis, US
Tulsa, US
Arlington, US
Tampa, US
New Orleans, US
Wichita, US
Cleveland, US
Bakersfield, US
Aurora, US
Anaheim, US
Honolulu, US
Santa Ana, US
Riverside, US
Corpus Christi, US
Lexington, US
Stockton, US
Henderson, US
Saint Paul, US
St. Louis, US
Cincinnati, US
Pittsburgh, US
Greensboro, US
Anchorage, US
Plano, US
Lincoln, US
Orlando, US
Irvine, US
Newark, US
Toledo, US
Durham, U

In [5]:
data

Unnamed: 0,City,Latitude,Longitude
0,"Los Angeles, US",34.053691,-118.242767
1,"Chicago, US",41.875562,-87.624421
2,"Houston, US",29.758938,-95.367697
3,"Phoenix, US",33.448587,-112.077346
4,"Philadelphia, US",39.952724,-75.163526
5,"San Antonio, US",29.424600,-98.495141
6,"San Diego, US",32.717421,-117.162771
7,"Dallas, US",32.776272,-96.796856
8,"San Jose, US",37.336191,-121.890583
9,"Austin, US",30.271129,-97.743700


#### Foursquare data

In [6]:
CLIENT_ID = '3TERCLYK2XCYCEIU3PRZQWRYMOZXFQBBY2DWYZKU1NQHKM24'
CLIENT_SECRET = 'H34ZU1ABIW2VNH3KLFSRQWI0KOGE0RVFAOJHKACTQAHAFSLT'
VERSION = '20180605'
print('Your credentails:')
print('CLIENT_ID: ' + CLIENT_ID)
print('CLIENT_SECRET:' + CLIENT_SECRET)

Your credentails:
CLIENT_ID: 3TERCLYK2XCYCEIU3PRZQWRYMOZXFQBBY2DWYZKU1NQHKM24
CLIENT_SECRET:H34ZU1ABIW2VNH3KLFSRQWI0KOGE0RVFAOJHKACTQAHAFSLT


In [7]:
LIMIT = 100 
radius = 500
def get_category_type(row):
    try:
        categories_list = row['categories']
    except:
        categories_list = row['venue.categories']
        
    if len(categories_list) == 0:
        return None
    else:
        return categories_list[0]['name']


def getNearbyVenues(names, latitudes, longitudes, radius=500):
    
    venues_list=[]
    for name, lat, lng in zip(names, latitudes, longitudes):
        
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lat, 
            lng, 
            radius, 
            LIMIT)
            
        results = requests.get(url).json()["response"]['groups'][0]['items']
        
        venues_list.append([(
            name, 
            lat, 
            lng, 
            v['venue']['name'], 
            v['venue']['location']['lat'], 
            v['venue']['location']['lng'],  
            v['venue']['categories'][0]['name']) for v in results])

    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['City', 
                  'Latitude', 
                  'Longitude', 
                  'Venue', 
                  'Venue Latitude', 
                  'Venue Longitude', 
                  'Venue Category']
    
    return(nearby_venues)

In [8]:
venues = getNearbyVenues(names=data['City'],
                                   latitudes=data['Latitude'],
                                   longitudes=data['Longitude'])

In [9]:
venues.head()

Unnamed: 0,City,Latitude,Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
0,"Los Angeles, US",34.053691,-118.242767,Grand Park,34.055034,-118.245179,Park
1,"Los Angeles, US",34.053691,-118.242767,Badmaash,34.051342,-118.244571,Indian Restaurant
2,"Los Angeles, US",34.053691,-118.242767,Renegade Craft Fair,34.054445,-118.244471,Arts & Crafts Store
3,"Los Angeles, US",34.053691,-118.242767,Redbird,34.050666,-118.244068,American Restaurant
4,"Los Angeles, US",34.053691,-118.242767,CVS pharmacy,34.053426,-118.242107,Pharmacy


In [10]:
print('There are {} uniques categories.'.format(len(venues['Venue Category'].unique())))

There are 407 uniques categories.


#### Analysis

In [11]:
onehot = pd.get_dummies(venues[['Venue Category']], prefix="", prefix_sep="")
onehot['City'] = venues['City'] 
fixed_columns = [onehot.columns[-1]] + list(onehot.columns[:-1])
onehot = onehot[fixed_columns]
onehot.head()

Unnamed: 0,City,ATM,Accessories Store,Adult Boutique,Advertising Agency,Afghan Restaurant,African Restaurant,American Restaurant,Antique Shop,Arcade,...,Vietnamese Restaurant,Warehouse Store,Weight Loss Center,Whisky Bar,Wine Bar,Wine Shop,Winery,Wings Joint,Women's Store,Yoga Studio
0,"Los Angeles, US",0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,"Los Angeles, US",0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,"Los Angeles, US",0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,"Los Angeles, US",0,0,0,0,0,0,1,0,0,...,0,0,0,0,0,0,0,0,0,0
4,"Los Angeles, US",0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [12]:
grouped = onehot.groupby('City').mean().reset_index()
grouped

Unnamed: 0,City,ATM,Accessories Store,Adult Boutique,Advertising Agency,Afghan Restaurant,African Restaurant,American Restaurant,Antique Shop,Arcade,...,Vietnamese Restaurant,Warehouse Store,Weight Loss Center,Whisky Bar,Wine Bar,Wine Shop,Winery,Wings Joint,Women's Store,Yoga Studio
0,"Abilene, US",0.000000,0.000000,0.0,0.0,0.000000,0.0,0.083333,0.000000,0.000000,...,0.000000,0.0,0.0,0.000000,0.083333,0.000000,0.0,0.000000,0.000000,0.000000
1,"Akron, US",0.000000,0.000000,0.0,0.0,0.000000,0.0,0.027027,0.000000,0.000000,...,0.000000,0.0,0.0,0.027027,0.027027,0.000000,0.0,0.000000,0.000000,0.000000
2,"Albuquerque, US",0.000000,0.000000,0.0,0.0,0.000000,0.0,0.018519,0.000000,0.000000,...,0.000000,0.0,0.0,0.000000,0.000000,0.000000,0.0,0.000000,0.000000,0.000000
3,"Alexandria, US",0.000000,0.000000,0.0,0.0,0.000000,0.0,0.333333,0.000000,0.000000,...,0.000000,0.0,0.0,0.000000,0.000000,0.000000,0.0,0.000000,0.000000,0.000000
4,"Allentown, US",0.000000,0.000000,0.0,0.0,0.000000,0.0,0.028571,0.000000,0.000000,...,0.000000,0.0,0.0,0.000000,0.000000,0.000000,0.0,0.000000,0.000000,0.000000
5,"Amarillo, US",0.000000,0.000000,0.0,0.0,0.000000,0.0,0.000000,0.000000,0.000000,...,0.000000,0.0,0.0,0.000000,0.045455,0.000000,0.0,0.000000,0.000000,0.000000
6,"Anaheim, US",0.000000,0.000000,0.0,0.0,0.000000,0.0,0.015152,0.000000,0.000000,...,0.015152,0.0,0.0,0.000000,0.030303,0.000000,0.0,0.000000,0.000000,0.000000
7,"Anchorage, US",0.000000,0.041667,0.0,0.0,0.000000,0.0,0.020833,0.000000,0.000000,...,0.000000,0.0,0.0,0.000000,0.020833,0.000000,0.0,0.000000,0.000000,0.000000
8,"Ann Arbor, US",0.000000,0.000000,0.0,0.0,0.000000,0.0,0.000000,0.000000,0.000000,...,0.000000,0.0,0.0,0.000000,0.000000,0.000000,0.0,0.000000,0.000000,0.000000
9,"Antioch, US",0.000000,0.000000,0.0,0.0,0.000000,0.0,0.000000,0.000000,0.000000,...,0.000000,0.0,0.0,0.000000,0.000000,0.000000,0.0,0.000000,0.000000,0.000000


In [13]:
def return_most_common_venues(row, num_top_venues):
    row_categories = row.iloc[1:]
    row_categories_sorted = row_categories.sort_values(ascending=False)
    
    return row_categories_sorted.index.values[0:num_top_venues]

In [14]:
num_top_venues = 10

indicators = ['st', 'nd', 'rd']

# create columns according to number of top venues
columns = ['City']
for ind in np.arange(num_top_venues):
    try:
        columns.append('{}{} Most Common Venue'.format(ind+1, indicators[ind]))
    except:
        columns.append('{}th Most Common Venue'.format(ind+1))

# create a new dataframe
venues_sorted = pd.DataFrame(columns=columns)
venues_sorted['City'] = grouped['City']

for ind in np.arange(grouped.shape[0]):
    venues_sorted.iloc[ind, 1:] = return_most_common_venues(grouped.iloc[ind, :], num_top_venues)

venues_sorted.head()

Unnamed: 0,City,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,"Abilene, US",Miscellaneous Shop,Pub,Mexican Restaurant,Museum,Wine Bar,Candy Store,Deli / Bodega,American Restaurant,Construction & Landscaping,Seafood Restaurant
1,"Akron, US",Bank,Bar,Art Gallery,Music Venue,Café,Sandwich Place,Coffee Shop,Steakhouse,Boutique,Jazz Club
2,"Albuquerque, US",Brewery,Bar,Restaurant,Coffee Shop,Theater,Asian Restaurant,Music Venue,Sandwich Place,Pizza Place,Hotel
3,"Alexandria, US",Park,American Restaurant,Fish & Chips Shop,Event Space,Exhibit,Fabric Shop,Falafel Restaurant,Farm,Farmers Market,Fast Food Restaurant
4,"Allentown, US",Sandwich Place,Coffee Shop,New American Restaurant,Pharmacy,Auto Garage,BBQ Joint,Bagel Shop,Donut Shop,Park,Pub


In [15]:
kclusters = 10
grouped_clustering = grouped.drop('City', 1)
kmeans = KMeans(n_clusters=kclusters, random_state=0).fit(grouped_clustering)
kmeans.labels_[0:10]

array([2, 2, 2, 3, 8, 8, 4, 8, 3, 4], dtype=int32)

In [16]:
venues_sorted.insert(0, 'Cluster Labels', kmeans.labels_)
merged = data
merged = merged.join(venues_sorted.set_index('City'), on='City')
merged.head()

Unnamed: 0,City,Latitude,Longitude,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,"Los Angeles, US",34.053691,-118.242767,2.0,Speakeasy,Sushi Restaurant,Museum,Noodle House,American Restaurant,Coffee Shop,Bookstore,Theater,Breakfast Spot,Japanese Restaurant
1,"Chicago, US",41.875562,-87.624421,8.0,Coffee Shop,Sandwich Place,Donut Shop,Hotel,Pub,Pizza Place,Museum,Bakery,Bookstore,American Restaurant
2,"Houston, US",29.758938,-95.367697,8.0,Hotel,Park,Theater,Sandwich Place,Concert Hall,Burger Joint,Coffee Shop,Performing Arts Venue,Fast Food Restaurant,Italian Restaurant
3,"Phoenix, US",33.448587,-112.077346,2.0,Coffee Shop,American Restaurant,Mexican Restaurant,Hotel,Pizza Place,Music Venue,Sushi Restaurant,Pub,Cocktail Bar,Rock Club
4,"Philadelphia, US",39.952724,-75.163526,8.0,Hotel,Coffee Shop,American Restaurant,Clothing Store,Chinese Restaurant,Bakery,Public Art,Mexican Restaurant,Salad Place,Chocolate Shop


In [17]:
map_clusters = folium.Map(location=[41.479014,-101.9245357], zoom_start=4)
x = np.arange(kclusters)
ys = [i + x + (i*x)**2 for i in range(kclusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]
markers_colors = []
for lat, lon, poi, cluster in zip(merged['Latitude'], merged['Longitude'], merged['City'], merged['Cluster Labels']):
    try :
        label = folium.Popup(str(poi) + ' Cluster ' + str(int(cluster)), parse_html=True)
        folium.CircleMarker(
            [lat, lon],
            radius=5,
            popup=label,
            color=rainbow[int(cluster)-1],
            fill=True,
            fill_color=rainbow[int(cluster)-1],
            fill_opacity=0.7).add_to(map_clusters)
    except:
        pass       
map_clusters