# Comparison of Davidson (Nashville), Mecklenburg (Charlotte), and Orange (Chapel Hill) County

### Table of Contents

* [Introduction](#intro)   
* [Data](#data)
* [Methodolgy](#method)
* [Analysis](#analysis)
* [Results and Discussion](#results)
* [Conclusion and Further Direction](#done)

## Introduction<a id = "intro"></a>

The goal of this project is to compare the similarities of three cities. Specifically, this report will be comparing **Nashville, TN**, **Charlotte, NC**, and **Chapel Hill, NC**. 

Since there is no way to drill down to the neighborhood level for these particular cities, we will instead be using the cities and zip codes for the counties they are in. We are looking to see the similarities between the venues that are nearby. Ideally, we want to find what makes them similar or different through these venue types. Using our skills learned throughout this course, we will explain in detail our findings.

## Data <a id = 'data'></a>

In order to obtain the zip codes, we will be scraping them from a government webiste. We are taking the zip code, city name and county name from the site. We will manually add the state. Using state, zip code and city we will be able to get the geographic coordinates of each zip code. 

Since we are determining similarity of neighborhoods based on nearby venues, we need to obtain those venues. Using the data scraped from the internet, we will use the **Foursquare API** to obtain the venues within a 750 meter radius. 

### Web Scraping

Here, I am gathering the data for the city, zip code, and county of each of the places in the counties being analyzed.

In [1]:
# import libraries for webscraping
import pandas as pd
import numpy as np
from bs4 import BeautifulSoup
import requests

In [2]:
#Getting the Mecklenburg County (Charlotte area) zip codes
url = requests.get('https://www.ciclt.net/sn/clt/capitolimpact/gw_ziplist.aspx?ClientCode=capitolimpact&State=nc&StName=North%20Carolina&StFIPS=37&FIPS=37119').text
soup = BeautifulSoup(url, 'html.parser')
tables = soup.find_all('table', border = 3)
headings = ['ZipCode', 'City', 'County']
zips = []
for table in tables:

    rows = table.find_all('tr')

    zips.append([[td.getText().rstrip() for td in rows[i].findAll('td')]
    for i in range(len(rows))])

#remove the blank row from both lists
zip1 = zips[0][1:]
zip2 = zips[1][1:]


#make df for the zip codes
clt_zips = pd.DataFrame(zip1, columns = headings)
clt_zips = clt_zips.append(pd.DataFrame(zip2, columns = headings))

clt_zips['State'] = "NC"

In [3]:
#Repeat above steps for Orange Country (Chapel Hill area) zip codes
url = requests.get('http://www.ciclt.net/sn/clt/capitolimpact/gw_ziplist.aspx?ClientCode=capitolimpact&State=nc&StName=North%20Carolina&StFIPS=37&FIPS=37135').text
soup = BeautifulSoup(url, 'html.parser')
table = soup.find('table', border = 3)
headings = ['ZipCode', 'City', 'County']
rows = table.find_all('tr')

ch_zips = [[td.getText().rstrip() for td in rows[i].findAll('td')]
    for i in range(len(rows))]

#remove the blank row from the list
ch_zips = ch_zips[1:]


#make df for the zip codes
ch_zips = pd.DataFrame(ch_zips, columns = headings)

ch_zips['State'] = "NC"

In [4]:
#repeat for Nashville-Davidson County zip codes
url = requests.get('http://www.ciclt.net/sn/clt/capitolimpact/gw_ziplist.aspx?ClientCode=capitolimpact&State=tn&StName=Tennessee&StFIPS=47&FIPS=47037').text
soup = BeautifulSoup(url, 'html.parser')
table = soup.find('table', cellpadding = 3)
headings = ['ZipCode', 'City', 'County']
rows = table.find_all('tr')

tn_zips = [[td.getText().rstrip() for td in rows[i].findAll('td')]
    for i in range(len(rows))]

#remove the blank row from the list
tn_zips = tn_zips[1:]


#make df for the zip codes
tn_zips = pd.DataFrame(tn_zips, columns = headings)

tn_zips['State'] = "TN"

In [5]:
# combine all zip codes into one dataframe

all_zips = clt_zips
all_zips = all_zips.append(ch_zips)
all_zips = all_zips.append(tn_zips)

all_zips = all_zips.reset_index(drop = True)

#to verify print the unique counties (should be 3)
np.unique(all_zips.County)

array(['Davidson County', 'Mecklenburg County', 'Orange County'],
      dtype=object)

Running the data in geocoder to get the latitude and longitude

In [6]:
#import geolocator packages
from geopy.geocoders import Nominatim

In [7]:
#joining lon - lat values to the dataframe
def get_coords(df):
    lat = []
    lon = []
    for i in range(len(df)):
        state = df.State[i]
        city = df.City[i]
        zipcode = df.ZipCode[i]
        address = city + ', '+ state + ' ' +zipcode
        geolocator = Nominatim(user_agent = 'my_explorer')
        location = geolocator.geocode(address)
        lat.append(location.latitude)
        lon.append(location.longitude)
    df['lat'] = lat
    df['lon'] = lon
    return df
final_zips = get_coords(all_zips)
final_zips

Unnamed: 0,ZipCode,City,County,State,lat,lon
0,28031,Cornelius,Mecklenburg County,NC,35.481705,-80.859001
1,28035,Davidson,Mecklenburg County,NC,35.499261,-80.848522
2,28036,Davidson,Mecklenburg County,NC,35.499261,-80.848522
3,28070,Huntersville,Mecklenburg County,NC,35.410828,-80.842930
4,28078,Huntersville,Mecklenburg County,NC,35.410828,-80.842930
...,...,...,...,...,...,...
128,37244,Nashville,Davidson County,TN,36.162230,-86.774353
129,37245,Nashville,Davidson County,TN,36.162230,-86.774353
130,37247,Nashville,Davidson County,TN,36.162230,-86.774353
131,37248,Nashville,Davidson County,TN,36.312392,-86.672015


In [8]:
#clean the data to remove duplicate lat/lon values
final_zips = final_zips.drop_duplicates(subset = ['lat', 'lon'])
print(final_zips.shape)
final_zips.head()

(36, 6)


Unnamed: 0,ZipCode,City,County,State,lat,lon
0,28031,Cornelius,Mecklenburg County,NC,35.481705,-80.859001
1,28035,Davidson,Mecklenburg County,NC,35.499261,-80.848522
3,28070,Huntersville,Mecklenburg County,NC,35.410828,-80.84293
5,28105,Matthews,Mecklenburg County,NC,35.115953,-80.722439
7,28126,Newell,Mecklenburg County,NC,35.279169,-80.736576


In [9]:
#import the map package from the earlier pieces of the capstone

!conda install -c conda-forge folium=0.5.0 --yes
import folium

Collecting package metadata (current_repodata.json): done
Solving environment: / 
The environment is inconsistent, please check the package plan carefully
The following packages are causing the inconsistency:

  - defaults/noarch::ibm-wsrt-py37main-keep==0.0.0=1976
  - defaults/noarch::ibm-wsrt-py37main-main==custom=1976
  - conda-forge/linux-64::pytorch==1.8.0=cpu_py37hafa7651_0
done

# All requested packages already installed.



In [10]:
#split into the individual counties in order to create maps
nsh = final_zips[final_zips.County == 'Davidson County']
clt = final_zips[final_zips.County == 'Mecklenburg County']
ch = final_zips[final_zips.County == 'Orange County']

In [11]:
#Nashville Map
address = 'Nashville, TN'
geolocator = Nominatim(user_agent = 'my_explorer')
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude

map_nashville = folium.Map(location=[latitude, longitude], zoom_start=11)

# add markers to map
for lat, lng, City, State in zip(nsh['lat'], nsh['lon'], nsh['City'], nsh['State']):
    label = '{}, {}'.format(City, State)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_nashville)  
map_nashville

In [12]:
#Charlotte Map
address = 'Charlotte, NC'
geolocator = Nominatim(user_agent = 'my_explorer')
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude

map_charlotte = folium.Map(location=[latitude, longitude], zoom_start=10)

# add markers to map
for lat, lng, City, State in zip(clt['lat'], clt['lon'], clt['City'], clt['State']):
    label = '{}, {}'.format(City, State)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_charlotte)  
map_charlotte

In [13]:
#Chapel Hill Map
address = 'Chapel Hill, NC'
geolocator = Nominatim(user_agent = 'my_explorer')
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude

map_chapelhill = folium.Map(location=[latitude, longitude], zoom_start=11)

# add markers to map
for lat, lng, City, State in zip(ch['lat'], ch['lon'], ch['City'], ch['State']):
    label = '{}, {}'.format(City, State)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_chapelhill)  
map_chapelhill

In [14]:
# @hidden_cell

CLIENT_ID = 'TGAOY3OUBXCTVPHFJHVNJ35B3WXIX33JRZTA0ZSXEZZWNG0A'
CLIENT_SECRET = 'RNBJLBMGO4QO4KZXG1F1SH0VAFYOQBUXX0UZ2GAGGZSIRURT'
CODE = 'OE53BZ210ZKEZVHQGKEBG1QMZZEPPJGQ052SYPFJVHWSBIC1#_=_'
ACCESS = 'XZO3PZAKY2VLQKVPLMNOZ3VA3YZU2SLWVKXQ5GBDLSVBXM3T'
VERSION = '20180605'

In [15]:
LIMIT = 100

def getNearbyVenues(names, latitudes, longitudes, radius=750):
    
    venues_list=[]
    for name, lat, lng in zip(names, latitudes, longitudes):
        #print(name)
            
        # create the API request URL
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lat, 
            lng, 
            radius, 
            LIMIT)
            
        # make the GET request
        results = requests.get(url).json()["response"]['groups'][0]['items']
        
        # return only relevant information for each nearby venue
        venues_list.append([(
            name, 
            lat, 
            lng, 
            v['venue']['name'], 
            v['venue']['location']['lat'], 
            v['venue']['location']['lng'],  
            v['venue']['categories'][0]['name']) for v in results])

    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['Neighborhood', 
                  'Neighborhood Latitude', 
                  'Neighborhood Longitude', 
                  'Venue', 
                  'Venue Latitude', 
                  'Venue Longitude', 
                  'Venue Category']
    
    return(nearby_venues)

In [16]:
all_venues = getNearbyVenues(names = final_zips['City'],
                                 latitudes=final_zips['lat'],
                                   longitudes=final_zips['lon']
                                  )

all_venues.head()

Unnamed: 0,Neighborhood,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
0,Cornelius,35.481705,-80.859001,Old Town Public House,35.482148,-80.860568,Bar
1,Cornelius,35.481705,-80.859001,SweetCakes Bakery,35.481886,-80.858196,Cupcake Shop
2,Cornelius,35.481705,-80.859001,crafty burg'r,35.481221,-80.855872,New American Restaurant
3,Cornelius,35.481705,-80.859001,Parker Banner Kent & Wayne,35.480516,-80.858248,Hobby Shop
4,Cornelius,35.481705,-80.859001,Fork Restaurant,35.487766,-80.85843,Restaurant


## Methodology <a id = 'method'></a>

For our analysis, we will be looking at the number of venues in each city, as well as the total number of unqiue venues overall. Then we will be creating a frequency table for each venue, city pairing and showing the top 3 most frequent venue types. 

Taking the 8 most common venues, we will then use k-mean clustering to group the cities by similarity of venue. We will display the venues for each group of cities to find the common themes and show the different groupings on a map. 

## Analysis <a id = 'analysis'></a>

Let's look at the total number of venues in each city first. Then how many unqiue venue types we have overall.

In [17]:
all_venues.groupby('Neighborhood', ).count()

Unnamed: 0_level_0,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
Neighborhood,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
Antioch,10,10,10,10,10,10
Bellevue,7,7,7,7,7,7
Carrboro,49,49,49,49,49,49
Cedar Grove,1,1,1,1,1,1
Chapel Hill,70,70,70,70,70,70
Charlotte,219,219,219,219,219,219
Cornelius,19,19,19,19,19,19
Davidson,28,28,28,28,28,28
Efland,3,3,3,3,3,3
Goodlettsville,58,58,58,58,58,58


In [18]:
print('There are {} uniques categories.'.format(len(all_venues['Venue Category'].unique())))


There are 209 uniques categories.


I am creating a frequency table for each venue, city combination.

In [19]:
all_fixed = pd.get_dummies(all_venues[['Venue Category']], prefix = '', prefix_sep = '')

all_fixed['Neighbourhood'] = all_venues.Neighborhood

fixed_columns = [all_fixed.columns[-1]] + list(all_fixed.columns[:-1])
all_fixed = all_fixed[fixed_columns]

all_group = all_fixed.groupby('Neighbourhood').mean().reset_index()

all_group

Unnamed: 0,Neighbourhood,ATM,Accessories Store,American Restaurant,Antique Shop,Arcade,Art Gallery,Art Museum,Arts & Crafts Store,Asian Restaurant,...,Vegetarian / Vegan Restaurant,Video Game Store,Video Store,Vietnamese Restaurant,Whisky Bar,Wine Bar,Wine Shop,Wings Joint,Women's Store,Yoga Studio
0,Antioch,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,Bellevue,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,Carrboro,0.0,0.0,0.040816,0.0,0.0,0.0,0.0,0.0,0.0,...,0.020408,0.0,0.020408,0.0,0.0,0.020408,0.0,0.020408,0.0,0.020408
3,Cedar Grove,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,Chapel Hill,0.0,0.0,0.042857,0.0,0.0,0.0,0.014286,0.0,0.014286,...,0.0,0.014286,0.0,0.0,0.0,0.0,0.0,0.014286,0.0,0.014286
5,Charlotte,0.004566,0.0,0.027397,0.0,0.004566,0.004566,0.004566,0.0,0.0,...,0.0,0.0,0.0,0.009132,0.0,0.004566,0.004566,0.004566,0.0,0.0
6,Cornelius,0.0,0.0,0.052632,0.052632,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
7,Davidson,0.0,0.0,0.107143,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
8,Efland,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
9,Goodlettsville,0.017241,0.017241,0.017241,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.017241,0.017241,0.0,0.0,0.0,0.0,0.0,0.017241,0.0


Let's look at the 3 most common venue types for each city.

In [20]:
for name in all_group.Neighbourhood:
    print('--------'+name+'-------')
    info = all_group[all_group.Neighbourhood == name].T.reset_index()
    info.columns = ['Venue', 'Freq']
    info = info.iloc[1:]
    info.Freq = info.Freq.astype(float)
    info = info.round({'Freq': 3})
    print(info.sort_values('Freq', ascending=False).reset_index(drop=True).head(3))
    print('\n')

--------Antioch-------
                     Venue  Freq
0  Comfort Food Restaurant   0.1
1             Soccer Field   0.1
2        Mobile Phone Shop   0.1


--------Bellevue-------
                Venue   Freq
0               Track  0.143
1  Light Rail Station  0.143
2      Clothing Store  0.143


--------Carrboro-------
         Venue   Freq
0  Pizza Place  0.082
1    Gift Shop  0.082
2  Coffee Shop  0.061


--------Cedar Grove-------
             Venue  Freq
0    Auto Workshop   1.0
1              ATM   0.0
2  Other Nightlife   0.0


--------Chapel Hill-------
                 Venue   Freq
0                  Bar  0.071
1       Sandwich Place  0.043
2  American Restaurant  0.043


--------Charlotte-------
         Venue   Freq
0  Pizza Place  0.059
1        Hotel  0.037
2  Coffee Shop  0.032


--------Cornelius-------
                    Venue   Freq
0             Supermarket  0.105
1          Cosmetics Shop  0.105
2  Furniture / Home Store  0.105


--------Davidson-------
           

In [21]:
def return_most_common_venues(row, num_top_venues):
    row_categories = row.iloc[1:]
    row_categories_sorted = row_categories.sort_values(ascending=False)
    
    return row_categories_sorted.index.values[0:num_top_venues]

In [22]:
x_ven = 8

indicators = ['st', 'nd', 'rd']


columns = ['Neighbourhood']
for ind in np.arange(x_ven):
    try:
        columns.append('{}{} Most Common Venue'.format(ind+1, indicators[ind]))
    except:
        columns.append('{}th Most Common Venue'.format(ind+1))

neighborhoods_venues_sorted = pd.DataFrame(columns=columns)
neighborhoods_venues_sorted['Neighbourhood'] = all_group['Neighbourhood']

for ind in np.arange(all_group.shape[0]):
    neighborhoods_venues_sorted.iloc[ind, 1:] = return_most_common_venues(all_group.iloc[ind, :], x_ven)

neighborhoods_venues_sorted.head()

Unnamed: 0,Neighbourhood,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue
0,Antioch,Comfort Food Restaurant,Soccer Field,Mobile Phone Shop,Trail,BBQ Joint,Mattress Store,Moving Target,Convenience Store
1,Bellevue,Track,Light Rail Station,Clothing Store,Bakery,Gas Station,Intersection,Pizza Place,Museum
2,Carrboro,Pizza Place,Gift Shop,Coffee Shop,Bar,Brewery,American Restaurant,Convenience Store,Record Shop
3,Cedar Grove,Auto Workshop,ATM,Other Nightlife,Middle Eastern Restaurant,Mobile Phone Shop,Monument / Landmark,Movie Theater,Moving Target
4,Chapel Hill,Bar,Sandwich Place,American Restaurant,Brewery,Greek Restaurant,Mexican Restaurant,Pizza Place,Performing Arts Venue


## Modeling

Let's us k-means to cluster our neighborhoods

In [23]:
# Matplotlib and associated plotting modules
import matplotlib.cm as cm
import matplotlib.colors as colors

# import k-means from clustering stage
from sklearn.cluster import KMeans

In [24]:
k = 7

all_clusters = all_group.drop('Neighbourhood', axis = 1)

kmeans = KMeans(n_clusters=k, random_state=0).fit(all_clusters)

In [25]:
neighborhoods_venues_sorted.insert(0, 'Cluster Labels', kmeans.labels_)

all_merge = final_zips.join(neighborhoods_venues_sorted.set_index('Neighbourhood'), on='City')

all_merge

Unnamed: 0,ZipCode,City,County,State,lat,lon,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue
0,28031,Cornelius,Mecklenburg County,NC,35.481705,-80.859001,0,Supermarket,Cosmetics Shop,Furniture / Home Store,Sandwich Place,Cupcake Shop,New American Restaurant,Bar,Restaurant
1,28035,Davidson,Mecklenburg County,NC,35.499261,-80.848522,0,American Restaurant,Ice Cream Shop,Café,Bank,Mexican Restaurant,Indie Movie Theater,Pizza Place,Steakhouse
3,28070,Huntersville,Mecklenburg County,NC,35.410828,-80.84293,0,Coffee Shop,Pub,Italian Restaurant,Pizza Place,Print Shop,Arts & Crafts Store,Farmers Market,Museum
5,28105,Matthews,Mecklenburg County,NC,35.115953,-80.722439,0,American Restaurant,Deli / Bodega,Ice Cream Shop,Japanese Restaurant,Pizza Place,Beer Garden,BBQ Joint,Bar
7,28126,Newell,Mecklenburg County,NC,35.279169,-80.736576,2,Rental Service,Convenience Store,Farm,ATM,Optical Shop,Mobile Phone Shop,Monument / Landmark,Movie Theater
8,28130,Paw Creek,Mecklenburg County,NC,35.266394,-80.914012,6,Bookstore,Flower Shop,Rental Service,Paper / Office Supplies Store,ATM,Nightlife Spot,Mobile Phone Shop,Monument / Landmark
9,28134,Pineville,Mecklenburg County,NC,35.085541,-80.887125,0,Hotel,Toy / Game Store,Jewelry Store,Bakery,Paper / Office Supplies Store,Diner,Discount Store,Coffee Shop
10,28201,Charlotte,Mecklenburg County,NC,35.241229,-80.822664,0,Pizza Place,Hotel,Coffee Shop,Restaurant,American Restaurant,Park,Gym,Sandwich Place
11,28202,Charlotte,Mecklenburg County,NC,35.227209,-80.843083,0,Pizza Place,Hotel,Coffee Shop,Restaurant,American Restaurant,Park,Gym,Sandwich Place
14,28205,Charlotte,Mecklenburg County,NC,35.220475,-80.788593,0,Pizza Place,Hotel,Coffee Shop,Restaurant,American Restaurant,Park,Gym,Sandwich Place


In [26]:
#split into the individual counties in order to create maps
nsh = all_merge[all_merge.County == 'Davidson County']
clt = all_merge[all_merge.County == 'Mecklenburg County']
ch = all_merge[all_merge.County == 'Orange County']

## Results and Discussion <a id = 'results'></a>

Making maps to show the differing clusters visually.

In [27]:
# create map
address = 'Nashville, TN'
geolocator = Nominatim(user_agent = 'my_explorer')
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude

map_clusters = folium.Map(location=[latitude, longitude], zoom_start=10)

# set color scheme for the clusters
x = np.arange(k)
ys = [i + x + (i*x)**2 for i in range(k)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# add markers to the map
markers_colors = []
for lat, lon, poi, cluster in zip(all_merge['lat'], all_merge['lon'], all_merge['City'], all_merge['Cluster Labels']):
    label = folium.Popup(str(poi) + ' Cluster ' + str(cluster), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color=rainbow[cluster-1],
        fill=True,
        fill_color=rainbow[cluster-1],
        fill_opacity=0.7).add_to(map_clusters)
       
map_clusters

In [28]:
# create map
address = 'Charlotte, NC'
geolocator = Nominatim(user_agent = 'my_explorer')
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude

map_clusters = folium.Map(location=[latitude, longitude], zoom_start=10)

# set color scheme for the clusters
x = np.arange(k)
ys = [i + x + (i*x)**2 for i in range(k)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# add markers to the map
markers_colors = []
for lat, lon, poi, cluster in zip(all_merge['lat'], all_merge['lon'], all_merge['City'], all_merge['Cluster Labels']):
    label = folium.Popup(str(poi) + ' Cluster ' + str(cluster), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color=rainbow[cluster-1],
        fill=True,
        fill_color=rainbow[cluster-1],
        fill_opacity=0.7).add_to(map_clusters)
       
map_clusters

In [29]:
# create map
address = 'Chapel Hill, NC'
geolocator = Nominatim(user_agent = 'my_explorer')
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude

map_clusters = folium.Map(location=[latitude, longitude], zoom_start=10)

# set color scheme for the clusters
x = np.arange(k)
ys = [i + x + (i*x)**2 for i in range(k)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# add markers to the map
markers_colors = []
for lat, lon, poi, cluster in zip(all_merge['lat'], all_merge['lon'], all_merge['City'], all_merge['Cluster Labels']):
    label = folium.Popup(str(poi) + ' Cluster ' + str(cluster), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color=rainbow[cluster-1],
        fill=True,
        fill_color=rainbow[cluster-1],
        fill_opacity=0.7).add_to(map_clusters)
       
map_clusters

In [30]:
all_merge.loc[all_merge['Cluster Labels'] == 0, all_merge.columns[[1] + list(range(5, all_merge.shape[1]))]]

Unnamed: 0,City,lon,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue
0,Cornelius,-80.859001,0,Supermarket,Cosmetics Shop,Furniture / Home Store,Sandwich Place,Cupcake Shop,New American Restaurant,Bar,Restaurant
1,Davidson,-80.848522,0,American Restaurant,Ice Cream Shop,Café,Bank,Mexican Restaurant,Indie Movie Theater,Pizza Place,Steakhouse
3,Huntersville,-80.84293,0,Coffee Shop,Pub,Italian Restaurant,Pizza Place,Print Shop,Arts & Crafts Store,Farmers Market,Museum
5,Matthews,-80.722439,0,American Restaurant,Deli / Bodega,Ice Cream Shop,Japanese Restaurant,Pizza Place,Beer Garden,BBQ Joint,Bar
9,Pineville,-80.887125,0,Hotel,Toy / Game Store,Jewelry Store,Bakery,Paper / Office Supplies Store,Diner,Discount Store,Coffee Shop
10,Charlotte,-80.822664,0,Pizza Place,Hotel,Coffee Shop,Restaurant,American Restaurant,Park,Gym,Sandwich Place
11,Charlotte,-80.843083,0,Pizza Place,Hotel,Coffee Shop,Restaurant,American Restaurant,Park,Gym,Sandwich Place
14,Charlotte,-80.788593,0,Pizza Place,Hotel,Coffee Shop,Restaurant,American Restaurant,Park,Gym,Sandwich Place
20,Charlotte,-80.787483,0,Pizza Place,Hotel,Coffee Shop,Restaurant,American Restaurant,Park,Gym,Sandwich Place
21,Charlotte,-80.744604,0,Pizza Place,Hotel,Coffee Shop,Restaurant,American Restaurant,Park,Gym,Sandwich Place


In [31]:
all_merge.loc[all_merge['Cluster Labels'] == 1, all_merge.columns[[1] + list(range(5, all_merge.shape[1]))]]

Unnamed: 0,City,lon,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue
76,Cedar Grove,-79.167793,1,Auto Workshop,ATM,Other Nightlife,Middle Eastern Restaurant,Mobile Phone Shop,Monument / Landmark,Movie Theater,Moving Target


In [32]:
all_merge.loc[all_merge['Cluster Labels'] == 2, all_merge.columns[[1] + list(range(5, all_merge.shape[1]))]]

Unnamed: 0,City,lon,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue
7,Newell,-80.736576,2,Rental Service,Convenience Store,Farm,ATM,Optical Shop,Mobile Phone Shop,Monument / Landmark,Movie Theater


In [33]:
all_merge.loc[all_merge['Cluster Labels'] == 3, all_merge.columns[[1] + list(range(5, all_merge.shape[1]))]]

Unnamed: 0,City,lon,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue
77,Efland,-79.169181,3,Burger Joint,Grocery Store,Baseball Field,ATM,Optical Shop,Mobile Phone Shop,Monument / Landmark,Movie Theater


In [34]:
all_merge.loc[all_merge['Cluster Labels'] == 4, all_merge.columns[[1] + list(range(5, all_merge.shape[1]))]]

Unnamed: 0,City,lon,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue
89,Joelton,-86.865277,4,Ice Cream Shop,Discount Store,Park,Gas Station,ATM,Optical Shop,Mobile Phone Shop,Monument / Landmark


In [35]:
all_merge.loc[all_merge['Cluster Labels'] == 5, all_merge.columns[[1] + list(range(5, all_merge.shape[1]))]]

Unnamed: 0,City,lon,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue
88,Hermitage,-86.605823,5,Bar,Buffet,Pet Store,Automotive Shop,Optical Shop,ATM,Mobile Phone Shop,Monument / Landmark


In [36]:
all_merge.loc[all_merge['Cluster Labels'] == 6, all_merge.columns[[1] + list(range(5, all_merge.shape[1]))]]

Unnamed: 0,City,lon,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue
8,Paw Creek,-80.914012,6,Bookstore,Flower Shop,Rental Service,Paper / Office Supplies Store,ATM,Nightlife Spot,Mobile Phone Shop,Monument / Landmark


Our findings have one large group of cities, along with 6 individual clusters with 1 city a piece. The largest gourp has majority of their common venues relating to food and tourism. These types of venues dominate the more "functional" venues like healthcare, gyms, atms, automotive shops, etc. Conversely, these functional venues are the majority of the individual characteristics of the small clusters. The majority of the small clusters are located in rural towns that are outside of the major city centers.

Nashville, Charlotte, and Chapel Hill are all in the large cluster indicating there is a similarity in the types of venues in each city. I do not find this surprising because of how each is a major city center with a vibrant night life built around the food venues in the area. They all are very out and about 'foodie' areas.

## Conclusion and Future Direction <a id = 'done'></a>

Our analysis found that Nashville, Charlotte, and Chapel Hill are similar based on the amount of dining options available in the area. However, I believe that there are steps to take in order to improve upon this analysis. First, getting down to the neighborhood level in Charlotte and Nashville is crucial to further improvement. Each nieghborhood has its own flair and feeling, along with different venues. This would also allow us to remove the rural cities and suburbs since they are drastically different in terms of lifestyle and venue availability. This change allows us to really just compare the major cities to see which nieghborhoods are most similar.

Adding data such as crime rate, walkability, weather patterns and housing cost would also boost this analysis. This is key data for determining city similarity and would be crucial in deciding to make a move if that was why the analysis was being used. It would be important to get this information on a neighborhood basis as well so that it mirrors the improvements mentioned above. Traffic and distance to downtown could be included but I do not see large improvements from this given the similar sizes in the major city populations and the number of traffic problems that arise from there.