# Capstone Project Overview

The project is the final capstone project as part of the IBM-Coursera Data Science Professional Certificate training program, and it aims to integrate a series of technical skills and methods taught throughout the courses.

## Background Information

Atlanta is a vibrant city with colorful history. It is the place in the United States where I spent one of the most important four years of my life. Not only the city contains modern, dynamic cultures such as arts venues and tourist spots like the [High Museum of Art](https://en.wikipedia.org/wiki/High_Museum_of_Art) and the [World of Coca Cola](https://en.wikipedia.org/wiki/World_of_Coca-Cola), it also encompasses historical places such as the [Margaret Mitchell House](https://en.wikipedia.org/wiki/Margaret_Mitchell_House_and_Museum) where ***Gone with the Wind*** was born, and the [Swan House](https://en.wikipedia.org/wiki/Swan_House_(Atlanta)) that witnessed the civil rights movements taking place decades ago. 

Therefore, I am really interested in how I could dig deeper to see what Atlanta can offer in terms of the businesses and venues in the city, for both its current residents and tourists who are interested in visiting or settling down in it.

Before I move on to the actual analyses, a few things should be clarified. First, Atlanta can be either defined as one single city or as a metropolitan area containing a large area of surrounding cities. Secondly, the reason I pick '**cities**' instead of '**neighborhoods**' is that, after a few years of living in the states, I feel like neighborhoods in the US have a geographically smaller definition compared to cities. This means that if we use neighborhoods for our analyses, we might not observe significant results among smaller neighborhoods that only have a gas station and a grocery store. More importantly, even if there might be some results, they might not turn out to be interesting, dinstictive ones due to gentrifications within the neighborhoods. 

Therefore, for the purpose of this project, I would be using only certain cities around Atlanta (including itself) for the unsupervised clustering of the cities instead of the neighborhoods that appeared in the lab sessions, with the assumption that these cities are significantly different from each other based on certain geographical or ethnic information that are out of the scope of this project. In order to pinpoint the number of cities and their exact names, I would be using this [Wikipedia](https://en.wikipedia.org/wiki/Atlanta_metropolitan_area) page on the **Atlanta Metropolitan Area** as my reference:

> The counties listed below are included in the Atlanta–Sandy Springs–Alpharetta, GA Metropolitan Statistical Area ... **Fulton**, **DeKalb**, **Gwinnett**, **Cobb**, and **Clayton** were the five original counties when the Atlanta metropolitan area was first defined in 1950, and continue to be the core of the metro area.

## Dataset

With the targets of interest narrowed down, I would start by obtaining all the cities in the five counties listed above. This can be easily achieved using their associated Wikipedia pages (one example can be found here: [Cities in Fulton County](https://en.wikipedia.org/wiki/Category:Cities_in_Fulton_County,_Georgia)) and performing BeautifulSoup web scraping techniques. The dataset in the end would include 3 columns including the names of the cities, their latitudes, and their longitudes. This will help us integrate the dataset with the Foursquare API to retrieve the businesses within a certain distance of each city.

## Install and Import the Packages

In [2]:
import pandas as pd
import numpy as np

import matplotlib.colors as colors
import matplotlib.cm as cm
import folium

from sklearn.cluster import KMeans

from bs4 import BeautifulSoup
import requests

In [20]:
!pip install geopy



Here I break down the purposes of using these packages:
* Pandas & NumPy :  Clean, combine, and organize data that would serve as the foundation of our analyses
* Requests & BeautifulSoup : Handle http requests and scrape useful contents based on tags and classes
* Matplotlib : Add details to generated maps
* Folium : Generate maps
* Sklearn : To import K-means, the machine learning algorithm
* GeoPy: To obtain latitutude and longitude information

## Scrape the Wikipedia Page for Information

I will be only using all the cities from these 5 counties, and since their associated Wikipedia pages have the same url format as well as page layouts, I would pass in the county as a list argument to the url. By using a For Loop, we define a series of operations that scrapes the names of all the cities and store them inside a list.

In [172]:
county_list = ["DeKalb", "Gwinnett", "Cobb", "Clayton", "Fulton"]

all_cities = []

for county in county_list:
    url_info = "https://en.wikipedia.org/wiki/Category:Cities_in_{}_County,_Georgia".format(county)
    wiki_info = requests.get(url_info).text
    soup_test = BeautifulSoup(wiki_info, "lxml")
    places = soup_test.find_all("div", class_ = "mw-content-ltr")[2]
    
    for place in places.find_all("a"):
        all_cities.append(place.text)

all_cities

['Atlanta',
 'Avondale Estates, Georgia',
 'Brookhaven, Georgia',
 'Chamblee, Georgia',
 'Clarkston, Georgia',
 'Decatur, Georgia',
 'Doraville, Georgia',
 'Dunwoody, Georgia',
 'Lithonia, Georgia',
 'Pine Lake, Georgia',
 'Stone Mountain, Georgia',
 'Stonecrest, Georgia',
 'Tucker, Georgia',
 'Auburn, Georgia',
 'Berkeley Lake, Georgia',
 'Buford, Georgia',
 'Dacula, Georgia',
 'Duluth, Georgia',
 'Grayson, Georgia',
 'Lawrenceville, Georgia',
 'Lilburn, Georgia',
 'Loganville, Georgia',
 'Norcross, Georgia',
 'Peachtree Corners, Georgia',
 'Snellville, Georgia',
 'Sugar Hill, Georgia',
 'Suwanee, Georgia',
 'Acworth, Georgia',
 'Austell, Georgia',
 'Kennesaw, Georgia',
 'Marietta, Georgia',
 'Powder Springs, Georgia',
 'Smyrna, Georgia',
 'College Park, Georgia',
 'Forest Park, Georgia',
 'Jonesboro, Georgia',
 'Lake City, Georgia',
 'Lovejoy, Georgia',
 'Morrow, Georgia',
 'Riverdale, Georgia',
 'Alpharetta, Georgia',
 'Atlanta',
 'Chattahoochee Hills, Georgia',
 'College Park, Geor

It is a good idea to check for duplicates now. I did not realize this until I created and checked the venues returned from the near the very end of the project. Here is a easy checker that I modified with the answers I found online that you might find useful:

>```python
def check_for_duplicates(your_list):
    if len(your_list) != len(set(your_list)):
        print("There are duplicates in your list.")
    else:
        print("No duplicates. Proceed on to the next step.")```

After checking for the duplicates, I remove the one duplicate copy of 'Atlanta' as well as 'College Park, Georgia'.

In [173]:
all_cities.remove("Atlanta")
all_cities.remove("College Park, Georgia")

print("We have {} cities in our analyses.".format(len(all_cities)))

We have 53 cities in our analyses.


## Retrieve the Geographic Information for each City

In [174]:
from geopy.geocoders import Nominatim

locator = Nominatim(user_agent = "myGeocoder")

lat = []
long = []

for n in range(0, len(all_cities)):
    location = locator.geocode("{}".format(all_cities[n]))
    lat.append(location.latitude)
    long.append(location.longitude)

if len(lat) == len(long):
    print("No error. The lengths of the latitude and the longitude lists are both {}.".format(len(lat)))
else:
    print("Something is wrong.")

No error. The lengths of the latitude and the longitude lists are both 53.


Then, I convert the list of all cities to a dataframe.

In [175]:
all_cities = pd.DataFrame({'City':all_cities})
print(type(all_cities))
print(all_cities)

<class 'pandas.core.frame.DataFrame'>
                                     City
0               Avondale Estates, Georgia
1                     Brookhaven, Georgia
2                       Chamblee, Georgia
3                      Clarkston, Georgia
4                        Decatur, Georgia
5                      Doraville, Georgia
6                       Dunwoody, Georgia
7                       Lithonia, Georgia
8                      Pine Lake, Georgia
9                 Stone Mountain, Georgia
10                    Stonecrest, Georgia
11                        Tucker, Georgia
12                        Auburn, Georgia
13                 Berkeley Lake, Georgia
14                        Buford, Georgia
15                        Dacula, Georgia
16                        Duluth, Georgia
17                       Grayson, Georgia
18                 Lawrenceville, Georgia
19                       Lilburn, Georgia
20                    Loganville, Georgia
21                      Norcross, Geor

In [176]:
all_cities["Latitude"] = lat
all_cities["Longitude"] = long
print(all_cities)

                                     City   Latitude  Longitude
0               Avondale Estates, Georgia  33.771494 -84.267144
1                     Brookhaven, Georgia  33.858437 -84.340203
2                       Chamblee, Georgia  33.892176 -84.298830
3                      Clarkston, Georgia  33.809549 -84.239643
4                        Decatur, Georgia  33.773758 -84.296069
5                      Doraville, Georgia  33.898158 -84.283256
6                       Dunwoody, Georgia  33.948365 -84.334963
7                       Lithonia, Georgia  33.712331 -84.105194
8                      Pine Lake, Georgia  33.793716 -84.206031
9                 Stone Mountain, Georgia  33.806217 -84.145751
10                    Stonecrest, Georgia  33.683130 -84.135851
11                        Tucker, Georgia  33.853270 -84.220073
12                        Auburn, Georgia  34.013662 -83.827350
13                 Berkeley Lake, Georgia  33.983712 -84.186585
14                        Buford, Georgi

In [67]:
address = "Atlanta, Georgia"

geolocator = Nominatim(user_agent ="atlanta_explorer")
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude
print('The geograpical coordinate of Atlanta are {}, {}.'.format(latitude, longitude))

The geograpical coordinate of Atlanta are 33.7490987, -84.3901849.


In this step, use the code from the lab to visualize our cities as points with center around Atlanta.

In [68]:
map_atlanta = folium.Map(location = [latitude, longitude], zoom_start = 9)

for city, lat, long in zip(all_cities["City"], all_cities["Latitude"], all_cities["Longitude"]):
    label = '{}; {}, {}'.format(city, round(lat, 1), round(long, 1))
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, long],
        radius = 5,
        popup = label,
        color = 'blue',
        fill = True,
        fill_color = '#3186cc',
        fill_opacity = 0.7,
        parse_html = False).add_to(map_atlanta)  
    
map_atlanta

## Define Foursquare API for Communication

In [33]:
CLIENT_ID = 'PKQFUWYHG5GX3TW5ZHXEORRR4ZIU3WSTYLOPVHOCQMRLAIZ3' # your Foursquare ID
CLIENT_SECRET = '0KWTVHYJBISJANYM1JUOSBE3OWTKIIELAJZSXG3KDPN02VBC' # your Foursquare Secret
VERSION = '20180605' # Foursquare API version
LIMIT = 100 # A default Foursquare API limit value

print('Your credentails:')
print('CLIENT_ID: ' + CLIENT_ID)
print('CLIENT_SECRET:' + CLIENT_SECRET);

Your credentails:
CLIENT_ID: PKQFUWYHG5GX3TW5ZHXEORRR4ZIU3WSTYLOPVHOCQMRLAIZ3
CLIENT_SECRET:0KWTVHYJBISJANYM1JUOSBE3OWTKIIELAJZSXG3KDPN02VBC


Here is a function to repeat the process of retrieving venues for all the cities in our list.

In [35]:
def getNearbyVenues(names, latitudes, longitudes, radius = 2500):
    venues_list = []
    for name, lat, lng in zip(names, latitudes, longitudes):
        print(name)
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lat, 
            lng, 
            radius, 
            LIMIT)

        results = requests.get(url).json()["response"]['groups'][0]['items']

        venues_list.append([(
            name, 
            lat, 
            lng, 
            v['venue']['name'], 
            v['venue']['location']['lat'], 
            v['venue']['location']['lng'],  
            v['venue']['categories'][0]['name']) for v in results])

    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['Neighborhood', 
                  'Neighborhood Latitude', 
                  'Neighborhood Longitude', 
                  'Venue', 
                  'Venue Latitude', 
                  'Venue Longitude', 
                  'Venue Category']
    
    return(nearby_venues)

In [69]:
atlanta_venues = getNearbyVenues(names = all_cities['City'], latitudes = all_cities['Latitude'],
                                   longitudes = all_cities['Longitude'])

Avondale Estates, Georgia
Brookhaven, Georgia
Chamblee, Georgia
Clarkston, Georgia
Decatur, Georgia
Doraville, Georgia
Dunwoody, Georgia
Lithonia, Georgia
Pine Lake, Georgia
Stone Mountain, Georgia
Stonecrest, Georgia
Tucker, Georgia
Auburn, Georgia
Berkeley Lake, Georgia
Buford, Georgia
Dacula, Georgia
Duluth, Georgia
Grayson, Georgia
Lawrenceville, Georgia
Lilburn, Georgia
Loganville, Georgia
Norcross, Georgia
Peachtree Corners, Georgia
Snellville, Georgia
Sugar Hill, Georgia
Suwanee, Georgia
Acworth, Georgia
Austell, Georgia
Kennesaw, Georgia
Marietta, Georgia
Powder Springs, Georgia
Smyrna, Georgia
Forest Park, Georgia
Jonesboro, Georgia
Lake City, Georgia
Lovejoy, Georgia
Morrow, Georgia
Riverdale, Georgia
Alpharetta, Georgia
Atlanta
Chattahoochee Hills, Georgia
College Park, Georgia
East Point, Georgia
Fairburn, Georgia
Hapeville, Georgia
Johns Creek, Georgia
Milton, Georgia
Mountain Park, Fulton County, Georgia
Palmetto, Georgia
Roswell, Georgia
Sandy Springs, Georgia
South Fult

In [70]:
print(atlanta_venues.shape)
atlanta_venues.head()

(3666, 7)


Unnamed: 0,Neighborhood,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
0,"Avondale Estates, Georgia",33.771494,-84.267144,Rising Son,33.776417,-84.267515,Restaurant
1,"Avondale Estates, Georgia",33.771494,-84.267144,Good Karma Coffee House,33.776037,-84.265889,Gluten-free Restaurant
2,"Avondale Estates, Georgia",33.771494,-84.267144,Avondale Estates,33.775358,-84.2671,City
3,"Avondale Estates, Georgia",33.771494,-84.267144,Avondale Lake & Bess Walking Park,33.768739,-84.266118,Park
4,"Avondale Estates, Georgia",33.771494,-84.267144,My Parents' Basement,33.775612,-84.272334,Beer Bar


In [86]:
atlanta_venues.columns = ["City" if x == "Neighborhood" 
                          else "City Latitude" if x == "Neighborhood Latitude" 
                          else "City Longitude" if x == "Neighborhood Longitude"
                          else x for x in atlanta_venues.columns]
atlanta_venues.head()

Unnamed: 0,City,City Latitude,City Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
0,"Avondale Estates, Georgia",33.771494,-84.267144,Rising Son,33.776417,-84.267515,Restaurant
1,"Avondale Estates, Georgia",33.771494,-84.267144,Good Karma Coffee House,33.776037,-84.265889,Gluten-free Restaurant
2,"Avondale Estates, Georgia",33.771494,-84.267144,Avondale Estates,33.775358,-84.2671,City
3,"Avondale Estates, Georgia",33.771494,-84.267144,Avondale Lake & Bess Walking Park,33.768739,-84.266118,Park
4,"Avondale Estates, Georgia",33.771494,-84.267144,My Parents' Basement,33.775612,-84.272334,Beer Bar


In [None]:
venues = atlanta_venues.copy()
print("There are {} unique categories of venue.".format(len(venues["Venue Category"].unique())))

With the steps above, we have successfully obtained a dataframe where each row represents a single venue within a certain city. The next step is to pivot the rows into columns to drill down the occurrences.

## Drill down into the Neighborhoods

In [91]:
# one hot encoding
atlanta_onehot = pd.get_dummies(venues[["Venue Category"]], prefix="", prefix_sep="")

# add neighborhood column back to dataframe
atlanta_onehot["City"] = venues["City"]

# move 'City' column to the first
atlanta_onehot = atlanta_onehot[["City"] + [col for col in atlanta_onehot.columns if col != "City"]]

atlanta_onehot.head(5)

Unnamed: 0,City,Accessories Store,Adult Boutique,African Restaurant,Airport,Airport Lounge,Airport Service,Airport Terminal,American Restaurant,Amphitheater,...,Video Store,Vietnamese Restaurant,Warehouse Store,Whisky Bar,Wine Bar,Wine Shop,Winery,Wings Joint,Women's Store,Yoga Studio
0,"Avondale Estates, Georgia",0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,"Avondale Estates, Georgia",0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,"Avondale Estates, Georgia",0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,"Avondale Estates, Georgia",0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,"Avondale Estates, Georgia",0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [97]:
atlanta_grouped = atlanta_onehot.groupby("City").mean().reset_index()
atlanta_grouped.head(5)

Unnamed: 0,City,Accessories Store,Adult Boutique,African Restaurant,Airport,Airport Lounge,Airport Service,Airport Terminal,American Restaurant,Amphitheater,...,Video Store,Vietnamese Restaurant,Warehouse Store,Whisky Bar,Wine Bar,Wine Shop,Winery,Wings Joint,Women's Store,Yoga Studio
0,"Acworth, Georgia",0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.070175,0.0,...,0.035088,0.0,0.0,0.0,0.0,0.017544,0.0,0.0,0.0,0.0
1,"Alpharetta, Georgia",0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.05,0.0,...,0.0,0.01,0.01,0.0,0.0,0.0,0.01,0.0,0.02,0.0
2,Atlanta,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.02,0.0,...,0.0,0.01,0.0,0.0,0.0,0.01,0.0,0.0,0.0,0.01
3,"Auburn, Georgia",0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,"Austell, Georgia",0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.04,0.0,0.0


By doing this, we calculated the frequency of occurrences of each venue as an average value among all venues within each city.

In [99]:
atlanta_grouped.shape

(53, 314)

In [102]:
num_top_venues = 5

for hood in atlanta_grouped['City']:
    print("----" + hood + "----")
    temp = atlanta_grouped[atlanta_grouped['City'] == hood].T.reset_index()
    temp.columns = ['venue', 'freq']
    temp = temp.iloc[1:]
    temp['freq'] = temp['freq'].astype(float)
    temp = temp.round({'freq': 2})
    print(temp.sort_values('freq', ascending=False).reset_index(drop=True).head(num_top_venues))
    print('\n')

----Acworth, Georgia----
                  venue  freq
0                  Park  0.09
1              Pharmacy  0.07
2  Fast Food Restaurant  0.07
3   American Restaurant  0.07
4                 Hotel  0.07


----Alpharetta, Georgia----
                     venue  freq
0           Clothing Store  0.06
1      American Restaurant  0.05
2              Coffee Shop  0.04
3  New American Restaurant  0.04
4               Taco Place  0.03


----Atlanta----
                  venue  freq
0                  Park  0.04
1             BBQ Joint  0.04
2           Art Gallery  0.04
3  Caribbean Restaurant  0.03
4           Coffee Shop  0.03


----Auburn, Georgia----
                     venue  freq
0           Discount Store  0.19
1  New American Restaurant  0.06
2            Moving Target  0.06
3     Gym / Fitness Center  0.06
4       Chinese Restaurant  0.06


----Austell, Georgia----
                venue  freq
0  Mexican Restaurant  0.12
1         Pizza Place  0.12
2                Park  0.08
3     

                  venue  freq
0        Discount Store  0.04
1  Fast Food Restaurant  0.04
2           Pizza Place  0.04
3      Department Store  0.04
4        Cosmetics Shop  0.04


----South Fulton, Georgia----
               venue  freq
0                Gym   0.2
1      Garden Center   0.2
2         Campground   0.2
3  Convenience Store   0.2
4   Business Service   0.2


----Stone Mountain, Georgia----
      venue  freq
0     Trail  0.09
1  Mountain  0.07
2   Theater  0.06
3      Park  0.06
4      Café  0.04


----Stonecrest, Georgia----
             venue  freq
0            Trail  0.25
1           Lounge  0.06
2           Forest  0.06
3              Gym  0.06
4  Nature Preserve  0.06


----Sugar Hill, Georgia----
                  venue  freq
0  Fast Food Restaurant  0.06
1           Gas Station  0.04
2   American Restaurant  0.04
3                  Bank  0.04
4                   Bar  0.04


----Suwanee, Georgia----
                  venue  freq
0  Fast Food Restaurant  0.06
1      

In [100]:
def return_most_common_venues(row, num_top_venues):
    row_categories = row.iloc[1:]
    row_categories_sorted = row_categories.sort_values(ascending = False)
    
    return row_categories_sorted.index.values[0:num_top_venues]

In [113]:
num_top_venues = 10

indicators = ['st', 'nd', 'rd']

# create columns according to number of top venues
columns = ['City']

for ind in np.arange(num_top_venues):
    try:
        columns.append('{}{} Most Common Venue'.format(ind + 1, indicators[ind]))
    except:
        columns.append('{}th Most Common Venue'.format(ind + 1))

# create a new dataframe
cities_venues_sorted = pd.DataFrame(columns = columns)
cities_venues_sorted['City'] = atlanta_grouped['City']

for ind in np.arange(atlanta_grouped.shape[0]):
    cities_venues_sorted.iloc[ind, 1:] = return_most_common_venues(atlanta_grouped.iloc[ind, :], num_top_venues)

cities_venues_sorted.head()

Unnamed: 0,City,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,"Acworth, Georgia",Park,Hotel,Fast Food Restaurant,Pharmacy,American Restaurant,Beach,Video Store,Fried Chicken Joint,Mexican Restaurant,Sandwich Place
1,"Alpharetta, Georgia",Clothing Store,American Restaurant,Coffee Shop,New American Restaurant,Fast Food Restaurant,Breakfast Spot,Taco Place,Café,Furniture / Home Store,Women's Store
2,Atlanta,Art Gallery,BBQ Joint,Park,Coffee Shop,Mexican Restaurant,Caribbean Restaurant,Burger Joint,Fast Food Restaurant,Historic Site,Hotel
3,"Auburn, Georgia",Discount Store,Moving Target,Chinese Restaurant,Gas Station,Breakfast Spot,Grocery Store,Sandwich Place,Gym / Fitness Center,Fast Food Restaurant,Pharmacy
4,"Austell, Georgia",Pizza Place,Mexican Restaurant,Train Station,Bar,Park,Hotel,Moving Target,Gas Station,Fast Food Restaurant,Campground


## Apply K-means Algorithm to Cluster the Cities

In [114]:
# set number of clusters
kclusters = 5

atlanta_grouped_clustering = atlanta_grouped.drop("City", 1)

# run k-means clustering
kmeans = KMeans(n_clusters = kclusters, random_state = 0).fit(atlanta_grouped_clustering)

# check cluster labels generated for each row in the dataframe
kmeans.labels_[0:10]

array([3, 0, 0, 3, 3, 3, 0, 0, 0, 0])

In [115]:
cities_venues_sorted.insert(0, "Cluster Label", kmeans.labels_)

Unnamed: 0,City,Latitude,Longitude,Cluster Label,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,"Avondale Estates, Georgia",33.771494,-84.267144,3,Pizza Place,Gas Station,Sandwich Place,Thrift / Vintage Store,Bakery,Wings Joint,Breakfast Spot,Pub,Cosmetics Shop,Fried Chicken Joint
1,"Brookhaven, Georgia",33.858437,-84.340203,0,American Restaurant,Mexican Restaurant,Department Store,Boutique,Shopping Mall,Cosmetics Shop,Pizza Place,Italian Restaurant,Movie Theater,Women's Store
2,"Chamblee, Georgia",33.892176,-84.29883,0,Chinese Restaurant,Mexican Restaurant,Ice Cream Shop,Asian Restaurant,Furniture / Home Store,Gym,Vietnamese Restaurant,Sandwich Place,Department Store,Convenience Store
3,"Clarkston, Georgia",33.809549,-84.239643,3,Discount Store,Hotel,Intersection,Breakfast Spot,Park,Gas Station,Adult Boutique,Storage Facility,Pizza Place,Sandwich Place
4,"Decatur, Georgia",33.773758,-84.296069,0,Pizza Place,Pub,American Restaurant,Coffee Shop,Brewery,Gastropub,Breakfast Spot,Grocery Store,Mediterranean Restaurant,Park


In [117]:
all_cities_merged = all_cities.copy()

all_cities_merged = all_cities_merged.join(cities_venues_sorted.set_index("City"), on = "City")

all_cities_merged.head()

Unnamed: 0,City,Latitude,Longitude,Cluster Label,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,"Avondale Estates, Georgia",33.771494,-84.267144,3,Pizza Place,Gas Station,Sandwich Place,Thrift / Vintage Store,Bakery,Wings Joint,Breakfast Spot,Pub,Cosmetics Shop,Fried Chicken Joint
1,"Brookhaven, Georgia",33.858437,-84.340203,0,American Restaurant,Mexican Restaurant,Department Store,Boutique,Shopping Mall,Cosmetics Shop,Pizza Place,Italian Restaurant,Movie Theater,Women's Store
2,"Chamblee, Georgia",33.892176,-84.29883,0,Chinese Restaurant,Mexican Restaurant,Ice Cream Shop,Asian Restaurant,Furniture / Home Store,Gym,Vietnamese Restaurant,Sandwich Place,Department Store,Convenience Store
3,"Clarkston, Georgia",33.809549,-84.239643,3,Discount Store,Hotel,Intersection,Breakfast Spot,Park,Gas Station,Adult Boutique,Storage Facility,Pizza Place,Sandwich Place
4,"Decatur, Georgia",33.773758,-84.296069,0,Pizza Place,Pub,American Restaurant,Coffee Shop,Brewery,Gastropub,Breakfast Spot,Grocery Store,Mediterranean Restaurant,Park


In [167]:
# create map
map_clusters = folium.Map(location = [latitude, longitude],
                          width = 650,
                          height = 550,
                           tiles = "Stamen Toner",
                          position = "absolute",
                          zoom_start = 9)

# set color scheme for the clusters
x = np.arange(kclusters)
ys = [i + x + (i*x)**2 for i in range(kclusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# add markers to the map
markers_colors = []
for lat, lon, poi, cluster in zip(all_cities_merged['Latitude'], 
                                  all_cities_merged['Longitude'], 
                                  all_cities_merged['City'], 
                                  all_cities_merged['Cluster Label']):
    label = folium.Popup(str(poi) + ' Cluster ' + str(cluster), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius = 5,
        popup = label,
        color = rainbow[cluster - 1],
        fill = True,
        fill_color = rainbow[cluster-1],
        fill_opacity = 0.8).add_to(map_clusters)
       
map_clusters

## Results and Discussion

With the number of clusters set to 5, the categories representing the green cities and the red cities contain the most observations and the rest of the categories only have three cities in total for each category. Green dots seem to be scattered farther away from the center compared to red dots.

One variable that one can drill down for further analysis is the distance that is set before calling the Foursquare API. The default is 500 meters but I set it to 2500 meters, which is roughly 1.55 miles. This is because American cities, compared to normal neighborhoods, are typically larger. However, not all the endpoint distances within each city are 1.55 miles apart from each other and this value might inaccurately include venues located in other cities or exclude those that should have been in the same city. Frankly speaking, Google Maps actually provide a cutoff marked by a red area when users search a certain geographic area, so if there is an option of using street names to create boundaries instead of using latitude and longitude values and a radiation distance, the results might turn out to be more interesting and accurate as well. 

## Conclusion

This analysis categorized cities from the 5 core counties of the Atlanta Metropolitan Area. The distribution of the cities among all categories were not proportionate, as illustrated by the final visualization of the city clusters.