# Segmenting and Clustering Neighborhoods in Atlanta

## 1. Introduction/Business Problem 

An entrepreneur who owns a fast food restaurant wants to open the second branch of his restaurant in Atlanta. Since he wants to increase his profit, he must open his restaurant in a crowded population and in a neighborhood with low competition in this sector.To find a solution, he applies to a consulting firm who can help with this. I work as a data analyst in the information technology department of this company. In this project, I will try to analyze data that I have and find the most effective solution by using the machine learning clustering algorithm which is 'k-means' to solve the problem of our customer. First, in data processing part of this project, I will determine the top 10 most crowded neighborhoods of Atlanta by cleaning my data. Next, I will visualize my data using Folium library and analyze it.

## 2. Data Section

In this section, I will build the code to scrape the Wikipedia page which is https://en.wikipedia.org/wiki/Table_of_Atlanta_neighborhoods_by_population

In [1]:
pip install wikipedia

Note: you may need to restart the kernel to use updated packages.


### Tranform the data into a pandas dataframe

In [2]:
import pandas as pd
import wikipedia as wp
 
#Get the html source
html = wp.page("Table of Atlanta neighborhoods by population").html().encode("UTF-8")
neighborhoods_Atlanta = pd.read_html(html)[0]
neighborhoods_Atlanta.to_csv('beautifulsoup_pandas.csv',header=0,index=False)
print (neighborhoods_Atlanta)

              Neighborhood  Population (2010) NPU
0               Adair Park               1331   V
1               Adams Park               1763   R
2               Adamsville               2403   H
3              Almond Park               1020   G
4              Ansley Park               2277   E
..                     ...                ...  ..
156       Westwood Terrace                733   I
157  Whittier Mill Village                617   D
158               Wildwood               1840   C
159    Wilson Mill Meadows               1096   H
160       Wisteria Gardens                512   H

[161 rows x 3 columns]


In [3]:
neighborhoods_Atlanta.head(10)

Unnamed: 0,Neighborhood,Population (2010),NPU
0,Adair Park,1331,V
1,Adams Park,1763,R
2,Adamsville,2403,H
3,Almond Park,1020,G
4,Ansley Park,2277,E
5,Ardmore,756,E
6,Argonne Forest,590,C
7,Arlington Estates,776,P
8,Ashview Heights,1292,T
9,Atlanta University Center,5703,T


I will remove the column which is 'NPU' using 'drop' function because I don't need this information. 

In [4]:
neighborhoods_Atlanta.drop(['NPU'], axis=1,inplace= True)

In [5]:
doubled = neighborhoods_Atlanta['Neighborhood'].unique().shape
if (neighborhoods_Atlanta.shape[0]==doubled[0]):
     print ('Neighborhood is OK, none of its values is doubled')
else:
     print ('some incongruences found, please check consistency')

Neighborhood is OK, none of its values is doubled


In [6]:
neighborhoods_Atlanta.shape

(161, 2)

I need to top 10 crowded neighborhoods of Atlanta.So I will sort my neighborhood data by population.

In [7]:
neighborhoods_Atlanta_firstten=neighborhoods_Atlanta.sort_values(by='Population (2010)', ascending=False)

In [8]:
neighborhoods_Atlanta_firstten.head(10)

Unnamed: 0,Neighborhood,Population (2010)
95,Midtown,16569
51,Downtown,13411
104,Old Fourth Ward,10505
101,North Buckhead,8270
119,Pine Hills,8033
98,Morningside/Lenox Park,8030
149,Virginia-Highland,7800
66,Grant Park,6771
64,Georgia Tech,6607
80,Kirkwood,5897


I manually created csv file for top 10 crowded neighborhoods with coordinates.

In [9]:
import pandas
df_coordinates= pandas.read_csv('neigborhoods_atlanta_coordinates.csv')
print(df_coordinates)

             Neighborhood  Population (2010)   Latitude  Longitude
0                 Midtown              16569  33.783020 -84.382332
1                Downtown              13411  33.921520 -84.381912
2         Old Fourth Ward              10505  33.766430 -84.370407
3          North Buckhead               8270  33.852700 -84.365400
4              Pine Hills               8033  33.838715 -84.350830
5  Morningside/Lenox Park               8030  33.796200 -84.359500
6       Virginia-Highland               7800  33.781700 -84.363500
7              Grant Park               6771  33.737200 -84.368200
8            Georgia Tech               6607  33.775600 -84.396300
9                Kirkwood               5897  33.753300 -84.326200


In [10]:
df_coordinates.head(10)

Unnamed: 0,Neighborhood,Population (2010),Latitude,Longitude
0,Midtown,16569,33.78302,-84.382332
1,Downtown,13411,33.92152,-84.381912
2,Old Fourth Ward,10505,33.76643,-84.370407
3,North Buckhead,8270,33.8527,-84.3654
4,Pine Hills,8033,33.838715,-84.35083
5,Morningside/Lenox Park,8030,33.7962,-84.3595
6,Virginia-Highland,7800,33.7817,-84.3635
7,Grant Park,6771,33.7372,-84.3682
8,Georgia Tech,6607,33.7756,-84.3963
9,Kirkwood,5897,33.7533,-84.3262


By processing the data, I gathered the data I needed in a table and made it ready for visualization and analysis. In the methodology section, I will make more detailed inferences using this data frame.

## 3. Methodology

### 3.1. Create a map of Atlanta with neighborhoods 

In [11]:
!conda install -c conda-forge folium=0.5.0 --yes
import folium

print('Folium installed and imported!')

Collecting package metadata (current_repodata.json): done
Solving environment: done

# All requested packages already installed.

Folium installed and imported!


In [12]:
def generateMapAtlanta(default_location=[33.7490,-84.3880], default_zoom_start=12):
    map_Atlanta= folium.Map(location=default_location, zoom_start=default_zoom_start)
    return map_Atlanta

In [13]:
map_Atlanta= generateMapAtlanta()
map_Atlanta

Lets map first ten neighborhoods of Atlanta which are most crowded.

In [14]:
# add markers to map
for lat, lng,neighborhood in zip(df_coordinates['Latitude'], df_coordinates['Longitude'], df_coordinates['Neighborhood']):
    label = '{}'.format(neighborhood)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_Atlanta)  
    
map_Atlanta

Next, I am going to start utilizing the Foursquare API to explore the most crowded neighborhoods and segment them.

### 3.2. Define Foursquare Credentials and Version

In [15]:
CLIENT_ID = 'XZZVXLQS53PT4TENTT3S0WA2OWV1JPJDRFVN4JQ5VJVJODJ4' # your Foursquare ID
CLIENT_SECRET = 'FMEINADFLUYLE0BAY2ABUWGWGXHYZ2O1ZVVGM0ZBGT0AS4ZY' # your Foursquare Secret
VERSION = '20180605' # Foursquare API version

print('Your credentails:')
print('CLIENT_ID: XZZVXLQS53PT4TENTT3S0WA2OWV1JPJDRFVN4JQ5VJVJODJ4' + CLIENT_ID)
print('CLIENT_SECRET:FMEINADFLUYLE0BAY2ABUWGWGXHYZ2O1ZVVGM0ZBGT0AS4ZY' + CLIENT_SECRET)

Your credentails:
CLIENT_ID: XZZVXLQS53PT4TENTT3S0WA2OWV1JPJDRFVN4JQ5VJVJODJ4XZZVXLQS53PT4TENTT3S0WA2OWV1JPJDRFVN4JQ5VJVJODJ4
CLIENT_SECRET:FMEINADFLUYLE0BAY2ABUWGWGXHYZ2O1ZVVGM0ZBGT0AS4ZYFMEINADFLUYLE0BAY2ABUWGWGXHYZ2O1ZVVGM0ZBGT0AS4ZY


#### Let's explore the first neighborhood in our dataframe.

In [16]:
df_coordinates.loc[0, 'Neighborhood']

'Midtown'

In [17]:
neighborhood_latitude = df_coordinates.loc[0, 'Latitude'] # neighborhood latitude value
neighborhood_longitude = df_coordinates.loc[0, 'Longitude'] # neighborhood longitude value

neighborhood_name = df_coordinates.loc[0, 'Neighborhood'] # neighborhood name

print('Latitude and longitude values of {} are {}, {}.'.format(neighborhood_name, 
                                                               neighborhood_latitude, 
                                                               neighborhood_longitude))

Latitude and longitude values of Midtown are 33.78302, -84.38233199999999.


#### Now, let's get the top 100 venues that are in Midtown within a radius of 500 meters.

In [18]:
LIMIT = 100 
radius = 500 
# create URL
url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
    CLIENT_ID, 
    CLIENT_SECRET, 
    VERSION, 
    neighborhood_latitude, 
    neighborhood_longitude, 
    radius, 
    LIMIT)
url # display URL

'https://api.foursquare.com/v2/venues/explore?&client_id=XZZVXLQS53PT4TENTT3S0WA2OWV1JPJDRFVN4JQ5VJVJODJ4&client_secret=FMEINADFLUYLE0BAY2ABUWGWGXHYZ2O1ZVVGM0ZBGT0AS4ZY&v=20180605&ll=33.78302,-84.38233199999999&radius=500&limit=100'

In [19]:
import requests  # library to handle requests
from pandas.io.json import json_normalize # tranform JSON file into a pandas dataframe
results= requests.get(url).json()
results

{'meta': {'code': 200, 'requestId': '5f528c27b52da7221b2dd6dc'},
 'response': {'suggestedFilters': {'header': 'Tap to show:',
   'filters': [{'name': '$-$$$$', 'key': 'price'},
    {'name': 'Open now', 'key': 'openNow'}]},
  'headerLocation': 'Midtown',
  'headerFullLocation': 'Midtown, Atlanta',
  'headerLocationGranularity': 'neighborhood',
  'totalResults': 72,
  'suggestedBounds': {'ne': {'lat': 33.7875200045, 'lng': -84.37692791374903},
   'sw': {'lat': 33.7785199955, 'lng': -84.38773608625095}},
  'groups': [{'type': 'Recommended Places',
    'name': 'recommended',
    'items': [{'reasons': {'count': 0,
       'items': [{'summary': 'This spot is popular',
         'type': 'general',
         'reasonName': 'globalInteractionReason'}]},
      'venue': {'id': '4c50aec5375c0f476b27b392',
       'name': 'Exhale',
       'location': {'address': '1065 Peachtree St NE',
        'crossStreet': 'at 11th St NE',
        'lat': 33.78329394640629,
        'lng': -84.38336849212646,
        'l

In [20]:
# function that extracts the category of the venue
def get_category_type(row):
    try:
        categories_list = row['categories']
    except:
        categories_list = row['venue.categories']
        
    if len(categories_list) == 0:
        return None
    else:
        return categories_list[0]['name']

Now we are ready to clean the json and structure it into a pandas dataframe.

In [21]:
venues = results['response']['groups'][0]['items']
    
nearby_venues = json_normalize(venues) # flatten JSON

# filter columns
filtered_columns = ['venue.name', 'venue.categories', 'venue.location.lat', 'venue.location.lng']
nearby_venues =nearby_venues.loc[:, filtered_columns]

# filter the category for each row
nearby_venues['venue.categories'] = nearby_venues.apply(get_category_type, axis=1)

# clean columns
nearby_venues.columns = [col.split(".")[-1] for col in nearby_venues.columns]

nearby_venues.head()

  This is separate from the ipykernel package so we can avoid doing imports until


Unnamed: 0,name,categories,lat,lng
0,Exhale,Spa,33.783294,-84.383368
1,Loews Atlanta Hotel,Hotel,33.783366,-84.383188
2,Café Intermezzo,Café,33.783136,-84.38347
3,Street Food Thursdays (& Mondays),Food Truck,33.784558,-84.382534
4,Einstein's,New American Restaurant,33.784143,-84.382086


And how many venues were returned by Foursquare?

In [22]:
print('{} venues were returned by Foursquare.'.format(nearby_venues.shape[0]))

72 venues were returned by Foursquare.


### 3.3. Explore Neighborhoods in Atlanta

#### Let's create a function to repeat the same process to all the neighborhoods have high population in Atlanta

In [23]:
def getNearbyVenues(names, latitudes, longitudes, radius=500):
    
    venues_list=[]
    for name, lat, lng in zip(names, latitudes, longitudes):
        print(name)
            
        # create the API request URL
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lat, 
            lng, 
            radius, 
            LIMIT)
            
        # make the GET request
        results = requests.get(url).json()["response"]['groups'][0]['items']
        
        # return only relevant information for each nearby venue
        venues_list.append([(
            name, 
            lat, 
            lng, 
            v['venue']['name'], 
            v['venue']['location']['lat'], 
            v['venue']['location']['lng'],  
            v['venue']['categories'][0]['name']) for v in results])

    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['Neighborhood', 
                  'Neighborhood Latitude', 
                  'Neighborhood Longitude', 
                  'Venue', 
                  'Venue Latitude', 
                  'Venue Longitude', 
                  'Venue Category']
    
    return(nearby_venues)

#### Now write the code to run the above function on each neighborhood and create a new dataframe called *Atlanta_venues*.

In [24]:
Atlanta_venues = getNearbyVenues(names=df_coordinates['Neighborhood'],
                                   latitudes=df_coordinates['Latitude'],
                                   longitudes=df_coordinates['Longitude']
                                  )

Midtown
Downtown
Old Fourth Ward
North Buckhead
Pine Hills
Morningside/Lenox Park
Virginia-Highland
Grant Park
Georgia Tech
Kirkwood


#### Let's check the size of the resulting dataframe

In [25]:
print(Atlanta_venues.shape)
Atlanta_venues.head()

(240, 7)


Unnamed: 0,Neighborhood,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
0,Midtown,33.78302,-84.382332,Exhale,33.783294,-84.383368,Spa
1,Midtown,33.78302,-84.382332,Loews Atlanta Hotel,33.783366,-84.383188,Hotel
2,Midtown,33.78302,-84.382332,Café Intermezzo,33.783136,-84.38347,Café
3,Midtown,33.78302,-84.382332,Street Food Thursdays (& Mondays),33.784558,-84.382534,Food Truck
4,Midtown,33.78302,-84.382332,Einstein's,33.784143,-84.382086,New American Restaurant


Let's check how many venues were returned for each neighborhood

In [26]:
Atlanta_venues.groupby('Neighborhood').count()

Unnamed: 0_level_0,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
Neighborhood,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
Downtown,35,35,35,35,35,35
Georgia Tech,12,12,12,12,12,12
Grant Park,9,9,9,9,9,9
Kirkwood,15,15,15,15,15,15
Midtown,72,72,72,72,72,72
Morningside/Lenox Park,3,3,3,3,3,3
North Buckhead,56,56,56,56,56,56
Old Fourth Ward,2,2,2,2,2,2
Pine Hills,3,3,3,3,3,3
Virginia-Highland,33,33,33,33,33,33


#### Let's find out how many unique categories can be curated from all the returned venues

In [27]:
print('There are {} uniques categories.'.format(len(Atlanta_venues['Venue Category'].unique())))

There are 106 uniques categories.


### 3.4. Analyze Each Neighborhood

In [28]:
# one hot encoding
Atlanta_onehot = pd.get_dummies(Atlanta_venues[['Venue Category']], prefix="", prefix_sep="")

# add neighborhood column back to dataframe
Atlanta_onehot['Neighborhood'] = Atlanta_venues['Neighborhood'] 

# move neighborhood column to the first column
fixed_columns = [Atlanta_onehot.columns[-1]] + list(Atlanta_onehot.columns[:-1])
Atlanta_onehot = Atlanta_onehot[fixed_columns]

Atlanta_onehot.head()

Unnamed: 0,Neighborhood,Accessories Store,American Restaurant,Athletics & Sports,BBQ Joint,Bakery,Bank,Bar,Bed & Breakfast,Board Shop,...,Thrift / Vintage Store,Toy / Game Store,Trail,Vegetarian / Vegan Restaurant,Video Store,Wine Bar,Wings Joint,Women's Store,Yoga Studio,Zoo Exhibit
0,Midtown,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,Midtown,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,Midtown,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,Midtown,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,Midtown,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


And let's examine the new dataframe size.

In [29]:
Atlanta_onehot.shape

(240, 107)

#### Next, let's group rows by neighborhood and by taking the mean of the frequency of occurrence of each category

In [30]:
Atlanta_grouped = Atlanta_onehot.groupby('Neighborhood').mean().reset_index()
Atlanta_grouped

Unnamed: 0,Neighborhood,Accessories Store,American Restaurant,Athletics & Sports,BBQ Joint,Bakery,Bank,Bar,Bed & Breakfast,Board Shop,...,Thrift / Vintage Store,Toy / Game Store,Trail,Vegetarian / Vegan Restaurant,Video Store,Wine Bar,Wings Joint,Women's Store,Yoga Studio,Zoo Exhibit
0,Downtown,0.028571,0.028571,0.0,0.0,0.057143,0.0,0.057143,0.0,0.0,...,0.0,0.0,0.0,0.057143,0.0,0.0,0.0,0.0,0.028571,0.0
1,Georgia Tech,0.0,0.0,0.083333,0.0,0.083333,0.083333,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,Grant Park,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.111111,0.111111,0.0,0.0,0.0,0.222222
3,Kirkwood,0.0,0.0,0.0,0.066667,0.066667,0.0,0.066667,0.0,0.0,...,0.0,0.0,0.0,0.066667,0.0,0.0,0.0,0.0,0.0,0.0
4,Midtown,0.0,0.097222,0.0,0.0,0.027778,0.0,0.027778,0.013889,0.013889,...,0.0,0.0,0.0,0.0,0.013889,0.0,0.013889,0.0,0.013889,0.0
5,Morningside/Lenox Park,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.333333,0.0,0.0,0.0,0.0,0.0,0.0,0.0
6,North Buckhead,0.035714,0.035714,0.017857,0.0,0.0,0.0,0.0,0.0,0.0,...,0.017857,0.035714,0.017857,0.0,0.0,0.0,0.0,0.089286,0.0,0.0
7,Old Fourth Ward,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
8,Pine Hills,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
9,Virginia-Highland,0.0,0.0,0.0,0.0,0.0,0.0,0.030303,0.0,0.0,...,0.0,0.0,0.090909,0.0,0.0,0.030303,0.0,0.0,0.030303,0.0


#### Let's confirm the new size

In [31]:
Atlanta_grouped.shape

(10, 107)

#### Let's print each neighborhood along with the top 5 most common venues

In [32]:
num_top_venues = 5

for hood in Atlanta_grouped['Neighborhood']:
    print("----"+hood+"----")
    temp = Atlanta_grouped[Atlanta_grouped['Neighborhood'] == hood].T.reset_index()
    temp.columns = ['venue','freq']
    temp = temp.iloc[1:]
    temp['freq'] = temp['freq'].astype(float)
    temp = temp.round({'freq': 2})
    print(temp.sort_values('freq', ascending=False).reset_index(drop=True).head(num_top_venues))
    print('\n')

----Downtown----
                           venue  freq
0                            Spa  0.09
1  Vegetarian / Vegan Restaurant  0.06
2      Middle Eastern Restaurant  0.06
3                         Bakery  0.06
4                            Bar  0.06


----Georgia Tech----
                  venue  freq
0  Fast Food Restaurant  0.17
1            Food Court  0.08
2        Sandwich Place  0.08
3    Chinese Restaurant  0.08
4           Music Venue  0.08


----Grant Park----
         venue  freq
0  Zoo Exhibit  0.22
1  Music Venue  0.11
2         Pool  0.11
3     Wine Bar  0.11
4  Video Store  0.11


----Kirkwood----
            venue  freq
0       Pet Store  0.20
1       Juice Bar  0.07
2  Breakfast Spot  0.07
3      Sports Bar  0.07
4     Pizza Place  0.07


----Midtown----
                 venue  freq
0  American Restaurant  0.10
1                Hotel  0.07
2                  Spa  0.04
3   Seafood Restaurant  0.04
4               Lounge  0.03


----Morningside/Lenox Park----
           

#### Let's put that into a *pandas* dataframe

First, let's write a function to sort the venues in descending order.

In [33]:
def return_most_common_venues(row, num_top_venues):
    row_categories = row.iloc[1:]
    row_categories_sorted = row_categories.sort_values(ascending=False)
    
    return row_categories_sorted.index.values[0:num_top_venues]

Now let's create the new dataframe and display the top 10 venues for each neighborhood.

In [34]:
import numpy as np # library to handle data in a vectorized manner
num_top_venues = 10

indicators = ['st', 'nd', 'rd']

# create columns according to number of top venues
columns = ['Neighborhood']
for ind in np.arange(num_top_venues):
    try:
        columns.append('{}{} Most Common Venue'.format(ind+1, indicators[ind]))
    except:
        columns.append('{}th Most Common Venue'.format(ind+1))

# create a new dataframe
neighborhoods_venues_sorted = pd.DataFrame(columns=columns)
neighborhoods_venues_sorted['Neighborhood'] = Atlanta_grouped['Neighborhood']

for ind in np.arange(Atlanta_grouped.shape[0]):
    neighborhoods_venues_sorted.iloc[ind, 1:] = return_most_common_venues(Atlanta_grouped.iloc[ind, :], num_top_venues)

neighborhoods_venues_sorted.head()

Unnamed: 0,Neighborhood,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,Downtown,Spa,Restaurant,Bakery,Vegetarian / Vegan Restaurant,Pizza Place,Bar,Middle Eastern Restaurant,Accessories Store,Park,Chinese Restaurant
1,Georgia Tech,Fast Food Restaurant,Sandwich Place,Chinese Restaurant,Food Court,College Theater,Coffee Shop,Music Venue,Restaurant,Bank,Athletics & Sports
2,Grant Park,Zoo Exhibit,Music Venue,Playground,Pharmacy,Park,Pool,Wine Bar,Video Store,Historic Site,Fast Food Restaurant
3,Kirkwood,Pet Store,Pizza Place,Bar,Coffee Shop,Mexican Restaurant,Breakfast Spot,Sandwich Place,Historic Site,Sports Bar,Juice Bar
4,Midtown,American Restaurant,Hotel,Seafood Restaurant,Spa,New American Restaurant,Italian Restaurant,Coffee Shop,Southern / Soul Food Restaurant,Gay Bar,Indian Restaurant


### 3.5. Cluster Neighborhoods

Run *k*-means to cluster the neighborhood into 3 clusters.

In [35]:
# import k-means from clustering stage
from sklearn.cluster import KMeans

In [36]:
# set number of clusters
kclusters = 3

Atlanta_grouped_clustering = Atlanta_grouped.drop('Neighborhood', 1)

# run k-means clustering
kmeans = KMeans(n_clusters=kclusters, random_state=0).fit(Atlanta_grouped_clustering)

# check cluster labels generated for each row in the dataframe
kmeans.labels_[0:10] 

array([0, 0, 0, 0, 0, 0, 0, 2, 1, 0], dtype=int32)

Let's create a new dataframe that includes the cluster as well as the top 10 venues for each neighborhood.

In [37]:
# add clustering labels
neighborhoods_venues_sorted.insert(0, 'Cluster Labels', kmeans.labels_)

Atlanta_merged = df_coordinates

# merge Atlanta_grouped with toronto_data to add latitude/longitude for each neighborhood
Atlanta_merged = Atlanta_merged.join(neighborhoods_venues_sorted.set_index('Neighborhood'), on='Neighborhood')

Atlanta_merged.head() # check the last columns!

Unnamed: 0,Neighborhood,Population (2010),Latitude,Longitude,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,Midtown,16569,33.78302,-84.382332,0,American Restaurant,Hotel,Seafood Restaurant,Spa,New American Restaurant,Italian Restaurant,Coffee Shop,Southern / Soul Food Restaurant,Gay Bar,Indian Restaurant
1,Downtown,13411,33.92152,-84.381912,0,Spa,Restaurant,Bakery,Vegetarian / Vegan Restaurant,Pizza Place,Bar,Middle Eastern Restaurant,Accessories Store,Park,Chinese Restaurant
2,Old Fourth Ward,10505,33.76643,-84.370407,2,Italian Restaurant,Playground,Furniture / Home Store,Dive Bar,Doctor's Office,Electronics Store,Exhibit,Farmers Market,Fast Food Restaurant,Food Court
3,North Buckhead,8270,33.8527,-84.3654,0,Women's Store,Steakhouse,Boutique,Hotel,Italian Restaurant,Coffee Shop,Furniture / Home Store,Kids Store,Accessories Store,Toy / Game Store
4,Pine Hills,8033,33.838715,-84.35083,1,Pool,Scenic Lookout,Furniture / Home Store,Dive Bar,Doctor's Office,Electronics Store,Exhibit,Farmers Market,Fast Food Restaurant,Food Court


Finally, let's visualize the resulting clusters

In [38]:
# Matplotlib and associated plotting modules
import matplotlib.cm as cm
import matplotlib.colors as colors

# create map
map_clusters = folium.Map(location=[33.7490,-84.3880], zoom_start=11)

# set color scheme for the clusters
x = np.arange(kclusters)
ys = [i + x + (i*x)**2 for i in range(kclusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# add markers to the map
markers_colors = []
for lat, lon, poi, cluster in zip(Atlanta_merged['Latitude'], Atlanta_merged['Longitude'], Atlanta_merged['Neighborhood'], Atlanta_merged['Cluster Labels']):
    label = folium.Popup(str(poi) + ' Cluster ' + str(cluster), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color=rainbow[cluster-1],
        fill=True,
        fill_color=rainbow[cluster-1],
        fill_opacity=0.7).add_to(map_clusters)
       
map_clusters

## 4. Results

### Examine Clusters

Now, we can examine each cluster and determine the discriminating venue categories that distinguish each cluster. Based on the defining categories, we can then assign a name to each cluster.

#### Cluster 1

In [39]:
Atlanta_merged.loc[Atlanta_merged['Cluster Labels'] == 0, Atlanta_merged.columns[[1] + list(range(5, Atlanta_merged.shape[1]))]]

Unnamed: 0,Population (2010),1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,16569,American Restaurant,Hotel,Seafood Restaurant,Spa,New American Restaurant,Italian Restaurant,Coffee Shop,Southern / Soul Food Restaurant,Gay Bar,Indian Restaurant
1,13411,Spa,Restaurant,Bakery,Vegetarian / Vegan Restaurant,Pizza Place,Bar,Middle Eastern Restaurant,Accessories Store,Park,Chinese Restaurant
3,8270,Women's Store,Steakhouse,Boutique,Hotel,Italian Restaurant,Coffee Shop,Furniture / Home Store,Kids Store,Accessories Store,Toy / Game Store
5,8030,Playground,Trail,Park,Zoo Exhibit,Furniture / Home Store,Doctor's Office,Electronics Store,Exhibit,Farmers Market,Fast Food Restaurant
6,7800,Trail,Park,Plaza,Movie Theater,Sandwich Place,Salon / Barbershop,Grocery Store,Cosmetics Shop,Pizza Place,Pet Store
7,6771,Zoo Exhibit,Music Venue,Playground,Pharmacy,Park,Pool,Wine Bar,Video Store,Historic Site,Fast Food Restaurant
8,6607,Fast Food Restaurant,Sandwich Place,Chinese Restaurant,Food Court,College Theater,Coffee Shop,Music Venue,Restaurant,Bank,Athletics & Sports
9,5897,Pet Store,Pizza Place,Bar,Coffee Shop,Mexican Restaurant,Breakfast Spot,Sandwich Place,Historic Site,Sports Bar,Juice Bar


#### Cluster 2

In [40]:
Atlanta_merged.loc[Atlanta_merged['Cluster Labels'] == 1, Atlanta_merged.columns[[1] + list(range(5, Atlanta_merged.shape[1]))]]

Unnamed: 0,Population (2010),1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
4,8033,Pool,Scenic Lookout,Furniture / Home Store,Dive Bar,Doctor's Office,Electronics Store,Exhibit,Farmers Market,Fast Food Restaurant,Food Court


#### Cluster 3

In [41]:
Atlanta_merged.loc[Atlanta_merged['Cluster Labels'] == 2, Atlanta_merged.columns[[1] + list(range(5, Atlanta_merged.shape[1]))]]

Unnamed: 0,Population (2010),1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
2,10505,Italian Restaurant,Playground,Furniture / Home Store,Dive Bar,Doctor's Office,Electronics Store,Exhibit,Farmers Market,Fast Food Restaurant,Food Court


## 5. Discusion

Atlanta is a city developed in terms of art, culture and food and beverage tourism. Opening a restaurant in such a city can be advantageous as well as risky. Since there are many people from different cultures living in this city, opening a place that is open to innovations may attract the attention of customers. If I have to interpret the concrete data I worked on in my project, fast food restaurants in the downtown of Atlanta are less than other restaurants. For this reason, opening a new fast-food restaurant in the neighborhoods close to downtown may be logical in terms of investment, but it should be kept in mind that the rent of the venue here is high. 

## 6. Conclusion

In this project, I analyzed the relationship between the population and restaurant selection in the most crowded neighborhoods of Atlanta. I created a data frame that contains population, latitude and longitude information for each neighborhood by cleaning data. I built the clustering method using the k mean algorithm to predict the most common venue in Atlanta. I did visualization using the folium library to support my predictions. Although opening a new place or restaurant in a big city such as Atlanta depends on many sociological and economic factors, I believe that I have reached the most accurate result with the available data. By using the analysis methods, I have used, this project can be studied in more detail and extensively if more features are achieved regarding these neighborhoods.