# Cluster Analysis Chicago
###  Code
**_Opening a coffee shop in Chicago, Illinois USA_**
- Build a dataframe of neighborhoods in Chicago, Illinois by web scraping the data from Wikipedia page
- Get the geographical coordinates of the neighborhoods
- Obtain the venue data for the neighborhoods from Foursquare API
- Explore and cluster the neighborhoods
- Select the best cluster to open a new coffee shop
***
### 1. Import  necessary libraries

In [None]:
!pip install geocoder

In [1]:
import numpy as np 
import pandas as pd 
pd.set_option("display.max_columns", None)
pd.set_option("display.max_rows", None)

import json 

from geopy.geocoders import Nominatim 
import geocoder 

import requests 
from bs4 import BeautifulSoup 

from pandas.io.json import json_normalize 


import matplotlib.cm as cm
import matplotlib.colors as colors


from sklearn.cluster import KMeans

import folium 

print("Libraries imported.")

Libraries imported.


### 2. Scrap data from rentcafe table into Pandas Dataframe. See average rent per neighborhood

In [2]:
# send the GET request
url = 'https://www.rentcafe.com/average-rent-market-trends/us/il/chicago/'

In [3]:
 r = requests.get(url)
soup = BeautifulSoup(r.text, 'html.parser')

In [4]:
rent_table = soup.find('table', class_ = 'market-trends market-trends-nhood')

In [5]:
for neighborhood in rent_table.find_all('tbody'):
    neighborhoods = []
    money = []
    rows = neighborhood.find_all('tr')
    for row in rows:
        pl_neighborhood = row.find('th').text.strip()
        pl_money = getattr(row.find('td'), 'text', None)
        neighborhoods.append(pl_neighborhood)
        money.append(pl_money)
        

In [6]:
df = pd.DataFrame(neighborhoods,columns=['neighborhood'])
df=df.drop(0)
df2 = pd.DataFrame(money, columns = ['rent'])
df2=df2.drop(0)


In [7]:
ci_df = pd.merge(df, df2, right_index = True, left_index=True)
ci_df2 = pd.merge(df, df2, right_index = True, left_index=True)
ci_df2 = ci_df2.rename(columns={'neighborhood':'Neighborhood'})
#we will use ci_df2 dataframe for later
print(ci_df)

                    neighborhood    rent
1                     The Island    $562
2                         Austin    $562
3                   West Pullman    $612
4                       Rosemoor    $612
5                       Roseland    $612
6                      Riverdale    $612
7                        Pullman    $612
8                 Princeton Park    $612
9                 Longwood Manor    $612
10                      Fernwood    $612
11         Cottage Grove Heights    $612
12             West Chesterfield    $783
13                      Marynook    $783
14                  East Chatham    $783
15                       Chatham    $783
16               Calumet Heights    $783
17                      Burnside    $783
18                   Avalon Park    $783
19                     Englewood    $842
20                        Pilsen    $887
21                Heart of Italy    $887
22              Heart of Chicago    $887
23            West Garfield Park    $911
24              

## Now that we have an idea for the expense of each neighborhood, let's get the coordinates

In [8]:
ci_df = ci_df.drop(['rent'], axis=1)

### 3. Get the geographical coordinates

In [9]:
# define a function to get coordinates
def get_latlng(neighborhood):
    # initialize your variable to None
    lat_lng_coords = None
    # loop until you get the coordinates
    while(lat_lng_coords is None):
        g = geocoder.arcgis('{}, Chicago, Illinois'.format(neighborhood))
        lat_lng_coords = g.latlng
    return lat_lng_coords

In [10]:
# call the function to get the coordinates, store in a new list using list comprehension
coords = [ get_latlng(neighborhood) for neighborhood in ci_df["neighborhood"].tolist() ]

In [11]:
coords

[[41.656612688885254, -87.72627910278086],
 [41.88775000000004, -87.76362999999998],
 [41.67951000000005, -87.64188999999999],
 [41.9735150397632, -87.86545992729509],
 [41.70211000000006, -87.62573999999995],
 [41.65384559658816, -87.60965495845035],
 [41.69282000000004, -87.60575999999998],
 [41.74568000000005, -87.63170999999994],
 [41.69917448038685, -87.67121680497412],
 [41.701850000000036, -87.63903999999997],
 [41.70014547055174, -87.60818232903685],
 [39.25452000000007, -90.06416999999999],
 [41.84619003294324, -87.65335993266356],
 [41.74108000000007, -87.61302999999998],
 [41.74108000000007, -87.61302999999998],
 [41.73336000000006, -87.57741999999996],
 [41.72944000000007, -87.59767999999997],
 [41.745070000000055, -87.58815999999996],
 [41.77978000000007, -87.64512999999994],
 [41.857020000000034, -87.65758999999997],
 [41.884250000000065, -87.63244999999995],
 [41.889250012619236, -87.67538269270521],
 [41.87702000000007, -87.73073999999997],
 [41.66425509193626, -87.7046

In [12]:
# create temporary dataframe to populate the coordinates into Latitude and Longitude
df_coords = pd.DataFrame(coords, columns=['Latitude', 'Longitude'])

In [13]:
# merge the coordinates into the original dataframe
ci_df['Latitude'] = df_coords['Latitude']
ci_df['Longitude'] = df_coords['Longitude']

In [14]:
# check the neighborhoods and the coordinates
print(ci_df.shape)
ci_df

(174, 3)


Unnamed: 0,neighborhood,Latitude,Longitude
1,The Island,41.88775,-87.76363
2,Austin,41.67951,-87.64189
3,West Pullman,41.973515,-87.86546
4,Rosemoor,41.70211,-87.62574
5,Roseland,41.653846,-87.609655
6,Riverdale,41.69282,-87.60576
7,Pullman,41.74568,-87.63171
8,Princeton Park,41.699174,-87.671217
9,Longwood Manor,41.70185,-87.63904
10,Fernwood,41.700145,-87.608182


In [15]:
ci_df=ci_df.drop(174) #NA
 #NA

In [16]:
# save the DataFrame as CSV file
ci_df.to_csv("ci_df.csv", index=False)

### 4. Create a map of Chicago, Illinois with neighborhoods superimposed on top

In [17]:
# get the coordinates of the city
address = 'Chicago, Illinois'

geolocator = Nominatim(user_agent="my-application")
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude
print('The geograpical coordinate of Chicago Illinois, USA {}, {}.'.format(latitude, longitude))

The geograpical coordinate of Chicago Illinois, USA 41.8755616, -87.6244212.


In [18]:
# create map of chi-town using latitude and longitude values
map_ci = folium.Map(location=[latitude, longitude], zoom_start=11)

# add markers to map
for lat, lng, neighborhood in zip(ci_df['Latitude'], ci_df['Longitude'], ci_df['neighborhood']):
    label = '{}'.format(neighborhood)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.5).add_to(map_ci)  
    
map_ci

In [19]:
# save the map as HTML file
map_ci.save('map_ci.html')

### 5. Use the Foursquare API to explore the neighborhoods

In [20]:
CLIENT_ID = '11HZ5QGHAEFP0FF2WOUTTLBG42CCTWUUPIHSAEUODDNWH0PJ' # your Foursquare ID
CLIENT_SECRET = 'D0QN0BCYUL2MUXCCVSBWGJUJAOOI54DRCMA3HPXYILLNP4ED' # your Foursquare Secret
VERSION = '20200709' # Foursquare API version

print('d54347n@pace.edu:')
print('CLIENT_ID: ' + CLIENT_ID)
print('CLIENT_SECRET:' + CLIENT_SECRET)

d54347n@pace.edu:
CLIENT_ID: 11HZ5QGHAEFP0FF2WOUTTLBG42CCTWUUPIHSAEUODDNWH0PJ
CLIENT_SECRET:D0QN0BCYUL2MUXCCVSBWGJUJAOOI54DRCMA3HPXYILLNP4ED


**Now, let's get the top 50 venues that are within a radius of 2000 meters.**

In [21]:
radius = 2000
LIMIT = 50

venues = []

for lat, long, neighborhood in zip(ci_df['Latitude'], ci_df['Longitude'], ci_df['neighborhood']):
    
    # create the API request URL
    url = "https://api.foursquare.com/v2/venues/explore?client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}".format(
        CLIENT_ID,
        CLIENT_SECRET,
        VERSION,
        lat,
        long,
        radius, 
        LIMIT)
    
    # make the GET request
    results = requests.get(url).json()["response"]['groups'][0]['items']
    
    # return only relevant information for each nearby venue
    for venue in results:
        venues.append((
            neighborhood,
            lat, 
            long, 
            venue['venue']['name'], 
            venue['venue']['location']['lat'], 
            venue['venue']['location']['lng'],  
            venue['venue']['categories'][0]['name']))

In [22]:
# convert the venues list into a new DataFrame
venues_df = pd.DataFrame(venues)

# define the column names
venues_df.columns = ['Neighborhood', 'Latitude', 'Longitude', 'VenueName', 'VenueLatitude', 'VenueLongitude', 'VenueCategory']

print(venues_df.shape)
venues_df.head()

(8482, 7)


Unnamed: 0,Neighborhood,Latitude,Longitude,VenueName,VenueLatitude,VenueLongitude,VenueCategory
0,The Island,41.88775,-87.76363,Oak Park Gymnastics Center,41.88785,-87.775836,Gymnastics Gym
1,The Island,41.88775,-87.76363,Uncle Remus Saucy Fried Chicken,41.880186,-87.765239,Fried Chicken Joint
2,The Island,41.88775,-87.76363,Pete's Fresh Marketplace,41.887901,-87.782082,Grocery Store
3,The Island,41.88775,-87.76363,Jerk Taco Stand,41.891579,-87.745923,African Restaurant
4,The Island,41.88775,-87.76363,MacArthur's Restaurant,41.880611,-87.760757,Southern / Soul Food Restaurant


**Let's check how many venues were returned for each neighorhood**

In [23]:
venues_df.groupby(["Neighborhood"]).count()

Unnamed: 0_level_0,Latitude,Longitude,VenueName,VenueLatitude,VenueLongitude,VenueCategory
Neighborhood,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
Albany Park,50,50,50,50,50,50
Andersonville,50,50,50,50,50,50
Arcadia Terrace,50,50,50,50,50,50
Armour Square,50,50,50,50,50,50
Austin,50,50,50,50,50,50
Avalon Park,50,50,50,50,50,50
Avondale,50,50,50,50,50,50
Back of the Yards,50,50,50,50,50,50
Belmont Central,50,50,50,50,50,50
Belmont Heights,50,50,50,50,50,50


**Let's find out how many unique categories can be curated from all the returned venues**

In [24]:
print('There are {} uniques categories.'.format(len(venues_df['VenueCategory'].unique())))

There are 337 uniques categories.


In [25]:
# print out the list of categories
venues_df['VenueCategory'].unique()[:50]

array(['Gymnastics Gym', 'Fried Chicken Joint', 'Grocery Store',
       'African Restaurant', 'Southern / Soul Food Restaurant',
       'Farmers Market', 'Liquor Store', 'Seafood Restaurant',
       'Donut Shop', 'Yoga Studio', 'Fast Food Restaurant', 'Park',
       'BBQ Joint', 'Gym', 'Pharmacy', 'ATM', 'Coffee Shop',
       'Cosmetics Shop', 'Discount Store', 'Convenience Store',
       'Toy / Game Store', 'Pizza Place', 'Ice Cream Shop',
       'Rental Car Location', 'Sandwich Place', 'Hobby Shop',
       'Mexican Restaurant', 'Bakery', 'American Restaurant',
       'Construction & Landscaping', 'Golf Course', 'Asian Restaurant',
       'Steakhouse', 'Chinese Restaurant', 'General Entertainment',
       'Gym / Fitness Center', 'Bank', 'Video Game Store',
       'Shopping Mall', 'Supermarket', 'Bar', 'Lounge', 'Burrito Place',
       'Nail Salon', 'Mobile Phone Shop', 'Kids Store', 'Wings Joint',
       'Breakfast Spot', 'Optical Shop', 'Department Store'], dtype=object)

In [26]:
venues_df['VenueCategory'].describe()

count            8482
unique            337
top       Coffee Shop
freq              345
Name: VenueCategory, dtype: object

In [27]:
# check if the results contain "Grocery Store"
"Coffee Shop" in venues_df['VenueCategory'].unique()

True

### 6. Analyze Each Neighborhood

In [28]:
# one hot encoding
ci_onehot = pd.get_dummies(venues_df[['VenueCategory']], prefix="", prefix_sep="")

# add neighborhood column back to dataframe
ci_onehot['Neighborhoods'] = venues_df['Neighborhood'] 

# move neighborhood column to the first column
fixed_columns = [ci_onehot.columns[-1]] + list(ci_onehot.columns[:-1])
ci_onehot = ci_onehot[fixed_columns]

print(ci_onehot.shape)
ci_onehot.head()

(8482, 338)


Unnamed: 0,Neighborhoods,ATM,Accessories Store,Adult Boutique,Afghan Restaurant,African Restaurant,Airport Lounge,Airport Service,American Restaurant,Amphitheater,Antique Shop,Arcade,Argentinian Restaurant,Art Gallery,Art Museum,Arts & Crafts Store,Arts & Entertainment,Asian Restaurant,Athletics & Sports,Auto Dealership,Automotive Shop,BBQ Joint,Bagel Shop,Bakery,Bank,Bar,Baseball Field,Baseball Stadium,Basketball Court,Basketball Stadium,Bavarian Restaurant,Beach,Bed & Breakfast,Beer Bar,Beer Garden,Beer Store,Big Box Store,Bike Shop,Bistro,Boat or Ferry,Bookstore,Border Crossing,Botanical Garden,Boutique,Bowling Alley,Boxing Gym,Brazilian Restaurant,Breakfast Spot,Brewery,Bubble Tea Shop,Building,Burger Joint,Burrito Place,Bus Station,Business Service,Butcher,Café,Cajun / Creole Restaurant,Camera Store,Candy Store,Caribbean Restaurant,Carpet Store,Casino,Cheese Shop,Chinese Restaurant,Chocolate Shop,Circus,Climbing Gym,Clothing Store,Cocktail Bar,Coffee Shop,College Arts Building,College Gym,College Quad,College Rec Center,Comedy Club,Comic Shop,Community Center,Concert Hall,Construction & Landscaping,Convenience Store,Cosmetics Shop,Creperie,Cuban Restaurant,Cultural Center,Cupcake Shop,Currency Exchange,Cycle Studio,Czech Restaurant,Dance Studio,Deli / Bodega,Department Store,Dessert Shop,Dim Sum Restaurant,Diner,Discount Store,Distillery,Dive Bar,Doctor's Office,Dog Run,Donut Shop,Eastern European Restaurant,Electronics Store,Empanada Restaurant,Ethiopian Restaurant,Event Space,Exhibit,Eye Doctor,Fabric Shop,Falafel Restaurant,Farm,Farmers Market,Fast Food Restaurant,Field,Filipino Restaurant,Fish & Chips Shop,Fish Market,Flea Market,Floating Market,Flower Shop,Food,Food & Drink Shop,Food Court,Food Service,Food Truck,Football Stadium,Forest,Fountain,French Restaurant,Fried Chicken Joint,Frozen Yogurt Shop,Fruit & Vegetable Store,Furniture / Home Store,Gaming Cafe,Garden,Garden Center,Gas Station,Gastropub,Gay Bar,General Entertainment,General Travel,German Restaurant,Gift Shop,Golf Course,Gourmet Shop,Greek Restaurant,Grocery Store,Gym,Gym / Fitness Center,Gymnastics Gym,Halal Restaurant,Harbor / Marina,Hardware Store,Health & Beauty Service,Heliport,Historic Site,History Museum,Hobby Shop,Hockey Arena,Home Service,Hookah Bar,Hot Dog Joint,Hotel,Hotel Bar,Ice Cream Shop,Indian Restaurant,Indie Movie Theater,Indie Theater,Indonesian Restaurant,Indoor Play Area,Intersection,Irish Pub,Israeli Restaurant,Italian Restaurant,Japanese Restaurant,Jazz Club,Jewelry Store,Juice Bar,Karaoke Bar,Kids Store,Kitchen Supply Store,Korean Restaurant,Lake,Latin American Restaurant,Laundromat,Library,Light Rail Station,Lighthouse,Lingerie Store,Liquor Store,Lounge,Malay Restaurant,Market,Martial Arts Dojo,Massage Studio,Mediterranean Restaurant,Men's Store,Mexican Restaurant,Middle Eastern Restaurant,Military Base,Mini Golf,Miscellaneous Shop,Mobile Phone Shop,Molecular Gastronomy Restaurant,Moroccan Restaurant,Motel,Motorcycle Shop,Movie Theater,Multiplex,Museum,Music School,Music Store,Music Venue,Nail Salon,National Park,Nature Preserve,New American Restaurant,Newsstand,Nightclub,Non-Profit,Noodle House,North Indian Restaurant,Opera House,Optical Shop,Organic Grocery,Other Great Outdoors,Other Nightlife,Outdoor Sculpture,Outdoors & Recreation,Outlet Store,Pakistani Restaurant,Paper / Office Supplies Store,Park,Parking,Pastry Shop,Pedestrian Plaza,Performing Arts Venue,Persian Restaurant,Peruvian Restaurant,Pet Service,Pet Store,Pharmacy,Photography Studio,Pie Shop,Pier,Pizza Place,Playground,Plaza,Polish Restaurant,Pool,Pool Hall,Portuguese Restaurant,Post Office,Pub,Public Art,Ramen Restaurant,Record Shop,Recording Studio,Rental Car Location,Residential Building (Apartment / Condo),Restaurant,Rock Club,Roof Deck,Salad Place,Salon / Barbershop,Sandwich Place,Scandinavian Restaurant,Scenic Lookout,Science Museum,Sculpture Garden,Seafood Restaurant,Shipping Store,Shoe Store,Shop & Service,Shopping Mall,Skate Park,Skating Rink,Smoke Shop,Smoothie Shop,Snack Place,Soccer Field,Soccer Stadium,Soup Place,South American Restaurant,Southern / Soul Food Restaurant,Spa,Spanish Restaurant,Speakeasy,Sporting Goods Shop,Sports Bar,Stadium,Stationery Store,Steakhouse,Storage Facility,Street Art,Strip Club,Supermarket,Supplement Shop,Sushi Restaurant,Taco Place,Taiwanese Restaurant,Tanning Salon,Tapas Restaurant,Tattoo Parlor,Tea Room,Tennis Court,Tex-Mex Restaurant,Thai Restaurant,Theater,Theme Restaurant,Thrift / Vintage Store,Tiki Bar,Tour Provider,Toy / Game Store,Track,Track Stadium,Trade School,Trail,Train,Train Station,Turkish Restaurant,Ukrainian Restaurant,Vegetarian / Vegan Restaurant,Veterinarian,Video Game Store,Video Store,Vietnamese Restaurant,Warehouse,Warehouse Store,Waterfront,Weight Loss Center,Whisky Bar,Wine Bar,Wine Shop,Winery,Wings Joint,Women's Store,Yoga Studio
0,The Island,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
1,The Island,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
2,The Island,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
3,The Island,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
4,The Island,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0


**Next, let's group rows by neighborhood and by taking the mean of the frequency of occurrence of each category**

In [29]:
ci_grouped = ci_onehot.groupby(["Neighborhoods"]).mean().reset_index()

print(ci_grouped.shape)


(173, 338)


In [30]:
len(ci_grouped[ci_grouped["Coffee Shop"] > 0])

143

**Create a new DataFrame for coffee shop data only**

In [31]:
ci_gs = ci_grouped[["Neighborhoods","Coffee Shop"]]

In [32]:
ci_gs.head()

Unnamed: 0,Neighborhoods,Coffee Shop
0,Albany Park,0.1
1,Andersonville,0.04
2,Arcadia Terrace,0.04
3,Armour Square,0.04
4,Austin,0.0


### 7. Cluster Neighborhoods
Run k-means to cluster the neighborhoods in Chicago IL into 3 clusters.

In [33]:
# set number of clusters
kclusters = 3

kl_clustering = ci_gs.drop(["Neighborhoods"], 1)

# run k-means clustering
kmeans = KMeans(n_clusters=kclusters, random_state=0).fit(kl_clustering)

# check cluster labels generated for each row in the dataframe
kmeans.labels_[0:10] 

array([1, 0, 0, 0, 2, 2, 2, 0, 1, 2])

In [34]:
# create a new dataframe that includes the cluster as well as the top 10 venues for each neighborhood.
ci_merged = ci_gs.copy()

# add clustering labels
ci_merged["Cluster Labels"] = kmeans.labels_

In [35]:
ci_merged.rename(columns={"Neighborhoods": "Neighborhood"}, inplace=True)
ci_merged.head()

Unnamed: 0,Neighborhood,Coffee Shop,Cluster Labels
0,Albany Park,0.1,1
1,Andersonville,0.04,0
2,Arcadia Terrace,0.04,0
3,Armour Square,0.04,0
4,Austin,0.0,2


In [36]:
# merge austin_grouped with chicago_data to add latitude/longitude for each neighborhood
ci_merged = ci_merged.join(ci_df.set_index("neighborhood"), on="Neighborhood")

print(ci_merged.shape)
ci_merged.head() # check the last columns!

(173, 5)


Unnamed: 0,Neighborhood,Coffee Shop,Cluster Labels,Latitude,Longitude
0,Albany Park,0.1,1,41.98294,-87.71915
1,Andersonville,0.04,0,41.78344,-87.630197
2,Arcadia Terrace,0.04,0,41.87557,-87.67652
3,Armour Square,0.04,0,41.95411,-87.68142
4,Austin,0.0,2,41.67951,-87.64189


In [37]:
# sort the results by Cluster Labels
print(ci_merged.shape)
ci_merged.sort_values(["Cluster Labels"], inplace=True)
ci_merged.head()

(173, 5)


Unnamed: 0,Neighborhood,Coffee Shop,Cluster Labels,Latitude,Longitude
172,Wrigleyville,0.04,0,41.93982,-87.65682
117,Printer's Row,0.04,0,41.87227,-87.62869
63,Horner Park,0.06,0,41.93925,-87.71125
62,Homan Square,0.04,0,41.83681,-87.68455
61,Hollywood Park,0.04,0,41.98289,-87.70908


**Finally, let's visualize the resulting clusters**

In [38]:
# create map
map_clusters = folium.Map(location=[latitude, longitude], zoom_start=11)

# set color scheme for the clusters
x = np.arange(kclusters)
ys = [i+x+(i*x)**2 for i in range(kclusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# add markers to the map
markers_colors = []
for lat, lon, poi, cluster in zip(ci_merged['Latitude'], ci_merged['Longitude'], ci_merged['Neighborhood'], ci_merged['Cluster Labels']):
    label = folium.Popup(str(poi) + ' - Cluster ' + str(cluster), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color=rainbow[cluster-1],
        fill=True,
        fill_color=rainbow[cluster-1],
        fill_opacity=0.7).add_to(map_clusters)
       
map_clusters

In [39]:
# save the map as HTML file
map_clusters.save('map_clusters.html')

### 8. Examine Clusters

#### Cluster 0

In [40]:
cluster0 = ci_merged.loc[ci_merged['Cluster Labels'] == 0]
cluster0_merged = pd.merge(left = cluster0, right = ci_df2, how = 'left', on = 'Neighborhood')
cluster0_merged

Unnamed: 0,Neighborhood,Coffee Shop,Cluster Labels,Latitude,Longitude,rent
0,Wrigleyville,0.04,0,41.93982,-87.65682,"$1,661"
1,Printer's Row,0.04,0,41.87227,-87.62869,"$2,443"
2,Horner Park,0.06,0,41.93925,-87.71125,"$1,124"
3,Homan Square,0.04,0,41.83681,-87.68455,$911
4,Hollywood Park,0.04,0,41.98289,-87.70908,"$1,230"
5,Heart of Italy,0.04,0,41.88925,-87.675383,$887
6,Groveland Park,0.06,0,41.88424,-87.62943,"$1,457"
7,Greektown,0.04,0,41.886671,-87.657261,"$2,255"
8,Greater Grand Crossing,0.04,0,41.82857,-87.67338,"$1,103"
9,Hyde Park,0.06,0,41.802391,-87.595016,"$1,447"


#### Cluster 1

In [41]:
cluster1 = ci_merged.loc[ci_merged['Cluster Labels'] == 1]
cluster1_merged = pd.merge(left = cluster1, right = ci_df2, how = 'left', on = 'Neighborhood')
cluster1_merged

Unnamed: 0,Neighborhood,Coffee Shop,Cluster Labels,Latitude,Longitude,rent
0,Prairie District,0.1,1,41.959896,-87.659426,"$1,457"
1,Sauganash Woods,0.08,1,41.98997,-87.74227,"$1,313"
2,North Mayfair,0.08,1,41.96793,-87.73788,"$1,313"
3,Old Town Triangle,0.08,1,41.92184,-87.64744,"$1,846"
4,South Shore,0.08,1,41.77142,-87.57894,"$1,021"
5,North Park,0.08,1,42.00897,-87.66619,"$1,198"
6,West Rogers Park,0.14,1,41.96307,-87.64992,"$1,295"
7,Park West,0.08,1,41.928979,-87.65619,"$1,782"
8,Pulaski Park,0.1,1,41.98582,-87.72848,"$1,230"
9,Wentworth Gardens,0.08,1,41.82976,-87.62783,"$1,171"


#### Cluster 2

In [42]:
cluster2 = ci_merged.loc[ci_merged['Cluster Labels'] == 2]
cluster2_merged = pd.merge(left = cluster2, right = ci_df2, how = 'left', on = 'Neighborhood')
cluster2_merged

Unnamed: 0,Neighborhood,Coffee Shop,Cluster Labels,Latitude,Longitude,rent
0,Chicago Loop,0.0,2,41.71207,-87.53068,"$2,534"
1,The Island,0.02,2,41.88775,-87.76363,$562
2,The Gap,0.0,2,41.96444,-87.837837,"$1,457"
3,Streeterville,0.02,2,41.89252,-87.62477,"$2,566"
4,Tri-Taylor,0.02,2,41.85702,-87.65759,"$1,433"
5,Cottage Grove Heights,0.0,2,39.25452,-90.06417,$612
6,Dearborn,0.0,2,41.85277,-87.63499,"$1,457"
7,South Loop,0.02,2,41.89907,-87.71947,"$2,110"
8,East Chatham,0.02,2,41.74108,-87.61303,$783
9,East Garfield Park,0.0,2,41.99768,-87.69414,"$1,286"


#### Cluster 2 average rent price

In [56]:
cluster2_merged['rent'] = cluster2_merged['rent'].replace(',','', regex=True)
cluster2_merged['rent'] = cluster2_merged['rent'].str.replace('$', '')
cluster2_merged['rent'] = pd.to_numeric(cluster2_merged['rent'])
cluster2_merged['rent'].mean()

1306.2205882352941

#### Cluster 1 average rent price

In [57]:
cluster1_merged['rent'] = cluster1_merged['rent'].replace(',','', regex=True)
cluster1_merged['rent'] = cluster1_merged['rent'].str.replace('$', '')
cluster1_merged['rent'] = pd.to_numeric(cluster1_merged['rent'])
cluster1_merged['rent'].mean()

1432.909090909091

#### Cluster 0 average rent price

In [58]:
cluster0_merged['rent'] = cluster0_merged['rent'].replace(',','', regex=True)
cluster0_merged['rent'] = cluster0_merged['rent'].str.replace('$', '')
cluster0_merged['rent'] = pd.to_numeric(cluster0_merged['rent'])
cluster0_merged['rent'].mean()

1542.5903614457832

#### Observations

Cluster 0 is portrayed by the red dots, we can see that these neighbourhoods have the highest rent averages across the map. Cluster 1 is the smallest and is portrayed with the blue dots. This cluster has the highest amounts of pizza parlors already establsihed. It's mainly based in the inner city and near Lake Michigan, it does not reach southern Chicago like the other two clusters. Opening a business there could lead to failiure because of the competetition. Cluster 2 is represented by the light green dots. It's the biggest cluster and according to foursquare has the least amount of pizza parlors. This could be because Foursquare is missing venue data pertaining to pizza places in these neighborhoods or there really are just a low amount of pizza establishments there. I find it hard to believe that the largest cluster is the one with the least amount of competetition though. Based of this data analysis, opening a new pizza place in cluster 2 would be ideal since it has the cheapest rent prices and has the least amount of competetiton. 