# Data Science Capstone Project
### Applied Data Science Capstone by MM Chaava
------

## Table of contents
----

* [Introduction: Business Problem](#introduction)
* [Data](#data)
* [Methodology](#methodology)
* [Analysis](#analysis)
* [Results and Discussion](#results)
* [Conclusion](#conclusion)
* [References](#references)

## Introduction: Business Problem <a name="introduction"></a>
----

In this project, we try to locate an ideal place for a restaurant. More specifically, the business venture of this project and report will be targeted to stakeholders in operating an **Ethiopian restaurant** in **Oklahoma City, Oklahoma**.

Lawton, Oklahoma is a rapidly growing city with a diverse population of 120,000 people. Many small, medium, and large businesses cater to various needs associated with the growing population. There are many restaurants in Lawton; in this project we try to locate an **area with no Ethiopian restaurants in vicinity** preferrably locations **as close to city center as possible**.

Using our data science powers, we will generate a few most promising neighborhoods based on the criteria outlined. Then, we will state the advantages of each area so as to finally determine the location preferred.

## Data
----
* Based on the definition of our business problem, factors to influence our decision are:
* number of existing restaurants in the neighborhood (all types of restaurants)
* number of and distance to African restaurants in the neighborhood if there are any
* distance from the city center to the neighborhood

The following sources of data are needed to obtain, extract, and generate the required information:
* The data required to solve the problem will help to explore venues in Lawton, to establish the most common businesses and most popular venues [1]
* number of all types of restaurants and their type and location in each neughborhood obtained by **Foursquare API** [2]
* coordinates of Lawton center will be obtained using **Folium** of well known Lawton locations

* The data required to solve the problem will help to explore venues in Lawton, to establish the most common businesses and most popular venues [1]
. After searching the Internet for data, a source of free United States data was found, a "personal free copy" of zip codes data: 
#### The file was downloaded as a comma delimited csv file.

### Methodology

Creating data and data pre-processing

*The data will be used to describe Oklahoma, and to explore the neighborhoods of Lawton. Python 3 and various Libraries will be used to explore the data for venues in Lawton*

## Importing Necessary Libraries

In [1]:
# Import library to process data into pandas DataFrame
import pandas as pd
# Transform JSON file into a pandas dataframe
from pandas.io.json import json_normalize
print('Libraries imported!')

Libraries imported!


In [2]:
# To work with Extensible Markup Language. 
!pip install lxml



In [3]:
# import libraries for accessing website url
import requests
# Import library for webscrapping
from bs4 import BeautifulSoup
from urllib.request import urlopen
print('Libraries imported!')

Libraries imported!


In [4]:
# To locate the coordinates of addresses, cities, countries, and landmarks across the globe
from geopy.geocoders import Nominatim # convert an address into latitude and longitude values

In [5]:
# Python library for working with arrays. And for working in domain of linear algebra etc
import numpy as np
import json
import csv
print('Libraries imported!')

Libraries imported!


In [6]:
# Building on the data wrangling strengths of the Python ecosystem
# visualize it in on a Leaflet map
import folium
print('Libraries imported!')

Libraries imported!


In [7]:
# Matplotlib and associated plotting modules
import matplotlib.cm as cm
import matplotlib.colors as colors
# import k-means from clustering stage
from sklearn.cluster import KMeans
print('Libraries imported!')

Libraries imported!


### Identify Source Document

In [8]:
# identify the data/website
df_us = pd.read_csv('zipUSCities.csv')

In [9]:
df_us.head(3)

Unnamed: 0,city,city_ascii,state_id,state_name,county_fips,county_name,lat,lng,population,density,source,military,incorporated,timezone,ranking,zips,id
0,New York,New York,NY,New York,36061,New York,40.6943,-73.9249,18713220,10715,polygon,False,True,America/New_York,1,11229 11226 11225 11224 11222 11221 11220 1138...,1840034016
1,Los Angeles,Los Angeles,CA,California,6037,Los Angeles,34.1139,-118.4068,12750807,3276,polygon,False,True,America/Los_Angeles,1,90291 90293 90292 91316 91311 90037 90031 9000...,1840020491
2,Chicago,Chicago,IL,Illinois,17031,Cook,41.8373,-87.6862,8604203,4574,polygon,False,True,America/Chicago,1,60018 60649 60641 60640 60643 60642 60645 6064...,1840000494


In [10]:
df_us.shape

(28338, 17)

In [11]:
# List the columns
titles = list(df_us.columns)
titles

['city',
 'city_ascii',
 'state_id',
 'state_name',
 'county_fips',
 'county_name',
 'lat',
 'lng',
 'population',
 'density',
 'source',
 'military',
 'incorporated',
 'timezone',
 'ranking',
 'zips',
 'id']

In [12]:
df1 = df_us[['city','state_id','state_name','county_name','lat','lng','population','zips','id']]
df1

Unnamed: 0,city,state_id,state_name,county_name,lat,lng,population,zips,id
0,New York,NY,New York,New York,40.6943,-73.9249,18713220,11229 11226 11225 11224 11222 11221 11220 1138...,1840034016
1,Los Angeles,CA,California,Los Angeles,34.1139,-118.4068,12750807,90291 90293 90292 91316 91311 90037 90031 9000...,1840020491
2,Chicago,IL,Illinois,Cook,41.8373,-87.6862,8604203,60018 60649 60641 60640 60643 60642 60645 6064...,1840000494
3,Miami,FL,Florida,Miami-Dade,25.7839,-80.2102,6445545,33129 33125 33126 33127 33128 33149 33144 3314...,1840015149
4,Dallas,TX,Texas,Dallas,32.7936,-96.7662,5743938,75287 75098 75233 75254 75251 75252 75253 7503...,1840019440
...,...,...,...,...,...,...,...,...,...
28333,Gross,NE,Nebraska,Boyd,42.9461,-98.5697,2,68719,1840011032
28334,Lotsee,OK,Oklahoma,Tulsa,36.1334,-96.2091,2,74063,1840021674
28335,The Ranch,MN,Minnesota,Mahnomen,47.3198,-95.6952,2,56557,1840039629
28336,Shamrock,OK,Oklahoma,Creek,35.9113,-96.5772,2,74068,1840022701


In [13]:
# List the columns
titles = list(df1.columns)
titles

['city',
 'state_id',
 'state_name',
 'county_name',
 'lat',
 'lng',
 'population',
 'zips',
 'id']

In [14]:
# Oklahoma data
oklahoma_data = df1[(df1['state_id']=='OK')]
print(len(oklahoma_data))

732


In [15]:
oklahoma_data.to_csv('oklahoma_data.csv')

In [16]:
header_names = ['neighborhood', 'state_id','state_name','county','lat', 'lng', 'pop', 'postcode','id']
ok_data = pd.read_csv('oklahoma_data.csv', header=None, skiprows=1, names=header_names)
ok_data.to_csv('ok_data.csv')

In [17]:
ok_data.head()

Unnamed: 0,neighborhood,state_id,state_name,county,lat,lng,pop,postcode,id
50,Oklahoma City,OK,Oklahoma,Oklahoma,35.4676,-97.5136,972943,73012 73013 73099 73097 73119 73118 73114 7311...,1840020428
67,Tulsa,OK,Oklahoma,Tulsa,36.1284,-95.9042,671033,74133 74134 74135 74132 74130 74136 74137 7414...,1840021672
355,Norman,OK,Oklahoma,Cleveland,35.2335,-97.3471,124880,73019 73026 74857 73072 73071 73069 73068 73070,1840020451
414,Broken Arrow,OK,Oklahoma,Tulsa,36.0365,-95.7809,110198,74014 74011 74012 74013,1840019059
494,Edmond,OK,Oklahoma,Oklahoma,35.6689,-97.4159,94054,73012 73013 73007 73003 73034,1840020423


## To Explore and Cluster SW Oklahoma Counties

### Using the geopy library to get the latitude and longitude values of Oklahoma, US.

In [18]:
address = 'Oklahoma, US'
geolocator = Nominatim(user_agent="ny_explorer")
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude
print('The geograpical coordinate of Oklahoma are {}, {}.'.format(latitude, longitude))

The geograpical coordinate of Oklahoma are 34.9550817, -97.2684063.


In [19]:
# create map of Oklahoma using latitude and longitude values
map_oklahoma = folium.Map(location=[latitude,longitude], zoom_start=10)

# add markers to map
for lat, lng, state, county in zip(ok_data['lat'], ok_data['lng'], ok_data['state_name'], ok_data['county']):
    label = '{}, {}'.format(state, county)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_oklahoma
)
map_oklahoma

### Define Foursquare Credentials and Version

In [20]:
CLIENT_ID = '0C2BD04A2V1DGSC4FXV5GIU01V2WURJBZ2XM044H3IGYJVRX' # My Foursquare ID
CLIENT_SECRET = 'ILVXI2IA2FRQ5V2RKVBSEFDAG2NAAJNEXYODYFMGIZN1OXYJ' # My Foursquare Secret
VERSION = '20180605' # Foursquare API version
LIMIT = 100 # A default Foursquare API limit value
print('Your credentails:')
print('CLIENT_ID: ' + CLIENT_ID)
print('CLIENT_SECRET: ' + CLIENT_SECRET)

Your credentails:
CLIENT_ID: 0C2BD04A2V1DGSC4FXV5GIU01V2WURJBZ2XM044H3IGYJVRX
CLIENT_SECRET: ILVXI2IA2FRQ5V2RKVBSEFDAG2NAAJNEXYODYFMGIZN1OXYJ


### Looking closer at the neighborhoods of Comanche County - Lawton, Oklahoma

In [21]:
# What are the neighborhoods of Comanche? Create a pandas DataFrame
comanche_data = ok_data[ok_data['county'] == "Comanche"].reset_index(drop=True)
comanche_data.head(15)

Unnamed: 0,neighborhood,state_id,state_name,county,lat,lng,pop,postcode,id
0,Lawton,OK,Oklahoma,Comanche,34.6176,-98.4203,93025,73501 73507 73505 73503 73502 73506,1840020477
1,Elgin,OK,Oklahoma,Comanche,34.7848,-98.301,3183,73538,1840020476
2,Cache,OK,Oklahoma,Comanche,34.6297,-98.6181,2811,73527,1840019203
3,Geronimo,OK,Oklahoma,Comanche,34.4852,-98.3893,1215,73543,1840021853
4,Fletcher,OK,Oklahoma,Comanche,34.8224,-98.2391,1143,73541,1840021852
5,Sterling,OK,Oklahoma,Comanche,34.7498,-98.1729,773,73567,1840022822
6,Chattanooga,OK,Oklahoma,Comanche,34.4242,-98.654,453,73528,1840022821
7,Medicine Park,OK,Oklahoma,Comanche,34.7294,-98.4846,452,73557 73507,1840022820
8,Indiahoma,OK,Oklahoma,Comanche,34.6196,-98.752,330,73552,1840021854
9,Faxon,OK,Oklahoma,Comanche,34.4603,-98.5794,130,73540,1840021851


In [22]:
cmh_data = comanche_data[['pop','postcode','id','state_id','state_name','county','neighborhood','lat','lng']]
cmh_data.head()

Unnamed: 0,pop,postcode,id,state_id,state_name,county,neighborhood,lat,lng
0,93025,73501 73507 73505 73503 73502 73506,1840020477,OK,Oklahoma,Comanche,Lawton,34.6176,-98.4203
1,3183,73538,1840020476,OK,Oklahoma,Comanche,Elgin,34.7848,-98.301
2,2811,73527,1840019203,OK,Oklahoma,Comanche,Cache,34.6297,-98.6181
3,1215,73543,1840021853,OK,Oklahoma,Comanche,Geronimo,34.4852,-98.3893
4,1143,73541,1840021852,OK,Oklahoma,Comanche,Fletcher,34.8224,-98.2391


###  neighborhoods were identified

In [23]:
# What is the first county of SW Oklahoma?
cmh_data.loc[0, 'neighborhood']

'Lawton'

In [24]:
cmh_data.dtypes

pop               int64
postcode         object
id                int64
state_id         object
state_name       object
county           object
neighborhood     object
lat             float64
lng             float64
dtype: object

In [25]:
# What are the latitude and longitude values for Lawton, Oklahoma?
neighborhood_latitude = cmh_data.loc[0,'lat'] # city latitude value
neighborhood_longitude = cmh_data.loc[0, 'lng'] # city longitude value
neighborhood_name = cmh_data.loc[0, 'neighborhood'] # city name
print('Latitude and longitude values of {} are {}, {}.'.format(neighborhood_name,neighborhood_latitude,
neighborhood_longitude))

Latitude and longitude values of Lawton are 34.6176, -98.4203.


In [26]:
# The top 100 venues that are in Regent Park, Harbourfront within a radius of 300 meters.
# Limit the number of venues returned.
LIMIT = 100
# Define the radius explored
radius =500
# First, we create the GET request URL, which we name 'url'.
url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
    CLIENT_ID,
    CLIENT_SECRET,
    VERSION,
    neighborhood_latitude,
    neighborhood_longitude,
    radius,
    LIMIT)
url
# display URL

'https://api.foursquare.com/v2/venues/explore?&client_id=0C2BD04A2V1DGSC4FXV5GIU01V2WURJBZ2XM044H3IGYJVRX&client_secret=ILVXI2IA2FRQ5V2RKVBSEFDAG2NAAJNEXYODYFMGIZN1OXYJ&v=20180605&ll=34.6176,-98.4203&radius=500&limit=100'

In [27]:
# Send the request and examine the results
results = requests.get(url).json()
results

{'meta': {'code': 200, 'requestId': '60f4e067ab849b1373ce9584'},
 'response': {'suggestedFilters': {'header': 'Tap to show:',
   'filters': [{'name': 'Open now', 'key': 'openNow'}]},
  'headerLocation': 'Lawton',
  'headerFullLocation': 'Lawton',
  'headerLocationGranularity': 'city',
  'totalResults': 14,
  'suggestedBounds': {'ne': {'lat': 34.622100004500005,
    'lng': -98.41484215016118},
   'sw': {'lat': 34.6130999955, 'lng': -98.42575784983882}},
  'groups': [{'type': 'Recommended Places',
    'name': 'recommended',
    'items': [{'reasons': {'count': 0,
       'items': [{'summary': 'This spot is popular',
         'type': 'general',
         'reasonName': 'globalInteractionReason'}]},
      'venue': {'id': '5be1b7ad64c8e1002c320292',
       'name': "Raising Cane's Chicken Fingers",
       'location': {'address': '805 NW Sheridan Rd',
        'lat': 34.61781156171866,
        'lng': -98.42178033862308,
        'labeledLatLngs': [{'label': 'display',
          'lat': 34.6178115617

From the Foursquare lab in the previous module, we know that all the
information is in the items key. Before proceeding, borrow the
get_category_type function from the Foursquare lab.

In [28]:
# The function that extracts the category of the venue...

def get_category_type(row):
    try:
        categories_list = row['categories']
    except:
        categories_list = row['venue.categories']
    if len(categories_list) == 0:
        return None
    else:
        return categories_list[0]['name']


In [29]:
# clean the json and structure it into a pandas dataframe.
venues = results['response']['groups'][0]['items']
nearby_venues = json_normalize(venues) # flatten JSON
# filter columns
filtered_columns = ['venue.name', 'venue.categories', 'venue.location.lat', 'venue.location.lng']
nearby_venues =nearby_venues.loc[:, filtered_columns]
# filter the category for each row
nearby_venues['venue.categories'] = nearby_venues.apply(get_category_type, axis=1)
# clean columns
nearby_venues.columns = [col.split(".")[-1]
for col in nearby_venues.columns]
nearby_venues.head()

  nearby_venues = json_normalize(venues) # flatten JSON


Unnamed: 0,name,categories,lat,lng
0,Raising Cane's Chicken Fingers,Fried Chicken Joint,34.617812,-98.42178
1,Walgreens,Pharmacy,34.616439,-98.421339
2,Sam's Club,Warehouse Store,34.617333,-98.423762
3,Tropical Smoothie Cafe,Café,34.619997,-98.422824
4,Staples,Paper / Office Supplies Store,34.620735,-98.421317


In [30]:
# The number of venues returned by Foursquare:
print('{} venues were returned by Foursquare.'.format(nearby_venues.shape[0]))
nearby_venues

14 venues were returned by Foursquare.


Unnamed: 0,name,categories,lat,lng
0,Raising Cane's Chicken Fingers,Fried Chicken Joint,34.617812,-98.42178
1,Walgreens,Pharmacy,34.616439,-98.421339
2,Sam's Club,Warehouse Store,34.617333,-98.423762
3,Tropical Smoothie Cafe,Café,34.619997,-98.422824
4,Staples,Paper / Office Supplies Store,34.620735,-98.421317
5,Redbox,Video Store,34.61608,-98.422209
6,Carl's Jr.,Fast Food Restaurant,34.621319,-98.42179
7,Dollar Tree,Discount Store,34.619925,-98.423185
8,ALDI,Market,34.619401,-98.421474
9,Cicis,Pizza Place,34.620616,-98.423421


### To create a function that repeats the same process to all the neighborhoods in Comanche County, Lawton Oklahoma.

In [31]:
def getNearbyVenues(names, latitudes, longitudes, radius=1000):
    
    venues_list=[]
    for name, latitude, longitude in zip(names, latitudes, longitudes):
        print(name)
            
        # create the API request URL
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            latitude, 
            longitude, 
            radius, 
            LIMIT)
            
        # make the GET request
        results = requests.get(url).json()["response"]['groups'][0]['items']
        
        # return only relevant information for each nearby venue
        venues_list.append([(
            name, 
            latitude, 
            longitude, 
            v['venue']['name'], 
            v['venue']['location']['lat'], 
            v['venue']['location']['lng'],  
            v['venue']['categories'][0]['name']) for v in results])

    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['neighborhood', 
                  'neighborhood lat',
                  'neighborhood lng', 
                  'Venue', 
                  'Venue lat', 
                  'Venue lng', 
                  'Venue Category']
    
    return(nearby_venues)

In [32]:
nearby_venues

Unnamed: 0,name,categories,lat,lng
0,Raising Cane's Chicken Fingers,Fried Chicken Joint,34.617812,-98.42178
1,Walgreens,Pharmacy,34.616439,-98.421339
2,Sam's Club,Warehouse Store,34.617333,-98.423762
3,Tropical Smoothie Cafe,Café,34.619997,-98.422824
4,Staples,Paper / Office Supplies Store,34.620735,-98.421317
5,Redbox,Video Store,34.61608,-98.422209
6,Carl's Jr.,Fast Food Restaurant,34.621319,-98.42179
7,Dollar Tree,Discount Store,34.619925,-98.423185
8,ALDI,Market,34.619401,-98.421474
9,Cicis,Pizza Place,34.620616,-98.423421


In [33]:
lawton_venues = getNearbyVenues(names=nearby_venues['name'],
                                   latitudes=nearby_venues['lat'],
                                   longitudes=nearby_venues['lng']
                                  )
lawton_venues

Raising Cane's Chicken Fingers
Walgreens
Sam's Club
Tropical Smoothie Cafe
Staples
Redbox
Carl's Jr.
Dollar Tree
ALDI
Cicis
Vaska Theater
Walmart Supercenter
The Brow Parlour
U.S. Lawns - Lawton


Unnamed: 0,neighborhood,neighborhood lat,neighborhood lng,Venue,Venue lat,Venue lng,Venue Category
0,Raising Cane's Chicken Fingers,34.617812,-98.42178,Raising Cane's Chicken Fingers,34.617812,-98.421780,Fried Chicken Joint
1,Raising Cane's Chicken Fingers,34.617812,-98.42178,Sam's Club,34.617333,-98.423762,Warehouse Store
2,Raising Cane's Chicken Fingers,34.617812,-98.42178,Walgreens,34.616439,-98.421339,Pharmacy
3,Raising Cane's Chicken Fingers,34.617812,-98.42178,Tropical Smoothie Cafe,34.619997,-98.422824,Café
4,Raising Cane's Chicken Fingers,34.617812,-98.42178,Chick-fil-A,34.623789,-98.425599,Fast Food Restaurant
...,...,...,...,...,...,...,...
693,U.S. Lawns - Lawton,34.613360,-98.42119,Green Acres Market,34.609362,-98.423474,Market
694,U.S. Lawns - Lawton,34.613360,-98.42119,Walmart Supercenter,34.618552,-98.424117,Big Box Store
695,U.S. Lawns - Lawton,34.613360,-98.42119,Tipton's Fine Jewelry,34.608581,-98.415175,Jewelry Store
696,U.S. Lawns - Lawton,34.613360,-98.42119,"Chizum's Fencing, LLC",34.616793,-98.413820,Business Service


In [34]:
# What is the resulting dataframe?
print(lawton_venues.shape)
#What does it contain, for exmple?
lawton_venues.tail()

(698, 7)


Unnamed: 0,neighborhood,neighborhood lat,neighborhood lng,Venue,Venue lat,Venue lng,Venue Category
693,U.S. Lawns - Lawton,34.61336,-98.42119,Green Acres Market,34.609362,-98.423474,Market
694,U.S. Lawns - Lawton,34.61336,-98.42119,Walmart Supercenter,34.618552,-98.424117,Big Box Store
695,U.S. Lawns - Lawton,34.61336,-98.42119,Tipton's Fine Jewelry,34.608581,-98.415175,Jewelry Store
696,U.S. Lawns - Lawton,34.61336,-98.42119,"Chizum's Fencing, LLC",34.616793,-98.41382,Business Service
697,U.S. Lawns - Lawton,34.61336,-98.42119,Phillip's Music,34.609211,-98.411531,Music Store


In [35]:
# How many venues were returned for each neighborhood?
lawton_venues.groupby('neighborhood').count()

Unnamed: 0_level_0,neighborhood lat,neighborhood lng,Venue,Venue lat,Venue lng,Venue Category
neighborhood,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
ALDI,48,48,48,48,48,48
Carl's Jr.,55,55,55,55,55,55
Cicis,54,54,54,54,54,54
Dollar Tree,55,55,55,55,55,55
Raising Cane's Chicken Fingers,53,53,53,53,53,53
Redbox,54,54,54,54,54,54
Sam's Club,53,53,53,53,53,53
Staples,54,54,54,54,54,54
The Brow Parlour,43,43,43,43,43,43
Tropical Smoothie Cafe,54,54,54,54,54,54


In [36]:
# Unique categories that can be curated from all the venues returned...
print('There are {} unique categories.'.format(len(lawton_venues['Venue Category'].unique())))

There are 46 unique categories.


In [37]:
# To analyze each neighborhood...
# one hot encoding
lawton_onehot = pd.get_dummies(lawton_venues[['Venue Category']], prefix="", prefix_sep="")

# add neighborhood column back to dataframe
lawton_onehot['neighborhood'] = lawton_venues['neighborhood'] 

# move neighborhood column to the first column
fixed_columns = [lawton_onehot.columns[-1]] + list(lawton_onehot.columns[:-1])
lawton_onehot = lawton_onehot[fixed_columns]

lawton_onehot.head()

Unnamed: 0,neighborhood,American Restaurant,Arts & Crafts Store,Asian Restaurant,Automotive Shop,BBQ Joint,Bagel Shop,Bar,Beer Garden,Big Box Store,...,Pharmacy,Pizza Place,Sandwich Place,Seafood Restaurant,Spa,Supplement Shop,Sushi Restaurant,Video Store,Warehouse Store,Wings Joint
0,Raising Cane's Chicken Fingers,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,Raising Cane's Chicken Fingers,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,1,0
2,Raising Cane's Chicken Fingers,0,0,0,0,0,0,0,0,0,...,1,0,0,0,0,0,0,0,0,0
3,Raising Cane's Chicken Fingers,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,Raising Cane's Chicken Fingers,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [38]:
# Closer examiniation of the dataframe size
lawton_onehot.shape

(698, 47)

#### Next, let's group rows by neighborhood and by taking the mean of the frequency of occurrence of each category


In [39]:
lawton_grouped = lawton_onehot.groupby('neighborhood').mean().reset_index()
lawton_grouped

Unnamed: 0,neighborhood,American Restaurant,Arts & Crafts Store,Asian Restaurant,Automotive Shop,BBQ Joint,Bagel Shop,Bar,Beer Garden,Big Box Store,...,Pharmacy,Pizza Place,Sandwich Place,Seafood Restaurant,Spa,Supplement Shop,Sushi Restaurant,Video Store,Warehouse Store,Wings Joint
0,ALDI,0.0,0.020833,0.0,0.020833,0.020833,0.0,0.020833,0.0,0.020833,...,0.041667,0.083333,0.041667,0.0,0.020833,0.020833,0.041667,0.020833,0.020833,0.041667
1,Carl's Jr.,0.0,0.0,0.018182,0.018182,0.018182,0.018182,0.018182,0.0,0.018182,...,0.036364,0.090909,0.036364,0.018182,0.018182,0.018182,0.036364,0.018182,0.018182,0.036364
2,Cicis,0.0,0.0,0.0,0.018519,0.018519,0.018519,0.018519,0.0,0.018519,...,0.037037,0.092593,0.037037,0.018519,0.018519,0.018519,0.037037,0.018519,0.018519,0.037037
3,Dollar Tree,0.0,0.018182,0.0,0.018182,0.018182,0.018182,0.018182,0.0,0.018182,...,0.036364,0.090909,0.036364,0.018182,0.018182,0.018182,0.036364,0.018182,0.018182,0.036364
4,Raising Cane's Chicken Fingers,0.018868,0.018868,0.0,0.018868,0.018868,0.0,0.018868,0.018868,0.018868,...,0.037736,0.075472,0.037736,0.0,0.018868,0.018868,0.037736,0.018868,0.018868,0.037736
5,Redbox,0.018519,0.018519,0.0,0.0,0.018519,0.0,0.018519,0.018519,0.018519,...,0.037037,0.074074,0.037037,0.0,0.018519,0.018519,0.037037,0.018519,0.018519,0.037037
6,Sam's Club,0.018868,0.018868,0.0,0.0,0.018868,0.0,0.018868,0.0,0.018868,...,0.037736,0.075472,0.037736,0.018868,0.018868,0.018868,0.037736,0.018868,0.018868,0.037736
7,Staples,0.0,0.0,0.018519,0.018519,0.018519,0.0,0.018519,0.0,0.018519,...,0.037037,0.092593,0.037037,0.018519,0.018519,0.018519,0.037037,0.018519,0.018519,0.037037
8,The Brow Parlour,0.023256,0.023256,0.0,0.0,0.046512,0.0,0.0,0.023256,0.023256,...,0.023256,0.046512,0.046512,0.0,0.023256,0.023256,0.046512,0.023256,0.023256,0.046512
9,Tropical Smoothie Cafe,0.0,0.0,0.0,0.018519,0.018519,0.018519,0.018519,0.0,0.018519,...,0.037037,0.092593,0.037037,0.018519,0.018519,0.018519,0.037037,0.018519,0.018519,0.037037


#### Confirming the new size

In [40]:
num_top_venues = 5

for hood in lawton_grouped['neighborhood']:
    print("----"+hood+"----")
    temp = lawton_grouped[lawton_grouped['neighborhood'] == hood].T.reset_index()
    temp.columns = ['venue','freq']
    temp = temp.iloc[1:]
    temp['freq'] = temp['freq'].astype(float)
    temp = temp.round({'freq': 2})
    print(temp.sort_values('freq', ascending=False).reset_index(drop=True).head(num_top_venues))
    print('\n')

----ALDI----
                  venue  freq
0  Fast Food Restaurant  0.12
1           Pizza Place  0.08
2           Wings Joint  0.04
3      Sushi Restaurant  0.04
4     Mobile Phone Shop  0.04


----Carl's Jr.----
                  venue  freq
0  Fast Food Restaurant  0.11
1           Pizza Place  0.09
2           Wings Joint  0.04
3   Fried Chicken Joint  0.04
4     Korean Restaurant  0.04


----Cicis----
                  venue  freq
0  Fast Food Restaurant  0.13
1           Pizza Place  0.09
2           Wings Joint  0.04
3        Sandwich Place  0.04
4    Italian Restaurant  0.04


----Dollar Tree----
                  venue  freq
0  Fast Food Restaurant  0.13
1           Pizza Place  0.09
2           Wings Joint  0.04
3     Mobile Phone Shop  0.04
4   Fried Chicken Joint  0.04


----Raising Cane's Chicken Fingers----
                  venue  freq
0  Fast Food Restaurant  0.13
1           Pizza Place  0.08
2           Wings Joint  0.04
3                Market  0.04
4      Sushi Rest

## To put the information above into a pandas dataframework

### First, we write a function to sort the venues in descending order.

In [41]:
def return_most_common_venues(row, num_top_venues):
    row_categories = row.iloc[1:]
    row_categories_sorted = row_categories.sort_values(ascending=False)
    
    return row_categories_sorted.index.values[0:num_top_venues]

In [42]:
# Create the dtaframe and display the top 10 venues
num_top_venues = 7
indicators = ['st', 'nd', 'rd']

# create columns according to number of top venues
columns = ['neighborhood']
for ind in np.arange(num_top_venues):
    try:
        columns.append('{}{} Most Common Venue'.format(ind+1, indicators[ind]))
    except:
        columns.append('{}th Most Common Venue'.format(ind+1))

# create a new dataframe
county_venues_sorted = pd.DataFrame(columns=columns)
county_venues_sorted['neighborhood'] = lawton_grouped['neighborhood']

for ind in np.arange(lawton_grouped.shape[0]):
    county_venues_sorted.iloc[ind, 1:] = return_most_common_venues(lawton_grouped.iloc[ind, :], num_top_venues)

county_venues_sorted.head()

Unnamed: 0,neighborhood,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue
0,ALDI,Fast Food Restaurant,Pizza Place,Wings Joint,Sushi Restaurant,Mobile Phone Shop,Sandwich Place,Italian Restaurant
1,Carl's Jr.,Fast Food Restaurant,Pizza Place,Wings Joint,Fried Chicken Joint,Korean Restaurant,Deli / Bodega,Mobile Phone Shop
2,Cicis,Fast Food Restaurant,Pizza Place,Wings Joint,Sandwich Place,Italian Restaurant,Korean Restaurant,Mobile Phone Shop
3,Dollar Tree,Fast Food Restaurant,Pizza Place,Wings Joint,Mobile Phone Shop,Fried Chicken Joint,Sushi Restaurant,Italian Restaurant
4,Raising Cane's Chicken Fingers,Fast Food Restaurant,Pizza Place,Wings Joint,Market,Sushi Restaurant,Italian Restaurant,Sandwich Place


# Step 4. Clustering Neighborhoods

### To run k-means to cluster the neighborhood into 5 clusters

In [43]:
# set number of clusters
kclusters = 4

lawton_grouped_clustering = lawton_grouped.drop('neighborhood', 1)

# run k-means clustering
kmeans = KMeans(n_clusters=kclusters, random_state=0).fit(lawton_grouped_clustering)

# check cluster labels generated for each row in the dataframe
kmeans.labels_[0:10]

array([0, 0, 0, 0, 2, 2, 2, 0, 1, 0])

## To create a new dataframe that includes the cluster as well as the top venues for each neighborhood

In [44]:
# add clustering labels
county_venues_sorted.insert(0, 'Cluster Labels', kmeans.labels_)

lawton_merged = lawton_venues

# merge lawton_grouped with swo_data to add latitude/longitude for each neighborhood
lawton_merged = lawton_merged.join(county_venues_sorted.set_index('neighborhood'), on='neighborhood')

lawton_merged.head() # check the last columns!

Unnamed: 0,neighborhood,neighborhood lat,neighborhood lng,Venue,Venue lat,Venue lng,Venue Category,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue
0,Raising Cane's Chicken Fingers,34.617812,-98.42178,Raising Cane's Chicken Fingers,34.617812,-98.42178,Fried Chicken Joint,2,Fast Food Restaurant,Pizza Place,Wings Joint,Market,Sushi Restaurant,Italian Restaurant,Sandwich Place
1,Raising Cane's Chicken Fingers,34.617812,-98.42178,Sam's Club,34.617333,-98.423762,Warehouse Store,2,Fast Food Restaurant,Pizza Place,Wings Joint,Market,Sushi Restaurant,Italian Restaurant,Sandwich Place
2,Raising Cane's Chicken Fingers,34.617812,-98.42178,Walgreens,34.616439,-98.421339,Pharmacy,2,Fast Food Restaurant,Pizza Place,Wings Joint,Market,Sushi Restaurant,Italian Restaurant,Sandwich Place
3,Raising Cane's Chicken Fingers,34.617812,-98.42178,Tropical Smoothie Cafe,34.619997,-98.422824,Café,2,Fast Food Restaurant,Pizza Place,Wings Joint,Market,Sushi Restaurant,Italian Restaurant,Sandwich Place
4,Raising Cane's Chicken Fingers,34.617812,-98.42178,Chick-fil-A,34.623789,-98.425599,Fast Food Restaurant,2,Fast Food Restaurant,Pizza Place,Wings Joint,Market,Sushi Restaurant,Italian Restaurant,Sandwich Place


In [45]:
# To visualize the resulting clusters...
# create the map
map_clusters = folium.Map(location=[latitude, longitude], zoom_start=11)

# set color scheme for the clusters
x = np.arange(kclusters)
ys = [i + x + (i*x)**2 for i in range(kclusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# add markers to the map
markers_colors = []
for lat, lng, poi, cluster in zip(lawton_merged['Venue lat'], lawton_merged['Venue lng'], lawton_merged['neighborhood'], lawton_merged['Cluster Labels']):
    label = folium.Popup(str(poi) + ' Cluster ' + str(cluster), parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color=rainbow[cluster-1],
        fill=True,
        fill_color=rainbow[cluster-1],
        fill_opacity=0.7).add_to(map_clusters)
       
map_clusters

 ### Examining the clusters to see the discriminating venue category that distinguishes each cluster. 

In [46]:
# Cluster 1
lawton_merged.loc[lawton_merged['Cluster Labels'] == 1, lawton_merged.columns[[1] + list(range(5, lawton_merged.shape[1]))]]

Unnamed: 0,neighborhood lat,Venue lng,Venue Category,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue
532,34.615887,-98.419141,Beer Garden,1,Fast Food Restaurant,Sushi Restaurant,Discount Store,Pizza Place,Market,Mobile Phone Shop,American Restaurant
533,34.615887,-98.423552,Clothing Store,1,Fast Food Restaurant,Sushi Restaurant,Discount Store,Pizza Place,Market,Mobile Phone Shop,American Restaurant
534,34.615887,-98.421339,Pharmacy,1,Fast Food Restaurant,Sushi Restaurant,Discount Store,Pizza Place,Market,Mobile Phone Shop,American Restaurant
535,34.615887,-98.421780,Fried Chicken Joint,1,Fast Food Restaurant,Sushi Restaurant,Discount Store,Pizza Place,Market,Mobile Phone Shop,American Restaurant
536,34.615887,-98.418918,Italian Restaurant,1,Fast Food Restaurant,Sushi Restaurant,Discount Store,Pizza Place,Market,Mobile Phone Shop,American Restaurant
...,...,...,...,...,...,...,...,...,...,...,...
663,34.614517,-98.422630,Gift Shop,1,Fast Food Restaurant,Wings Joint,Mobile Phone Shop,BBQ Joint,Sushi Restaurant,Discount Store,Sandwich Place
664,34.614517,-98.423474,Market,1,Fast Food Restaurant,Wings Joint,Mobile Phone Shop,BBQ Joint,Sushi Restaurant,Discount Store,Sandwich Place
665,34.614517,-98.420448,Sushi Restaurant,1,Fast Food Restaurant,Wings Joint,Mobile Phone Shop,BBQ Joint,Sushi Restaurant,Discount Store,Sandwich Place
666,34.614517,-98.413820,Business Service,1,Fast Food Restaurant,Wings Joint,Mobile Phone Shop,BBQ Joint,Sushi Restaurant,Discount Store,Sandwich Place


In [47]:
# Examining the clusters to see the discriminating venue category that distinguishes each cluster. 
lawton_merged.loc[lawton_merged['Cluster Labels'] == 2, lawton_merged.columns[[1] + list(range(5, lawton_merged.shape[1]))]]

Unnamed: 0,neighborhood lat,Venue lng,Venue Category,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue
0,34.617812,-98.421780,Fried Chicken Joint,2,Fast Food Restaurant,Pizza Place,Wings Joint,Market,Sushi Restaurant,Italian Restaurant,Sandwich Place
1,34.617812,-98.423762,Warehouse Store,2,Fast Food Restaurant,Pizza Place,Wings Joint,Market,Sushi Restaurant,Italian Restaurant,Sandwich Place
2,34.617812,-98.421339,Pharmacy,2,Fast Food Restaurant,Pizza Place,Wings Joint,Market,Sushi Restaurant,Italian Restaurant,Sandwich Place
3,34.617812,-98.422824,Café,2,Fast Food Restaurant,Pizza Place,Wings Joint,Market,Sushi Restaurant,Italian Restaurant,Sandwich Place
4,34.617812,-98.425599,Fast Food Restaurant,2,Fast Food Restaurant,Pizza Place,Wings Joint,Market,Sushi Restaurant,Italian Restaurant,Sandwich Place
...,...,...,...,...,...,...,...,...,...,...,...
315,34.616080,-98.420448,Sushi Restaurant,2,Fast Food Restaurant,Pizza Place,Wings Joint,Market,Italian Restaurant,Sushi Restaurant,Discount Store
316,34.616080,-98.422630,Gift Shop,2,Fast Food Restaurant,Pizza Place,Wings Joint,Market,Italian Restaurant,Sushi Restaurant,Discount Store
317,34.616080,-98.423474,Market,2,Fast Food Restaurant,Pizza Place,Wings Joint,Market,Italian Restaurant,Sushi Restaurant,Discount Store
318,34.616080,-98.413820,Business Service,2,Fast Food Restaurant,Pizza Place,Wings Joint,Market,Italian Restaurant,Sushi Restaurant,Discount Store


In [48]:
# Examining the clusters to see the discriminating venue category that distinguishes each cluster. 
lawton_merged.loc[lawton_merged['Cluster Labels'] == 3, lawton_merged.columns[[1] + list(range(5, lawton_merged.shape[1]))]]

Unnamed: 0,neighborhood lat,Venue lng,Venue Category,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue
668,34.61336,-98.423552,Clothing Store,3,Fast Food Restaurant,Market,Discount Store,Convenience Store,Movie Theater,Music Store,Jewelry Store
669,34.61336,-98.419141,Beer Garden,3,Fast Food Restaurant,Market,Discount Store,Convenience Store,Movie Theater,Music Store,Jewelry Store
670,34.61336,-98.422546,Burger Joint,3,Fast Food Restaurant,Market,Discount Store,Convenience Store,Movie Theater,Music Store,Jewelry Store
671,34.61336,-98.424087,Arts & Crafts Store,3,Fast Food Restaurant,Market,Discount Store,Convenience Store,Movie Theater,Music Store,Jewelry Store
672,34.61336,-98.421339,Pharmacy,3,Fast Food Restaurant,Market,Discount Store,Convenience Store,Movie Theater,Music Store,Jewelry Store
673,34.61336,-98.42178,Fried Chicken Joint,3,Fast Food Restaurant,Market,Discount Store,Convenience Store,Movie Theater,Music Store,Jewelry Store
674,34.61336,-98.423762,Warehouse Store,3,Fast Food Restaurant,Market,Discount Store,Convenience Store,Movie Theater,Music Store,Jewelry Store
675,34.61336,-98.41596,BBQ Joint,3,Fast Food Restaurant,Market,Discount Store,Convenience Store,Movie Theater,Music Store,Jewelry Store
676,34.61336,-98.42257,Italian Restaurant,3,Fast Food Restaurant,Market,Discount Store,Convenience Store,Movie Theater,Music Store,Jewelry Store
677,34.61336,-98.422768,American Restaurant,3,Fast Food Restaurant,Market,Discount Store,Convenience Store,Movie Theater,Music Store,Jewelry Store


In [49]:
# Examining the clusters to see the discriminating venue category that distinguishes each cluster. 
lawton_merged.loc[lawton_merged['Cluster Labels'] == 0, lawton_merged.columns[[1] + list(range(5, lawton_merged.shape[1]))]]

Unnamed: 0,neighborhood lat,Venue lng,Venue Category,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue
158,34.619997,-98.425599,Fast Food Restaurant,0,Fast Food Restaurant,Pizza Place,Wings Joint,Sandwich Place,Italian Restaurant,Korean Restaurant,Mobile Phone Shop
159,34.619997,-98.422824,Café,0,Fast Food Restaurant,Pizza Place,Wings Joint,Sandwich Place,Italian Restaurant,Korean Restaurant,Mobile Phone Shop
160,34.619997,-98.421780,Fried Chicken Joint,0,Fast Food Restaurant,Pizza Place,Wings Joint,Sandwich Place,Italian Restaurant,Korean Restaurant,Mobile Phone Shop
161,34.619997,-98.418918,Italian Restaurant,0,Fast Food Restaurant,Pizza Place,Wings Joint,Sandwich Place,Italian Restaurant,Korean Restaurant,Mobile Phone Shop
162,34.619997,-98.423762,Warehouse Store,0,Fast Food Restaurant,Pizza Place,Wings Joint,Sandwich Place,Italian Restaurant,Korean Restaurant,Mobile Phone Shop
...,...,...,...,...,...,...,...,...,...,...,...
620,34.618552,-98.421875,Spa,0,Fast Food Restaurant,Pizza Place,Wings Joint,Sushi Restaurant,Italian Restaurant,Pharmacy,Mobile Phone Shop
621,34.618552,-98.429072,Fast Food Restaurant,0,Fast Food Restaurant,Pizza Place,Wings Joint,Sushi Restaurant,Italian Restaurant,Pharmacy,Mobile Phone Shop
622,34.618552,-98.429760,Fast Food Restaurant,0,Fast Food Restaurant,Pizza Place,Wings Joint,Sushi Restaurant,Italian Restaurant,Pharmacy,Mobile Phone Shop
623,34.618552,-98.418134,Bar,0,Fast Food Restaurant,Pizza Place,Wings Joint,Sushi Restaurant,Italian Restaurant,Pharmacy,Mobile Phone Shop


Conclusion

There is no Ethiopian Restaurant in the City of Lawton, Oklahoma. 

Therefore, a location close to the area of trending venues, close to Cluster 0 would be preferred

In [1]:
pwd

'C:\\Users\\MacDonald Chaava\\git_projects\\Toronto_Neighborhood_Clustering'

## **References:**

* [1] [US Cities zip codes](https://www.unitedstateszipcodes.org/zip-code-database/)
* [2] [Forsquare API](https://developer.foursquare.com/)
* [3] [Folium"](https://python-visualization.github.io/folium/)
* [4] [Google Map](https://www.google.com/maps/)