# Capstone Project: The Battle of Neighborhoods (Week 1)

## - Instructions: 
Now that you have been equipped with the skills and the tools to use location data to explore a geographical location, over the course of two weeks, you will have the opportunity to be as creative as you want and come up with an idea to leverage the Foursquare location data to explore or compare neighborhoods or cities of your choice or to come up with a problem that you can use the Foursquare location data to solve. If you cannot think of an idea or a problem, here are some ideas to get you started:

In Module 3, we explored New York City and the city of Toronto and segmented and clustered their neighborhoods. Both cities are very diverse and are the financial capitals of their respective countries. One interesting idea would be to compare the neighborhoods of the two cities and determine how similar or dissimilar they are. Is New York City more like Toronto or Paris or some other multicultural city? I will leave it to you to refine this idea.
In a city of your choice, if someone is looking to open a restaurant, where would you recommend that they open it? Similarly, if a contractor is trying to start their own business, where would you recommend that they setup their office?

U.S. has been a land of opportunity for many who seek a better live and a fresh start.  Not all migrants have technical skills or are willing to join the corporate world of a very different culture.  Starting one's own business such as a restaurant may be more viable for such individuals. 
This project demonstrates how the basic fundamental law of Demand and Supply can be used to analyze statistical data
We will use the greater area of Dallas and narrow down to a few feasible neighborhoods.  The same process can be repeated with other areas of interest.

# Start of Battle of Neighborhoods - code section (part of Week 2)

In [1]:
# Import libraries - long wait to sort the environment
import numpy as np # library to handle data in a vectorized manner

import pandas as pd # library for data analsysis
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)

import json # library to handle JSON files

!conda install -c conda-forge geopy --yes # uncomment this line if you haven't completed the Foursquare API lab
from geopy.geocoders import Nominatim # convert an address into latitude and longitude values

import requests # library to handle requests for postal codes in wikipedia
from pandas.io.json import json_normalize # tranform JSON file into a pandas dataframe

# Matplotlib and associated plotting modules
import matplotlib.cm as cm
import matplotlib.colors as colors

# import k-means from clustering stage
from sklearn.cluster import KMeans

!conda install -c conda-forge folium=0.5.0 --yes # uncomment this line if you haven't completed the Foursquare API lab
import folium # map rendering library

print('Libraries imported!')


Solving environment: done


  current version: 4.5.11
  latest version: 4.8.0

Please update conda by running

    $ conda update -n base -c defaults conda



## Package Plan ##

  environment location: /home/jupyterlab/conda/envs/python

  added / updated specs: 
    - geopy


The following packages will be downloaded:

    package                    |            build
    ---------------------------|-----------------
    certifi-2019.11.28         |           py36_0         149 KB  conda-forge
    scikit-learn-0.20.1        |   py36h22eb022_0         5.7 MB
    liblapack-3.8.0            |      11_openblas          10 KB  conda-forge
    liblapacke-3.8.0           |      11_openblas          10 KB  conda-forge
    geographiclib-1.50         |             py_0          34 KB  conda-forge
    libopenblas-0.3.6          |       h5a2b251_2         7.7 MB
    numpy-1.17.3               |   py36h95a1406_0         5.2 MB  conda-forge
    scipy-1.4.1                |   py36h921218d_0        

## Import data and prepare a dataframe of Dallas neighborhoods

Next, we need to import the Postal Code (zip code) for the Dallas city along with its neighborhoods and their lat/longs

In [2]:
# Import zip codes, lat/longs and cities/neighborhoods of whole TX 
raw_zip = pd.read_csv('texas_zips.csv')
#raw_zip.set_index("city")

print(raw_zip.dtypes)
##print(raw_zip.shape)
##print(raw_zip.columns)
raw_zip.head()
#raw_zip

zip                     int64
lat                   float64
lng                   float64
city                   object
state_id               object
state_name             object
zcta                     bool
parent_zcta           float64
population              int64
density               float64
county_fips             int64
county_name            object
all_county_weights     object
imprecise                bool
military                 bool
timezone               object
dtype: object


Unnamed: 0,zip,lat,lng,city,state_id,state_name,zcta,parent_zcta,population,density,county_fips,county_name,all_county_weights,imprecise,military,timezone
0,75001,32.96,-96.83847,Addison,TX,Texas,True,,12414,1250.2,48113,Dallas,{'48113':100},False,False,America/Chicago
1,75002,33.08966,-96.60751,Allen,TX,Texas,True,,63140,655.6,48085,Collin,{'48085':100},False,False,America/Chicago
2,75006,32.96188,-96.89701,Carrollton,TX,Texas,True,,46364,1065.0,48113,Dallas,{'48113':100},False,False,America/Chicago
3,75007,33.00462,-96.89714,Carrollton,TX,Texas,True,,51624,1709.9,48121,Denton,"{'48113':5.79,'48121':94.21}",False,False,America/Chicago
4,75009,33.34028,-96.75033,Celina,TX,Texas,True,,8785,35.5,48085,Collin,"{'48085':94.8,'48121':5.2}",False,False,America/Chicago


Clean up the raw data of Texas and create a new dataframe with only Dallas neighborhoods

In [3]:
# Drop zcta, county_fips, all_county_weights, imprecise, military, timezone columns
raw_zip.drop(['zcta','parent_zcta','density','county_fips','all_county_weights','imprecise','military','timezone'], axis=1, inplace=True)
raw_zip=raw_zip[['state_name', 'state_id', 'county_name', 'city', 'zip', 'lat', 'lng', 'population']]

# Rename city-> neighborhood, county-> city, lat/long, state_name->state
raw_zip.rename(columns={'lat':'latitude','lng':'longitude','city':'neighborhood'}, inplace=True)
raw_zip.rename(columns={'county_name':'city','state_name':'state'}, inplace=True)
#raw_zip.set_index("city")


In [4]:
# create a new dataframe only with Dallas neighborhoods
dallas_zip=raw_zip[raw_zip.city == 'Dallas']
dallas_zip.set_index("city")

print(dallas_zip.dtypes)
print(dallas_zip.shape)
#print(raw_zip.columns)
dallas_zip.head()       # this is the dataframe to work with

state            object
state_id         object
city             object
neighborhood     object
zip               int64
latitude        float64
longitude       float64
population        int64
dtype: object
(84, 8)


Unnamed: 0,state,state_id,city,neighborhood,zip,latitude,longitude,population
0,Texas,TX,Dallas,Addison,75001,32.96,-96.83847,12414
2,Texas,TX,Dallas,Carrollton,75006,32.96188,-96.89701,46364
7,Texas,TX,Dallas,Coppell,75019,32.96329,-96.98553,38666
18,Texas,TX,Dallas,Irving,75038,32.87458,-96.99758,27802
19,Texas,TX,Dallas,Irving,75039,32.88752,-96.94225,11032


### Now to render a map of Neighborhood (cities) in the County of Dallas 

In order to define an instance of the geocoder, we need to define a user_agent. We will name our agent <em>dallas_explorer</em>, as shown below.

In [5]:
address = 'Dallas, TX'
geolocator = Nominatim(user_agent="dallas_explorer")
location = geolocator.geocode(address)
d_latitude = location.latitude
d_longitude = location.longitude
print('The geographical coordinate of Dallas city is {}, {}.'.format(d_latitude, d_longitude))


The geographical coordinate of Dallas city is 32.7762719, -96.7968559.


Create a map of Dallas county (Borough) with cities (neighborhood) superimposed on top.

In [6]:
# create map of Dallas using latitude and longitude values
map_dallas = folium.Map(location=[d_latitude, d_longitude], zoom_start=10)

# add markers to map
for lat, lng, city, neighborhood in zip(dallas_zip['latitude'], dallas_zip['longitude'], dallas_zip['city'], dallas_zip['neighborhood']):
    label = '{}, {}'.format(neighborhood, city)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_dallas)  
    
map_dallas

## Explore Neighborhoods (cities) venues in the county of Dallas using Foursquare

Explore and cluster the cities in Dallas -  
Define Foursquare Credentials and Version and some parameters

In [7]:
CLIENT_ID = 'REZHBJKA5QZFZ5DDNU4X50IDMD4WK0ECXQXYOUAWAQBPAL1P'     # your Foursquare ID
CLIENT_SECRET = '2QDU2YE3I0LZMFHQL53GXUEVLD0QOQ0HNE4QR1SC5VNA5Y5L' # your Foursquare Secret
VERSION = '20180605'                # Foursquare API version

radius=3000                         # default radius is 500m from ll
LIMIT = 500                         # default LIMIT 500 venues per neighborhood

Create a function to repeat the same process for exploring venues to all the cities of Dallas county

In [8]:
def getNearbyVenues(names, latitudes, longitudes):   # names -> cities in Dallas county, radius=500):
    
    venues_list=[]
    for name, lat, lng in zip(names, latitudes, longitudes):
        print(name)
            
        # create the API request URL
        url = 'https://api.foursquare.com/v2/venues/explore?categoryId= 4d4b7105d754a06374d81259&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lat, 
            lng, 
            radius,                # radius = 3000 as set in previous cell
            LIMIT)                 # LIMIT  =  500 as set in previous cell
            
        # make the GET request
        results = requests.get(url).json()["response"]['groups'][0]['items']
        
        # return only relevant information for each nearby venue
        venues_list.append([(
            name, 
            lat, 
            lng, 
            v['venue']['name'], 
            v['venue']['location']['lat'], 
            v['venue']['location']['lng'],  
            v['venue']['categories'][0]['name']) for v in results])

    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['Neighborhood', 
                  'Neighborhood Latitude', 
                  'Neighborhood Longitude', 
                  'Venue', 
                  'Venue Latitude', 
                  'Venue Longitude', 
                  'Venue Category']
    
    return(nearby_venues)

Run the above function for each city and store it in a new dataframe called dallas_venues

In [9]:
dallas_venues = getNearbyVenues(names=dallas_zip['neighborhood'],
                                   latitudes=dallas_zip['latitude'],
                                   longitudes=dallas_zip['longitude']
                                  )

Addison
Carrollton
Coppell
Irving
Irving
Garland
Garland
Garland
Garland
Garland
Sachse
Grand Prairie
Grand Prairie
Grand Prairie
Irving
Irving
Irving
Irving
Richardson
Richardson
Rowlett
Rowlett
Cedar Hill
Desoto
Duncanville
Lancaster
Duncanville
Hutchins
Lancaster
Mesquite
Mesquite
Seagoville
Wilmer
Balch Springs
Mesquite
Sunnyvale
Dallas
Dallas
Dallas
Dallas
Dallas
Dallas
Dallas
Dallas
Dallas
Dallas
Dallas
Dallas
Dallas
Dallas
Dallas
Dallas
Dallas
Dallas
Dallas
Dallas
Dallas
Dallas
Dallas
Dallas
Dallas
Dallas
Dallas
Dallas
Dallas
Dallas
Dallas
Dallas
Dallas
Dallas
Dallas
Dallas
Dallas
Dallas
Dallas
Dallas
Dallas
Dallas
Dallas
Dallas
Dallas
Dallas
Dallas
Dallas


Let's do a sanity check on the resulting data

In [10]:
##print(dallas_venues.shape)
##dallas_venues.head()

Let's check how many venues were returned for each city (neighborhood)

In [11]:
dallas_venues.groupby('Neighborhood').count()        

Unnamed: 0_level_0,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
Neighborhood,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
Addison,100,100,100,100,100,100
Balch Springs,40,40,40,40,40,40
Carrollton,100,100,100,100,100,100
Cedar Hill,50,50,50,50,50,50
Coppell,75,75,75,75,75,75
Dallas,3518,3518,3518,3518,3518,3518
Desoto,40,40,40,40,40,40
Duncanville,117,117,117,117,117,117
Garland,325,325,325,325,325,325
Grand Prairie,135,135,135,135,135,135


In [12]:
# Let's find out how many unique categories can be curated from all the returned venues - debug 
print('There are {} uniques categories.'.format(len(dallas_venues['Venue Category'].unique())))

There are 86 uniques categories.


### Analyze Each Neighborhood

In [13]:
# one hot encoding
dallas_onehot = pd.get_dummies(dallas_venues[['Venue Category']], prefix="", prefix_sep="")

# add neighborhood column back to dataframe
dallas_onehot['Neighborhood'] = dallas_venues['Neighborhood'] 

# move neighborhood column to the first column
fixed_columns = [dallas_onehot.columns[-1]] + list(dallas_onehot.columns[:-1])
dallas_onehot = dallas_onehot[fixed_columns]

dallas_onehot.shape   # examine the new dataframe size

(5524, 87)

Next, let's group rows by neighborhood and by taking the mean of the frequency of occurrence of each category

In [14]:
dallas_grouped = dallas_onehot.groupby('Neighborhood').mean().reset_index()
dallas_grouped                         # 
dallas_grouped.shape                   # Let's confirm the new size

(19, 87)

Let's print each neighborhood along with the top 10 most common venues

In [15]:
num_top_venues = 20     # normally only the top 10 

for hood in dallas_grouped['Neighborhood']:
    print("----"+hood+"----")
    temp = dallas_grouped[dallas_grouped['Neighborhood'] == hood].T.reset_index()
    temp.columns = ['venue','freq']
    temp = temp.iloc[1:]
    temp['freq'] = temp['freq'].astype(float)
    temp = temp.round({'freq': 3})
    print(temp.sort_values('freq', ascending=False).reset_index(drop=True).head(num_top_venues))
    print('\n')


----Addison----
                  venue  freq
0    Mexican Restaurant  0.12
1    Italian Restaurant  0.10
2   American Restaurant  0.08
3          Burger Joint  0.06
4           Pizza Place  0.05
5            Steakhouse  0.04
6        Sandwich Place  0.04
7      Sushi Restaurant  0.03
8                 Diner  0.03
9            Restaurant  0.03
10      Thai Restaurant  0.03
11          Wings Joint  0.03
12   Seafood Restaurant  0.03
13       Breakfast Spot  0.02
14               Bakery  0.02
15        Deli / Bodega  0.02
16  Fried Chicken Joint  0.02
17            BBQ Joint  0.02
18     Asian Restaurant  0.02
19  Japanese Restaurant  0.01


----Balch Springs----
                    venue   freq
0    Fast Food Restaurant  0.275
1     Fried Chicken Joint  0.125
2                    Food  0.100
3      Chinese Restaurant  0.100
4             Pizza Place  0.100
5      Mexican Restaurant  0.050
6              Taco Place  0.050
7                  Bakery  0.025
8              Bagel Shop  0.025


First, let's write a function to sort the venues in descending order.

In [16]:
def return_most_common_venues(row, num_top_venues):
    row_categories = row.iloc[1:]
    row_categories_sorted = row_categories.sort_values(ascending=False)
    
    return row_categories_sorted.index.values[0:num_top_venues]

Now let's create the new dataframe and display the top 10 venues for each neighborhood.

In [17]:
num_top_venues = 20     # normally only the top 10 

indicators = ['st', 'nd', 'rd']

# create columns according to number of top venues
columns = ['Neighborhood']
for ind in np.arange(num_top_venues):
    try:
        columns.append('{}{} Most Common Venue'.format(ind+1, indicators[ind]))
    except:
        columns.append('{}th Most Common Venue'.format(ind+1))

# create a new dataframe
neighborhoods_venues_sorted = pd.DataFrame(columns=columns)
neighborhoods_venues_sorted['Neighborhood'] = dallas_grouped['Neighborhood']

for ind in np.arange(dallas_grouped.shape[0]):
    neighborhoods_venues_sorted.iloc[ind, 1:] = return_most_common_venues(dallas_grouped.iloc[ind, :], num_top_venues)

neighborhoods_venues_sorted   # now to eyeball our data

Unnamed: 0,Neighborhood,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue,11th Most Common Venue,12th Most Common Venue,13th Most Common Venue,14th Most Common Venue,15th Most Common Venue,16th Most Common Venue,17th Most Common Venue,18th Most Common Venue,19th Most Common Venue,20th Most Common Venue
0,Addison,Mexican Restaurant,Italian Restaurant,American Restaurant,Burger Joint,Pizza Place,Sandwich Place,Steakhouse,Seafood Restaurant,Diner,Restaurant,Wings Joint,Thai Restaurant,Sushi Restaurant,Bakery,Fried Chicken Joint,Breakfast Spot,Deli / Bodega,BBQ Joint,Asian Restaurant,Mediterranean Restaurant
1,Balch Springs,Fast Food Restaurant,Fried Chicken Joint,Food,Chinese Restaurant,Pizza Place,Taco Place,Mexican Restaurant,Bagel Shop,Bakery,Asian Restaurant,Burger Joint,Diner,Café,Sandwich Place,Seafood Restaurant,Dumpling Restaurant,Ethiopian Restaurant,Fish & Chips Shop,Donut Shop,Wings Joint
2,Carrollton,Fast Food Restaurant,Mexican Restaurant,Korean Restaurant,Pizza Place,Sandwich Place,Sushi Restaurant,Fried Chicken Joint,Café,Burger Joint,Indian Restaurant,Chinese Restaurant,Vietnamese Restaurant,Asian Restaurant,Bakery,Thai Restaurant,Donut Shop,BBQ Joint,Diner,Breakfast Spot,Caribbean Restaurant
3,Cedar Hill,Fast Food Restaurant,American Restaurant,Mexican Restaurant,Pizza Place,Burger Joint,Fried Chicken Joint,Seafood Restaurant,Donut Shop,Breakfast Spot,Italian Restaurant,Wings Joint,Food,Sandwich Place,Café,Sushi Restaurant,Southern / Soul Food Restaurant,Tex-Mex Restaurant,Bakery,BBQ Joint,Asian Restaurant
4,Coppell,Pizza Place,Fast Food Restaurant,American Restaurant,Sandwich Place,Mexican Restaurant,Donut Shop,Tex-Mex Restaurant,Bakery,Burger Joint,BBQ Joint,Food,Indian Restaurant,Café,Peking Duck Restaurant,Deli / Bodega,Diner,Wings Joint,Mediterranean Restaurant,Japanese Restaurant,Chinese Restaurant
5,Dallas,Mexican Restaurant,Fast Food Restaurant,American Restaurant,Pizza Place,Sandwich Place,Burger Joint,Fried Chicken Joint,Taco Place,Seafood Restaurant,Italian Restaurant,BBQ Joint,Restaurant,Chinese Restaurant,Bakery,Steakhouse,New American Restaurant,Breakfast Spot,Diner,Thai Restaurant,Donut Shop
6,Desoto,Donut Shop,Pizza Place,Fast Food Restaurant,American Restaurant,Sandwich Place,Fried Chicken Joint,Seafood Restaurant,Burger Joint,Wings Joint,Snack Place,Fish & Chips Shop,Mexican Restaurant,Chinese Restaurant,Restaurant,Bakery,Tex-Mex Restaurant,Cafeteria,Café,Gastropub,Asian Restaurant
7,Duncanville,Fast Food Restaurant,Pizza Place,American Restaurant,Mexican Restaurant,Fried Chicken Joint,Wings Joint,Chinese Restaurant,BBQ Joint,Italian Restaurant,Seafood Restaurant,Sandwich Place,Donut Shop,Burger Joint,Restaurant,Diner,Sushi Restaurant,Taco Place,Bakery,Bagel Shop,Food Court
8,Garland,Fast Food Restaurant,Mexican Restaurant,Pizza Place,Burger Joint,American Restaurant,Chinese Restaurant,Sandwich Place,Fried Chicken Joint,Donut Shop,Wings Joint,Seafood Restaurant,Restaurant,Taco Place,Food,BBQ Joint,Breakfast Spot,Bakery,Italian Restaurant,Sushi Restaurant,Vietnamese Restaurant
9,Grand Prairie,Fast Food Restaurant,Mexican Restaurant,Pizza Place,Fried Chicken Joint,Bakery,Sandwich Place,American Restaurant,BBQ Joint,Donut Shop,Taco Place,Wings Joint,Chinese Restaurant,Burger Joint,Restaurant,Seafood Restaurant,Italian Restaurant,Diner,Deli / Bodega,Café,Buffet


In [18]:
neighborhoods_venues_sorted.to_csv(r'Dallas Venues.csv')

## Now to explore Asian food in the using Foursquare

In [19]:
def getNearbyAsianVenues(names, latitudes, longitudes):   # names -> cities in Dallas county, radius=500):
    
    venues_list=[]
    for name, lat, lng in zip(names, latitudes, longitudes):
        print(name)
            
        # create the API request URL
        url = 'https://api.foursquare.com/v2/venues/explore?categoryId=4bf58dd8d48988d142941735&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lat, 
            lng, 
            radius,                # radius = 500 as set in previous cell
            LIMIT)                 # LIMIT = 100 as set in previous cell
            
        # make the GET request
        results = requests.get(url).json()["response"]['groups'][0]['items']
        
        # return only relevant information for each nearby venue
        venues_list.append([(
            name, 
            lat, 
            lng, 
            v['venue']['name'], 
            v['venue']['location']['lat'], 
            v['venue']['location']['lng'],  
            v['venue']['categories'][0]['name']) for v in results])

    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['Neighborhood', 
                  'Neighborhood Latitude', 
                  'Neighborhood Longitude', 
                  'Venue', 
                  'Venue Latitude', 
                  'Venue Longitude', 
                  'Venue Category']
    
    return(nearby_venues)

In [20]:
dallas_asian_venues = getNearbyAsianVenues(names=dallas_zip['neighborhood'],
                                   latitudes=dallas_zip['latitude'],
                                   longitudes=dallas_zip['longitude']
                                  )

Addison
Carrollton
Coppell
Irving
Irving
Garland
Garland
Garland
Garland
Garland
Sachse
Grand Prairie
Grand Prairie
Grand Prairie
Irving
Irving
Irving
Irving
Richardson
Richardson
Rowlett
Rowlett
Cedar Hill
Desoto
Duncanville
Lancaster
Duncanville
Hutchins
Lancaster
Mesquite
Mesquite
Seagoville
Wilmer
Balch Springs
Mesquite
Sunnyvale
Dallas
Dallas
Dallas
Dallas
Dallas
Dallas
Dallas
Dallas
Dallas
Dallas
Dallas
Dallas
Dallas
Dallas
Dallas
Dallas
Dallas
Dallas
Dallas
Dallas
Dallas
Dallas
Dallas
Dallas
Dallas
Dallas
Dallas
Dallas
Dallas
Dallas
Dallas
Dallas
Dallas
Dallas
Dallas
Dallas
Dallas
Dallas
Dallas
Dallas
Dallas
Dallas
Dallas
Dallas
Dallas
Dallas
Dallas
Dallas


In [21]:
dallas_asian_venues.groupby('Neighborhood').count()     

Unnamed: 0_level_0,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
Neighborhood,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
Addison,38,38,38,38,38,38
Balch Springs,5,5,5,5,5,5
Carrollton,65,65,65,65,65,65
Cedar Hill,3,3,3,3,3,3
Coppell,9,9,9,9,9,9
Dallas,1235,1235,1235,1235,1235,1235
Desoto,3,3,3,3,3,3
Duncanville,20,20,20,20,20,20
Garland,70,70,70,70,70,70
Grand Prairie,15,15,15,15,15,15


In [22]:
# one hot encoding
dallas_asian_onehot = pd.get_dummies(dallas_asian_venues[['Venue Category']], prefix="", prefix_sep="")

# add neighborhood column back to dataframe
dallas_asian_onehot['Neighborhood'] = dallas_asian_venues['Neighborhood'] 

# move neighborhood column to the first column
fixed_columns = [dallas_asian_onehot.columns[-1]] + list(dallas_asian_onehot.columns[:-1])
dallas_asian_onehot = dallas_asian_onehot[fixed_columns]

dallas_asian_onehot.shape   # examine the new dataframe size

(1736, 37)

In [23]:
dallas_asian_grouped = dallas_asian_onehot.groupby('Neighborhood').mean().reset_index()
dallas_asian_grouped                         # 
dallas_asian_grouped.shape                   # Let's confirm the new size

(18, 37)

In [24]:
num_top_venues = 20     # normally only the top 10 

for hood in dallas_asian_grouped['Neighborhood']:
    print("----"+hood+"----")
    temp = dallas_asian_grouped[dallas_asian_grouped['Neighborhood'] == hood].T.reset_index()
    temp.columns = ['venue','freq']
    temp = temp.iloc[1:]
    temp['freq'] = temp['freq'].astype(float)
    temp = temp.round({'freq': 3})
    print(temp.sort_values('freq', ascending=False).reset_index(drop=True).head(num_top_venues))
    print('\n')


----Addison----
                            venue   freq
0                Asian Restaurant  0.421
1              Chinese Restaurant  0.158
2                 Thai Restaurant  0.132
3                Sushi Restaurant  0.105
4             Japanese Restaurant  0.079
5           Vietnamese Restaurant  0.053
6                    Noodle House  0.026
7               Korean Restaurant  0.026
8                  Sandwich Place  0.000
9                Ramen Restaurant  0.000
10                     Restaurant  0.000
11                    Salad Place  0.000
12            American Restaurant  0.000
13             Seafood Restaurant  0.000
14                     Poke Place  0.000
15            Szechuan Restaurant  0.000
16           Taiwanese Restaurant  0.000
17             Tianjin Restaurant  0.000
18  Vegetarian / Vegan Restaurant  0.000
19                     Soup Place  0.000


----Balch Springs----
                            venue  freq
0              Chinese Restaurant   0.6
1                As

In [25]:
def return_most_common_asian_venues(row, num_top_venues):
    row_categories = row.iloc[1:]
    row_categories_sorted = row_categories.sort_values(ascending=False)
    
    return row_categories_sorted.index.values[0:num_top_venues]

In [26]:
num_top_venues = 20     # normally only the top 10 

indicators = ['st', 'nd', 'rd']

# create columns according to number of top venues
columns = ['Neighborhood']
for ind in np.arange(num_top_venues):
    try:
        columns.append('{}{} Most Common Venue'.format(ind+1, indicators[ind]))
    except:
        columns.append('{}th Most Common Venue'.format(ind+1))

# create a new dataframe
neighborhoods_asian_venues_sorted = pd.DataFrame(columns=columns)
neighborhoods_asian_venues_sorted['Neighborhood'] = dallas_asian_grouped['Neighborhood']

for ind in np.arange(dallas_asian_grouped.shape[0]):
    neighborhoods_asian_venues_sorted.iloc[ind, 1:] = return_most_common_asian_venues(dallas_asian_grouped.iloc[ind, :], num_top_venues)

neighborhoods_asian_venues_sorted   # now to eyeball our data

Unnamed: 0,Neighborhood,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue,11th Most Common Venue,12th Most Common Venue,13th Most Common Venue,14th Most Common Venue,15th Most Common Venue,16th Most Common Venue,17th Most Common Venue,18th Most Common Venue,19th Most Common Venue,20th Most Common Venue
0,Addison,Asian Restaurant,Chinese Restaurant,Thai Restaurant,Sushi Restaurant,Japanese Restaurant,Vietnamese Restaurant,Korean Restaurant,Noodle House,Food Truck,Indian Restaurant,Himalayan Restaurant,Fried Chicken Joint,Wings Joint,Filipino Restaurant,Fast Food Restaurant,Dim Sum Restaurant,Burmese Restaurant,Buffet,Bakery,Diner
1,Balch Springs,Chinese Restaurant,Asian Restaurant,Wings Joint,Filipino Restaurant,Karaoke Bar,Japanese Restaurant,Indian Restaurant,Himalayan Restaurant,Fried Chicken Joint,Food Truck,Fast Food Restaurant,Vietnamese Restaurant,Diner,Dim Sum Restaurant,Burmese Restaurant,Buffet,Bakery,Korean Restaurant,Mongolian Restaurant,New American Restaurant
2,Carrollton,Korean Restaurant,Asian Restaurant,Chinese Restaurant,Vietnamese Restaurant,Sushi Restaurant,Thai Restaurant,Ramen Restaurant,Japanese Restaurant,Karaoke Bar,Soup Place,Noodle House,Filipino Restaurant,Fried Chicken Joint,Food Truck,Wings Joint,Fast Food Restaurant,Himalayan Restaurant,Dim Sum Restaurant,Burmese Restaurant,Buffet
3,Cedar Hill,Asian Restaurant,Chinese Restaurant,Sushi Restaurant,Wings Joint,Filipino Restaurant,Japanese Restaurant,Indian Restaurant,Himalayan Restaurant,Fried Chicken Joint,Food Truck,Diner,Fast Food Restaurant,Korean Restaurant,Dim Sum Restaurant,Burmese Restaurant,Buffet,Bakery,Karaoke Bar,Mongolian Restaurant,Vietnamese Restaurant
4,Coppell,Vietnamese Restaurant,Chinese Restaurant,Japanese Restaurant,Asian Restaurant,Thai Restaurant,Sushi Restaurant,Peking Duck Restaurant,Filipino Restaurant,Indian Restaurant,Himalayan Restaurant,Fried Chicken Joint,Food Truck,Wings Joint,Fast Food Restaurant,Karaoke Bar,Dim Sum Restaurant,Burmese Restaurant,Buffet,Bakery,Diner
5,Dallas,Asian Restaurant,Chinese Restaurant,Sushi Restaurant,Thai Restaurant,Japanese Restaurant,Vietnamese Restaurant,Korean Restaurant,Ramen Restaurant,Noodle House,Food Truck,Fast Food Restaurant,Seafood Restaurant,Taiwanese Restaurant,Fried Chicken Joint,Sandwich Place,New American Restaurant,American Restaurant,Vegetarian / Vegan Restaurant,Poke Place,Szechuan Restaurant
6,Desoto,Chinese Restaurant,Thai Restaurant,Wings Joint,Fast Food Restaurant,Japanese Restaurant,Indian Restaurant,Himalayan Restaurant,Fried Chicken Joint,Food Truck,Filipino Restaurant,Diner,Korean Restaurant,Dim Sum Restaurant,Burmese Restaurant,Buffet,Bakery,Asian Restaurant,Karaoke Bar,Mongolian Restaurant,Vietnamese Restaurant
7,Duncanville,Chinese Restaurant,Asian Restaurant,Vietnamese Restaurant,Sushi Restaurant,Mongolian Restaurant,Burmese Restaurant,Buffet,Bakery,Dim Sum Restaurant,Korean Restaurant,Fast Food Restaurant,Filipino Restaurant,Food Truck,Fried Chicken Joint,Himalayan Restaurant,Indian Restaurant,Japanese Restaurant,Karaoke Bar,Diner,Wings Joint
8,Garland,Chinese Restaurant,Vietnamese Restaurant,Asian Restaurant,Sushi Restaurant,Thai Restaurant,Korean Restaurant,Tianjin Restaurant,Filipino Restaurant,Indian Restaurant,Himalayan Restaurant,Fried Chicken Joint,Food Truck,Wings Joint,Fast Food Restaurant,Diner,Japanese Restaurant,Burmese Restaurant,Buffet,Bakery,Dim Sum Restaurant
9,Grand Prairie,Chinese Restaurant,Asian Restaurant,Japanese Restaurant,Filipino Restaurant,Sushi Restaurant,Wings Joint,Indian Restaurant,Himalayan Restaurant,Fried Chicken Joint,Food Truck,Diner,Fast Food Restaurant,Korean Restaurant,Dim Sum Restaurant,Burmese Restaurant,Buffet,Bakery,Karaoke Bar,Mongolian Restaurant,Vietnamese Restaurant


In [27]:
neighborhoods_asian_venues_sorted.to_csv(r'Dallas Asian Venues.csv') 

End here for Asian Venues

### Visualization of Cluster venues

Run *k*-means to cluster Asian venues in the neighborhood into 18 clusters.

In [28]:

kclusters = 5           # set number of clusters

# new data frame for grouping by neighborhood
dallas_asian_grouped_clustering = dallas_asian_grouped.drop('Neighborhood', 1)

# run k-means clustering
kmeans = KMeans(n_clusters=kclusters, random_state=0).fit(dallas_asian_grouped_clustering)

# check cluster labels generated for each row in the dataframe
kmeans.labels_[0:18] 

array([3, 1, 4, 3, 3, 3, 1, 1, 3, 1, 2, 3, 1, 1, 3, 1, 4, 0], dtype=int32)

Let's create a new dataframe that includes the cluster as well as the top 10 venues for each neighborhood.

In [29]:
# add clustering labels
#backup_neighborhoods_asian_venues_sorted=neighborhoods_asian_venues_sorted # debug - create a backup 
neighborhoods_asian_venues_sorted.insert(0, 'Cluster Labels', kmeans.labels_)

dallas_merged = dallas_zip

# merge dallas_asian_grouped with dallas_zip to add latitude/longitude for each neighborhood
dallas_merged = dallas_merged.join(neighborhoods_asian_venues_sorted.set_index('Neighborhood'), on='neighborhood')
dallas_merged.dropna(axis = 0, inplace = True)             # incased NaN shows up after merging
dallas_merged['Cluster Labels'] = dallas_merged['Cluster Labels'].astype('int64') # Cluster Labels must be int

print(dallas_merged.dtypes)
dallas_merged                      # check the columns!  


state                      object
state_id                   object
city                       object
neighborhood               object
zip                         int64
latitude                  float64
longitude                 float64
population                  int64
Cluster Labels              int64
1st Most Common Venue      object
2nd Most Common Venue      object
3rd Most Common Venue      object
4th Most Common Venue      object
5th Most Common Venue      object
6th Most Common Venue      object
7th Most Common Venue      object
8th Most Common Venue      object
9th Most Common Venue      object
10th Most Common Venue     object
11th Most Common Venue     object
12th Most Common Venue     object
13th Most Common Venue     object
14th Most Common Venue     object
15th Most Common Venue     object
16th Most Common Venue     object
17th Most Common Venue     object
18th Most Common Venue     object
19th Most Common Venue     object
20th Most Common Venue     object
dtype: object


Unnamed: 0,state,state_id,city,neighborhood,zip,latitude,longitude,population,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue,11th Most Common Venue,12th Most Common Venue,13th Most Common Venue,14th Most Common Venue,15th Most Common Venue,16th Most Common Venue,17th Most Common Venue,18th Most Common Venue,19th Most Common Venue,20th Most Common Venue
0,Texas,TX,Dallas,Addison,75001,32.96,-96.83847,12414,3,Asian Restaurant,Chinese Restaurant,Thai Restaurant,Sushi Restaurant,Japanese Restaurant,Vietnamese Restaurant,Korean Restaurant,Noodle House,Food Truck,Indian Restaurant,Himalayan Restaurant,Fried Chicken Joint,Wings Joint,Filipino Restaurant,Fast Food Restaurant,Dim Sum Restaurant,Burmese Restaurant,Buffet,Bakery,Diner
2,Texas,TX,Dallas,Carrollton,75006,32.96188,-96.89701,46364,4,Korean Restaurant,Asian Restaurant,Chinese Restaurant,Vietnamese Restaurant,Sushi Restaurant,Thai Restaurant,Ramen Restaurant,Japanese Restaurant,Karaoke Bar,Soup Place,Noodle House,Filipino Restaurant,Fried Chicken Joint,Food Truck,Wings Joint,Fast Food Restaurant,Himalayan Restaurant,Dim Sum Restaurant,Burmese Restaurant,Buffet
7,Texas,TX,Dallas,Coppell,75019,32.96329,-96.98553,38666,3,Vietnamese Restaurant,Chinese Restaurant,Japanese Restaurant,Asian Restaurant,Thai Restaurant,Sushi Restaurant,Peking Duck Restaurant,Filipino Restaurant,Indian Restaurant,Himalayan Restaurant,Fried Chicken Joint,Food Truck,Wings Joint,Fast Food Restaurant,Karaoke Bar,Dim Sum Restaurant,Burmese Restaurant,Buffet,Bakery,Diner
18,Texas,TX,Dallas,Irving,75038,32.87458,-96.99758,27802,3,Asian Restaurant,Chinese Restaurant,Thai Restaurant,Japanese Restaurant,Sushi Restaurant,Korean Restaurant,Vietnamese Restaurant,Indian Restaurant,Himalayan Restaurant,Sandwich Place,Bakery,Wings Joint,Dim Sum Restaurant,Diner,Fast Food Restaurant,Filipino Restaurant,Food Truck,Fried Chicken Joint,Burmese Restaurant,Buffet
19,Texas,TX,Dallas,Irving,75039,32.88752,-96.94225,11032,3,Asian Restaurant,Chinese Restaurant,Thai Restaurant,Japanese Restaurant,Sushi Restaurant,Korean Restaurant,Vietnamese Restaurant,Indian Restaurant,Himalayan Restaurant,Sandwich Place,Bakery,Wings Joint,Dim Sum Restaurant,Diner,Fast Food Restaurant,Filipino Restaurant,Food Truck,Fried Chicken Joint,Burmese Restaurant,Buffet
20,Texas,TX,Dallas,Garland,75040,32.92766,-96.62008,59406,3,Chinese Restaurant,Vietnamese Restaurant,Asian Restaurant,Sushi Restaurant,Thai Restaurant,Korean Restaurant,Tianjin Restaurant,Filipino Restaurant,Indian Restaurant,Himalayan Restaurant,Fried Chicken Joint,Food Truck,Wings Joint,Fast Food Restaurant,Diner,Japanese Restaurant,Burmese Restaurant,Buffet,Bakery,Dim Sum Restaurant
21,Texas,TX,Dallas,Garland,75041,32.88091,-96.65147,30700,3,Chinese Restaurant,Vietnamese Restaurant,Asian Restaurant,Sushi Restaurant,Thai Restaurant,Korean Restaurant,Tianjin Restaurant,Filipino Restaurant,Indian Restaurant,Himalayan Restaurant,Fried Chicken Joint,Food Truck,Wings Joint,Fast Food Restaurant,Diner,Japanese Restaurant,Burmese Restaurant,Buffet,Bakery,Dim Sum Restaurant
22,Texas,TX,Dallas,Garland,75042,32.9139,-96.67493,37881,3,Chinese Restaurant,Vietnamese Restaurant,Asian Restaurant,Sushi Restaurant,Thai Restaurant,Korean Restaurant,Tianjin Restaurant,Filipino Restaurant,Indian Restaurant,Himalayan Restaurant,Fried Chicken Joint,Food Truck,Wings Joint,Fast Food Restaurant,Diner,Japanese Restaurant,Burmese Restaurant,Buffet,Bakery,Dim Sum Restaurant
23,Texas,TX,Dallas,Garland,75043,32.85707,-96.57941,58094,3,Chinese Restaurant,Vietnamese Restaurant,Asian Restaurant,Sushi Restaurant,Thai Restaurant,Korean Restaurant,Tianjin Restaurant,Filipino Restaurant,Indian Restaurant,Himalayan Restaurant,Fried Chicken Joint,Food Truck,Wings Joint,Fast Food Restaurant,Diner,Japanese Restaurant,Burmese Restaurant,Buffet,Bakery,Dim Sum Restaurant
24,Texas,TX,Dallas,Garland,75044,32.96264,-96.65323,40811,3,Chinese Restaurant,Vietnamese Restaurant,Asian Restaurant,Sushi Restaurant,Thai Restaurant,Korean Restaurant,Tianjin Restaurant,Filipino Restaurant,Indian Restaurant,Himalayan Restaurant,Fried Chicken Joint,Food Truck,Wings Joint,Fast Food Restaurant,Diner,Japanese Restaurant,Burmese Restaurant,Buffet,Bakery,Dim Sum Restaurant


In [30]:
# create map of Dallas using latitude and longitude values
#map_clusters = folium.Map(location=[d_latitude, d_longitude], zoom_start=10)
map_dallas = folium.Map(location=[d_latitude, d_longitude], zoom_start=10)   # try adding cluster to the map_dallas. Was map_clusters

# set color scheme for the clusters
x = np.arange(kclusters)
ys = [i + x + (i*x)**2 for i in range(kclusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# add markers to the map
markers_colors = []
for lat, lon, poi, cluster in zip(dallas_merged['latitude'], dallas_merged['longitude'], dallas_merged['neighborhood'], dallas_merged['Cluster Labels']):
    label = folium.Popup(str(poi) + ' Cluster ' + str(cluster), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color=rainbow[cluster-1],
        fill=True,
        fill_color=rainbow[cluster-1],
        fill_opacity=0.7).add_to(map_dallas)         #fill_opacity=0.7).add_to(map_clusters)
       
map_dallas
#map_clusters

# End of Battle of Neighborhoods - code section (part of Week 2)