<h1>Choosing a suburb city when you work in Dallas</h1>

<h3>Introduction / Business Probem</h3>

Dallas is major city concentrating many businesses and corporate headquarters, therefore hundreds of thousands of jobs.<br>
Housing in the city is either very expensive or is poorly served in terms of services and infrastructure, reason why many people who work in Dallas look for houses in the surrounding suburb cities, where the combionation of lower prices and services offer is more attractive.
<p>
This Capstone project explores, segments and clusters the suburb cities in the Dallas-Fort Worth Metroplex area, within 35 miles distance from downtown Dallas, to cluster similar cities in terms of offer of education, healthcare, shopping, outdoor options, entartainment, restaurants and hotels, in order to provide an insight of better fit depending on the housebuyer interests.

<h3>Data Description</h3>

As the project explores, segments and clusters the suburb cities in the Dallas-Fort Worth Metroplex area, the following data components are obtained and steps are executed:<br>
<ol>
<li>Produce a CSV file with the list of all the cities in the DFW area from <a href='https://www.hdavidballinger.com/dfw-metroplex.php'>https://www.hdavidballinger.com/dfw-metroplex.php</a> and load it into a dataframe. The list provides the population and the distance from downtown Dallas for each city.</li>
<li>Filter Dallas, Fort Worth and cities farer than 35 from downtown Dallas from the list.</li>    
<li>Obtain, via API, the geolocation of each city and add it to the data frame.</li>
<li>Obtain the nearby POIs from FourSquare, via API, for the target cities.</li>
<li>Pivot the POIs to count the number of schools, hospitals, groceries / malls, parks, theaters, restaurants and hotels per city.</li>    
</ol>
This data allows, then, to segment and cluster the suburb cities and perform analysis on the profile of each one in terms of services offered.

<b>Loading list of cities from CSV</b>

The below block reads from a CSV file containing the list of cities in DFW area to a dataframe, filtering out cities with more than 400K population or farer than 35 miles from Downtown Dallas, and displays it.

In [1]:
import pandas as pd 
import numpy as np
  
url = 'https://raw.githubusercontent.com/jzanardo/Coursera_Capstone/master/dfw-cities.csv'

df = pd.read_csv(url)

df = df[df.distance <= 35]
df = df[df.population <= 400000]
df = df.reset_index()

display(df)

Unnamed: 0,index,city,population,distance,county
0,0,Addison,15500.0,16,Dallas
1,2,Allen,98100.0,26,Collin
2,11,Arlington,388100.0,21,Tarrant
3,15,Balch Springs,25200.0,15,Dallas
4,17,Bartonville,1700.0,33,Denton
...,...,...,...,...,...
87,189,Watauga,24500.0,32,Tarrant
88,190,Waxahachie,33400.0,30,Ellis
89,193,Westlake,1300.0,31,Tarrant
90,199,Wilmer,3900.0,16,Dallas


<b>Associating Coordinates</b>

This block uses Nominatim.geocode service to associate the coordinates to each city contained in the dataframe.

In [2]:
from geopy.geocoders import Nominatim

geolocator = Nominatim(user_agent="dal_explorer")

for index in range(len(df)):
    address = df.loc[index,'city'] + ', TX'
    location = geolocator.geocode(address)
    latitude = location.latitude
    longitude = location.longitude
    df.at[index,'latitude'] = latitude
    df.at[index,'longitude'] = longitude

display(df)

Unnamed: 0,index,city,population,distance,county,latitude,longitude
0,0,Addison,15500.0,16,Dallas,32.960431,-96.830260
1,2,Allen,98100.0,26,Collin,33.103174,-96.670550
2,11,Arlington,388100.0,21,Tarrant,32.701939,-97.105624
3,15,Balch Springs,25200.0,15,Dallas,32.728741,-96.622771
4,17,Bartonville,1700.0,33,Denton,33.073177,-97.131679
...,...,...,...,...,...,...,...
87,189,Watauga,24500.0,32,Tarrant,32.857906,-97.254737
88,190,Waxahachie,33400.0,30,Ellis,32.394491,-96.843936
89,193,Westlake,1300.0,31,Tarrant,32.991226,-97.194370
90,199,Wilmer,3900.0,16,Dallas,32.589024,-96.685272


<h3>Displaying all cities in the map with folium...</h3>

In [3]:
import json # library to handle JSON files

!pip install folium
import folium # map rendering library

Collecting folium
  Downloading folium-0.12.1-py2.py3-none-any.whl (94 kB)
[K     |████████████████████████████████| 94 kB 6.8 MB/s  eta 0:00:01
[?25hCollecting branca>=0.3.0
  Downloading branca-0.4.2-py3-none-any.whl (24 kB)
Installing collected packages: branca, folium
Successfully installed branca-0.4.2 folium-0.12.1


In [4]:
# create map of DFW area using latitude and longitude values
location = geolocator.geocode('Dallas, TX')
latitude = location.latitude
longitude = location.longitude

map_dfw = folium.Map(location=[latitude, longitude], zoom_start=10)

# add markers to map
for lat, lng, city, population in zip(df['latitude'], df['longitude'], df['city'], df['population']):
    label = '{}, {}'.format(city, str(population/1000) + 'K')
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_dfw)  
    
map_dfw

<h3>Foursquare credentials to connect the API...</h3>

In [5]:
CLIENT_ID = 'TKXWAFGDJBMVDYYEOFTGVYGFUKCSUTCI3OT1QOY003LBVFJW' # Foursquare ID
CLIENT_SECRET = '5A2BUVYGV5ZVD4XPNCI0WOE5PVGUTLQ5ZFIRGZTMN3XRADHW' # Foursquare Secret
VERSION = '20180605' # Foursquare API version
LIMIT = 100 # A default Foursquare API limit value

print('Credentails:')
print('CLIENT_ID: ' + CLIENT_ID)
print('CLIENT_SECRET:' + CLIENT_SECRET)

Credentails:
CLIENT_ID: TKXWAFGDJBMVDYYEOFTGVYGFUKCSUTCI3OT1QOY003LBVFJW
CLIENT_SECRET:5A2BUVYGV5ZVD4XPNCI0WOE5PVGUTLQ5ZFIRGZTMN3XRADHW


<h3>Gathering and formatting nearby venues within a radius of 6 KM...</h3>

In [6]:
import requests # library to handle requests
from pandas.io.json import json_normalize # tranform JSON file into a pandas dataframe

def getNearbyVenues(names, latitudes, longitudes, radius=6000):
    
    venues_list=[]
    for name, lat, lng in zip(names, latitudes, longitudes):
        print(name)
            
        # create the API request URL
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lat, 
            lng, 
            radius, 
            LIMIT)
            
        # make the GET request
        results = requests.get(url).json()["response"]['groups'][0]['items']
        
        # return only relevant information for each nearby venue
        venues_list.append([(
            name, 
            lat, 
            lng, 
            v['venue']['name'], 
            v['venue']['location']['lat'], 
            v['venue']['location']['lng'],  
            v['venue']['categories'][0]['name']) for v in results])

    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['city', 
                  'city latitude', 
                  'city longitude', 
                  'venue', 
                  'venue latitude', 
                  'venue longitude', 
                  'venue category']
    
    return(nearby_venues)

In [7]:
dfw_venues = getNearbyVenues(names=df['city'],
                                   latitudes=df['latitude'],
                                   longitudes=df['longitude']
                                  )

Addison
Allen
Arlington
Balch Springs
Bartonville
Bedford
Carrollton
Cedar Hill
Cockrell Hill
Colleyville
Combine
Coppell
Copper Canyon
Corinth
Crandall
Dalworthington Gardens
DeSoto
Double Oak
Duncanville
Ennis
Euless
Fairview
Farmers Branch
Fate
Ferris
Flower Mound
Forest Hill
Forney
Frisco
Garland
Garrett
Glenn Heights
Grand Prairie
Grapevine
Hackberry
Haltom City
Heath
Hebron
Hickory Creek
Highland Park
Highland Village
Hurst
Hutchins
Irving
Kaufman
Keller
Kennedale
Lake Dallas
Lancaster
Lantana
Lavon
Lewisville
Lucas
Mansfield
McKinney
McLendon Chisholm
Mesquite
Midlothian
Mobile City
Murphy
North Richland Hills
Oak Leaf
Ovilla
Palmer
Pantego
Parker
Plano
Red Oak
Richardson
Richland Hills
Roanoke
Rockwall
Rowlett
Royse City
Sachse
Scurry
Seagoville
Shady Shores
Southlake
St. Paul
Sunnyvale
Talty
Terrell
The Colony
Trophy Club
University Park
Venus
Watauga
Waxahachie
Westlake
Wilmer
Wylie


In [10]:
print(dfw_venues.shape)
dfw_venues.head()

(6943, 7)


Unnamed: 0,city,city latitude,city longitude,venue,venue latitude,venue longitude,venue category
0,Addison,32.960431,-96.83026,Addison Circle Park,32.960917,-96.826488,Park
1,Addison,32.960431,-96.83026,Ida Claire,32.954487,-96.825878,Southern / Soul Food Restaurant
2,Addison,32.960431,-96.83026,Kenny's Wood Fired Grill,32.953615,-96.823573,American Restaurant
3,Addison,32.960431,-96.83026,Texas de Brazil,32.954592,-96.830206,Brazilian Restaurant
4,Addison,32.960431,-96.83026,Mr. Sushi,32.953424,-96.829081,Sushi Restaurant


<h3>Creating a master category for the venues based on their category and filtering "Other" out...</h3>

In [25]:
for index in range(len(dfw_venues)):
    
    venue_category = dfw_venues.at[index,'venue category']
    
    if venue_category.__contains__('Restaurant') or venue_category.__contains__('Bar'):
        master_category = 'Food'
    else:
        if venue_category.__contains__('Park') or venue_category.__contains__('Trail') or venue_category.__contains__('Court'):
            master_category = 'Outdoor'
        else:
            if venue_category.__contains__('Museum') or venue_category.__contains__('Theater') or venue_category.__contains__('Stadium'):
                master_category = 'Entertainment'
            else:
                if venue_category.__contains__('Shop') or venue_category.__contains__('Store') or venue_category.__contains__('Mall') or venue_category.__contains__('Service') or venue_category.__contains__('Market') or venue_category.__contains__('Grocery'):
                    master_category = 'Shopping'
                else:
                    if venue_category.__contains__('School') or venue_category.__contains__('Library') or venue_category.__contains__('University'):
                        master_category = 'Education'
                    else:
                        if venue_category.__contains__('Medical'):
                            master_category = 'Health'
                        else:
                            if venue_category.__contains__('Hotel') or venue_category.__contains__('Motel') or venue_category.__contains__('Resort'):
                                master_category = 'Hotels'
                            else:
                                master_category = 'Other'
     
    dfw_venues.at[index,'masterCategory'] = master_category

dfw_venues = dfw_venues[dfw_venues.masterCategory != 'Other']
dfw_venues = dfw_venues.reset_index()
dfw_venues.rename(columns = {'masterCategory':'master category'}, inplace = True)
    
display(dfw_venues)

Unnamed: 0,level_0,index,city,city latitude,city longitude,venue,venue latitude,venue longitude,venue category,master category
0,0,0,Addison,32.960431,-96.830260,Addison Circle Park,32.960917,-96.826488,Park,Outdoor
1,1,1,Addison,32.960431,-96.830260,Ida Claire,32.954487,-96.825878,Southern / Soul Food Restaurant,Food
2,2,2,Addison,32.960431,-96.830260,Texas de Brazil,32.954592,-96.830206,Brazilian Restaurant,Food
3,3,3,Addison,32.960431,-96.830260,Kenny's Wood Fired Grill,32.953615,-96.823573,American Restaurant,Food
4,4,4,Addison,32.960431,-96.830260,Mr. Sushi,32.953424,-96.829081,Sushi Restaurant,Food
...,...,...,...,...,...,...,...,...,...,...
4377,4377,6913,Wylie,33.015120,-96.538879,The Rock Wood Fired Pizza,33.010535,-96.575365,Restaurant,Food
4378,4378,6915,Wylie,33.015120,-96.538879,Dollar Tree,33.022714,-96.510629,Discount Store,Shopping
4379,4379,6916,Wylie,33.015120,-96.538879,Woodbridge Trails,32.991589,-96.583112,Trail,Outdoor
4380,4380,6917,Wylie,33.015120,-96.538879,The Harbor House,33.049679,-96.534218,American Restaurant,Food


<h3>Counting venues by master category...</h3>

In [26]:
# one hot encoding
dfw_onehot = pd.get_dummies(dfw_venues[['master category']], prefix="", prefix_sep="")

# add neighborhood column back to dataframe
dfw_onehot['city'] = dfw_venues['city'] 

# move neighborhood column to the first column
fixed_columns = [dfw_onehot.columns[-1]] + list(dfw_onehot.columns[:-1])
dfw_onehot = dfw_onehot[fixed_columns]

dfw_onehot.head()

Unnamed: 0,city,Education,Entertainment,Food,Health,Hotels,Outdoor,Shopping
0,Addison,0,0,0,0,0,1,0
1,Addison,0,0,1,0,0,0,0
2,Addison,0,0,1,0,0,0,0
3,Addison,0,0,1,0,0,0,0
4,Addison,0,0,1,0,0,0,0


<h3>Grouping the dataset by city and ranking master venue types, in order to perform the clustering...</h3>

In [27]:
dfw_grouped = dfw_onehot.groupby('city').mean().reset_index()
dfw_grouped
dfw_grouped.shape

(91, 8)

In [28]:
import numpy as np #

def return_most_common_venues(row, num_top_venues):
    row_categories = row.iloc[1:]
    row_categories_sorted = row_categories.sort_values(ascending=False)
    
    return row_categories_sorted.index.values[0:num_top_venues]

num_top_venues = 7

indicators = ['st', 'nd', 'rd']

# create columns according to number of top venues
columns = ['city']
for ind in np.arange(num_top_venues):
    try:
        columns.append('{}{} Most Common Venue'.format(ind+1, indicators[ind]))
    except:
        columns.append('{}th Most Common Venue'.format(ind+1))

# create a new dataframe
cities_venues_sorted = pd.DataFrame(columns=columns)
cities_venues_sorted['city'] = dfw_grouped['city']

for ind in np.arange(dfw_grouped.shape[0]):
    cities_venues_sorted.iloc[ind, 1:] = return_most_common_venues(dfw_grouped.iloc[ind, :], num_top_venues)

cities_venues_sorted.head()

Unnamed: 0,city,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue
0,Addison,Food,Shopping,Outdoor,Entertainment,Hotels,Health,Education
1,Allen,Shopping,Food,Outdoor,Entertainment,Hotels,Health,Education
2,Arlington,Food,Shopping,Outdoor,Hotels,Health,Entertainment,Education
3,Balch Springs,Shopping,Food,Outdoor,Entertainment,Hotels,Health,Education
4,Bartonville,Shopping,Food,Outdoor,Entertainment,Hotels,Health,Education


<h3>Running KMeans clustering with K = 7...</h3>

In [29]:
from sklearn.cluster import KMeans

# set number of clusters
kclusters = 7

dfw_grouped_clustering = dfw_grouped.drop('city', 1)

# run k-means clustering
kmeans = KMeans(n_clusters=kclusters, random_state=0).fit(dfw_grouped_clustering)

# check cluster labels generated for each row in the dataframe
kmeans.labels_[0:80] 

array([6, 0, 4, 1, 0, 4, 3, 1, 0, 4, 2, 0, 0, 0, 1, 0, 0, 0, 0, 4, 0, 0,
       3, 4, 0, 0, 1, 4, 4, 4, 4, 4, 0, 4, 4, 4, 4, 0, 0, 4, 0, 0, 4, 3,
       0, 0, 0, 3, 1, 0, 0, 0, 0, 4, 0, 1, 1, 0, 0, 4, 0, 0, 0, 1, 4, 0,
       4, 4, 6, 0, 0, 1, 0, 4, 0, 5, 0, 4, 0, 0], dtype=int32)

In [30]:
# add clustering labels
cities_venues_sorted.insert(0, 'Cluster Labels', kmeans.labels_)

dfw_merged_2 = df

# merge dfw_grouped with df to add latitude/longitude for each neighborhood
dfw_merged_2 = dfw_merged_2.join(cities_venues_sorted.set_index('city'), on='city')

dfw_merged_2.head()

Unnamed: 0,index,city,population,distance,county,latitude,longitude,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue
0,0,Addison,15500.0,16,Dallas,32.960431,-96.83026,6.0,Food,Shopping,Outdoor,Entertainment,Hotels,Health,Education
1,2,Allen,98100.0,26,Collin,33.103174,-96.67055,0.0,Shopping,Food,Outdoor,Entertainment,Hotels,Health,Education
2,11,Arlington,388100.0,21,Tarrant,32.701939,-97.105624,4.0,Food,Shopping,Outdoor,Hotels,Health,Entertainment,Education
3,15,Balch Springs,25200.0,15,Dallas,32.728741,-96.622771,1.0,Shopping,Food,Outdoor,Entertainment,Hotels,Health,Education
4,17,Bartonville,1700.0,33,Denton,33.073177,-97.131679,0.0,Shopping,Food,Outdoor,Entertainment,Hotels,Health,Education


<h3>Displaying Cities' clusters in the map...</h3>

In [31]:
import matplotlib.cm as cm
import matplotlib.colors as colors

# create map
map_clusters = folium.Map(location=[latitude, longitude], zoom_start=10)

# set color scheme for the clusters
x = np.arange(kclusters)
ys = [i + x + (i*x)**2 for i in range(kclusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

dfw_merged_2['Cluster Labels'].fillna(value=0, inplace=True)

# add markers to the map
markers_colors = []
for lat, lon, poi, cluster in zip(dfw_merged_2['latitude'], dfw_merged_2['longitude'], dfw_merged_2['city'], dfw_merged_2['Cluster Labels']):
    label = folium.Popup(str(poi) + ' Cluster ' + str(cluster), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color=rainbow[int(cluster)-1],
        fill=True,
        fill_color=rainbow[int(cluster)-1],
        fill_opacity=0.7).add_to(map_clusters)
       
map_clusters

<h3>Listing each of the 7 clusters...</h3>

<h3>Cluster 0 : Predominance of Shopping services, with Theaters more numerous than Hotels</h3>

In [32]:
print('Cluster 0:')
dfw_merged_2.loc[dfw_merged_2['Cluster Labels'] == 0, dfw_merged_2.columns[[1] + list(range(5, dfw_merged_2.shape[1]))]]

Cluster 0:


Unnamed: 0,city,latitude,longitude,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue
1,Allen,33.103174,-96.67055,0.0,Shopping,Food,Outdoor,Entertainment,Hotels,Health,Education
4,Bartonville,33.073177,-97.131679,0.0,Shopping,Food,Outdoor,Entertainment,Hotels,Health,Education
8,Cockrell Hill,32.736242,-96.886948,0.0,Shopping,Food,Outdoor,Entertainment,Hotels,Health,Education
11,Coppell,32.95526,-97.01557,0.0,Shopping,Food,Hotels,Outdoor,Entertainment,Education,Health
12,Copper Canyon,33.095955,-97.096678,0.0,Shopping,Food,Outdoor,Entertainment,Hotels,Health,Education
13,Corinth,33.154009,-97.064732,0.0,Shopping,Food,Hotels,Entertainment,Outdoor,Health,Education
15,Dalworthington Gardens,32.70291,-97.155289,0.0,Shopping,Food,Outdoor,Hotels,Health,Entertainment,Education
16,DeSoto,32.606287,-96.865622,0.0,Shopping,Food,Outdoor,Hotels,Health,Entertainment,Education
17,Double Oak,33.065122,-97.110567,0.0,Shopping,Food,Entertainment,Outdoor,Hotels,Health,Education
18,Duncanville,32.6518,-96.908337,0.0,Shopping,Food,Outdoor,Hotels,Health,Entertainment,Education


<h3>Cluster 1 : Predominance of Shopping services, with Hotels more numerous than Theaters</h3>

In [33]:
print('Cluster 1:')
dfw_merged_2.loc[dfw_merged_2['Cluster Labels'] == 1, dfw_merged_2.columns[[1] + list(range(5, dfw_merged_2.shape[1]))]]

Cluster 1:


Unnamed: 0,city,latitude,longitude,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue
3,Balch Springs,32.728741,-96.622771,1.0,Shopping,Food,Outdoor,Entertainment,Hotels,Health,Education
7,Cedar Hill,32.588807,-96.955367,1.0,Shopping,Food,Outdoor,Entertainment,Hotels,Health,Education
14,Crandall,32.627911,-96.45582,1.0,Shopping,Food,Outdoor,Hotels,Health,Entertainment,Education
26,Forest Hill,32.672078,-97.269181,1.0,Shopping,Food,Outdoor,Hotels,Entertainment,Health,Education
48,Lancaster,32.59208,-96.756108,1.0,Shopping,Food,Outdoor,Hotels,Health,Entertainment,Education
55,McLendon Chisholm,32.842348,-96.380539,1.0,Shopping,Outdoor,Food,Hotels,Health,Entertainment,Education
56,Mesquite,32.76661,-96.599472,1.0,Shopping,Food,Outdoor,Entertainment,Hotels,Health,Education
63,Palmer,32.431252,-96.66777,1.0,Shopping,Food,Entertainment,Outdoor,Hotels,Health,Education
71,Rockwall,32.892346,-96.406699,1.0,Shopping,Food,Outdoor,Hotels,Health,Entertainment,Education


<h3>Cluster 2 : Restaurants first, Theaters second</h3>

In [34]:
print('Cluster 2:')
dfw_merged_2.loc[dfw_merged_2['Cluster Labels'] == 2, dfw_merged_2.columns[[1] + list(range(5, dfw_merged_2.shape[1]))]]

Cluster 2:


Unnamed: 0,city,latitude,longitude,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue
10,Combine,32.588303,-96.511797,2.0,Food,Entertainment,Shopping,Outdoor,Hotels,Health,Education


<h3>Cluster 3 : Restaurants first, Shopping second</h3>

In [35]:
print('Cluster 3:')
dfw_merged_2.loc[dfw_merged_2['Cluster Labels'] == 3, dfw_merged_2.columns[[1] + list(range(5, dfw_merged_2.shape[1]))]]

Cluster 3:


Unnamed: 0,city,latitude,longitude,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue
6,Carrollton,32.953735,-96.890282,3.0,Food,Shopping,Outdoor,Entertainment,Hotels,Health,Education
22,Farmers Branch,32.926514,-96.896115,3.0,Food,Shopping,Outdoor,Entertainment,Hotels,Health,Education
43,Irving,32.829518,-96.944218,3.0,Food,Shopping,Outdoor,Hotels,Health,Entertainment,Education
47,Lake Dallas,33.119287,-97.025564,3.0,Shopping,Food,Outdoor,Hotels,Health,Entertainment,Education
81,Talty,32.683187,-96.385539,3.0,Food,Shopping,Outdoor,Hotels,Health,Entertainment,Education
84,Trophy Club,33.004677,-97.205599,3.0,Food,Shopping,Hotels,Outdoor,Entertainment,Education,Health
86,Venus,32.433474,-97.102508,3.0,Food,Shopping,Outdoor,Hotels,Health,Entertainment,Education


<h3>Cluster 4 : Restaurants first, Shopping second - similar to Cluster 3</h3>

In [36]:
print('Cluster 4:')
dfw_merged_2.loc[dfw_merged_2['Cluster Labels'] == 4, dfw_merged_2.columns[[1] + list(range(5, dfw_merged_2.shape[1]))]]

Cluster 4:


Unnamed: 0,city,latitude,longitude,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue
2,Arlington,32.701939,-97.105624,4.0,Food,Shopping,Outdoor,Hotels,Health,Entertainment,Education
5,Bedford,32.844017,-97.143067,4.0,Food,Shopping,Outdoor,Entertainment,Hotels,Health,Education
9,Colleyville,32.88096,-97.155012,4.0,Food,Shopping,Outdoor,Entertainment,Hotels,Health,Education
19,Ennis,32.329311,-96.625268,4.0,Food,Shopping,Outdoor,Hotels,Health,Entertainment,Education
23,Fate,32.941511,-96.381372,4.0,Food,Shopping,Education,Outdoor,Hotels,Health,Entertainment
27,Forney,32.747893,-96.471929,4.0,Food,Shopping,Entertainment,Outdoor,Hotels,Health,Education
28,Frisco,33.150674,-96.823612,4.0,Food,Shopping,Entertainment,Outdoor,Education,Hotels,Health
29,Garland,32.912624,-96.638883,4.0,Food,Shopping,Outdoor,Entertainment,Hotels,Health,Education
30,Garrett,32.363476,-96.654713,4.0,Food,Shopping,Outdoor,Hotels,Entertainment,Health,Education
31,Glenn Heights,32.543873,-96.855183,4.0,Food,Shopping,Outdoor,Hotels,Health,Entertainment,Education


<h3>Cluster 5 : Higher proportion of Hotels</h3>

In [37]:
print('Cluster 5:')
dfw_merged_2.loc[dfw_merged_2['Cluster Labels'] == 5, dfw_merged_2.columns[[1] + list(range(5, dfw_merged_2.shape[1]))]]

Cluster 5:


Unnamed: 0,city,latitude,longitude,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue
75,Scurry,32.722517,-100.905202,5.0,Shopping,Hotels,Food,Entertainment,Outdoor,Health,Education


<h3>Cluster 6 : Restaurants first, Shopping second - similar to Cluster 3</h3>

In [38]:
print('Cluster 6:')
dfw_merged_2.loc[dfw_merged_2['Cluster Labels'] == 6, dfw_merged_2.columns[[1] + list(range(5, dfw_merged_2.shape[1]))]]

Cluster 6:


Unnamed: 0,city,latitude,longitude,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue
0,Addison,32.960431,-96.83026,6.0,Food,Shopping,Outdoor,Entertainment,Hotels,Health,Education
68,Richardson,32.948179,-96.729721,6.0,Food,Shopping,Hotels,Outdoor,Entertainment,Health,Education


<h3>Results and Discussion</h3>

The criteria adopted to cluster the suburb cities was to rank the most common venues on those cities based on the following 7 categories: Shopping, Food, Outdoor, Entertainment, Health, Hotels and Education. <br>
Venues of other categories were not considered.
<br>
Other aspects like distance from Downton Dallas and population were not considered for clustering as well. The first one because the resulting commute time don't vary in a meaningful way to turn it a variable of interest; the second because a higher population in a city in DFW area doesn't mean more concentration, usually these cities spread around equaly over the land.
<p>
With the KMeans clustering with K=7, we got a similar result than when applying K=4 because the frequency of the categories selected for clustering are very similar across the suburb cities. 
<br>
Food and Shopping services are almost always the most frequent, with outdoor activities usually in third place and schools (education) predominantly in the last place.
<br>
If we try to force a clear difference among the clusters, the most notable one is that in 3 clusters the shopping activity prevails, while in 4 clusters the restaurants (food) are more frequent.<p>
Tagging a name to the clusters to classify them in terms of services more offered ends up to:<p>
- Cluster 0 : Predominance of Shopping services, with Theaters more numerous than Hotels - 45 cities<P>
- Cluster 1 : Predominance of Shopping services, with Hotels more numerous than Theaters - 9 cities<P>
- Cluster 2 : Restaurants first, Theaters second - 1 city<P>
- Cluster 3 : Restaurants first, Shopping second - 7 cities<P>
- Cluster 4 : Restaurants first, Shopping second - similar to Cluster 3 - 27 cities<P>
- Cluster 5 : Higher proportion of Hotels - 1 city<P>
- Cluster 6 : Restaurants first, Shopping second - similar to Cluster 3 - 2 cities<P>

<h3>Conclusion</h3>

When it comes to services offered in the suburb cities in the DFW Metroplex area, the frequency by service category is quite similar across them all, with one only notable division: the group where restaurants are the most common venue and the group where shopping venues are more frequent; these 2 categories are the most frequent for the vast majority of cities, though.
<p>
Therefore, diferentiating the cities as per the frequency of categories of services offered in each one is not very helpful, as they are very similar from that perspective. Quantitative economic attibutes like house prices and household income or, qualitative ones like school rates likely produce a better differentiation than venue categories.    