# The battle of neighborhoods
## Author: [Carlos Morlan](https://www.linkedin.com/in/carlos-morlan-96343a15/)
### Published date: July 24<sup>th</sup>, 2019

[![Battle of neighborhoods](https://www.garybarker.co.uk/files/uk-city-life-cartoon-illustration.jpg)](https://www.garybarker.co.uk)

# Table of contents

  - [Executive Summary](#ExSum)
  - [Introduction](#Intro)
  - [Methodology](#Metho)
  - [Detailed results](#DetRe)
  - [Discussion section](#DiSec)
  - [Conclusions](#Concl)
  - [References](#Refer)
  - [Appendices](#Appen)

### <a name="ExSum"></a>Executive Summary

Taking a Mexico state as input create clusters by category, visitors profile and time series

Explain the crux of your arguments in 3 paragraphs or less.

> The ultimate purpose of analytics is to communicate findings to the concerned who might use these insights to formulate policy or strategy.
> The data scientist should then use the insights to build the narrative to communicate the findings.

This text you see here is *actually* written in Markdown! To get a feel for Markdown's syntax, type some text into the left window and watch the results in the right.

### <a name="Intro"></a>Introduction

Setting up the problem for the reader who might be new to the topic and who might need to be gently introduced to the subject matter before being imersed in intricate details.

### <a name="Metho"></a>Methodology (this section will be expanded in the second week's submission)

Research methods will be introduced and data sources used for the analysis will be described.

### <a name="DetRe"></a>Detailed results (this section will be expanded in the second week's submission)

Illustrative graphics will be included to present my empirical findings and I will formally test my hypothesis.

### <a name="DiSec"></a>Discussion section (this section will be expanded in the second week's submission)

I will craft my main arguments supported on the results presented earlier. I will try to rely on the power of narrative to enable numbers to communicate my thesis to my readers.

### <a name="Concl"></a>Conclusions (this section will be expanded in the second week's submission)

Generalize specific findings and will promote them.

### <a name="Refer"></a>References

Housekeeping
[Notebook image](https://www.garybarker.co.uk)
[Geolocation Mexico Postal Codes](http://download.geonames.org/export/zip/)

#### <a name="Appen"></a>Appendices (this section will be expanded in the second week's submission)

This section will be included only if needed.


# Data Processing

The data format is tab-delimited text in utf8 encoding, with the following fields :

* country code      : iso country code, 2 characters
* postal code       : varchar(20)
* place name        : varchar(180)
* admin name1       : 1. order subdivision (state) varchar(100)
* admin code1       : 1. order subdivision (state) varchar(20)
* admin name2       : 2. order subdivision (county/province) varchar(100)
* admin code2       : 2. order subdivision (county/province) varchar(20)
* admin name3       : 3. order subdivision (community) varchar(100)
* admin code3       : 3. order subdivision (community) varchar(20)
* latitude          : estimated latitude (wgs84)
* longitude         : estimated longitude (wgs84)
* accuracy          : accuracy of lat/lng from 1=estimated, 4=geonameid, 6=centroid of addresses or shape

In [1]:
import pandas as pd
import numpy as np

# Read source, the file is tab delimited and the postal code column (#2) should be treated as string
postal_codes_tmp = pd.read_csv('MX.txt', sep='\t', header=None, dtype={1:str})

# Assign column headers because the file doesn't have it
postal_codes_tmp.columns = ['CountryCode', 'PostalCode', 'PlaceName', 'State', 'StateCode', 'TownHall', 'TownHallCode', 'AdminName3', 'AdminCode3', 'Latitude', 'Longitude', 'Accuracy']
# print(postal_codes_tmp.dtypes)

# Add leading zeros to the Postal code
postal_codes_tmp['PostalCode'] = postal_codes_tmp['PostalCode'].apply(lambda x: x.zfill(5))

# Add filters to process a single state town hall for testing purposes
state_filter = postal_codes_tmp['StateCode']==9
townhall_filter = postal_codes_tmp['TownHallCode']==3
filtered_postal_codes = postal_codes_tmp[state_filter & townhall_filter]
#3Places_filter = postal_codes_tmp['PostalCode']=='04260'
#filtered_postal_codes = filtered_postal_codes[3Places]
#filtered_postal_codes.head()

# Remove unused columns
filtered_postal_codes.drop(columns=['CountryCode', 'State', 'StateCode', 'TownHall', 'TownHallCode', 'AdminName3', 'AdminCode3'], inplace=True)

# Leave only one latitude-longitude by PostalCode (mean will be used)
unique_coordinates = filtered_postal_codes.groupby('PostalCode').agg({'PlaceName': [(', '.join)], 'Latitude': 'mean', 'Longitude': 'mean', 'Accuracy': 'count'}).reset_index()

# Rename column headers
unique_coordinates.columns = ['PostalCode', 'PlaceName', 'Latitude', 'Longitude', 'RecordCount']

postal_codes = unique_coordinates

# Verify data frame consistency
print('{} rows in dataframe'.format(postal_codes.shape[0]))
print('{} unique postal codes'.format(len(postal_codes['PostalCode'].unique())))
postal_codes.head()
#postal_codes

#postal_codes[postal_codes['PostalCode']=='04100']

101 rows in dataframe
101 unique postal codes


A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  errors=errors)


Unnamed: 0,PostalCode,PlaceName,Latitude,Longitude,RecordCount
0,4000,Villa Coyoacán,19.34,-99.1617,1
1,4009,Delegación Política Coyoacán,19.34,-99.1617,1
2,4010,Barrio Santa Catarina,19.3175,-99.1327,1
3,4020,Barrio La Concepción,19.3267,-99.1504,1
4,4030,Barrio San Lucas,19.3175,-99.1327,1


## Creating Mexico City Map

In [7]:
#!conda install -c conda-forge geopy --yes # uncomment this line if you haven't completed the Foursquare API lab
from geopy.geocoders import Nominatim # convert an address into latitude and longitude values

#!conda install -c conda-forge folium=0.5.0 --yes # uncomment this line if you haven't completed the Foursquare API lab
import folium # map rendering library

import requests # library to handle requests

# import k-means from clustering stage
from sklearn.cluster import KMeans

# Matplotlib and associated plotting modules
import matplotlib.cm as cm
import matplotlib.colors as colors

address = 'Mexico City, Mexico'

geolocator = Nominatim(user_agent="city_explorer")
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude
print('The geograpical coordinates of {} are {}, {}.'.format(address, latitude, longitude))

# create map of the chosen city using latitude and longitude values
map_city = folium.Map(location=[latitude, longitude], zoom_start=12)

# add markers to map
for lat, lng, postalcode, placename in zip(postal_codes['Latitude'], postal_codes['Longitude'], postal_codes['PostalCode'], postal_codes['PlaceName']):
    label = '{}, {}'.format(postalcode, placename)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_city)  
    
map_city

The geograpical coordinates of Mexico City, Mexico are 19.4326009, -99.1333416.


## Defining Foursquare functions
### 1. To get nearby venues
### 2. To sort venues in descending order

In [3]:
# Define Foursquare Credentials and Version
CLIENT_ID = '020DHIJQ5OJ4YZ12HXXY4O0D33CXV4OT0QXK25QO3Y03IK1I'
CLIENT_SECRET = 'P3SIW32METMPEVCC1WEZ3DXWQEFVGA2YZC5ELTDWT2FSYVW4'
VERSION = '20180605'
LIMIT = 100

def getNearbyVenues(names, latitudes, longitudes, radius=500):
    
    venues_list=[]
    for name, lat, lng in zip(names, latitudes, longitudes):
        print('Getting data for ' + name)
            
        # create the API request URL
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lat, 
            lng, 
            radius, 
            LIMIT)
            
        # make the GET request
        results = requests.get(url).json()["response"]['groups'][0]['items']
        
        # return only relevant information for each nearby venue
        venues_list.append([(
            name, 
            lat, 
            lng, 
            v['venue']['name'], 
            v['venue']['location']['lat'], 
            v['venue']['location']['lng'],  
            v['venue']['categories'][0]['name']) for v in results])

    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['PlaceName', 
                  'PlaceName Latitude', 
                  'PlaceName Longitude', 
                  'Venue', 
                  'Venue Latitude', 
                  'Venue Longitude', 
                  'Venue Category']
    
    return(nearby_venues)

def return_most_common_venues(row, num_top_venues):
    row_categories = row.iloc[1:]
    row_categories_sorted = row_categories.sort_values(ascending=False)
    
    return row_categories_sorted.index.values[0:num_top_venues]

## Getting Foursquare information

In [4]:
# Get nearby venues data
city_venues = getNearbyVenues(names=postal_codes['PlaceName'],
                                   latitudes=postal_codes['Latitude'],
                                   longitudes=postal_codes['Longitude']
                                  )

print(str(city_venues.shape[0]) + ' venues with ' + str(city_venues.shape[1]) + ' columns')
print('There are {} uniques categories.'.format(len(city_venues['Venue Category'].unique())))

city_venues.head()

Getting data for Villa Coyoacán
Getting data for Delegación Política Coyoacán
Getting data for Barrio Santa Catarina
Getting data for Barrio La Concepción
Getting data for Barrio San Lucas
Getting data for Parque San Andrés
Getting data for Del Carmen
Getting data for Viveros de Coyoacán
Getting data for San Diego Churubusco, San Mateo
Getting data for Cámara Nacional de la Industria Editorial
Getting data for Campestre Churubusco
Getting data for Churubusco Country Club
Getting data for Prado Churubusco
Getting data for Ermita Churubusco
Getting data for Hermosillo
Getting data for Paseos de Taxqueña
Getting data for 20 de Agosto
Getting data for San Francisco Culhuacán Barrio de La Magdalena, San Francisco Culhuacán Barrio de San Francisco, San Francisco Culhuacán Barrio de Santa Ana, San Francisco Culhuacán Barrio de San Juan
Getting data for Santa Martha del Sur Quetzalcoatl
Getting data for Taxqueña
Getting data for Ajusco
Getting data for Ajusco Montserrat
Getting data for Pedreg

Unnamed: 0,PlaceName,PlaceName Latitude,PlaceName Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
0,Villa Coyoacán,19.34,-99.1617,Acuatica y Family Fitness Nelson Vargas,19.338527,-99.15912,Gym / Fitness Center
1,Villa Coyoacán,19.34,-99.1617,El Tajín,19.344046,-99.163164,Mexican Restaurant
2,Villa Coyoacán,19.34,-99.1617,The Green Corner,19.344169,-99.161397,Vegetarian / Vegan Restaurant
3,Villa Coyoacán,19.34,-99.1617,Café El Jarocho,19.344178,-99.160771,Café
4,Villa Coyoacán,19.34,-99.1617,Helado Obscuro,19.343997,-99.159541,Ice Cream Shop


In [9]:
city_onehot = pd.get_dummies(city_venues[['Venue Category']], prefix="", prefix_sep="")

# add place name column back to dataframe
city_onehot['PlaceName'] = city_venues['PlaceName'] 

# move place name column to the first column
fixed_columns = [city_onehot.columns[-1]] + list(city_onehot.columns[:-1])
city_onehot = city_onehot[fixed_columns]

# city_onehot.head()
print('{} places and {} categories before grouping by Place Name'.format(city_onehot.shape[0], city_onehot.shape[1]))

city_grouped = city_onehot.groupby('PlaceName').mean().reset_index()
# print(city_grouped)

print('{} places and {} categories after grouping by Place Name'.format(city_grouped.shape[0], city_grouped.shape[1]))

num_top_venues = 25

for hood in city_grouped['PlaceName']:
    print("----"+hood+"----")
    temp = city_grouped[city_grouped['PlaceName'] == hood].T.reset_index()
    temp.columns = ['venue','freq']
    temp = temp.iloc[1:]
    temp['freq'] = temp['freq'].astype(float)
    temp = temp.round({'freq': 2})
    if hood == 'Del Carmen':
        print(temp.sort_values('freq', ascending=False).reset_index(drop=True).head(num_top_venues))
        print('\n')
        
num_top_venues = 5

indicators = ['st', 'nd', 'rd']

# create columns according to number of top venues
columns = ['PlaceName']
for ind in np.arange(num_top_venues):
    try:
        columns.append('{}{} Most Common Venue'.format(ind+1, indicators[ind]))
    except:
        columns.append('{}th Most Common Venue'.format(ind+1))

# create a new dataframe
city_venues_sorted = pd.DataFrame(columns=columns)
city_venues_sorted['PlaceName'] = city_grouped['PlaceName']

for ind in np.arange(city_grouped.shape[0]):
    city_venues_sorted.iloc[ind, 1:] = return_most_common_venues(city_grouped.iloc[ind, :], num_top_venues)

city_venues_sorted.head()

3448 places and 232 categories before grouping by Place Name
101 places and 232 categories after grouping by Place Name
----20 de Agosto----
----Adolfo Ruiz Cortínes----
----Ajusco----
----Ajusco Montserrat----
----Alianza Popular Revolucionaria, Los Cedros----
----Altillo Universidad, Acasulco, Copilco, Integración Latinoamericana, Villas Copilco----
----Avante----
----Barrio La Concepción----
----Barrio Oxtopulco Universidad----
----Barrio San Lucas----
----Barrio Santa Catarina----
----Cafetales----
----Campestre Churubusco----
----Campestre Coyoacán----
----Cantil del Pedregal, Bosques de Tetlameya----
----Carmen Serdán----
----Churubusco Country Club----
----Copilco El Alto, Copilco Universidad----
----Culhuacán CTM Sección I, Culhuacán CTM Sección V, Culhuacán CTM Sección II----
----Culhuacán CTM Sección IX-A, Culhuacán CTM Sección IX-B, Culhuacán CTM Sección VIII----
----Culhuacán CTM Sección Piloto, Culhuacán CTM Canal Nacional----
----Culhuacán CTM Sección VII----
----Culhuacá

Unnamed: 0,PlaceName,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue
0,20 de Agosto,Park,Gym,Ice Cream Shop,Auto Workshop,Speakeasy
1,Adolfo Ruiz Cortínes,Taco Place,Speakeasy,Mexican Restaurant,Grocery Store,Seafood Restaurant
2,Ajusco,Mexican Restaurant,Seafood Restaurant,Taco Place,Thrift / Vintage Store,Convenience Store
3,Ajusco Montserrat,Mexican Restaurant,Seafood Restaurant,Taco Place,Thrift / Vintage Store,Convenience Store
4,"Alianza Popular Revolucionaria, Los Cedros",Mexican Restaurant,Taco Place,Restaurant,Seafood Restaurant,Breakfast Spot


#

In [10]:
# set number of clusters
kclusters = 7

city_grouped_clustering = city_grouped.drop('PlaceName', 1)

# run k-means clustering
kmeans = KMeans(n_clusters=kclusters, random_state=0).fit(city_grouped_clustering)

# check cluster labels generated for each row in the dataframe
kmeans.labels_[0:10]

# add clustering labels
city_venues_sorted.insert(0, 'ClusterL', kmeans.labels_)

city_merged = postal_codes

# merge city_grouped with postal_codes to add latitude/longitude for each place
city_merged = city_merged.join(city_venues_sorted.set_index('PlaceName'), on='PlaceName')
city_merged.fillna(0, inplace=True) # For some reason the labels were converted to float in the previous step
# city_merged.dtypes

city_merged['ClusterL'] = city_merged['ClusterL'].astype(int)

# create map
map_clusters = folium.Map(location=[latitude, longitude], zoom_start=11)

# set color scheme for the clusters
x = np.arange(kclusters)
ys = [i + x + (i*x)**2 for i in range(kclusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# add markers to the map
markers_colors = []
for lat, lon, poi, cluster in zip(city_merged['Latitude'], city_merged['Longitude'], city_merged['PlaceName'], city_merged['ClusterL']):
    label = folium.Popup(str(poi) + ' Cluster ' + str(cluster), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color=rainbow[cluster-1],
        fill=True,
        fill_color=rainbow[cluster-1],
        fill_opacity=0.7).add_to(map_clusters)
       
map_clusters

### AAA

In [11]:
city_merged.loc[city_merged['ClusterL'] == 0, city_merged.columns[[1] + list(range(5, city_merged.shape[1]))]]

Unnamed: 0,PlaceName,ClusterL,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue
30,"Pacifico, Rinconada de los Reyes, El Rosedal, ...",0,Mexican Restaurant,Taco Place,Convenience Store,Bakery,Flea Market
31,Zapata,0,Pizza Place,Mexican Restaurant,Convenience Store,Bakery,Flea Market
32,Montserrat,0,Pizza Place,Mexican Restaurant,Convenience Store,Bakery,Flea Market
33,Mariana,0,Mexican Restaurant,Taco Place,Food Truck,BBQ Joint,Pizza Place
88,Carmen Serdán,0,Mexican Restaurant,Flea Market,Taco Place,Bakery,Food Truck
89,Cafetales,0,Mexican Restaurant,Flea Market,Taco Place,Bakery,Food Truck
90,Emiliano Zapata Fraccionamiento Popular,0,Mexican Restaurant,Flea Market,Taco Place,Bakery,Food Truck
