# Relocation

Coursera Capstone project

# Introduction

Relocation is the action of moving to a different place and establishing a residence there. Relocation is done by people for many reasons such as for a new company position in a different place or for marriage reasons. This project aims to provide information for people looking to relocate between the districts in Bangkok and the districts in Chiang Mai, both of which are one of the largest provinces in Thailand. Specifically, this project will provide groups of similar districts in the sample. Additionally, this project will focus on only the lifestyle aspects of each districts.

## Target Audience

The target audience of this project is for those who need to relocate from Bangkok to Chiang Mai or vice versa. There are various reasons for relocation including moving to accept a new company position at a different location or moving for marriage reasons. Those people will benefit from this project because they can then negotiate or figure out the best course of action for relocating.

# Data

As this project aims to get provide the groupings of districts in Chiang Mai and Bangkok from lifestyle aspects, this project will be looking at the proportion of venue types in each districts. Additionally, the population density of each districts will also be taken account.

## Proportion of venue types

The venue types located in each districts is taken from FourSquare API. The `/venues/explore` endpoint is used to retrieve the list of venues along with its name, latitude, longitude, and category. The category as obtained from this endpoint is very specific. It can be generalized by using the `/venues/categories` endpoint which contains the whole hierarchy of available categories. The proportion of venue types is then calculated by the number of a specific venue type divided by the total number of venues in that area. 

The proportion of venue types is important because this number broadly shows the specialization of that district. If it contains more business / professional venues, then it is likely to be some business center. If it contains a lot of hotels then the district's specialization might be toward tourism.

## Population Density

The population density is calculated by the district population divided by the district area. The district population data is taken from the Bureau of Registration Administration (BORA) and the district area is taken from the Energy Policy and Planning Office (EPPO). The district population data consists of the male population, the female population, and the total population, available annually from 2000 to 2017. The district area data consists of only the district area in squared kilometers.

The population density represents the district in many ways. The population density can be used to estimate the capacity of the venues in that district and also approximate the amount of social activities in that area.

# Content

In [1]:
from typing import List, Set, Dict, Tuple, Optional
from bs4 import BeautifulSoup
import requests
import lxml
from tqdm import tqdm, tqdm_notebook
from time import sleep

## List of Districts name in English and Thai

Scraped from websites.

For Bangkok: https://sathai.com/geo/%E0%B8%81%E0%B8%A3%E0%B8%B8%E0%B8%87%E0%B9%80%E0%B8%97%E0%B8%9E%E0%B8%A1%E0%B8%AB%E0%B8%B2%E0%B8%99%E0%B8%84%E0%B8%A3  
For Chiang Mai: https://sathai.com/geo/%E0%B9%80%E0%B8%8A%E0%B8%B5%E0%B8%A2%E0%B8%87%E0%B9%83%E0%B8%AB%E0%B8%A1%E0%B9%88  

In [2]:
CHIANG_MAI_DISTRICT_NAMES_URL = 'https://sathai.com/geo/%E0%B9%80%E0%B8%8A%E0%B8%B5%E0%B8%A2%E0%B8%87%E0%B9%83%E0%B8%AB%E0%B8%A1%E0%B9%88'
BANGKOK_DISTRICT_NAMES_URL = 'https://sathai.com/geo/%E0%B8%81%E0%B8%A3%E0%B8%B8%E0%B8%87%E0%B9%80%E0%B8%97%E0%B8%9E%E0%B8%A1%E0%B8%AB%E0%B8%B2%E0%B8%99%E0%B8%84%E0%B8%A3'

DISTRICT_NAMES_URLS = (
    CHIANG_MAI_DISTRICT_NAMES_URL,
    BANGKOK_DISTRICT_NAMES_URL
)

In [3]:
def parse_districts_page(url: str) -> Tuple[List[Tuple[str, str]], str, str]:
    assert 'sathai.com' in url
    
    response = requests.get(url)
    soup = BeautifulSoup(response.text)
    
    header = soup.select_one('.container h3').text
    province_th, province_en = header.split(' ', 1)
    
    districts : List[Tuple[str, str]] = []
    for record in tqdm_notebook(soup.select('.list .record a'), desc=province_en):
        thai = record.select_one('.thai').text
        english = record.select_one('.english').text
        districts.append((thai, english))
    return districts, province_th, province_en

def get_districts(url: str) -> List[Tuple[str, str]]:
    return parse_districts_page(url)[0]

### Convert and Combine into pandas DataFrame

In [4]:
import pandas as pd
import numpy as np

In [5]:
def get_districts_df(url: str) -> pd.core.frame.DataFrame:
    districts : List[Tuple[str, str]]
    province_th: str
    province_en: str
        
    districts, province_th, province_en = parse_districts_page(url)
    
    districts_df = pd.DataFrame(districts, columns=['district_th', 'district_en'])
    districts_df.insert(loc=0, column='province_th', value=province_th)
    districts_df.insert(loc=1, column='province_en', value=province_en)
    return districts_df

In [6]:
def get_all_districts_df(urls: Tuple[str]) -> pd.core.frame.DataFrame:
    df = pd.DataFrame(columns = ['province_th', 'province_en', 'district_th', 'district_en'])
    for url in urls:
        province_df = get_districts_df(url)
        df = pd.concat([df, province_df], axis = 0).reset_index(drop=True)
    return df

In [7]:
districts_df = get_all_districts_df(DISTRICT_NAMES_URLS)

HBox(children=(IntProgress(value=0, description='Chiang mai', max=25), HTML(value='')))




HBox(children=(IntProgress(value=0, description='Bangkok', max=50), HTML(value='')))




In [8]:
districts_df.head()

Unnamed: 0,province_th,province_en,district_th,district_en
0,เชียงใหม่,Chiang mai,เชียงดาว,Chiang Dao
1,เชียงใหม่,Chiang mai,เมืองเชียงใหม่,Mueang Chiang Mai
2,เชียงใหม่,Chiang mai,เวียงแหง,Wiang Haeng
3,เชียงใหม่,Chiang mai,แม่แจ่ม,Mae Chaem
4,เชียงใหม่,Chiang mai,แม่แตง,Mae Taeng


## List of Latitude and Longitude for each districts

From `geopy`.

In [9]:
GEOCODER_AGENT = "coursera_capstone"

In [10]:
from geopy.geocoders import Nominatim, Photon
import geopy.location
import folium
nominatim_agent = Nominatim(user_agent=GEOCODER_AGENT)
photon_agent = Photon(user_agent=GEOCODER_AGENT)

In [11]:
def get_geocode(agent, address: str) -> geopy.location.Location:
    location: geopy.location.Location
        
    for i in range(0, 10):
        try:
            location = agent.geocode(address)
            if (location != None):
                break
            sleep(0.01 * i ** 2) # quadratic backoff
        except Exception as e:
            print("Error occurred: %s" % (e))
    else:
        print("Geolocator agent %s reached max retries [query='%s']" % (type(agent), address))
        location = None
    return location

In [12]:
def get_coordinates(address):
    location = get_geocode(nominatim_agent, address)
    if (location == None):
        location = get_geocode(photon_agent, address)
    if (location == None):
        print("Unable to get coordinates for %s" % (address))
    return (location.latitude, location.longitude)

In [13]:
def get_district_coordinates(district: str, province: str, country: str = 'Thailand') -> Tuple[float, float]:
    address = "%s, %s, %s" % (district, province, country)
    return get_coordinates(address)

In [14]:
def append_coordinates(districts_df: pd.core.frame.DataFrame) -> pd.core.frame.DataFrame:
    df = districts_df.copy(deep = True)
    df.insert(loc = 4, column='latitude', value = np.nan)
    df.insert(loc = 5, column='longitude', value = np.nan)
    for index, province_en, district_en in tqdm_notebook(districts_df[['province_en', 'district_en']].reset_index().values, desc='Getting coordinates'):
        latitude, longitude = get_district_coordinates(district_en, province_en)
        df.loc[index, 'latitude'] = latitude
        df.loc[index, 'longitude'] = longitude
    return df

In [15]:
districts_df = append_coordinates(districts_df)

HBox(children=(IntProgress(value=0, description='Getting coordinates', max=75), HTML(value='')))

Error occurred: Service timed out
Error occurred: Service timed out
Error occurred: Service timed out
Error occurred: Service timed out
Error occurred: Service timed out
Error occurred: Service timed out
Error occurred: Service timed out
Error occurred: Service timed out
Error occurred: Service timed out
Error occurred: Service timed out
Error occurred: Service timed out
Error occurred: Service timed out
Error occurred: Service timed out
Error occurred: Service timed out
Error occurred: Service timed out
Error occurred: Service timed out
Error occurred: Service timed out
Error occurred: Service timed out
Error occurred: Service timed out
Error occurred: Service timed out
Error occurred: Service timed out
Error occurred: Service timed out
Geolocator agent <class 'geopy.geocoders.osm.Nominatim'> reached max retries [query='Mae On, Chiang mai, Thailand']
Error occurred: Service timed out
Error occurred: Service timed out
Geolocator agent <class 'geopy.geocoders.osm.Nominatim'> reached max

On manual inspection, the latitude and longitude for 'Galyani Wattana' is very far off. This is probably because it is a new district. Let's drop it out for now.

In [16]:
districts_df = districts_df[districts_df['district_en'] != 'Galyani Wattana']

In [17]:
districts_df.head()

Unnamed: 0,province_th,province_en,district_th,district_en,latitude,longitude
0,เชียงใหม่,Chiang mai,เชียงดาว,Chiang Dao,19.368503,98.967102
1,เชียงใหม่,Chiang mai,เมืองเชียงใหม่,Mueang Chiang Mai,18.788922,98.987309
2,เชียงใหม่,Chiang mai,เวียงแหง,Wiang Haeng,19.559497,98.634181
3,เชียงใหม่,Chiang mai,แม่แจ่ม,Mae Chaem,18.498699,98.363784
4,เชียงใหม่,Chiang mai,แม่แตง,Mae Taeng,19.130423,98.940978


In [18]:
# create map of New York using latitude and longitude values
m = folium.Map(location=get_coordinates("Thailand"), zoom_start=6)

# add markers to map
data_points = zip(districts_df['latitude'], districts_df['longitude'], districts_df['province_en'], districts_df['district_en'])
for lat, long, province_en, district_en in data_points:
    label = '{}, {}'.format(province_en, district_en)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, long],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7).add_to(m)  
    
m

Seems 'Phra Nakhon, Bangkok' is also wrong. Let's use the `Photon` geolocator for that.

In [19]:
def get_phra_nakhon_coordinates():
    location = get_geocode(agent = photon_agent, address='Phra Nakhon, Bangkok, Thailand')
    latitude, longitude = (location.latitude, location.longitude)
    return (latitude, longitude)

In [20]:
def correct_phra_nakhon_coordinates(districts_df: pd.core.frame.DataFrame) -> None:
    conditions = (districts_df['province_en'] == 'Bangkok') & (districts_df['district_en'] == 'Phra Nakhon')
    index = districts_df.index[conditions]
    districts_df.loc[index, 'latitude'] = get_phra_nakhon_coordinates()[0]
    districts_df.loc[index, 'longitude'] = get_phra_nakhon_coordinates()[1]

In [21]:
correct_phra_nakhon_coordinates(districts_df)

## Get District Population Data

Got from http://stat.bora.dopa.go.th/stat/statnew/statTDD/

In [22]:
import re

In [23]:
BANGKOK_DISTRICT_POPULATION_2012_URL = 'http://stat.bora.dopa.go.th/stat/statnew/statTDD/datasource/showStatDistrict.php?statType=1&year=55&rcode=10&_=1538440865517'
BANGKOK_DISTRICT_POPULATION_2017_URL = 'http://stat.bora.dopa.go.th/stat/statnew/statTDD/datasource/showStatDistrict.php?statType=1&year=60&rcode=10&_=1538440865517'
CHIANG_MAI_DISTRICT_POPULATION_2012_URL = 'http://stat.bora.dopa.go.th/stat/statnew/statTDD/datasource/showStatDistrict.php?statType=1&year=55&rcode=50&_=1538440865517'
CHIANG_MAI_DISTRICT_POPULATION_2017_URL = 'http://stat.bora.dopa.go.th/stat/statnew/statTDD/datasource/showStatDistrict.php?statType=1&year=60&rcode=50&_=1538440865517'


Import manually collected mapping from Chiang Mai Subdistricts (Tambon) to the respective district (Amphoe). Data had to be collected because the division of the population data is very eccentric. This mapping is made by manually googling the subdistrict names and intelligently selecting the correct district.

In [24]:
tambon_map = pd.read_csv('data/chiang_mai_tambon.csv', header=None).set_index(0).T.to_dict('records')[0]

In [25]:
province_name_pattern = re.compile(r'<font color=\'red\'><b>(.*)</b></font>')
def parse_province_name(raw_name):
    province_name = province_name_pattern.match(raw_name).group(1)
    if (len(province_name.replace("จังหวัด", "")) > 0):
        province_name = province_name.replace("จังหวัด", "")
    return province_name

In [26]:
name_pattern = re.compile(r'<a href=javascript:openWindow\(\'\?rcode=\d+&statType=\d+&year=\d+\'\)>(.*)<\/a>')

def parse_district_name(raw_name):
    name = name_pattern.match(raw_name).group(1)
    if (len(name.replace("ท้องถิ่นเขต", "")) > 0):
        name = name.replace("ท้องถิ่นเขต", "")
    if (len(name.replace("อำเภอ", "")) > 0):
        name = name.replace("อำเภอ", "")
    return name
    
def parse_district_data(row):
    area_code, name_html, male, female, total, residence = row
    
    name = parse_district_name(name_html)
    if (name.find('ท้องถิ่นเทศบาล') != -1):
        name = tambon_map[name]
        
    male_population = int(male.replace(",", ""))
    female_population = int(female.replace(",", ""))
    total_population = int(total.replace(",", ""))
    
    return (name, male_population, female_population, total_population)

def parse_districts_data(data):
    districts = []
    for row in data:
        district = parse_district_data(row)
        if (district != None):
            districts.append(district)
            
    return districts

In [27]:
def parse_districts_population(url: str) -> Tuple[List[Tuple[str, int, int, int]], str]:
    assert 'stat.bora.dopa.go.th' in url

    response = requests.get(url)
    data = response.json()['aaData']
    province_th = parse_province_name(data[0][1])
    districts_population = parse_districts_data(data[1:])
    return (districts_population, province_th)

In [28]:
def get_district_population_df(url: str) -> pd.core.frame.DataFrame:
    districts_population, province_th = parse_districts_population(url)
    column_names = ['district_th', 'male_population', 'female_population', 'total_population']
    df = pd.DataFrame(districts_population, columns = column_names)
    df.insert(loc = 0, column='province_th', value = province_th)
    df = df.groupby(['province_th', 'district_th']).sum().reset_index()
    return df

In [29]:
def get_all_districts_population_single_year_df(urls: Tuple[str]) -> pd.core.frame.DataFrame:
    df = pd.DataFrame(columns = ['province_th', 'district_th', 'male_population', 'female_population', 'total_population'])
    for url in urls:
        district_population_df = get_district_population_df(url)
        df = pd.concat([df, district_population_df], axis = 0).reset_index(drop=True)
    return df

In [30]:
district_population_2012 = get_all_districts_population_single_year_df([BANGKOK_DISTRICT_POPULATION_2012_URL, CHIANG_MAI_DISTRICT_POPULATION_2012_URL])

In [31]:
district_population_2017 = get_all_districts_population_single_year_df([BANGKOK_DISTRICT_POPULATION_2017_URL, CHIANG_MAI_DISTRICT_POPULATION_2017_URL])

In [32]:
district_population = district_population_2012.merge(district_population_2017, on=['province_th', 'district_th'], suffixes=('_2012', '_2017'))

In [33]:
districts_df = districts_df.merge(district_population, on=['province_th', 'district_th'])

In [34]:
districts_df.head()

Unnamed: 0,province_th,province_en,district_th,district_en,latitude,longitude,male_population_2012,female_population_2012,total_population_2012,male_population_2017,female_population_2017,total_population_2017
0,เชียงใหม่,Chiang mai,เชียงดาว,Chiang Dao,19.368503,98.967102,41291,40746,82037,46241,46347,92588
1,เชียงใหม่,Chiang mai,เมืองเชียงใหม่,Mueang Chiang Mai,18.788922,98.987309,110334,124725,235059,110168,124481,234649
2,เชียงใหม่,Chiang mai,เวียงแหง,Wiang Haeng,19.559497,98.634181,13649,13415,27064,22768,22381,45149
3,เชียงใหม่,Chiang mai,แม่แจ่ม,Mae Chaem,18.498699,98.363784,29522,28350,57872,30474,29254,59728
4,เชียงใหม่,Chiang mai,แม่แตง,Mae Taeng,19.130423,98.940978,37159,37716,74875,37370,38420,75790


## Get District Area Data

Getting from http://www.e-report.energy.go.th/area.html. However, the web is using some sort of microsoft word framework and can't be scraped. The pages were downloaded manually instead.

In [35]:
CHIANG_MAI_DISTRICT_AREA_URL = 'http://www.e-report.energy.go.th/area/Chingmai.htm'
BANGKOK_DISTRICT_AREA_URL = 'http://www.e-report.energy.go.th/area/Bangkok.htm'

In [36]:
def parse_districts_area_page(filepath: str) -> List[Tuple[str, float]]:    
    with open(filepath, 'rb') as f:
        text = f.read()
        soup = BeautifulSoup(text)
    
        table = soup.select('.MsoNormalTable tr')
        table_header = table[0]
        table_body = table[1:]
    
        districts : List[Tuple[str, float]] = []
        for table_row in tqdm_notebook(table_body, desc=filepath):
            record = [cell.text.strip() for cell in table_row.select('td')]
            districts.append((record[1], float(record[2])))
    return districts

In [37]:
def get_districts_area_df(province_th: str, filepath: str) -> pd.core.frame.DataFrame:
    districts: List[Tuple[str, float]] = parse_districts_area_page(filepath)
    df = pd.DataFrame(districts, columns=['district_th', 'area_km_sq'])
    df.insert(loc = 0, column='province_th', value = province_th)
    return df

In [38]:
def get_bangkok_districts_area_df() -> pd.core.frame.DataFrame:
    areas_df = get_districts_area_df('กรุงเทพมหานคร', 'data/bangkok.htm')
    areas_df = areas_df[~ areas_df.district_th.str.startswith('แขวง')]
    areas_df.loc[:, 'district_th'] = areas_df.district_th.str.replace('เขต', '')
    areas_df.reset_index(drop = True, inplace=True)
    return areas_df

In [39]:
def get_chiang_mai_districts_area_df() -> pd.core.frame.DataFrame:
    areas_df = get_districts_area_df('เชียงใหม่', 'data/chiang_mai.htm')
    areas_df.loc[:, 'district_th'] = areas_df.district_th.replace('เมือง', 'เมืองเชียงใหม่')
    return areas_df

In [40]:
def get_all_districts_area_df() -> pd.core.frame.DataFrame:
    chiang_mai_districts_area_df = get_chiang_mai_districts_area_df()
    bangkok_districts_area_df = get_bangkok_districts_area_df()
    df = pd.concat([chiang_mai_districts_area_df, bangkok_districts_area_df], axis = 0).reset_index(drop=True)
    return df

In [41]:
districts_area = get_all_districts_area_df()

HBox(children=(IntProgress(value=0, description='data/chiang_mai.htm', max=24), HTML(value='')))




HBox(children=(IntProgress(value=0, description='data/bangkok.htm', max=210), HTML(value='')))




In [42]:
districts_df = districts_df.merge(districts_area, on=['province_th', 'district_th'])

In [43]:
districts_df

Unnamed: 0,province_th,province_en,district_th,district_en,latitude,longitude,male_population_2012,female_population_2012,total_population_2012,male_population_2017,female_population_2017,total_population_2017,area_km_sq
0,เชียงใหม่,Chiang mai,เชียงดาว,Chiang Dao,19.368503,98.967102,41291,40746,82037,46241,46347,92588,1882.082
1,เชียงใหม่,Chiang mai,เมืองเชียงใหม่,Mueang Chiang Mai,18.788922,98.987309,110334,124725,235059,110168,124481,234649,152.359
2,เชียงใหม่,Chiang mai,เวียงแหง,Wiang Haeng,19.559497,98.634181,13649,13415,27064,22768,22381,45149,672.172
3,เชียงใหม่,Chiang mai,แม่แจ่ม,Mae Chaem,18.498699,98.363784,29522,28350,57872,30474,29254,59728,3361.151
4,เชียงใหม่,Chiang mai,แม่แตง,Mae Taeng,19.130423,98.940978,37159,37716,74875,37370,38420,75790,1362.784
5,เชียงใหม่,Chiang mai,แม่ริม,Mae Rim,18.920660,98.946758,42476,43989,86465,45792,47393,93185,443.634
6,เชียงใหม่,Chiang mai,แม่วาง,Mae Wang,18.630793,98.707169,15503,15841,31344,15747,16087,31834,601.218
7,เชียงใหม่,Chiang mai,แม่ออน,Mae On,18.733222,99.122954,10695,10565,21260,10614,10652,21266,442.263
8,เชียงใหม่,Chiang mai,แม่อาย,Mae Ai,20.020057,99.273625,36583,36331,72914,39161,39139,78300,736.701
9,เชียงใหม่,Chiang mai,ไชยปราการ,Chai Prakan,19.730316,99.139032,22042,22600,44642,22673,23340,46013,510.851


## Get Venues Data from FourSquare

In [44]:
# @hidden_cell

CLIENT_ID = 'I4TUSMI0RRYIX3GVEYZ0MHZQ4NBHEV0X02AUXYW1OT4PQUH3' # your Foursquare ID
CLIENT_SECRET = '5EVZRBREAQXGCGLQBPT4SG5LONXR255F50LDUVE5MRBOTXNV' # your Foursquare Secret
VERSION = '20180605' # Foursquare API version

In [45]:
def make_foursquare_explore_url_(lattitude, longitude, page = 1, radius = 2000, limit = 50, section = None):
    url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&offset={}&radius={}&limit={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lattitude, 
            longitude, 
            page,
            radius, 
            limit)
    if (section != None):
        url += '&section={}' % (section)
    return url

In [46]:
def get_venues_from_foursquare_explore_response_(response_json):
    results = response_json["response"]['groups'][0]['items']
    filtered_venues = [(v['venue']['name'], 
            v['venue']['location']['lat'], 
            v['venue']['location']['lng'],  
            v['venue']['categories'][0]['name']) for v in results]
    return filtered_venues

In [47]:
def get_nearby_venues_columns_():
    return [
        'province_en',
        'district_en',
        'venue_name', 
        'venue_latitude', 
        'venue_longitude', 
        'venue_category'
    ]

def get_nearby_venues(df, radius=500):
    
    all_venues_list : List[pd.core.frame.DataFrame] = []
    for row in tqdm_notebook(districts_df[['province_en', 'district_en', 'latitude', 'longitude']].values):
        province, district, lat, long = row
        district_venue_pages = get_district_venue_pages(lat, long)
        
        if (len(district_venue_pages) == 0):
            print("No venues for %s, %s" % (district, province))
            continue
        
        district_venues_df: pd.core.frame.DataFrame = pd.concat(district_venue_pages, axis = 0, sort = False).reset_index(drop=True)
        district_venues_df.drop_duplicates(inplace = True)
        
        district_venues_df.insert(loc = 0, column='province_en', value = province)
        district_venues_df.insert(loc = 1, column='district_en', value = district)
        all_venues_list.append(district_venues_df)
    
    df = pd.concat(all_venues_list, axis = 0, sort = False).reset_index(drop=True)
    df.columns = get_nearby_venues_columns_()
    
    return df

In [55]:
def get_district_venue_page(lat: float, long: float, page: int):
    df = None
    for i in range(0, 5):
        try:
            url = make_foursquare_explore_url_(lat, long, page)
            response = requests.get(url).json()
            district_venues = get_venues_from_foursquare_explore_response_(response) 
            df = pd.DataFrame(district_venues)
        except Exception as e:
            print("An error occurred while getting from FourSquare")
            print("URL = ", url)
            print("Response = ", response)
            sleep(0.1)
        if (df is not None):
            break
    else:
        print("Error reached maximum retries.")
    return df

In [56]:
def get_district_venue_pages(lat: float, long: float):
    district_venue_pages : List[pd.core.frame.DataFrame] = []
    for page in range(0, 500, 50):
        district_venue_page = get_district_venue_page(lat, long, page)
        if (len(district_venue_page) == 0):
            break
        district_venue_pages.append(district_venue_page)
        if (len(district_venue_page) % 50 != 0):
            break
    return district_venue_pages

In [57]:
venues_df = get_nearby_venues(districts_df)

HBox(children=(IntProgress(value=0, max=74), HTML(value='')))

No venues for Doi Lo, Chiang mai


In [58]:
venues_df.tail(250)

Unnamed: 0,province_en,district_en,venue_name,venue_latitude,venue_longitude,venue_category
6984,Bangkok,Lak Si,อาหารปักษ์ใต้พี่เก๋ (หาดใหญ่),13.897037,100.589835,Thai Restaurant
6985,Bangkok,Lak Si,CAT FOODCOURT,13.883936,100.570747,Food Court
6986,Bangkok,Lak Si,Cafe' Amazon (คาเฟ่ อเมซอน),13.890604,100.568440,Coffee Shop
6987,Bangkok,Lak Si,ศูนย์อาหารศูนย์ราชการ,13.889256,100.564937,Food Court
6988,Bangkok,Lak Si,ราชาข้าวกล่อง,13.877571,100.574163,Food Truck
6989,Bangkok,Lak Si,De Cafe',13.894862,100.585794,Coffee Shop
6990,Bangkok,Lak Si,TrueCoffee (ทรูคอฟฟี่),13.878917,100.566964,Coffee Shop
6991,Bangkok,Lak Si,Black Canyon Coffee (แบล็คแคนยอนคอฟฟี่),13.890356,100.567183,Coffee Shop
6992,Bangkok,Lak Si,ตลาดนัด | ซอยต้นสน,13.873377,100.572839,Flea Market
6993,Bangkok,Lak Si,SUM RAN JAI,13.898133,100.585475,Thai Restaurant


## Get Inverse Category Map

Inverse Category is the mapping from a venue category to a top-level category. This will reduce the dimensions used for clustering.

In [59]:
def make_foursquare_categories_url_():
    url = 'https://api.foursquare.com/v2/venues/categories?&client_id={}&client_secret={}&v={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION)
    return url

In [60]:
response = requests.get(make_foursquare_categories_url_()).json()

In [61]:
categories = response['response']['categories']

In [62]:
inverse_categories = {}

for category in categories:
    print(category['name'])
    top_level_name = category['name']
    inverse_categories[top_level_name] = top_level_name
    for c1 in category['categories']:
        inverse_categories[c1['name']] = top_level_name
        for c2 in c1['categories']:
            inverse_categories[c2['name']] = top_level_name
            for c3 in c2['categories']:
                inverse_categories[c3['name']] = top_level_name
                for c4 in c3['categories']:
                    inverse_categories[c4['name']] = top_level_name
                    for c5 in c4['categories']:
                        inverse_categories[c5['name']] = top_level_name
                        for c6 in c5['categories']:
                            inverse_categories[c6['name']] = top_level_name
        
inverse_categories

Arts & Entertainment
College & University
Event
Food
Nightlife Spot
Outdoors & Recreation
Professional & Other Places
Residence
Shop & Service
Travel & Transport


{'Arts & Entertainment': 'Arts & Entertainment',
 'Amphitheater': 'Arts & Entertainment',
 'Aquarium': 'Arts & Entertainment',
 'Arcade': 'Arts & Entertainment',
 'Art Gallery': 'Arts & Entertainment',
 'Bowling Alley': 'Arts & Entertainment',
 'Casino': 'Arts & Entertainment',
 'Circus': 'Arts & Entertainment',
 'Comedy Club': 'Arts & Entertainment',
 'Concert Hall': 'Arts & Entertainment',
 'Country Dance Club': 'Arts & Entertainment',
 'Disc Golf': 'Arts & Entertainment',
 'Exhibit': 'Arts & Entertainment',
 'General Entertainment': 'Arts & Entertainment',
 'Go Kart Track': 'Arts & Entertainment',
 'Historic Site': 'Arts & Entertainment',
 'Karaoke Box': 'Arts & Entertainment',
 'Laser Tag': 'Arts & Entertainment',
 'Memorial Site': 'Arts & Entertainment',
 'Mini Golf': 'Arts & Entertainment',
 'Movie Theater': 'Arts & Entertainment',
 'Drive-in Theater': 'Arts & Entertainment',
 'Indie Movie Theater': 'Arts & Entertainment',
 'Multiplex': 'Arts & Entertainment',
 'Museum': 'Arts & 

In [63]:
venues_df['venue_category'] = venues_df['venue_category'].apply(lambda x: inverse_categories[x])

In [64]:
venues_df

Unnamed: 0,province_en,district_en,venue_name,venue_latitude,venue_longitude,venue_category
0,Chiang mai,Chiang Dao,ขาหมูเชียงดาวเจ้าเก่า แยกวัดอินทาราม,19.368530,98.965055,Food
1,Chiang mai,Chiang Dao,Azalea,19.379677,98.965886,Travel & Transport
2,Chiang mai,Chiang Dao,ครัวเชียงดาว Chaing Dao Restaurant,19.372483,98.965953,Food
3,Chiang mai,Chiang Dao,ร้านปูอลาสก้า,19.352785,98.963730,Food
4,Chiang mai,Chiang Dao,ประเสริฐทองวิลล์,19.352748,98.963505,Food
5,Chiang mai,Chiang Dao,พรเพ็ญขาหมูเสวย,19.368232,98.964996,Food
6,Chiang mai,Chiang Dao,Cafe' Amazon PTT Chiang Dao,19.353228,98.963767,Food
7,Chiang mai,Chiang Dao,ถนนคนเดินเชียงดาว,19.368338,98.965775,Outdoors & Recreation
8,Chiang mai,Chiang Dao,ผึ้งน้อยเชียงดาว,19.371910,98.965322,Food
9,Chiang mai,Chiang Dao,ข้าวขาหมูเชียงดาว,19.372507,98.965999,Food


## Get Venue Data for each District

In [65]:
venues_with_dummies = pd.concat((venues_df, pd.get_dummies(venues_df['venue_category'])), axis = 1)

In [66]:
district_venues_df = venues_with_dummies.groupby(['province_en', 'district_en']).sum().drop(['venue_latitude', 'venue_longitude'], axis = 1).reset_index()

In [67]:
district_venues_df.set_index(['province_en', 'district_en'], inplace=True)

In [68]:
district_venues_df = district_venues_df.div(district_venues_df.sum(axis=1), axis=0)

In [69]:
district_venues_df.reset_index(inplace = True)

In [71]:
districts_df = districts_df.merge(district_venues_df, how = 'left', on=['province_en', 'district_en']).fillna(0)

## Feature Engineering

In [72]:
districts_df['population_density_2012'] = districts_df['total_population_2012'] / districts_df['area_km_sq']
districts_df['population_density_2017'] = districts_df['total_population_2017'] / districts_df['area_km_sq']

In [73]:
denominator = 1

In [74]:
districts_df['arts_entertainment_density_2017'] = districts_df['Arts & Entertainment'] / denominator
districts_df['college_university_density_2017'] = districts_df['College & University'] / denominator
districts_df['food_density_2017'] = districts_df['Food'] / denominator
districts_df['nightlife_spot_density_2017'] = districts_df['Nightlife Spot'] / denominator
districts_df['outdoors_recreation_density_2017'] = districts_df['Outdoors & Recreation'] / denominator
districts_df['professional_density_2017'] = districts_df['Professional & Other Places'] / denominator
districts_df['residence_density_2017'] = districts_df['Residence'] / denominator
districts_df['shop_service_density_2017'] = districts_df['Shop & Service'] / denominator
districts_df['travel_transport_density_2017'] = districts_df['Travel & Transport'] / denominator

In [75]:
districts_df

Unnamed: 0,province_th,province_en,district_th,district_en,latitude,longitude,male_population_2012,female_population_2012,total_population_2012,male_population_2017,...,population_density_2017,arts_entertainment_density_2017,college_university_density_2017,food_density_2017,nightlife_spot_density_2017,outdoors_recreation_density_2017,professional_density_2017,residence_density_2017,shop_service_density_2017,travel_transport_density_2017
0,เชียงใหม่,Chiang mai,เชียงดาว,Chiang Dao,19.368503,98.967102,41291,40746,82037,46241,...,49.194456,0.050000,0.000000,0.700000,0.000000,0.050000,0.000000,0.000000,0.100000,0.100000
1,เชียงใหม่,Chiang mai,เมืองเชียงใหม่,Mueang Chiang Mai,18.788922,98.987309,110334,124725,235059,110168,...,1540.105934,0.026432,0.000000,0.753304,0.026432,0.017621,0.017621,0.000000,0.048458,0.110132
2,เชียงใหม่,Chiang mai,เวียงแหง,Wiang Haeng,19.559497,98.634181,13649,13415,27064,22768,...,67.168820,0.000000,0.000000,0.800000,0.000000,0.000000,0.000000,0.000000,0.200000,0.000000
3,เชียงใหม่,Chiang mai,แม่แจ่ม,Mae Chaem,18.498699,98.363784,29522,28350,57872,30474,...,17.770103,0.000000,0.000000,0.428571,0.000000,0.000000,0.000000,0.000000,0.142857,0.428571
4,เชียงใหม่,Chiang mai,แม่แตง,Mae Taeng,19.130423,98.940978,37159,37716,74875,37370,...,55.614096,0.000000,0.000000,0.200000,0.200000,0.400000,0.000000,0.000000,0.000000,0.200000
5,เชียงใหม่,Chiang mai,แม่ริม,Mae Rim,18.920660,98.946758,42476,43989,86465,45792,...,210.049275,0.090909,0.000000,0.454545,0.030303,0.181818,0.000000,0.000000,0.181818,0.060606
6,เชียงใหม่,Chiang mai,แม่วาง,Mae Wang,18.630793,98.707169,15503,15841,31344,15747,...,52.949180,0.000000,0.000000,0.200000,0.000000,0.400000,0.000000,0.000000,0.000000,0.400000
7,เชียงใหม่,Chiang mai,แม่ออน,Mae On,18.733222,99.122954,10695,10565,21260,10614,...,48.084511,0.000000,0.000000,0.593750,0.000000,0.000000,0.000000,0.000000,0.375000,0.031250
8,เชียงใหม่,Chiang mai,แม่อาย,Mae Ai,20.020057,99.273625,36583,36331,72914,39161,...,106.284639,0.000000,0.000000,0.800000,0.000000,0.000000,0.000000,0.000000,0.200000,0.000000
9,เชียงใหม่,Chiang mai,ไชยปราการ,Chai Prakan,19.730316,99.139032,22042,22600,44642,22673,...,90.071273,0.000000,0.000000,1.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000


### Select Columns

In [76]:
X_columns = [
    'population_density_2012', 
    'population_density_2017', 
    'arts_entertainment_density_2017',
    'college_university_density_2017', 
    'food_density_2017',
    'nightlife_spot_density_2017', 
    'outdoors_recreation_density_2017',
    'professional_density_2017', 
    'residence_density_2017',
    'shop_service_density_2017', 
    'travel_transport_density_2017']
X = districts_df[X_columns]

### Normalization

In [77]:
from sklearn.preprocessing import StandardScaler

In [78]:
scaler = StandardScaler()
X = scaler.fit_transform(X)

## Clustering

In [87]:
from sklearn.cluster import KMeans

In [110]:
n_clusters = 5
clustering = KMeans(n_clusters = n_clusters).fit(X)

In [111]:
clustering.labels_

array([0, 0, 0, 1, 1, 3, 1, 3, 0, 0, 1, 3, 3, 3, 0, 1, 0, 3, 1, 3, 0, 0,
       0, 3, 0, 4, 3, 0, 0, 0, 0, 4, 0, 0, 3, 0, 0, 0, 0, 4, 0, 0, 0, 0,
       0, 0, 0, 0, 4, 0, 4, 3, 4, 4, 2, 4, 0, 0, 0, 4, 0, 2, 0, 0, 2, 0,
       2, 4, 4, 2, 0, 3, 3, 0], dtype=int32)

In [112]:
districts_df['cluster'] = kmeans.labels_

## Validate Clustering

In [113]:
from matplotlib import cm, colors

In [114]:
# create map
map_clusters = folium.Map(location=get_coordinates("Thailand"), zoom_start=6)

# set color scheme for the clusters
x = np.arange(n_clusters)
ys = [i+x+(i*x)**2 for i in range(n_clusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# add markers to the map
markers_colors = []
for lat, lon, poi, cluster in districts_df[['latitude', 'longitude', 'district_th', 'cluster']].values:
    label = folium.Popup(str(poi) + '\nCluster ' + str(cluster), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color=rainbow[cluster-1],
        fill=True,
        fill_color=rainbow[cluster-1],
        fill_opacity=0.7).add_to(map_clusters)
       
map_clusters

In [115]:
pd.DataFrame(scaler.inverse_transform(kmeans.cluster_centers_), columns = X_columns, dtype = float).apply(lambda x: round(x / 10, 2))

Unnamed: 0,population_density_2012,population_density_2017,arts_entertainment_density_2017,college_university_density_2017,food_density_2017,nightlife_spot_density_2017,outdoors_recreation_density_2017,professional_density_2017,residence_density_2017,shop_service_density_2017,travel_transport_density_2017
0,1271.79,1175.85,0.0,0.0,0.06,0.01,0.0,0.0,0.0,0.01,0.01
1,350.86,345.59,0.0,0.0,0.07,0.0,0.0,0.0,0.0,0.02,0.0
2,45.69,48.85,0.0,0.0,0.04,0.0,0.02,0.0,0.0,0.02,0.02
3,923.89,910.21,0.0,0.0,0.06,0.01,0.0,0.0,0.0,0.01,0.01
4,438.34,451.47,0.0,0.0,0.06,0.01,0.0,0.0,0.0,0.01,0.01


# Summary

So it seems that there are indeed districts in Bangkok that are similar to Chiang Mai. Some relocations are more probable than others; the suburbs of Bangkok is very similar to the city area of Chiang Mai while the outskirts of Bangkok are similar to the suburbs of Chiang Mai. The city center of Bangkok is unlike any area in Chiang Mai; therefore, a lower possibility exists for relocation for those in the city center of Bangkok.