# Capstone Project - The Battle of the Neighborhoods (Week 2)
## Guide to Explore Hanoi

## I. Introduction

As someone like to travel, it's a good idea to do a research beforehand for a new destination. This time I choose Hanoi, our country's capital city.

For a traveler, the most important thing is know the place that they will travel to. So I will need to find out the area on Hanoi that has many service for traveler such as hotel, restaurant, cafe... and where to go if I have problem with my health. With all those infos, my trip will be exciting and safe for sure.

## II. Data Source and Tools

Based on the intention of us, we will need data such as:
- List of urban districts of Hanoi from **Wikipedia**: https://en.wikipedia.org/wiki/Hanoi#List_of_local_government_divisions - I already scap the info and put into ***hanoi_district.csv*** file
- List of wards for each district from **Wikipedia**: https://vi.wikipedia.org/wiki/Thể_loại:Xã,_phường,_thị_trấn_Hà_Nội - The link to each list is included in ***hanoi_district.csv*** file
- **Google Chrome** is used to analyse the content of Wikipedia page.
- Map data, district boundary, coordinates from **OpenStreetMap** (OSM)
- Location data from **FourSquare**. Mainly I'm interested in **hotel**, **cafe**, **restaurant** and **hospital**
- Online tool from https://tyrasd.github.io/osmtogeojson/ to convert OSM data into geojson for use with **folium**

## III. Build a Map for Hanoi's main districts

## Import library

In [1]:
import pandas as pd
import io
import requests

## Gather Data and Cleanup

We already have the list of main district of Hanoi in a CSV file, so we will load the data

In [2]:
with io.open('hanoi_district.csv','r', encoding='utf-8-sig') as file:
    district_df = pd.read_csv(file)

print('There are', district_df.shape[0], 'main district in Hanoi')
district_df.head()

There are 12 main district in Hanoi


Unnamed: 0,d_code,d_en_name,d_vi_name,d_lat,d_lon,d_url
0,1,Ba Dinh District,Ba Đình,,,https://vi.wikipedia.org/wiki/Th%E1%BB%83_lo%E...
1,2,Hoan Kiem District,Hoàn Kiếm,,,https://vi.wikipedia.org/wiki/Th%E1%BB%83_lo%E...
2,3,Tay Ho District,Tây Hồ,,,https://vi.wikipedia.org/wiki/Th%E1%BB%83_lo%E...
3,4,Long Bien District,Long Biên,,,https://vi.wikipedia.org/wiki/Th%E1%BB%83_lo%E...
4,5,Cau Giay District,Cầu Giấy,,,https://vi.wikipedia.org/wiki/Th%E1%BB%83_lo%E...


Each district has link to a wikipedia file contain list of wards belong to them. We will need to extract those data and put into a new dataframe.

Define a function to get content of a URL so we can reuse if needed

In [3]:
def getData(url):
    html = requests.get(url)

    if html.status_code != 200:
        print('Error! Please check the url or your network')
    
    return html.content

Let's take a look at our first district.

Using **Google Chrome** to inspect the source code of the Wikipedia page, we know that the list of ward is inside a ***li*** tag, which in turn stay inside a ***div*** with class ***mw-category***

In [4]:
from bs4 import BeautifulSoup

data = BeautifulSoup(getData(district_df.loc[0, 'd_url']), 'html.parser').find('div',{'class':'mw-category'}).findAll('li')

data

[<li><a href="/wiki/C%E1%BB%91ng_V%E1%BB%8B" title="Cống Vị">Cống Vị</a></li>,
 <li><a href="/wiki/%C4%90i%E1%BB%87n_Bi%C3%AAn,_Ba_%C4%90%C3%ACnh" title="Điện Biên, Ba Đình">Điện Biên, Ba Đình</a></li>,
 <li><a href="/wiki/%C4%90%E1%BB%99i_C%E1%BA%A5n,_Ba_%C4%90%C3%ACnh" title="Đội Cấn, Ba Đình">Đội Cấn, Ba Đình</a></li>,
 <li><a href="/wiki/Gi%E1%BA%A3ng_V%C3%B5" title="Giảng Võ">Giảng Võ</a></li>,
 <li><a href="/wiki/Kim_M%C3%A3" title="Kim Mã">Kim Mã</a></li>,
 <li><a href="/wiki/Li%E1%BB%85u_Giai" title="Liễu Giai">Liễu Giai</a></li>,
 <li><a href="/wiki/Ng%E1%BB%8Dc_H%C3%A0,_Ba_%C4%90%C3%ACnh" title="Ngọc Hà, Ba Đình">Ngọc Hà, Ba Đình</a></li>,
 <li><a href="/wiki/Ng%E1%BB%8Dc_Kh%C3%A1nh" title="Ngọc Khánh">Ngọc Khánh</a></li>,
 <li><a href="/wiki/Nguy%E1%BB%85n_Trung_Tr%E1%BB%B1c_(ph%C6%B0%E1%BB%9Dng)" title="Nguyễn Trung Trực (phường)">Nguyễn Trung Trực (phường)</a></li>,
 <li><a href="/wiki/Ph%C3%BAc_X%C3%A1" title="Phúc Xá">Phúc Xá</a></li>,
 <li><a href="/wiki/Qu%C3%A1n_Th%C3

Each element contain the URL and name of a ward. We will need the URL to extract the latitude and longitude of each ward in case OpenStreetMap can't provide those infos.

Get all the ward name of each district.

In [5]:
# Define new dataframe
col_list = district_df.columns.to_list()
col_list.extend(['w_name', 'w_url', 'w_lat', 'w_lon'])
ward_df = pd.DataFrame(columns=col_list)

total_districts = district_df.shape[0]

# Loop through each row in district_df to get list of wards
for idx, row in district_df.iterrows():
    # Get content of the Wikipedia page, then find the list of all wards, put it in variable data
    print(str(idx+1)+'/'+str(total_districts)+'. Getting list of wards for', row['d_en_name']+'...')
    data = BeautifulSoup(getData(row['d_url']), 'html.parser').find('div',{'class':'mw-category'}).findAll('li')
    for w in data:
        r = row.copy()
        w_name = w.text.replace(', '+ district_df.loc[idx,'d_vi_name'], '').replace(' (phường)', '')  # Clean up the ward's name
        w_url = 'https://vi.wikipedia.org'+w.a['href']  # Add prefix to url
        r['w_name'] = w_name
        r['w_url'] = w_url
        ward_df=ward_df.append(r, ignore_index=True)    # Append ward to ward_df dataframe

print('\nDone!')


1/12. Getting list of wards for Ba Dinh District...
2/12. Getting list of wards for Hoan Kiem District...
3/12. Getting list of wards for Tay Ho District...
4/12. Getting list of wards for Long Bien District...
5/12. Getting list of wards for Cau Giay District...
6/12. Getting list of wards for Dong Da District...
7/12. Getting list of wards for Hai Ba Trung District...
8/12. Getting list of wards for Hoang Mai District...
9/12. Getting list of wards for Thanh Xuan District...
10/12. Getting list of wards for Nam Tu Liem District...
11/12. Getting list of wards for Bac Tu Liem District...
12/12. Getting list of wards for Ha Dong District...

Done!


Let's take a look at our new dataframe

In [6]:
print('There are', ward_df.shape[0], 'wards in total')
ward_df.head()

There are 166 wards in total


Unnamed: 0,d_code,d_en_name,d_vi_name,d_lat,d_lon,d_url,w_name,w_url,w_lat,w_lon
0,1,Ba Dinh District,Ba Đình,,,https://vi.wikipedia.org/wiki/Th%E1%BB%83_lo%E...,Cống Vị,https://vi.wikipedia.org/wiki/C%E1%BB%91ng_V%E...,,
1,1,Ba Dinh District,Ba Đình,,,https://vi.wikipedia.org/wiki/Th%E1%BB%83_lo%E...,Điện Biên,https://vi.wikipedia.org/wiki/%C4%90i%E1%BB%87...,,
2,1,Ba Dinh District,Ba Đình,,,https://vi.wikipedia.org/wiki/Th%E1%BB%83_lo%E...,Đội Cấn,https://vi.wikipedia.org/wiki/%C4%90%E1%BB%99i...,,
3,1,Ba Dinh District,Ba Đình,,,https://vi.wikipedia.org/wiki/Th%E1%BB%83_lo%E...,Giảng Võ,https://vi.wikipedia.org/wiki/Gi%E1%BA%A3ng_V%...,,
4,1,Ba Dinh District,Ba Đình,,,https://vi.wikipedia.org/wiki/Th%E1%BB%83_lo%E...,Kim Mã,https://vi.wikipedia.org/wiki/Kim_M%C3%A3,,


We will use OpenStreetMap to get the boudary info of each district and coordinates of each ward. So we will build the function to do that

## Build the function to get district's boundary or ward's coordinates

In [7]:
def get_OSM_data(POI_list, info='bound', filename=''):
    query_head = """
    [out:xml][timeout:25];
    (\n"""

    query_body = ''
    
    # Get boundary
    if (info == 'bound'):
        for d in POI_list:
            query_body += """relation["name:en"=\""""+d+"""\"]; (._;>;);\n"""

        query_tail = """);
        out center;
        """
    # Get coordinates
    elif (info == 'coord'):
        for w in POI_list:
            query_body += """node["name"=\""""+w+"""\"][place=suburb];\n"""
        
        query_tail = """);
        out;
        """
    
    query = (query_head+query_body+query_tail)

    overpass_url = "http://overpass-api.de/api/interpreter"
    
    response = requests.get(overpass_url, params={'data': query})

    # Write the respond content to file if needed
    if filename != '':
        with io.open(filename,'w+', encoding='utf8') as file:
            file.write(response.text)

    return response.text

Get OSM boundary data for all districts and save as ***hanoi_districts_boundary.xml***

In [9]:
data = get_OSM_data(district_df['d_en_name'], 'bound', 'hanoi_districts_boundary.xml')

With the XML file, we can use the online tool at https://tyrasd.github.io/osmtogeojson/ to convert it to geojson file. We named it ***hanoi_districts.geojson***

## Draw map using geojson data from our *result.geojson* file

We need to get the center point for our map

In [10]:
data = get_OSM_data(['Hanoi'])

Use BeautifulSoup to extract the center point. Since we know Hanoi has **admin_level = 2**, we will only get the coordinates of the node that satisfied the condition

In [11]:
from bs4 import BeautifulSoup

nodes = BeautifulSoup(data, 'lxml').findAll('node')

for node in nodes:
    if node.find('tag', {'k':'admin_level', 'v':2}) != None:
        latitude = node['lat']
        longitude = node['lon']
        break

Draw map

In [19]:
import folium

map = folium.Map(location=[latitude, longitude], width='100%', height=400, zoom_start=11)

# We use io.open since the normal method won't work reliability with utf-8 encode file
with io.open('hanoi_districts.geojson','r', encoding='utf8') as file:
    geojson = file.read()

folium.GeoJson(geojson, name='District boundaries').add_to(map)

# Add marker for center point
folium.CircleMarker([latitude, longitude], radius=5,color='red',fill=True,
                       fill_color='#cc563f', fill_opacity=0.7, popup='Center of Hanoi',
                       parse_html=False).add_to(map)

map

Query OSM for coordinates of wards

In [18]:
# Loop through each district
for idx, row in district_df.iterrows():
    print(str(idx+1),'of',total_districts,'- Processing', row['d_en_name'])
    
    data = get_OSM_data(ward_df.loc[ward_df['d_en_name']==row['d_en_name'], 'w_name'], 'coord')
    nodes = BeautifulSoup(data, 'lxml').findAll('node')
    for node in nodes:
        ward_df.loc[ward_df['w_name']==node.find('tag',{'k':'name'})['v'],'w_lat']=node['lat']
        ward_df.loc[ward_df['w_name']==node.find('tag',{'k':'name'})['v'],'w_lon']=node['lon']

1 of 12 - Processing Ba Dinh District
2 of 12 - Processing Hoan Kiem District
3 of 12 - Processing Tay Ho District
4 of 12 - Processing Long Bien District
5 of 12 - Processing Cau Giay District
6 of 12 - Processing Dong Da District
7 of 12 - Processing Hai Ba Trung District
8 of 12 - Processing Hoang Mai District
9 of 12 - Processing Thanh Xuan District
10 of 12 - Processing Nam Tu Liem District
11 of 12 - Processing Bac Tu Liem District
12 of 12 - Processing Ha Dong District


Let check the result

In [20]:
ward_df.head()

Unnamed: 0,d_code,d_en_name,d_vi_name,d_lat,d_lon,d_url,w_name,w_url,w_lat,w_lon
0,1,Ba Dinh District,Ba Đình,,,https://vi.wikipedia.org/wiki/Th%E1%BB%83_lo%E...,Cống Vị,https://vi.wikipedia.org/wiki/C%E1%BB%91ng_V%E...,21.0356697,105.8102348
1,1,Ba Dinh District,Ba Đình,,,https://vi.wikipedia.org/wiki/Th%E1%BB%83_lo%E...,Điện Biên,https://vi.wikipedia.org/wiki/%C4%90i%E1%BB%87...,21.030667,105.8383505
2,1,Ba Dinh District,Ba Đình,,,https://vi.wikipedia.org/wiki/Th%E1%BB%83_lo%E...,Đội Cấn,https://vi.wikipedia.org/wiki/%C4%90%E1%BB%99i...,21.0348806,105.830439
3,1,Ba Dinh District,Ba Đình,,,https://vi.wikipedia.org/wiki/Th%E1%BB%83_lo%E...,Giảng Võ,https://vi.wikipedia.org/wiki/Gi%E1%BA%A3ng_V%...,21.0257843,105.8188201
4,1,Ba Dinh District,Ba Đình,,,https://vi.wikipedia.org/wiki/Th%E1%BB%83_lo%E...,Kim Mã,https://vi.wikipedia.org/wiki/Kim_M%C3%A3,21.0312786,105.8263686



Looked good, but I want to make sure all the latitude and longitude for each ward is filled.

In [26]:
ward_df[(ward_df['w_lat'].isnull()) | (ward_df['w_lon'].isnull())]

Unnamed: 0,d_code,d_en_name,d_vi_name,d_lat,d_lon,d_url,w_name,w_url,w_lat,w_lon
14,2,Hoan Kiem District,Hoàn Kiếm,,,https://vi.wikipedia.org/wiki/Th%E1%BB%83_lo%E...,Chương Dương,https://vi.wikipedia.org/wiki/Ch%C6%B0%C6%A1ng...,,
17,2,Hoan Kiem District,Hoàn Kiếm,,,https://vi.wikipedia.org/wiki/Th%E1%BB%83_lo%E...,Đồng Xuân,https://vi.wikipedia.org/wiki/%C4%90%E1%BB%93n...,,
36,3,Tay Ho District,Tây Hồ,,,https://vi.wikipedia.org/wiki/Th%E1%BB%83_lo%E...,Thụy Khuê,https://vi.wikipedia.org/wiki/Th%E1%BB%A5y_Khu...,,
37,3,Tay Ho District,Tây Hồ,,,https://vi.wikipedia.org/wiki/Th%E1%BB%83_lo%E...,Tứ Liên,https://vi.wikipedia.org/wiki/T%E1%BB%A9_Li%C3...,,
43,4,Long Bien District,Long Biên,,,https://vi.wikipedia.org/wiki/Th%E1%BB%83_lo%E...,Gia Thụy,https://vi.wikipedia.org/wiki/Gia_Th%E1%BB%A5y,,
...,...,...,...,...,...,...,...,...,...,...
160,268,Ha Dong District,Hà Đông,,,https://vi.wikipedia.org/wiki/Th%E1%BB%83_lo%E...,Phúc La,https://vi.wikipedia.org/wiki/Ph%C3%BAc_La,,
161,268,Ha Dong District,Hà Đông,,,https://vi.wikipedia.org/wiki/Th%E1%BB%83_lo%E...,Quang Trung,"https://vi.wikipedia.org/wiki/Quang_Trung,_H%C...",,
162,268,Ha Dong District,Hà Đông,,,https://vi.wikipedia.org/wiki/Th%E1%BB%83_lo%E...,Vạn Phúc,https://vi.wikipedia.org/wiki/V%E1%BA%A1n_Ph%C...,,
163,268,Ha Dong District,Hà Đông,,,https://vi.wikipedia.org/wiki/Th%E1%BB%83_lo%E...,Văn Quán,https://vi.wikipedia.org/wiki/V%C4%83n_Qu%C3%A...,,


Out of 166 wards, there are 98 ward without coordinates data. This is not good.  We have to use other way to fill the coordinates.

Take a look at Wikipedia page for each ward, we have the coordinates but in DMS format. So we will have to extract the coordinates and convert it to DD for use in Folium

Let's try with the ward that have index value 14

In [42]:
print(ward_df.loc[14])
data = BeautifulSoup(getData(ward_df.loc[14, 'w_url']), 'html.parser')

d_code                                                       2
d_en_name                                   Hoan Kiem District
d_vi_name                                            Hoàn Kiếm
d_lat                                                      NaN
d_lon                                                      NaN
d_url        https://vi.wikipedia.org/wiki/Th%E1%BB%83_lo%E...
w_name                                            Chương Dương
w_url        https://vi.wikipedia.org/wiki/Ch%C6%B0%C6%A1ng...
w_lat                                                      NaN
w_lon                                                      NaN
Name: 14, dtype: object


In [46]:
lat_dms = data.find('span',{'class':'latitude'}).text
lon_dms = data.find('span',{'class':'longitude'}).text
print(ward_df.loc[14, 'w_name'], 'have latitude', lat_dms, 'and longitude', lon_dms, 'in DMS format')

Chương Dương have latitude 21°01′38″B and longitude 105°51′37″Đ in DMS format
