<h1 align=center><font size = 8>Capstone Project: The Battle of Neighbourhoods - Week #1</font></h1>

<h2 align=left><font size = 6>1. Introduction</font></h2>

<h3 align=left><font size = 5>1.1 Description of the Problem</font></h3>

There are many practical questions that require the comparison across city neighborhoods. For example, a job seeker with transferable skills may wish to focus his/her search on a single neighborhood with jobs that best match his/her qualifications, rather than dispersing his/her search efforts across multiple neighborhoods. Likewise, a restaurant looking to expand its locations might perhaps select neighborhoods it wishes to expand into before considering particular sites or neighborhoods. Additionally, many within-city computations might be aided by modelling a neighborhood’s relationship to other neighborhoods. For example, a person buying or renting a home in a new city might want to be able to compare the neighborhoods of the city.

<h3 align=left><font size = 5>1.2 Objective</font></h3>
Taipei City and New Taipei City are two a major cities in Taiwan. Both cities have been centers of attention for residential, job employment, tourism, education, shopping and sport activities. Both municipalities located in the north of Taiwan with New Taipei City surrounds Taipei City. The aim of this project is to segment both cities' neighborhoods based on data collected from Foursquare about venue categories in totally 41 neighborhoods across the cities. Using segmentation and clustering, I hope I can determine:  


1. the similarity or dissimilarity between neighborhoods.
2. classification of a neighborhood inside a city whether it is residential, tourism places, or others.

<h2 align=left><font size = 6>2. Data</font></h2>

<h3 align=left><font size = 5>2.1 Description of Data</font></h3>

This project will rely on public data from websites and Foursquare.

In [1]:
import numpy as np # library to handle data in a vectorized manner
import pandas as pd # library for data analsysis
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)
import json # library to handle JSON files
#!conda install -c conda-forge geopy --yes # uncomment this line if you haven't completed the Foursquare API lab
from geopy.geocoders import Nominatim # convert an address into latitude and longitude values
import requests # library to handle requests
from pandas.io.json import json_normalize # tranform JSON file into a pandas dataframe
# Matplotlib and associated plotting modules
import matplotlib.cm as cm
import matplotlib.colors as colors
# import k-means from clustering stage
from sklearn.cluster import KMeans
from sklearn.datasets.samples_generator import make_blobs
#!conda install -c conda-forge folium=0.5.0 --yes # uncomment this line if you haven't completed the Foursquare API lab
import folium # map rendering library
from bs4 import BeautifulSoup
import lxml
print('Libraries imported.')

Libraries imported.


<h1 align=left><font size = 5>1. Download, scrape and wrangle</font></h1>

There are totally 2 boroughs and 41 neighborhoods in both Taipei City and New Taipei City. In order to segement the neighborhoods and explore them, we will essentially need a dataset that contains the 2 boroughs and the neighborhoods that exist in each borough as well as the latitude and logitude coordinates of each neighborhood. 

Luckily, these datasets exist for free on the web.

### A. City Data

First, setup urls which contain JSON files for both cities.

In [2]:
# JSON file urls
url_dict={'Taipei City': 'https://github.com/comdan66/TaipeiTowns/tree/master/towns/%E5%8F%B0%E5%8C%97%E5%B8%82', 
          'New Taipei City': 'https://github.com/comdan66/TaipeiTowns/tree/master/towns/%E6%96%B0%E5%8C%97%E5%B8%82'}


Next, make a dictionary as a lookup table for neighborhood name translation from Chinese to English.

In [3]:
neighbor_dict={'中山區': 'Zhongshan', '中正區': 'Zhongzheng', '信義區': 'Xinyi', '內湖區': 'Neihu', '北投區': 'Beitou', 
               '南港區': 'Nangang', '士林區': 'Shilin', '大同區': 'Datong', '大安區': 'Daan', '文山區': 'Wenshan', '松山區': 'Songshan',
               '萬華區': 'Wanhua',  '三峽區': 'Sanxia', '三芝區': 'Sanjhih', '三重區': 'Sanchong', '中和區': 'Zhonghe', '五股區': 'Wugu',
               '八里區': 'Bali', '土城區': 'Tucheng', '坪林區': 'Pinglin', '平溪區': 'Pingxi', '新店區': 'Xindian', '新莊區': 'Xinzhuang',
               '板橋區': 'Banqiao', '林口區': 'Linkou', '樹林區': 'Shulin', '永和區': 'Yonghe', '汐止區': 'Xizhi', '泰山區': 'Taishan',
               '淡水區': 'Tamsui', '深坑區': 'Shenkeng', '烏來區': 'Wulai', '瑞芳區': 'Rueifang', '石碇區': 'Shiding', '石門區': 'Shimen',
               '萬里區': 'Wanli', '蘆洲區': 'Luzhou', '貢寮區': 'Gongliao', '金山區': 'Jinshan', '雙溪區': 'Shuangxi', '鶯歌區': 'Yingge'}
neighbor_dict

{'中山區': 'Zhongshan',
 '中正區': 'Zhongzheng',
 '信義區': 'Xinyi',
 '內湖區': 'Neihu',
 '北投區': 'Beitou',
 '南港區': 'Nangang',
 '士林區': 'Shilin',
 '大同區': 'Datong',
 '大安區': 'Daan',
 '文山區': 'Wenshan',
 '松山區': 'Songshan',
 '萬華區': 'Wanhua',
 '三峽區': 'Sanxia',
 '三芝區': 'Sanjhih',
 '三重區': 'Sanchong',
 '中和區': 'Zhonghe',
 '五股區': 'Wugu',
 '八里區': 'Bali',
 '土城區': 'Tucheng',
 '坪林區': 'Pinglin',
 '平溪區': 'Pingxi',
 '新店區': 'Xindian',
 '新莊區': 'Xinzhuang',
 '板橋區': 'Banqiao',
 '林口區': 'Linkou',
 '樹林區': 'Shulin',
 '永和區': 'Yonghe',
 '汐止區': 'Xizhi',
 '泰山區': 'Taishan',
 '淡水區': 'Tamsui',
 '深坑區': 'Shenkeng',
 '烏來區': 'Wulai',
 '瑞芳區': 'Rueifang',
 '石碇區': 'Shiding',
 '石門區': 'Shimen',
 '萬里區': 'Wanli',
 '蘆洲區': 'Luzhou',
 '貢寮區': 'Gongliao',
 '金山區': 'Jinshan',
 '雙溪區': 'Shuangxi',
 '鶯歌區': 'Yingge'}

Now, let's retrieve Geo-spatial data from websites in order to explore city venues with Foursquare.

As the websites contain only neighborhood boundary Geo-spatial data, so we need to calculate neighborhood center latitude and longitude values by averaging the retrieved boundary data.

In [4]:
root='https://github.com'

In [5]:
# Define function for retrieving Geo-spatial data

def get_geo(city, url, df):
    results = requests.get(url)
    soup = BeautifulSoup(results.text, 'html.parser')
    table=soup.find('table')
    url_strings=table.findAll('span', class_="css-truncate css-truncate-target")
    
    # Get neighborhood json file urls and neighborhood names
    url_list=[]
    neighbor_list=[]

    for i in range(0,len(url_strings),3):
        url_list.append(root + url_strings[i].a['href'])
        neighbor_list.append(neighbor_dict[url_strings[i].text.replace('.json','')])
        
    # Get neighborhood boundary latitude and longtitude
    geo_list=[]

    for i in range(0,len(url_list)):
        geo=[]
        url=url_list[i]
        results = requests.get(url)
        soup = BeautifulSoup(results.text, 'html.parser')
        table=soup.find('table', class_='highlight tab-size js-file-line-container')
        rows=table.findAll('tr')

        location = 'LC'

        for j in range(2,len(rows)-1):
            id=location+str(j+1)
            data=rows[j].find(id=id).text

            data=data.lstrip(' [').rstrip('], ').split(', ')
            data[0]=float(data[0])
            data[1]=float(data[1])
            geo.append(data)
            
        # Calculate neighborhood center latitude and longitude values by averaging the boundary data
        geo = pd.DataFrame(geo).mean().to_list()

        # Fill the dataframe with data
        df = df.append({'Borough': city,
                        'Neighborhood': neighbor_list[i],
                        'Latitude': geo[0],
                        'Longitude': geo[1]}, ignore_index=True)
    return(df)

The next task is essentially transforming Geo-spatial data into a *pandas* dataframe. So let's start by creating an empty dataframe.

In [6]:
# define the dataframe columns
columns=['Borough', 'Neighborhood', 'Latitude', 'Longitude']

# instantiate the dataframe
geo_data=pd.DataFrame(columns=columns)

Take a look at the empty dataframe to confirm that the columns are as intended.

In [7]:
geo_data

Unnamed: 0,Borough,Neighborhood,Latitude,Longitude


Then let's loop through the websites to retrieve Geo-spatial data and fill the dataframe with data.

In [8]:
for city, url in url_dict.items():
    geo_data=get_geo(city, url, geo_data)
    
geo_data

Unnamed: 0,Borough,Neighborhood,Latitude,Longitude
0,Taipei City,Zhongshan,25.071279,121.537923
1,Taipei City,Zhongzheng,25.031602,121.518729
2,Taipei City,Xinyi,25.030042,121.573207
3,Taipei City,Neihu,25.081642,121.59561
4,Taipei City,Beitou,25.151112,121.522906
5,Taipei City,Nangang,25.03355,121.620147
6,Taipei City,Shilin,25.128001,121.547409
7,Taipei City,Datong,25.060159,121.512601
8,Taipei City,Daan,25.021382,121.546949
9,Taipei City,Wenshan,24.987646,121.576226


Quickly examine the resulting dataframe.

And make sure that the dataset has all 5 boroughs and 306 neighborhoods.

In [9]:
print('The dataframe has {} boroughs and {} neighborhoods.'.format(
        len(geo_data['Borough'].unique()),
        geo_data.shape[0]
    )
)

print('The boroughs are: ', set(geo_data['Borough']))

The dataframe has 2 boroughs and 41 neighborhoods.
The boroughs are:  {'New Taipei City', 'Taipei City'}


### B. City Map

#### Use geopy library to get the latitude and longitude values of Taipei City.

In order to define an instance of the geocoder, we need to define a user_agent. We will name our agent <em>taipei_explorer</em>, as shown below.

In [10]:
address = 'Taipei, Taiwan'

geolocator = Nominatim(user_agent="taipei_explorer")
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude
print('The geograpical coordinate of Taipei City are {}, {}.'.format(latitude, longitude))
print('There are {} neighborhoods in Taipei City and New Taipei City.'.format(geo_data.shape[0]))

The geograpical coordinate of Taipei City are 25.0375198, 121.5636796.
There are 41 neighborhoods in Taipei City and New Taipei City.


#### Create a map of Taipei City and New Taipei City with neighborhoods superimposed on top.

In [11]:
# create map of Taipei City and New Taipei City using latitude and longitude values
map_taipei = folium.Map(location=[latitude, longitude], zoom_start=11)

# add markers to map
for lat, lng, label in zip(geo_data['Latitude'], geo_data['Longitude'], geo_data['Neighborhood']):
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_taipei)  
    
map_taipei