# The Battle of Neighborhoods (Week 1)

#### By: Pierre Shi

## Assignment 1 - Introduction

### Look for a location to setup a Japanese Restaurant in Gyeonggi

Gyeonggi Province (Korean) is the most populous province in South Korea. Gyeonggi-do can be translated as "province surrounding Seoul". A group of Japanese investors want to setup some business in this province, but they do not know where to start, so they suggest to open a Japanese restaurant first to evaluate the local business environment. 

The question is where to setup such a business is the best choice. A successful startup could indicate a promising future while a failed investment may frustate the investors. Comparing the current restaurant setup, using c location data seems to be an idea. 

In this project, I am going to leverage the Foursquare location data together with the Gyeonggi cities' population and other restaurant information, to find out where the best place to maximize the success chance of this new Japanese restaurant. 

## Assignment 2 - Data

### Description of the data source and method

1. The Wiki Data for the Korea cities will be used to provide the most recent population and density information.
2. Then using Foursquare data source, to extract the venues informtion is the next step.
3. Combining this information, we can analyze the K-means clusters for the population and current restaurant setup, to find the best city to start the business. 

In [23]:
## Install and import related packages
!conda install -c conda-forge folium=0.5.0 --yes
!pip install lxml
!pip install html5lib
!pip install requests
!pip install beautifulsoup4
!pip install tqdm
!pip install geopy

import requests # library to handle requests
import pandas as pd # library for data analsysis
import numpy as np # library to handle data in a vectorized manner
from bs4 import BeautifulSoup
from tqdm import tqdm
import seaborn as sns

Collecting package metadata (current_repodata.json): done
Solving environment: done

# All requested packages already installed.



### 2.1 Download Data From Wiki

We get the South Korea city information from Wiki website

In [26]:
df_K = pd.read_html("https://en.wikipedia.org/wiki/List_of_cities_in_South_Korea")[5]
df_K.head()

Unnamed: 0,City,Hangul,Hanja,Province,Population(2017),Area,Density,Founded
0,Andong,안동시,安東市,North Gyeongsang,168226,1521.26,110.6,1963-01-01
1,Ansan,안산시,安山市,Gyeonggi,689326,149.06,4624.5,1986-01-01
2,Anseong,안성시,安城市,Gyeonggi,182784,553.47,330.3,1998-04-01
3,Anyang,안양시,安養市,Gyeonggi,598392,58.46,10235.9,1973-07-01
4,Asan,아산시,牙山市,South Chungcheong,303043,542.15,559.0,1986-01-01


### 2.2 Data Cleanup

In [27]:
# Dropping not-needed columns, renaming columns, and replacing empty values
df_K.drop(columns=["Hangul", "Hanja", "Founded"], inplace = True)
df_K.columns = ["City", "Province", "Population", "Area", "Density"]
df_K.replace("*", 0, inplace=True)

# Correcting data types
df_K = df_K.astype({"Population":"float64", "Density":"float64"})
df_K.dtypes

City           object
Province       object
Population    float64
Area          float64
Density       float64
dtype: object

In [28]:
df_K

Unnamed: 0,City,Province,Population,Area,Density
0,Andong,North Gyeongsang,168226.0,1521.26,110.6
1,Ansan,Gyeonggi,689326.0,149.06,4624.5
2,Anseong,Gyeonggi,182784.0,553.47,330.3
3,Anyang,Gyeonggi,598392.0,58.46,10235.9
4,Asan,South Chungcheong,303043.0,542.15,559.0
...,...,...,...,...,...
80,Yeoju,Gyeonggi,111558.0,608.64,183.3
81,Yeongcheon,North Gyeongsang,100384.0,920.29,109.1
82,Yeongju,North Gyeongsang,109281.0,669.05,163.3
83,Yeosu,South Jeolla,288818.0,501.27,576.2


Here I will use the geolocator function to get the Lat and Long and add to our dataframe

In [29]:
geolocator = Nominatim(user_agent="Korea_Explorer1")

In [30]:
# Getting coordinates of each city
tqdm.pandas()
geocode = RateLimiter(geolocator.geocode, min_delay_seconds=1)
coords = (df_K["City"]).progress_apply(geocode)
# Adding two new empty columns to dataframe
df_K["Latitude"] = np.nan
df_K["Longitude"] = np.nan
# Populating the Latitude and Longitude columns with data from coords
for index in df_K.index:
    df_K.at[index, 'Latitude'] = coords[index].latitude
    df_K.at[index, 'Longitude'] = coords[index].longitude

100%|██████████| 85/85 [01:24<00:00,  1.01it/s]


Because of the data in Gyeonggi is needed, I will just extract that part of the data and remove the unneeded ones

In [31]:
df_Gyeonggi = df_K.drop(df_K[df_K.Province != 'Gyeonggi'].index)
df_Gyeonggi.head()

Unnamed: 0,City,Province,Population,Area,Density,Latitude,Longitude
1,Ansan,Gyeonggi,689326.0,149.06,4624.5,43.690843,0.773523
2,Anseong,Gyeonggi,182784.0,553.47,330.3,37.007773,127.279971
3,Anyang,Gyeonggi,598392.0,58.46,10235.9,36.102355,114.33633
6,Bucheon,Gyeonggi,851245.0,53.4,15940.9,37.501442,126.766014
16,Dongducheon,Gyeonggi,98062.0,95.66,1025.1,37.903086,127.060518


### 2.3 Get Data from Foursquare

Use my credential to get the data

In [32]:
# use the following user ID and client Secret
CLIENT_ID = 'ZIO5LTYH2NFRAFS0EDV20VNBTXJZIMRLYC1AR0QQKOTLQUHF'
CLIENT_SECRET = 'SVRUJKAHEOEUHGIKDIZS5VHQNN5N2PVNOQGVKZJ3P5PYDDTF'
VERSION = '20180604'
LIMIT = 100

In [34]:
from math import sqrt, pi
df_Gyeonggi["Search Radius"] = df_Gyeonggi["Area"].apply(lambda x: round(sqrt(x/pi)*1000))

In [35]:
def getNearbyVenues(names, latitudes, longitudes, radius):
    
    venues_list=[]
    for name, lat, lng, radius in zip(names, latitudes, longitudes, radius):
        print(name)
            
        # create the API request URL
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lat, 
            lng, 
            radius, 
            LIMIT)
            
        # make the GET request
        results = requests.get(url).json()["response"]['groups'][0]['items']
        
        # return only relevant information for each nearby venue
        venues_list.append([(
            name, 
            lat, 
            lng, 
            v['venue']['name'], 
            v['venue']['location']['lat'], 
            v['venue']['location']['lng'],  
            v['venue']['categories'][0]['name']) for v in results])

    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['City', 
                  'City Latitude', 
                  'City Longitude', 
                  'Venue', 
                  'Venue Latitude', 
                  'Venue Longitude', 
                  'Venue Category']
    
    return(nearby_venues)

# Calling the user-defined function, and saving the results into a variable
df_Gyeonggi_venues = getNearbyVenues(names=df_Gyeonggi['City'],latitudes=df_Gyeonggi['Latitude'],longitudes=df_Gyeonggi['Longitude'],radius=df_Gyeonggi['Search Radius'])

Ansan
Anseong
Anyang
Bucheon
Dongducheon
Gimpo
Goyang
Gunpo
Guri
Gwacheon
Gwangju
Gwangmyeong
Hanam
Hwaseong
Icheon
Namyangju
Osan
Paju
Pocheon
Pyeongtaek
Seongnam
Siheung
Suwon
Uijeongbu
Uiwang
Yangju
Yeoju
Yongin


In [36]:
print(df_Gyeonggi_venues.shape)
df_Gyeonggi_venues.head()

(2047, 7)


Unnamed: 0,City,City Latitude,City Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
0,Ansan,43.690843,0.773523,Aubiet Terrain De Foot,43.657431,0.777255,Soccer Field
1,Ansan,43.690843,0.773523,Gare SNCF d'Aubiet,43.650207,0.789071,Train Station
2,Ansan,43.690843,0.773523,Un jour inoubliable photographe Gers,43.747599,0.744879,Photography Studio
3,Ansan,43.690843,0.773523,Traiteur Gers (Gilbert Saint-Loubert),43.747506,0.742564,Food Service
4,Anseong,37.007773,127.279971,장터국밥,36.999503,127.268721,Korean Restaurant


In [37]:
print("Number of unique venue categories in Gyeonggi of Korea:{}".format(len(df_Gyeonggi_venues["Venue Category"].unique())))

Number of unique venue categories in Gyeonggi of Korea:187


Now all the data neeeded is available for further analysis