# Capstone Project - The Battle of the Neighborhoods (Week 2)
### Applied Data Science Capstone by IBM/Coursera

## Table of contents
* [Introduction: Business Problem](#introduction)
* [Data](#data)
* [Methodology](#methodology)
* [Analysis](#analysis)
* [Results](#results)
* [Conclusion](#conclusion)

## 1. Introduction: Business Problem <a name="introduction"></a>

The ultimate goal of this capstone project is to analyze and to provide the best locations in **Osaka** in order to open a new **Vietnamese restaurant**. In the scope of this capstonse project, I don’t want to mention about a particular high-end, exclusive restaurant like the image of a Italian or France restaurant, but I’m talking about a small family restaurant in general. In Japan, when looking for a place to open this kind of business, the first thing and also the most important thing you have to put in consideration is the accessibility. Most of people in Japan transport by public transportations, such as train, subway, bus.  Other aspect should be considered is the target customers. From the perpective of sharing the similarities in cuisine, I think the main customer will be South East Asian.  

By applying what I have learned through out this course like data science methodology and machine learning techniques, I will provide a solution to answer the business question: **if someone is looking to open a new Vietnamese restaurant in Osaka area, where should they consider to open it?**

## 2. Data <a name="data"></a>

We will need data from reliable sources for analysis. To understand our problem and quantify result we will use the following data.  

+ List of wards/boroughs in Osaka from Wikipedia page https://en.wikipedia.org/wiki/Osaka.   
+ Latitude and longitude coordinates of those wards by using Python Geocoder package.   
+ Venue data, particularly data related to restaurant from https://foursquare.com/.   
+ Foreign population, particularly Vietnamese in Osaka from Osaka Official website https://www.city.osaka.lg.jp.   

In [2]:
# The code was removed by Watson Studio for sharing.

Libraries imported.


**Scrap data from Wikipedia page into a DataFrame**

In [3]:
# send the GET request
data = requests.get('https://en.wikipedia.org/wiki/Osaka').text

In [4]:
# parse data from the html into a beautifulsoup object
soup = BeautifulSoup(data, 'html.parser')
#print(soup)

In [5]:
# create a list to store ward data
romanji_wardList = []
kanji_wardList = []
populationList = []

In [6]:
# append the data into the list
for row in soup.find(lambda tag: tag.name == 'table' and tag.get('class') == ['wikitable']).find_all('tr'):
    cells = row.find_all('td')
    if(len(cells) > 0):
        romanji_wardList.append(cells[1].text.rstrip('\n'))
        kanji_wardList.append(cells[2].text.rstrip('\n'))
        populationList.append(cells[3].text.rstrip('\n'))

In [7]:
# create a new DataFrame from the list
Osaka_ward_df = pd.DataFrame({'Romanji Ward': romanji_wardList,
                           'Kanji Ward': kanji_wardList,
                           'Population': populationList})

Osaka_ward_df

Unnamed: 0,Romanji Ward,Kanji Ward,Population
0,Abeno-ku,阿倍野区,107000
1,Asahi-ku,旭区,90854
2,Chūō-ku,中央区,100998
3,Fukushima-ku,福島区,78348
4,Higashinari-ku,東成区,83684
5,Higashisumiyoshi-ku,東住吉区,126704
6,Higashiyodogawa-ku,東淀川区,176943
7,Hirano-ku,平野区,193282
8,Ikuno-ku,生野区,129641
9,Jōtō-ku,城東区,167925


In [8]:
# print the number of rows of the dataframe
Osaka_ward_df.shape

(24, 3)

**Get the geographical coordinates**

In [9]:
# define a function to get coordinates
def get_latlng(neighborhood):
    # initialize your variable to None
    lat_lng_coords = None
    # loop until you get the coordinates
    while(lat_lng_coords is None):
        g = geocoder.arcgis('{}, 大阪市'.format(neighborhood))
        lat_lng_coords = g.latlng
    return lat_lng_coords

In [10]:
# call the function to get the coordinates, store in a new list using list comprehension
coords = [ get_latlng(neighborhood) for neighborhood in Osaka_ward_df['Kanji Ward'].tolist() ]

In [11]:
coords

[[34.638732384000036, 135.51846701600005],
 [34.721168302000024, 135.54426935900005],
 [34.681143992000045, 135.50988413100004],
 [34.69230835500008, 135.47221971200008],
 [34.66995105700005, 135.54127030600011],
 [34.62212290100007, 135.52666107200002],
 [34.74122629100003, 135.52941332900002],
 [34.621289991000026, 135.54638145100012],
 [34.65364726300004, 135.53435383200008],
 [34.703209542000025, 135.5447971630001],
 [34.70536324600005, 135.51004948800005],
 [34.68305881100002, 135.4523606150001],
 [34.66392284600005, 135.46077739200007],
 [34.70128087300003, 135.52807667800005],
 [34.65940480000006, 135.4995426050001],
 [34.676302572000054, 135.48596146500006],
 [34.634954497000024, 135.49441456700004],
 [34.71133350000008, 135.45616510000002],
 [34.609624101000065, 135.48283201300012],
 [34.60365252500003, 135.5005810240001],
 [34.65034192100006, 135.47280482500003],
 [34.657815604000064, 135.51930857600007],
 [34.704365030000076, 135.57423884500008],
 [34.721054991000074, 135.48

In [12]:
# create temporary dataframe to populate the coordinates into Latitude and Longitude
df_coords = pd.DataFrame(coords, columns=['Latitude', 'Longitude'])

In [13]:
# merge the coordinates into the original dataframe
Osaka_ward_df['Latitude'] = df_coords['Latitude']
Osaka_ward_df['Longitude'] = df_coords['Longitude']

In [14]:
# check the neighborhoods and the coordinates
print(Osaka_ward_df.shape)
Osaka_ward_df

(24, 5)


Unnamed: 0,Romanji Ward,Kanji Ward,Population,Latitude,Longitude
0,Abeno-ku,阿倍野区,107000,34.638732,135.518467
1,Asahi-ku,旭区,90854,34.721168,135.544269
2,Chūō-ku,中央区,100998,34.681144,135.509884
3,Fukushima-ku,福島区,78348,34.692308,135.47222
4,Higashinari-ku,東成区,83684,34.669951,135.54127
5,Higashisumiyoshi-ku,東住吉区,126704,34.622123,135.526661
6,Higashiyodogawa-ku,東淀川区,176943,34.741226,135.529413
7,Hirano-ku,平野区,193282,34.62129,135.546381
8,Ikuno-ku,生野区,129641,34.653647,135.534354
9,Jōtō-ku,城東区,167925,34.70321,135.544797


In [15]:
# save the DataFrame as CSV file
Osaka_ward_df.to_csv("Osaka_ward_df.csv", index=False)

In [16]:
# create a copy of Osaka dataset to merge with number of Vietnamese
Osaka_df_copy = Osaka_ward_df.copy()
Osaka_df_copy.head()

Unnamed: 0,Romanji Ward,Kanji Ward,Population,Latitude,Longitude
0,Abeno-ku,阿倍野区,107000,34.638732,135.518467
1,Asahi-ku,旭区,90854,34.721168,135.544269
2,Chūō-ku,中央区,100998,34.681144,135.509884
3,Fukushima-ku,福島区,78348,34.692308,135.47222
4,Higashinari-ku,東成区,83684,34.669951,135.54127


In [17]:
# load data about number of Vietnamese from Osaka office website
kubetu_gaikokujin_df = pd.read_excel('https://www.city.osaka.lg.jp/shimin/cmsfiles/contents/0000006/6893/(0109)07_kubetu_kokusekibetu_gaikokujin.xls', index_col=None, skiprows=3)
kubetu_gaikokujin_df.head()

Unnamed: 0.1,Unnamed: 0,アフガニスタン,アルジェリア,アルゼンチン,オーストラリア,オーストリア,アラブ首長国連邦,ベルギー,ボリビア,ブラジル,ブルガリア,ミャンマー,ブータン,バングラデシュ,バハマ,ブルネイ,ベラルーシ,カンボジア,カメルーン,カナダ,スリランカ,チリ,中国,台湾,コロンビア,コンゴ共和国,コンゴ民主共和国,コスタリカ,キューバ,キプロス,クロアチア,チェコ,ベナン,デンマーク,ドミニカ共和国,エクアドル,エルサルバドル,エチオピア,フィンランド,フランス,フィジー,ドイツ,ガーナ,ギリシャ,グアテマラ,ギニア,ガンビア,ギニアビサウ,ハイチ,ホンジュラス,ハンガリー,アイスランド,インド,インドネシア,イラン,イラク,アイルランド,イスラエル,イタリア,ジャマイカ,ヨルダン,韓国及び朝鮮,ケニア,キルギス,カザフスタン,ラオス,レバノン,リビア,リヒテンシュタイン,ラトビア,リトアニア,マダガスカル,マレーシア,マリ,メキシコ,モンゴル,モロッコ,マラウイ,マルタ,モルディブ,モーリシャス,モザンビーク,ミクロネシア,モルドバ,北マケドニア,ネパール,オランダ,ニュージーランド,ニカラグア,ナイジェリア,ノルウェー,ナウル,パキスタン,パナマ,パラグアイ,ペルー,フィリピン,ポーランド,ポルトガル,パプアニューギニア,ルーマニア,ルワンダ,ロシア,サウジアラビア,セネガル,シエラレオネ,スペイン,スーダン,スウェーデン,スイス,シリア,シンガポール,ソロモン,セントルシア,タイ,タンザニア,トーゴ,トリニダード・トバゴ,チュニジア,トルコ,トンガ,ツバル,タジキスタン,ウガンダ,南アフリカ共和国,エジプト,英国,米国,ブルキナファソ,ウルグアイ,ウクライナ,ウズベキスタン,ベネズエラ,ベトナム,イエメン,ジンバブエ,無国籍,アンゴラ,アルメニア,アゼルバイジャン,スロベニア,スロバキア,ボスニア・ヘルツェゴビナ,セルビア,南スーダン共和国,不詳
0,北 区,0,0,0,34,0,0,4,0,33,1,13,1,5,1,0,1,2,1,40,17,0,1950,326,0,0,0,0,0,0,0,1,0,2,0,0,0,0,5,48,0,22,2,0,0,0,0,0,0,0,1,0,87,80,2,0,2,2,14,0,0,1888,1,0,2,0,0,0,0,0,0,0,22,0,6,9,0,0,0,0,0,0,0,0,0,124,1,12,0,2,0,0,6,0,0,12,155,4,3,0,13,0,28,13,0,0,13,0,1,7,4,22,0,0,137,0,0,0,1,3,0,0,0,0,3,3,45,193,0,0,12,2,3,261,0,1,2,0,0,0,0,1,0,1,0,3
1,都 島 区,0,0,0,17,1,0,0,6,26,1,11,0,2,0,0,0,10,0,15,5,0,1029,153,1,1,1,0,0,0,0,0,0,0,0,0,0,0,1,14,1,2,3,0,0,0,0,0,0,0,0,0,25,96,2,0,0,1,8,2,0,1151,1,0,0,1,0,0,0,0,0,0,7,0,4,2,0,0,0,1,0,0,0,0,0,75,1,3,0,1,0,0,5,0,1,0,95,2,0,0,1,0,6,3,0,0,4,0,2,2,0,2,0,0,50,0,0,0,2,2,0,0,0,1,1,0,16,64,0,0,1,0,0,371,0,0,0,0,0,0,0,0,0,0,0,2
2,福 島 区,0,0,0,14,0,0,0,5,7,0,3,0,2,0,0,0,0,0,8,10,0,440,71,1,0,0,0,0,0,0,0,0,1,0,1,0,0,2,5,0,5,0,0,0,0,0,0,0,0,0,0,18,6,0,0,2,0,12,0,0,647,1,0,0,0,0,0,1,0,0,0,0,0,0,6,0,1,0,0,0,0,0,0,0,15,1,3,0,1,0,0,1,0,0,15,62,3,2,0,1,0,7,0,0,0,5,0,1,0,0,0,0,0,31,0,0,0,2,3,0,0,0,0,1,0,15,73,0,0,4,0,0,82,0,0,1,0,0,0,0,0,0,0,0,3
3,此 花 区,0,0,1,3,0,0,1,2,16,0,4,0,2,0,0,1,4,0,5,3,1,543,57,1,0,0,0,0,0,0,0,0,0,0,1,0,0,0,4,0,5,0,0,0,0,0,0,0,1,0,0,20,7,0,0,0,0,0,2,0,675,5,0,0,0,0,0,0,0,1,0,1,0,1,0,0,0,0,0,0,0,0,0,0,36,3,1,0,1,1,0,1,0,0,10,107,0,0,0,4,0,11,1,0,0,1,0,4,1,0,2,0,0,24,0,0,0,0,1,0,1,0,0,0,0,10,27,0,0,0,0,0,480,0,0,0,0,0,1,0,0,0,0,0,2
4,中 央 区,0,0,4,40,5,0,3,5,84,0,27,0,5,0,0,2,2,0,31,13,9,3365,608,9,0,1,1,2,2,0,0,0,4,1,0,0,0,2,59,0,15,7,0,0,0,0,0,0,0,1,1,113,78,8,2,2,2,24,1,1,2805,1,0,3,1,0,0,0,1,0,0,18,0,10,27,0,0,0,0,1,0,0,3,0,154,2,12,0,0,0,0,12,0,0,18,521,4,3,0,25,1,26,1,3,0,17,0,5,4,1,9,0,0,146,0,0,0,1,13,0,0,0,0,2,0,68,132,0,0,4,5,0,372,0,0,4,0,0,0,1,0,0,1,0,5


In [18]:
# create a new dataframe to load number of Vietnamese in each ward
kubetu_gaikokujin_df.rename(columns={'Unnamed: 0': 'Kanji Ward'}, inplace=True)
number_of_vietnamese_df = pd.DataFrame()
number_of_vietnamese_df['Kanji Ward'] = kubetu_gaikokujin_df['Kanji Ward']
number_of_vietnamese_df['Number of Vietnamese'] = kubetu_gaikokujin_df['ベトナム']
number_of_vietnamese_df.head()

Unnamed: 0,Kanji Ward,Number of Vietnamese
0,北 区,261
1,都 島 区,371
2,福 島 区,82
3,此 花 区,480
4,中 央 区,372


In [19]:
# strim all full-size and half-size space in Ward name
number_of_vietnamese_df['Kanji Ward'] = number_of_vietnamese_df['Kanji Ward'].str.replace('　', '')
number_of_vietnamese_df['Kanji Ward'] = number_of_vietnamese_df['Kanji Ward'].str.replace(' ', '')
number_of_vietnamese_df.head()

Unnamed: 0,Kanji Ward,Number of Vietnamese
0,北区,261
1,都島区,371
2,福島区,82
3,此花区,480
4,中央区,372


In [20]:
# merge number of Vietnamese into Osaka dataset
Osaka_df_copy = Osaka_df_copy.merge(number_of_vietnamese_df, on=['Kanji Ward'])
Osaka_df_copy.head()

Unnamed: 0,Romanji Ward,Kanji Ward,Population,Latitude,Longitude,Number of Vietnamese
0,Abeno-ku,阿倍野区,107000,34.638732,135.518467,490
1,Asahi-ku,旭区,90854,34.721168,135.544269,283
2,Chūō-ku,中央区,100998,34.681144,135.509884,372
3,Fukushima-ku,福島区,78348,34.692308,135.47222,82
4,Higashinari-ku,東成区,83684,34.669951,135.54127,678


In [21]:
# save the DataFrame as CSV file
Osaka_df_copy.to_csv("Osaka_df_with_number_of_Vietnamese.csv", index=False)

**Create a map of Osaka with neighborhoods superimposed on top**

In [22]:
# get the coordinates of Osaka
address = '大阪市, 日本'

geolocator = Nominatim(user_agent="my-application")
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude
print('The geograpical coordinate of Osaka, Japan {}, {}.'.format(latitude, longitude))

The geograpical coordinate of Osaka, Japan 34.6937569, 135.5014539.


In [23]:
# create map of Osaka using latitude and longitude values
map_Osaka = folium.Map(location=[latitude, longitude], zoom_start=11)

# add markers to map
for lat, lng, neighborhood in zip(Osaka_ward_df['Latitude'], Osaka_ward_df['Longitude'], Osaka_ward_df['Kanji Ward']):
    label = '{}'.format(neighborhood)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7).add_to(map_Osaka)  
    
map_Osaka

In [24]:
# save the map as HTML file
map_Osaka.save('map_Osaka.html')

**Use the Foursquare API to explore the neighborhoods**

In [25]:
# The code was removed by Watson Studio for sharing.

**Now, let's get the top 100 venues that are within a radius of 2000 meters.**

In [26]:
import math

def calc_xy_distance(x1, y1, x2, y2):
    dx = x2 - x1
    dy = y2 - y1
    return math.sqrt(dx*dx + dy*dy)

In [46]:
radius = 2000
LIMIT = 200
# food id
#section = '4d4b7105d754a06374d81259'
#Vietnamese restaurant id
section = '4bf58dd8d48988d14a941735'

venues = []

for lat, long, neighborhood in zip(Osaka_ward_df['Latitude'], Osaka_ward_df['Longitude'], Osaka_ward_df['Kanji Ward']):
    
    # create the API request URL
    url = "https://api.foursquare.com/v2/venues/explore?client_id={}&client_secret={}&v={}&ll={},{}&categoryId={}&radius={}&limit={}".format(
        CLIENT_ID,
        CLIENT_SECRET,
        VERSION,
        lat,
        long,
        section,
        radius, 
        LIMIT)
    
    # make the GET request
    results = requests.get(url).json()["response"]['groups'][0]['items']
    
    # return only relevant information for each nearby venue
    for venue in results:
        venues.append((
            neighborhood,
            lat, 
            long, 
            venue['venue']['name'], 
            venue['venue']['location']['lat'], 
            venue['venue']['location']['lng'],
            calc_xy_distance(lat, long, venue['venue']['location']['lat'], venue['venue']['location']['lng']),
            venue['venue']['categories'][0]['name']))

In [47]:
# convert the venues list into a new DataFrame
Osaka_venues_df = pd.DataFrame(venues)

# define the column names
Osaka_venues_df.columns = ['Kanji Ward', 'Latitude', 'Longitude', 'VenueName', 'VenueLatitude', 'VenueLongitude', 'Distance from center', 'VenueCategory']

print(Osaka_venues_df.shape)
Osaka_venues_df.head()

(68, 8)


Unnamed: 0,Kanji Ward,Latitude,Longitude,VenueName,VenueLatitude,VenueLongitude,Distance from center,VenueCategory
0,阿倍野区,34.638732,135.518467,クアン オン ニャット,34.648581,135.525135,0.011894,Vietnamese Restaurant
1,阿倍野区,34.638732,135.518467,Banh Mi Viet Nam,34.649674,135.506522,0.016199,Vietnamese Restaurant
2,阿倍野区,34.638732,135.518467,バインミーサンキュー,34.649029,135.504624,0.017252,Vietnamese Restaurant
3,阿倍野区,34.638732,135.518467,バインミー39,34.649284,135.50291,0.018798,Vietnamese Restaurant
4,阿倍野区,34.638732,135.518467,ベトナム料理 センレストラン,34.652105,135.505637,0.018532,Vietnamese Restaurant


**Let's find out how many unique categories can be curated from all the returned venues**

In [48]:
print('There are {} uniques categories.'.format(len(Osaka_venues_df['VenueCategory'].unique())))

There are 2 uniques categories.


In [49]:
###### print out the list of categories
Osaka_venues_df['VenueCategory'].unique()[:2]

array(['Vietnamese Restaurant', 'Fast Food Restaurant'], dtype=object)

In [50]:
# Somehow Fast food restaurant category got into the list so we have to remove
Osaka_venues_df.drop(Osaka_venues_df[ Osaka_venues_df['VenueCategory'] == 'Fast Food Restaurant' ].index, inplace=True)
#print(vrestaurant_df)
print(Osaka_venues_df['VenueCategory'].value_counts())

Vietnamese Restaurant    67
Name: VenueCategory, dtype: int64


**Let's check how many venues were returned for each neighorhood**

In [51]:
#venues_vrestaurant.groupby(["Ward"]).count()
Osaka_venues_df_copy = pd.DataFrame({'Kanji Ward': Osaka_venues_df['Kanji Ward'],
                           'Number of venue': Osaka_venues_df['VenueName']})

Osaka_venues_grouped_df = Osaka_venues_df_copy.groupby('Kanji Ward').count()
Osaka_venues_grouped_df.head()

Unnamed: 0_level_0,Number of venue
Kanji Ward,Unnamed: 1_level_1
中央区,10
北区,9
城東区,1
天王寺区,16
平野区,1


In [53]:
# create a dataframe with number of Vietnamese and number of Vietnamese restaurant in each Ward of Osaka
Osaka_merge_df = Osaka_df_copy.merge(Osaka_venues_grouped_df, how='left', on=['Kanji Ward'])
Osaka_merge_df.fillna(0, inplace=True)
Osaka_merge_df.head()

Unnamed: 0,Romanji Ward,Kanji Ward,Population,Latitude,Longitude,Number of Vietnamese,Number of venue
0,Abeno-ku,阿倍野区,107000,34.638732,135.518467,490,5.0
1,Asahi-ku,旭区,90854,34.721168,135.544269,283,1.0
2,Chūō-ku,中央区,100998,34.681144,135.509884,372,10.0
3,Fukushima-ku,福島区,78348,34.692308,135.47222,82,0.0
4,Higashinari-ku,東成区,83684,34.669951,135.54127,678,3.0


## 3. Methodology <a name="methodology"></a> 

In this project we will direct our efforts on detecting areas of Osaka that have low Vietnamese restaurant density. We will limit our analysis to area ~2km around each Ward's center.

In first step we have collected the required **data: location and type (category) of every restaurant within 2km from each Ward's center** (24 wards). We have also **identified Vietnamese restaurants** (according to Foursquare categorization).

Second step in our analysis will be calculation and exploration of '**restaurant density**' across different areas of Osaka - we will use **heatmaps** to identify a few promising areas close to center with low number of restaurants in general (*and* no Vietnamese restaurants in vicinity) and focus our attention on those areas.

In third and final step we will focus on most promising areas and within those create **clusters of locations that meet some basic requirements** established in discussion with stakeholders: we will take into consideration locations with **high Vietnamese density**, and we want locations **without Vietnamese restaurants in radius of 200 meters**. We will present map of all such locations but also create clusters (using **k-means clustering**) of those locations to identify general zones / neighborhoods / addresses which should be a starting point for final 'street level' exploration and search for optimal venue location by stakeholders.

## 4. Analysis <a name="analysis"></a>

**Pre-processing and normalizing**

In [54]:
Osaka = Osaka_merge_df.drop(['Romanji Ward', 'Latitude', 'Longitude', 'Population'], axis=1)
Osaka.head()

Unnamed: 0,Kanji Ward,Number of Vietnamese,Number of venue
0,阿倍野区,490,5.0
1,旭区,283,1.0
2,中央区,372,10.0
3,福島区,82,0.0
4,東成区,678,3.0


In [55]:
from sklearn.preprocessing import StandardScaler
X = Osaka.values[:,1:]
X = np.nan_to_num(X)
Clus_dataSet = StandardScaler().fit_transform(X)
Clus_dataSet



array([[-0.30071396,  0.56552841],
       [-0.6623569 , -0.45882494],
       [-0.50686791,  1.8459701 ],
       [-1.01351744, -0.71491327],
       [ 0.02773469,  0.05335174],
       [-0.64139209, -0.45882494],
       [ 0.43654845, -0.71491327],
       [ 0.91699197, -0.45882494],
       [ 2.98027849,  0.05335174],
       [-0.54705046, -0.45882494],
       [-0.70079238,  1.58988176],
       [-0.31818464, -0.71491327],
       [-0.42999695, -0.71491327],
       [-0.50861498,  0.05335174],
       [ 1.177305  ,  0.56552841],
       [-0.60470368,  0.56552841],
       [ 2.55923526,  0.05335174],
       [ 0.27581826, -0.71491327],
       [-0.20637233, -0.71491327],
       [-0.42475575, -0.71491327],
       [-0.54355632, -0.71491327],
       [-0.65536863,  3.38250012],
       [-0.9139346 , -0.71491327],
       [ 0.60426691, -0.45882494]])

In [56]:
X

array([[490, 5.0],
       [283, 1.0],
       [372, 10.0],
       [82, 0.0],
       [678, 3.0],
       [295, 1.0],
       [912, 0.0],
       [1187, 1.0],
       [2368, 3.0],
       [349, 1.0],
       [261, 9.0],
       [480, 0.0],
       [416, 0.0],
       [371, 3.0],
       [1336, 5.0],
       [316, 5.0],
       [2127, 3.0],
       [820, 0.0],
       [544, 0.0],
       [419, 0.0],
       [351, 0.0],
       [287, 16.0],
       [139, 0.0],
       [1008, 1.0]], dtype=object)

**Cluster Neighborhoods**  

Run k-means to cluster the neighborhoods in Osaka into 3 clusters.

In [57]:
clusterNum = 3
k_means = KMeans(init = "k-means++", n_clusters = clusterNum, n_init = 12)
k_means.fit(X)
labels = k_means.labels_
print(labels)

[0 0 0 0 0 0 1 1 2 0 0 0 0 0 1 0 2 1 0 0 0 0 0 1]


**We assign the labels to each row in dataframe**

In [58]:
Osaka["Clus_km"] = labels
Osaka.head(5)

Unnamed: 0,Kanji Ward,Number of Vietnamese,Number of venue,Clus_km
0,阿倍野区,490,5.0,0
1,旭区,283,1.0,0
2,中央区,372,10.0,0
3,福島区,82,0.0,0
4,東成区,678,3.0,0


**We can easily check the centroid values by averaging the features in each cluster**

In [59]:
Osaka.groupby('Clus_km').mean()

Unnamed: 0_level_0,Number of Vietnamese,Number of venue
Clus_km,Unnamed: 1_level_1,Unnamed: 2_level_1
0,360.764706,3.176471
1,1052.6,1.4
2,2247.5,3.0


**Next, let's group rows by neighborhood and by taking the mean of the frequency of occurrence of each category**

In [60]:
# create a new dataframe that includes the cluster as well as the top 10 venues for each neighborhood.
Osaka_merged = Osaka.copy()

# add clustering labels
#Osaka_merged["Cluster Labels"] = kmeans.labels_

In [61]:
#Osaka_merged.rename(columns={"Neighborhoods": "Neighborhood"}, inplace=True)
Osaka_merged.head()

Unnamed: 0,Kanji Ward,Number of Vietnamese,Number of venue,Clus_km
0,阿倍野区,490,5.0,0
1,旭区,283,1.0,0
2,中央区,372,10.0,0
3,福島区,82,0.0,0
4,東成区,678,3.0,0


In [64]:
# merge toronto_grouped with toronto_data to add latitude/longitude for each neighborhood
Osaka_merged = Osaka_merged.join(Osaka_ward_df.set_index('Kanji Ward'), on='Kanji Ward')

print(Osaka_merged.shape)
Osaka_merged.head() # check the last columns!

(24, 8)


Unnamed: 0,Kanji Ward,Number of Vietnamese,Number of venue,Clus_km,Romanji Ward,Population,Latitude,Longitude
0,阿倍野区,490,5.0,0,Abeno-ku,107000,34.638732,135.518467
1,旭区,283,1.0,0,Asahi-ku,90854,34.721168,135.544269
2,中央区,372,10.0,0,Chūō-ku,100998,34.681144,135.509884
3,福島区,82,0.0,0,Fukushima-ku,78348,34.692308,135.47222
4,東成区,678,3.0,0,Higashinari-ku,83684,34.669951,135.54127


In [65]:
# sort the results by Cluster Labels
print(Osaka_merged.shape)
Osaka_merged.sort_values(['Clus_km'], inplace=True)
Osaka_merged

(24, 8)


Unnamed: 0,Kanji Ward,Number of Vietnamese,Number of venue,Clus_km,Romanji Ward,Population,Latitude,Longitude
0,阿倍野区,490,5.0,0,Abeno-ku,107000,34.638732,135.518467
21,天王寺区,287,16.0,0,Tennōji-ku,80830,34.657816,135.519309
20,大正区,351,0.0,0,Taishō-ku,62872,34.650342,135.472805
19,住吉区,419,0.0,0,Sumiyoshi-ku,153425,34.603653,135.500581
18,住之江区,544,0.0,0,Suminoe-ku,120629,34.609624,135.482832
15,西区,316,5.0,0,Nishi-ku,103089,34.676303,135.485961
13,都島区,371,3.0,0,Miyakojima-ku,107555,34.701281,135.528077
12,港区,416,0.0,0,Minato-ku,80759,34.663923,135.460777
22,鶴見区,139,0.0,0,Tsurumi-ku,111501,34.704365,135.574239
10,北区,261,9.0,0,Kita-ku (administrative center),136602,34.705363,135.510049


**Finally, let's visualize the resulting clusters**

In [66]:
# create map
map_clusters = folium.Map(location=[latitude, longitude], zoom_start=11)

# set color scheme for the clusters
x = np.arange(clusterNum)
ys = [i+x+(i*x)**2 for i in range(clusterNum)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# add markers to the map
markers_colors = []
for lat, lon, poi, cluster in zip(Osaka_merged['Latitude'], Osaka_merged['Longitude'], Osaka_merged['Kanji Ward'], Osaka_merged['Clus_km']):
    label = folium.Popup(str(poi) + ' - Cluster ' + str(cluster), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color=rainbow[cluster-1],
        fill=True,
        fill_color=rainbow[cluster-1],
        fill_opacity=0.7).add_to(map_clusters)
       
map_clusters

In [67]:
# save the map as HTML file
map_clusters.save('map_clusters.html')

## Results <a name="results"></a>

#### Cluster 0  

Low density of Vietnamese and average density restaurant.

In [71]:
Osaka_merged.loc[Osaka_merged['Clus_km'] == 0].sort_index()

Unnamed: 0,Kanji Ward,Number of Vietnamese,Number of venue,Clus_km,Romanji Ward,Population,Latitude,Longitude
0,阿倍野区,490,5.0,0,Abeno-ku,107000,34.638732,135.518467
1,旭区,283,1.0,0,Asahi-ku,90854,34.721168,135.544269
2,中央区,372,10.0,0,Chūō-ku,100998,34.681144,135.509884
3,福島区,82,0.0,0,Fukushima-ku,78348,34.692308,135.47222
4,東成区,678,3.0,0,Higashinari-ku,83684,34.669951,135.54127
5,東住吉区,295,1.0,0,Higashisumiyoshi-ku,126704,34.622123,135.526661
9,城東区,349,1.0,0,Jōtō-ku,167925,34.70321,135.544797
10,北区,261,9.0,0,Kita-ku (administrative center),136602,34.705363,135.510049
11,此花区,480,0.0,0,Konohana-ku,65086,34.683059,135.452361
12,港区,416,0.0,0,Minato-ku,80759,34.663923,135.460777


#### Cluster 1  

Average density of Vietnamese and low density of Vietnamese restaurant.

In [72]:
Osaka_merged.loc[Osaka_merged['Clus_km'] == 1].sort_index()

Unnamed: 0,Kanji Ward,Number of Vietnamese,Number of venue,Clus_km,Romanji Ward,Population,Latitude,Longitude
6,東淀川区,912,0.0,1,Higashiyodogawa-ku,176943,34.741226,135.529413
7,平野区,1187,1.0,1,Hirano-ku,193282,34.62129,135.546381
14,浪速区,1336,5.0,1,Naniwa-ku,74992,34.659405,135.499543
17,西淀川区,820,0.0,1,Nishiyodogawa-ku,95960,34.711334,135.456165
23,淀川区,1008,1.0,1,Yodogawa-ku,182254,34.721055,135.486691


#### Cluster 2  

High density of Vietnamese and low density of Vietnamese restaurant.

In [73]:
Osaka_merged.loc[Osaka_merged['Clus_km'] == 2].sort_index()

Unnamed: 0,Kanji Ward,Number of Vietnamese,Number of venue,Clus_km,Romanji Ward,Population,Latitude,Longitude
8,生野区,2368,3.0,2,Ikuno-ku,129641,34.653647,135.534354
16,西成区,2127,3.0,2,Nishinari-ku,108654,34.634954,135.494415


## Conclusion <a name="conclusion"></a>  


Purpose of this project was to identify Osaka areas close to center with low number of restaurants (particularly Vietnamese restaurants) in order to aid stakeholders in narrowing down the search for optimal location for a new Vietnamese restaurant. By calculating restaurant density distribution from Foursquare data we have first identified general wards that justify further analysis (*生野区* and *西成区*), and then generated extensive collection of locations which satisfy some basic requirements regarding density of Vietnamese. Clustering of those locations was then performed in order to create major zones of interest (containing greatest number of potential locations) and addresses of those zone centers were created to be used as starting points for final exploration by stakeholders.

Final decission on optimal restaurant location will be made by stakeholders based on specific characteristics of wards and locations in every recommended zone, taking into consideration additional factors like attractiveness of each location, accessibily to train station, real estate availability, prices, social and economic dynamics of every ward etc.