# Capstone Project - The Battle of Neighborhoods

## Introduction

Introduction where you discuss the business problem and who would be interested in this project.

#### "Would you recommend a location in Hong Kong to open a new cinema?"  
My boss, the stakeholder wants to **open a new cinema as company's new business**.  
  
He explains that watching movie is a part of whole afternoon or night activities. Cinema should has **many restaurants and shopping places nearby**. Transportation is also an important factor. Customer can walk to cinema within **5 minutes** from **public transport facilities** is perfect.  
  
He wants me concentrated on selection of cinema location according to its nearby environment. Cinema facility and rental price is not my concern. He lists out his **top 10 favorite cinemas** in Hong Kong with rating.  

I work with my teammates and select **5 possible locations** to build the cinema. Which location should be suggested to the stakeholder?

## Data

Data where you describe the data that will be used to solve the problem and the source of the data.

According to the question, following data are required.

#### 1. Geographic coordinate of Hong Kong cinemas

I need to **compare 5 possible locations with current cinemas** in Hong Kong. Therefore, I need to find a list of Hong Kong cinema and cinemas' geographic coordinates. Luckily, I can find the list and coordinates from the website https://hkmovie6.com/cinema .

In [2]:
# Import necessary library
import json
import pandas as pd

In [3]:
# Download the cinema list
!wget -O hk_cinema_list.json https://hkmovie6.com/api/cinemas/lists

--2019-08-21 19:39:56--  https://hkmovie6.com/api/cinemas/lists
Resolving hkmovie6.com (hkmovie6.com)... 2606:4700:30::681f:4301, 2606:4700:30::681f:4201, 104.31.66.1, ...
Connecting to hkmovie6.com (hkmovie6.com)|2606:4700:30::681f:4301|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: unspecified [application/json]
Saving to: ‘hk_cinema_list.json’

hk_cinema_list.json     [ <=>                ]  55.68K  --.-KB/s    in 0.1s    

2019-08-21 19:39:57 (372 KB/s) - ‘hk_cinema_list.json’ saved [57018]



In [4]:
# Convert the JSON data into DataFrmae
cinemas_json = None
with open('hk_cinema_list.json', 'r', encoding='utf-8') as f:
    cinemas_json = json.load(f)
    
cinemas = []
for data in cinemas_json['data']:    
    cinemas.append({
        'Name': data['name'],
        'ChiName': data['chiName'],
        'Address': data['address'],
        'Latitude': data['lat'],
        'Longitude': data['lon']
    })
df_cinemas = pd.DataFrame(cinemas, columns=['Name','ChiName','Address','Latitude','Longitude'])

KeyError: 'lat'

In [None]:
print('There are {} cinemas in Hong Kong'.format(len(df_cinemas)))

First five records of Hong Kong cinemas

In [None]:
df_cinemas.head()

#### 2. Geographic coordinates of 5 possible cinema addresses
Geographic coordinates of 5 possible cinemas are required and I can use Google Map API to find this information

In [None]:
possible_locations = [
    { 'Location': 'L1', 'Address': 'Sau Mau Ping Shopping Centre, Sau Mau Ping'},
    { 'Location': 'L2', 'Address': 'Tuen Mun Ferry, Tuen Mun'},
    { 'Location': 'L3', 'Address': 'Un Chau Shopping Centre, Cheung Sha Wan'},
    { 'Location': 'L4', 'Address': 'Prosperity Millennia Plaza, North Point'},
    { 'Location': 'L5', 'Address': 'Tsuen Fung Centre Shopping Arcade, Tsuen Wan'},
]

In [None]:
# install the google map api client library
!pip install -U googlemaps

In [None]:
google_act = None
with open('google_map_act.json', 'r') as f:
    google_act = json.load(f)
    
GOOGLE_MAP_API_KEY = google_act['api_key']    

import googlemaps
gmaps = googlemaps.Client(key=GOOGLE_MAP_API_KEY)

In [None]:
# Retrieve geolocation and create the dataframe of pending cinema addresses
def getLatLng(address):
    latlnt = gmaps.geocode('{}, Hong Kong'.format(address))
    return (latlnt[0]['geometry']['location']['lat'], latlnt[0]['geometry']['location']['lng'])

Dataframe of 5 target locations with geographic coordinates information

In [None]:
for loc in possible_locations:        
    (lat, lng) = getLatLng(loc['Address'])
    loc['Latitude'] = lat
    loc['Longitude'] = lng
    
df_possible_locations = pd.DataFrame(possible_locations, columns=['Location', 'Address', 'Latitude', 'Longitude'])
df_possible_locations

#### 3. Favorite cinema list of stakeholder

The favorite cinema list of stakeholder is an important information that I can **use it as profile to select the best location**.  

In [None]:
boss_favorite = [
    {'Name': 'Broadway Circuit - MONGKOK', 'Rating': 4.5},
    {'Name': 'Broadway Circuit - The ONE', 'Rating': 4.5},
    {'Name': 'Grand Ocean', 'Rating': 4.3},
    {'Name': 'The Grand Cinema', 'Rating': 3.4},
    {'Name': 'AMC Pacific Place', 'Rating': 2.3},
    {'Name': 'UA IMAX @ Airport', 'Rating': 1.5},
]

df_boss_favorite = pd.DataFrame(boss_favorite, columns=['Name','Rating'])
df_boss_favorite

#### 4. Eating, Shopping and Public transportation facility around cinema
The recommended cinema location needs to have many eating and shopping venues nearby. Convenient public transport is also required.  
These data can be found by using FourSquare API to find these venues around the location. The radius of exploration distance is set to 500 meters, which is about 5 minutes walking distance.

Following type of venue category will be used to search

In [None]:
fs_categories = {
    'Food': '4d4b7105d754a06374d81259',
    'Shop & Service': '4d4b7105d754a06378d81259',
    'Bus Stop': '52f2ab2ebcbc57f1066b8b4f',
    'Metro Station': '4bf58dd8d48988d1fd931735',
    'Nightlife Spot': '4d4b7105d754a06376d81259',
    'Arts & Entertainment': '4d4b7104d754a06370d81259'
}

In [None]:
', '.join([ cat for cat in fs_categories])

In [None]:
cinema = df_cinemas.loc[0]

In [None]:
print('Use the first cinema "{}" in the list as example to explore venues nearyby'.format(cinema['Name']))

In [None]:
# Install FourSquare client library
!pip install foursquare

In [None]:
fs_act = None
with open('fs_act.json') as json_data:
    fs_act = json.load(json_data)

In [None]:
import foursquare
from pandas.io.json import json_normalize # tranform JSON file into a pandas dataframe
fs = foursquare.Foursquare(client_id=fs_act['client_id'], client_secret=fs_act['client_secret'])

In [None]:
RADIUS = 500 # 500m, around 5 minutes walking time

In [None]:
# Define a function to search nearby information and convert the result as dataframe
def venues_nearby(latitude, longitude, category, verbose=True):    
    results = fs.venues.search(
        params = {
            'query': category, 
            'll': '{},{}'.format(latitude, longitude),
            'radius': RADIUS,
            'categoryId': fs_categories[category]
        }
    )    
    df = json_normalize(results['venues'])
    cols = ['Name','Latitude','Longitude','Tips','Users','Visits']    
    if( len(df) == 0 ):        
        df = pd.DataFrame(columns=cols)
    else:        
        df = df[['name','location.lat','location.lng','stats.tipCount','stats.usersCount','stats.visitsCount']]
        df.columns = cols
    if( verbose ):
        print('{} "{}" venues are found within {}m of location'.format(len(df), category, RADIUS))
    return df
    

Find Metro Station around the cinema

In [None]:
venues_nearby(cinema['Latitude'], cinema['Longitude'], 'Metro Station').head()

Find Bus Stop around the cinema

In [None]:
venues_nearby(cinema['Latitude'], cinema['Longitude'], 'Bus Stop').head()

Find eating places around the cinema

In [None]:
venues_nearby(cinema['Latitude'], cinema['Longitude'], 'Food').head()

In [None]:
venues_nearby(cinema['Latitude'], cinema['Longitude'], 'Arts & Entertainment').head()

## Methodology 

Methodology section which represents the main component of the report where you discuss and describe any exploratory data analysis that you did, any inferential statistical testing that you performed, and what machine learnings were used and why.

With above data, I can use content-based recommendation technique to resolve the problem.

Combine with FourSquare API which provides how many venues in different category of Hong Kong cinemas, a matrix which captured characteristic of venues nearby cinema are built. Stakeholder's favorite list is the profile to combine with the matrix to become a weighted matrix of favorite cinema.

The weighted matrix can be applied on 5 target locations with venues information to generate a ranking result. The the top one on the ranking list can be recommended to the stakeholder.

Before building the matrix, I have to prepare the required data and apply some data analysis.

#### Data Cleansing and Preparation

Check the cinemas dataset contains any duplicated address

In [None]:
duplicated = df_cinemas.duplicated('Address', keep=False)
df_cinemas[duplicated].sort_values('Address')

Some "special house" in cinema are separated as a new cinema in www.hkmovie6.com  
These records are duplicated in my case and should be corrected.

In [None]:
# The Grand SC Starsuite -> The Grand Cinema
df_cinemas.loc[29, 'Name'] = 'The Grand Cinema'

# XXX @ UA MegaBox -> UA MegaBox
df_cinemas.loc[44, 'Name'] = 'UA MegaBox'
df_cinemas.loc[45, 'Name'] = 'UA MegaBox'

# BEA IMAX @ UA Cine Moko -> UA Cine Moko
df_cinemas.loc[42, 'Name'] = 'UA Cine Moko'

# XXX @ UA iSQUARE -> iSQUARE
df_cinemas.loc[43, 'Name'] = 'UA iSQUARE'
df_cinemas.loc[46, 'Name'] = 'UA iSQUARE'

# Emperor Cinemas - Entertainment Building
df_cinemas.loc[1, 'Name'] = 'Emperor Cinemas - Entertainment Building'

# Cinema City VICTORIA (Causeway Bay)
df_cinemas.loc[6, 'Name'] = 'Cinema City VICTORIA (Causeway Bay)'

In [None]:
df_cinemas[duplicated]

In [None]:
df_cinemas.drop_duplicates('Address', inplace=True, keep='first')

Drop the duplicated cinema records

In [None]:
df_cinemas[df_cinemas.duplicated('Name')]

In [None]:
df_cinemas.head()

In [None]:
df_cinemas['ChiName'].to_frame()

Cinema '新光戲院大劇場' and '大館' should be considered as cinema in Hong Kong. These records must be rmeoved

In [None]:
df_cinemas.drop(index=[65,67], inplace=True)

In [None]:
df_cinemas.drop(axis=1, columns=['ChiName'], inplace=True)

In [None]:
df_cinemas.head()

Check the shape of cinemas dataset

In [None]:
df_cinemas.shape

Now I can use the FourSquare API to explore nearby venues of Hong Kong cinemas

In [None]:
from pathlib import Path

venues_csv = Path('./cinemas_venues.csv')
df_venues = None

# check the venues data is explored and downloaded 
if( venues_csv.exists() ):
    df_venues = pd.read_csv('./cinemas_venues.csv')
else:    
    # construct a dataframe to store data
    df_venues = pd.DataFrame(columns=['Cinema Name', 'Category', 'Name', 'Latitude', 'Longitude', 'Tips', 'Users', 'Visits'])
    for (name, address, latitude, longitude) in df_cinemas.itertuples(index=False):
        for cat, cat_id in fs_categories.items():
            df = venues_nearby(latitude, longitude, cat, verbose=False)
            df['Cinema Name'] = name
            df['Category'] = cat
            df_venues = df_venues.append(df, sort=True)
    df_venues.to_csv('cinemas_venues.csv', index=False)

In [None]:
print('Total {} of venues are found'.format(len(df_venues)))

In [None]:
# check the shape of data
df_venues.shape

In [None]:
# check some data
df_venues.head()

Number of venues in each category

In [None]:
df_venues['Category'].value_counts().to_frame(name='Count')

In [None]:
df_venues[(df_venues.Tips > 0)|(df_venues.Users > 0)|(df_venues.Visits > 0)]

In [None]:
df_venues.drop(columns=['Tips','Users','Visits'], inplace=True)

In [None]:
df_venues[df_venues.Category=='Nightlife Spot']

In [None]:
df_venues.drop(index=87, inplace=True)

Comapred with other categories, only one 'Nightlife Spot' venue. This category is removed.

In [None]:
df_venues.shape

Explore nearby venues of 5 possible/target locations

In [None]:
df_target_venues = pd.DataFrame(columns=['Location', 'Category', 'Name', 'Latitude', 'Longitude', 'Tips', 'Users', 'Visits'])
for (location, address, latitude, longitude) in df_possible_locations.itertuples(index=False):
    for cat, cat_id in fs_categories.items():
        df = venues_nearby(latitude, longitude, cat, verbose=False)
        df['Location'] = location
        df['Category'] = cat
        df_target_venues = df_target_venues.append(df, sort=True)

In [None]:
df_target_venues.head()

In [None]:
df_target_venues[(df_target_venues.Tips > 0)|(df_target_venues.Users > 0)|(df_target_venues.Visits > 0)]

In [None]:
df_target_venues.drop(columns=['Tips','Users','Visits'], inplace=True)

In [None]:
df_target_venues['Category'].value_counts().to_frame(name='Count')

No venue is found for 'Nightlife Spot' category

In [None]:
df_target_venues.shape

I only interested in number of venues in each category of dataframe.  

In [None]:
df_venues_count = df_venues.groupby(['Cinema Name','Category'], as_index=False).count()
df_venues_count.drop(columns=['Latitude','Longitude'], inplace=True)
df_venues_count.rename(columns={'Name':'Count'}, inplace=True)
df_venues_count.head()

In [None]:
df_venues_count = df_venues_count.pivot(index='Cinema Name', columns='Category', values='Count').fillna(0)
df_venues_count.head()

In [None]:
# Do the same process on target locations
df_target_venues_count = df_target_venues.groupby(['Location','Category']).size().reset_index(name='Count')
df_target_venues_count = df_target_venues_count.pivot(index='Location', columns='Category', values='Count').fillna(0)

In [None]:
df_target_venues_count

Check boss's favorite cinema list

In [None]:
boss_favorite

Check boss's favorite cinemas are inside the hong kong cinemas dataset

Check the Hong Kong cinema list contains all stakeholder's favorite cinemas

In [None]:
names = [ cinema['Name'] for cinema in boss_favorite ]
df_cinemas[df_cinemas.Name.isin(names)]

Stakholder's favorite cinema list

In [None]:
df_boss_favorite = pd.DataFrame(boss_favorite, columns=['Name','Rating'])
df_boss_favorite

#### Data Analysis

In [None]:
!conda install seaborn=0.9 --yes

In [None]:
import matplotlib
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

Check the data type of variables

In [None]:
df_venues_count.dtypes.to_frame(name='Data Type')

All datatype is numeric

Generates descriptive statistics that summarize the central tendency,
dispersion and shape of a dataset's distribution

In [None]:
df_venues_count.describe()

Cinema really has many 'Bus Stop', 'Food', 'Shop & Service' venues around. However it is unusual that a cinema has 4 metro stations nearby (within 500 meters).  

In [None]:
df_venues_count['Metro Station'].value_counts().sort_index().to_frame('Cinema Count')

One cinema contains 4 Metro Station around

In [None]:
df_venues_count[df_venues_count['Metro Station'] > 2]

In [None]:
metro_over_2 = df_venues_count[df_venues_count['Metro Station'] > 2].index.tolist()
df_venues[(df_venues['Cinema Name'].isin(metro_over_2)) & (df_venues.Category == 'Metro Station')]

Venue 'Mtr Hung Hom Station Platform 4' is duplicated and should be removed.

In [None]:
df_venues.loc[2182, 'Name'] = 'MTR Hung Hom Station'

In [None]:
df_venues.drop(index=2183, inplace=True)

Re-construct the dataframe again

In [None]:
df_venues_count = df_venues.groupby(['Cinema Name','Category'], as_index=False).count()
df_venues_count.drop(columns=['Latitude','Longitude'], inplace=True)
df_venues_count.rename(columns={'Name':'Count'}, inplace=True)
df_venues_count = df_venues_count.pivot(index='Cinema Name', columns='Category', values='Count').fillna(0)
df_venues_count.head()

Plot the distribution of other variables

In [None]:
f, axes = plt.subplots(2, 2, figsize=(10, 10))
sns.distplot(df_venues_count['Arts & Entertainment'] , color="skyblue", ax=axes[0, 0], kde=False)
sns.distplot(df_venues_count['Bus Stop'] , color="olive", ax=axes[0, 1], kde=False)
sns.distplot(df_venues_count['Food'] , color="gold", ax=axes[1, 0], kde=False)
sns.distplot(df_venues_count['Shop & Service'] , color="teal", ax=axes[1, 1], kde=False)

The distribution of other variables are quite similar. Now check their **Pearson Correlation**

In [None]:
df_venues_count.corr()

It seems that 'Bus Stop', 'Shop & Service' and 'Food' category are highly correlated.  
Find **P-Value** of the variables

By convention, when the p-value is:
- < 0.001 we say there is strong evidence that the correlation is significant,
- < 0.05; there is moderate evidence that the correlation is significant,
- < 0.1; there is weak evidence that the correlation is significant, and
- is >  0.1; there is no evidence that the correlation is significant.

In [None]:
from scipy import stats

In [None]:
p_value_data = []
for left in df_venues_count.columns:
    p_values = [left]
    for right in df_venues_count.columns:        
        pearson_coef, p_value = stats.pearsonr(df_venues_count[left], df_venues_count[right])
        if(p_value < 0.001):
            p_values.append('strong')
        elif(p_value < 0.05):
            p_values.append('moderate')
        elif(p_value < 0.1):
            p_values.append('weak')
        else:
            p_values.append('no')            
    p_value_data.append(p_values)

In [None]:
df_p_values = pd.DataFrame(p_value_data, columns=['Category'] + df_venues_count.columns.tolist())

In [None]:
df_p_values

The correlation between 'Bus Stop', 'Food', 'Metro Station' and 'Shop & Service' are statistically significant, and the coefficient of > 0.5 shows that the relationship is positive

In [None]:
df_boss_favorite

In [None]:
!conda install -c conda-forge folium=0.5 --yes
import folium

print('Folium installed and imported!')

In [None]:
hk_coords = getLatLng('Hong Kong')

Visualize the location of cinemas, target location and stakeholder's favorite cineams on the map

In [None]:
hk_map = folium.Map(location=hk_coords, zoom_start=12, tiles='Stamen Toner')

cinemas_fg = folium.FeatureGroup()
targets_fg = folium.FeatureGroup()

for(location, address, latitude, longitude) in df_possible_locations.itertuples(index=False):
    targets_fg.add_child(
        folium.features.CircleMarker(
            location=(latitude, longitude),
            popup=location,
            radius=5,
            fill=True,
            color='yellow',
            fill_opacity=1.
        )
    )

boss_ratings = df_boss_favorite.set_index('Name')    
name_list = boss_ratings.index.tolist()

for (name, address, latitude, longitude ) in df_cinemas.itertuples(index=False):    
    
    color = 'blue'        
    popup = name
    
    if( name in name_list ):
        color = 'red'    
        popup = '{} - Rating: {}'.format(name, boss_ratings.loc[name,'Rating'])
        
    cinemas_fg.add_child(        
        folium.features.CircleMarker(
            location=(latitude, longitude),
            popup=popup,
            radius=5,
            fill=True,
            color=color,
            fill_opacity=1.
        )
    )
    
hk_map.add_child(cinemas_fg)
hk_map.add_child(targets_fg)

Most of Hong Kong cinemas (blue circle) and stakeholder's favorite cinemas (red circle) location are built near main road, and centralized in urban area of Hong Kong. 
The target locations (yellow circle) of new cinema are not near to main road.

#### Machine Learning

Now, let's use __Content-Based__ or __Item-Item recommendation systems__. In this case, I am going to try to figure out the boss's favorite new cinema location by counting number of nearby venues and ratings given.

Normalize the values of venues dataframe by using MinMaxScaler method

In [None]:
df_venues_count.head()

In [None]:
from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler()

In [None]:
venues_normalized = scaler.fit_transform(df_venues_count)

In [None]:
df_venues_normalized = pd.DataFrame(
    venues_normalized,
    index=df_venues_count.index,
    columns=df_venues_count.columns
)

In [None]:
df_venues_normalized.head()

Merge the data with boss's favorite list

In [None]:
boss_rating_table = pd.merge(
    df_boss_favorite,
    df_venues_normalized,
    how='inner',
    left_on='Name',
    right_index=True
)
boss_rating_table.drop(['Name','Rating'], axis=1, inplace=True)
boss_rating_table

Dot product to get the weight of rating on each category according to boss's favorite

In [None]:
boss_profile = boss_rating_table.transpose().dot(df_boss_favorite['Rating'])

In [None]:
boss_profile

Normalize the values of target venues

In [None]:
df_targets_normalized = pd.DataFrame(
    scaler.transform(df_target_venues_count),
    index=df_target_venues_count.index,
    columns=df_target_venues_count.columns
)

In [None]:
df_targets_normalized

## Results

Results section where you discuss the results.

With the boss's profile and the complete list of cinemas and their venues count in hand, I am going to take the weighted average of every lcoation based on the profile and recommend the top location that most satisfy it.

In [None]:
df_recommend = (df_targets_normalized*boss_profile).sum(axis=1)/boss_profile.sum()
df_recommend = df_recommend.reset_index(name='Rating')

In [None]:
df_possible_locations

In [None]:
df_final = pd.merge(
    df_possible_locations,
    df_recommend,
    left_on='Location',
    right_on='Location'
)
df_final.sort_values('Rating', ascending=False, inplace=True)

In [None]:
df_final

In [None]:
print('I should recommend the location "{}" of address "{}" to the stackholder'.format(df_final.iat[0,0], df_final.iat[0,1]))

The result is reasonable. Location "L5" has the most number of venues in category "Bus Stop", "Food", "Metro Station" and "Shop & Service". 

In [None]:
df_target_venues_count.head()

Moreover, these categories are most concerned by the stakeholder according to profile rating

In [None]:
boss_profile.sort_values(ascending=False)

Therefore, Location "L5" should be recommeded to the stakeholder

## Discussion 

Discussion section where you discuss any observations you noted and any recommendations you can make based on the results.

Number of venues of 5 target locations are actually below the average

In [None]:
df_venues_count.mean().to_frame(name='Average Count')

In [None]:
df_target_venues_count.mean().to_frame('Average Count')

I should contact local commercial property agents to find more suitable locations. Moreover, FourSquare is not popular in Hong Kong, the data maybe out-dated or unreliable, the report should gather more data from other location data source such as Google Place API.

## Conclusion 

Conclusion section where you conclude the report.

The stakeholder's problem is resolved. Stakeholder wants to find the best place to build a new cinema in Hong Kong, and the factors of "best location" is based on the number of venues in eating, shopping, transportation category around the location. Stakeholder also provide his favorite list of cinema to further explain what the "best location" is. Content-based filtering machine learning technique is the most suitable method to resolve the problem. It combines stakeholder's preference and cinema profile to make the recommendation result.

The 5 target locations of new cinema may not be a good choices. As the weighting matrix is developed, I can quickly pick other locations and make the recommendation again.
