# <p style="text-align:center;"><u>THE ASIAN TOURIST IN PARIS</u></p>

# Introduction

<font size=4>In the city of Paris, there are thousands of tourists who have come to see the amazing city for its sheer beauty. These tourists are from different nationals and have cuisines which are specific to them. For example, asians will prefer to go eat in restaurants that specialize in making asian cuisines rather than eating in a restaurant that deals with different kinds of cuisines. To this end, the problem i intend to solve is deciding the best AirBnB apartment locations where Asian tourists can reside during a tourist visit to the city of Paris.</font>
<br><br>

<font size=4>By clustering the Asian restaurants using Foursquare API and AirBnB private room listings data for the city of Paris, i can successfully cluster the restaurants into groups and propose the best AirBnB private room apartment based on the average price per night and the number of reviews each room has. By solving this problem, i can help Asian tourists coming to Paris to choose an AirBnB apartment situated in proximity to </font>
<br><br>

<font size=4>In conclusion, the problem i intend to solve can be framed as the question: <br><br><b>Which AirBnB apartment should an Asian tourist choose to reside in during a visit to Paris city in order to enjoy Asian cuisines located nearby?</b></font>

# Data 

<font size=4>To solve the stated problem, i will be using Foursquare's location data which will be accessed using their GET API. The kind of data i will be accessing include:
<br>
<br>

<li>Restaurants in Paris city</li>
<li>Asian restaurants and other interesting places around these restaurants</li>


<br>
Also, i will be utilizing data from http://public.opendatasoft.com. This data will comprise of all AirBnB private room listings in the city of Paris, their latitude and longitude, price per night, neighborhood of location and some other redundant data.
With these set of data, i can determine the best location/AirBnB property in which an asian tourist should reside in during a visit to the city of Paris by clustering the restaurant data and determing the cluster with the most number of restaurants where asian cuisines are being sold. Also, with the data, i can explore AirBnB listings around the chosen sites, and propose them as the hotel of choice for asian tourists that are coming into Paris city.
<br>
<br>
The classification will put into consideration the price of the rooms per night and the number of reviews for each AirBnB listing.</font>

# Methodology

<font size=4>The methodology to adopt for this problem can be briefly described in the following phases:<br></font>
<br>
    <b>1. AirBnB Data collection from http://public.opendatasoft.com</b>
<br>
The AirBnB data will be accessed from OpenDataSoft and the data set will be limited to private rooms in AirBnB's listings in the city of Paris. The collected data set will be stored in  a .csv format.
<br>    
    <b>2. AirBnB Data preprocessing.</b>
<br>
The Airbnb data will be loaded into a pandas dataframe and data preprocessing techniques such as cleaning, trimming, shaping etc will be carried out on it to prepare it for processing.
<br><br>
    <b>3. EDA</b>
<br>
Exploratory data analysis will carried out on the data to better describe it and get some insight about the data
<br><br>
    <b>4. Utilization of Foursquare API search function</b>
<br>
The Foursquare API search function will then be used to find Asian restaurants within 500m of each AirBnB private room in Paris city. The JSON result will be cleaned and made ready for clustering
<br><br>
    <b>5. Clustering of Asian restaurants</b>
<br>
To properly group the asian restuarants with respect to the locations of AirBnB private rooms, the KNN machine learning technique will be used to cluster the restaurant into ten (10) different clusters.
<br>
<br>
<br>


#### <u>AirBnB Data Preprocessing</u>

##### 1. Import necessary dependencies

In [1]:
import numpy as np # library to handle data in a vectorized manner

import pandas as pd # library for data analsysis
#pd.set_option('display.max_columns', None)
#pd.set_option('display.max_rows', None)

import json # library to handle JSON files

#!conda install -c conda-forge geopy --yes # uncomment this line if you haven't completed the Foursquare API lab
from geopy.geocoders import Nominatim # convert an address into latitude and longitude values

import requests # library to handle requests
from pandas.io.json import json_normalize # tranform JSON file into a pandas dataframe

#!pip install beautifulsoup4
#from bs4 import BeautifulSoup
import urllib.request
import csv


# libraries for displaying images
from IPython.display import Image 
from IPython.core.display import HTML

# Matplotlib and associated plotting modules
import matplotlib.cm as cm
import matplotlib.colors as colors

# import k-means from clustering stage
from sklearn.cluster import KMeans

#!conda install -c conda-forge folium=0.5.0 --yes # uncomment this line if you haven't completed the Foursquare API lab
import folium # map rendering library

print('Libraries imported.')

Libraries imported.


##### 2. Read the AirBnB Paris listing data into Pandas dataframe

In [2]:
# The code was removed by Watson Studio for sharing.

Unnamed: 0,Room ID,Name,Host ID,Neighbourhood,Room type,Room Price,Minimum nights,Number of reviews,Date last review,Number of reviews per month,Rooms rent by the host,Availibility,Updated Date,City,Country,Coordinates,Location
0,27425430,Bâtiments et cartier chik,94000621,Observatoire,Private room,30,20,0,,,1,0,2019-07-09,Paris,France,"48.8324388237,2.32971312235","France, Paris, Observatoire"
1,28003325,Chambre Double Galilée B&B,211467698,Passy,Private room,125,3,12,2019-06-19,1.2,4,105,2019-07-09,Paris,France,"48.8709289603,2.29685938737","France, Paris, Passy"
2,28069177,Two Rooms perfect for families,135335766,Batignolles-Monceau,Private room,398,1,0,,,7,29,2019-07-09,Paris,France,"48.8817959345,2.31241028731","France, Paris, Batignolles-Monceau"
3,28115947,Accommodations for 3 next to the train station,126446227,Opéra,Private room,119,1,4,2019-06-08,0.49,6,317,2019-07-09,Paris,France,"48.8786312806,2.32969475795","France, Paris, Opéra"
4,28122260,"Jolie chambre 2p ds cosy F3, 5m du Parc Expos",184941981,Vaugirard,Private room,50,1,27,2019-06-21,4.53,2,295,2019-07-09,Paris,France,"48.8300155978,2.29669152579","France, Paris, Vaugirard"


In [3]:
# Because of running time, we'll be using just 1000 rows of data
airbnb = df.sample(n=1000, random_state=3) # Get 1000 random rows from the AirBnB data set
airbnb.reset_index(drop=True, inplace=True) # Reset the index
airbnb.shape # Output the shape of the selected data

(1000, 17)

##### 3. Modify the data by dropping unneccesary columns

In [4]:
# Drop the unneeded data columns
airbnb.drop(['Name', 'Host ID', 'Room type', 'Minimum nights', 'Date last review', 'Number of reviews per month', 'Rooms rent by the host', 'Availibility',
       'Updated Date', 'City', 'Country', 'Location'], axis=1, inplace=True)
# Display the modified dataframe shape
print(airbnb.shape)

airbnb.head() # Display the 1st 5 rows of the dataframe

(1000, 5)


Unnamed: 0,Room ID,Neighbourhood,Room Price,Number of reviews,Coordinates
0,12817090,Entrepôt,30,8,"48.8733818465,2.37345456201"
1,2181753,Reuilly,40,104,"48.8413143526,2.38339810857"
2,29895295,Reuilly,50,7,"48.8472675613,2.40041154285"
3,30728070,Popincourt,139,0,"48.8609207147,2.36660350692"
4,21862193,Entrepôt,40,1,"48.8790213314,2.37008084215"


In [None]:
# Split the data in the coordinates column into latitude and longitude
airbnb['Latitude'], airbnb['Longitude'] = airbnb['Coordinates'].str.split(',', 1).str
# Convert the latitude and longitude to float datatype
airbnb['Latitude'] = airbnb['Latitude'].astype(float, copy=True)
airbnb['Longitude'] = airbnb['Longitude'].astype(float, copy=True)
# Drop the coordinate column
airbnb.drop(['Coordinates'], axis=1, inplace=True)

In [None]:
airbnb.rename(columns={'Neighbourhood': 'Neighborhood'}, inplace=True) # Rename the Neighbourhood column
airbnb.head()

Unnamed: 0,Room ID,Neighborhood,Room Price,Number of reviews,Latitude,Longitude
0,12817090,Entrepôt,30,8,48.873382,2.373455
1,2181753,Reuilly,40,104,48.841314,2.383398
2,29895295,Reuilly,50,7,48.847268,2.400412
3,30728070,Popincourt,139,0,48.860921,2.366604
4,21862193,Entrepôt,40,1,48.879021,2.370081


In [None]:
# Output the number of neighborhoods where AirBnB listings are located in Paris
print ('The total number of neighborhood is', len(airbnb['Neighborhood'].value_counts()))

The total number of neighborhood is 20


##### 4a. Use geopy to get the lat. and long. of Paris, France

In [None]:
address = 'Paris, France' # address to be inputted into the geolocator

geolocator = Nominatim(user_agent='Paris_Explorer')
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude

print('The geograpical coordinate of Paris are {}, {}.'.format(latitude, longitude))

The geograpical coordinate of Paris are 48.8566101, 2.3514992.


##### 4b. Create a map of Paris with neighbourhoods superimposed on top.

In [None]:
# create map of New York using latitude and longitude values
map_paris = folium.Map(location=[latitude, longitude], zoom_start=13)

# add markers to map
for lat, lng, neighbourhood, ID in zip(airbnb['Latitude'], airbnb['Longitude'], airbnb['Neighborhood'], airbnb['Room ID']):
    label = '{}, ''Room ID:{}'.format(neighbourhood, ID) # Show neighborhood and Room ID as popup label
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_paris)  
    
map_paris

#### <u>Foursquare API Utilization</u>

In [None]:
CLIENT_ID = 'MFKZJY4N32FIOZQ0XWT1UTD53RMQWZY3RHBFZ4KTTKNOO0X0' # my Foursquare ID
CLIENT_SECRET = '5TIVBR0L34XJGXQDUZVMELO5YZURRRG2W3UWJNRYMTH3T0VJ' # my Foursquare Secret
VERSION = '20191002' # Foursquare API version

print('Your credentails:')
print('CLIENT_ID: ' + CLIENT_ID)
print('CLIENT_SECRET:' + CLIENT_SECRET)

Your credentails:
CLIENT_ID: MFKZJY4N32FIOZQ0XWT1UTD53RMQWZY3RHBFZ4KTTKNOO0X0
CLIENT_SECRET:5TIVBR0L34XJGXQDUZVMELO5YZURRRG2W3UWJNRYMTH3T0VJ


##### Now, let's get the Asian restaurants within a radius of 500 meters of the neighborhoods
This will be done using a defined function that will return the needed parameters from the Foursquare API JSON result.

In [None]:
def getNearbyVenues(names, latitudes, longitudes, radius=500):
    
    limit = 20 # limit of number of venues returned by Foursquare API
    search_query = 'Asian Restaurant'
    venues_list=[]
    for name, lat, lng in zip(names, latitudes, longitudes):
        print(name)
            
        # create the API request URL
        url = 'https://api.foursquare.com/v2/venues/search?client_id={}&client_secret={}&ll={},{}&v={}&query={}&radius={}&limit={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            lat, 
            lng, 
            VERSION, 
            search_query, 
            radius, 
            limit)

          
        # make the GET request
        results = requests.get(url).json()['response']['venues']
        
        
        # return only relevant information for each nearby venue
        venues_list.append([(
            name, 
            lat, 
            lng, 
            v['name'], 
            v['location']['lat'], 
            v['location']['lng'],  
            v['categories']) for v in results])
        
    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['Neighborhood', 
                  'Neighborhood Latitude', 
                  'Neighborhood Longitude', 
                  'Venue', 
                  'Venue Latitude', 
                  'Venue Longitude',
                  'Venue Category']
    
    return(nearby_venues)

In [None]:
# Use the defined function to get all the venues
asian_restaurants = getNearbyVenues(names=airbnb['Neighborhood'],
                                   latitudes=airbnb['Latitude'],
                                   longitudes=airbnb['Longitude']
                                  )

Entrepôt
Reuilly
Reuilly
Popincourt
Entrepôt
Élysée
Buttes-Montmartre
Observatoire
Gobelins
Reuilly
Buttes-Montmartre
Ménilmontant
Popincourt
Popincourt
Buttes-Montmartre
Popincourt
Hôtel-de-Ville
Popincourt
Observatoire
Vaugirard
Vaugirard
Luxembourg
Popincourt
Reuilly
Gobelins
Bourse
Panthéon
Opéra
Buttes-Chaumont
Gobelins
Élysée
Entrepôt
Bourse
Palais-Bourbon
Opéra
Buttes-Chaumont
Vaugirard
Opéra
Observatoire
Reuilly
Ménilmontant
Gobelins
Passy
Vaugirard
Buttes-Chaumont
Ménilmontant
Buttes-Chaumont
Passy
Buttes-Montmartre


In [None]:
print(asian_restaurants.shape)
asian_restaurants.head(20)

Define a function that will get the category type for each returned venue from the JSON result file.

In [None]:
# function that extracts the category of the venue
def get_category_type(row):
    try:
        categories_list = row['Venue Category']
    except:
        categories_list = row['venues.categories']
        
    if len(categories_list) == 0:
        return None
    else:
        return categories_list[0]['name']

# filter the category for each row
asian_restaurants['categories'] = asian_restaurants.apply(get_category_type, axis=1)

In [None]:
asian_restaurants.head()

In [None]:
asian_restaurants.drop(['Venue Category'], inplace=True, axis=1) # Drop the initial Venue category column
asian_restaurants.rename(columns={'categories': "Venue Category"}, inplace=True) # Rename the new category as venue category
asian_restaurants.head(20)

In [None]:
print('The number of unique venue is:', len(asian_restaurants['Venue Category'].unique().tolist())) # Print the number of unique restaurants returned
asian_restaurants['Venue Category'].unique().tolist() # Print the list of unique venues

<b>NB:</b> From the print above, the venue category contains results which are not restaurants, results which are undefined kinds of restaurants, and some results which are not restaurants of the asian origin. These rows will be dropped using the lines of code below.

In [None]:
# Create a list of unwanted venue categories and store in variable 'unwanted'
unwanted = ['Restaurant', 'Brazilian Restaurant', 'African Restaurant', 'Greek Restaurant', 'Fast Food Restaurant','Italian Restaurant','French Restaurant', 'Moroccan Restaurant', 'Romanian Restaurant',
 'Cafeteria', 'German Restaurant', 'Modern European Restaurant', 'South American Restaurant','Ethiopian Restaurant']

In [None]:
# Store rows in which the venue category is a restaurant in variable 'ásian'
asian = asian_restaurants[asian_restaurants['Venue Category'].str.contains("Restaurant") == True]

In [None]:
asian.shape # Display the new shape of the data set

In [None]:
asian = asian[~asian['Venue Category'].isin(unwanted)]

In [None]:
print(asian.shape)
asian.head(20)

In [None]:
asian.groupby('Neighborhood').count() # Group the dataframe by neighborhood

In [None]:
print('There are {} uniques categories.'.format(len(asian['Venue Category'].unique())))

With one hot encoding, convert the venue categories into a data form that can be processed

In [None]:
# one hot encoding
asian_onehot = pd.get_dummies(asian[['Venue Category']], prefix="", prefix_sep="")

# add neighborhood column back to dataframe
asian_onehot['Neighborhood'] = asian['Neighborhood'] 

# move neighborhood column to the first column
fixed_columns = [asian_onehot.columns[-1]] + list(asian_onehot.columns[:-1])
asian_onehot = asian_onehot[fixed_columns]

asian_onehot.head()

In [None]:
asian_onehot.shape

In [None]:
asian_grouped = asian_onehot.groupby('Neighborhood').mean().reset_index() # Find the mean of the data set and reset the index
asian_grouped

Find the top 5 restaurants in each neighborhood

In [None]:
num_top_venues = 5

for hood in asian_grouped['Neighborhood']:
    print("----"+hood+"----")
    temp = asian_grouped[asian_grouped['Neighborhood'] == hood].T.reset_index()
    temp.columns = ['venue','freq']
    temp = temp.iloc[1:]
    temp['freq'] = temp['freq'].astype(float)
    temp = temp.round({'freq': 2})
    print(temp.sort_values('freq', ascending=False).reset_index(drop=True).head(num_top_venues))
    print('\n')

In [None]:
# A function to return the most common venues
def return_most_common_venues(row, num_top_venues):
    row_categories = row.iloc[1:]
    row_categories_sorted = row_categories.sort_values(ascending=False)
    
    return row_categories_sorted.index.values[0:num_top_venues]

In [None]:
num_top_venues = 10

indicators = ['st', 'nd', 'rd']

# create columns according to number of top venues
columns = ['Neighborhood']
for ind in np.arange(num_top_venues):
    try:
        columns.append('{}{} Most Common Venue'.format(ind+1, indicators[ind]))
    except:
        columns.append('{}th Most Common Venue'.format(ind+1))

# create a new dataframe
neighborhoods_venues_sorted = pd.DataFrame(columns=columns)
neighborhoods_venues_sorted['Neighborhood'] = asian_grouped['Neighborhood']

for ind in np.arange(asian_grouped.shape[0]):
    neighborhoods_venues_sorted.iloc[ind, 1:] = return_most_common_venues(asian_grouped.iloc[ind, :], num_top_venues)

neighborhoods_venues_sorted.head()

#### Clustering the Neighborhood using KNN

In [None]:
# set number of clusters
kclusters = 5

asian_grouped_clustering = asian_grouped.drop('Neighborhood', 1)

# run k-means clustering
kmeans = KMeans(n_clusters=kclusters, random_state=0).fit(asian_grouped_clustering)

# check cluster labels generated for each row in the dataframe
kmeans.labels_[0:10]

In [None]:
# add clustering labels
neighborhoods_venues_sorted.insert(0, 'Cluster Labels', kmeans.labels_)

asian_merged = airbnb

# merge asian_grouped with the sorted neighborhood data
asian_merged = asian_merged.join(neighborhoods_venues_sorted.set_index('Neighborhood'), on='Neighborhood')

asian_merged.head()

# Results & Discussion

After the clustering of the restaurants using KNN, the cluster labels were generated. Hence, the venues can now be superimposed on the map of Paris to show each of the clusters as depicted below

In [None]:
# create map
map_clusters = folium.Map(location=[latitude, longitude], zoom_start=13)

# set color scheme for the clusters
x = np.arange(kclusters)
ys = [i + x + (i*x)**2 for i in range(kclusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# add markers to the map
markers_colors = []
for lat, lon, poi, cluster in zip(asian_merged['Latitude'], asian_merged['Longitude'], asian_merged['Neighborhood'], asian_merged['Cluster Labels']):
    label = folium.Popup(str(poi) + ' Cluster ' + str(cluster), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color=rainbow[cluster-1],
        fill=True,
        fill_color=rainbow[cluster-1],
        fill_opacity=0.7).add_to(map_clusters)
       
map_clusters

From the map above, it is seen that their is clear overlap in the clusters due to high concentration of similar venue category across the AirBnB listing.
To further explain the, the data set will be grouped by the cluster label in order to spot how the clustering algorithm grouped the restaurants with respect to the average price of AirBnB apartments

In [None]:
asian_merged.groupby('Cluster Labels').mean()

From the result above, it is clear that the average room price for each cluster is different, thus can help an Asian tourist determine his/her room of choice.

In [None]:
asian_bar = asian_merged.groupby('Cluster Labels').mean()
ax = asian_bar.plot.bar(y='Room Price', rot=0)

The dataframes below shows the result for each cluster

<b>CLUSTER 1</b>

In [None]:
asian_merged.loc[asian_merged['Cluster Labels'] == 0, asian_merged.columns[[0,1,2,3,4,5] + list(range(7, asian_merged.shape[1]))]]

<b>CLUSTER 2</b>

In [None]:
asian_merged.loc[asian_merged['Cluster Labels'] == 1, asian_merged.columns[[0,1,2,3,4,5] + list(range(7, asian_merged.shape[1]))]]

<b>CLUSTER 3</b>

In [None]:
asian_merged.loc[asian_merged['Cluster Labels'] == 2, asian_merged.columns[[0,1,2,3,4,5] + list(range(7, asian_merged.shape[1]))]]

<b>CLUSTER 4</b>

In [None]:
asian_merged.loc[asian_merged['Cluster Labels'] == 3, asian_merged.columns[[0,1,2,3,4,5] + list(range(7, asian_merged.shape[1]))]]

<b>CLUSTER 5</b>

In [None]:
asian_merged.loc[asian_merged['Cluster Labels'] == 4, asian_merged.columns[[0,1,2,3,4,5] + list(range(7, asian_merged.shape[1]))]]

# Conclusion & Recommendations

From the results, it is seen that the KNN machine learning algorithm is a good tool for clustering asian restaurants which are in proximity to AirBnB private room listings in the city of Paris.
<br>
<br>
<br>
As a recommendation, to make the outcome even much better, one can consider the review of each of these restaurant to help the Asian tourist determine the locations with the most positive reviews. The price of the rooms can be factored into the KNN model. Also, the distance of each restaurant from the AirBnB private apartment can be computed to further help improve the outcome of the model.