### INF 510 Fall 2019 Final Project Submission

1.	**The names of team member(s)**:

    Linle Jiang

2.	**How to run your code (what command-line switches they are, what happens when you invoke the code, etc.)**

    This project requires the following packages:
    - requests, BeautifulSoup, pandas, json, csv, argparse and bokeh
    
    To use the Bokeh package, please make sure you activate the environment file.
    
    To run this project, make sure the above packages are installed, and then simply clone the repo at https://github.com/linlejiang/inf550_project and execute this notebook.

3.	**Any major “gotchas” to the code (i.e. things that don’t work, go slowly, could be improved, etc.)**
    
    The code takes about 10 minutes to scrap the data from Google and TripAdvisor.  And the generated interative plots will all in the same row, I couldn't make them display in two rows.

4.  **Anything else you feel is relevant to the grading of your project your project.**

    If you don't run the remote mode before the local mode, you may find that the two mode generates slightly different hotel data. Specifically, the number of hotels may differ. For more details and explanations, please refer to the comments under the 'Hotel data criteria' in the trip_data() function in Milestone 2.

5. **What did you set out to study?  (i.e. what was the point of your project?  This should be close to your Milestone 1 assignment, but if you switched gears or changed things, note it here.)**

    Consistent with my Milestone 1 assignment, my motivation to select this project is to plan a trip to Hawaii. From my experience, there isn’t any interactive maps with both top sights and hotels plotted simultaneously or without substantial manual efforts. Therefore, the ultimate goals of this project are: 1) to create three interactive maps of Hawaii automatically, which will enable users to interact with the maps (e.g., the size of maps can be changed within the visualization, and when the mouse cursor is at a hotel datapoint, the name of the hotel will show up); and 2) to compare visually how different the distributions would be for hotels plotted by different popularity metrics.

    However, there are some minor changes: 

      1. the hotel address data was planned to be scraped from TripAdvisor initially, in order to use Google API to get geometry results. However, this plan was dropped due to the following reasons:     
          1) some hotel addresses obtained from TripAdvisor were not correct, after check the website;  
          2) Google API allows for hotel names as inputs to request hotel geometry, and using this approach substantially decrease the number of missing data than scraping from TripAdvisor;          
          3) this increases the data acquiring time significantly, and lower the probability of being banned by the website, as the number of requests drop substantially.

      2. I plot 20 sights instead of 10.

      3. Instead of collecting all hotel listings from TripAdvisor, I specified certain criteria to determine what kinds of hotels should be included in this dataset. This is to resemble real-life hotel selection processes, and to limit the number of request queries when the code scrapes data from TripAdvisor.

      4. The three metrics used to describe the hotels are price, popularity (i.e., number of reviews), and recommendation (i.e., computed by the product of the ratings and the standardized number of reviews). The recommendation metric is also used to select the sights in the plot.

6. **What did you Discover/what were your conclusions (i.e. what were your findings?  Were your original assumptions confirmed, etc.?)**

    I generate three interative plots in a html file, if you run the code, they will automatically show up in your browser. The plot enables users to interact with the maps (1. the size of maps can be changed; 2. when the mouse cursor is at a sight, the name of the sight will show up; 3. when the mouse cursor is at a hotel, the name and lowest price of the hotel will show up; 4. when click on the hotel circle, the browser will open a new tab directing the user to the hotel webpage in TripAdvisor; 5. the map can zoom in/out if you launch the wheelZoom tool; 6. you can drag the map to view different areas within the map).
    
    Overall, in this project, I found that the most recommended sights (i.e., the metric based on reviews and ratings) are concentrated in the urban area, only a few of them are natural sights. Also, although all the islands have at least one most recommended sight, most of them are in Honolulu. This makes sense, considering that Honolulu is the most developed area in Hawaii islands. 
    
    For the hotels included in the dataset, first, most of them are located near the sea. This is consistent with the idea that the general public enjoys the beach and hotels with a view of the sea. Additionally, it appears that the cheaper hotels tend to cluster together, likely due to the high demand and competitive nature, while the more expensive hotels are more spread out. Also, only about half of the sights have hotels closeby, this is especially the case in the Island of Hawaii and the urban area of Honolulu. Moreover, these hotels closer to the sights tend to have lower prices. Finally, it appears that there are positive correlations between hotel price, popularity (i.e., number of reviews), and recommendation. Consider the higher expenses in Hawaii, it is possible that visitors of Hawaii tend to have higher income level. And thus they also tend to stay at hotels with higher quality, and are able to afford such expenses.

7. **What difficulties did you have in completing the project?**  

    *What didn't work?  What was hard to do?  What stumbling blocks did you run into?*
    
    The most difficult part for me is to scrape a large amount of data from TripAdvisor. According to my initial plan, I would have to make thousands of request to the website in order to get all the hotel data. To avoid being banned by the website, I tried to rotate proxies and user agents. However, that was not practical because the free public IP significantly slowed down the scraping processes, which is likely due to the extended time making requests via different proxies. Therefore, I had to step back and figure out what were the data that might not be necessary. Fortunately, I was able to increase the number of hotels to be included in this dataset by removing one unecessary attribute (i.e., hotel address, note that each address needs one individual request when scraping).
    
    Another difficulty I had was to optimize my code to shorten the time spent on scraping the hotel data. However, my hand is tied given the website structure, which requires a large amount of requests to be made.

8. **What skills did you wish you had while you were doing the project?**

    *Was there something that you wish you'd have known better while you were doing the project?  If you learned that skill while doing the project, note it here, but even if not, what would have helped?*

    As I mentioned in question 7, first, I wished I had learned how to optimize my code to collect data at a faster rate, but I wasn't sure if there are ways to make my code more efficient. Second, I wished I had learned how to perform proxy rotation and user agent rotation to prevent being banned by the website. I did learn some techniques and applied, but they weren't successful attempts. Third, I wished I had known how to create interative visualizations, which, through this project, I learned about how to use the bokeh package to create interactive maps. I was able to include all features as planned. Still, there are a lot more useful and interesting features to be learned. Finally, I wished I had learned some NLP techniques, so that I could collect the reviews for each hotel, and extract keywords from the review to provide some qualitative descriptions for the each hotel. 

9. **What would you do “next” to expand or augment the project?**

    *If you had to continue this project, what would you add to it?  If you had the skills you mentioned in question 8, what could you do to enhance things?*
    
    First, I would extend the program so that whenever people searches for a location, the program will automatically scrape data for that location and generate the interactive maps. Second, if I had known how to optimize my code and perform efficient proxies and user agent rotations, my program would scale better and be able to collect more data and operate faster. Third, if I had learned some NLP keyword extraction technique, I could include some extra attributes in this dataset.

In [1]:
import pandas as pd
from bokeh.io import output_file, show
from bokeh.models import ColumnDataSource, GMapOptions, LinearColorMapper, ColorBar, HoverTool, TapTool, OpenURL
from bokeh.plotting import gmap
from bokeh.transform import transform
from bokeh.layouts import row

API_key = input('Google API Key: ')

Google API Key: 


In [4]:
sight_df = pd.read_csv('data/Sights.csv')
hotel_df = pd.read_csv('data/Hotels.csv')
        
# get the z score of the sight review variable
sight_df['z_sight_review'] = (sight_df['sight_reviews'] - sight_df['sight_reviews'].mean())/sight_df['sight_reviews'].std()
        
# generate the recommendation metric, taking the product of sight rating and the z score of sight review
sight_df['rec_sights'] = sight_df['z_sight_review']*sight_df['sight_rating']
        
# get the z score of the hotel review variable
hotel_df['z_hotel_review'] = (hotel_df['hotel_review'] - hotel_df['hotel_review'].mean())/hotel_df['hotel_review'].std()
        
# generate the recommendation metric, taking the product of hotel rating and the z score of hotel review
hotel_df['rec_hotels'] = hotel_df['z_hotel_review']*hotel_df['hotel_rating']

# create a html file to store the three interactive maps
output_file("data/Hawaii Sights and Hotels.html")

# set the map at Hawaii (with the geometry data provided), using Google terrain map as tile with a zoom level of 7
map_options = GMapOptions(lat = 20.716179, lng = -158.214676, map_type = "terrain", zoom = 7)

# this package takes in ColumnDataSource format, here converts our dataframe to the corresponding format to provide the data sources for plotting
# sort the sight data by the recommendation metric, and select the 20 most recommended sights
source1 = ColumnDataSource(sight_df.sort_values(by = ['rec_sights'], ascending = False)[:20])
source2 = ColumnDataSource(hotel_df)

# plot Recommended Sights with Hawaii Hotels by Price
        
# generate a Hawaii base map by making a request to Google MAP API, and specify the title for the map
p1 = gmap(API_key, map_options, title = "Recommended Sights with Hawaii Hotels by Price")

# create triangle objects to represent the sights, specifying the geometry data as x- and y-axis,
# set the color, size of the triangles, also provide a legend to the map
o1 = p1.triangle(x = "LNG", y = "LAT", size = 9, legend = 'sight', fill_color = "blue", fill_alpha = 0.8, source = source1)
# add the hovertool so that when the mouse cursor is at a sight, the sight name will be displayed
p1.add_tools(HoverTool(renderers = [o1], tooltips = [('Sight', '@sight_name')]))

# specify the color for the hotel circles
color_mapper_price = LinearColorMapper(palette = 'BuPu9', low = hotel_df['hotel_price'].min(), high = hotel_df['hotel_price'].max())
        
# create a color bar to indicate the relationship between color and price
color_bar_price = ColorBar(color_mapper = color_mapper_price, label_standoff = 12, location = (0,0), title = 'Price')
        
# create circle objects to represent the hotels, specifying the geometry data as x- and y-axis,
# set the color, size of the circles, also provide a legend to the map
o2 = p1.circle(x = "LNG", y = "LAT", size = 7, legend = 'hotel', color = transform('hotel_price', color_mapper_price), alpha = 0.8, source = source2)
# add the hovertool so that when the mouse cursor is at a hotel, the hotel name and lowest price will be displayed
p1.add_tools(HoverTool(renderers = [o2], tooltips = [('Hotel', '@hotel_name'),
                                                    ('Price', '@hotel_price')]))

# add a taptool that enable user to click a hotel circle and jump to the hotel webpage in TripAdvisor
url = '@hotel_url'  # specify the url source in our ColumnDataSource
tap = TapTool(callback = OpenURL(url = url))  # specify the open link action for this taptool
        
# avoid the hotel circles being grey out during any interations
o2.selection_glyph = None
o2.nonselection_glyph = None
        
# add the taptool to the plot
p1.add_tools(tap)

# add the color bar next to the plot
p1.add_layout(color_bar_price, 'right')

# similar to the above, plot Recommended Sights with Hawaii Hotels by Popularity Metric       
p2 = gmap(API_key, map_options, title = "Recommended Sights with Hawaii Hotels by Popularity Metric")
o3 = p2.triangle(x = "LNG", y = "LAT", size = 9, legend = 'sight', fill_color = "blue", fill_alpha = 0.8, source = source1)
p2.add_tools(HoverTool(renderers = [o3], tooltips = [('Sight', '@sight_name')]))
color_mapper_pop = LinearColorMapper(palette = 'Viridis9', low = hotel_df['hotel_review'].min(), high = hotel_df['hotel_review'].max())
color_bar_pop = ColorBar(color_mapper = color_mapper_pop, label_standoff = 12, location = (0,0), title = 'Popularity')
o4 = p2.circle(x = "LNG", y = "LAT", size = 7, legend = 'hotel', color = transform('hotel_review', color_mapper_pop), alpha = 0.8, source = source2)
p2.add_tools(HoverTool(renderers = [o4], tooltips = [('Hotel', '@hotel_name'),
                                                    ('Price', '@hotel_price')]))
url = '@hotel_url'
tap = TapTool(callback = OpenURL(url = url))
o4.selection_glyph = None
o4.nonselection_glyph = None
p2.add_tools(tap)
p2.add_layout(color_bar_pop, 'right')
        
# similar to the above, plot Recommended Sights with Hawaii Hotels by Recommendation Metric
p3 = gmap(API_key, map_options, title = "Recommended Sights with Hawaii Hotels by Recommendation Metric")
o5 = p3.triangle(x = "LNG", y = "LAT", size = 9, legend = 'sight', fill_color = "blue", fill_alpha = 0.8, source = source1)
p3.add_tools(HoverTool(renderers = [o5], tooltips = [('Sight', '@sight_name')]))
color_mapper_pop = LinearColorMapper(palette = 'PiYG9', low = hotel_df['rec_hotels'].min(), high = hotel_df['rec_hotels'].max())
color_bar_pop = ColorBar(color_mapper = color_mapper_pop, label_standoff = 12, location = (0,0), title = 'Recommendation')
o6 = p3.circle(x = "LNG", y = "LAT", size = 7, legend = 'hotel', color = transform('rec_hotels', color_mapper_pop), alpha = 0.8, source = source2)
p3.add_tools(HoverTool(renderers = [o6], tooltips = [('Hotel', '@hotel_name'),
                                                    ('Price', '@hotel_price')]))
url = '@hotel_url'
tap = TapTool(callback = OpenURL(url = url))
o6.selection_glyph = None
o6.nonselection_glyph = None
p3.add_tools(tap)
p3.add_layout(color_bar_pop, 'right')

# automatically pop up the interactive plots in the browser, displayed in a row
show(row(p1,p2,p3))
        
# store all data to two csv files
sight_df.to_csv("data/Google_sights_data_complete.csv", index=False)
hotel_df.to_csv("data/Hotel_data_complete.csv", index=False)
 