# Holiday Rental Chooser
##### Team Members:
Nikolas Racelis-Russell, A15193225  
Romain Coville

In [1]:
import arcgis
from arcgis.gis import GIS
from ipywidgets import *
from IPython.display import display
from arcgis.features import use_proximity
import pandas as pd
import numpy as np
import json

In [6]:
gis = GIS(username="nracelis_ucsd5")

Enter password: ········


### Question
This project aims to build a module able to choose a rental place (apartment, house …) for a given individual who wishes to go on holiday somewhere. The task will then be to recommend an ad, respecting several specifications from the user, that should best fit their needs and wants.

We can try to divide this project in two different parts :
Being able to find a location that best fits the different specification (walk distance, price, location , near activities …).
Discovering what makes an ad being liked or not by the public.
The audience we aim to touch through this project is an audience of basic consumers who wish to organize their holidays without consuming much time trying to find the best place to live during their vacations. This kind of application could be sold to traveling online platforms and holiday planner agencies.

### Background and Literature
The main focus of our litterature work will be based on how to find as precisely as possible, the different parameters that explain why a listing in Airbnb is liked by the users or not. Below are two reports/articles discussing this subject, we’ll try to find more as we go through the project.

Classifications of what locations tourist may like: https://bit.ly/37M3yN5
    - Helped define what locations would be useful for the map selection part of the project, though our data was limited due to some being gated behind pricing.
Sentiment analysis of airbnb review data : https://bit.ly/2T5d9cF

AIRBNB LISTING PRICE AND RATING PREDICTION: https://github.com/ManuGMathew/AML-Project---Airbnb
GA_capstone-Housing_prices: https://github.com/tmkilian/GA_capstone-Housing_prices

The first two were submitted with the project proposal, and though they were interesting, had little effect on our project's methods. The latter two are github repos found that helped us create the regression, and were used as reference to assist with our methodologies. However, the regression was still much different due to the nature of each city having different cultural aspects, like how Paris' zipcodes correlated to a spiral pattern into a city, which ended up being one of our higher feature importance areas. 

### Python Libraries & ArcGIS modules
- <s>Geopandas (exploratory data analysis).</s> ended up not being needed
- Numpy and pandas (for data manipulation)
- Scikit learn (finding what attributes makes a place being liked by the public)
- ArcGis API for Python (web maps for selection of areas).
    - Specifically use_proximity from arcgis.features
- <s>jupyter dashboard (to create a quick UI to present our work results).</s> ended up not being needed
- Plotly (for the data visualization)


### Data Sources
Include a URL and a one-sentence description of each source you used. At least two of the
sources should have more than 1000 spatial objects each. This is basically how we expected it in
the project proposal.
Please reflect on how your choice of sources evolved since the proposal phase, and any concerns
about the sources you used - related to data quality, provenance, access constraints, etc. Also,
reflect on any data that would be helpful to address your research question, which you could not
obtain (and why). Be creative! There is a lot of additional information, sometimes from less
traditional sources, that may help.

Initial Data Sources:
- Airbnb open data: http://insideairbnb.com/get-the-data.html
- City landmarks : https://www.statista.com/statistics/303351/most-visited-tourist-attractions-worldwide/ (not free)
- Data for local restaurants to recommend based on preferences: https://www.yelp.com/dataset
    - Eventually had to drop this due to the lack of both NYC and Paris data here
    
Final Data Sources:
- Airbnb open data: http://insideairbnb.com/get-the-data.html
    - Scraped data from paris tourist attractions wikipedia page: https://en.wikipedia.org/wiki/List_of_tourist_attractions_in_Paris
    
Data was really big issue for this project. Especially since we had wanted to use the yelp dataset. We didn't know that it only included certain cities initially, so when we already had the paris airbnb data set up, we had lost a lot of valuable data to be used in the rental selection. So we had to manually scrape some tourist data from wikipedia to supplement the loss.

### Data Cleaning

Describe the cleaning/wrangling operations that you performed, and whether you realized you
needed to clean the data by examining metadata documents or by introspecting the data itself.
Was the amount of cleaning/data preparation similar to what you expected when writing project
proposal?
(a combination of markdown and code cells)
(at least 100 words – less if you didn’t have to do any cleaning!)

### Descriptive statistics

Explore the data using maps, charts, and common descriptive statistics. Sample questions you
can try to answer: is there spatial autocorrelation in the data? Are you dealing with random point
patterns? What is the spatial mean and standard distance? Or why the above questions are not
relevant to your research theme and the data?
(a combination of markdown and documented code cells)

# insert stuff from Romain's here

### Analysis

Provide a general outline of your analysis (in a markdown), and then document it step by step as
you code the solution. Please make sure that we can reproduce your analysis by running your
notebook. As before, a diagram describing your workflow would be helpful.
Please also reflect on how your actual analysis steps were different compared to your project
proposal - or state that you didn’t deviate from the initial plan.

### Initial Plan:
##### Joining of Data: 
- Make several tables for different functions
    - User data table
    - Business data table
    
##### Creating holiday recommendation framework: 
- Find what are the best attributes to determine the best location of a rental place.
- Geo enrich yelp data with landmarks
    - Finding walking and commuting distance from potential rental places
- Give recommendations based on search made by the user

##### Tourist Attraction Classification: 
- Find main attractions coordinates for the cities.
- Label the attraction by types using Yelp data set.
- Use this label data for the recommendation.

##### Client taste analysis : 
- Putting all the data together
- Dimension reduction if necessary (via PCA or clustering)
- Regression to find what attributes makes a place being liked by the public.

### Final Plan:
##### Joining of Data: 
- Make several tables for different functions
    - <s> User data table </s> no user data could be found, individual airbnb data is private and not accessible
    - Business data table
        - Got much more complicated due to the fact that no business data was available publically online, except through the google API which costed money
        
##### Creating holiday recommendation framework: 
- Find what are the best attributes to determine the best location of a rental place.
- <s>Geo enrich yelp data with landmarks</s> had to drop yelp data set as none were in paris
    - Finding walking and commuting distance from potential rental places
- <s>Give recommendations based on search made by the user</s> not possible due to no user data
- What we can do with the ArcGIS API is return all points within a walking distance area, and use a web app to filter out tourist attractions based on attributes.

##### Tourist Attraction Classification: 
- Find main attractions coordinates for the cities.
- <s>Label the attraction by types using Yelp data set.</s> had to use scraped wikipedia data instead
- Use this label data for the recommendation.

<s>Client taste analysis : </s>
- <s>Putting all the data together</s>
- <s>Dimension reduction if necessary (via PCA or clustering)</s>
- <s>Regression to find what attributes makes a place being liked by the public.</s>



(a combination of markdown and documented code cells)
(at least 500 words) 

### Joining of Data:

In [2]:
# create separate listings df to sample from
df_listings = pd.read_csv('Data/airbnb_listings.csv', low_memory=False)
df_listings.shape

  interactivity=interactivity, compiler=compiler, result=result)


(66414, 106)

In [3]:
# due to crashing of map when loading, sample from only these separate zip codes
zipcodes = [75003, 75005, 75007, 75012, 75014, 75016, 75018, 75020]
df_listings_restricted = df_listings[df_listings.zipcode.isin(zipcodes)]
df_listings_restricted_samplesize = round(df_listings_restricted.shape[0] * .20)
df_listings_restricted_sample = df_listings_restricted.sample(df_listings_restricted_samplesize, random_state= 42)
df_listings_restricted_sample.shape

(1318, 106)

In [4]:
# save as restricted csv to upload to web
df_listings_restricted_sample.to_csv('airbnb_listings_restricted.csv')

In [11]:
airbnb_restricted = gis.content.get('14afc76851064943b7fd61afd009db58')
airbnb_restricted

### Creating holiday recommendation framework:

In [15]:
# callback function that was supposed to be used to create the onclick, but computation time usually just crashed this
def compute_distance(callback_map, g):
    try:
        print(g)    
    except:
        print('crash')

In [17]:
# Define map, add layer of restricted points, and add onclick function
callback_map = gis.map('Paris')
callback_map.add_layer(airbnb_restricted)
callback_map.on_click(compute_distance)
callback_map

MapView(layout=Layout(height='400px', width='100%'))

{'spatialReference': {'latestWkid': 3857, 'wkid': 102100}, 'x': 247419.84985005105, 'y': 6256507.123786469}
{'spatialReference': {'latestWkid': 3857, 'wkid': 102100}, 'x': 247063.1444956117, 'y': 6254927.423646796}


Below is the structure of how we would have liked it to work, but using a single test point

In [18]:
# example of geometry returned by on click
test_point = {'spatialReference': {'latestWkid': 4326, 'wkid': 4326},
 'x': 2.3158225,
 'y': 48.8355492}

In [19]:
from arcgis.features import Feature, FeatureSet
features = [Feature(geometry=test_point, attributes={'name': 'my_location'})]
fset = FeatureSet(
    features=features, 
    geometry_type='esriGeometryPoint',
    spatial_reference = 4326
)

In [33]:
flayer = fset.sdf.spatial.to_featurelayer('test_point')
flayer

In [39]:
# create walking times
walking_times = arcgis.create_drive_time_areas(
    flayer, [5,10,15], travel_mode='Walking'
)
walking_times

<FeatureCollection>

In [43]:
# import tourist layer with tourist location data
tourist_flayer = gis.content.get('8ce9d7f23e084aedbc2466b6832fbb98')
tourist_flayer

In [49]:
from arcgis.features.analysis import join_features

# now that we have all the tourist attractions, and the walking times, we can return all the tourist attractions
# within the walking distance
joined_flayer = join_features(target_layer = tourist_flayer,
                             join_layer = walking_times,
                             spatial_relationship = 'completelywithin')

In [50]:
meep = gis.map('Paris')
meep.add_layer(flayer)
meep.add_layer(walking_times, {'opacity':0.25,})
meep.add_layer(joined_flayer)
meep

MapView(layout=Layout(height='400px', width='100%'))

However, because the python api doesn't have the web app portion (or at least not from what we learned in class) we customized a web app through the ArcGIS online map viewer and will upload that here.

In [53]:
web_app = gis.content.get('390604a3e45d4f288fef31de1701c11f')
web_app

### Tourist Attraction Classification:

### Summary

Describe what you found, and why it is important; illustrate the findings with maps/charts
reflecting your results. 
If you create new datasets or map in AGOL, please share them to the DSC 170 Data group, and
reference them in your notebook by ID via gis.content.get.
(a combination of markdown and documented code cells)
(200 words)


### Discussion

The discussion should include the following parts:
1) Discuss your findings with respect to the literature sources in section 3. What do the
results mean in the context of what is already known? What is new? Does it validate what
was found in literature? How do your results improve our understanding of the problem?
2) Of particular importance is a discussion of any trade-offs and decision points that you had
to consider. This may include a discussion of any performance issues, width of buffers
you applied, projections you chose, spatial operations you used, map combination
techniques, and other issues we discussed in class.
(this can be done as a markdown, at least 200 words

#### Findings
With respect to literature, the first two were not particularly relevant, at least not as relevant as we had initially thought. We didn't have so much of a question to answer but rather a concept to create. Which was perhaps one of the bigger challenges of this project. 

#### Trade-offs
This was the biggest problem we had. As we had progressed we had to constantly compromise our initial idea due to the nature of the ArcGIS API. For example, the on click map was evidently flawed, in that when clicking on the map, it returned only the geometry of the point, not the data contained within the point itself. Now, this could perhaps have been remedied by the use of another proximity function to return only the data for the point nearest to the geometry, but this would have been yet another call to the API that our already slow performance could not take.

Additionally, we had problems when there were too many points on the screen, mostly the entire page crashing due to too many points on the screen. 

Eventually, we had to abandon the idea of an interactive UI and simply compromise by creating a proof of concept using the available tools as best we could.

### Conclusions

Did you manage to completely answer your initial research question? If not, what additional data
and additional analysis steps can you think of? Can your approach be extended to other areas or
topics, and use additional datasets? How do you expect the results to be used and by whom?

No, we didn't get to answer our initial research question, but we were able to compromise with what we had, and ended up with a proof of concept rather than final product. 

Arcgis is something that is useful for giving insights to data, but not good at creating good UI, as proven with this project. The problem arose due to the fact that many of the computations needed have to be called to an api, which means performance is hindered greatly. This was self evident through the struggles we had with our on-click map, which could return the geometry, but due to problems with conversion through the API, meant that the on click map could not work as smaller feature layers had lossy conversions when sent through the api. We had to manually create the feature layer for one point.

Something that could be done is to take all of the tourist location data we scraped, and using the google api create a cost distance matrix which would be able to return distances instantaneously after the inital computation period. However, this costs money and would have been another API that we would have to wrangle.