# Exam Project 

####  42578 Advanced Business Analytics

**Group members:** Toke Holst Møller (s172827), Johanne Juul Iversen (s174308), Ann-Katrine Christiansen (s184287) and Jesper Hauch (s174287)


## Introduction

Have you ever been urban travelling? Then you've likely perused Google Maps, Yelp, Tripadvisor and the like for good recommendations and tips on where to go, where to eat and what to do. Likely you've spent hours agonizing over one star reviews that contradict what you thought you knew while skipping over the five star reviews that just show praise. If that sounds like you, a solution is on its way. Imagine this. In your hometown you know exactly what you like and where to go. What if you could get exciting areas at your destination recommended with the vibe you know from home? That is the goal of the Urban Travellers Guide&trade; . 

In this project, the aim is to develop a guide that can help urban travellers find the perfect area for their next holiday, by comparing their favorite neighbourhoods to areas at their next destination. This project will act as a proof-of-concept considering only Manhattan in NYC and London. The guide will be based on the individual neighbourhoods' establishments and their reviews.

The project uses the [Google Local](https://cseweb.ucsd.edu/~jmcauley/datasets.html#google_local) data published by Professor Julian McAuley from University of California, San Diego. The data contains reviews about businesses along with their geographical location. The raw data has 3.116.785 businesses and 11.453.845 reviews, which spans across locations all over the world.

As the wish was to focus solely on businesses located in London and Manhattan in NYC, the data was partitioned through preprocessing. Then initial data exploration is performed in the descriptive statistics notebook where further elaboration on the ideas for the direction of the guide are presented.

The guide was developed using a space embedding model consisting of a neural network with a single layer, where the weights represent the embeddings of a neighbourhood. Since there are missing entries in the features, such as price range for an establishment, an NLP classification model was applied to impute the missing values based on the reviews of the establishment. By using the space embeddings to characterize the different areas of the cities, the recommender system was created. This is done using the user in question and their reviews of the areas in their hometown. These are bundled for the different areas and using cosine similarity, other areas are returned as recommendations.  

Finally, the model was pushed to a production environment in collaboration with 2021.ai and their product Grace.

## How to read this report

The report consist of four main notebooks besides this README notebook. They should be read in the order below. You can simply click on the link to enter the notebooks if your working directory is set to the source file location.

1. [Descriptive Statistics](./DescriptiveStatistics.ipynb)
2. [NLP Classification](./NLPClassification_withSHAP.ipynb)
3. [Space Embeddings](./SpaceEmbeddings.ipynb)
4. [Recommender System](./RecommenderSystem.ipynb)

Besides the above four, we have used the following notebooks for downloading of data and preprocissing of data. These are available should they be of interest to the reader.

5. [Download Data](./DataDownload.ipynb)
6. [Preprocessing of Data](./DataProcessing.ipynb)


## Discussion

By going through the notebooks above one should have gained a thourough understanding of the POC that is the Urban Travellers Guide. While the project is a POC there are still several points that could have altered the outcome of the guide.  

First there were several data points that had the wrong coordinates, placing central establishments outside of the cities. This is a general issue with Google data as the users themselves register their businesses which will have an inherent level of error. This raises questions with regards to the data quality in general. These data issues also affect the price levels of businesses. It is mostly restaurants that have a price attached to them and that begets the question whether the prices can be predicted at all. Afterall is a clothing store comparable to a restaurant. Using the review sentiment should alleviate some of the issues with this, but the features that are words have restaurant related words as the most important ones. This creates bias in the data towards restaurants and neglects the facets of other businesses. This could be alleviated by figuring out actual price points of other types of establishments but it would bring an intensive preprocessing step. 

When looking at the data the NLP model uses to predict prices there are another point that stands out. Some of the highly rated businesses have only a few reviews e.g. where a business has an average rating of five stars based on two reviews. This is a far too small sample size and one could consider not predicting the price for these businesses. The predictor would be more prone to errors with such a small sample size as one very positive review could lead to the business being predicted to be expensive. Another important facet to the data is the fact that London is generally cheaper than New York. This was not reflected in the prediction model and the city could be added as a feature to ensure that the general price range of the city is reflected in the predicted data.  

Finally for the NLP model, the text went through extensive cleaning removing numbers and the like. Some things were left over though such as a few reviews using the Cyrillic alphabet and whole reviews in Chinese, Japanese or Korean. These should also have been removed but as the processing time of data was already so extensive this was not done. Generally the goal was to reduce the bag of words size. Before the first reduction the number of "words" was the size of half of the english dictionary. This was reduced significantly and thus sped up training of the model.

From the NLP model, the data was prepared for the space embeddings. As the space embeddings are defined from the grids one should consider that the grid structure is somewhat rigid. Cities are fluid in that they expand continously over time according to the desires of the humans that live in them and thus coherent areas such as a neighborhood are almost never perfectly square. An area that might strech across several grids and while theses grids in theory should get similar embeddings, this is not always the case. Using Manhattan as an example, a grid that overlapped the border between Turtle Bay (posh and expensive) and Hells Kitchen (somewhat gentrified but still working class), could give some confusing recmomendations to travellers.      

As mentioned above, how the cities are partitioned is paramount to the model. This also became clear when testing the model. Since NY is only represented by Manhattan and London includes some of the suburbs, there are some inconsistencies in how recommendations are given. Manhattan is extremely dense compared to the greater London area and in further analysis it would be sensical to try to compare similar city structures, however hard that might be. This would affect grid size and bring another heap of issues as no two cities are completely alike but is definitely worth exploring.  

Overfitting data, hard to validate, is it the right attributes? Do we need more data? Demographic.

Discrete 

Recommender System

Cosine similarity, should we use different metrics?
London to NY - good, NY to London - bad. 

As mentioned earlier the rigidity of the grids is a problem. This could be alleviated e.g. using shape files for the different neighborhoods and computing space embeddings for these spaces. This would be an interesting extension of the model that would make it more user friendly and likely more accurate in its recommendations as well. 

To test the model, a set of personas were used. This qualitative testing is a way to test the recommender system for which is not possible to test quantitatively. The personas are somewhat extreme. They have strong tastes which raises the question; Are they representative? The model would likely provide weaker recommendations to the average person that has less strong opinions on where they're going, but then again that can be argued to be the point of the model. It tries to mirror who you are as a consumer and traveller. This is not inherently good or bad, but would have to be adressed should the guide be developed further as it then would be used by average people. 

## Conclusion

Generally this project has been a very exciting endeavour. Though there are shortcomings, it generally produced promising results and as a POC it could be tweaked in several ways to perform differently. The testing of the recommender system was done qualitatively through the personas which produced great results on Londoners going to New York but not so much vice versa. 

## Individual contribution

|Member  | Contribution | 
|:- --- |: ---| 
| Toke (s172827)| x, y, z | 
| Johanne (s174308)| x, y, z|
| Ann-Katrine (s184287) | x, y, z |
|Jesper (s174287)|x, y, z |

