# 1. Introduction

In this project, we are going to analyse the data fetch from Yelp. Yelp is an app that collects information about restaurants and other business. Its recommendation system will choose the best for customers according to history reviews. We now fetch over 1 million reviews and over 50,000 companies information. In our project, we only focus on the **Chinese restaurant**.

We have two main goals. Firstly, we wanted to provide some practical advice for the business owner. On one hand, we provided suggestions according to the business attributes. We first extracted business attributes and sorted them according to the feature importance in XGBoost. We discussed whether the first few attributes had any influence on business average stars with ANOVA. On the other hand, we analyzed customer reviews. We vectorized preprocessed words. Then group them by 5 features (sanitation, food, waiting time, service, price) by Cosine similarity. Then we count the frequency of each group in each review and normalized them. Finally, we use Earth Mover Distance to measure how these features affect the dining experience. We randomly pick some Chinese restaurants and provide some suggestions by the analysis above.

Secondly, we wanted to predict the reviews' star according to the review texts. We tried many methods and finally, we choose the LSTM model. It did a great job and our final RMSE is 0.59.

# 2. Data Cleaning

After extracting Chinese restaurants by searching for the word 'Chinese' in the column 'categories' in  business data, there are 3557 businesses remained, along with 209897 reviews. 

### 2.1 Attributes

We split the category according to the comma. Then pick out the Chinese restaruants. Then for other attributes, we denote all missing value like 'N/A', 'None', empty as one distinct value 'None'.

### 2.2 Reviews

We preprocess the words according to the following steps:
1. Remove the reviews that contain non-english characters
2. Splite the words according to the white space and other punctuations
3. Change the upper-case into lower-case
4. Expand the common abbreviation like: wouldn't → would, not
5. Remove punctuations
6. Delete stopping words, but keep some words like: not

# 3. Preliminary Analysis

### 3.1 Visualization of Attributes

After data cleaning, we first looked into working times of the restaurants. However, restaurants with different ratings apprear to have similar average working times per week. Then barplots of average stars in different categories for specific atttributes were drawn. Some interesting patterns were discovered, such as average rating for dinner restaurants tend to be high but opposite for breakfast, and noise level also has an influence on the ratings.

The 'NoiseLevel' example: <img src="NoiseLevel.png", width=300>

### 3.2 Visulization of Reviews 

For review texts, we first drew some wordclouds (words are ordered by appearance frequencies):
<img src="Wordcloud.png", width = 700> 
Obviously, "food", "service", "time" etc were frequently mentioned in the reviews. So we naturally took an assumption that these are aspects that have strong impact on stars.

Secondly, we calculated word frequencies of various kinds of food in each star and visualized them by barplots. It turned out that beef, shrimp, crab and eggplant are the most popular ingredients and spicy is the most popular favor. On the other hand, this indicates that food is an important feature associated with ratings.

# 4. Suggestions for Business

### 4.1 Attributes Analysis

#### Missing Data

For the missing data, we thought they also provide some information. Usually speaking, the missing value means the restaurant does not have such equipment or service. For example, if they do not provide the information about the wifi, they may not have accessible wifi. The restaurant owner may forget to provide such information. In this situation, this feature may not be an advantage of their service, otherwise, they will certainly propagate it to attract customers. And finally, the missing data may be caused by the Yelp database. But it is not the main reason. We cannot distinguish them from others. So we can safely ignore it. Our final decision was to denote them as specific categories called 'None'.

#### XGBoost

After preprocessing the attributes, we used XGBoost to rank the feature importance. XGBoost is a tree-based method that can do classification or regression. Intuitively speaking, XGBoost will first randomly pick one feature and split the tree nodes and then find the best split rule to minimize the regulized objective function. Then it searches all possible features to find the one with the minimum objective function. And the tree grows until the improvement of the objective function does not reach the benchmark. We repeat that process several times. And finally, we combine all leaves of these trees together (usually we sum them up).

From the above algorithm, we can see that the XGBoost will certainly first split node with the most important feature. And for those features have a high correlation with the selected feature, XGBoost may not choose them again. It provides us with a method to rank the importance of the attributes.

Following is the barplot of importance levels for attributes: <img src="FeatureImportance.png", width=600>

#### ANOVA

Then we applied ANOVA on the selected 88 attributes. P-values, feature importance levels, and non-missing sample sizes of attributes are recorded in file *'Tables/Attributes_ANOVA.csv'* on github repository. Only 5 attributes (**’Noise Level‘, ’Caters‘, ’HasTV‘, ’Restaurants Reservations‘ and 'Outdoor Seating'**) passed both the cutoff of greater or equal to 50 in importance level and less or equal to 0.05/88=0.00057 in ANOVA p-values. So we decided to mainly focus on these five attributes in the suggestion part.

### 4.2 Review Analysis

In this part, We use word2vec model to vectorized the words. More specifically, we use Skip Gram model. To introduce the Skip Gram, we need to define the context and target word first. For each target word, we d                                  efine the previous and following 5 words as the context of the target word, i.e. window_size = 5. Then we train the Skip Gram with target word as input and the context word as output. Generally speaking, Skip Gram can predict the context word according to the target words. But here we use it to find similar words.

We extract the word vectors and then use cosine similarity to find the correlated words. The cosine similarity is defined as:
$$\frac{x*y}{\Vert x\Vert\Vert y\Vert}$$
, where x, y are vectors of two different words. We read a few reviews and search for some background information. And finally we decided to analyze the following five aspects: sanitation, food, waiting for time, service, price. We first pick some words and use cosine similarity to find similar words. We manually check the selected words and repeat that process several times. You can find these words in the data folder.

Then we count these bag of words frequency in each review and normalized them with word length. Then we group these data by the stars. We define stars 1 or 2 as negative and 4 or 5 as positive. Then we use Earth Mover’s Distance (EMD) to measure the difference of the distribution of positive and negative frequency. The reason why we did not use KL divergence is that it will 'explode' in some situation (think of KL divergence between N(0, $\varepsilon^3$) and N($\varepsilon$, $\varepsilon^3$) when $\varepsilon\to 0$. Also, it is not a measure of distance (strictly speaking) because it is not symmetric. But Earth Mover’s Distance can make a remedy of it. See more details [here](https://vincentherrmann.github.io/blog/wasserstein/). And finally, we sort these 5 features according to the EMD and determine whether it is an advantage or disadvantage. Then we provide some related suggestions on it.

# 5. Example Illustration

# 6. Predict Review Stars

# Notebook Contribution
Lijie Liu: Introduction, Preprocess, Missing Data, XGBoost  
Ning Shen: ANOVA, Preliminary Analysis  
Xiangan Zhang: Prediction

# Code Contribution
Lijie Liu: Word Preprocess, Word Similarity, slides  
Ning Shen: ANOVA, Preliminary Analysis, slides  
Xiangan Zhang: LSTM prediction, XGBoost, EMD, slides