# Nightlife in Las Vegas

#### Junxia Zhu, Yiqun Jiang, Yingjing Jiang

## 1. Introduction

In this project, we analysed yelp data focusing on nightlife and bars in Las Vegas. We aimed to explore which factors influenced ratings of reviews and furthermore, give advice to owners for improving the ratings. Our work could be mainly divided into two parts: keys word analysis from reviews and feature extraction from attributes. For the first part, We use Python to do nature language provessing finding meaningful key words. For the latter, we applied random forest algorithm and used ANOVA. We interpreted the factors and made suggesstions due to outcomes. Besides, we also construct prediction models to achieve a "bonus" goal for predicting ratings.

## 2. Background

The Yelp data includes 4 json files: review_train.json, review_test.json, business_train.json, business_test.json, which contains 5364626 reviews, 1321274 reviews, 154606 businesses and 38000 businesses respectively.

## 3. Goal1--Analysis

### 3.1 Data Filtering 

For the analysis part, we focused on the two train data files which has 5364626 reviews for 154606 business. After filtered by "bars", "nightlife" and "Las Vagas" to fit our thesis, the total data involved in this part includes about 265847 reviews for 1201 bars.

### 3.2 Business Analysis

#### 3.2.1 Data Cleaning

1. There are nested dictionaries in business attributes, so we first extracted all dictionaries as a new attribute list.   
2. Then we calcualted average star of all reviews for each business and set the average stars as our response variables to see the relationships between ratings and attributes.   
3. One point need to be metioned is that many attributes have missing values so we mark the blank with "unknown", which is treated as a new level.  

#### 3.2.2 Important Attributes Analysis

Here we used random forest computing variable importance scores to select useful attributes. "NoiseLevel" and "RestaurantPriceRange" are of the top importance, which means these two attributes are highly related to ratings. We did one-way ANOVA for both attributes, the model and outcomes are as below:

| Model | P-value | result |
|:---------:| :-------------------------: |:----------:|
| stars~NoiseLevel |  4.36*10e-8  | reject H_0 |
| stars~RestaurantPriceRange |  2.44*10e-3  | reject H_0 |
 
According to the results, we reject H_0--there is no differences between different levels of attributes, which means there does exist discrepancy between different levels of NoiseLevel and RestaurantPriceRange.

We also want to check interaction between these two attributes, so we construct full model with interaction term. The outcome is shown in below:


|  terms | P-value |
|:---------:| :-------------------------: |
|NoiseLevel |  3.02*e-8 | 
|RestaurantPriceRange  |  3.50*e-3  |
|NoiseLevel*RestaurantPriceRange |  3.58*e-2  |

From the table above, we know that all terms in the full model are significant, so except for NoiseLevel and RestaurantPriceRange, their interaction also relates to ratings. Besides, we compare full model with all reduced models and found it had the lowest RSS, which means it is the best model. This result also testify that the outcome of two-way ANOVA is reliable.

#### 3.2.3 Missing Value Analysis 

From the random forest results, we can not identify the influence of missing values, so we applied decision tree method using GUIDE-a software for machine learning. 

    
# TODO: Insert the tree plot
The plot gave us insights about how missing value related to ratings. TODO: detailed explanation

### 3.3 Review Analysis

#### 3.3.1 Data Cleaning

1. Tokenize each reviews, which means break paragraphs to sentences.
2. To deal with negative tone in reviews, we check each sentences and add "NOT_" to each word in those sentences.
3. As what we cared about is text, so we removed punctuations and meaningless symbols. 
4. For each word, we do stemming which could avoid different word forms caused by tense, singular and plural. Then we were able to break each reviews to words and counted frequency.
5. In the word lists, some words like "is", "the" actually make no sense so we constructed a stopword list to remove all these useless word.

#### 3.3.2 Key Word Analysis

We first divided words into four aspects: food, beverage, entertainment and service, and then chose high-rate words from these four aspects respectively. Here, to deviod baseline differences' influence of star distributions, we divided word frequency by the number of words in reviews of each star level, which changed the frequency to rates. To be more specific for our word analysis, we picked words exclusive to bars like "beer", "cocktail", "strip", "casino", and plotted histograms of each stars to see if there existed special patterns. 

And when we tried to interpret the histograms, we found the pattern for the word "beer" is really hard to explain since the rates of each star level are nearly the same. So upon previous unigram analysis, we developed bigrams to reach deeper exploration for beers. We found "beer selection", "draft beer" and "craft beer" are of top rates.  

<tr>
<td><img src="craft_beer.png" width=400 height=500> </td>
<td><img src="draft_beer.png" width=400 height=500> </td>
</tr>

After that, we search the adjectives next to the bigrams to get more information. And with analysis for the adjectives, we are able to give corresponding advice. To make our conclusions easily understandable, we made a shiny app to visualize our results. Here is the link for the shiny app: https://yingjingjiang.shinyapps.io/shiny_app/    

In [None]:
import matplotlib.pyplot as plt
import matplotlib.image as mpimg
from matplotlib import rcParams

%matplotlib inline
rcParams['figure.figsize'] = 11 ,8
img_A = mpimg.imread('\path\to\img_A.png')
img_B = mpimg.imread('\path\to\img_B.png')
fig, ax = plt.subplots(1,2)
ax[0].imshow(img_A);
ax[1].imshow(img_B);

## 4. Goal2--Prediction

we combined features extracted from both reviews and attributes. We made dataframe with columns of key words and attibutes and then constructed several models to predict. For reviews, we applied TFIDF 
    

## 5. Conclusions and Our Advice

## 6. Strength and Weakness

### Strength:
1. Carefully deal with the reviews and take care of different baselines.
2. Use several methods to confirm our results, which makes our conclusion robust.
3. Consider NA as a level to see whether it makes differences but not simply omit them.

### Weakness:
1. Fail to find patterns in time and hour analysis, which may need more careful inspectation.
2. Hard to interpret tree method outcome objectively more out of subjective analysis.

### Contribution: 

#### 1. Goal1:   
* Business Analysis:   
    Data cleaning: Junxia Zhu, Yiqun Jiang, Yingjing Jiang  
    Model: Yiqun Jiang, Yingjing Jiang  
    Plots: Junxia Zhu  
* Word Analysis:   
    Data cleaning: Yiqun Jiang, Junxia Zhu  
    Ngrams generating: Yiqun Jiang  
    Adj analysis: Junxia Zhu  
    Plots: Junxia Zhu  
    Shiny app: Yingjing Jiang  
    
#### 2. Goal2:  
* Kaggle: Junxia Zhu, Yiqun Jiang, Yingjing Jiang  

### Reference
[1] GUIDE Manual: http://www.stat.wisc.edu/~loh/treeprogs/guide/guideman.pdf