# Stat 628 Module 2 Report

## 1.Motivation

Recently, Yelp has released some of its 121 million reviews to the public as a part of the “Yelp Dataset Challenge.” This project analyzes a dataset containing 1.5 million reviews with features whose dimension is truly high dimensional (i.e. text data). The two goals of this project are: 

1. Find out what makes a review positive or negative based on the review and a small set of attributes

2. Propose a prediction model to predict the ratings of reviews based on the text and attributes. 

We expect to provide useful suggestions for merchants based on our findings.

## 2. Data Description & Cleaning

### 2.1 Data Description

The dataset consists of training set and test set, containing 1546379 observations and 1016664 observations, respectively. Each observation has three types of data associated with the review: 

1. Stars (dependent variable)

2. Review text (primary independent variable) 

3. A small set of covariates that provide supplementary information such as date, city and categories. 

### 2.2 Data Cleaning

In data cleaning step, a series of operations are applied to the raw texts:

1.	Punctuations and Stopwords: Remove all of them. Dictionary of stopwords is provided by NLTK package in python. 

2.	Non-English reviews: As they are only a super small portion of whole dataset, we ignore them.

3.	Lemmatization: All remaining words is transformed by lemmatize function in python. For example, "broken", "broke" and "breaking" are all converted to "break".

## 3. Polarity Analysis

## 4. Raings Prediction

### 4.1 Model Description

Multinomial Logistics Classification(MLC) and Support Vector Regression(SVR) are popular and commonly used in predictive task as they usually produce decent results. In practice, running time of these methods in large dataset are still acceptable. Therefore, we decided to choose them to be our pre-models. 

Then, 20 pre-models are obtained by tuning the L2-penalty of MLC and SVR. To clarify, they are applied to a huge sparse matrix recording both unigram and bigram from cleaned data.

Finally, our final prediction, which achieves 0.62 RMSE, is calculated by combining the results from 20 pre-models.

### 4.2 Feature Consruction

In order to include as much as possible useful textual information, every single word and every words' combination(bigram) of each comment are all considered and are represented by a super large sparse matrix created by number counting. In addition, to scale data, TF-IDF was applied to the raw sparse. A visual workflow is shown below.

### 4.3 Model Fitting

It's common sense that models with the different penalty emphasize on different information because the higher the penalty, the more emphasis on shrinkage of coefficients. To choose pre-models for final prediction, any number ranging from 0 to 5 with increment 0.1 each time was selected as penalty for RMSE test. Then, 20 models with lowest RMSE were chosen and luckily 10 of which were MLC with the rest being SVR. 

According to emsemble learning, mixture of models usually beats any single one. In light of that, Lasso regression and simple average were used for the predictions of 20 pre-models to achieve better performance. For convenience, we denoted their results as Lasso and Mean resepctively. Eventually, our final predictions were calculated by:

\begin{align*}
Final \ Prediction = (Lasso + Mean)/2
\end{align*}

### 4.4 Model Evaluations

From the graph below, the final emsemble model shows the best predictive power. All models were trained based on 1.5 millions training set and the results were RMSE computed by Kaggle.

<img src="rmse_barplot.jpeg" alt="Root Mean Square Error" style="width: 500px; height: 300px"/>


* **Strengths**
 * Our method is quite simple, it doesn't require too many operations on data cleaning and transformation.
 * Final model has a decent predictive power, which gets 0.624 RMSE in Kaggle competition.

* **Weaknesses**
 * Neural Network should work better on NLP problems mostly, but we failed in parameters' tuning due to time limit. Accuracy will be improved if given more time.
 * It's a little bit time-consuming to explain our final model. In addition, it's not best emsemble actually.

## 5. Conclusions



## 6. References

[1] Multinomial Logistic Classification: https://en.wikipedia.org/wiki/Softmax_function

[2] Support Vector Regression: https://en.wikipedia.org/wiki/Support_vector_machine

[3] TF-IDF Transformation: https://en.wikipedia.org/wiki/Tf%E2%80%93idf


## Contributions

|Name            | Contribution                | 
|----------------|-----------------------------| 
|Yishan Cai| exploratory analysis, text mining in R, extracted features from sentimental lexicons|
|Jiashun Cheng| basic data investigation, created a dictionary with scores based on training data, conducted basic features extraction, tuned models' parameters|
|Yilun Chen| data cleaning, created the sparse matrix with unigram&bigram features, built up regression and classification models, proposed final model|