## 1. Abstract

In this project, our analysis consists of two parts: natural language processing and predictions based on several machine learning techniques. For natural language processing, a sentimental dictionary that reveals the polarity of reviews is extracted from training dataset based on the properties and distributions of different words. Within this process, keywords substitutions, unigram and bigram analysis are implemented to obtain high dimensional data structure as preprocessed data for prediction section. Eventually, in terms of rating predictions, a mixture model of Multinomial Logistic Classification and Support Vector Regression, which shows decent predictive power, is chosen as the final model.

## 2. Sentimental Dictionary Extraction

During text cleaning, a number of adjectives, adverbs and emotional words are recorded according to their high frequency of occurence. For the construction of sentimental scoring system, each selected word is assigned a value that is the average star ratings with respect to its density in each category. Some examples are shown:

|   great| amazing|   fresh|       no|      old|
|--------|--------|--------|---------|---------|
| 2.2228 | 1.0346 | 0.7687 | -9.6555 | -2.1107 |


To some extent, this dictionary reveals the polarity of each review. Averaging scores of keywords in one review produces it own sentimental score and the mean sentimental score in each category is corresponding to the order of ratings, which is listed below: 

|      1|      2|      3|     4|     5|
|-------|-------|-------|------|------|
| -2.71 | -2.13 | -0.78 | 0.83 | 1.99 |

## 3. Model Fitting

<img src="process.png" alt="Process" style="width: 550px;"/>

As it shown from the picture above, text reviews are translated into sparse matrix by tf-idf, and then a set of models are generated by different hyper-parameters. Next, two features are created by model selection and the prediction is the average of these two features.


### 3.1 Text Mining

In the preprocessing section, a series of operations are applied to raw texts:

1) Remove punctuations, non-English characters and stopwords.

2) Lemmatize all remaining words.

3) Transform reviews to a high dimensional tfidf sparse matrix containing unigram and bigram words

### 3.2 Model Specification

#### Multinomial Logistic Classification

Multinomianl Logistic Classification is based on softmax function that "squashes" high dimensional data into K-dimension vector. To clarify, 

\begin{align*}
P(y = j \ | \ X) = \frac{e^{X \beta_j}}{\sum_{k = 1}^{K} e^{X \beta_k}}
\end{align*}

In this analysis, $j = 1, 2, 3, 4, 5$ and $y$ is the predicted value.

#### Support Vector Regression

Support Vector Regression is similar to simple linear regression but loss function is more flexbile because of the existence of tiny bounds. Mathematical form is listed:

\begin{align*}
L_{\epsilon}(y, f(x, \beta)) = \text{max} (|y - f(x, \beta)| - \epsilon, 0)
\end{align*}

### 3.3 Feature Extraction

By changing penalty coefficients, 20 models comes out. 10 of which are Multinomial Logistic Classification and the rest are Support Vector Regression. Each model within its own kind focuses on different subset of word attributes because tiny change of penalty coefficient makes huge impact on high dimensional data, which brings the diversity within one kind of model setting. 

In order to extract useful information, 2 techniques are applied: Naive mean and Lasso Regression.

1) Naively taking the average of all models enables us to obtain all information regardless of the importance of each model.

2) Lasso helps us select the several real contributive models

These 2 operations creates 2 features, which are used in final fitting.

### 3.4 Final Adjustment

Eventually, to integrate all condensed infromation, the average of 2 features are chosen. Unsurprsingly the final mixture model performs better in prediction task no matter which data set is input. In a nutshell, the final model is 

\begin{align*}
Final \ Model = \sum_{k = 1}^{10} (a_k L_k + b_k S_k)
\end{align*}

where $L_k$, $S_k$ are Multinomial Logistic Classification model(MLC) and Support Vector Regression model(SVR) respectively with coefficients $a_k$, $b_k$.

In terms of prediction, decent predictive power is shown as Root Mean Square Error(RMSE) remains at 0.624 under train set, test set and evaluation set. 

### 3.5 Prediction

Obviously, final model outperforms each of its components in the predciont task on test data according to the barplot.

![Root Mean Square Error](rmse_barplot.jpeg)