cs107-ml.Rmd

---
title: 'CS107 Final Froject : Sentiment Analysis of Product Reviews'
author: "Sahbi Ben Gdaiem, Mohammed Hadj-Ali"
date: "April 27, 2016"
output: 
  html_document:
    toc: true
    number_sections: true
    highlight: tango
---

```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
```

# Introduction
## Motivation :
Sentiment Analysis, is receiving a big attention these days, because of its huge spectrum of applications ranging from product review analysis, campaign feedback, competition bench-marking, customer profiles, political trends, etc...

There is a huge flow of information going through the internet and social networks. Online discussions are only relevant to people for a couple of days. Nobody actually goes in past to tweets that are older than maybe a week, for instance. This entire humanity archive of discussion could help in many applications if we train machines to understand the sentiment of people towards a specific theme at a specific time. 

## Background
“Sentiment analysis is the computational study of people's opinions, sentiments, emotions, and attitudes.” [Excerpt From: Bing Liu. Sentiment Analysis: Mining Opinions, Sentiments, and Emotions.]. This book is an excellent survey of NLP and SA research.

Given the large amount of data available on the Web, it is now possible to investigate high-level Information Retrieval tasks like user's intentions and feelings about facts or objects discussed. [Pang, B., Lee, L., 2008. Opinion mining and sentiment analysis. Foundations and Trends in Information Retrieval] 

# Sentiment Analyis

## Challenge :
There are several things to take into consideration when approaching a Sentiment Analysis task. In general, there are two main approaches: 

- Sentiment lexicons using Natural Language Processing (NLP) techniques. A Sentiment lexicon is a list of words that are associated to polarity values (positive or negative). NLP techniques offer a deep level of analysis since they take into account the context words in the sentence. 

- Machine Learning classification algorithms. Because sentiment classification is a text classification problem, any existing supervised learning method can be directly applied [Bing Liu]. For example,  naive Bayes classification , logistic regression, support vector machines (SVM), etc..

In this work we'll work on ML classification and then try to get into the NLP and experience some of the basic techniques used.


```{r message=FALSE,echo=FALSE}
# Specify and load required R packages. 
libs = c(
  "jsonlite",    ##  loading JSON data input 
  "caret",       ##  confusion matrix
  "stringr",     ##  gsub
  "knitr",       ##  kable
  "ggplot2",     ##  plots
  "dplyr",       ##  dataframe wrangling
  "randomForest", #  randomForest classification
  "e1071",       ##  naive bayes and SVM 
  "MASS",        ##  lda
  "wordcloud",   ##  word cloud
  "tidyr",       ##  ubiquitous
  "tidytext"     ##  tokenizer and sentiment dataset
)

sapply(libs, require, character.only=TRUE)
```


## Data Wrangling and Preparation

### Load data

In this work we'll use a data-set that we obtained thankfully from Julian McAuley, at the University of San Diego (here)[http://jmcauley.ucsd.edu/data/amazon/] 

We'll be also inspired by their SIGIR and KDD papers (listed on the above page) as a baseline for our accuracy bench-marking.

This data-set contains product reviews and metadata from Amazon, including 142.8 million reviews spanning May 1996 - July 2014. 
We have decided to use electronics reviews for this work. Because electronics are not perfect so create a lot of contrasted opinions.
 

```{r}
# TODO: Set working directory to where this script is (avoid hardcoded paths)
this.dir <- "~/HES/CS107/final project/cs107final"
setwd(this.dir)

json_file <- "data/reviews_electronics_extract.json"
json_file <- "data/reviews_elect_50K.json.gz"
```

The file format is **NDJSON** (Newline Delimited JSON), So `stream_in` is the appropriate way of to load the data into a data frame.

```{r cache=TRUE}
dat <- stream_in(gzfile(json_file), verbose = FALSE)

dat <- dat %>% head(50000)    ## comment after testing

#Preview the result.
glimpse(dat)
```


## Exploratory Data Analysis
### Basic numbers
Let's explore some facts about our data.

Checking how many users are there. Almost we have 46K users for 50K reviews. So each user has done 1 unique review per product.

```{r}
n_distinct(dat$reviewerID)
```

Checking how many products are there using the distinct ASIN (Amazon Standard Identification Number) amazon's unique product identifier. 

```{r}
n_distinct(dat$asin)
```

Let's look at at the the number of ratings per product. It's quite skewed with some extreme best sellers (a headphone from Koss) having 3000 reviews. 

```{r}
dat %>% group_by(asin) %>%
        summarize(n_ratings = n() ) %>% 
        ggplot(aes(n_ratings)) + 
        geom_histogram() + 
        scale_x_log10() +
        labs(title="Histogram for Ratings per Product") +
        labs(x="Number of Ratings", y="Count") 

# best seller
dat %>% group_by(asin) %>% 
        summarize(n_ratings = n() ) %>% arrange(desc(n_ratings)) %>% head(1) %>% .$asin
```

### Simplification of data

Prepare the text for processing, convert it to lower case, and keep only relevant columns.
We then garbage collect the old `dat`

```{r}
df <- dat %>% mutate(reviewText = summary) %>% 
              dplyr::select(reviewText, overall)
# rm(dat)  ## TODO : uncomment for production version
gc()
```


### Rating system in Amazon

The rating system used in Amazon is as follows : 

* emotional positive (5 stars)
* rational positive  (4 stars)
* neutral            (3 stars)
* rational negative  (2 stars)
* emotional negative (1 star )

**The user's star rating of his own review description as a subjective human interpretation of opinion. So, we consider that as the ground truth.** 

Let's see the distribution of these ratings in our case

```{r}
df %>%  ggplot(aes(x=as.factor(overall))) + 
        geom_bar(aes(fill=factor(overall))) + 
                labs(title="Histogram for Ratings per Product") +
                labs(x="Rating Stars", y="number of reviews") 

mean(dat$overall)
```

> The ratings are very skewed towards positive feedback. Which is an indication that Amazon is not selling junk at least but it's not going to help in our modeling. We have to have equal likelihood of each class of the ratings. 

> In the next sections we'll solve this.

## Design decisions on reviews

In this work we will have a binary classification. Either Positive or Negative.
We objectively chose to demarcate each rating at the 2.x level divide, such that star ratings above this level would be marked as “1” and star ratings below this level would be marked as “0” 

We will collapse the 3, 4 and 5 stars ratings into “1” value, and the 1 and 2 stars ratings into Negative “0”

So at the end we'll have only 2 opinions : negative/positive
The simplest way to do this is by joining a mapping matrix

```{r}
overall <-seq(1,5)
rating <- c(0,0,0,1,1)
trans <- data.frame(cbind(overall, rating))

df <- df %>% 
     # filter(overall != 3)  %>% 
      left_join(trans) %>% 
      dplyr::select(-overall) %>% 
      mutate(id = row_number())
```

Let's check again the distribution of opinions

```{r}
df %>%  ggplot(aes(x=as.factor(rating))) + 
        geom_bar(aes(fill=factor(rating))) + 
labs(title="Histogram of Ratings") +
labs(x="Class", y="number of reviews") 
```

Even after this mapping the class proportions need to be corrected manually.
We calculate the skew (disproportion rate)

```{r}
sum(df$rating == 0) / sum(df$rating == 1)
n_remov <- sum(df$rating == 1) - sum(df$rating == 0)
```

we have 3x more Positive than Negative. Let's remove 2/3 of Positive to adjust the proportions

```{r}
positive_ids <- df$id[df$rating == 1]
remove_ids <-  sample(positive_ids, n_remov)

## remove these rows
df3 <- df[-remove_ids,]
```

Check the numbers, we are good to go :

```{r}
sum(df3$rating == 0) / sum(df3$rating == 1)

## clean up things
df <- df3
rm(df3, dat, positive_ids, remove_ids)
gc()
```

# Machine Learning Classification

## Bag of Words

One of the simpler things to do with text is to treat each text as a "bag of words". We have used the `tm` package in order to construct a Term Document Matrix but the computer couldn't handle such  huge dimensions. So let's go with `tidytext`

Let's discover the top words for both positive and negative ratings. We use the `wordcloud` package to have a nice display. 


```{r warning=FALSE}

emo_pos <- df %>% filter(rating == 1)  %>%
            unnest_tokens(word, reviewText)  %>%
            filter(!word %in% stop_words$word) %>%
            filter(nchar(word) > 2) %>%
            count(word) 

emo_neg <- df %>% filter(rating == 0)  %>%
            unnest_tokens(word, reviewText)  %>%
            filter(!word %in% stop_words$word) %>%
            filter(nchar(word) > 2) %>%
            count(word) 

layout(matrix(c(1:4), nrow=2, ncol = 2), heights=c(1, 4))
par(mar=rep(0, 4))
plot.new()
text(x=0.5, y=0.5, "Positive ratings", cex=2)

wordcloud(words = emo_pos$word, freq = emo_pos$n, scale=c(5,0.5), max.words=100, random.order=FALSE, rot.per=0.35, use.r.layout=FALSE, colors=brewer.pal(8, "Dark2"))

plot.new()
text(x=0.5, y=0.5, "Negative ratings", cex=2)

wordcloud(words = emo_neg$word, freq = emo_neg$n, scale=c(5,0.5), max.words=100, random.order=FALSE, rot.per=0.35, use.r.layout=FALSE, colors=brewer.pal(8, "Dark2"))

```


## Text tidiying

Do a series of transformations :

- lowercase
- remove punctuation
- strip white space

```{r}
df$reviewText <- gsub( "[^a-zA-Z]", "\\1 \\2", tolower(df$reviewText))
df$reviewText <- gsub( " +", " ", df$reviewText)
```

Use `nrc` lexicon as a bag of words that have sentiments but we're not going to look into these sentiments for the time being.

```{r}
bow <- sentiments %>%
    filter(lexicon == "nrc") %>%
    dplyr::select(word)
```


## Creating features from Bag of Words

Use `unrest_tokens` and join function in order to convert our reviews into a sparse Matrix.


```{r}
Xdf <- df %>%
    unnest_tokens(word, reviewText) %>%
    filter(!word %in% stop_words$word) %>%
    inner_join(bow) %>%
    count(word, id) %>%
    spread(word, n, fill = 0)

## join back to save the truth y
Xdf2 <- df %>% right_join(Xdf, by = "id") 

## save the truth in order:
Y   <- Xdf2[,2]
Xdf <- Xdf2 %>% dplyr::select(-c(1:3))

```

## TF-IDF

Term Frequency - Inverse Document Frequency is term count within a document weighted against the term's ubiquity within the corpus. This weight is based on the principle that terms occurring in almost every document are therefore less specific to an individual document and should be scaled down. 
So a tf-idf value represents the term's relative importance within a document.

Compute tf-idf, inverse document frequency, and relative term frequency on document-feature matrices

$$tf(t,d) = \frac{f_{d}(t)}{\underset{w \in d}{max}}$$
$$idf(t,D) = log \left (\frac{|D|}{|d \in D : t \in d|}  \right )$$
$$tfidf(t,d,D) = tf(t,d)\cdot idf(t,D)$$
$f_d(t):=$ freqency of term $t$ in document $d$

$D :$ corpus of documents

$|D| :$ number of documents where the term $t$ appears

$|d \in D : t \in d|\ :$ number of documents where the term $t$ appears

```{r}


## tfidf function implementaion
## the normalize will divide x by the text total freq.
tfidf <- function(x, normalize=TRUE){

  if(normalize) x <- x/rowSums(x)
  
  tf <- t(x)
  idf <- log(nrow(x)) - log(colSums(x>0) + 1)
  t( tf*idf )
}

Xdf <- tfidf(Xdf) 

```


## Reduce dimensionality (Single Vector Decomposition)

Suppose we have m words (features) and n documents 
Single vector decomposition (SVD) of an m*n real or complex matrix X is a factorization of the form :


$$\mathbf{ X = U \Sigma V^{T} }$$
$$ U\ is\ m x r\ matrix,\ columns\ of\ U\ contain\ the\ Eigenvectors\ of\ X \cdot X^{T} $$
$$ V\ is\ an\ r x n,\ columns\ of\ V\ contain\ the\ Eigenvectors\ of\ X^{T} \cdot X $$
$$ \Sigma\ is\ a\ diagonal\ r x r\ matrix.\ diagonal\ values\ are\ eigenvalues\ of\ X \cdot X^{T}$$

$$\begin{bmatrix} x_{1,1} & x_{1,2}  & \cdots & x_{1,n} \\ x_{2,1}& \ddots & & \vdots \\ \vdots &  & \ddots & \vdots \\ 
 x_{m,1}& \cdots &  \cdots & x_{m,n} \\ \end{bmatrix}= \begin{bmatrix} u_{1,1} & u_{1,2}  & \cdots & u_{1,r} \\ u_{2,1}& \ddots & & \vdots \\ \vdots &  & \ddots & \vdots \\u_{m,1}& \cdots &  \cdots & u_{m,r} \\\end{bmatrix} \cdot diag \begin{bmatrix}d_{1}\\\vdots \\  \vdots  \\  d_{r} \\ \end{bmatrix}\cdot \begin{bmatrix} v_{1,1} & v_{1,2}  & \cdots & v_{1,n} \\ v_{2,1}& \ddots & & \vdots \\ \vdots &  & \ddots & \vdots \\ v_{r,1}& \cdots &  \cdots & v_{r,n} \\ \end{bmatrix}$$

It's basically a PCA but without mean shifting.But SVD works better in SA because it is able to detect and extract small signals from noisy data. Noisy data here means words that are not significant for prediction.

In this context, it is known as latent semantic analysis (LSA).

```{r cache=T}
# we transpose id because we want docunent as 
Xmat <- t(as.matrix(Xdf))
dim(Xmat)

# calculate the sparsity of the matrix
sprintf("Matrix sparsity = %.2f%%", 100*(1- sum(Xmat > 0) / length(Xmat)))

# call svd on this matrix
Xmat.svd <- svd(Xmat, LINPACK=T)
#save(Xmat.svd, file=paste(this.dir, "Xmat.svd", ".RData", sep=""))
#load(paste(this.dir, 'Xmat.svd', '.RData', sep=''))

sprintf("U matrix dimension is: %dx%d", dim(Xmat.svd$u)[1], dim(Xmat.svd$u)[2])
sprintf("d (sigma) matrix is : %d ", length(Xmat.svd$d))
sprintf("V matrix dimension is: %dx%d", dim(Xmat.svd$v)[1], dim(Xmat.svd$v)[2])
```


The diagonal vector `d` generated is ordered by importance of each dimension.
Let's graph a cumulative variance:  

```{r}
s <- sum(Xmat.svd$d)
d <- data.frame(var = cumsum(Xmat.svd$d^2/sum(Xmat.svd$d^2)))
ggplot(d, aes(x=1:nrow(d), y=var)) + geom_line() +
      xlab("number of eigenvectors") +
      ylab("Cumulative variance") + scale_x_log10()
```

> Design decision
>
>We have to decide how many dimensions to keep. Looking at the graph we can keep 10 dimensions which represent almost 99% of the features.
This is curious and a puzzling choice to make. But let's aim for fast processing and later on see if adding more dimensions improves our prediction.

In order to rotate into our new space of reduced dimensions we have to limit the number of eigenvectors in `U` then transpose it an multiply it by the original X :

$$ \hat{X} = U^{T}\cdot X $$

```{r}
## trauncate 10% of the column
truncate <- round(0.1 * ncol(Xmat.svd$u))
Xmat_h <- t(Xmat.svd$u[,1:truncate]) %*% Xmat
```

Transform it back to a dataframe and bring back the truth y

```{r}
X_h <- data.frame(y=Y , t(Xmat_h))
```


If we consider the first 2 dimensions as predictors let's see how it looks

```{r}
X_h  %>%
    ggplot(aes(X2, X1, color=factor(y))) + 
    geom_jitter() # + scale_x_log10() + scale_y_log10()
```


## Supervised training

Now that we know some basic facts about our data set, 
let's randomly split the data into training and test data. 
We set the seed `set.seed(755)` and use `sample()` function to select 
your test index to create two separate data frames
called: `train` and `test` from the original `ratings` data frame. 
 `test` contains a randomly selected 20% of the rows and training the other 80%. We will 
use these data frames to do the rest of the analyses in the problem set.


```{r}

set.seed(123)
# sample from 1 to the length of X_h
test_sample <- sample(1:nrow(X_h), nrow(X_h)/5) 
test <- X_h[test_sample,]
train <- X_h[-test_sample,]
# just checking the dimensions
nrow(train) / nrow(test)
```

As we go along we will be comparing different approaches. Let's start by creating a benchmark table:

```{r}
benchmark <- data_frame()
```

### Naive Bayes supervised training 

The basic idea to find the probabilities of categories given a text document by using the joint probabilities of words and categories. It is based on the assumption of word independence.
The starting point is the Bayes theorem for conditional probability, stating that, for a given data point x and class C: 

$$P(C/x) = \frac{P(x/C) \cdot P(C)}{P(x)}$$

By making the assumption that for a data point x = {x1,x2,…xj}, the probability of each of its attributes occurring in a given class is independent, we can estimate the probability of x as follows :

$$P(C/x) = P(C) \cdot \prod_{j} P(x_{j}/C)$$


Now, we can train the naive Bayes model with the training set. We'll be using e1071 package made by David Meyer from TU Wien.
The naiveBayes function requires a 

```{r}

# transform into matrices
mat_train <- as.matrix(train)
mat_test <- as.matrix(test)

# train the model
nb.model = naiveBayes(mat_train[,-1],   ## these are being the features
                      as.factor(mat_train[,1]) )  ## this is the truth y

nb.pred = predict(nb.model, mat_test[,-1])

# calcualte the confusion matrix
table(as.factor(mat_test[,1]), nb.pred)

## calcualte accuracy
accuracy <- function(true_, predicted_) 
{
	true_ <- as.vector(true_)
	predicted_ <- as.vector(predicted_, mode=class(true_))
	res <- predicted_ == true_

	accuracy <- length(res[res == TRUE])/length(true_)
	return(accuracy)
}

nb.accuracy <- accuracy(as.factor(mat_test[,1]), nb.pred) ; nb.accuracy

benchmark <- bind_rows(benchmark,
                          data_frame(method="Naive Bayes",  
                                     accu = nb.accuracy ))

## calculate RMSE

RMSE <- function(true_ratings, predicted_ratings){
    sqrt(mean((true_ratings - predicted_ratings)^2))
}

RMSE(mat_test[,1], as.numeric(nb.pred))


```


### Support Vector Machine 

SVMs were developed by Cortes & Vapnik (1995) [1] for binary classification. Their approach may be roughly sketched as follows:
> Class separation: basically, we are looking for the optimal separating hyperplane between the two classes by maximizing the margin between the classes’ closest points —the points lying on the boundaries are called support vectors, and the middle of the margin is our optimal separating hyperplane;

[1] : Cortes, C. & Vapnik, V. (1995). Support-vector network. Machine Learning, 20, 1–25

```{r}

## svm
svm.model <- svm(mat_train[,-1],           ## these are being the features
                 as.factor(mat_train[,1]),  ## this is the truth y
                 cost = 5, gamma = 1)
svm.pred <- predict(svm.model, mat_test[,-1])

## compute svm confusion matrix
table(as.factor(mat_test[,1]), svm.pred)

## compute svm accuracy
svm.accuracy <- accuracy(as.factor(mat_test[,1]), svm.pred) ; svm.accuracy

benchmark <- bind_rows(benchmark,
                          data_frame(method="Support Vector Machine",  
                                     accu = svm.accuracy ))
```

### Logistic regression

```{r warning=FALSE}

## glm
glm.model <- glm(y~., data=train, family=binomial)

glm.pred <- predict(glm.model, newdata=test, type="response") 
summary(glm.model)

## compute lr confusion matrix assuming a cutoff at 0.5
table(test$y, glm.pred>.5) 

## compute lr accuracy
accuracy(test$y, glm.pred>.5)

```

In the previous model we were assuming a less optimal cut-off. Let's use LDA assuming of course that our covariates are bivariate normal


```{r}
lda.model <- lda(y ~ ., train)
 
plda = predict(lda.model, newdata = test, type="response")

## compute lda confusion matrix
lda.tab <- table(test$y, plda$class)
 
lda.accuracy <- confusionMatrix(lda.tab , positive="1")$overall[[1]];lda.accuracy


benchmark <- bind_rows(benchmark,
                          data_frame(method="Linear Discriminent Analysis",  
                                     accu = lda.accuracy ))
```


### Cross-Validation of Logistic Regression

```{r warning=FALSE}

#define training control
train_control <- trainControl(method="cv", number=10)
X_h <- mutate(X_h, y = factor(y))
# train the model
model <- train(y ~ .,
             data = X_h,
             method = "glm",
             trControl = train_control,
             tuneLength = 1, # How fine a mesh to go on grid
             #tuneGrid=data.frame(k=seq(1,3,1)),
             metric="Accuracy")
# summarize results
print(model)
```


### Randon Forest 

The `randomForest` prediction function works similarly to decision trees

```{r}
rf.fit <- randomForest(as.factor(y) ~ ., 
                       data=train, 
                       importance=TRUE, 
                       mtry = 3, 
                       ntree=2000)

rf.pred <- predict(rf.fit, newdata = test, type="response" )

## compute confusion matrix 
table(as.factor(test$y), rf.pred ) 

## compute accuracy
rf.accuracy <- accuracy(as.factor(test$y), rf.pred );rf.accuracy


benchmark <- bind_rows(benchmark,
                          data_frame(method="Random Forest",  
                                     accu = rf.accuracy ))
```
 

```{r}
rf.fit$importance[1:5,]
```


## Conclusion

In this section we used a supervised machine learning approach and tried several classification models. From the table below, the algorithmic approach using randomForest gives the best accuracy. we are 25% better than flipping a coin with only a classified bag of words.


```{r}
# summary of 5 methods
benchmark %>% kable
```