/
cs107-ml.Rmd
640 lines (430 loc) · 20.7 KB
/
cs107-ml.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
---
title: 'CS107 Final Froject : Sentiment Analysis of Product Reviews'
author: "Sahbi Ben Gdaiem, Mohammed Hadj-Ali"
date: "April 27, 2016"
output:
html_document:
toc: true
number_sections: true
highlight: tango
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
```
# Introduction
## Motivation :
Sentiment Analysis, is receiving a big attention these days, because of its huge spectrum of applications ranging from product review analysis, campaign feedback, competition bench-marking, customer profiles, political trends, etc...
There is a huge flow of information going through the internet and social networks. Online discussions are only relevant to people for a couple of days. Nobody actually goes in past to tweets that are older than maybe a week, for instance. This entire humanity archive of discussion could help in many applications if we train machines to understand the sentiment of people towards a specific theme at a specific time.
## Background
“Sentiment analysis is the computational study of people's opinions, sentiments, emotions, and attitudes.” [Excerpt From: Bing Liu. Sentiment Analysis: Mining Opinions, Sentiments, and Emotions.]. This book is an excellent survey of NLP and SA research.
Given the large amount of data available on the Web, it is now possible to investigate high-level Information Retrieval tasks like user's intentions and feelings about facts or objects discussed. [Pang, B., Lee, L., 2008. Opinion mining and sentiment analysis. Foundations and Trends in Information Retrieval]
# Sentiment Analyis
## Challenge :
There are several things to take into consideration when approaching a Sentiment Analysis task. In general, there are two main approaches:
- Sentiment lexicons using Natural Language Processing (NLP) techniques. A Sentiment lexicon is a list of words that are associated to polarity values (positive or negative). NLP techniques offer a deep level of analysis since they take into account the context words in the sentence.
- Machine Learning classification algorithms. Because sentiment classification is a text classification problem, any existing supervised learning method can be directly applied [Bing Liu]. For example, naive Bayes classification , logistic regression, support vector machines (SVM), etc..
In this work we'll work on ML classification and then try to get into the NLP and experience some of the basic techniques used.
```{r message=FALSE,echo=FALSE}
# Specify and load required R packages.
libs = c(
"jsonlite", ## loading JSON data input
"caret", ## confusion matrix
"stringr", ## gsub
"knitr", ## kable
"ggplot2", ## plots
"dplyr", ## dataframe wrangling
"randomForest", # randomForest classification
"e1071", ## naive bayes and SVM
"MASS", ## lda
"wordcloud", ## word cloud
"tidyr", ## ubiquitous
"tidytext" ## tokenizer and sentiment dataset
)
sapply(libs, require, character.only=TRUE)
```
## Data Wrangling and Preparation
### Load data
In this work we'll use a data-set that we obtained thankfully from Julian McAuley, at the University of San Diego (here)[http://jmcauley.ucsd.edu/data/amazon/]
We'll be also inspired by their SIGIR and KDD papers (listed on the above page) as a baseline for our accuracy bench-marking.
This data-set contains product reviews and metadata from Amazon, including 142.8 million reviews spanning May 1996 - July 2014.
We have decided to use electronics reviews for this work. Because electronics are not perfect so create a lot of contrasted opinions.
```{r}
# TODO: Set working directory to where this script is (avoid hardcoded paths)
this.dir <- "~/HES/CS107/final project/cs107final"
setwd(this.dir)
json_file <- "data/reviews_electronics_extract.json"
json_file <- "data/reviews_elect_50K.json.gz"
```
The file format is **NDJSON** (Newline Delimited JSON), So `stream_in` is the appropriate way of to load the data into a data frame.
```{r cache=TRUE}
dat <- stream_in(gzfile(json_file), verbose = FALSE)
dat <- dat %>% head(50000) ## comment after testing
#Preview the result.
glimpse(dat)
```
## Exploratory Data Analysis
### Basic numbers
Let's explore some facts about our data.
Checking how many users are there. Almost we have 46K users for 50K reviews. So each user has done 1 unique review per product.
```{r}
n_distinct(dat$reviewerID)
```
Checking how many products are there using the distinct ASIN (Amazon Standard Identification Number) amazon's unique product identifier.
```{r}
n_distinct(dat$asin)
```
Let's look at at the the number of ratings per product. It's quite skewed with some extreme best sellers (a headphone from Koss) having 3000 reviews.
```{r}
dat %>% group_by(asin) %>%
summarize(n_ratings = n() ) %>%
ggplot(aes(n_ratings)) +
geom_histogram() +
scale_x_log10() +
labs(title="Histogram for Ratings per Product") +
labs(x="Number of Ratings", y="Count")
# best seller
dat %>% group_by(asin) %>%
summarize(n_ratings = n() ) %>% arrange(desc(n_ratings)) %>% head(1) %>% .$asin
```
### Simplification of data
Prepare the text for processing, convert it to lower case, and keep only relevant columns.
We then garbage collect the old `dat`
```{r}
df <- dat %>% mutate(reviewText = summary) %>%
dplyr::select(reviewText, overall)
# rm(dat) ## TODO : uncomment for production version
gc()
```
### Rating system in Amazon
The rating system used in Amazon is as follows :
* emotional positive (5 stars)
* rational positive (4 stars)
* neutral (3 stars)
* rational negative (2 stars)
* emotional negative (1 star )
**The user's star rating of his own review description as a subjective human interpretation of opinion. So, we consider that as the ground truth.**
Let's see the distribution of these ratings in our case
```{r}
df %>% ggplot(aes(x=as.factor(overall))) +
geom_bar(aes(fill=factor(overall))) +
labs(title="Histogram for Ratings per Product") +
labs(x="Rating Stars", y="number of reviews")
mean(dat$overall)
```
> The ratings are very skewed towards positive feedback. Which is an indication that Amazon is not selling junk at least but it's not going to help in our modeling. We have to have equal likelihood of each class of the ratings.
> In the next sections we'll solve this.
## Design decisions on reviews
In this work we will have a binary classification. Either Positive or Negative.
We objectively chose to demarcate each rating at the 2.x level divide, such that star ratings above this level would be marked as “1” and star ratings below this level would be marked as “0”
We will collapse the 3, 4 and 5 stars ratings into “1” value, and the 1 and 2 stars ratings into Negative “0”
So at the end we'll have only 2 opinions : negative/positive
The simplest way to do this is by joining a mapping matrix
```{r}
overall <-seq(1,5)
rating <- c(0,0,0,1,1)
trans <- data.frame(cbind(overall, rating))
df <- df %>%
# filter(overall != 3) %>%
left_join(trans) %>%
dplyr::select(-overall) %>%
mutate(id = row_number())
```
Let's check again the distribution of opinions
```{r}
df %>% ggplot(aes(x=as.factor(rating))) +
geom_bar(aes(fill=factor(rating))) +
labs(title="Histogram of Ratings") +
labs(x="Class", y="number of reviews")
```
Even after this mapping the class proportions need to be corrected manually.
We calculate the skew (disproportion rate)
```{r}
sum(df$rating == 0) / sum(df$rating == 1)
n_remov <- sum(df$rating == 1) - sum(df$rating == 0)
```
we have 3x more Positive than Negative. Let's remove 2/3 of Positive to adjust the proportions
```{r}
positive_ids <- df$id[df$rating == 1]
remove_ids <- sample(positive_ids, n_remov)
## remove these rows
df3 <- df[-remove_ids,]
```
Check the numbers, we are good to go :
```{r}
sum(df3$rating == 0) / sum(df3$rating == 1)
## clean up things
df <- df3
rm(df3, dat, positive_ids, remove_ids)
gc()
```
# Machine Learning Classification
## Bag of Words
One of the simpler things to do with text is to treat each text as a "bag of words". We have used the `tm` package in order to construct a Term Document Matrix but the computer couldn't handle such huge dimensions. So let's go with `tidytext`
Let's discover the top words for both positive and negative ratings. We use the `wordcloud` package to have a nice display.
```{r warning=FALSE}
emo_pos <- df %>% filter(rating == 1) %>%
unnest_tokens(word, reviewText) %>%
filter(!word %in% stop_words$word) %>%
filter(nchar(word) > 2) %>%
count(word)
emo_neg <- df %>% filter(rating == 0) %>%
unnest_tokens(word, reviewText) %>%
filter(!word %in% stop_words$word) %>%
filter(nchar(word) > 2) %>%
count(word)
layout(matrix(c(1:4), nrow=2, ncol = 2), heights=c(1, 4))
par(mar=rep(0, 4))
plot.new()
text(x=0.5, y=0.5, "Positive ratings", cex=2)
wordcloud(words = emo_pos$word, freq = emo_pos$n, scale=c(5,0.5), max.words=100, random.order=FALSE, rot.per=0.35, use.r.layout=FALSE, colors=brewer.pal(8, "Dark2"))
plot.new()
text(x=0.5, y=0.5, "Negative ratings", cex=2)
wordcloud(words = emo_neg$word, freq = emo_neg$n, scale=c(5,0.5), max.words=100, random.order=FALSE, rot.per=0.35, use.r.layout=FALSE, colors=brewer.pal(8, "Dark2"))
```
## Text tidiying
Do a series of transformations :
- lowercase
- remove punctuation
- strip white space
```{r}
df$reviewText <- gsub( "[^a-zA-Z]", "\\1 \\2", tolower(df$reviewText))
df$reviewText <- gsub( " +", " ", df$reviewText)
```
Use `nrc` lexicon as a bag of words that have sentiments but we're not going to look into these sentiments for the time being.
```{r}
bow <- sentiments %>%
filter(lexicon == "nrc") %>%
dplyr::select(word)
```
## Creating features from Bag of Words
Use `unrest_tokens` and join function in order to convert our reviews into a sparse Matrix.
```{r}
Xdf <- df %>%
unnest_tokens(word, reviewText) %>%
filter(!word %in% stop_words$word) %>%
inner_join(bow) %>%
count(word, id) %>%
spread(word, n, fill = 0)
## join back to save the truth y
Xdf2 <- df %>% right_join(Xdf, by = "id")
## save the truth in order:
Y <- Xdf2[,2]
Xdf <- Xdf2 %>% dplyr::select(-c(1:3))
```
## TF-IDF
Term Frequency - Inverse Document Frequency is term count within a document weighted against the term's ubiquity within the corpus. This weight is based on the principle that terms occurring in almost every document are therefore less specific to an individual document and should be scaled down.
So a tf-idf value represents the term's relative importance within a document.
Compute tf-idf, inverse document frequency, and relative term frequency on document-feature matrices
$$tf(t,d) = \frac{f_{d}(t)}{\underset{w \in d}{max}}$$
$$idf(t,D) = log \left (\frac{|D|}{|d \in D : t \in d|} \right )$$
$$tfidf(t,d,D) = tf(t,d)\cdot idf(t,D)$$
$f_d(t):=$ freqency of term $t$ in document $d$
$D :$ corpus of documents
$|D| :$ number of documents where the term $t$ appears
$|d \in D : t \in d|\ :$ number of documents where the term $t$ appears
```{r}
## tfidf function implementaion
## the normalize will divide x by the text total freq.
tfidf <- function(x, normalize=TRUE){
if(normalize) x <- x/rowSums(x)
tf <- t(x)
idf <- log(nrow(x)) - log(colSums(x>0) + 1)
t( tf*idf )
}
Xdf <- tfidf(Xdf)
```
## Reduce dimensionality (Single Vector Decomposition)
Suppose we have m words (features) and n documents
Single vector decomposition (SVD) of an m*n real or complex matrix X is a factorization of the form :
$$\mathbf{ X = U \Sigma V^{T} }$$
$$ U\ is\ m x r\ matrix,\ columns\ of\ U\ contain\ the\ Eigenvectors\ of\ X \cdot X^{T} $$
$$ V\ is\ an\ r x n,\ columns\ of\ V\ contain\ the\ Eigenvectors\ of\ X^{T} \cdot X $$
$$ \Sigma\ is\ a\ diagonal\ r x r\ matrix.\ diagonal\ values\ are\ eigenvalues\ of\ X \cdot X^{T}$$
$$\begin{bmatrix} x_{1,1} & x_{1,2} & \cdots & x_{1,n} \\ x_{2,1}& \ddots & & \vdots \\ \vdots & & \ddots & \vdots \\
x_{m,1}& \cdots & \cdots & x_{m,n} \\ \end{bmatrix}= \begin{bmatrix} u_{1,1} & u_{1,2} & \cdots & u_{1,r} \\ u_{2,1}& \ddots & & \vdots \\ \vdots & & \ddots & \vdots \\u_{m,1}& \cdots & \cdots & u_{m,r} \\\end{bmatrix} \cdot diag \begin{bmatrix}d_{1}\\\vdots \\ \vdots \\ d_{r} \\ \end{bmatrix}\cdot \begin{bmatrix} v_{1,1} & v_{1,2} & \cdots & v_{1,n} \\ v_{2,1}& \ddots & & \vdots \\ \vdots & & \ddots & \vdots \\ v_{r,1}& \cdots & \cdots & v_{r,n} \\ \end{bmatrix}$$
It's basically a PCA but without mean shifting.But SVD works better in SA because it is able to detect and extract small signals from noisy data. Noisy data here means words that are not significant for prediction.
In this context, it is known as latent semantic analysis (LSA).
```{r cache=T}
# we transpose id because we want docunent as
Xmat <- t(as.matrix(Xdf))
dim(Xmat)
# calculate the sparsity of the matrix
sprintf("Matrix sparsity = %.2f%%", 100*(1- sum(Xmat > 0) / length(Xmat)))
# call svd on this matrix
Xmat.svd <- svd(Xmat, LINPACK=T)
#save(Xmat.svd, file=paste(this.dir, "Xmat.svd", ".RData", sep=""))
#load(paste(this.dir, 'Xmat.svd', '.RData', sep=''))
sprintf("U matrix dimension is: %dx%d", dim(Xmat.svd$u)[1], dim(Xmat.svd$u)[2])
sprintf("d (sigma) matrix is : %d ", length(Xmat.svd$d))
sprintf("V matrix dimension is: %dx%d", dim(Xmat.svd$v)[1], dim(Xmat.svd$v)[2])
```
The diagonal vector `d` generated is ordered by importance of each dimension.
Let's graph a cumulative variance:
```{r}
s <- sum(Xmat.svd$d)
d <- data.frame(var = cumsum(Xmat.svd$d^2/sum(Xmat.svd$d^2)))
ggplot(d, aes(x=1:nrow(d), y=var)) + geom_line() +
xlab("number of eigenvectors") +
ylab("Cumulative variance") + scale_x_log10()
```
> Design decision
>
>We have to decide how many dimensions to keep. Looking at the graph we can keep 10 dimensions which represent almost 99% of the features.
This is curious and a puzzling choice to make. But let's aim for fast processing and later on see if adding more dimensions improves our prediction.
In order to rotate into our new space of reduced dimensions we have to limit the number of eigenvectors in `U` then transpose it an multiply it by the original X :
$$ \hat{X} = U^{T}\cdot X $$
```{r}
## trauncate 10% of the column
truncate <- round(0.1 * ncol(Xmat.svd$u))
Xmat_h <- t(Xmat.svd$u[,1:truncate]) %*% Xmat
```
Transform it back to a dataframe and bring back the truth y
```{r}
X_h <- data.frame(y=Y , t(Xmat_h))
```
If we consider the first 2 dimensions as predictors let's see how it looks
```{r}
X_h %>%
ggplot(aes(X2, X1, color=factor(y))) +
geom_jitter() # + scale_x_log10() + scale_y_log10()
```
## Supervised training
Now that we know some basic facts about our data set,
let's randomly split the data into training and test data.
We set the seed `set.seed(755)` and use `sample()` function to select
your test index to create two separate data frames
called: `train` and `test` from the original `ratings` data frame.
`test` contains a randomly selected 20% of the rows and training the other 80%. We will
use these data frames to do the rest of the analyses in the problem set.
```{r}
set.seed(123)
# sample from 1 to the length of X_h
test_sample <- sample(1:nrow(X_h), nrow(X_h)/5)
test <- X_h[test_sample,]
train <- X_h[-test_sample,]
# just checking the dimensions
nrow(train) / nrow(test)
```
As we go along we will be comparing different approaches. Let's start by creating a benchmark table:
```{r}
benchmark <- data_frame()
```
### Naive Bayes supervised training
The basic idea to find the probabilities of categories given a text document by using the joint probabilities of words and categories. It is based on the assumption of word independence.
The starting point is the Bayes theorem for conditional probability, stating that, for a given data point x and class C:
$$P(C/x) = \frac{P(x/C) \cdot P(C)}{P(x)}$$
By making the assumption that for a data point x = {x1,x2,…xj}, the probability of each of its attributes occurring in a given class is independent, we can estimate the probability of x as follows :
$$P(C/x) = P(C) \cdot \prod_{j} P(x_{j}/C)$$
Now, we can train the naive Bayes model with the training set. We'll be using e1071 package made by David Meyer from TU Wien.
The naiveBayes function requires a
```{r}
# transform into matrices
mat_train <- as.matrix(train)
mat_test <- as.matrix(test)
# train the model
nb.model = naiveBayes(mat_train[,-1], ## these are being the features
as.factor(mat_train[,1]) ) ## this is the truth y
nb.pred = predict(nb.model, mat_test[,-1])
# calcualte the confusion matrix
table(as.factor(mat_test[,1]), nb.pred)
## calcualte accuracy
accuracy <- function(true_, predicted_)
{
true_ <- as.vector(true_)
predicted_ <- as.vector(predicted_, mode=class(true_))
res <- predicted_ == true_
accuracy <- length(res[res == TRUE])/length(true_)
return(accuracy)
}
nb.accuracy <- accuracy(as.factor(mat_test[,1]), nb.pred) ; nb.accuracy
benchmark <- bind_rows(benchmark,
data_frame(method="Naive Bayes",
accu = nb.accuracy ))
## calculate RMSE
RMSE <- function(true_ratings, predicted_ratings){
sqrt(mean((true_ratings - predicted_ratings)^2))
}
RMSE(mat_test[,1], as.numeric(nb.pred))
```
### Support Vector Machine
SVMs were developed by Cortes & Vapnik (1995) [1] for binary classification. Their approach may be roughly sketched as follows:
> Class separation: basically, we are looking for the optimal separating hyperplane between the two classes by maximizing the margin between the classes’ closest points —the points lying on the boundaries are called support vectors, and the middle of the margin is our optimal separating hyperplane;
[1] : Cortes, C. & Vapnik, V. (1995). Support-vector network. Machine Learning, 20, 1–25
```{r}
## svm
svm.model <- svm(mat_train[,-1], ## these are being the features
as.factor(mat_train[,1]), ## this is the truth y
cost = 5, gamma = 1)
svm.pred <- predict(svm.model, mat_test[,-1])
## compute svm confusion matrix
table(as.factor(mat_test[,1]), svm.pred)
## compute svm accuracy
svm.accuracy <- accuracy(as.factor(mat_test[,1]), svm.pred) ; svm.accuracy
benchmark <- bind_rows(benchmark,
data_frame(method="Support Vector Machine",
accu = svm.accuracy ))
```
### Logistic regression
```{r warning=FALSE}
## glm
glm.model <- glm(y~., data=train, family=binomial)
glm.pred <- predict(glm.model, newdata=test, type="response")
summary(glm.model)
## compute lr confusion matrix assuming a cutoff at 0.5
table(test$y, glm.pred>.5)
## compute lr accuracy
accuracy(test$y, glm.pred>.5)
```
In the previous model we were assuming a less optimal cut-off. Let's use LDA assuming of course that our covariates are bivariate normal
```{r}
lda.model <- lda(y ~ ., train)
plda = predict(lda.model, newdata = test, type="response")
## compute lda confusion matrix
lda.tab <- table(test$y, plda$class)
lda.accuracy <- confusionMatrix(lda.tab , positive="1")$overall[[1]];lda.accuracy
benchmark <- bind_rows(benchmark,
data_frame(method="Linear Discriminent Analysis",
accu = lda.accuracy ))
```
### Cross-Validation of Logistic Regression
```{r warning=FALSE}
#define training control
train_control <- trainControl(method="cv", number=10)
X_h <- mutate(X_h, y = factor(y))
# train the model
model <- train(y ~ .,
data = X_h,
method = "glm",
trControl = train_control,
tuneLength = 1, # How fine a mesh to go on grid
#tuneGrid=data.frame(k=seq(1,3,1)),
metric="Accuracy")
# summarize results
print(model)
```
### Randon Forest
The `randomForest` prediction function works similarly to decision trees
```{r}
rf.fit <- randomForest(as.factor(y) ~ .,
data=train,
importance=TRUE,
mtry = 3,
ntree=2000)
rf.pred <- predict(rf.fit, newdata = test, type="response" )
## compute confusion matrix
table(as.factor(test$y), rf.pred )
## compute accuracy
rf.accuracy <- accuracy(as.factor(test$y), rf.pred );rf.accuracy
benchmark <- bind_rows(benchmark,
data_frame(method="Random Forest",
accu = rf.accuracy ))
```
```{r}
rf.fit$importance[1:5,]
```
## Conclusion
In this section we used a supervised machine learning approach and tried several classification models. From the table below, the algorithmic approach using randomForest gives the best accuracy. we are 25% better than flipping a coin with only a classified bag of words.
```{r}
# summary of 5 methods
benchmark %>% kable
```