In [1]:
paste("R version is:", paste0(R.Version()[c("major","minor")], collapse = "."))

## Scoring Opinions and Sentiments

### Understanding How Machines Read

In this chapter we extesively use the tm library (http://tm.r-forge.r-project.org/) for R, which easily transforms textual data into numeric matrices.



In [2]:
# installing tm library, if not yet available
if (!("tm" %in% rownames(installed.packages()))) {
    install.packages("tm")
}

In [3]:
text_1 <- 'The quick brown fox jumps over the lazy dog.'
text_2 <- 'My dog is quick and can jump over fences.'
text_3 <- 'Your dog is so lazy that it sleeps all the day.'
corpus <- c(text_1, text_2, text_3)

Lacking equiparable functions in R as provided in Python by the Scikit-learn package, we can obtain similar results leveraging tm data structures, such as the <EM>DocumentTermMatrix</EM>

In [4]:
library(tm)
corpus <- VCorpus(VectorSource(corpus)) 
dtm <- DocumentTermMatrix(corpus,            
                          control = list(removePunctuation = TRUE,
                                         stopwords=FALSE,
                                         tolower = TRUE)
)
inspect(dtm)

Loading required package: NLP



<<DocumentTermMatrix (documents: 3, terms: 17)>>
Non-/sparse entries: 23/28
Sparsity           : 55%
Maximal term length: 6
Weighting          : term frequency (tf)
Sample             :
    Terms
Docs all and brown can day dog lazy over quick the
   1   0   0     1   0   0   1    1    1     1   2
   2   0   1     0   1   0   1    0    1     1   0
   3   1   0     0   0   1   1    1    0     0   1


In order to extract the dictionary of terms from our matrix, we can use the Term command:

In [5]:
Terms(dtm)

### Processing and Enhancing Text

In [6]:
text_4 <- 'A black dog just passed by but my dog is brown.'
corpus <- c(text_1, text_2, text_3, text_4)
corpus <- VCorpus(VectorSource(corpus)) 
dtm <- DocumentTermMatrix(corpus,            
                          control = list(removePunctuation = TRUE,
                                         stopwords=FALSE,
                                         tolower = TRUE)
                         )
inspect(dtm)

<<DocumentTermMatrix (documents: 4, terms: 21)>>
Non-/sparse entries: 29/55
Sparsity           : 65%
Maximal term length: 6
Weighting          : term frequency (tf)
Sample             :
    Terms
Docs all and black brown but dog lazy over quick the
   1   0   0     0     1   0   1    1    1     1   2
   2   0   1     0     0   0   1    0    1     1   0
   3   1   0     0     0   0   1    1    0     0   1
   4   0   0     1     1   1   2    0    0     0   0


In [7]:
apply(as.matrix(dtm), 2, sum)

In [8]:
dtm <- DocumentTermMatrix(corpus,
           control = list(weighting = function(x) weightTfIdf(x, normalize=TRUE),
                          removePunctuation = TRUE,
                          stopwords=FALSE,
                          tolower = TRUE)
                          )


inspect(dtm)

<<DocumentTermMatrix (documents: 4, terms: 21)>>
Non-/sparse entries: 25/59
Sparsity           : 70%
Maximal term length: 6
Weighting          : term frequency - inverse document frequency (normalized) (tf-idf)
Sample             :
    Terms
Docs       and     black     brown       but       can    fences      jump
   1 0.0000000 0.0000000 0.1111111 0.0000000 0.0000000 0.0000000 0.0000000
   2 0.2857143 0.0000000 0.0000000 0.0000000 0.2857143 0.2857143 0.2857143
   3 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000
   4 0.0000000 0.2857143 0.1428571 0.2857143 0.0000000 0.0000000 0.0000000
    Terms
Docs      just    passed       the
   1 0.0000000 0.0000000 0.2222222
   2 0.0000000 0.0000000 0.0000000
   3 0.0000000 0.0000000 0.1250000
   4 0.2857143 0.2857143 0.0000000


As explained in the tm FAQ (http://tm.r-forge.r-project.org/faq.html#Bigrams), n-grams can be obtained using the ngrams function that can be found in the NLP. In our example we prefer to use the Weka_tokenizers in the RWeka packages because they are faster and more robust, though the usage is quite similar.

In [9]:
if (!require("RWeka")) install.packages("RWeka", repos='http://cran.us.r-project.org')

Loading required package: RWeka



In [10]:
library(RWeka)

BigramTokenizer <- function(x) NGramTokenizer(x, Weka_control(min=2, max=2))

dtm <- TermDocumentMatrix(corpus, 
                          control=list(tokenize=BigramTokenizer,
                                       removePunctuation = TRUE,
                                       stopwords=FALSE,
                                       tolower = TRUE))

inspect(dtm)

<<TermDocumentMatrix (terms: 33, documents: 4)>>
Non-/sparse entries: 36/96
Sparsity           : 73%
Maximal term length: 11
Weighting          : term frequency (tf)
Sample             :
           Docs
Terms       1 2 3 4
  a black   0 0 0 1
  all the   0 0 1 0
  and can   0 1 0 0
  black dog 0 0 0 1
  brown fox 1 0 0 0
  but my    0 0 0 1
  by but    0 0 0 1
  can jump  0 1 0 0
  dog is    0 1 1 1
  my dog    0 1 0 1


In [11]:
Terms(dtm)

### Stemming and removing stop words

In [12]:
if (!("SnowballC" %in% rownames(installed.packages()))) {
    install.packages("SnowballC")
}

In [13]:
dtm <- TermDocumentMatrix(corpus, 
                          control=list(removePunctuation = TRUE,
                                       stopwords = stopwords("english"), 
                                       tolower = TRUE, 
                                       stemming = TRUE
                                       ))

inspect(dtm)

<<TermDocumentMatrix (terms: 13, documents: 4)>>
Non-/sparse entries: 20/32
Sparsity           : 62%
Maximal term length: 5
Weighting          : term frequency (tf)
Sample             :
       Docs
Terms   1 2 3 4
  black 0 0 0 1
  brown 1 0 0 1
  can   0 1 0 0
  day   0 0 1 0
  dog   1 1 1 2
  fenc  0 1 0 0
  fox   1 0 0 0
  jump  1 1 0 0
  lazi  1 0 1 0
  quick 1 1 0 0


### Scraping Textual Datasets from the Web

In [14]:
if (!("rvest" %in% rownames(installed.packages()))) {
    install.packages("rvest")
}

library(rvest)

Loading required package: xml2



In [15]:
wiki <- "https://en.wikipedia.org/wiki/List_of_United_States_cities_by_population"
html_data <- read_html(wiki)

In [16]:
extracted_tables <- html_data %>% html_table(fill=TRUE)

In [17]:
table_of_cities <- extracted_tables[[5]]

In [18]:
selected_columns <- c(1, 2, 3, 4, 5, 7)
table_of_cities[1:5, selected_columns]

Unnamed: 0_level_0,2019rank,City,State[c],2019estimate,2010Census,2016 land area
Unnamed: 0_level_1,<int>,<chr>,<chr>,<chr>,<chr>,<chr>
1,1,New York[d],New York,8336817,8175133,301.5 sq mi
2,2,Los Angeles,California,3979576,3792621,468.7 sq mi
3,3,Chicago,Illinois,2693976,2695598,227.3 sq mi
4,4,Houston[3],Texas,2320268,2100263,637.5 sq mi
5,5,Phoenix,Arizona,1680992,1445632,517.6 sq mi


### Using Scoring and Classification

In [19]:
# installing feather library, if not yet available
if (!("feather" %in% rownames(installed.packages()))) {
    install.packages("feather")
}

In [20]:
# installing feather library, if not yet available
if (!("RCurl" %in% rownames(installed.packages()))) {
    install.packages("RCurl")
}

In [21]:
library(feather)
library(RCurl)

In [22]:
url <- "https://github.com/lmassaron/datasets/releases/download/1.0/shakespeare_lines_in_plays.feather"
destfile <- "shakespeare_lines_in_plays.feather"
download.file(url, destfile, mode =  "wb")

In [23]:
shakespeare <- read_feather(destfile)

In [24]:
library(tm)

if (!("irlba" %in% rownames(installed.packages()))) {
    install.packages("irlba")
}

library(irlba)

if (!("Matrix" %in% rownames(installed.packages()))) {
    install.packages("Matrix")
}

library(Matrix)

Loading required package: Matrix



In [25]:
corpus <- VCorpus(VectorSource(shakespeare$lines)) 
dtm <- DocumentTermMatrix(corpus,
           control = list(weighting=function(x) weightTfIdf(x, normalize=TRUE),
                          stopwords=stopwords("english"), 
                          removePunctuation=TRUE,
                          removeNumbers=TRUE,
                          tolower=TRUE,
                          wordLengths=c(4,Inf))
                          )

inspect(dtm)

<<DocumentTermMatrix (documents: 217, terms: 26675)>>
Non-/sparse entries: 195029/5593446
Sparsity           : 97%
Maximal term length: 37
Weighting          : term frequency - inverse document frequency (normalized) (tf-idf)
Sample             :
     Terms
Docs  antipholus      enter     exeunt       exit        king    reenter
  1    0.4537249 0.10261048 0.11811845 0.13563933 0.000000000 0.01860786
  109  0.0000000 0.08278573 0.12811678 0.12923548 0.004722556 0.04986377
  121  0.0000000 0.11077519 0.15882713 0.11978962 0.000000000 0.00000000
  133  0.0000000 0.10086558 0.18944237 0.10877148 0.000000000 0.04476587
  139  0.0000000 0.08862015 0.11294374 0.18157585 0.000000000 0.07265336
  169  0.0000000 0.08422998 0.12686652 0.17258073 0.000000000 0.09147436
  176  0.0000000 0.06667025 0.09150534 0.09807338 0.000000000 0.05766139
  200  0.0000000 0.10553582 0.12591700 0.22696981 0.000000000 0.08836219
  31   0.0000000 0.12389330 0.16718645 0.10618471 0.000000000 0.01365665
  7    0.000

In [26]:
terms <- dtm$dimnames[[2]]

In [27]:
sparse_matrix <- sparseMatrix(i=dtm$i, j=dtm$j, x=dtm$v)

In [28]:
n_topics = 10
res <- irlba(sparse_matrix, n_topics)

In [29]:
topics <- res$v

In [30]:
top_words = 5
for (topic in 1:n_topics) {
    print(paste("topic", topic, 
                "| top words:", paste(terms[order(abs(topics[,topic]), decreasing=T)[1:top_words]], collapse=" ")
               )
         )
}

[1] "topic 1 | top words: exit exeunt enter scene reenter"
[1] "topic 2 | top words: syracuse antipholus dromio ephesus luciana"
[1] "topic 3 | top words: maria toby belch fabian viola"
[1] "topic 4 | top words: iago othello desdemona cassio emilia"
[1] "topic 5 | top words: orlando leonato rosalind pedro celia"
[1] "topic 6 | top words: iago othello cassio desdemona emilia"
[1] "topic 7 | top words: ariel ferdinand caliban costard biondello"
[1] "topic 8 | top words: biondello bianca katharina lucentio ariel"
[1] "topic 9 | top words: ariel costard moth boyet biron"
[1] "topic 10 | top words: proteus silvia thurio salarino provost"


### Analyzing reviews from e-commerce

In [31]:
library(feather)
library(RCurl)

In [32]:
url <- "https://github.com/lmassaron/datasets/releases/download/1.0/imdb_50k.feather"
destfile <- "imdb_50k.feather"
download.file(url, destfile, mode =  "wb")

In [33]:
reviews <- read_feather(destfile)

In [34]:
# On Windows, first install Rtools following instructions at: https://cran.r-project.org/bin/windows/Rtools/

if (!("keras" %in% rownames(installed.packages()))) {
    install.packages("backports", type='binary')
    install.packages("devtools")
    devtools::install_github("rstudio/keras", force=TRUE)
    reticulate::py_config()
} else {
    reticulate::py_config()
}

# If necessary, please download and install Rtools 3.5 from http://cran.r-project.org/bin/windows/Rtools/

"path[1]="C:\Users\Luca\anaconda3\envs\ml4d/python.exe": Impossibile trovare il file specificato"


python:         C:/Users/Luca/anaconda3/python.exe
libpython:      C:/Users/Luca/anaconda3/python37.dll
pythonhome:     C:/Users/Luca/anaconda3
version:        3.7.6 (default, Jan  8 2020, 20:23:39) [MSC v.1916 64 bit (AMD64)]
Architecture:   64bit
numpy:          C:/Users/Luca/anaconda3/Lib/site-packages/numpy
numpy_version:  1.18.1

python versions found: 
 C:/Users/Luca/anaconda3/python.exe
 C:/Users/Luca/anaconda3/envs/algo4dummies/python.exe
 C:/Users/Luca/anaconda3/envs/dl4dummies/python.exe
 C:/Users/Luca/anaconda3/envs/ml4dit/python.exe
 C:/Users/Luca/anaconda3/envs/p4ds4d/python.exe

In [35]:
library(keras) # installation instructions are on the notebook from chapter 14th

num_words=10000
tokenizer <- text_tokenizer()
tokenizer %>% fit_text_tokenizer(reviews$review[1:30000])

max_len = 256

X <- pad_sequences(texts_to_sequences(tokenizer, reviews$review[1:30000]), maxlen=max_len)
y <- reviews$sentiment[1:30000]

Xv <- pad_sequences(texts_to_sequences(tokenizer, reviews$review[30001:40000]), maxlen=max_len)
yv <- reviews$sentiment[30001:40000]

Xt <- pad_sequences(texts_to_sequences(tokenizer, reviews$review[40001:50000]), maxlen=max_len)
yt <- reviews$sentiment[40001:50000]

In [36]:
embedding_size <- 8

# Add layers to the model

model <- keras_model_sequential() %>% 
  layer_embedding(
      input_dim=max(X, Xv, Xt) + 1, 
      output_dim=embedding_size, 
      input_length=max_len) %>% 
  layer_flatten() %>%
  layer_dropout(rate=0.2) %>%
  layer_dense(units = 1, activation = 'sigmoid')

# Compile the model
model %>% compile(
  loss = loss_binary_crossentropy,
  optimizer = optimizer_adam(),
  metrics=c('acc')
)

# Summary of the model
model

Model
Model: "sequential"
________________________________________________________________________________
Layer (type)                        Output Shape                    Param #     
embedding (Embedding)               (None, 256, 8)                  787184      
________________________________________________________________________________
flatten (Flatten)                   (None, 2048)                    0           
________________________________________________________________________________
dropout (Dropout)                   (None, 2048)                    0           
________________________________________________________________________________
dense (Dense)                       (None, 1)                       2049        
Total params: 789,233
Trainable params: 789,233
Non-trainable params: 0
________________________________________________________________________________



In [37]:
# Setting the model's training parameters
epochs=2
batch_size=16

# Training the model
history <- model %>% fit(
    X, y,
    batch_size = batch_size,
    epochs = epochs,
    validation_data=list(Xv, yv),
    verbose=1
        )

In [38]:
# Computing validation metrics
scores <- model %>% evaluate(Xv, yv, verbose=0)

# Printing the scores
cat('Validation loss:', scores[[1]], '\n')
cat('Validation accuracy:', scores[[2]], '\n')

Validation loss: 0.2561352 
Validation accuracy: 0.897 


In [39]:
# Computing test metrics
scores <- model %>% evaluate(Xt, yt, verbose=0)

# Printing the scores
cat('Test loss:', scores[[1]], '\n')
cat('Test accuracy:', scores[[2]], '\n')

Test loss: 0.2496146 
Test accuracy: 0.8988 


In [40]:
proba <- model %>% predict(Xt)

# Transforming probabilities into a binary variable
# using the 0.5 probability threshold
preds <- as.numeric(proba>=0.5)