### Some Packages that can be used:

[CRAN Task View for NLP in R](https://cran.rstudio.com/web/views/NaturalLanguageProcessing.html)  
[CRAN Task View for Machine Learning and Statistical Learning](https://cran.r-project.org/web/views/MachineLearning.html)  


In [1]:
library(tm)
library(coreNLP)
library(openNLP)
library(tidyverse)
library(tidytext)
library(topicmodels)

Loading required package: NLP
── Attaching packages ─────────────────────────────────────── tidyverse 1.2.1 ──
✔ ggplot2 3.1.0     ✔ purrr   0.3.0
✔ tibble  2.0.1     ✔ dplyr   0.7.8
✔ tidyr   0.8.2     ✔ stringr 1.3.1
✔ readr   1.3.1     ✔ forcats 0.3.0
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ ggplot2::annotate() masks NLP::annotate()
✖ dplyr::filter()     masks stats::filter()
✖ dplyr::lag()        masks stats::lag()


In [2]:
data(crude)

In [3]:
crude

<<VCorpus>>
Metadata:  corpus specific: 0, document level (indexed): 0
Content:  documents: 20

In [4]:
crude[[9]]$content
s9 = as.String(crude[[9]])
s9

The Gulf oil state of Qatar, recovering
slightly from last year's decline in world oil prices,
announced its first budget since early 1985 and projected a
deficit of 5.472 billion riyals.
    The deficit compared with a shortfall of 7.3 billion riyals
in the last published budget for 1985/86.
    In a statement outlining the budget for the fiscal year
1987/88 beginning today, Finance and Petroleum Minister Sheikh
Abdul-Aziz bin Khalifa al-Thani said the government expected to
spend 12.217 billion riyals in the period.
    Projected expenditure in the 1985/86 budget had been 15.6
billion riyals.
    Sheikh Abdul-Aziz said government revenue would be about
6.745 billion riyals, down by about 30 pct on the 1985/86
projected revenue of 9.7 billion.
    The government failed to publish a 1986/87 budget due to
uncertainty surrounding oil revenues.
    Sheikh Abdul-Aziz said that during that year the government
decided to limit recurrent expenditure each month to
one-twelfth of the previous f

### Basic Tokenization:

In [5]:
word_ann = Maxent_Word_Token_Annotator()
sent_ann = Maxent_Sent_Token_Annotator()

In [6]:
crude9_ann = NLP::annotate(s9, list(sent_ann, word_ann))

In [7]:
crude9_ann_doc = AnnotatedPlainTextDocument(s9, crude9_ann)

In [8]:
words(crude9_ann_doc)

In [9]:
wds = words(crude9_ann_doc)
wds = wds[ -which(wds == '.' | wds == ',')]
wds

In [10]:
bing_sentiments = get_sentiments('bing')
bing_sentiments

word,sentiment
2-faced,negative
2-faces,negative
a+,positive
abnormal,negative
abolish,negative
abominable,negative
abominably,negative
abominate,negative
abomination,negative
abort,negative


In [11]:
inner_join(data.frame(wds), bing_sentiments, by = c("wds" = "word"))

“Column `wds`/`word` joining factor and character vector, coercing into character vector”

wds,sentiment
decline,negative
failed,negative
limit,negative
limit,negative
burden,negative
positive,positive
foremost,positive
protect,positive
helped,positive
reasonable,positive


In [12]:
sents(crude9_ann_doc)

[[1]]
 [1] "The"        "Gulf"       "oil"        "state"      "of"        
 [6] "Qatar"      ","          "recovering" "slightly"   "from"      
[11] "last"       "year"       "'s"         "decline"    "in"        
[16] "world"      "oil"        "prices"     ","          "announced" 
[21] "its"        "first"      "budget"     "since"      "early"     
[26] "1985"       "and"        "projected"  "a"          "deficit"   
[31] "of"         "5.472"      "billion"    "riyals"     "."         

[[2]]
 [1] "The"       "deficit"   "compared"  "with"      "a"         "shortfall"
 [7] "of"        "7.3"       "billion"   "riyals"    "in"        "the"      
[13] "last"      "published" "budget"    "for"       "1985/86"   "."        

[[3]]
 [1] "In"         "a"          "statement"  "outlining"  "the"       
 [6] "budget"     "for"        "the"        "fiscal"     "year"      
[11] "1987/88"    "beginning"  "today"      ","          "Finance"   
[16] "and"        "Petroleum"  "Minister"   "Shei

In [13]:
pos_tag_annotator <- Maxent_POS_Tag_Annotator()
crude9_pos <- NLP::annotate(s9, pos_tag_annotator, crude9_ann)
crude9_pos

 id  type     start end  features
   1 sentence     1  187 constituents=<<integer,35>>
   2 sentence   193  293 constituents=<<integer,18>>
   3 sentence   299  523 constituents=<<integer,36>>
   4 sentence   529  601 constituents=<<integer,13>>
   5 sentence   607  754 constituents=<<integer,27>>
   6 sentence   760  853 constituents=<<integer,16>>
   7 sentence   859 1038 constituents=<<integer,29>>
   8 sentence  1044 1155 constituents=<<integer,16>>
   9 sentence  1157 1221 constituents=<<integer,13>>
  10 sentence  1227 1374 constituents=<<integer,27>>
  11 sentence  1380 1617 constituents=<<integer,39>>
  12 sentence  1623 1685 constituents=<<integer,11>>
  13 sentence  1687 1731 constituents=<<integer,9>>
  14 sentence  1737 1808 constituents=<<integer,16>>
  15 sentence  1814 2107 constituents=<<integer,53>>
  16 sentence  2110 2115 constituents=375
  17 word         1    3 POS=DT
  18 word         5    8 POS=NNP
  19 word        10   12 POS=NN
  20 word        14   18 POS=NN
 

### latent Dirichlet allocation (LDA)

In [15]:
crudeDTM = DocumentTermMatrix(crude)
crudeDTM

<<DocumentTermMatrix (documents: 20, terms: 1266)>>
Non-/sparse entries: 2255/23065
Sparsity           : 91%
Maximal term length: 17
Weighting          : term frequency (tf)

In [16]:
lda = LDA(crudeDTM, k=2)
topic_df = tidy(lda)
topic_groups = group_by(topic_df, topic)
topic_groups = top_n(topic_groups, 10)
arrange(ungroup(topic_groups), topic, -beta)

Selecting by beta


topic,term,beta
1,the,0.065868263
1,oil,0.024412713
1,and,0.022109627
1,opec,0.017042837
1,said,0.015200368
1,its,0.011054813
1,was,0.010133579
1,prices,0.010133579
1,mln,0.009672962
1,for,0.009672962


In [None]:
?stop_words

In [17]:
cleaned_topic_df = anti_join(topic_df, stop_words, by = c("term" = "word"))

In [18]:
topic_groups = group_by(cleaned_topic_df, topic)
topic_groups = top_n(topic_groups, 10)
arrange(ungroup(topic_groups), topic, -beta)

Selecting by beta


topic,term,beta
1,oil,0.024412713
1,opec,0.017042837
1,prices,0.010133579
1,mln,0.009672962
1,saudi,0.00829111
1,bpd,0.007369876
1,official,0.006448641
1,kuwait,0.005988024
1,said.,0.005527407
1,market,0.005066789
