/
exp_NLP_tm_package.Rmd
331 lines (240 loc) · 12.6 KB
/
exp_NLP_tm_package.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
---
title: "Managing Unstructured Data with the `tm` package"
author: "Pier Lorenzo Paracchini"
date: "03. dec. 2016"
output:
html_document:
fig_height: 9
fig_width: 9
keep_md: yes
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE, collapse = TRUE, message = FALSE, warning = FALSE)
```
##Required packages
* `readr` package for reading the file containing the dataset,
* `tm` package, used perform text mining operations on unstructured data,
* `snowballC`, used for stemming,
* `wordcloud`, used for visualization purpose and the creation of wordclouds,
* `ggplot2`, used for basic plotting.
```{r echo = FALSE}
require(tm)
require(SnowballC)
require(wordcloud)
require(ggplot2)
require(readr)
```
## Introduction
The idea is to play around with the `tm` package and perform some common text mining operations on a set of documents, the __corpus__ under analysis in order to learn the potential of this package and, specifically, to understand how to __use the `tm` package functionality to engineer features from free text__.
## The Data
The data used for this experiment is the [SMS Spam Collection v. 1](http://www.dt.fee.unicamp.br/~tiago/smsspamcollection/), a public set of SMS labeled messages that have been collected for mobile phone spam research.
The original file has been pre-processed in order to create a CSV file
* `\t` separator has been replaced by `,`
* `"` in the free text have been replaced by `'`
* the sms text has been included in `"` (quoted text)
```{r getData}
dataUri <- "./data/smsspamcollection/SMSSpamCollection.txt"
rawData <- read_csv(dataUri,col_names = F)
```
### A quick look at the data
The dataset contains `r dim(rawData)[1]` entries/ observations, where each entry is a classified sms. The first column represents the label, while the second column represents the text of the sms, both are `character` type. The features/ columns do not have proper names (just the standard names assigned by R).
```{r datasetOverview}
str(rawData)
```
Proper names are given to the columns - `label` and `text` respectively. The `label`, being a categorical variable with two only possible valiues `ham, spam`, is transformed into a factor; the `text` is kept as `character`.
```{r datasetPrepartion}
#Adding the column names
colnames(rawData) <- c("label", "text")
#Setting the label as a factor
rawData$label <- as.factor(rawData$label)
```
How many sms have been tagged as `spam` and `ham`?
```{r someMoreInfo}
x <- table(rawData$label)
x
prop.table(x) * 100
```
## Importing the __corpus__
The `tm` package offers different options for importing a __corpus__. The creation of a __corpus__ is done using `tm::Corpus` class providing two different arguments
* a __source__ object, containing the documents
* a set of parameters for reading the content from teh source object
* a __reader__, a function capable of reading in and processing the format delivered by the source object
The list of available __source__ options can be seen using `tm::getSources()`. Text documents can be imported using a dataframe, a vector, a URI, a directory, etc.
```{r}
getSources()
```
The list of available __readers__ can be found using `tm::getReaders()` with the supported text file formats. Each __source__ has a __default reader__ that can be overriden.
```{r collapse=TRUE}
getReaders()
```
We have a dataframe containing all of the documents that should be part of our corpus. The __corpus__ is created using `vectorSource` and the sms text `text`, using the default __reader__.
```{r createCorpus}
theCorpus <- Corpus(VectorSource(x = rawData$text))
print(theCorpus)
```
### Inspecting the __corpus__
In order to display detailed information on a corpus, the `tm::inspect` function can be used.
```{r inspectCorpus1}
inspect(head(theCorpus,2))
```
In order to access the individual documents the `[[` can be used either via position or document identifier using the relevant attribute.
```{r inspectCorpus2}
#Using the [[]] format and the position
#first document
theCorpus[[1]]$meta
theCorpus[[1]]$content
#second document
theCorpus[[2]]$content
#Getting the document identifier
doc_id <- meta(theCorpus[[2]], "id")
identical(theCorpus[[2]], theCorpus[[doc_id]])
```
## Transforming the corpus
Once we have a corpus, tipically we want to transform its content in order to make it easy to extract information/ insights from it. The `tm` package offfers a variety of __predefined transformations__ that can be applied on the __corpus__. To get a list of the transformation supported use the `tm::getTransformations()` function.
```{r showTrasformations, collapse=TRUE}
getTransformations()
```
* `removeNumbers`, remove numbers from a text document
* `removePunctuation`, remove punctuation `[:punct:]` from a text document
* `removeWords`, remove words specified in the provided list of words from a text document
* `stemDocument`, stem words in a text document using Porter's stemming algorithm
* `stripWhitespace`, strip extra whitespace from a text document, normalized to a single whitespace
Custom transformation can be created and used on the corpus using the `tm::content_transformer` function as a wrapper e.g. `tm::content_tansform(customTransformation)`.
Usually one of the first steps is to remove the most frequently used words from the corpus, called __stopwords__. __Stopwords__ are the most common, short function terms, whit no important meaning. The `tm` package offers a list of such words `tm::stopwords` function.
```{r stopWords, collapse=TRUE}
stopwords()
```
Stopwords are unimportant words that will not actually change the meaning of the documents (the sms text). Running a simple example of removing stopwords, it is possible to see that the "and" word is replaced with an empty space, but "AND" is not removed cause of the uppercase letter (case sensitive).
```{r simpleTransformation, collapse=TRUE}
removeWords("going and running", stopwords())
removeWords("going AND running", stopwords())
```
In order to remove the possible challenges connected with uppercase letters, meaningless words, punctuation, numbers, common practices are to simply apply the following transformations to the corpus
* transform the uppercase to lowercase
* remove the stopwords
* remove punctuation symbols
* remove numbers
* normalize the white spaces (a words removed is transformed into a white space)
To iteratively apply transformations to the corpus the `tm::tm_map` function is used.
```{r transformCorpus, collapse=TRUE}
theCorpus <- tm_map(theCorpus, content_transformer(tolower))
#we had to wrap the tolower function in the content_transformer function, so that our transformation really complies with the tm package's object structure.This is usually required when using a transformation function outside of the tm package (custom transformation).
theCorpus <- tm_map(theCorpus, removeWords, stopwords())
theCorpus <- tm_map(theCorpus, removePunctuation)
theCorpus <- tm_map(theCorpus, removeNumbers)
theCorpus <- tm_map(theCorpus, stripWhitespace)
##see the content of 3 documents
theCorpus[[1]]$content
theCorpus[[50]]$content
theCorpus[[100]]$content
```
## Creating the Term Document Matrix (TDM)
In order to find the most common words in the corpus we need to create a sparse matrix from the corpus using `tm::TermDocumentMatrix`.
A `tm::TermDocumentMatrix` is basically a matrix (__sparse matrix__) which includes the all of the possible words in the rows and the documents in the columns. Each cell represents tne number of occurences of that specific word in that specific document.
```{r sparseMatrix}
tdm <- TermDocumentMatrix(theCorpus)
inspect(tdm[1:5, 1:20])
```
To find the most frequent terms using the sparse matrix, the `tm::findFrequentTerms` function can be used. Example which terms is found at least 80 times in the corpus?
```{r findFrequentTerms, collapse=TRUE}
#Note the bigger the number of documents, the bigger the lowfreq numer - use 100 if using all of the documents.
findFreqTerms(tdm, lowfreq = 80)
```
One of the limitation of the `tm::findFrequentTerms` function is that it does just return a character vector of terms. To get the number of times a term has been found in the __corpus__ the `tm::tm_term_score` function is of great help.
```{r termScore}
tm_term_score(tdm, c("already", "tell", "day"), FUN = slam::row_sums)
```
The `tm::findFrequentTerms` and `tm::tm_term_score` functions can be used to find how many times each term has been used in the __corpus__ overall.
```{r frequentTerms}
getTermsFrequency <- function(corpus.tdm){
all.terms <- findFreqTerms(corpus.tdm)
freq = tm_term_score(x = corpus.tdm, terms = all.terms, FUN = slam::row_sums)
terms <- names(freq); names(freq) <- NULL
corpora.allTermsFrequency <- data.frame(term = terms, freq = freq)
corpora.allTermsFrequency[order(corpora.allTermsFrequency$freq, decreasing = T), ]
}
frequent_terms_df <- getTermsFrequency(tdm)
head(frequent_terms_df, 10)
tail(frequent_terms_df, 10)
```
## Useful Visualizations
Visualizations allows to inspecting the list of words/ terms that are part of the __corpus__ and understand how good the transformation process is.
Visualization can help to identify
* rules that need to be considered when removing punctuation or numbers,
* other words that can be added to the list of stopwords,
* other meaningful insights.
```{r plotting}
visualizeBarPlot <- function(ftm.df, colorBars = "grey40", titleBarPlot = ""){
ggplot(ftm.df[1:50,], aes(x = reorder(term,freq), y = freq/1000)) +
geom_bar(stat = "identity", fill=colorBars) +
xlab("Terms") + ylab("Frequency (* 1000)")+
ggtitle(paste(titleBarPlot, "(Top 50)")) + coord_flip()
}
visualizeWordcloud <- function(ftm.df){
mypal <- brewer.pal(8,"Dark2")
wordcloud(words = ftm.df$term,
freq = ftm.df$freq,
colors = mypal,
scale=c(6,.5),
random.order = F, max.words = 200)
}
```
Visualizing a boxplot with the top 50 terms in decreasing order
```{r boxplot}
visualizeBarPlot(frequent_terms_df, titleBarPlot = "High Frequency Terms")
```
Visualizing a wordcloud with the top 200 terms
```{r wordcloud}
visualizeWordcloud(frequent_terms_df)
```
### Stemming words
[Stemming](https://en.wikipedia.org/wiki/Stemming) is the process of reducing inflected (or sometimes derived) words to their word stem, base or root form—generally a written word form. Stemming in R can be performed using the `SnowballC` package. This package has a `SnowballC::wordStem` function that support several languages based on the __Porter's stemming algorithm__.
```{r wordStem}
#example for word stemming
wordStem(c("cats", "mastering", "using", "modelling", "models", "model"))
#Note that the Porter's algorithm does not provide real english words in all cases
#Something to consider later when looking at the results.
wordStem(c("are", "analyst", "analyze", "analysis"))
```
Once the stem has been defined, it is possible to complete the stemmed words using `tm::stemCompletion` function. Be careful when performing stem completion ....
```{r stemCompletionExample, collapse=TRUE}
a <- c("mining", "miners", "mining") #Original words
b <- stemDocument(a) #Stemmed words
d <- stemCompletion(
b, #stemmed words that we want to complete
dictionary = a #dictionary to be used to complete stemmed words
)
#Original words
a
#Stemmed words
b
#Completion Stemmed Words
d
#Do u see anything strange?
```
```{r stemCorpus}
theCorpus_stem <- theCorpus
theCorpus_stem <- tm_map(theCorpus_stem, stemDocument, language = "english")
tdm_stem <- TermDocumentMatrix(theCorpus_stem)
frequent_terms_df_stem <- getTermsFrequency(tdm_stem)
visualizeBarPlot(frequent_terms_df_stem, titleBarPlot = "High Frequency Terms")
visualizeWordcloud(frequent_terms_df_stem)
```
## Analysing the Associations among terms
The `tm::TermDocumentMatrix` can be used to identify the association between the cleaned terms found in the corpus. For this purpose we can use the `tm::findAssocs` function in the `tm` package.
```{r Call}
#Lets find the associations between the term and others terms that have a correlation higher than 0.2
findAssocs(tdm, "call", 0.20)
```
```{r findAssocsBig}
#Lets find the associations between the bayesian term and others terms that have a correlation higher than 0.2
findAssocs(tdm, "big", 0.20)
```
## References
["SMS Spam Collection v. 1"](http://dcomp.sor.ufscar.br/talmeida/smspamcollection/), Almeida, T.A., Gómez Hidalgo, J.M.
["Introduction to the tm package"](https://cran.r-project.org/web/packages/tm/vignettes/tm.pdf), `tm` vignette
["Text Mining Infrastructure in R"](https://www.jstatsoft.org/article/view/v025i05), Ingo Feinerer, Kurt Hornik, David Meyer
## Session Information
```{r}
sessionInfo()
```