# 9. Text Analytics

Text analysis, sometimes called text analytics, refers to the representation, processing, and modeling of textual data to derive useful insights.

In general, text analysis is concerned with a *corpus* of documents. These could be sentences in a paragraph, chapters in a book, or indeed books in a corpus. We can typically break the process of analysing text into three steps:

1. **Parsing** is the process of imposing some structure on unstructured text. For example, we could break a raw HTML file into paragraphs, and each paragraph into indivdual words.

2. **Search and Retrieval** is the identification of documents containing certain *key terms* deemed relevant to the analysis.

3. **Text Mining** uses the terms and indices of the previous steps to discover patterns and insights.

Text data is incompatible with the models we have discussed so far because the models require numeric values. For example, we don't have a direct numeric distance between the words "hello" and "friend", so the k-means clustering algorithm cannot be applied to raw text data.

Moreover, sometimes single data points such as a tweet, facebook post etc will contain a large number of distinct words. So we need a way to convert the text data into a flexible numeric representation. This representation should tell us which words occured and how important their appearance is to the meaning of a document. This should be related to how many times the word appears, and how much information is contained in the word.

There are many ways to represent text numerically. The simplest way is known as Bag-of-Words (BoW) representation.


## Bag-of-Words

The bag-of-words model is a simple method of transforming strings of text into a numeric representation. BoW treats each word as a feature and the value of the feature is the number of times it occurds.

For example the string

    "The quick brown fox jumps over the lazy dog"
    
would be transformed into

| the | quick | brown | fox | jumps | over | lazy | dog |
|---|---|---|---|---|---|---|---|
| 2 | 1 | 1 | 1 | 1 | 1 | 1 | 1 |

Note that:
- the word "the" occurs twice so its count is 2
- other words are unique so they only occur once
- the number of features is the number of unique words

Also note that for only one sentence we have 8 columns, as most of the words are distinct. If our analysis is focussed on a large *corpus* (or collection) of text documents, the number of columns is far larger. For instance the Google n-Grams corpus of publicly accessible web pages contains about one million distinct words.

So text analytics suffers from the *high dimensionality* problem, wherein the number of columns in a dataset is large relative to the number of rows. Fitting models in a high-dimensional space means that there is a lot of room for error, so there is a large focus on reducing the dimensionality of textual data representation.


## TF-IDF

In text data there will be lots of repeated words such as "a", "is" and "the" that aren't very useful, yet with BoW representation they will have a high associated weight. We should ignore them as much as possible.

The Term Frequency–Inverse Document Frequency (TF-IDF) is a weighting procedure for BoW data. The TF-IDF weights boost the counts or frequency of uncommon words (which will be useful) and shrinks the mangitude associated with common words. There are two components to the TF-IDF weights, and each of these can be calculated in different ways:

- Term Frequency, often the _raw count_ of a term in a document $tf = f_D$. Other possibilities are boolean (1 if the term appears, otherwise 0), length adjusted ($tf = \frac{f_D}{n_{words}}$) or logarithmic ($tf = \log(1+f_D)$).

- Inverse Document Frequency, or a measure of the information contained in a word. This is a penalty for commonly used words like 'a' and 'the'. It's the logarithmically scaled inverse fraction of documents that contain the word (obtained by dividing the total number of documents by the number of documents containing the term, and then taking the logarithm of that quotient). $idf = \log(\frac{N}{n_D})$ where $N$ is the number of documents and $n_D$ is the number of documents in which the word appears.

The tfidf score is calculated as follows:

$$tfidf = tf \cdot idf $$

As an example, consider the following corpus of two documents:

    "The quick brown fox jumps over the lazy dog"
    "The five boxing wizards jump quickly"
    
The *raw count* $tf$ weight of the word `The` is 2 in the first document, and 1 in the second. The $idf$ weight of `The`, defined for the whole corpus, is $idf = \log(\frac{2}{2}) = 0$. Hence the $tfidf$ weights of `The` in both documents is 0. This is appropriate since the word `The` carries very little meaning in each sentence.

Now consider the word `wizards`. The raw count $tf$ weight is 0 in the first document, and 1 in the second. The $idf$ weight is $idf = \log(\frac{2}{1}) \approx 0.7 $, the largest possible value in this corpus. Hence the $tfidf$ weight is 0 in the first document, and $tfidf \approx 0.7$ in the second, which is appropriate since, intuitively, `wizard` contributes a lot of meaning to the sentence.

## Topic Modelling

Topic models are statistical models that examine words from a set of documents, determine the themes over the text, and discover how the themes are associated or change over time. The process of topic modeling can be simplified to the following.

1. Uncover the hidden topical patterns within a corpus.

2. Annotate documents according to these topics.

3. Use annotations to organize, search, and summarise texts.

A topic is formally defined as a distribution over a fixed vocabulary of words. Different topics would have different distributions over the same vocabulary. A topic can be viewed as a cluster of words with related meanings, and each word has a corresponding weight inside this topic. Note that a word from the vocabulary can reside in multiple topics with different weights. Topic models do not necessarily require prior knowledge of the texts. The topics can emerge solely based on analyzing the text.

In [1]:
require("ggplot2")
require("reshape2")
require("lda")

# load documents and vocabulary
data(cora.documents)
data(cora.vocab)
theme_set(theme_bw())

# Number of topic clusters to display
K <- 10
# Number of documents to display
N <- 9

result <- lda.collapsed.gibbs.sampler(cora.documents,
                                      K, ## Num clusters cora.vocab,
                                      25, ## Num iterations
                                      0.1,
                                      0.1,
                                      compute.log.likelihood=TRUE)

# Get the top words in the cluster
top.words <- top.topic.words(result$topics, 5, by.score=TRUE)

# build topic proportions
topic.props <- t(result$document_sums) / colSums(result$document_sums)

document.samples <- sample(1:dim(topic.props)[1], N)

topic.props <- topic.props[document.samples,]

topic.props[is.na(topic.props)] <- 1 / K

colnames(topic.props) <- apply(top.words, 2, paste, collapse=" ")
topic.props.df <- melt(cbind(data.frame(topic.props),
                             document=factor(1:N)),
                       variable.name="topic",
                       d.vars = "document")

qplot(topic, value*100, fill=topic, stat="identity",
      ylab="proportion (%)", data=topic.props.df,
      geom="histogram") +
theme(axis.text.x = element_text(angle=0, hjust=1, size=12)) +
coord_flip() +
facet_wrap(~ document, ncol=3)

Loading required package: ggplot2
Loading required package: reshape2
Loading required package: lda
“data set ‘cora.vocab’ not found”

ERROR: Error in lda.collapsed.gibbs.sampler(cora.documents, K, 25, 0.1, 0.1, : could not find function "lda.collapsed.gibbs.sampler"
