# PC Session 3

**Author:**
[Helge Liebert](https://hliebert.github.io/)

# **Text analysis**: Kiva loans

## Dependencies

In [41]:
## Libraries
library(tm)
library(data.table)
library(ggplot2)
library(tidytext)
library(dplyr)
library(topicmodels)
library(wordcloud)
library(SentimentAnalysis)
library(naivebayes)
library(slam)
library(glmnet)
library(lexicon)

In [None]:
## Simple helper function to view first copora elements, only for illustration in lecture
chead <- function(c) lapply(c[1:2], as.character)

## Setting up a corpus and applying transformations

This tutorial relies on the Kiva data from the last lecture. The text analyses are based on the loan description and to a limited extent the loan purpose statement as well. The data is based on a csv database dump provided on their site. To ease computation we just use a limited sample of 10,000 observations. Using the csv file from the Kiva Homepage and the file `prep-kiva.r`, you can clean the data yourself and use a larger sample. The full sample is close to one million observations. (You might need to delete a few nested parentheses in the loan description using a text editor which lead to errors reading the csv file.) If you look at the script, you can also see that I spent some time pre-processing the data, filtering HTML tags and similar things, to get mostly clean loan descriptions. This type of pre-processing is common, but also very application specific.

In case you are using Windows, or any other OS not using UTF-8 encoding as default, setting the encoding when reading data files is good practice. When working with text data, take care to ensure you are using the correct encoding and that transformations between files and encodings do not lead to broken characters.

In [None]:
## Read data
loans <- fread("Data/kiva-tiny.csv", encoding = "UTF-8")
names(loans)

The first thing we are going to do is set up a corpus. We will focus on the loan description. For all transformations in the remainder of this tutorial, we are going to print the first two loans' descriptions for illustration.

In [None]:
## Set up corpus
setnames(loans, "loanid", "doc_id")
setnames(loans, "description", "text")
corp <- Corpus(DataframeSource(loans))

## Inspect it
corp
lapply(corp[1:2], as.character)

These are the main transformations available in the `tm` library, but any other customized transformation can be applied.

In [None]:
## Main corpus transformations, passed via tm_map()
## Other transformations have to be wrapped in content_transformer()
getTransformations()

We apply the `base` string function `tolower()` to transform all strings to lower case.

In [None]:
## All chars to lower case
corp <- tm_map(corp, content_transformer(tolower))
chead(corp)

Remove all punctuation as punctuation is unlikely to carry special meaning in the context of loans and we want to simplify the text input to get token counts. We need to set the unicode option to true to rid of all punctuation elements (eg. the quotation marks).

In [None]:
## Remove punctuation
corp <- tm_map(corp, removePunctuation)
chead(corp)
## corp <- tm_map(corp, removePunctuation, ucp = TRUE)
## chead(corp)

Now we remove all numbers. We observe the loan amount and the repayment schedule in other variables, so we can get rid of numbers. Extracting the meaning of numbers within their context is difficult. 

In [None]:
## Remove numbers
corp <- tm_map(corp, removeNumbers)
chead(corp)

Any other transformation - like substituting specific patterns based on regular expressions - can be passed to `tm_map()` using a user-defined function.

In [None]:
## For specific transformations, you could also pass a lambda function to remove patterns based on a regex

## Example:
## toSpace <- content_transformer(function (x , pattern) gsub(pattern, " ", x))
## corp <- tm_map(corp, toSpace, "patternhere")

Looking at a frequency plot of the token counts, there is still plenty of filtering to do to get meaningful token counts.

In [None]:
## Look at the most frequent words in our text and see whether we should get rid of some
frequent_terms <- qdap::freq_terms(corp, 30)
plot(frequent_terms)

Next, we remove stopwords and other generic words which do not carry special meaning in our context.

In [None]:
## More invasive changes: remove generic and custom stopwords
corp <- tm_map(corp, removeWords, stopwords('english'))
chead(corp)

In [None]:
## And a few more words we filter for lack of being informative, this could be extended
corp <- tm_map(corp, removeWords, "loan")
corp <- tm_map(corp, removeWords, "kiva")

In [None]:
## There are a lot of names in the data, these are not really informative
## We apply a dictionary to get rid of some of them
## Truncation because of regex limit
corp <- tm_map(corp, removeWords, common_names[1:floor(length(common_names)/2)])
corp <- tm_map(corp, removeWords, common_names[floor(length(common_names)/2):length(common_names)])
corp <- tm_map(corp, removeWords, freq_first_names[1:floor(nrow(freq_first_names)/2), Name])
corp <- tm_map(corp, removeWords, freq_first_names[floor(nrow(freq_first_names)/2):nrow(freq_first_names), Name])
## corp <- tm_map(corp, removeWords, freq_last_names) # needs to be truncated as well, even longer
chead(corp)

You can also stem the document here. For illustration purposes I refrain from it here, but in real applications you might want to do this. Stemmers do not work equally well for all languages, depending on your application, there may be added value to grouping and transforming tokens further yourself.

In [None]:
## Stem document
## corp <- tm_map(corp, stemDocument, language = 'english')
## chead(corp)

Strip all extra whitespace (this is without consequences for tokenization).

In [None]:
## Strip extra whitespace
corp <- tm_map(corp, stripWhitespace)
chead(corp)

## Building a document-term matrix and restricting the feature set

We transform the corpus to a document-term matrix. We use simple term-frequency weighting, i.e. the simple token counts. This is the default. You can also choose term frequency-inverse document frequency (tfidf) weighting at this point.

In [None]:
## Build a document-term or term-document matrix
## Default is term-frequency weighting (document length normalized count)
## TF-IDF weighting also possible

## dtm <- TermDocumentMatrix(corp)
dtm <- DocumentTermMatrix(corp)

## Inspect the document-term matrix
inspect(dtm)

Following the tokenization, we can inspect the most popular words.

In [None]:
## Inspect most popular words
findFreqTerms(dtm, lowfreq=1000)

The document-word vectors allow us to inspect the correlation between words as we would between variables in other data.

In [None]:
## Inspect associations
findAssocs(dtm, 'hard', 0.15)

Since the matrix is very wide and sparse, we are going to remove terms to arrive at a tractable representation. We can filter words simply by removing words that are very rare.

In [None]:
## Remove sparse terms, prevents cluster node from choking and saves time
## may also improve tractability

## Tweak the sparse parameter to influence # of words
dtms <- removeSparseTerms(dtm, sparse=0.90)
dim(dtms)
dtms <- dtms[row_sums(dtms) > 0, ]
dim(dtms)

We can also filter by tf-idf, only keeping words which occur frequently in some documents but not in others, helping us to keep those words that disambiguate loans. We compute the average tf-idf score for each token, then filter by that. I went back-and-forth a bit tweaking the threshold value for filtering.

In [None]:
## Alternatively, filter words by mean tf-idf

## Calculate average term-specific tf-idf weights as
## mean(word count/document length) * log(ndocs/ndocs containing word)
termtfidf <- tapply(dtm$v/row_sums(dtm)[dtm$i], dtm$j, mean) *
             log(nDocs(dtm)/col_sums(dtm > 0))
summary(termtfidf)

## Only include terms with at least median tf-idf score
dtmw <- dtm[, (termtfidf >= 0.15)]
dim(dtmw)
## And documents within which these terms occur - this may induce selection
dtmw <- dtmw[row_sums(dtmw) > 0, ]
dim(dtmw)

In [None]:
## Much less frequent terms now
findFreqTerms(dtmw, lowfreq=100)

## Visualizations of word frequencies

These are just very simple visualizations of word frequencies. First the unfiltered but transformed corpus.

In [None]:
## Simple visualization
wordcloud(corp, max.words = 100, random.order = FALSE,
          colors = brewer.pal(8, "Dark2"))

Next the term-frequency filtered document-term matrix. Obviously this is very similar to the plot above. 

In [None]:
## Counts from dtms
counts <- sort(colSums(as.matrix(dtms)), decreasing = TRUE)
counts <- data.frame(word = names(counts), freq = counts)
wordcloud(words = counts$word, freq = counts$freq,
          max.words = 100, random.order = FALSE,
          colors = brewer.pal(8, "Dark2"))

However, the frequency plot of the tfidf-filtered document-term matrix looks different. There are a lot more terms that disambiguate professions and investment goods.

In [None]:
## Counts from dtmw
counts <- sort(colSums(as.matrix(dtmw)), decreasing = TRUE)
counts <- data.frame(word = names(counts), freq = counts)
wordcloud(words = counts$word, freq = counts$freq,
          max.words = 100, random.order = FALSE,
          colors = brewer.pal(8, "Dark2"))

## Dictionary methods: Inferring sentiment

We are using a fixed mapping of terms to infer a sentiment score, and then convert it to discrete sentiment categories. Unsuprisingly, most loan descriptions are phrased to convey a positive message. 

In [None]:
## Dictionary method: Sentiment analysis using dictionaries
sentiment <- analyzeSentiment(dtms, language = "english")
sentiment <- convertToDirection(sentiment$SentimentGI)

## Potentially add back to original data for further analysis
## loans$sentiment <- sentiment

## look at sentiment distribution
table(sentiment)

## Unsupervised generative model: Topic model

Next we are going to train an unsupervised topic model on the term-frequency filtered document term matrix. You will find that it is hard to find distinct topics, both due to the term filtering, and the fact that most loans are handed out by partner organizations who use a standard questionnaire to get basic information which is then translated to an english description. 

In [None]:
## Unsupervised method: Topic model
lda <- LDA(dtms, k = 5, control = list(seed = 100))
## lda <- LDA(dtmw, k = 5, control = list(seed = 100))

## Most likely topic for each document, could merge this to original data
## topic <- topics(lda, 1)

## Five most frequent terms for each topic
terms(lda, 10)

## Plot most frequent terms and associated probabilities by topic
tpm <- tidy(lda, matrix = "beta")

topterms <-
    tpm %>%
    group_by(topic) %>%
    top_n(10, beta) %>%
    ungroup() %>%
    arrange(topic, -beta)

topterms %>%
    mutate(term = reorder(term, beta)) %>%
    ggplot(aes(term, beta, fill = factor(topic))) +
    geom_col(show.legend = FALSE) +
    facet_wrap(~ topic, scales = "free") +
    coord_flip()

Instead, let us use the `loan use` statement, filtered by tfidf. These topics already look more distinct. You could tweak the filter and the number of topics further to arrive at a more meaningful result.

In [None]:
## not working well due to standardized templates
## let us try to use the `loanuse' statement text for the generative topic model 

# new data
loanuse <- loans[, .(doc_id, loanuse)]
setnames(loanuse, "loanuse", "text")

# new dtm, this time do most of the transformations in one step
dtmuse <- DocumentTermMatrix(Corpus(DataframeSource(loanuse)),
                             control = list(weighting = weightTf,
                                            language = "english",
                                            tolower = TRUE,
                                            removePunctuation = TRUE,
                                            removeNumbers = TRUE,
                                            stopwords = TRUE,
                                            stemming = FALSE,
                                            wordLengths = c(3, Inf)))
inspect(dtmuse)

# Recalculate weights
termtfidf <- tapply(dtmuse$v/row_sums(dtmuse)[dtmuse$i], dtmuse$j, mean) *
    log2(nDocs(dtmuse)/col_sums(dtmuse > 0))
summary(termtfidf)

## Filter by tf-idf
## dim(dtmuse)
dtmuse <- dtmuse[, (termtfidf >= 1.70)]
dtmuse <- dtmuse[row_sums(dtmuse) > 0, ]
## dim(dtmuse)

In [None]:
## Unsupervised method: Topic model, this time for loanuse statement
lda <- LDA(dtmuse, k = 3, control = list(seed = 100))
## str(lda)

## Most likely topic for each document, could merge this to original data
topic <- topics(lda, 1)
## Five most frequent terms for each topic
terms(lda, 10)

## Plot most frequent terms and associated probabilities by topic
tpm <- tidy(lda, matrix = "beta")

topterms <-
    tpm %>%
    group_by(topic) %>%
    top_n(10, beta) %>%
    ungroup() %>%
    arrange(topic, -beta)

topterms %>%
    mutate(term = reorder(term, beta)) %>%
    ggplot(aes(term, beta, fill = factor(topic))) +
    geom_col(show.legend = FALSE) +
    facet_wrap(~ topic, scales = "free") +
    coord_flip()


Look at unique terms not appearing in other topics.

In [None]:
freqterms <- terms(lda, 40)
duplicates <- c(freqterms)[duplicated(c(freqterms))]
distinctterms <- lapply(as.list(as.data.frame(freqterms)), function(x) x[!(x %in% duplicates)])
distinctterms

## Supervised methods: Data preparation

The following cells transform and prep the data to be used as inputs for supervised methods. We split the data into a test and a training sample. The outcome we try to predict is whether the loan is obtained for a business proposition in the agricultural sector.

In [None]:
## Supervised methods: Prep data
## Convert the sparse term-document matrix to a standard data frame
bag <- as.data.frame(as.matrix(dtms))
dim(bag)

In [None]:
## Convert token counts to simple binary indicators
bag <- as.data.frame(sapply(bag, function(x) as.numeric(x > 0)))
bag$doc_id <- rownames(as.matrix(dtms))

## Add outcomes from the original data: Predict agricultural sector
loans$agsector <- as.numeric(loans$sectorname=="Agriculture")
bag <- merge(bag, loans[, .(agsector, loanamount, doc_id)], by = "doc_id")
                            
# How many people want a loan in the agricultural sector?                            
table(bag$agsector)

In [None]:
## Partition data in test and training sample
set.seed(100)
testids <- sample(floor(nrow(bag)/3))
xtrain <- as.matrix(bag[-testids, !(names(bag) %in% c("agsector", "loanamount", "doc_id"))])
ytrain <- as.factor(bag[-testids,  "agsector"])
xtest  <- as.matrix(bag[ testids, !(names(bag) %in% c("agsector", "loanamount", "doc_id"))])
ytest  <- as.factor(bag[ testids,  "agsector"])

## Supervised generative model: Naive Bayes classifier

Naive Bayes is a simple model relying on a conditional independence assumption of the token counts. It often performs acceptable. In this case it does not perform very well, possibly because we filtered the input token data to aggressively. Among other things, there is no need to remove stopwords here, and it may not improve performance. You can feed the unfiltered data or try different transformations and see whether this improves matters. However, Naive Bayes may also be not well suited for this setting.

In [None]:
## Supervised generative model: Naive Bayes
nbclassifier <- naive_bayes(xtrain, ytrain, laplace = 1)
nbpred <- predict(nbclassifier, xtest)
summary(nbpred)

In [None]:
## Performance statistics: Classification rate
round(1-mean(as.numeric(nbpred != ytest)), 2)

## Performance statistics: Confusion matrix (
## table(nbpred, ytest)
caret::confusionMatrix(nbpred, ytest)

In [None]:
## Supervised text regression: L1 penalized logistic regression
l1classifier <- cv.glmnet(xtrain, ytrain, alpha = 1, family = "binomial")
l1pred <- as.factor(predict(l1classifier, xtest, s = "lambda.min", type = "class"))
summary(l1pred)

In [None]:
## Performance statistics: Classification rate
round(1-mean(as.numeric(l1pred != ytest)), 2)

## Performance statistics: Confusion matrix
caret::confusionMatrix(l1pred, ytest)

## Supervised text regression: L<sub>1</sub> penalized logistic classifier

This trains a logistic lasso estimator, weighting the penalty factor for each input token by the token's standard deviation. Looking at the misclassification rate and the confusion matrix, the model performs better than naive bayes in predicting the agricultural sector. However, looking at precision and recall, the model does poor in getting the true condition outcomes right, leading to a large number of false negatives.

In [None]:
## L1 logistic classifier using rare feature upweighting
# l1classifier <- cv.glmnet(xtrain, ytrain, alpha = 1, family = "binomial")
## L1 logistic classifier using rare feature upweighting
sdweights <- apply(xtrain, 2, sd)
l1classifier <- cv.glmnet(xtrain, ytrain, alpha = 1, family = "binomial",
                          standardize = FALSE, penalty.factor  = sdweights)
l1pred <- as.factor(predict(l1classifier, xtest, s = "lambda.min", type = "class",
                            penalty.factor  = sdweights))
summary(l1pred)

In [None]:
## Performance statistics: Classification rate
round(1-mean(as.numeric(l1pred != ytest)), 2)
## Performance statistics: Confusion matrix
caret::confusionMatrix(l1pred, ytest)

## Remarks and additions

In [None]:
## How would you go about improving performce for the classifiers?






## (Addendum: Regression example: L<sub>1</sub> penalized linear regression)

In [None]:
## Further example: Predict Loan Amount
## Supervised text regression: L1 penalized linear regression

## Rebuild outcome vectors
#ytrain <- as.matrix(bag[-testids,  "loanamount"])
#ytest  <- as.matrix(bag[ testids,  "loanamount"])

## Estimate and predict
#l1predictor <- cv.glmnet(xtrain, ytrain, alpha = 1, family = "gaussian")
#l1pred <- predict(l1predictor, xtest, s = "lambda.min", type = "response")

## RMSE
#round(sqrt(mean((l1pred - ytest)^2)), 2)
#caret::postResample(l1pred, ytest)