# PC Session 3

**Author:**
[Helge Liebert](https://hliebert.github.io/)

# **Text analysis**: Kiva loans

## Dependencies

In [None]:
## Libraries
library("tm")
library("data.table")
library("ggplot2")
library("tidytext")
library("dplyr")
library("topicmodels")
library("wordcloud")
library("SentimentAnalysis")
library("naivebayes")
library("slam")
library("glmnet")
library("lexicon")
library("fastNaiveBayes")
library("caret")
library("ranger")

In [None]:
## Simple helper function to view first copora elements, only for illustration in lecture
chead <- function(c) lapply(c[1:2], as.character)

## Setting up a corpus and applying transformations

This tutorial relies on the Kiva data for crowdfunded loans. The following analyses are based on the loan description and to a limited extent the loan purpose statement as well. The data is based on a csv database dump provided on their site. To ease computation we just use a limited sample of 10,000 observations. Using the csv file from the Kiva Homepage, you can clean the data yourself and use a larger sample. The full sample is close to one million observations. 

I spent some time pre-processing the data, filtering HTML tags and similar things, to get mostly clean loan descriptions. This type of pre-processing is common, but also very application specific. If you prep the data yourself, you may need to delete a few nested parentheses in the loan description using a text editor to make it readable. 

In case you are using Windows, or any other OS not using UTF-8 encoding as default, setting the encoding when reading data files is good practice. When working with text data, take care to ensure you are using the correct encoding and that transformations between files and encodings do not lead to broken characters.

In [None]:
## Read data
loans <- data.table::fread("Data/kiva-tiny.csv", encoding = "UTF-8")
names(loans)

The first thing we are going to do is set up a corpus. We will focus on the loan description. For all transformations in the remainder of this tutorial, we are going to print the first two loans' descriptions for illustration.

In [None]:
## Set up corpus
setnames(loans, "loanid", "doc_id")
setnames(loans, "description", "text")
corp <- Corpus(DataframeSource(loans))

## Inspect it
corp
lapply(corp[1:2], as.character)

These are the main transformations available in the `tm` library, but any other customized transformation can be applied.

In [None]:
## Main corpus transformations, passed via tm_map()
## Other transformations have to be wrapped in content_transformer()
getTransformations()

In [None]:
tolower("allcAPS")

We apply the `base` string function `tolower()` to transform all strings to lower case.

In [None]:
## All chars to lower case
corp <- tm_map(corp, content_transformer(tolower))
chead(corp)

Remove all punctuation as punctuation is unlikely to carry special meaning in the context of loans and we want to simplify the text input to get token counts. We need to set the unicode option to true to rid of all punctuation elements (eg. the quotation marks).

In [None]:
## Remove punctuation
# corp <- tm_map(corp, removePunctuation)
# chead(corp)
corp <- tm_map(corp, removePunctuation, ucp = TRUE)
chead(corp)

Now we remove all numbers. We observe the loan amount and the repayment schedule in other variables, so we can get rid of numbers. Extracting the meaning of numbers within their context is difficult. 

In [None]:
## Remove numbers
corp <- tm_map(corp, removeNumbers)
chead(corp)

Any other transformation - like substituting specific patterns based on regular expressions - can be passed to `tm_map()` using a user-defined function.

In [None]:
## For specific transformations, you could also pass a lambda function to remove patterns based on a regex

## Example:
# toSpace <- content_transformer(function (x , pattern) gsub(pattern, " ", x))
# corp <- tm_map(corp, toSpace, "ocongate")
# chead(corp)

Looking at a frequency plot of the token counts, there is still plenty of filtering to do to get meaningful token counts.

In [None]:
## Look at the most frequent words in our text and see whether we should get rid of some
# frequent_terms <- qdap::freq_terms(corp, 30)
# plot(frequent_terms)

Next, we remove stopwords and other generic words which do not carry special meaning in our context.

In [None]:
## More invasive changes: remove generic and custom stopwords
corp <- tm_map(corp, removeWords, stopwords('english'))
chead(corp)

In [None]:
## And a few more words we filter for lack of being informative, this could be extended
corp <- tm_map(corp, removeWords, "loan")
corp <- tm_map(corp, removeWords, "kiva")

In [None]:
## There are a lot of names in the data, these are not really informative
## We apply a dictionary to get rid of some of them
## Truncation because of regex limit
corp <- tm_map(corp, removeWords, common_names[1:floor(length(common_names)/2)])
corp <- tm_map(corp, removeWords, common_names[floor(length(common_names)/2):length(common_names)])
corp <- tm_map(corp, removeWords, freq_first_names[1:floor(nrow(freq_first_names)/2), Name])
corp <- tm_map(corp, removeWords, freq_first_names[floor(nrow(freq_first_names)/2):nrow(freq_first_names), Name])
## corp <- tm_map(corp, removeWords, freq_last_names) # needs to be truncated as well, even longer
chead(corp)

In [None]:
common_names[1:100]

You can also stem the document here. For illustration purposes I refrain from it here, but in real applications you might want to do this. Stemmers do not work equally well for all languages, depending on your application, there may be added value to grouping and transforming tokens further yourself.

In [None]:
## Stem document
## corp <- tm_map(corp, stemDocument, language = 'english')
## chead(corp)

Strip all extra whitespace (this is without consequences for tokenization).

In [None]:
## Strip extra whitespace
corp <- tm_map(corp, stripWhitespace) 
chead(corp)

## Building a document-term matrix and restricting the feature set

We transform the corpus to a document-term matrix. We use simple term-frequency weighting, i.e. the simple token counts. This is the default. You can also choose term frequency-inverse document frequency (tfidf) weighting at this point.

In [None]:
## Build a document-term or term-document matrix
## Default is term-frequency weighting (document length normalized count)
## TF-IDF weighting also possible

## dtm <- TermDocumentMatrix(corp)
dtm <- DocumentTermMatrix(corp)

## Inspect the document-term matrix
inspect(dtm)

In [None]:
230763649+516351

Following the tokenization, we can inspect the most popular words.

In [None]:
## Inspect most popular words
findFreqTerms(dtm, lowfreq = 1000)

The document-word vectors allow us to inspect the correlation between words as we would between variables in other data.

In [None]:
## Inspect associations
findAssocs(dtm, 'profit', 0.15)

Since the matrix is very wide and sparse, we are going to remove terms to arrive at a tractable representation. We can filter words simply by removing words that are very rare.

In [None]:
## Remove sparse terms, prevents cluster node from choking and saves time
## Don't do this at all if you can avoid it (or limit only as little as possible).

## Tweak the sparse parameter to influence # of words
## dtms <- removeSparseTerms(dtm, sparse = 0.90)
dtms <- removeSparseTerms(dtm, sparse = 0.95) ## less sparse, works much better for modeling, computation longer
dim(dtms)
dtms <- dtms[row_sums(dtms) > 0, ]
dim(dtms)

We can also filter by tf-idf, only keeping words which occur frequently in some documents but not in others, helping us to keep those words that disambiguate loans. We compute the average tf-idf score for each token, then filter by that. I went back-and-forth a bit tweaking the threshold value for filtering.

In [None]:
## Alternatively, filter words by mean tf-idf

## Calculate average term-specific tf-idf weights as
## mean(word count/document length) * log(ndocs/ndocs containing word)
termtfidf <- tapply(dtm$v/row_sums(dtm)[dtm$i], dtm$j, mean) *
             log(nDocs(dtm)/col_sums(dtm > 0))
summary(termtfidf)

In [None]:
## Only include terms with at least median tf-idf score
dtmw <- dtm[, (termtfidf >= 0.15)]
dim(dtmw)
## And documents within which these terms occur - this may induce selection
dtmw <- dtmw[row_sums(dtmw) > 0, ]
dim(dtmw)
dim(dtm)

In [None]:
## Much less frequent terms now
findFreqTerms(dtmw, lowfreq=100)

## Visualizations of word frequencies

These are just very simple visualizations of word frequencies. First the unfiltered but transformed corpus.

In [None]:
## Simple visualization
wordcloud(corp, max.words = 100, random.order = FALSE,
          colors = brewer.pal(8, "Dark2"))

Next the term-frequency filtered document-term matrix. Obviously this is very similar to the plot above. 

In [None]:
## Counts from dtms
counts <- sort(colSums(as.matrix(dtms)), decreasing = TRUE)
counts <- data.frame(word = names(counts), freq = counts)
wordcloud(words = counts$word, freq = counts$freq,
          max.words = 100, random.order = FALSE,
          colors = brewer.pal(8, "Dark2"))

However, the frequency plot of the tfidf-filtered document-term matrix looks different. There are a lot more terms that disambiguate professions and investment goods.

In [None]:
## Counts from dtmw
counts <- sort(colSums(as.matrix(dtmw)), decreasing = TRUE)
counts <- data.frame(word = names(counts), freq = counts)
wordcloud(words = counts$word, freq = counts$freq,
          max.words = 100, random.order = FALSE,
          colors = brewer.pal(8, "Dark2"))

## Dictionary methods: Inferring sentiment

We are using a fixed mapping of terms to infer a sentiment score, and then convert it to discrete sentiment categories. Unsuprisingly, most loan descriptions are phrased to convey a positive message. 

In [None]:
## Dictionary method: Sentiment analysis using dictionaries
sentiment <- analyzeSentiment(dtms, language = "english")
sentiment <- convertToDirection(sentiment$SentimentGI)

## look at sentiment distribution
table(sentiment)

## Unsupervised generative model: Topic model

Next we are going to train an unsupervised topic model on the term-frequency filtered document term matrix. You will find that it is hard to find distinct topics, both due to the term filtering, and the fact that most loans are handed out by partner organizations who use a standard questionnaire to get basic information which is then translated to an english description. 

In [None]:
## Unsupervised method: Topic model
## lda <- LDA(dtms, k = 5, control = list(seed = 100))
lda <- LDA(dtmw, k = 5, control = list(seed = 1000))

## Most likely topic for each document, could merge this to original data
## topic <- topics(lda, 1)

## Five most frequent terms for each topic
terms(lda, 10)

## Plot most frequent terms and associated probabilities by topic
tpm <- tidy(lda, matrix = "beta")

topterms <-
    tpm %>%
    group_by(topic) %>%
    top_n(10, beta) %>%
    ungroup() %>%
    arrange(topic, -beta)

topterms %>%
    mutate(term = reorder(term, beta)) %>%
    ggplot(aes(term, beta, fill = factor(topic))) +
    geom_col(show.legend = FALSE) +
    facet_wrap(~ topic, scales = "free") +
    coord_flip()

In [None]:
terms(lda, 10)

Instead, let us use the `loan use` statement, filtered by tfidf. These topics already look more distinct. You could tweak the filter and the number of topics further to arrive at a more meaningful result.

In [None]:
## not working well due to standardized templates
## let us try to use the `loanuse' statement text for the generative topic model 

# new data
loanuse <- loans[, .(doc_id, loanuse)]
setnames(loanuse, "loanuse", "text")

# new dtm, this time do most of the transformations in one step
dtmuse <- DocumentTermMatrix(Corpus(DataframeSource(loanuse)),
                             control = list(weighting = weightTf,
                                            language = "english",
                                            tolower = TRUE,
                                            removePunctuation = TRUE,
                                            removeNumbers = TRUE,
                                            stopwords = TRUE,
                                            stemming = FALSE,
                                            wordLengths = c(3, Inf)))
inspect(dtmuse)


In [None]:
str(dtmuse)

In [None]:
# Recalculate weights
termtfidf <- tapply(dtmuse$v/row_sums(dtmuse)[dtmuse$i], dtmuse$j, mean) *
    log2(nDocs(dtmuse)/col_sums(dtmuse > 0))
summary(termtfidf)

## Filter by tf-idf
## dim(dtmuse)
dtmuse.tfidf <- dtmuse[, (termtfidf >= 1.30)]
dtmuse.tfidf <- dtmuse.tfidf[row_sums(dtmuse.tfidf) > 0, ]
dim(dtmuse.tfidf)

In [None]:
## Unsupervised method: Topic model, this time for loanuse statement
lda <- LDA(dtmuse.tfidf, k = 5, control = list(seed = 1000))
## str(lda)

In [None]:
## Unsupervised method: Topic model, this time for loanuse statement
lda <- LDA(dtmuse, k = 5, control = list(seed = 1000))
## str(lda)

In [None]:
## Most likely topic for each document, could merge this to original data
topic <- topics(lda, 1)
head(topic)

In [None]:
## Five most frequent terms for each topic
terms(lda, 10)

In [None]:
## Plot most frequent terms and associated probabilities by topic
tpm <- tidy(lda, matrix = "beta")

topterms <-
    tpm %>%
    group_by(topic) %>%
    top_n(10, beta) %>%
    ungroup() %>%
    arrange(topic, -beta)

topterms %>%
    mutate(term = reorder(term, beta)) %>%
    ggplot(aes(term, beta, fill = factor(topic))) +
    geom_col(show.legend = FALSE) +
    facet_wrap(~ topic, scales = "free") +
    coord_flip()

Look at unique terms not appearing in other topics.

In [None]:
freqterms <- terms(lda, 40)
duplicates <- c(freqterms)[duplicated(c(freqterms))]
distinctterms <- lapply(as.list(as.data.frame(freqterms)), function(x) x[!(x %in% duplicates)])
distinctterms

## Supervised methods: Data preparation

The following cells transform and prep the data to be used as inputs for supervised methods. We split the data into a test and a training sample. The outcome we try to predict is whether the loan is obtained for a business proposition in the agricultural sector.

In [None]:
## Input: Only filtering here for tractability and runtime. 
## Use dtm without restrictions or leave sparsity as large as possible.
dtms <- removeSparseTerms(dtm, sparse = 0.95)
dim(dtms)

In [None]:
## Supervised methods: Prep data
## Convert the sparse term-document matrix to a standard data frame
bag <- as.data.frame(as.matrix(dtms))
dim(bag)
#bag
head(bag)

In [None]:
## Convert token counts to simple binary indicators
bag.bin <- as.data.frame(sapply(bag, function(x) as.numeric(x > 0)))
dim(bag.bin)
head(bag.bin)
hist(bag$buy)

In [None]:
## Add names to rows
bag$doc_id <- rownames(as.matrix(dtms))
bag.bin$doc_id <- rownames(as.matrix(dtms))
head(bag)

In [None]:
## Different sectors
table(loans$sectorname)

In [None]:
## Add outcomes from the original data: Predict agricultural sector
loans$agsector <- as.numeric(loans$sectorname == "Agriculture")
bag <- merge(bag, loans[, .(agsector, loanamount, doc_id)], by = "doc_id")
bag.bin <- merge(bag.bin, loans[, .(agsector, loanamount, doc_id)], by = "doc_id")
                            
# How many people want a loan in the agricultural sector?                            
table(bag$agsector)

In [None]:
## Partition data in test and training sample
set.seed(100)
testids <- sample(floor(nrow(bag)/5))

In [None]:
names(bag)

In [None]:
xtrain <- as.matrix(bag[-testids, !(names(bag) %in% c("agsector", "loanamount", "doc_id"))])
xtest  <- as.matrix(bag[ testids, !(names(bag) %in% c("agsector", "loanamount", "doc_id"))])

xtrain.bin <- as.matrix(bag.bin[-testids, !(names(bag) %in% c("agsector", "loanamount", "doc_id"))])
xtest.bin  <- as.matrix(bag.bin[ testids, !(names(bag) %in% c("agsector", "loanamount", "doc_id"))])

ytrain <- as.factor(bag[-testids,  "agsector"])
ytest  <- as.factor(bag[ testids,  "agsector"])

dim(xtrain)
length(ytrain)

dim(xtest)
length(ytest)

## Supervised generative model: Naive Bayes classifier

### With binary token indicators

Naive Bayes is a simple model relying on a conditional independence assumption of the token counts. It often performs acceptable. In this case it does not perform very well, possibly because we filtered the input token data to aggressively. If you redo the analysis with a larger set of features (setting sparsity = 0.95, or not filtering at all), the naive bayes classifier performs much better. Among other things, there is also no need to remove stopwords here, and it may not improve performance. You can feed the unfiltered data or try different transformations and see whether this improves matters. 

In [None]:
## Supervised generative model: Naive Bayes
## naive_bayes package requires transforming everything to factors and using binary indicators, not counts.
xtrain.factor <- as.data.frame(lapply(as.data.frame(xtrain.bin), as.factor))
xtest.factor <- as.data.frame(lapply(as.data.frame(xtest.bin), as.factor))

In [None]:
nbclassifier <- naive_bayes(xtrain.factor, ytrain, laplace = 1)
nbclassifier

In [None]:
nbpred <- predict(nbclassifier, xtest.factor)
# nbclassifier
summary(nbpred)

In [None]:
## Performance statistics: Classification rate
round(1 - mean(as.numeric(nbpred != ytest)), 2)

In [None]:
## Performance statistics: Confusion matrix (
## table(nbpred, ytest)
confusionMatrix(nbpred, ytest)

In [None]:
expand.grid(
    mtry = seq(2, 2 * floor(sqrt(ncol(xtrain))), length.out = 10),
    splitrule = "gini",
    min.node.size = c(1,3)
  )

#### Tuning the laplace smoothing parameter

There isn't really much scope for tuning with naive bayes.

In [None]:
## parameter grid
nb.grid <- expand.grid(
  laplace = seq(0, 1, 0.1),
  adjust = 1,
  usekernel = TRUE
)
nb.grid

In [None]:
## use k-fold cv to tune the laplace smoothing parameter
nbclassifier <- train(
  xtrain.factor, ytrain,
  method = "naive_bayes",
  trControl = trainControl(method = "cv", number = 10),
  tuneGrid = nb.grid
)

nbclassifier
summary(nbclassifier)

In [None]:
nbpred <- predict(nbclassifier, xtest.factor)
1-mean(as.numeric(nbpred != ytest))
confusionMatrix(nbpred, ytest)

### With token counts

Binary features perform only slightly better/worse, mostly just about the same depending on then size of the design matrix. Whether a word occurs at all encodes about the same information compared to how frequent it occurs. 

In [None]:
## fastNaiveBayes is the better package (supports multinomial distribution, for non-binary feature counts)
## fnb.detect_distribution(xtrain)

nbclassifier <- fastNaiveBayes(xtrain, ytrain)
# nbclassifier <- multinomial_naive_bayes(xtrain, ytrain)

nbpred <- predict(nbclassifier, xtest)

In [None]:
## Performance statistics: Classification rate
round(1-mean(as.numeric(nbpred != ytest)), 2)

## Performance statistics: Confusion matrix (
## table(nbpred, ytest)
confusionMatrix(nbpred, ytest)

## Supervised text regression: L<sub>1</sub> penalized logistic classifier

Looking at the misclassification rate and the confusion matrix, the model performs better than naive bayes in predicting the agricultural sector. However, looking at precision and recall, the model again does poorly in getting the true condition outcomes right, leading to a large number of false negatives.

### With binary token indicators

In [None]:
## Supervised text regression: L1 penalized logistic regression
l1classifier <- cv.glmnet(xtrain.bin, ytrain, alpha = 1, family = "binomial")
l1pred <- as.factor(predict(l1classifier, xtest.bin, s = "lambda.min", type = "class"))
summary(l1pred)

In [None]:
plot(l1classifier)

In [None]:
## Performance statistics: Classification rate
round(1-mean(as.numeric(l1pred != ytest)), 2)

## Performance statistics: Confusion matrix
caret::confusionMatrix(l1pred, ytest)

### With token counts

If you check, the model with feature counts does not do better than the binary model.

In [None]:
## Supervised text regression: L1 penalized logistic regression
l1classifier <- cv.glmnet(xtrain, ytrain, alpha = 1, family = "binomial")
l1pred <- as.factor(predict(l1classifier, xtest, s = "lambda.min", type = "class"))
summary(l1pred)

In [None]:
plot(l1classifier)

In [None]:
## Performance statistics: Classification rate
round(1-mean(as.numeric(l1pred != ytest)), 2)

## Performance statistics: Confusion matrix
caret::confusionMatrix(l1pred, ytest)

This also trains a logistic lasso estimator, weighting the penalty factor for each input token by the token's standard deviation. Results do not really differ compared to just standardizing (no surprise).

In [None]:
## L1 logistic classifier using rare feature upweighting
# l1classifier <- cv.glmnet(xtrain, ytrain, alpha = 1, family = "binomial")
## L1 logistic classifier using rare feature upweighting
sdweights <- apply(xtrain, 2, sd)
l1classifier <- cv.glmnet(xtrain, ytrain, alpha = 1, family = "binomial",
                          standardize = FALSE, penalty.factor  = sdweights)
l1pred <- as.factor(predict(l1classifier, xtest, s = "lambda.min", type = "class",
                            penalty.factor  = sdweights))
summary(l1pred)

In [None]:
## Performance statistics: Classification rate
round(1-mean(as.numeric(l1pred != ytest)), 2)
## Performance statistics: Confusion matrix
caret::confusionMatrix(l1pred, ytest)

## Supervised: Random forest

In [None]:
## using library(caret) for training. alternative: library(mlr), or library(tidymodels) if you prefer tidyverse 

## using oob
## small scale for cluster, adjust mtry to finer grid and increase num.trees substantially
rfclassifier <- train(
  y = ytrain,
  x = xtrain,
  method = "ranger",
  num.trees = 200,
  tuneGrid = expand.grid(
    mtry = seq(2, 2 * floor(sqrt(ncol(xtrain))), length.out = 10),
    splitrule = "gini",
    min.node.size = c(1,3)
  ),
  trControl = trainControl(
    method = "oob"
  )
)

In [None]:
rfclassifier

In [None]:
plot(rfclassifier)

In [None]:
rfclassifier$bestTune

In [None]:
rfpred <- predict(rfclassifier, xtest)
1 - mean(as.numeric(rfpred != ytest))

In [None]:
confusionMatrix(rfpred, ytest)

## Boosted trees

In [None]:
## Ada boost with decision tree as base learner
## Simplified to lower runtime!
## Increase iterations and use larger parameter grid, expand grid for maxdepth, user finer grid for learning rate.
gbclassifier <- train(
  y = ytrain,
  x = xtrain,
  method = "ada",
  tuneGrid = expand.grid(
    iter = 10, 
    maxdepth = seq(2, 5, 1), 
    nu = seq(0.1, 1, 0.3)
  ),
  trControl = trainControl(
    method = "cv",
    number = 5
  )
)

In [None]:
gbclassifier

In [None]:
plot(gbclassifier)

In [None]:
gbclassifier$bestTune

In [None]:
gbpred <- predict(gbclassifier, xtest)
1 - mean(as.numeric(gbpred != ytest))

In [None]:
confusionMatrix(gbpred, ytest)

In [None]:
## gradient boosting
## try method = "xgbTree" from library(xgboost), may have better performance but more tuning parameters
## ...

# Tasks

In [None]:
## Task: The above examples are all classification. Implement a regression example. 
## Predict the loanamount using L1 penalized linear regression (lasso).
## ...

In [None]:
## Discuss: How would you go about improving performce for the classifiers?
## Do not restrict sparsity/drop columns for tractability. 
## Consider adding other predictors as inputs. 
## Use the loanuse statement.

In [None]:
## Task: Re-implement the classifiers using only the loan use statement. 
## ...

# Addendum

## Regression example: Predicting loanamount using L<sub>1</sub> penalized linear regression

In [None]:
## Further example: Predict Loan Amount
## Supervised text regression: L1 penalized linear regression

## Rebuild outcome vectors
ytrain <- as.matrix(bag[-testids,  "loanamount"])
ytest  <- as.matrix(bag[ testids,  "loanamount"])

In [None]:
## Estimate and predict
l1predictor <- cv.glmnet(xtrain, ytrain, alpha = 1, family = "gaussian")
l1pred <- predict(l1predictor, xtest, s = "lambda.min", type = "response")


In [None]:
## RMSE
round(sqrt(mean((l1pred - ytest)^2)), 2)
postResample(l1pred, ytest)

In [None]:
hist(ytrain)

## Understanding signal-to-noise

This section demonstrates the value of good data.

 
## Using the 'loanuse' statement for prediction

In [None]:
## Input: Only filtering here for tractability and runtime. 
## Use dtm without restrictions or leave sparsity as large as possible.
dtms <- removeSparseTerms(dtmuse, sparse = 0.995)
dim(dtms)

In [None]:
dtmuse

In [None]:
## Supervised methods: Prep data
## Convert the sparse term-document matrix to a standard data frame
bag <- as.data.frame(as.matrix(dtms))
## Convert token counts to simple binary indicators
bag.bin <- as.data.frame(sapply(bag, function(x) as.numeric(x > 0)))

In [None]:
## Add names to rows
bag$doc_id <- rownames(as.matrix(dtms))
bag.bin$doc_id <- rownames(as.matrix(dtms))
head(bag)

In [None]:
## Add outcomes from the original data: Predict agricultural sector
loans$agsector <- as.numeric(loans$sectorname == "Agriculture")
bag <- merge(bag, loans[, .(agsector, loanamount, doc_id)], by = "doc_id")
bag.bin <- merge(bag.bin, loans[, .(agsector, loanamount, doc_id)], by = "doc_id")
                            
# How many people want a loan in the agricultural sector?                            
table(bag$agsector)

In [None]:
## Partition data in test and training sample
set.seed(100)
testids <- sample(floor(nrow(bag)/5))

In [None]:
xtrain <- as.matrix(bag[-testids, !(names(bag) %in% c("agsector", "loanamount", "doc_id"))])
xtest  <- as.matrix(bag[ testids, !(names(bag) %in% c("agsector", "loanamount", "doc_id"))])

xtrain.bin <- as.matrix(bag.bin[-testids, !(names(bag) %in% c("agsector", "loanamount", "doc_id"))])
xtest.bin  <- as.matrix(bag.bin[ testids, !(names(bag) %in% c("agsector", "loanamount", "doc_id"))])

ytrain <- as.factor(bag[-testids,  "agsector"])
ytest  <- as.factor(bag[ testids,  "agsector"])

dim(xtrain)
length(ytrain)

dim(xtest)
length(ytest)

## Running Naive Bayes on only the loan use statement

In [None]:
## Supervised generative model: Naive Bayes
## naive_bayes package requires transforming everything to factors and using binary indicators, not counts.
options(encoding = "UTF-8")
xtrain.factor <- as.data.frame(lapply(as.data.frame(xtrain.bin), as.factor))
xtest.factor <- as.data.frame(lapply(as.data.frame(xtest.bin), as.factor))

In [None]:
nbclassifier <- naive_bayes(xtrain.factor, ytrain, laplace = 1)

In [None]:
nbpred <- predict(nbclassifier, xtest.factor)
# nbclassifier
summary(nbpred)

In [None]:
## Performance statistics: Classification rate
round(1 - mean(as.numeric(nbpred != ytest)), 2)

## Performance statistics: Confusion matrix (
## table(nbpred, ytest)
confusionMatrix(nbpred, ytest)

## Running Lasso on only the loan use statement

In [None]:
## Supervised text regression: L1 penalized logistic regression
l1classifier <- cv.glmnet(xtrain.bin, ytrain, alpha = 1, family = "binomial")
l1pred <- as.factor(predict(l1classifier, xtest.bin, s = "lambda.min", type = "class"))
summary(l1pred)

In [None]:
confusionMatrix(l1pred, ytest)

## Re-estimating the models using both loan use statement and description as inputs

This actually worsens the signal to noise ratio.

In [None]:
bag.use <- as.data.frame(as.matrix(removeSparseTerms(dtmuse, sparse = 0.995)))
bag.use.bin <- as.data.frame(sapply(bag.use, function(x) as.numeric(x > 0)))
bag.desc <- as.data.frame(as.matrix(removeSparseTerms(dtm, sparse = 0.95)))
bag.desc.bin <- as.data.frame(sapply(bag.desc, function(x) as.numeric(x > 0)))

set.seed(100)
testids <- sample(floor(nrow(bag.use)/5))

xtrain.use <- as.matrix(bag.use[-testids, !(names(bag.use) %in% c("agsector", "loanamount", "doc_id"))])
xtest.use  <- as.matrix(bag.use[ testids, !(names(bag.use) %in% c("agsector", "loanamount", "doc_id"))])
xtrain.desc <- as.matrix(bag.desc[-testids, !(names(bag.desc) %in% c("agsector", "loanamount", "doc_id"))])
xtest.desc  <- as.matrix(bag.desc[ testids, !(names(bag.desc) %in% c("agsector", "loanamount", "doc_id"))])

xtrain.use.bin <- as.matrix(bag.use.bin[-testids, !(names(bag.use) %in% c("agsector", "loanamount", "doc_id"))])
xtest.use.bin  <- as.matrix(bag.use.bin[ testids, !(names(bag.use) %in% c("agsector", "loanamount", "doc_id"))])
xtrain.desc.bin <- as.matrix(bag.desc.bin[-testids, !(names(bag.desc) %in% c("agsector", "loanamount", "doc_id"))])
xtest.desc.bin  <- as.matrix(bag.desc.bin[ testids, !(names(bag.desc) %in% c("agsector", "loanamount", "doc_id"))])

xtrain <- cbind(xtrain.use, xtrain.desc)
xtrain.bin <- cbind(xtrain.use.bin, xtrain.desc.bin)
xtest <- cbind(xtest.use, xtest.desc)
xtest.bin <- cbind(xtest.use.bin, xtest.desc.bin)
                                     
ytrain <- as.factor(as.numeric(loans$sectorname == "Agriculture")[-testids])
ytest  <- as.factor(as.numeric(loans$sectorname == "Agriculture")[testids])

dim(xtrain.use)
dim(xtrain.desc)
dim(xtrain)
table(ytrain)

In [None]:
## Naive Bayes
xtrain.factor <- as.data.frame(lapply(as.data.frame(xtrain.bin), as.factor))
xtest.factor <- as.data.frame(lapply(as.data.frame(xtest.bin), as.factor))
nbclassifier <- naive_bayes(xtrain.factor, ytrain, laplace = 1)
nbpred <- predict(nbclassifier, xtest.factor)
summary(nbpred)
confusionMatrix(nbpred, ytest)

In [None]:
## Supervised text regression: L1 penalized logistic regression
l1classifier <- cv.glmnet(xtrain.bin, ytrain, alpha = 1, family = "binomial")
l1pred <- as.factor(predict(l1classifier, xtest.bin, s = "lambda.min", type = "class"))
summary(l1pred)
confusionMatrix(l1pred, ytest)