Skip to content

polong-lin/text_prediction-shinyApp

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

9 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Report on Creating a ‘Predictive Text App’ using text mining libraries and Shiny

Polong Lin

October 12, 2014


Summary of report:

In this report, I discuss my creation of a predictive text app that trained on a large corpus of text from news-related sources. I discuss how the data was used to create n-gram models, that were subsequently used to generate my first prediction model, using a very simplified version of the Katz-Backoff model. Next, I tried to implement the Kneser-Ney smoothing algorithm, but due to limited time to process the scripts, the results were not as ideal as hoped. Both prediction models are shown in a final Shiny app.

Introduction:

The purpose of this project was to create a predictive text app that predicts the next word that follows a sentence input. For example, given “I can not wait to” as the input sentence, the app should suggest words like “see”, “go”, and “eat”.

This report is broken down into the two major steps:

  1. Training the app on a large corpus of text
    • importing the corpus
    • cleaning the corpus to remove undesired items
    • converting the corpus into a proper format (data tables of n-gram models)
  2. Generating the predictions
    • smoothing the n-gram model
    • retrieving predictions

1. Training the app on a large corpus of text

Overall Process:

  1. Import the corpus (or a subset of say, 500,000 lines, as I did)

  2. Clean the text (e.g., remove symbols, emoticons, profanity)

  3. Tokenize the text

  4. Create a data table of a unigram model with the token and token counts as the columns

  5. Repeat #4 for higher n-grams, up until 5-gram.

  6. Add columns to split each token into its individual words, for better indexing (e.g., token1, token2, token, and count as the columns of a data table for the bigram model).

  7. Remove any tokens with a count of 1 from each model to reduce size.


a. Corpus Source:

There were three corpi available in English: text that was mined from Twitter, from blogs, and from news articles online. One can download all of the text files used in this project at:

https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip

Summary Statistics of data sources (blog, news, Twitter)

Twitter data: 2.3610^{6} lines of text; NA MB

Blog data: 910^{5} lines of text; NA MB

News data: 1.0110^{6} lines of text; NA MB

Note: Although I worked on all three datasets, all of the code in this report uses the “News” data set.

Example of text from the news corpus:

## He wasn't home alone, apparently.
## The St. Louis plant had to close. It would die of old age. Workers had been making cars there since the onset of mass automotive production in the 1920s.

b. Creating n-gram models

Natural language processing (NLP) is a field that is concerned with using computer science to analyze text. Borrowing from concepts in NLP, we divided each text document into tokens (i.e., individual words in this case), which could be grouped into n-grams for analysis. N-grams are words that occur consecutively, with “n” representing the number of words in question. For example, trigrams are triplets of words together. Bigrams are pairs of words. Unigrams are individual words. By “tokenizing” a corpus, we can take tokens that occur together, and gather the number of times each n-gram occurs throughout the corpus (e.g., the number of times the trigram, “it is the”, occurs in the text).

Thus, for each n in n-gram, the corpus of text was converted into counts of n-gram. Here is an example of the final product:

## The most common bigrams from the news corpus
##   token1: first word in the bigram
##   token2: second word in the bigram
##   token: the bigram
##   count: the number of items the bigram appeared
##    token1 token2   token count
## 1:     of    the  of the 79156
## 2:     in    the  in the 76945
## 3:     to    the  to the 38823
## 4:     on    the  on the 33406
## 5:    for    the for the 32076
## 6:     it     is   it is 31614

c. Explanation of how I created the n-gram models in R

The text was imported using readLines, and converted into a corpus using the tm package in R.

require(tm)
sourcename = "news"; lines = 500000
readLines(paste0(directory,"/final/en_US/en_US.",sourcename,".txt"),
              n = lines))
cleanedCorpus <- cleanCorpus(Corpus(VectorSource(text)))

cleanCorpus is a custom function I built to pre-process the text. The purpose was the following:

  • remove most symbols such as #%$, because they do not add any value to word predictions
  • convert “!” and “?”" to “.” to signify the end of a sentence
  • reduce consecutive periods into a single period
  • convert “&” symbol into “and” as they mean the same thing
  • remove profanity based on a list of words from a separate text file. This list was obtained from: https://raw.githubusercontent.com/shutterstock/List-of-Dirty-Naughty-Obscene-and-Otherwise-Bad-Words/master/en
  • remove numbers “0123456789”" to simplify the corpus
  • convert all text to lowercase for simplicity
  • expand contractions so that contractions can be tokenized as if they were the same as their full forms
  • remove extra whitespace

Next, the text were tokenized using tau or the RWeka packages, which were interchangeable.

require(tau)
tau_ngrams <- function(x, ngrams) return(rownames(as.data.frame(unclass(textcnt(x,method="string",n=ngrams)))))
clean.tdm <- TermDocumentMatrix(cleanedCorpus, 
                                control = list(tokenize = function(x) tau_ngrams(x, ngrams),
                                               wordLengths = c(1, Inf)
                                               ))

Next, data tables were created for each ngram model.

ngram_model <- rollup(clean.tdm, 2, na.rm = TRUE, FUN = sum)
ngram_model_dt <- data.table(token = ngram_model$dimnames$Terms, count = ngram_model$v)

After splitting the tokens, which contained multiple words, into a word per column, this resulted in our final n-gram model output.

Here’s an example:

The final output for a trigram model:

##            token1    token2 token3                  token count
##       1:      one        of    the             one of the  7019
##       2:        a       lot     of               a lot of  5196
##       3:       it        is      a                it is a  4809
##       4:        i        do    not               i do not  3693
##       5:       it        is    not              it is not  3206
##      ---                                                       
## 1397053:   zydeco        or  cajun        zydeco or cajun     2
## 1397054: zydrunas ilgauskas    was zydrunas ilgauskas was     2
## 1397055:     zygi      wilf   said         zygi wilf said     2
## 1397056:    zynga         s  other          zynga s other     2
## 1397057:       zz       top      s               zz top s     2

As you can see above, there are some trigrams with words that do not appear to be actual English words. This shows how my functions used to clean the corpus can still improve. In the end, I decided to spend more time working on the next phase of producing the prediction algorithm, rather than spending too much time to generate a cleaner corpus!


2. Generating the predictions

Based on the data tables and their counts, I was now able to create a very simple prediction model.

a. Prediction model 1: Using raw counts

This basic prediction model is best summarized as a simplified version of the Katz’s back-off model. For a given sentence with k words, it returns k+1th words that have the highest counts according to the k+1-gram data table. 1. Input a sentence. 2. Reduce the sentence to the final k words, where k is the number of n-gram models created minus one. 3. Check the n-gram data table and match the n-1 words to return a list of nth words, according to the data table. 4. If Step 3 returns at least one word, then the prediction stops here. Otherwise, if no nth word was found, then it reduces the sentence of k-words into k-1-words and repeats #3. 5. Still, if no words are found at the bigram level (e.g., no continuation of the final word, “to”) then the prediction model does not return any words.


Here’s an example of the algorithm in Prediction 1:

We begin with a string:

string <- tolower("The most advanced vehicle of the")
string <- strsplit(string, " ")[[1]]
string
## [1] "the"      "most"     "advanced" "vehicle"  "of"       "the"
length(string)
## [1] 6

Then we reduce the string to the final k words, where k is equal to the number of ngram models minus one.

k <- length(ndictlist) - 1 #Number of n-gram models in "ndictlist" minus one
string <- tail(string, k) #Reducing the string
string
## [1] "vehicle" "of"      "the"

Then we look up the highest n-gram model to see what candidate nth words are returned:

setkey(ndictlist[[4]], token1, token2, token3)
result <- ndictlist[[4]][list(string[1], string[2], string[3])]
result[order(-count)]
##     token1 token2 token3 token4 token count
## 1: vehicle     of    the     NA    NA    NA

As we can see, there are no possible candidate words that follow to "vehicle of the". These words are returned because they were found in the original news corpus. So what we are seeing in the data table above is the ranking of the most popular endings to "vehicle" found in the news corpus, of which there were none.


Next, the prediction models reduces the sentence by one and re-iterates the process.

k <- k - 1 #Number of n-gram models in "ndictlist" minus one
string <- tail(string, k) #Reducing the string
string
## [1] "of"  "the"
setkey(ndictlist[[3]], token1, token2)
result <- ndictlist[[3]][list(string[1], string[2])]
result[order(-count)][1:10]
##     token1 token2 token3         token count
##  1:     of    the   year   of the year  1656
##  2:     of    the season of the season  1262
##  3:     of    the   most   of the most  1187
##  4:     of    the  state  of the state   954
##  5:     of    the    new    of the new   830
##  6:     of    the  world  of the world   799
##  7:     of    the   city   of the city   773
##  8:     of    the   game   of the game   744
##  9:     of    the   best   of the best   666
## 10:     of    the  first  of the first   644

The trigram model was able to find trigrams beginning with "of the" and returned the results above.


Indeed, there are a number of drawbacks to this model: - it would have to train on an extremely large corpus to be able to retrieve all the valid combinations (n-grams) of words - a few n-grams occur very frequently, while most n-grams appear seldomly. There is no discounting applied to the counts to even out the distribution. - it can not handle combinations of words it has never seen before in the corpus

**For these reasons, I decided to try a prediction model using the Kneser-Ney method.

b. Prediction Model 2: My attempt at Kneser-Ney smoothing

Kneser-Ney smoothing is an algorithm designed to adjust the weights (through discounting) by using the continuation counts of lower n-grams. The concept is fairly simple though a bit more difficult to implement in a program than the one used in Prediction Model 1.

Given the sentence, “Francisco”" is presented as the suggested ending, because it appears more often than “glasses” in some text.

I can’t see without my reading… __ Francisco __

However, even though “Francisco” appears more often than “glasses”, “Francisco” rarely occurs outside of the context of “San Francisco”. Thus, instead of observing how often a word appears, the Kneser-Ney algorithm takes into account how often a word completes a bigram type (e.g., “prescription glasses”, “reading glasses”, “small glasses” vs. “San Francisco”). This example was taken from Jurafsky’s video lecture on Kneser-Ney Smoothing, which also describes the equation used to calculate the Kneser-Ney probability. https://www.youtube.com/watch?v=wtB00EczoCM

I believe that typically, the smoothing algorithm is performed on all of the n-grams (unigram models, bigram models, etc.) prior to attempting any predictions. However, given the restraint of computing time, I only had enough time to create a version which performs the smoothing when a user provides an input.


I was able to write a completely recursive Kneser-Ney algorithm, pkn.calc, for n-gram models of any n. However, in effect, I limited the number of candidate words and thus the resulting term is often very inaccurate.

The procedure of my pkn.calc function is as follows:

  1. Calculate the Kneser-Ney probability (PKN) of the lowest order (unigram level).
  2. Calculate the PKN of all the lower orders using successively lower PKN values.
  3. Calculate the highest PKN value using the PKN of the lower values, and return this value.

To implement this in real-time means I first select the candidates (what words could come next in a sentence) to be used for smoothing.

The candidates for what word should come next are chosen as the top-ranking words to follow wi in the bigram mode, where the first word of the bigram is the final word (word i, or wi for short).

For example, I would take the top 50 words to follow “of” to be used as candidates to calculate which had the highest PKN value.

string <- "vehicle of the"
dictlist <- ndictlist

source("predictive-text-analysis/pkn/pkn.candidateList.R") candidateList <- candidateList(string, dictlist, min = 2) head(candidateList, n = 10)

##  [1] "first"   "state"   "same"    "city"    "new"     "most"    "company"
##  [8] "us"      "last"    "best"

Using all of the words in candidateList, I calcuated the pkn value for each, producing the following table:

#PKN calculation (fully recursive)
source("predictive-text-analysis/pkn/pkn.pkn.calc.R")

results <- c() for(q in 1:length(head(candidateList))){ temp <- pkn.calc(string, candidate = head(candidateList)[q], dictlist) results <- c(results, temp) names(results)[q] <- head(candidateList)[q] } pkn.results.sorted <- sort(results, decreasing = TRUE) pkn.results.df <- data.frame("pkn" = pkn.results.sorted, "rank" = 1:length(pkn.results.sorted)) pkn.results.df

##               pkn rank
## most  0.015240435    1
## state 0.012619176    2
## new   0.010971481    3
## city  0.010297586    4
## first 0.008547671    5
## same  0.004329523    6

As you can see above, these seem to be less ideal than the candidates seen from Prediction Model 1, which used raw counts. Certainly this model needs to be tweaked - by not limiting the pkn calculations to the candidates used.

Nonetheless, this is as much as I was able to do within the timeframe of this project.


Comparing the two models

Two models were used:

  1. Raw counts based on n-grams that already exist in the corpus.
  2. Kneser-Ney probabilities based on candidates (the possible bigram continuations for the final word in the sentence)

Here are the two results for vehicle of the _____

## Prediction Model 1: Using raw counts
result[order(-count)][1:6][, c("token3", "count"), with=FALSE]
##    token3 count
## 1:   year  1656
## 2: season  1262
## 3:   most  1187
## 4:  state   954
## 5:    new   830
## 6:  world   799
## Prediction Model 2: Using Kneser-Ney smoothing
pkn.results.df
##               pkn rank
## most  0.015240435    1
## state 0.012619176    2
## new   0.010971481    3
## city  0.010297586    4
## first 0.008547671    5
## same  0.004329523    6

As you can tell, both prediction models are not producing words that are related to “vehicle”. In fact, “vehicle of the” does not appear in the corpus, as I pointed out in the Prediction Model 1.

However, Prediction Model 1 seems to be producing better prediction results than Prediction Model 2.

Conclusion

I created two prediction models to try to predict the next word of a sentence. The simpler version, using raw counts, appears to work better, but the second, which uses the Kneser-Ney algorithm, if implemented correctly, is also promising.

More work needs to be done on optimizing processing speed as well. For example, much of the Kneser-Ney smoothing can be done in parallel, across several nodes.

Overall, this project was a significant educational experience where I learned to deal with large amounts of data, to process textual data, and I learned how to explore algorithms to optimize predictive power.

<script> // add bootstrap table styles to pandoc tables $(document).ready(function () { $('tr.header').parent('thead').parent('table').addClass('table table-condensed'); }); </script> <script> (function () { var script = document.createElement("script"); script.type = "text/javascript"; script.src = "https://cdn.mathjax.org/mathjax/latest/MathJax.js?config=TeX-AMS-MML_HTMLorMML"; document.getElementsByTagName("head")[0].appendChild(script); })(); </script>

About

An app that predicts the next word of a sentence

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Languages