In this report, I discuss my creation of a predictive text app that trained on a large corpus of text from news-related sources. I discuss how the data was used to create n-gram models, that were subsequently used to generate my first prediction model, using a very simplified version of the Katz-Backoff model. Next, I tried to implement the Kneser-Ney smoothing algorithm, but due to limited time to process the scripts, the results were not as ideal as hoped. Both prediction models are shown in a final Shiny app.
The purpose of this project was to create a predictive text app that predicts the next word that follows a sentence input. For example, given “I can not wait to” as the input sentence, the app should suggest words like “see”, “go”, and “eat”.
This report is broken down into the two major steps:
- Training the app on a large corpus of text
- importing the corpus
- cleaning the corpus to remove undesired items
- converting the corpus into a proper format (data tables of n-gram models)
- Generating the predictions
- smoothing the n-gram model
- retrieving predictions
Import the corpus (or a subset of say, 500,000 lines, as I did)
Clean the text (e.g., remove symbols, emoticons, profanity)
Tokenize the text
Create a data table of a unigram model with the token and token counts as the columns
Repeat #4 for higher n-grams, up until 5-gram.
Add columns to split each token into its individual words, for better indexing (e.g., token1, token2, token, and count as the columns of a data table for the bigram model).
Remove any tokens with a count of 1 from each model to reduce size.
There were three corpi available in English: text that was mined from Twitter, from blogs, and from news articles online. One can download all of the text files used in this project at:
https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip
Twitter data: 2.3610^{6} lines of text; NA MB
Blog data: 910^{5} lines of text; NA MB
News data: 1.0110^{6} lines of text; NA MB
Note: Although I worked on all three datasets, all of the code in this report uses the “News” data set.
Example of text from the news corpus:
## He wasn't home alone, apparently.
## The St. Louis plant had to close. It would die of old age. Workers had been making cars there since the onset of mass automotive production in the 1920s.
Natural language processing (NLP) is a field that is concerned with using computer science to analyze text. Borrowing from concepts in NLP, we divided each text document into tokens (i.e., individual words in this case), which could be grouped into n-grams for analysis. N-grams are words that occur consecutively, with “n” representing the number of words in question. For example, trigrams are triplets of words together. Bigrams are pairs of words. Unigrams are individual words. By “tokenizing” a corpus, we can take tokens that occur together, and gather the number of times each n-gram occurs throughout the corpus (e.g., the number of times the trigram, “it is the”, occurs in the text).
Thus, for each n in n-gram, the corpus of text was converted into counts of n-gram. Here is an example of the final product:
## The most common bigrams from the news corpus
## token1: first word in the bigram
## token2: second word in the bigram
## token: the bigram
## count: the number of items the bigram appeared
## token1 token2 token count
## 1: of the of the 79156
## 2: in the in the 76945
## 3: to the to the 38823
## 4: on the on the 33406
## 5: for the for the 32076
## 6: it is it is 31614
The text was imported using readLines
, and converted into a corpus using the tm
package in R.
require(tm)
sourcename = "news"; lines = 500000
readLines(paste0(directory,"/final/en_US/en_US.",sourcename,".txt"),
n = lines))
cleanedCorpus <- cleanCorpus(Corpus(VectorSource(text)))
cleanCorpus
is a custom function I built to pre-process the text. The purpose was the following:
- remove most symbols such as #%$, because they do not add any value to word predictions
- convert “!” and “?”" to “.” to signify the end of a sentence
- reduce consecutive periods into a single period
- convert “&” symbol into “and” as they mean the same thing
- remove profanity based on a list of words from a separate text file. This list was obtained from: https://raw.githubusercontent.com/shutterstock/List-of-Dirty-Naughty-Obscene-and-Otherwise-Bad-Words/master/en
- remove numbers “0123456789”" to simplify the corpus
- convert all text to lowercase for simplicity
- expand contractions so that contractions can be tokenized as if they were the same as their full forms
- remove extra whitespace
Next, the text were tokenized using tau
or the RWeka
packages, which were interchangeable.
require(tau)
tau_ngrams <- function(x, ngrams) return(rownames(as.data.frame(unclass(textcnt(x,method="string",n=ngrams)))))
clean.tdm <- TermDocumentMatrix(cleanedCorpus,
control = list(tokenize = function(x) tau_ngrams(x, ngrams),
wordLengths = c(1, Inf)
))
Next, data tables were created for each ngram model.
ngram_model <- rollup(clean.tdm, 2, na.rm = TRUE, FUN = sum)
ngram_model_dt <- data.table(token = ngram_model$dimnames$Terms, count = ngram_model$v)
After splitting the tokens, which contained multiple words, into a word per column, this resulted in our final n-gram model output.
Here’s an example:
The final output for a trigram model:
## token1 token2 token3 token count
## 1: one of the one of the 7019
## 2: a lot of a lot of 5196
## 3: it is a it is a 4809
## 4: i do not i do not 3693
## 5: it is not it is not 3206
## ---
## 1397053: zydeco or cajun zydeco or cajun 2
## 1397054: zydrunas ilgauskas was zydrunas ilgauskas was 2
## 1397055: zygi wilf said zygi wilf said 2
## 1397056: zynga s other zynga s other 2
## 1397057: zz top s zz top s 2
As you can see above, there are some trigrams with words that do not appear to be actual English words. This shows how my functions used to clean the corpus can still improve. In the end, I decided to spend more time working on the next phase of producing the prediction algorithm, rather than spending too much time to generate a cleaner corpus!
Based on the data tables and their counts, I was now able to create a very simple prediction model.
This basic prediction model is best summarized as a simplified version of the Katz’s back-off model. For a given sentence with k
words, it returns k+1
th words that have the highest counts according to the k+1
-gram data table. 1. Input a sentence. 2. Reduce the sentence to the final k
words, where k
is the number of n-gram models created minus one. 3. Check the n-gram data table and match the n-1
words to return a list of n
th words, according to the data table. 4. If Step 3 returns at least one word, then the prediction stops here. Otherwise, if no n
th word was found, then it reduces the sentence of k
-words into k-1
-words and repeats #3. 5. Still, if no words are found at the bigram level (e.g., no continuation of the final word, “to”) then the prediction model does not return any words.
We begin with a string:
string <- tolower("The most advanced vehicle of the")
string <- strsplit(string, " ")[[1]]
string
## [1] "the" "most" "advanced" "vehicle" "of" "the"
length(string)
## [1] 6
Then we reduce the string to the final k words, where k is equal to the number of ngram models minus one.
k <- length(ndictlist) - 1 #Number of n-gram models in "ndictlist" minus one
string <- tail(string, k) #Reducing the string
string
## [1] "vehicle" "of" "the"
Then we look up the highest n-gram model to see what candidate n
th words are returned:
setkey(ndictlist[[4]], token1, token2, token3)
result <- ndictlist[[4]][list(string[1], string[2], string[3])]
result[order(-count)]
## token1 token2 token3 token4 token count
## 1: vehicle of the NA NA NA
As we can see, there are no possible candidate words that follow to "vehicle of the"
. These words are returned because they were found in the original news corpus. So what we are seeing in the data table above is the ranking of the most popular endings to "vehicle"
found in the news corpus, of which there were none.
Next, the prediction models reduces the sentence by one and re-iterates the process.
k <- k - 1 #Number of n-gram models in "ndictlist" minus one
string <- tail(string, k) #Reducing the string
string
## [1] "of" "the"
setkey(ndictlist[[3]], token1, token2)
result <- ndictlist[[3]][list(string[1], string[2])]
result[order(-count)][1:10]
## token1 token2 token3 token count
## 1: of the year of the year 1656
## 2: of the season of the season 1262
## 3: of the most of the most 1187
## 4: of the state of the state 954
## 5: of the new of the new 830
## 6: of the world of the world 799
## 7: of the city of the city 773
## 8: of the game of the game 744
## 9: of the best of the best 666
## 10: of the first of the first 644
The trigram model was able to find trigrams beginning with "of the"
and returned the results above.
Indeed, there are a number of drawbacks to this model: - it would have to train on an extremely large corpus to be able to retrieve all the valid combinations (n-grams) of words - a few n-grams occur very frequently, while most n-grams appear seldomly. There is no discounting applied to the counts to even out the distribution. - it can not handle combinations of words it has never seen before in the corpus
**For these reasons, I decided to try a prediction model using the Kneser-Ney method.
Kneser-Ney smoothing is an algorithm designed to adjust the weights (through discounting) by using the continuation counts of lower n-grams. The concept is fairly simple though a bit more difficult to implement in a program than the one used in Prediction Model 1.
Given the sentence, “Francisco”" is presented as the suggested ending, because it appears more often than “glasses” in some text.
I can’t see without my reading… __ Francisco __
However, even though “Francisco” appears more often than “glasses”, “Francisco” rarely occurs outside of the context of “San Francisco”. Thus, instead of observing how often a word appears, the Kneser-Ney algorithm takes into account how often a word completes a bigram type (e.g., “prescription glasses”, “reading glasses”, “small glasses” vs. “San Francisco”). This example was taken from Jurafsky’s video lecture on Kneser-Ney Smoothing, which also describes the equation used to calculate the Kneser-Ney probability. https://www.youtube.com/watch?v=wtB00EczoCM
I believe that typically, the smoothing algorithm is performed on all of the n-grams (unigram models, bigram models, etc.) prior to attempting any predictions. However, given the restraint of computing time, I only had enough time to create a version which performs the smoothing when a user provides an input.
I was able to write a completely recursive Kneser-Ney algorithm, pkn.calc
, for n-gram models of any n. However, in effect, I limited the number of candidate words and thus the resulting term is often very inaccurate.
The procedure of my pkn.calc
function is as follows:
- Calculate the Kneser-Ney probability (PKN) of the lowest order (unigram level).
- Calculate the PKN of all the lower orders using successively lower PKN values.
- Calculate the highest PKN value using the PKN of the lower values, and return this value.
To implement this in real-time means I first select the candidates (what words could come next in a sentence) to be used for smoothing.
The candidates for what word should come next are chosen as the top-ranking words to follow wi
in the bigram mode, where the first word of the bigram is the final word (word i, or wi
for short).
For example, I would take the top 50 words to follow “of” to be used as candidates to calculate which had the highest PKN value.
string <- "vehicle of the" dictlist <- ndictlist
source("predictive-text-analysis/pkn/pkn.candidateList.R") candidateList <- candidateList(string, dictlist, min = 2) head(candidateList, n = 10)
## [1] "first" "state" "same" "city" "new" "most" "company"
## [8] "us" "last" "best"
Using all of the words in candidateList
, I calcuated the pkn value for each, producing the following table:
#PKN calculation (fully recursive) source("predictive-text-analysis/pkn/pkn.pkn.calc.R")
results <- c() for(q in 1:length(head(candidateList))){ temp <- pkn.calc(string, candidate = head(candidateList)[q], dictlist) results <- c(results, temp) names(results)[q] <- head(candidateList)[q] } pkn.results.sorted <- sort(results, decreasing = TRUE) pkn.results.df <- data.frame("pkn" = pkn.results.sorted, "rank" = 1:length(pkn.results.sorted)) pkn.results.df
## pkn rank
## most 0.015240435 1
## state 0.012619176 2
## new 0.010971481 3
## city 0.010297586 4
## first 0.008547671 5
## same 0.004329523 6
As you can see above, these seem to be less ideal than the candidates seen from Prediction Model 1, which used raw counts. Certainly this model needs to be tweaked - by not limiting the pkn calculations to the candidates used.
Nonetheless, this is as much as I was able to do within the timeframe of this project.
Two models were used:
- Raw counts based on n-grams that already exist in the corpus.
- Kneser-Ney probabilities based on candidates (the possible bigram continuations for the final word in the sentence)
Here are the two results for vehicle of the _____
## Prediction Model 1: Using raw counts
result[order(-count)][1:6][, c("token3", "count"), with=FALSE]
## token3 count
## 1: year 1656
## 2: season 1262
## 3: most 1187
## 4: state 954
## 5: new 830
## 6: world 799
## Prediction Model 2: Using Kneser-Ney smoothing
pkn.results.df
## pkn rank
## most 0.015240435 1
## state 0.012619176 2
## new 0.010971481 3
## city 0.010297586 4
## first 0.008547671 5
## same 0.004329523 6
As you can tell, both prediction models are not producing words that are related to “vehicle”. In fact, “vehicle of the” does not appear in the corpus, as I pointed out in the Prediction Model 1.
However, Prediction Model 1 seems to be producing better prediction results than Prediction Model 2.
I created two prediction models to try to predict the next word of a sentence. The simpler version, using raw counts, appears to work better, but the second, which uses the Kneser-Ney algorithm, if implemented correctly, is also promising.
More work needs to be done on optimizing processing speed as well. For example, much of the Kneser-Ney smoothing can be done in parallel, across several nodes.
Overall, this project was a significant educational experience where I learned to deal with large amounts of data, to process textual data, and I learned how to explore algorithms to optimize predictive power.