The goal of this project is to take a dataset provided and create an NLP (natural language processing) model that is able to predict subsequent words. Blogs, Twitter and News were datasets used to train the model.
SwiftKey is the company that works in cooperation with Professors of the Johns Hopkins University to prepare this Project, with objective to construct a predictive model that makes easier for people to type on their mobile devices.
Besides cleaning and sub-setting the data, the tokenization technique of N-Grams were used to combinations the words to be used at the predictive algotithm.
The final project was concluded with one Shiny application and a Pitch using R-Presentation.
This project involves Natural Language Processing. The critical task is to take a user's input phrase (group of words) and to output a predicted next word.
Project deliverables:
- Next Word Prediction Model, as basis for an app
- Next Word Prediction App hosted at shinyapps.io
- This presentation hosted at R pubs
The next word prediction model uses the principles of "tidy data" applied to text mining in R. Key model steps:
- Input: raw text files for model training
- Clean training data; separate into 2 word, 3 word, and 4 word n grams, save as tibbles
- Sort n grams tibbles by frequency, save as repos
- N grams function: uses a "back-off" type prediction model
- user supplies an input phrase
- model uses last 3, 2, or 1 words to predict the best 4th, 3rd, or 2nd match in the repos
- Output: next word prediction
Benefits: easy to read code; uses "pipes"; fast processing of training data; able to sample up to 25% of original corpus; relatively small output repos
The next word prediction app provides a simple user interface to the next word prediction model.
Key Features:
- Text box for user input
- Predicted next word outputs dynamically below user input
- Tabs with plots of most frequent n grams in the data-set
- Side panel with user instructions
Key Benefits:
- Fast response
- Method allows for large training sets leading to better next word predictions
" https://groupejopa.shinyapps.io/ngram_match/"
Slide Deck "https://rpubs.com/groupejopa/740513"
"https://github.com/groupejopa/JHU-Data-Science-Capstone-/tree/master/ngram_match"
"https://github.com/groupejopa/JHU-Data-Science-Capstone-"
Tidy Data
Text Mining with R: A Tidy Approach
Mark-blackmore