https://github.com/JianyangZhang/Mini-Search-Engine
The following words can be predicted along with the user's typing. The predictions are based on the probabilities in natural language that specific words usually appear after a given word or phrase.
The auto-completion program firstly generates an n-gram library from training data, then builds a language model based on the n-gram library and stores the model to MySQL database. This language model can be used to predict the words after a given word or phrase. Details are shown in the following sections.
The most regular function search engines provide. When a user's input may have typo, it suggests the correct spelling.
The spell checking program uses a Trie as data structure, takes single words as input and construct a dictionary with probability. With the word "best" being inserted, "bost", "bast", "bist", "bust" etc. will all point to "best" as the correct word. If "bust" is added later, it will overwrite the correct spelling of "best" to be "bust".
mysql> create database tp;
mysql> use tp;
mysql> create table LanguageModel(starter varchar(250), follower varchar(250), probability int);
mysql> grant all on *.* to "root"@"%" identified by "123456";
mysql> flush privileges;
hdfs dfs -mkdir /mysql
hdfs dfs -put ./mysql-connector-java-5.1.39-bin.jar /mysql
hdfs dfs -mkdir /trainingdata
hdfs dfs -put ./trainingdata/* /trainingdata
Usage: hadoop jar <jar file> <main class name> <input dir> <output dir> [GRAM_NUMBER] [THRESHOLD] [TOP_K]
Ex: hadoop jar TextPrediction.jar com.textprediction.ngramlm.Dispatcher /trainingdata /ngram 5 5 5
GRAM_NUMBER: the number n of the n-gram; default 5;
THRESHOLD: threshold for a phrase to be semantic; default 5;
TOP_K: only show the top k predictions based on their probabilities; default 5;
hdfs dfs -ls /ngram
hdfs dfs -get /ngram/part-r-* ./generated-n-gram-library/
mysql> select * from LanguageModel limit 50;
mysql> select * from LanguageModel into outfile '/tmp/generated-language-model/LanguageModel.out';
mysql> select * from LanguageModel where starter like 'input%' order by probability desc limit x;
user input would, predictions are
user input this is, predictions are
user input away from, predictions are
In this project, I used trie tree for spell checking only. The phrase recommendation module, for the purpose of practice, is implemented by storing language model in database then quering SQL. However, the SQL operation "like" is expensive! Behind real search engines, all of these functions are implemented by "distributed trie trees". In other word, they store all language models in trie trees.