Skip to content

JianyangZhang/Hot-Search-Terms

Repository files navigation

Mini search engine in Docker:

https://github.com/JianyangZhang/Mini-Search-Engine

Auto-completion:

The following words can be predicted along with the user's typing. The predictions are based on the probabilities in natural language that specific words usually appear after a given word or phrase.

The auto-completion program firstly generates an n-gram library from training data, then builds a language model based on the n-gram library and stores the model to MySQL database. This language model can be used to predict the words after a given word or phrase. Details are shown in the following sections.

Spell checking:

The most regular function search engines provide. When a user's input may have typo, it suggests the correct spelling.

The spell checking program uses a Trie as data structure, takes single words as input and construct a dictionary with probability. With the word "best" being inserted, "bost", "bast", "bist", "bust" etc. will all point to "best" as the correct word. If "bust" is added later, it will overwrite the correct spelling of "best" to be "bust".

Sample MySQL database setup:

mysql> create database tp;
mysql> use tp;
mysql> create table LanguageModel(starter varchar(250), follower varchar(250), probability int);
mysql> grant all on *.* to "root"@"%" identified by "123456";
mysql> flush privileges;

tp1

Preparation in HDFS:

hdfs dfs -mkdir /mysql
hdfs dfs -put ./mysql-connector-java-5.1.39-bin.jar /mysql
hdfs dfs -mkdir /trainingdata
hdfs dfs -put ./trainingdata/* /trainingdata

tp2

Run the project:

Usage: hadoop jar <jar file> <main class name> <input dir> <output dir> [GRAM_NUMBER] [THRESHOLD] [TOP_K]
Ex: hadoop jar TextPrediction.jar com.textprediction.ngramlm.Dispatcher /trainingdata /ngram 5 5 5

GRAM_NUMBER: the number n of the n-gram; default 5;
THRESHOLD: threshold for a phrase to be semantic; default 5;
TOP_K: only show the top k predictions based on their probabilities; default 5;

Check the first MapReduce job results that generate an n-gram library:

hdfs dfs -ls /ngram
hdfs dfs -get /ngram/part-r-* ./generated-n-gram-library/

tp3

The second MapReduce job produces a language model in MySQL database:

mysql> select * from LanguageModel limit 50;
mysql> select * from LanguageModel into outfile '/tmp/generated-language-model/LanguageModel.out';

tp4

Sample predictions:

mysql> select * from LanguageModel where starter like 'input%' order by probability desc limit x;

  user input would, predictions are

tp8
  user input this is, predictions are

tp5
  user input away from, predictions are

tp6

The Trie tree structure for spell checking:

trietreeshow

Notes:

In this project, I used trie tree for spell checking only. The phrase recommendation module, for the purpose of practice, is implemented by storing language model in database then quering SQL. However, the SQL operation "like" is expensive! Behind real search engines, all of these functions are implemented by "distributed trie trees". In other word, they store all language models in trie trees.

About

Implemented a mini search engine with spell checking and auto-completion features (using Hadoop)

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published