GitHub - JianyangZhang/Hot-Search-Terms at https://githubhelp.com

Mini search engine in Docker:

https://github.com/JianyangZhang/Mini-Search-Engine

Auto-completion:

The following words can be predicted along with the user's typing. The predictions are based on the probabilities in natural language that specific words usually appear after a given word or phrase.

The auto-completion program firstly generates an n-gram library from training data, then builds a language model based on the n-gram library and stores the model to MySQL database. This language model can be used to predict the words after a given word or phrase. Details are shown in the following sections.

Spell checking:

The most regular function search engines provide. When a user's input may have typo, it suggests the correct spelling.

The spell checking program uses a Trie as data structure, takes single words as input and construct a dictionary with probability. With the word "best" being inserted, "bost", "bast", "bist", "bust" etc. will all point to "best" as the correct word. If "bust" is added later, it will overwrite the correct spelling of "best" to be "bust".

Sample MySQL database setup:

mysql> create database tp;
mysql> use tp;
mysql> create table LanguageModel(starter varchar(250), follower varchar(250), probability int);
mysql> grant all on *.* to "root"@"%" identified by "123456";
mysql> flush privileges;

Preparation in HDFS:

hdfs dfs -mkdir /mysql
hdfs dfs -put ./mysql-connector-java-5.1.39-bin.jar /mysql
hdfs dfs -mkdir /trainingdata
hdfs dfs -put ./trainingdata/* /trainingdata

Run the project:

Usage: hadoop jar <jar file> <main class name> <input dir> <output dir> [GRAM_NUMBER] [THRESHOLD] [TOP_K]
Ex: hadoop jar TextPrediction.jar com.textprediction.ngramlm.Dispatcher /trainingdata /ngram 5 5 5

GRAM_NUMBER: the number n of the n-gram; default 5;
THRESHOLD: threshold for a phrase to be semantic; default 5;
TOP_K: only show the top k predictions based on their probabilities; default 5;

Check the first MapReduce job results that generate an n-gram library:

hdfs dfs -ls /ngram
hdfs dfs -get /ngram/part-r-* ./generated-n-gram-library/

The second MapReduce job produces a language model in MySQL database:

mysql> select * from LanguageModel limit 50;
mysql> select * from LanguageModel into outfile '/tmp/generated-language-model/LanguageModel.out';

Sample predictions:

mysql> select * from LanguageModel where starter like 'input%' order by probability desc limit x;

user input would, predictions are

user input this is, predictions are

user input away from, predictions are

The Trie tree structure for spell checking:

Notes:

In this project, I used trie tree for spell checking only. The phrase recommendation module, for the purpose of practice, is implemented by storing language model in database then quering SQL. However, the SQL operation "like" is expensive! Behind real search engines, all of these functions are implemented by "distributed trie trees". In other word, they store all language models in trie trees.

Name		Name	Last commit message	Last commit date
Latest commit History 75 Commits
Hadoop cluster in Docker		Hadoop cluster in Docker
TextPrediction		TextPrediction
alltrainingdata		alltrainingdata
generated-language-model		generated-language-model
generated-n-gram-library		generated-n-gram-library
trainingdata		trainingdata
.gitattributes		.gitattributes
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Hadoop cluster in Docker

Hadoop cluster in Docker

TextPrediction

TextPrediction

alltrainingdata

alltrainingdata

generated-language-model

generated-language-model

generated-n-gram-library

generated-n-gram-library

trainingdata

trainingdata

.gitattributes

.gitattributes

LICENSE

LICENSE

README.md

README.md

Repository files navigation

Mini search engine in Docker:

Auto-completion:

Spell checking:

Sample MySQL database setup:

Preparation in HDFS:

Run the project:

Check the first MapReduce job results that generate an n-gram library:

The second MapReduce job produces a language model in MySQL database:

Sample predictions:

The Trie tree structure for spell checking:

Notes:

About

Releases

Packages

Languages

License

JianyangZhang/Hot-Search-Terms

Folders and files

Latest commit

History

Repository files navigation

Mini search engine in Docker:

Auto-completion:

Spell checking:

Sample MySQL database setup:

Preparation in HDFS:

Run the project:

Check the first MapReduce job results that generate an n-gram library:

The second MapReduce job produces a language model in MySQL database:

Sample predictions:

The Trie tree structure for spell checking:

Notes:

About

Resources

License

Stars

Watchers

Forks

Languages