insret the args as follow:
- jar_path
- output_path (bucket)
- ec2 key pair
- write "yes" or "no" if you want to run the program with a combiner
N-grams are fixed size tuples of items.
-
Given a corpus from google three-gram, the program do as follow:
- Counts the legal three-grams and divide the corpus into two parts.
- Calculate the probability of each three-gram.
- Sort the three-grams alphabetically, and by probability.
The project has 4 map reduce jobs:
-
Job 1 filters illegal three-grams, divide the corpus, sums the occurrences of each three-gram at each corpus, counts the number of three-grams in the corpus.
-
Job 2 calculates the occurrence of each three-gram at each corpus, and join the data.
-
Job 3 mapper join the data from the 2 splits of the corpus, and calculate the probabilty of each three-gram.
-
Job 4 is executing the sort - (w1w2 alphabetically ascending) and then (w1w2w3 probability descending).