GitHub - mihail911/Ngram: cs 149 assignment 3

Branches Tags

Name		Name	Last commit message	Last commit date
Latest commit History 22 Commits
.gitignore		.gitignore
Ngram.java		Ngram.java
README.txt		README.txt
Tokenizer.java		Tokenizer.java
build.sh		build.sh
pa3-1gb-q1.sh		pa3-1gb-q1.sh
pa3-1gb-q2.sh		pa3-1gb-q2.sh
pa3-8gb-q1.sh		pa3-8gb-q1.sh
pa3-8gb-q2.sh		pa3-8gb-q2.sh
query1.txt		query1.txt
query2.txt		query2.txt

Repository files navigation

CS 149 PA3 - Ngram

Collaborators: Mihail Eric - meric
               Raymond Wu  - wur911

We first created our own InputFormat class that we labeled as NgramInputFormat.
We implemented a Record Reader layered on top of a line record reader. Our 
record reader did the following:
1) Find the title and save it
2) Read in the body of the text until you reach the next title
3) Output the saved title as the key and the body as the value
4) Save the new title read in to replace the old title
5) Repeat
The record reader thus skipped the unlabeled beginning of the file input and
and the incomplete last document. 

In the main function, we parsed the query and we created a HashSet for the 
ngrams in each map task. 

We used a mapper for each fileinput. The mapper tokenized the value (the text
of the document) and created ngrams. It checked if the mapper was in the 
hashset, and incremented the score if so. Then it outputted the result with 
a key of 1 and a value of a (title,score) string.

The combiner took in the results from a mapper and found the top 20 scores,
breaking ties by taking the lexigraphically last title. It then output the 
top 20 with a key of 1 and a value of a [(score,title),(score,title),...]
string.

Lastly, the reducer took in results from the mapper and collected the top 20
results of the groups of 20 results it received from the combiner. 
It outputted these as a "score \tab title" string.

The runtime is O(q + n/p + logP). The main function reads in the query and 
takes O(q) time. Since each mapper is running in parallel, the HashSet creation
takes O(q) time and the creation of document ngrams takes O(n/p) time since
we are dividing the input between p processors. The combiner took O(n/p) time
since it processed O(n/p) intermediate outputs from the mapper. The reducer
received results from the p combiners and thus took O(p) time.