-
Notifications
You must be signed in to change notification settings - Fork 98
/
assignments.html
1685 lines (1282 loc) · 63.6 KB
/
assignments.html
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
<!DOCTYPE html>
<html lang="en">
<head>
<meta charset="utf-8">
<title>Big Data Infrastructure</title>
<meta name="viewport" content="width=device-width, initial-scale=1.0">
<meta name="description" content="">
<meta name="author" content="">
<!-- Le styles -->
<link href="assets/css/bootstrap.css" rel="stylesheet">
<style>
body {
padding-top: 60px; /* 60px to make the container go all the way to the bottom of the topbar */
}
pre {
font-size: 90%;
}
</style>
<link href="assets/css/bootstrap-responsive.css" rel="stylesheet">
<!-- HTML5 shim, for IE6-8 support of HTML5 elements -->
<!--[if lt IE 9]>
<script src="http://html5shim.googlecode.com/svn/trunk/html5.js"></script>
<![endif]-->
<!-- Fav and touch icons -->
<!--link rel="apple-touch-icon-precomposed" sizes="144x144" href="assets/ico/apple-touch-icon-144-precomposed.png">
<link rel="apple-touch-icon-precomposed" sizes="114x114" href="assets/ico/apple-touch-icon-114-precomposed.png">
<link rel="apple-touch-icon-precomposed" sizes="72x72" href="assets/ico/apple-touch-icon-72-precomposed.png">
<link rel="apple-touch-icon-precomposed" href="assets/ico/apple-touch-icon-57-precomposed.png">
<link rel="shortcut icon" href="assets/ico/favicon.png"-->
</head>
<body>
<div class="navbar navbar-inverse navbar-fixed-top">
<div class="navbar-inner">
<div class="container">
<a class="btn btn-navbar" data-toggle="collapse" data-target=".nav-collapse">
<span class="icon-bar"></span>
<span class="icon-bar"></span>
<span class="icon-bar"></span>
</a>
<div class="nav-collapse collapse">
<ul class="nav">
<li><a href="index.html">Home</a></li>
<li><a href="overview.html">Overview</a></li>
<li><a href="syllabus.html">Syllabus</a></li>
<li class="active"><a href="assignments.html">Assignments</a></li>
</ul>
</div>
</div>
</div>
</div>
<div class="container">
<div class="page-header">
<h1>Assignments <small>Big Data Infrastructure (Spring 2015)</small></h1>
</div>
<div class="subnav">
<ul class="nav nav-pills">
<li><a href="#assignment0">0</a></li>
<li><a href="#assignment1">1</a></li>
<li><a href="#assignment2">2</a></li>
<li><a href="#assignment3">3</a></li>
<li><a href="#assignment4">4</a></li>
<li><a href="#assignment5">5</a></li>
<li><a href="#assignment6">6</a></li>
<li><a href="#assignment7">7</a></li>
<li><a href="#finalproject">Final Project</a></li>
</ul>
</div>
<section id="assignment0" style="padding-top:35px">
<div>
<h3>Assignment 0: Prelude <small>due 6:00pm January 26</small></h3>
<p>Note that this course requires you to have access to a reasonable
recent computer with at least 4 GB memory and plenty of hard disk
space.</p>
<p>Complete
the <a href="http://lintool.github.com/Cloud9/docs/word-count.html">word
count tutorial</a> in Cloud<sup>9</sup>, which is a Hadoop toolkit we're
going to use throughout the course. The tutorial will take you
through setting up Hadoop on your local machine and running Hadoop on
the virtual machine. It'll also begin familiarizing you with
GitHub.</p>
<p><b>Note:</b> This assignment is not explicitly graded, except as
part of Assignment 1.</p>
<p style="padding-top: 20px"><a href="#">Back to top</a></p>
</div>
</section>
<section id="assignment1" style="padding-top:35px">
<div>
<h3>Assignment 1: Warmup <small>due 6:00pm February 2</small></h3>
<p>Make sure you've completed
the <a href="http://lintool.github.com/Cloud9/docs/word-count.html">word
count tutorial</a> in Cloud<sup>9</sup>.</p>
<p>Sign up for a <a href="http://github.com/">GitHub</a> account. It is
very important that you do so as soon as possible, because GitHub is
the mechanism by which you will submit assignments. Once you've signed
up for an account, go to this page
to <a href="https://github.com/edu">request an educational
account</a>.</p>
<p>Next, create a <b>private</b> repo
called <code>bigdata-assignments</code>. Here
is <a href="https://help.github.com/articles/create-a-repo">how you
create a repo on GitHub</a>. For "Who has access to this repository?",
make sure you click "Only the people I specify". If you've
successfully gotten an educational account (per above), you should be
able to create private repos for free. Take some time to learn about
git if you've never used it before. There are plenty of good tutorials
online: do a simple web search and find one you like. If you've used
svn before, many of the concepts will be familiar, except that git
is far more powerful.</p>
<p>After you've learned about git, set aside the repo for now; you'll
come back to it later.</p>
<p>In the single node virtual cluster in the word count tutorial, you
should have already ran the word count demo with five reducers:</p>
<pre>
$ hadoop jar target/cloud9-X.Y.Z-fatjar.jar edu.umd.cloud9.example.simple.DemoWordCount \
-input bible+shakes.nopunc -output wc -numReducers 5
</pre>
<p>Answer the following questions (see instructions below for how to "turn in" these answers):</p>
<p><b>Question 1.</b> What is the first term
in <code>part-r-00000</code> and how many times does it appear?</p>
<p><b>Question 2.</b> What is the third to last term
in <code>part-r-00004</code> and how many times does it appear?</p>
<p><b>Question 3.</b> How many unique terms are there? (Hint: read the
counter values)</p>
<h4 style="padding-top: 10px">Time to write some code!</h4>
<p>Per above, you should have a private GitHub repo called
<code>bigdata-assignments/</code>. Change into that directory. Once
inside, create a Maven project with the following command:</p>
<pre>
$ mvn archetype:create -DgroupId=edu.umd.YOUR_USERNAME -DartifactId=assignment1
</pre>
<p>For <code>YOUR_USERNAME</code>, please use your GitHub username
(not your UMD directory ID, not your email address, or anything
else...). In what follows below, I will use
<code>jimmylin</code>, but you should obviously substitute your
own. Once you've executed the above command you should be able
to <code>cd</code>
into <code>bigdata-assignments/assignment1</code>. In that directory,
you'll find a <code>pom.xml</code> file (which tells Maven how to
build your code); replace with this
one <a href="assignments/pom.xml">here</a> (which is set up properly
for Hadoop), but inside this <code>pom.xml</code>, change the
following line and replace my username with yours.</p>
<pre>
<groupId>edu.umd.jimmylin</groupId>
</pre>
<p>Next,
copy <code>Cloud9/src/main/java/edu/umd/cloud9/example/simple/DemoWordCount.java</code>
into <code>bigdata-assignments/assignment1/src/main/java/edu/umd/jimmylin/</code>. Open
up this new version of <code>DemoWordCount.java</code>
in <code>assignment1/</code> using a text editor and change the Java
package to <code>edu.umd.jimmylin</code>.</p>
<p>Now, in the <code>bigdata-assignments/assignment1</code> base
directory, you should be able to run Maven to build your package:</p>
<pre>
$ mvn clean package
</pre>
<p>Once the build succeeds, you should be able to run the word count
demo program in your own repository:</p>
<pre>
$ hadoop jar target/assignment1-1.0-SNAPSHOT-fatjar.jar edu.umd.jimmylin.DemoWordCount \
-input bible+shakes.nopunc -output wc -numReducers 5
</pre>
<p>The output should be exactly the same as the program above, but the
difference here is that the code is now in a repository under your
control. Congratulations, you've created your first functional Maven
artifact!<p>
<p>Let's do a little bit of cleanup of the words. Modify the word
count demo (your newly-created version in <code>assignment1/</code>)
so that only words consisting entirely of letters are counted. To be
more specific, the word must match the following Java regular
expression:</p>
<pre>
word.matches("[A-Za-z]+")
</pre>
<p>Now run this modified word count, again with five reducers. Answer
the following questions:</p>
<p><b>Question 4.</b> What is the first term
in <code>part-r-00000</code> and how many times does it appear?</p>
<p><b>Question 5.</b> What is the third to last term
in <code>part-r-00004</code> and how many times does it appear?</p>
<p><b>Question 6.</b> How many unique terms are there?
<h4 style="padding-top: 10px">Turning in the Assignment</h4>
<p>Please follow these instructions carefully!</p>
<p>At this point, you should have a GitHub
repo <code>bigdata-assignments</code>, and inside the repo, you should
have a directory
called <code>assignment1/</code>. Under <code>assignment1/</code>, you
should already have the code for the modified word count example above
(i.e., questions 4, 5, 6). Commit all of this code and push to
GitHub.</p>
<p>Next, under <code>assignment1/</code>, create a file
called <code>assignment1.md</code>. In that file, put your answers to
the above questions 1 through 6. Use the Markdown annotation format:
here's
a <a href="http://daringfireball.net/projects/markdown/basics">simple
guide</a>. Here's an <a href="http://markable.in/editor/">online
editor</a> that's also helpful.</p>
<p>Make sure you have committed everything and have pushed your repo
back to origin. You can verify that it's there by logging into your
GitHub account (in a web browser): your assignment should be
viewable in the web interface.</p>
<p>Almost there! Add the
user <a href="https://github.com/teachtool">teachtool</a> a
collaborator to your repo so that I can check it out (under settings
in the main web interface on your repo). Note: do <b>not</b> add my
primary GitHub
account <a href="https://github.com/lintool">lintool</a> as a
collaborator.</p>
<p>Finally, send me an email, to jimmylin@umd.edu with the subject
line "Big Data Infrastructure Assignment #1". In the body of the email
message, tell me what your GitHub username is so that I can link your
repo to you. Also, in your email please tell me how long you spent
doing the assignment, including everything (installing the VM,
learning about git, working through the tutorial, etc.).</p>
<h4 style="padding-top: 10px">Grading</h4>
<p>Here's how I am going to grade your assignment. I will clone your
repo, go into your <code>assignment1/</code> directory, and build your
Maven artifact:</p>
<pre>
$ mvn clean package
</pre>
<p>I am then going to run your code:</p>
<pre>
$ hadoop jar target/assignment1-1.0-SNAPSHOT-fatjar.jar edu.umd.jimmylin.DemoWordCount \
-input bible+shakes.nopunc -output wc -numReducers 5
</pre>
<p>Once the code completes, I will verify its output. To make sure
everything is in the proper place, you should do a fresh clone, i.e.,
clone your own repo, but in a different location, and run through
these same steps. If it works for you, it'll work for me.</p>
<p>The purpose of this assignment is to familiarize you with the
Hadoop development environment. You'll get a "pass" if you've
successfully completed the assignment. I expect everyone to get a
"pass".</p>
<p style="padding-top: 20px"><a href="#">Back to top</a></p>
</div>
</section>
<section id="assignment2" style="padding-top:35px">
<div>
<h3>Assignment 2: Counting <small>due 6:00pm February 23</small></h3>
<p>Begin by setting up your development environment. In your GitHub
repo <code>bigdata-assignments/</code>, create a Maven project with
the following command:</p>
<pre>
$ mvn archetype:create -DgroupId=edu.umd.YOUR_USERNAME -DartifactId=assignment2
</pre>
<p>For <code>YOUR_USERNAME</code>, please use your GitHub username
(not your UMD directory ID, not your email address, or anything
else...). In what follows below, I will use
<code>jimmylin</code>, but you should obviously substitute your
own. Once you've executed the above command, change directory
to <code>bigdata-assignments/assignment2</code>. In that directory,
replace <code>pom.xml</code> with this
version <a href="assignments/pom.xml">here</a> (which is set up
properly for Hadoop). However, inside <code>pom.xml</code>, change the
following line and replace my username with yours.</p>
<pre>
<groupId>edu.umd.jimmylin</groupId>
</pre>
<p>Also, replace all instances of <code>assignment1</code> with
<code>assignment2</code>.</p>
<p>The actual assignment begins with an optional <i>but
recommended</i> component: complete
the <a href="http://lintool.github.com/Cloud9/docs/exercises/bigrams.html">bigram
counts exercise</a> in Cloud<sup>9</sup>. The solution is already
checked in the repo, so it won't be graded. Even if you decide not to
write code for the exercise, take some time to sketch out what the
solution would look like. The exercises are designed to help you
learn: jumping directly to the solution defeats this purpose.</p>
<p>In this assignment you'll be
computing <a href="http://en.wikipedia.org/wiki/Pointwise_mutual_information">pointwise
mutual information</a>, which is a function of two events <i>x</i>
and <i>y</i>:</p>
<p><img width="200" src="assignments/PMI.png"/></p>
<p>The larger the magnitude of PMI for <i>x</i> and <i>y</i> is,
the more information you know about the probability of seeing <i>y</i>
having just seen <i>x</i> (and vice-versa, since PMI is
symmetrical). If seeing <i>x</i> gives you no information about seeing
<i>y</i>, then <i>x</i> and <i>y</i> are independent and the PMI is
zero.</p>
<p>To complete this assignment, you'll need to work with the UMIACS
cluster. In the beginning of the assignment you'll be working with the
toy <code>bible+shakes.nopunc.gz</code> corpus, but later you'll move
to a larger corpus (more below). You can start working on the UMIACS
cluster directly, or you can start on the Cloudera VM and move to the
UMIACS cluster later. It's your choice, but as we discussed in class,
debugging may be easier inside your Cloudera VM.</p>
<p>Write a program that computes the PMI of words in the
sample <code>bible+shakes.nopunc.gz</code> corpus. To be more
specific, the event we're after is <i>x</i> occurring on a line in the
file or <i>x</i> and <i>y</i> co-occurring on a line. That is, if a
line contains A, A, B; then there is only one instance of A and B
appearing together, not two. To reduce the number of spurious pairs,
we are only interested in pairs of words that co-occur in ten or more
lines. Use the same definition of "word" as in the word count demo:
whatever Java's <code>StringTokenizer</code> gives.</p>
<p>You will build two versions of the program (put both in
package <code>edu.umd.YOUR_USERNAME</code>):</p>
<ol>
<li>A "pairs" implementation. The implementation must use
combiners. Name this implementation <code>PairsPMI</code>.</li>
<li>A "stripes" implementation. The implementation must use
combiners. <code>StripesPMI</code>.</li>
</ol>
<p>If you feel compelled (for extra credit), you are welcome to try
out the "in-mapper combining" technique for both implementations.</p>
<p>Since PMI is symmetrical, PMI(x, y) = PMI(y, x). However, it's
actually easier in your implementation to compute both values, so
don't worry about duplicates. Also, use <code>TextOutputFormat</code>
so the results of your program are human readable.</p>
<p><b>Note:</b> just so everyone's answer is consistent, please use
log base 10.</p>
<p>Answer the following questions:</p>
<p><b>Question 0.</b> <i>Briefly</i> describe in prose your solution,
both the pairs and stripes implementation. For example: how many
MapReduce jobs? What are the input records? What are the intermediate
key-value pairs? What are the final output records? A paragraph for
each implementation is about the expected length.</p>
<p><b>Question 1.</b> What is the running time of the complete pairs
implementation? What is the running time of the complete stripes
implementation? (Did you run this in your VM or on the UMIACS cluster?
Either is fine, but tell me which one.)</p>
<p><b>Question 2.</b> Now disable all combiners. What is the running
time of the complete pairs implementation now? What is the running
time of the complete stripes implementation? (Did you run this in your
VM or on the UMIACS cluster? Either is fine, but tell me which
one.)</p>
<p><b>Question 3.</b> How many distinct PMI pairs did you extract?</p>
<p><b>Question 4.</b> What's the pair (x, y) with the highest PMI?
Write a sentence or two to explain what it is and why it has such a
high PMI.</p>
<p><b>Question 5.</b> What are the three words that have the highest
PMI with "cloud" and "love"? And what are the PMI values?</p>
<p>Note that you can compute the answer to questions 3—5 however
you wish: a helper Java program, a Python script, command-line
manipulation, etc.</p>
<p>Now, answer the same questions 1—5 for the following corpus, which
is stored on HDFS on the UMAICS cluster:</p>
<pre>
/shared/simplewiki-20141222-pages-articles.txt
</pre>
<p>That file is 121 MB and contains the latest version
of <a href="http://simple.wikipedia.org/wiki/Main_Page">Simple English
Wikipedia</a>. Number the answers to these questions 6—10.</p>
<p>Note that it is possible to complete questions 1—5 on your
Cloudera VM, but you <i>must</i> answer questions 6—10 on the
UMIACS cluster. This is to ensure that your code "scales
correctly".</p>
<h4 style="padding-top: 10px">Turning in the Assignment</h4>
<p>Please follow these instructions carefully!</p>
<p>Make sure your repo has the following items:</p>
<ul>
<li>Similar to your first assignment, the answers to the questions go
in <code>bigdata-assignments/assignment2/assignment2.md</code>.</li>
<li>The pairs and stripes implementation should be
in <code>bigdata-assignments/assignment2/src/</code>. Of course, your
repo may contain other Java code.</li>
</ul>
<p>When grading, I will perform a clean clone of your repo, change
directory into <code>bigdata-assignments/assignment2/</code> and build
your code:<p>
<pre>
$ mvn clean package
</pre>
<p>Your code should build successfully. I am then going to run your
code (for the pairs and stripes implementations, respectively):</p>
<pre>
$ hadoop jar target/assignment2-1.0-SNAPSHOT-fatjar.jar edu.umd.YOUR_USERNAME.PairsPMI \
-input DATA -output output-pairs -numReducers 5
$ hadoop jar target/assignment2-1.0-SNAPSHOT-fatjar.jar edu.umd.YOUR_USERNAME.StripesPMI \
-input DATA -output output-stripes -numReducers 5
</pre>
<p>For <code>DATA</code>, I am either going to
use <code>bible+shakes.nopunc</code>
or <code>simplewiki-20141222-pages-articles.txt</code> (you can assume
that I'll supply the correct HDFS path). Note that I am going to check
the output and I expect the contents in the final output on HDFS to be
human readable.</p>
<p>When you've done everything, commit to your repo and remember to
push back to origin. You should be able to see your edits in the web
interface. Before you consider the assignment "complete", I would
recommend that you verify everything above works by performing a clean
clone of your repo and going through the steps above.</p>
<p>That's it! There's no need to send me anything—I already know
your username from the first assignment. Note that everything should
be committed and pushed to origin before the deadline (before class on
February 16). Git timestamps your commits and so I can tell if your
assignment is late.</p>
<h4 style="padding-top: 10px">Hints</h4>
<ul>
<li>Did you take a look at the <a href="http://lintool.github.com/Cloud9/docs/exercises/bigrams.html">bigram
counts exercise</a>?</li>
<li>Your solution may require more than one MapReduce job.</li>
<li>Recall from lecture techniques for loading in "side data"?</li>
<li>Look in <code>edu.umd.cloud9.example.cooccur</code> for a
reference implementation of the pairs and stripes techniques.</li>
<li>As disscussed in class,
my <a href="https://github.com/lintool/tools/tree/master/lintools-datatypes/">lintools-datatypes
package</a> package has <code>Writable</code> datatypes that you
might find useful.</li>
</ul>
<h4 style="padding-top: 10px">Grading</h4>
<p>The entire assignment is worth 40 points:
<ul>
<li>Each of the questions 1 to 10 is worth 1 point, for a total of 10
points.</li>
<li>The pairs implementation is worth 10 points and the stripes
implementation is worth 10 points. The purpose of question 0 is to
help me understand your implementation.</li>
<li>Getting your code to run is worth 5 points for each implementation
(i.e., 10 points total). That is, to earn all five points, I should be
able to run your code (building and running), following exactly the
procedure above. Therefore, if all the answers are correct and the
implementation seems correct, but I cannot get your code to build and
run, you will only get a score of 30/40.</li>
</ul>
<p style="padding-top: 20px"><a href="#">Back to top</a></p>
</div>
</section>
<section id="assignment3" style="padding-top:35px">
<div>
<h3>Assignment 3: Inverted Indexing <small>due 6:00pm March 2</small></h3>
<p>Begin by setting up your development environment. The process is
exactly the same as in the previous assignment. In your GitHub
repo <code>bigdata-assignments/</code>, create a Maven project with
the following command:</p>
<pre>
$ mvn archetype:create -DgroupId=edu.umd.YOUR_USERNAME -DartifactId=assignment3
</pre>
<p>For <code>YOUR_USERNAME</code>, please use your GitHub
username. Once you've executed the above command, change directory
to <code>bigdata-assignments/assignment3</code>. In that directory,
replace <code>pom.xml</code> with this
version <a href="assignments/pom.xml">here</a> (which is set up
properly for Hadoop). However, inside <code>pom.xml</code>, change the
following line and replace my username with yours.</p>
<pre>
<groupId>edu.umd.jimmylin</groupId>
</pre>
<p>Also, replace all instances of <code>assignment1</code> with
<code>assignment3</code>.</p>
<p>This assignment begins with an optional <i>but recommended</i>
component: complete
the <a href="http://lintool.github.com/Cloud9/docs/exercises/indexing.html">inverted
indexing exercise</a>
and <a href="http://lintool.github.com/Cloud9/docs/exercises/retrieval.html">boolean
retrieval exercise</a> in Cloud<sup>9</sup>. The solution is already
checked in the repo, so it won't be graded. However, the rest of the
assignment builds from there. Even if you decide not to write code for
those two exercises, take some time to sketch out what the solution
would look like. The exercises are designed to help you learn: jumping
directly to the solution defeats the purpose.</p>
<p>Starting from the inverted indexing baseline, modify the indexer
code in the two following ways:</p>
<p><b>1. Index Compression.</b> The index should be compressed using
VInts: see <code>org.apache.hadoop.io.WritableUtils</code>. You should
also use gap-compression techniques as appropriate.</p>
<p><b>2. Scalability.</b> The baseline indexer implementation
currently buffers and sorts postings in the reducer, which as we
discussed in class is not a scalable solution. Address this
scalability bottleneck using techniques we discussed in class and in
the textbook.</p>
<p><b>Note:</b> The major scalability issue is
buffering <i>uncompressed</i> postings in memory. In your solution,
you'll still end up buffering each postings list, but
in <i>compressed</i> form (raw bytes, no additional object
overhead). This is fine because if you use the right compression
technique, the postings lists are quite small. As a data point, on a
collection of 50 million web pages, 2GB heap is more than enough for a
full <i>positional</i> index (and in this assignment you're not asked
to store positional information in your postings).</p>
<p>To go into a bit more detail: in the reference implementation, the
final key type is <code>PairOfWritables<IntWritable,
ArrayListWritable<PairOfInts>></code>. The most obvious idea
is to change that into something
like <code>PairOfWritables<VIntWritable,
ArrayListWritable<PairOfVInts>></code>. This does not work!
The reason is that you will still be materializing each posting, i.e.,
all <code>PairOfVInts</code> objects in memory. This translates into a
Java object for every posting, which is wasteful in terms of memory
usage and will exhaust memory pretty quickly as you scale. In other
words, you're <i>still</i> buffering objects—just inside
the <code>ArrayListWritable</code>.
<p>This new indexer should be
named <code>BuildInvertedIndexCompressed</code>. This new class should
be in the package <code>edu.umd.YOUR_USERNAME</code>. Make sure it
works on the <code>bible+shakes.nopunc</code> collection.</p>
<p>Modify <code><a href="https://github.com/lintool/Cloud9/blob/master/src/main/java/edu/umd/cloud9/example/ir/BooleanRetrieval.java">BooleanRetrieval</a></code>
so that it works with the new compressed indexes (on
the <code>bible+shakes.nopunc</code> collection). Name this new
class <code>BooleanRetrievalCompressed</code>. This new class should
be in the package <code>edu.umd.YOUR_USERNAME</code> and
give <i>exactly</i> the same output as the old version.</p>
<p>Next, make sure your <code>BuildInvertedIndexCompressed</code>
and <code>BooleanRetrievalCompressed</code> implementations work on
the larger collection on HDFS in the UMIACS cluster:</p>
<pre>
/shared/simplewiki-20141222-pages-articles.txt
</pre>
<p>Note that <code>BooleanRetrievalCompressed</code> has a number of
queries that are hard-coded in the <code>main</code>. For simplicity,
use those same queries on the simplewiki collection also.</p>
<p>Another note: the <code>BooleanRetrieval</code> reference
implementation prints out the entire line (i.e., "document") that
satisfies the query. For the <code>bible+shakes.nopunc</code>
collection, this is fine since the lines are short. However, in the
simplewiki collection, the lines (i.e., documents) are much longer, so
you should somehow truncate: either print out only the article title
or the first 80 characters, or something along those lines.</p>
<p>Answer the following questions:</p>
<p><b>Question 1.</b> What is the size of your compressed index for <code>bible+shakes.nopunc</code>?</p>
<p><b>Question 2.</b> What is the size of your compressed index for <code>simplewiki-20141222-pages-articles.txt</code>?</p>
<p><b>Question 3.</b> Which articles in <code>simplewiki-20141222-pages-articles.txt</code> satisfy the following boolean queries?</p>
<pre>
outrageous AND fortune
means AND deceit
(white OR red ) AND rose AND pluck
(unhappy OR outrageous OR (good AND your)) AND fortune
</pre>
<p>Note that I just want the article titles only (not the actual text).</p>
<h4 style="padding-top: 10px">Turning in the Assignment</h4>
<p>Please follow these instructions carefully!</p>
<p>Make sure your repo has the following items:</p>
<ul>
<li>Similar to your second assignment, the answers to the questions go
in <code>bigdata-assignments/assignment3/assignment3.md</code>.</li>
<li>Your code goes in <code>bigdata-assignments/assignment3/src/</code>.</li>
</ul>
<p>When grading, I will perform a clean clone of your repo, change
directory into <code>bigdata-assignments/assignment3/</code> and build
your code:<p>
<pre>
$ mvn clean package
</pre>
<p>Your code should build successfully. I am then going to run your
code on the <code>bible+shakes.nopunc</code> collection:</p>
<pre>
$ hadoop jar target/assignment3-1.0-SNAPSHOT-fatjar.jar edu.umd.YOUR_USERNAME.BuildInvertedIndexCompressed \
-input bible+shakes.nopunc -output index-shakes -numReducers 1
$ hadoop jar target/assignment3-1.0-SNAPSHOT-fatjar.jar edu.umd.YOUR_USERNAME.BooleanRetrievalCompressed \
-index index-shakes -collection bible+shakes.nopunc
</pre>
<p>The index should build properly (the size should match your answer
to Question 1), and the output of the boolean retrieval should be
correct.</p>
<p>I am next going to test your code on
the <code>simplewiki-20141222-pages-articles.txt</code>
collection:</p>
<pre>
$ hadoop jar target/assignment3-1.0-SNAPSHOT-fatjar.jar edu.umd.jimmylin.BuildInvertedIndexCompressed \
-input /shared/simplewiki-20141222-pages-articles.txt -output index-enwiki -numReducers 1
$ hadoop jar target/assignment3-1.0-SNAPSHOT-fatjar.jar edu.umd.jimmylin.BooleanRetrievalCompressed \
-index index-enwiki -collection /shared/simplewiki-20141222-pages-articles.txt
</pre>
<p>The index should build properly (the size should match your answer
to Question 2). The output of <code>BooleanRetrievalCompressed</code>
should match your answer to Question 3 above.</p>
<p>When you've done everything, commit to your repo and remember to
push back to origin. You should be able to see your edits in the web
interface. Before you consider the assignment "complete", I would
recommend that you verify everything above works by performing a clean
clone of your repo and going through the steps above.</p>
<p>That's it! There's no need to send me anything—I already know
your username from the first assignment. Note that everything should
be committed and pushed to origin before the deadline (before class on
February 23). Git timestamps your commits and so I can tell if your
assignment is late.</p>
<h4 style="padding-top: 10px">Grading</h4>
<p>The entire assignment is worth 40 points:
<ul>
<li>The implementation of <code>BuildInvertedIndexCompressed</code> is
worth 20 points: index compression is worth 10 points and making sure
the algorithm is scalable is worth 10 points.</li>
<li>The implementation of <code>BooleanRetrievalCompressed</code> is
worth 10 points.</li>
<li>Getting your code to run is worth 10 points. That is, to earn all
10 points, I should be able to run your code (building and running),
following exactly the procedure above. Therefore, if all the answers
are correct and the implementation seems correct, but I cannot get
your code to build and run, you will only get a score of 30/40.</li>
</ul>
<p style="padding-top: 20px"><a href="#">Back to top</a></p>
</div>
</section>
<section id="assignment4" style="padding-top:35px">
<div>
<h3>Assignment 4: Graphs <small>due 6:00pm March 23</small></h3>
<p>Begin by setting up your development environment. The process is
exactly the same as in the previous assignment. In your GitHub
repo <code>bigdata-assignments/</code>, create a Maven project with
the following command:</p>
<pre>
$ mvn archetype:create -DgroupId=edu.umd.YOUR_USERNAME -DartifactId=assignment4
</pre>
<p>For <code>YOUR_USERNAME</code>, please use your GitHub
username. Once you've executed the above command, change directory
to <code>bigdata-assignments/assignment4</code>. In that directory,
replace <code>pom.xml</code> with this
version <a href="assignments/pom.xml">here</a> (which is set up
properly for Hadoop). However, inside <code>pom.xml</code>, change the
following line and replace my username with yours.</p>
<pre>
<groupId>edu.umd.jimmylin</groupId>
</pre>
<p>Also, replace all instances of <code>assignment1</code> with
<code>assignment4</code>.</p>
<p>Begin this assignment by taking the time to understand
the <a href="http://lintool.github.com/Cloud9/docs/exercises/pagerank.html">PageRank
reference implementation</a> in Cloud<sup>9</sup>. There is no need to
try the exercise from scratch, but study the code carefully to
understand exactly how it works.</p>
<p>For this assignment, you are going to implement multiple-source
personalized PageRank. As we discussed in class, personalized PageRank
is different from ordinary PageRank in a few respects:</p>
<ul>
<li>There is the notion of a <i>source</i> node, which is what we're
computing the personalization with respect to.</li>
<li>When initializing PageRank, instead of a uniform distribution
across all nodes, the source node gets a mass of one and every other
node gets a mass of zero.</li>
<li>Whenever the model makes a random jump, the random jump is
always back to the source node; this is unlike in ordinary PageRank,
where there is an equal probability of jumping to any node.</li>
<li>All mass lost in the dangling nodes are put back into the source
node; this is unlike in ordinary PageRank, where the missing mass is
evenly distributed across all nodes.</li>
</ul>
<p>This assignment can be completed entirely in your
VM. Alternatively, you are welcome to use the UMIACS cluster if you
wish.</p>
<p>Here are some publications about personalized PageRank if you're
interested. They're just provided for background; neither is necessary
for completing the assignment.</p>
<ul>
<li>Daniel Fogaras, Balazs Racz, Karoly Csalogany, and Tamas Sarlos. (2005) <a href="content/Fogaras_etal_2005.pdf">Towards Scaling Fully Personalized PageRank: Algorithms, Lower Bounds, and Experiments.</a> Internet Mathematics, 2(3):333-358.</li>
<li>Bahman Bahmani, Abdur Chowdhury, and Ashish Goel. (2010) <a href="content/Bahmani_etal_VLDB2010.pdf">Fast Incremental and Personalized PageRank.</a> Proceedings of the 36th International Conference on Very Large Data Bases (VLDB 2010).</li>
</ul>
<p>Your implementation is going to run multiple personalized PageRank
computations in parallel, one with respect to each source. The user is
going to specify on the command line the sources. This means that each
PageRank node object (i.e., <code>Writable</code>) is going to contain
an array of PageRank values.</p>
<p>Here's how the implementation is going to work; it largely follows
the reference implementation in the exercise above. It's your
responsibility to make your implementation work with respect to the
command-line invocations specified below.</p>
<p>First, the user is going to convert the adjacency list into
PageRank node records:</p>
<pre>
$ hadoop jar target/assignment4-1.0-SNAPSHOT-fatjar.jar edu.umd.jimmylin.BuildPersonalizedPageRankRecords \
-input sample-large.txt -output YOURNAME-PageRankRecords -numNodes 1458 -sources 9627181,9370233,10207721
</pre>
<p>Note that we're going to use the "large" graph from the exercise
linked above. The <code>-sources</code> option specifies the source
nodes for the personalized PageRank computations. In this case, we're
running three computations in parallel, with respect to node
ids 9627181, 9370233, and 10207721. You can expect the option value to
be in the form of a comma-separated list, and that all node ids
actually exist in the graph. The list of source nodes may be
arbitrarily long, but for practical purposes I won't test your code
with more than a few.</p>
<p>Since we're running three personalized PageRank computations in
parallel, each PageRank node is going to hold an array of three
values, the personalized PageRank values with respect to the first
source, second source, and third source. You can expect the array
positions to correspond exactly to the position of the node id in the
source string.</p>
<p>Next, the user is going to partition the graph and get ready to
iterate:</p>
<pre>
$ hadoop fs -mkdir YOURNAME-PageRank
$ hadoop jar target/assignment4-1.0-SNAPSHOT-fatjar.jar edu.umd.jimmylin.PartitionGraph \
-input YOURNAME-PageRankRecords -output YOURNAME-PageRank/iter0000 -numPartitions 5 -numNodes 1458
</pre>
<p>This will be standard hash partitioning.</p>
<p>After setting everything up, the user will iterate multi-source
personalized PageRank:</p>
<pre>
$ hadoop jar target/assignment4-1.0-SNAPSHOT-fatjar.jar edu.umd.jimmylin.RunPersonalizedPageRankBasic \
-base YOURNAME-PageRank -numNodes 1458 -start 0 -end 20 -sources 9627181,9370233,10207721
</pre>
<p>Note that the sources are passed in from the command-line
again. Here, we're running twenty iterations.</p>
<p>Finally, the user runs a program to extract the top ten personalized
PageRank values, with respect to each source.</p>
<pre>
$ hadoop jar target/assignment4-1.0-SNAPSHOT-fatjar.jar edu.umd.jimmylin.ExtractTopPersonalizedPageRankNodes \
-input YOURNAME-PageRank/iter0020 -top 10 -sources 9627181,9370233,10207721
</pre>
<p>The above program should print the following answer to stdout:</p>
<pre>
Source: 9627181
0.43721 9627181
0.10006 8618855
0.09015 8980023
0.07705 12135350
0.07432 9562469
0.07432 10027417
0.01749 9547235
0.01607 9880043
0.01402 8070517
0.01310 11122341
Source: 9370233
0.42118 9370233
0.08627 11325345
0.08378 11778650
0.07160 10952022
0.07160 10767725
0.07160 8744402
0.03259 10611368
0.01716 12182886
0.01467 12541014
0.01467 11377835
Source: 10207721
0.38494 10207721
0.07981 11775232
0.07664 12787320
0.06565 12876259
0.06543 8642164
0.06543 10541592
0.02224 8669492
0.01963 10940674
0.01911 10867785
0.01815 9619639
</pre>
<h4 style="padding-top: 10px">Additional Specifications</h4>
<p>To make the final output easier to read, in the
class <code>ExtractTopPersonalizedPageRankNodes</code>, use the
following format to print each (personalized PageRank value, node id)
pair:</p>
<pre>
String.format("%.5f %d", pagerank, nodeid)
</pre>
<p>This will generate the final results in the same format as
above. Also note: print actual probabilities, not log
probabilities—although during the actual PageRank computation
keeping values as log probabilities is better.</p>
<p>The final class <code>ExtractTopPersonalizedPageRankNodes</code>
does not need to be a MapReduce job (but it does need to read from
HDFS). Obviously, the other classes need to run MapReduce jobs.</p>
<p>The reference implementation of PageRank in Cloud<sup>9</sup> has
many options: you can either use in-mapper combining or
ordinary combiners. In your implementation, choose one or the
other. You do not need to implement both options. Also, the reference
implementation has an option to either use range partitioning or hash
partitioning: you only need to implement hash partitioning. You can
start with the reference implementation and remove code that you don't
need (see #2 below).</p>
<h4 style="padding-top: 10px">Hints and Suggestion</h4>
<p>To help you out, there's a small helper program in
Cloud<sup>9</sup> that computes personalized PageRank using a
sequential algorithm. Use it to check your answers:</p>
<pre>
$ hadoop jar target/assignment4-1.0-SNAPSHOT-fatjar.jar edu.umd.cloud9.example.pagerank.SequentialPersonalizedPageRank \
-input sample-large.txt -source 9627181
</pre>
<p>Note that this isn't actually a MapReduce job; we're simply using
Hadoop to run the <code>main</code> for convenience. The values from
your implementation should be pretty close to the output of the above
program, but might differ a bit due to convergence issues. After 20
iterations, the output of the MapReduce implementation should match to
at least the fourth decimal place.</p>
<p>This is complex assignment. I would suggest breaking the
implementation into the following steps:</p>
<ol>
<li>First, copy the reference PageRank implementation into your own
assignments repo (renaming the classes appropriately). Make sure you
can get it to run and output the correct results with ordinary
PageRank.</li>
<li>Simplify the code; i.e., if you decide to use the in-mapper