## Assignment 2: Stop Words
Improve the previous program to calculate how many stop words are in the input dataset. Stop words list is in ‘/datasets/stop_words_en.txt’ file. Use Hadoop counter to count the number of stop words and total words in the dataset. The result is the percentage of stop words in the entire dataset (without percent symbol).

**Hint.** As you can see in the Hadoop Streaming userguide "you will need to use "-file" option to tell the framework to pack your executable files as a part of job submission.". In general you can attach to the job not only executable files and then access them within your mappers and reducers as if were located in the same directory.

For example if you've attached such files to the job:
```
...
-files mapper.py,reducer.py,/dir1/file1.txt,/dir2/file2 \
...
```
you can works with attached files using relative paths:
```
# mapper.py

with open("file1.txt") as f1, open("file2") as f2:
 ...
```
Please pay attention that the following code:
```
# mapper.py

with open("/dir1/file1.txt") as f1, open("/dir2/file2") as f2:
 ...
 ```
 will work within Jupyter or Docker container because it has a single node which is simultaneously client node, datanode and namenode. However the code with absolute paths will fail on a real multi-node Hadoop cluster because "/dir1" and "/dir2" doesn't exist on the datanodes.


**Hint 2.** The solution can contain only the one Hadoop job, number of reducers should be 0 or more than 1.

After the Hadoop job run an extra script to calculate the result, which:

reads the logs of the Hadoop command from stderr,
extracts the values of the Hadoop counters of stop words and total words,
outputs the percentage of stop words in the correct format to stdout,
You should also print Hadoop logs (the input of the script) into stderr - it’s required for the grading system.

The script can be written in Python, bash or whatever you like.

In [1]:
%%writefile mapper.py

import sys
import re

reload(sys)
sys.setdefaultencoding('utf-8') # required to convert to unicode

with open('stop_words_en.txt') as f:
    stop_words = set(f.read().split())

for line in sys.stdin:
    try:
        article_id, text = unicode(line.strip()).split('\t', 1)
    except ValueError as e:
        continue
    words = re.split("\W*\s+\W*", text, flags=re.UNICODE)
    for word in words:
        print >> sys.stderr, "reporter:counter:wiki,total_words,%d" % 1
        if word in stop_words:
            print >> sys.stderr, "reporter:counter:wiki,stop_words,%d" % 1
        print "%s\t%d" % (word.lower(), 1)

Writing mapper.py


In [42]:
%%writefile get_stopword_percentage.py
import sys
import re

output_log = list(map(lambda x: x.strip(), sys.stdin.read().split()))

pattern_tot = 'total_words='
regexp_tot = re.compile(pattern_tot)

pattern_stop = 'stop_words='
regexp_stop = re.compile(pattern_stop)

total_words = [int(x.replace(pattern_tot, '')) for x in output_log if regexp_tot.search(x)][0]
stop_words = [int(x.replace(pattern_stop, '')) for x in output_log if regexp_stop.search(x)][0]

print(stop_words / float(total_words) * 100)

Overwriting get_stopword_percentage.py


In [3]:
%%bash

OUT_DIR="wordcount_result_stopwords"
NUM_REDUCERS=0

hdfs dfs -rm -r -skipTrash ${OUT_DIR} > /dev/null

yarn jar /opt/cloudera/parcels/CDH/lib/hadoop-mapreduce/hadoop-streaming.jar \
    -D mapred.jab.name="Streaming stopwords" \
    -D mapreduce.job.reduces=${NUM_REDUCERS} \
    -files mapper.py,/datasets/stop_words_en.txt \
    -mapper "python2 mapper.py" \
    -input /data/wiki/en_articles_part \
    -output ${OUT_DIR} > /dev/null 2> output.log

rm: `wordcount_result_stopwords': No such file or directory


In [43]:
%%bash

cat output.log | egrep "*_words" | python get_stopword_percentage.py
cat output.log >&2

38.44036900909957


18/04/13 06:21:48 INFO client.RMProxy: Connecting to ResourceManager at /0.0.0.0:8032
18/04/13 06:21:49 INFO client.RMProxy: Connecting to ResourceManager at /0.0.0.0:8032
18/04/13 06:21:51 INFO mapred.FileInputFormat: Total input files to process : 1
18/04/13 06:21:51 INFO mapreduce.JobSubmitter: number of splits:2
18/04/13 06:21:51 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1523567791799_0001
18/04/13 06:21:51 INFO impl.YarnClientImpl: Submitted application application_1523567791799_0001
18/04/13 06:21:51 INFO mapreduce.Job: The url to track the job: http://8c786766d7b1:8088/proxy/application_1523567791799_0001/
18/04/13 06:21:51 INFO mapreduce.Job: Running job: job_1523567791799_0001
18/04/13 06:21:59 INFO mapreduce.Job: Job job_1523567791799_0001 running in uber mode : false
18/04/13 06:21:59 INFO mapreduce.Job:  map 0% reduce 0%
18/04/13 06:22:15 INFO mapreduce.Job:  map 37% reduce 0%
18/04/13 06:22:21 INFO mapreduce.Job:  map 55% reduce 0%
18/04/13 06:22:27 INFO 