## Assignment 4: Word Groups
Calculate statistics for groups of words which are equal up to permutations of letters. For example, ‘emit’, ‘item’ and ‘time’ are the same words up to a permutation of letters. Determine such groups of words and sum all their counts. Apply stop words filter. Filter out groups that consist of only one word.

Output: count of occurrences for the group of words, number of unique words in the group, comma-separated list of the words in the group in lexicographical order:
```
sum <tab> group size <tab> word1,word2,...
```
Example: assume ‘emit’ occurred 3 times, 'item' -- 2 times, 'time' -- 5 times; 3 + 2 + 5 = 10, group contains 3 words, so for this group result is:
```
10 3 emit,item,time
```
The result of the task is the output line with word ‘english’.

***NB:*** *Do not forget about the lexicographical order of words in the group: 'emit,item,time' is OK, 'emit,time,item' is not.*

In [56]:
%%writefile mapper.py

import sys
import re

reload(sys)
sys.setdefaultencoding('utf-8') # required to convert to unicode

with open('stop_words_en.txt') as f:
    stop_words = set(f.read().split())

for line in sys.stdin:
    try:
        article_id, text = unicode(line.strip()).split('\t', 1)
    except ValueError as e:
        continue
    words = re.split("\W*\s+\W*", text, flags=re.UNICODE)
    for word in words:
        word = word.lower()
        if word in stop_words:
            continue
        word_sorted = ''.join(sorted(word))
        print "%s\t%d\t%s" % (word_sorted, 1, word)

Overwriting mapper.py


In [57]:
%%writefile reducer.py

import sys
import re

reload(sys)
sys.setdefaultencoding('utf-8') # required to convert to unicode

current_key = None
current_cnt = 0
words_set = set()

for line in sys.stdin:
    try:
        key, cnt, word = unicode(line.strip()).split('\t')
        cnt = int(cnt)
    except ValueError as e:
        continue
    
    if current_key != key:
        if current_key and (len(words_set) > 1):
            print "%d\t%d\t%s" % (current_cnt, len(words_set), ','.join(sorted(words_set)))
        current_key = key
        words_set = set()
        words_set.add(word)
        current_cnt = cnt
    else:
        words_set.add(word)
        current_cnt += cnt
        
print "%d\t%d\t%s" % (current_cnt, len(words_set), ','.join(sorted(words_set)))

Overwriting reducer.py


In [58]:
%%bash

OUT_DIR="word_groups"
NUM_REDUCERS=8

hdfs dfs -rm -r -skipTrash ${OUT_DIR} > /dev/null

yarn jar /opt/cloudera/parcels/CDH/lib/hadoop-mapreduce/hadoop-streaming.jar \
    -D mapred.jab.name="Streaming word groups" \
    -D mapreduce.job.reduces=${NUM_REDUCERS} \
    -files mapper.py,reducer.py,/datasets/stop_words_en.txt \
    -mapper "python2 mapper.py" \
    -reducer "python2 reducer.py" \
    -input /data/wiki/en_articles_part \
    -output ${OUT_DIR} > /dev/null
    
hdfs dfs -cat word_groups/* | grep -P '(,|\t)english($|,)'

7820	5	english,helsing,hesling,shengli,shingle


18/04/13 14:05:28 INFO client.RMProxy: Connecting to ResourceManager at /0.0.0.0:8032
18/04/13 14:05:28 INFO client.RMProxy: Connecting to ResourceManager at /0.0.0.0:8032
18/04/13 14:05:28 INFO mapred.FileInputFormat: Total input files to process : 1
18/04/13 14:05:29 INFO mapreduce.JobSubmitter: number of splits:2
18/04/13 14:05:29 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1523569920442_0012
18/04/13 14:05:29 INFO impl.YarnClientImpl: Submitted application application_1523569920442_0012
18/04/13 14:05:29 INFO mapreduce.Job: The url to track the job: http://509cd4f42d8c:8088/proxy/application_1523569920442_0012/
18/04/13 14:05:29 INFO mapreduce.Job: Running job: job_1523569920442_0012
18/04/13 14:05:35 INFO mapreduce.Job: Job job_1523569920442_0012 running in uber mode : false
18/04/13 14:05:35 INFO mapreduce.Job:  map 0% reduce 0%
18/04/13 14:05:51 INFO mapreduce.Job:  map 48% reduce 0%
18/04/13 14:05:57 INFO mapreduce.Job:  map 67% reduce 0%
18/04/13 14:05:59 INFO 