### Word Groups

#### Task description
Calculate statistics for groups of words which are equal up to permutations of letters. For example, ‘emit’, ‘item’ and ‘time’ are the same words up to a permutation of letters. Determine such groups of words and sum all their counts. Apply stop words filter. Filter out groups that consist of only one word.

Output: count of occurrences for the group of words, number of unique words in the group, comma-separated list of the words in the group in lexicographical order:

sum < tab > group size < tab > word1,word2,...

Example: assume ‘emit’ occurred 3 times, 'item' -- 2 times, 'time' -- 5 times; 3 + 2 + 5 = 10, group contains 3 words, so for this group result is:

10 3 emit,item,time

The result of the task is the output line with word ‘english’.

In [1]:
%%writefile mapper.py

import sys
import re

reload(sys)
sys.setdefaultencoding('utf-8') # required to convert to unicode

path_to_file='stop_words_en.txt'

with open(path_to_file) as stop_words_file:
    content = stop_words_file.readlines()
    stop_words = set(l.strip().lower() for l in content)

for line in sys.stdin:
    try:
        article_id, text = unicode(line.strip()).split('\t', 1)
    except ValueError as e:
        continue
    words = re.split("\W*\s+\W*", text, flags=re.UNICODE)
    for word in words:
        word = word.lower()
        
        if word in stop_words:
            continue
            
        letters = list(word)
        if len(letters) != 1:
            key = ''.join(sorted(letters))

            print "%s\t%s\t%d" % (key, word, 1)

Overwriting mapper.py


In [2]:
%%writefile reducer.py

import sys
import re

current_key = None
word_sum = 0
current_words = set()

for line in sys.stdin:
    try:
        key, origin, count = line.strip().split('\t', 2)
        count = int(count)
    except ValueError as e:
        continue
    if current_key != key:
        if current_key and len(current_words) != 1:
            print "%d\t%d\t%s" % (word_sum, len(current_words), ','.join(sorted(current_words)))
        word_sum = 0
        current_key = key
        current_words = set()
    word_sum += count
    current_words.add(origin)

if current_key and len(current_words) != 1:
    print "%d\t%d\t%s" % (word_sum, len(current_words), ','.join(sorted(current_words)))

Overwriting reducer.py


In [3]:
%%bash

OUT_DIR="wordcount_result_lkv"
NUM_REDUCERS=8

hdfs dfs -rm -r -skipTrash ${OUT_DIR} > /dev/null

yarn jar /opt/cloudera/parcels/CDH/lib/hadoop-mapreduce/hadoop-streaming.jar \
    -D mapred.jab.name="Streaming word groups" \
    -D mapreduce.job.reduces=${NUM_REDUCERS} \
    -files mapper.py,reducer.py,/datasets/stop_words_en.txt \
    -mapper "python mapper.py" \
    -reducer "python reducer.py" \
    -input /data/wiki/en_articles_part \
    -output ${OUT_DIR} > /dev/null

hdfs dfs -cat ${OUT_DIR}/part-00005 | head

2	2	rates".in,rates."in
2	2	world'.in,world.'in
3	2	sahara's,shaara's
5	2	alban's,nabal's
3	2	adal's,alda's
20	3	ada's,as'ad,sa'ad
3	2	artisan's,sartain's
3	3	abel's,bale's,beal's
43	2	brain's,brian's
8	2	castel's,castle's


18/01/15 22:25:30 INFO client.RMProxy: Connecting to ResourceManager at /0.0.0.0:8032
18/01/15 22:25:30 INFO client.RMProxy: Connecting to ResourceManager at /0.0.0.0:8032
18/01/15 22:25:31 INFO mapred.FileInputFormat: Total input files to process : 1
18/01/15 22:25:31 INFO mapreduce.JobSubmitter: number of splits:2
18/01/15 22:25:31 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1516011326149_0011
18/01/15 22:25:31 INFO impl.YarnClientImpl: Submitted application application_1516011326149_0011
18/01/15 22:25:31 INFO mapreduce.Job: The url to track the job: http://a0872c38c6f3:8088/proxy/application_1516011326149_0011/
18/01/15 22:25:31 INFO mapreduce.Job: Running job: job_1516011326149_0011
18/01/15 22:25:37 INFO mapreduce.Job: Job job_1516011326149_0011 running in uber mode : false
18/01/15 22:25:37 INFO mapreduce.Job:  map 0% reduce 0%
18/01/15 22:25:53 INFO mapreduce.Job:  map 41% reduce 0%
18/01/15 22:25:59 INFO mapreduce.Job:  map 61% reduce 0%
18/01/15 22:26:04 INFO 