# Task


Calculate statistics for groups of words which are equal up to permutations of letters. For example, ‘emit’, ‘item’ and ‘time’ are the same words up to a permutation of letters. Determine such groups of words and sum all their counts. Apply stop words filter. Filter out groups that consist of only one word.

Output: count of occurrences for the group of words, number of unique words in the group, comma-separated list of the words in the group in lexicographical order:

sum <tab> group size <tab> word1,word2,...

Example: assume ‘emit’ occurred 3 times, 'item' -- 2 times, 'time' -- 5 times; 3 + 2 + 5 = 10, group contains 3 words, so for this group result is:

10 3 emit,item,time

The result of the task is the output line with word ‘english’.

NB: Do not forget about the lexicographical order of words in the group: 'emit,item,time' is OK, 'emit,time,item' is not.



Note from Kiril Cvetkov 10.May.2018 : We cannot use combiner here, since we cannot use consistent relation mappings between mapper->combiner->reducer


In [437]:
%%writefile mapper.py

import sys
import re

reload(sys)
sys.setdefaultencoding('utf-8') # required to convert to unicode

with open('stop_words_en.txt') as f:
    stop_words = set(f.read().split())
    
for line in sys.stdin:
    try:
        article_id, text = unicode(line.strip()).split('\t', 1)
        #text = line.strip()

    except ValueError as e:
        continue
    words = re.split("\W*\s+\W*", text, flags=re.UNICODE)
    #words = re.split("\W*\s+\W*", text)
    for word in words:
        word=word.lower()
        if word in stop_words:
            continue
        key=''.join(sorted(word))
        print "%s\t%s\t%d" % (key, word, 1)
        #print(key+"\t"+word.lower()+"\t"+str(1))

Overwriting mapper.py


In [438]:
%%writefile reducer.py

import sys

current_key = None
word_sum = 0
words = set()

for line in sys.stdin:
    try:
        key, word, count = line.strip().split('\t', 2)
        count = int(count)
    except ValueError as e:
        continue
    if current_key != key:
        if current_key and len(words)>1:
            print "%d\t%d\t%s" % (word_sum, len(words), ','.join(sorted(words)))
            #print(word_sum, len(words), ','.join(words))
        word_sum = 0
        current_key = key
        words = set()
    word_sum += count
    words.add(word)

if current_key and len(words)>1:
    print "%d\t%d\t%s" % (word_sum, len(words), ','.join(sorted(words)))
    #print(word_sum, len(words), ','.join(words))


Overwriting reducer.py


In [None]:
%%bash

OUT_DIR="wordgroup_result_"$(date +"%s%6N")
NUM_REDUCERS=8

hdfs dfs -rm -r -skipTrash ${OUT_DIR} > /dev/null

yarn jar /opt/cloudera/parcels/CDH/lib/hadoop-mapreduce/hadoop-streaming.jar \
    -D mapred.jab.name="Streaming wordgroup" \
    -D mapreduce.job.reduces=${NUM_REDUCERS} \
    -files mapper.py,reducer.py,/datasets/stop_words_en.txt \
    -mapper "python mapper.py" \
    -reducer "python reducer.py" \
    -input /data/wiki/en_articles_part \
    -output ${OUT_DIR} > /dev/null
    
hdfs dfs -cat ${OUT_DIR}/* | grep -P '(,|\t)english($|,)'
