Calculate statistics for groups of words which are equal up to permutations of letters. For example, ‘emit’, ‘item’ and ‘time’ are the same words up to a permutation of letters. Determine such groups of words and sum all their counts. Apply stop words filter. Filter out groups that consist of only one word.

Output: count of occurrences for the group of words, number of unique words in the group, comma-separated list of the words in the group in lexicographical order:

sum <tab> group size <tab> word1,word2,...

Example: assume ‘emit’ occurred 3 times, 'item' -- 2 times, 'time' -- 5 times; 3 + 2 + 5 = 10, group contains 3 words, so for this group result is:

10 3 emit,item,time

The result of the task is the output line with word ‘english’.

The result on the sample dataset:


NB: Do not forget about the lexicographical order of words in the group: 'emit,item,time' is OK, 'emit,time,item' is not.

In [1]:
%%writefile mapper.py


import sys
import re

reload(sys)
sys.setdefaultencoding('utf-8')  # required to convert to unicode

path = 'stop_words_en.txt'

with open(path) as h:
    STOP_WORDS = [l.strip().lower() for l in h]


for line in sys.stdin:
    try:
        article_id, text = unicode(line.strip()).split('\t', 1)
    except ValueError as e:
        continue

    words = re.split("\W*\s+\W*", text, flags=re.UNICODE)

    # Your code for mapper here.
    for word in words:
        wordl = word.lower()
        if wordl in STOP_WORDS:
            continue
        key = ''.join(sorted(wordl))
        print "%s\t%s" % (key, wordl)

Overwriting mapper.py


In [6]:
%%writefile reducer.py

import sys

def reset_count(key):
    global current_key
    current_key = key
    global current_count
    current_count = 0
    global current_words
    current_words = set()
reset_count(None)

def commit_current():
    if current_key and len(current_words) > 1:
        print("%s\t%s\t%s\t%s" % (
            current_count,
            current_key,
            len(current_words),
            ','.join(sorted(current_words))
        ))
                
for line in sys.stdin:
    try:
        key, word = line.strip().split('\t', 1)
    except ValueError as e:
        # print (e, "'" + line + "'") # FIXME
        continue
    if current_key != key:
        commit_current()
        reset_count(key)
    current_count += 1
    current_words.add(word)

commit_current()

Overwriting reducer.py


In [7]:
%%bash

hdfs dfs -rm -r -skipTrash wordgroups > /dev/null

yarn jar /opt/cloudera/parcels/CDH/lib/hadoop-mapreduce/hadoop-streaming.jar \
    -D mapred.jab.name="word groups" \
    -D mapreduce.job.reduces=8 \
    -files mapper.py,reducer.py,/datasets/stop_words_en.txt \
    -mapper "python2 mapper.py" \
    -reducer "python2 reducer.py" \
    -input /data/wiki/en_articles_part \
    -output wordgroups > /dev/null

hdfs dfs -cat wordgroups/* | grep $'\teghilns\t'

7820	eghilns	5	english,helsing,hesling,shengli,shingle


rm: `wordgroups': No such file or directory
19/06/22 12:22:05 INFO client.RMProxy: Connecting to ResourceManager at /0.0.0.0:8032
19/06/22 12:22:05 INFO client.RMProxy: Connecting to ResourceManager at /0.0.0.0:8032
19/06/22 12:22:06 INFO mapred.FileInputFormat: Total input files to process : 1
19/06/22 12:22:06 INFO mapreduce.JobSubmitter: number of splits:2
19/06/22 12:22:06 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1561045447920_0041
19/06/22 12:22:06 INFO impl.YarnClientImpl: Submitted application application_1561045447920_0041
19/06/22 12:22:06 INFO mapreduce.Job: The url to track the job: http://0de65cec5c1c:8088/proxy/application_1561045447920_0041/
19/06/22 12:22:06 INFO mapreduce.Job: Running job: job_1561045447920_0041
19/06/22 12:22:11 INFO mapreduce.Job: Job job_1561045447920_0041 running in uber mode : false
19/06/22 12:22:11 INFO mapreduce.Job:  map 0% reduce 0%
19/06/22 12:22:27 INFO mapreduce.Job:  map 6% reduce 0%
19/06/22 12:22:33 INFO mapreduce.Job: