# Hadoop Streaming assignment 4: Word Groups

Calculate statistics for groups of words which are equal up to permutations of letters. For example, ‘emit’, ‘item’ and ‘time’ are the same words up to a permutation of letters. Determine such groups of words and sum all their counts. Apply stop words filter. Filter out groups that consist of only one word.

Output: count of occurrences for the group of words, number of unique words in the group, comma-separated list of the words in the group in lexicographical order:

    sum <tab> group size <tab> word1,word2,...

Example: assume ‘emit’ occurred 3 times, 'item' -- 2 times, 'time' -- 5 times; 3 + 2 + 5 = 10, group contains 3 words, so for this group result is:

    10 3 emit,item,time

The result of the task is the output line with word ‘english’.

The result on the sample dataset:

    7823    eghilns 5   english,helsing,hesling,shengli,shingle

**NB**: Do not forget about the lexicographical order of words in the group: 'emit,item,time' is OK, 'emit,time,item' is not.

If you want to deploy the environment on your own machine, please use [bigdatateam/yarn-notebook](https://hub.docker.com/r/bigdatateam/yarn-notebook/) Docker container.   

## Step 1. Create first mapper and reducer

Word counter

In [171]:
%%writefile mapper1.py

import sys
import re

path = "stop_words_en.txt"

with open(path, "r") as file:
    stop_words = file.read().splitlines()

for line in sys.stdin:

    article_id, text = line.strip().split('\t', 1)

    try:
        words = re.split('\W*\s+\W*', text.strip())
        words = [word for word in words if (word not in stop_words) and word.isalpha()]
        
        for word in words:
            print("{}\t{}".format(word.lower(), 1))
            
    except Exception as e:
        print(e)
        continue

Overwriting mapper1.py


In [172]:
%%writefile reducer1.py

import sys

current_key = None
word_total = 0

for line in sys.stdin:
    try:
        key, count = line.strip().split('\t', 1)
        count = int(count)
        
        if current_key != key:
            if current_key:
                print("{}\t{:d}".format(current_key, word_total))
                
            current_key = key
            word_total = 0
        
        word_total += count
        
    except Exception as e:
        continue  

if current_key:
    print("{}\t{:d}".format(current_key, word_total))

Overwriting reducer1.py


## Step 2. Create second mapper and reducer

Aggregate by sorted letters

In [173]:
%%writefile mapper2.py
import sys

current_key = None
word_total = 0
sorted_key = None

for line in sys.stdin:
    try:
        word, count = line.strip().split('\t', 1)
        count = int(count)
        
        if current_key != word:
            if current_key:
                print("{}\t{}\t{}".format(sorted_key, current_key, word_total))
                
            current_key = word
            sorted_key  = "".join(sorted(current_key))
            word_total = 0
        
        word_total += count
        
    except ValueError as e:
        print(e)
        continue    

if current_key:
    print("{}\t{}\t{}".format(sorted_key, current_key, word_total))

Overwriting mapper2.py


In [174]:
%%writefile reducer2.py
import sys

current_key = None
word_total = 0
word_set   = set()

for line in sys.stdin:
    try:
        sorted_word, word, count = line.strip().split('\t', 2)
        count = int(count)
    
        if current_key != sorted_word:
            if current_key:
                print("{}\t{} {}\t{}".format(word_total, current_key, len(word_set), ",".join(sorted(word_set))))
                
            current_key = sorted_word
            word_set = set()
            word_total = 0
        
        word_total += count
        word_set.add(word)
        
    except ValueError as e:
        print(e)
        continue    

if current_key:
    print("{}\t{} {}\t{}".format(word_total, current_key, len(word_set), ",".join(sorted(word_set))))

Overwriting reducer2.py


## Step 3. Bash commands

<b> Hint: </b> For printing the exact row you may use basic UNIX commands. For instance, sed/head/tail/... (if you know other commands, you can use them).

To run both jobs, you must use two consecutive yarn-commands. Remember that the input for the second job is the ouput for the first job.

In [183]:
%%bash

OUT_DIR_1="assignment4_1_"$(date +"%s%6N")
OUT_DIR_2="assignment4_2_"$(date +"%s%6N")
NUM_REDUCERS=4

# Code for your first job
yarn jar /opt/cloudera/parcels/CDH/lib/hadoop-mapreduce/hadoop-streaming.jar \
    -files mapper1.py,reducer1.py,/datasets/stop_words_en.txt \
    -mapper 'python2 mapper1.py' \
    -combiner 'python2 reducer1.py' \
    -reducer 'python2 reducer1.py' \
    -numReduceTasks ${NUM_REDUCERS} \
    -input /data/wiki/en_articles_part \
    -output ${OUT_DIR_1} > /dev/null


# Code for your second job
yarn jar /opt/cloudera/parcels/CDH/lib/hadoop-mapreduce/hadoop-streaming.jar \
    -files mapper2.py,reducer2.py \
    -mapper 'python2 mapper2.py' \
    -reducer 'python2 reducer2.py' \
    -numReduceTasks 1 \
    -input ${OUT_DIR_1} \
    -output ${OUT_DIR_2} > /dev/null

# Code for obtaining the results
hdfs dfs -cat ${OUT_DIR_2}/part-00000 | grep "english,"

hdfs dfs -rm -r -skipTrash ${OUT_DIR_1}* > /dev/null
hdfs dfs -rm -r -skipTrash ${OUT_DIR_2}* > /dev/null

7820	eghilns 5	english,helsing,hesling,shengli,shingle


18/11/09 08:54:19 INFO client.RMProxy: Connecting to ResourceManager at /0.0.0.0:8032
18/11/09 08:54:19 INFO client.RMProxy: Connecting to ResourceManager at /0.0.0.0:8032
18/11/09 08:54:20 INFO mapred.FileInputFormat: Total input files to process : 1
18/11/09 08:54:20 INFO mapreduce.JobSubmitter: number of splits:2
18/11/09 08:54:20 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1541743376671_0019
18/11/09 08:54:21 INFO impl.YarnClientImpl: Submitted application application_1541743376671_0019
18/11/09 08:54:21 INFO mapreduce.Job: The url to track the job: http://5ca157db9fb1:8088/proxy/application_1541743376671_0019/
18/11/09 08:54:21 INFO mapreduce.Job: Running job: job_1541743376671_0019
18/11/09 08:54:28 INFO mapreduce.Job: Job job_1541743376671_0019 running in uber mode : false
18/11/09 08:54:28 INFO mapreduce.Job:  map 0% reduce 0%
18/11/09 08:54:44 INFO mapreduce.Job:  map 29% reduce 0%
18/11/09 08:54:50 INFO mapreduce.Job:  map 44% reduce 0%
18/11/09 08:54:56 INFO 