#### Common warnings:

1. __Backup your solution into the 'work' directory inside the home directory ('/home/jovyan'). It is the only one that state will be saved over sessions.__

1. Please, ensure that you call the right interpreter (python2 or python3). Do not write just "python" without the major version. There is no guarantee that any particular version of Python is set as the default one in the Grading system.

1. One cell must contain only one programming language.
E.g. if a cell contains Python code and you also want to call a bash-command (using “!”) in it, you should move the bash to another cell.

1. Our IPython converter is an improved version of the standard converter Nbconvert and it can process most of Jupyter's magic commands correctly (e.g. it understands "%%bash" and executes the cell as a "bash"-script). However, we highly recommend to avoid magics wherever possible.

#### Hints for the YARN tasks:

1. Please, use relative HDFS paths, i.e. dir1/file1 instead of /user/jovyan/dir1/file1. When you submit the code it will be executed on a real Hadoop cluster. For instance, user ‘jovyan’ may not exist there.

1. Hadoop counters’ names should have only small latin letters. One exception: only the first letter of the name can be in upper case.

1. In the Hadoop logs the counter of stop words should be before the counter of total words. For doing this please take into account that the counters are printed in the lexicographical order.

# Hadoop Streaming assignment 3: Name Count

Make WordCount program for all the names in the dataset. Name is a word with the following properties:

1) The first character is not a digit (other characters can be digits).

2) The first character is uppercase, all the other characters that are letters are lowercase.

3) There are less than 0.5% occurrences of this word, when this word regardless to its case appears in the dataset and the condition (2) is not met.

Order by quantity, most popular first, output format:

name <tab> count

The result is the 5th line in the output.

The result on the sample dataset:
    french 5742

## Step 1. Create mapper and reducer.

<b>Hint:</b>  Demo task contains almost all the necessary pieces to complete this assignment. You may use the demo to implement the first MapReduce Job.

In [48]:
%%writefile mapper.py

import sys
import re


def eprint(*args, **kwargs):
    print(*args, file=sys.stderr, **kwargs)
    
for line in sys.stdin:
    try:
        article_id, text = line.strip().split('\t', 1)
    except ValueError as e:
        continue
    words = re.split("\W*\s+\W*", text, flags=re.UNICODE)
    for word in words:
        eprint("reporter:counter:Wiki stats,Total words,%d" % 1)
        print("%s\t%d" % (word, 1))

Overwriting mapper.py


In [46]:
%%writefile reducer.py

import sys

current_key = None
word_sum = 0

for line in sys.stdin:
    try:
        key, count = line.strip().split('\t', 1)
        count = int(count)
    except ValueError as e:
        continue
    if current_key != key:
        if current_key:
            print("%s\t%d" % (current_key, word_sum))
        word_sum = 0
        current_key = key
    word_sum += count

if current_key:
    print("%s\t%d" % (current_key, word_sum))

Overwriting reducer.py


In [36]:
# You can use this cell for other experiments: for example, for combiner.

## Step 2. Create sort job.

<b>Hint:</b> You may use MapReduce comparator to solve this step. Make sure that the keys are sorted in ascending order.

In [45]:
%%writefile mapper_1.py

import sys

for line in sys.stdin:
    try:
        word, count = line.strip().split('\t', 1)
        count = int(count)
        print("%d\t%s" % (count, word))
    except ValueError as e:
        continue

Overwriting mapper_1.py


In [56]:
%%writefile reducer_1.py

import sys

def eprint(*args, **kwargs):
    print(*args, file=sys.stderr, **kwargs)
    
def is_name(word, count):
    if word[0].isdigit():
        return False
    if word[0].isupper() and count <= 5968:
        for letter in word[1:]:
            if letter.isalpha():
                if letter.isupper():
                    return False
        return True
    elif count <= 5968:
        return True
    return False

for line in sys.stdin:
    try:
        count, word = line.strip().split('\t', 1)
        count = int(count)
        if is_name(word, count):
            print("%s\t%d" % (word.lower(), count))
    except ValueError as e:
        continue

Overwriting reducer_1.py


## Step 3. Bash commands

<b> Hint: </b> For printing the exact row you may use basic UNIX commands. For instance, sed/head/tail/... (if you know other commands, you can use them).

To run both jobs, you must use two consecutive yarn-commands. Remember that the input for the second job is the ouput for the first job.

__NB__: Please, use a defined python major version (e.g. `python3 mappper.py` instead of `python mapper.py`)!

Only the answer to your task should be printed in the output stream (__stdout__) in the last cell. There should be no more output in this stream. In order to get rid of garbage [junk lines] (e.g. created by `hdfs dfs -rm` or `yarn` commands) redirect the output to /dev/null.

#### Final notice:

1. Please take into account that you must __not__ redirect __stderr__ to anywhere. Hadoop, Hive, and Spark print their logs to stderr and the Grading system also reads and analyses it.

1. During checking the code from the notebook, the system runs all notebook's cells and reads the output of only the last filled cell. It is clear that any exception should not be thrown in the running cells. If you decide to write some text in a cell, you should change the style of the cell to Markdown (Cell -> Cell type -> Markdown).

1. The Grader takes into account the output from the sample dataset you have in the notebook. Therefore, you have to "Run All" cells in the notebook before you send the ipynb solution.

1. The name of the notebook must contain only Roman letters, numbers and characters “-” or “_”. For example, Windows adds something like " (2)" (with the leading space) at the end of a filename if you try to download a file with the same name. This is a problem, because you will have a space character and curly braces "(" and ")". 

In [1]:
%%bash

OUT_DIR="wordcount_result_"$(date +"%s%6N")
OUT_DIR1="wordcount_result1_"$(date +"%s%6N")
NUM_REDUCERS=4
NUM_REDUCERS1=1

hdfs dfs -rm -r -skipTrash ${OUT_DIR} > /dev/null
hdfs dfs -rm -r -skipTrash ${OUT_DIR1} > /dev/null


yarn jar /opt/cloudera/parcels/CDH/lib/hadoop-mapreduce/hadoop-streaming.jar \
    -D mapred.jab.name="Streaming wordCount" \
    -D mapreduce.job.reduces=${NUM_REDUCERS} \
    -files mapper.py,reducer.py \
    -mapper "python3 mapper.py" \
    -combiner "python3 reducer.py" \
    -reducer "python3 reducer.py" \
    -input /data/wiki/en_articles_part \
    -output ${OUT_DIR} > /dev/null
    
yarn jar /opt/cloudera/parcels/CDH/lib/hadoop-mapreduce/hadoop-streaming.jar \
    -D mapreduce.job.reduces=${NUM_REDUCERS1} \
    -D mapred.jab.name="Sorting wordCount" \
    -D mapreduce.job.output.key.comparator.class=org.apache.hadoop.mapreduce.lib.partition.KeyFieldBasedComparator \
    -D stream.map.output.field.separator="\t" \
    -D stream.num.map.output.key.fields=2 \
    -D mapreduce.map.output.key.field.separator="\t" \
    -D mapreduce.partition.keycomparator.options=-k1,1nr \
    -files mapper_1.py,reducer_1.py \
    -mapper 'python3 mapper_1.py' \
    -reducer 'python3 reducer_1.py' \
    -input ${OUT_DIR} \
    -output ${OUT_DIR1} > /dev/null

hdfs dfs -cat ${OUT_DIR1}/part-00000 | sed -n '5p;6q'

french	5741


rm: `wordcount_result_1601840907406415': No such file or directory
rm: `wordcount_result1_1601840907407428': No such file or directory
20/10/04 19:48:31 INFO client.RMProxy: Connecting to ResourceManager at /0.0.0.0:8032
20/10/04 19:48:31 INFO client.RMProxy: Connecting to ResourceManager at /0.0.0.0:8032
20/10/04 19:48:31 INFO mapred.FileInputFormat: Total input files to process : 1
20/10/04 19:48:32 INFO mapreduce.JobSubmitter: number of splits:2
20/10/04 19:48:32 INFO Configuration.deprecation: yarn.resourcemanager.system-metrics-publisher.enabled is deprecated. Instead, use yarn.system-metrics-publisher.enabled
20/10/04 19:48:32 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1601837393566_0026
20/10/04 19:48:32 INFO conf.Configuration: resource-types.xml not found
20/10/04 19:48:32 INFO resource.ResourceUtils: Unable to find 'resource-types.xml'.
20/10/04 19:48:32 INFO resource.ResourceUtils: Adding resource type - name = memory-mb, units = Mi, type = COUNTABLE
20/10/0