In this task Hadoop Streaming is used to process Wikipedia articles dump.

Dataset location: /data/wiki/en_articles_part

Stop words list is in ‘/datasets/stop_words_en.txt’ file.

Format: article_id <tab> article_text

To parse the articles don’t forget about Unicode (even though this is an English Wikipedia dump, there are many characters from other languages), remove punctuation marks and transform words to lowercase to get the correct quantities. To cope with Unicode we recommend to use the following tokenizer:



Calculate tf*idf for each pair (word, article) from the Wikipedia dump. Apply the stop words filter to speed up calculations. Term frequency (tf) is a function depending on a term (word) and a document (article):

tf(term, doc_id) = Nt/N,

where Nt - quantity of particular term in the document, N - the total number of terms in the document (without stop words)

Inverse document frequency (idf) is a function depends on a term:

idf(term) = 1/log(1 + Dt),

where Dt - number of documents in the dataset with the particular term.

You can find more information here: https://en.wikipedia.xn--org/wiki/Tfidf-q82h but use just the formulas mentioned above.

Output: tf*idf for term=’labor’ and article_id=12

Hint: all Wikipedia article_ids are greater than 0. So you can use a dummy article_id=0 to calculate the number of documents with each term.

Passed 100%

In [35]:
%%writefile mapper.py
import sys
import re
from collections import Counter

reload(sys)
sys.setdefaultencoding('utf-8') # required to convert to unicode

with open('stop_words_en.txt') as f:
    stop_words = set(f.read().split())

for line in sys.stdin:
    try:
        article_id, text = unicode(line.strip()).split('\t', 1)
    except ValueError as e:
        continue
        
    words = re.split("\W*\s+\W*", text, flags=re.UNICODE)
    words = [x.lower() for x in words if x.lower() not in stop_words]
    words_set = set(words)
    
    num_words = len(words)
    counter = Counter(words)
    article_id = int(article_id)
    for word in words:
        frequency = counter[word]
        tf = frequency / float(num_words)
        print "%s\t%d\t%f" % (word, article_id, tf)


Overwriting mapper.py


In [11]:
%%writefile reducer.py
from __future__ import division
import sys
from math import log

current_word = None
article_dict = {}

for line in sys.stdin:
    try:
        word, article_id, tf = line.strip().split('\t')
        article_id = int(article_id)
        tf = float(tf)
    except ValueError as e:
        continue
    
    if current_word != word:
        if current_word:
            idf = 1 / log(1 + len(article_dict))
            for key_article_id, tf in article_dict.iteritems():
                tfidf = tf * idf
                print "%s\t%d\t%f" % (current_word, key_article_id, tfidf)
        article_dict = {}
        current_word = word
    article_dict[article_id] = tf

if current_word:
    print "%s\t%d\t%f" % (current_word, article_id, tfidf)

Overwriting reducer.py


# Debugging

In [None]:
# !hdfs dfs -cat /data/wiki/en_articles_part/articles-part | head -2 > test_file.txt
# !cp /datasets/stop_words_en.txt stop_words_en.txt
# !cat test_file.txt | python2 mapper.py | sort -k1,1 | python2 reducer.py

In [1]:
# ! hdfs dfs -ls /data/wiki

In [40]:
%%bash

OUT_DIR="tfidf_result_"$(date +"%s%6N")
NUM_REDUCERS=8

hdfs dfs -rm -r -skipTrash ${OUT_DIR} > /dev/null

yarn jar /opt/cloudera/parcels/CDH/lib/hadoop-mapreduce/hadoop-streaming.jar \
    -D mapred.jab.name="Streaming tfidf" \
    -D mapreduce.job.reduces=${NUM_REDUCERS} \
    -files mapper.py,reducer.py,/datasets/stop_words_en.txt \
    -mapper "python mapper.py" \
    -reducer "python reducer.py" \
    -input /data/wiki/en_articles_part \
    -output ${OUT_DIR} > /dev/null

hdfs dfs -cat  ${OUT_DIR}/* | grep -P 'labor\t12\t' | cut -f3


0.000351


rm: `tfidf_result_1529466951938651': No such file or directory
18/06/20 03:55:54 INFO client.RMProxy: Connecting to ResourceManager at /0.0.0.0:8032
18/06/20 03:55:54 INFO client.RMProxy: Connecting to ResourceManager at /0.0.0.0:8032
18/06/20 03:55:55 INFO mapred.FileInputFormat: Total input files to process : 1
18/06/20 03:55:56 INFO mapreduce.JobSubmitter: number of splits:2
18/06/20 03:55:56 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1529463856073_0007
18/06/20 03:55:56 INFO impl.YarnClientImpl: Submitted application application_1529463856073_0007
18/06/20 03:55:56 INFO mapreduce.Job: The url to track the job: http://d958f05bfa46:8088/proxy/application_1529463856073_0007/
18/06/20 03:55:56 INFO mapreduce.Job: Running job: job_1529463856073_0007
18/06/20 03:56:00 INFO mapreduce.Job: Job job_1529463856073_0007 running in uber mode : false
18/06/20 03:56:00 INFO mapreduce.Job:  map 0% reduce 0%
18/06/20 03:56:16 INFO mapreduce.Job:  map 73% reduce 0%
18/06/20 03:56:17