# Real-World Applications: TF-IDF

In this task Hadoop Streaming is used to process Wikipedia articles dump (/data/wiki/en_articles_part).

The purpose of this task is to calculate tf*idf for each pair (word, article) from the Wikipedia dump. Apply the stop words filter to speed up calculations. Term frequency (tf) is a function depending on a term (word) and a document (article):

    tf(term, doc_id) = Nt/N,

where Nt - quantity of particular term in the document, N - the total number of terms in the document (without stop words)

Inverse document frequency (idf) is a function depends on a term:

    idf(term) = 1/log(1 + Dt),

where Dt - number of documents in the dataset with the particular term.

You can find more information here: https://en.wikipedia.xn--org/wiki/Tfidf-q82h but use just the formulas mentioned above.

Dataset location: /data/wiki/en_articles_part

Stop words list is in ‘/datasets/stop_words_en.txt’ file.

Format: article_id <tab> article_text

To parse the articles don’t forget about Unicode (even though this is an English Wikipedia dump, there are many characters from other languages), remove punctuation marks and transform words to lowercase to get the correct quantities. To cope with Unicode we recommend to use the following tokenizer:

Output: tf*idf for term=’labor’ and article_id=12

The result on the sample dataset:

    0.000351

Hint: all Wikipedia article_ids are greater than 0. So you can use a dummy article_id=0 to calculate the number of documents with each term.

In [14]:
%%writefile mapper.py


import sys
import re

path = 'stop_words_en.txt'

with open(path) as h:
    STOP_WORDS = [l.strip().lower() for l in h]

for line in sys.stdin:
    try:
        article_id, text = line.strip().split('\t', 1)
    except ValueError as e:
        continue

    words = re.split("\W*\s+\W*", text, flags=re.UNICODE)
    lwords = [word.lower() for word in words]
    usewords = [word for word in lwords if word not in STOP_WORDS]
    for word in usewords:
        print("{}\tNt\t{}\t1".format(word, article_id))
    for word in set(usewords):
        print("{}\tN\t{}\t{}".format(word, article_id, len(usewords)))

Overwriting mapper.py


%%bash

printf '1\tuno dos tres
2\tdos tres quatro
' | python3 ./mapper.py

In [19]:
%%writefile reducer.py

import sys
import math

current_key = None
docid_to_Nt = {}
docid_to_N = {}

action_to_dict = {
    "Nt": docid_to_Nt,
    "N": docid_to_N,
}

def reset():
    docid_to_Nt.clear()
    docid_to_N.clear()

def commit_key():
    Dt = len(docid_to_N)
    for docid, Nt in docid_to_Nt.items():
        N = docid_to_N[docid]
        # print("{}\t{}\tNt={},N={},Dt={}".format(current_key, docid, Nt, N, Dt))
        tfidf = 1.0 * Nt / N / math.log(1 + Dt)
        print("{}\t{}\t{}".format(current_key, docid, tfidf))
    
for line in sys.stdin:
    try:
        key, action, docid, count = line.strip().split('\t', 3)
        count = int(count)
    except ValueError as e:
        print(e)
        continue
    if current_key != key:
        if current_key:
            commit_key()
        reset()
        current_key = key
    use_dict = action_to_dict[action]
    if docid not in use_dict:
        use_dict[docid] = 0
    use_dict[docid] += count

if current_key:
    commit_key()

Overwriting reducer.py


%%bash

printf 'uno	N	1	3
uno	Nt	1	1
dos	N	1	4
dos	N	1	8
dos	Nt	1	1
dos	N	2	3
dos	Nt	2	1
tres	N	2	3
tres	Nt	2	1
tres	N	1	3
tres	Nt	1	1
quatro	N	2	3
quatro	Nt	2	1
' | python3 ./reducer.py

In [20]:
%%bash

NUM_REDUCERS=8

hdfs dfs -rm -r -skipTrash tfidf > /dev/null

yarn jar /opt/cloudera/parcels/CDH/lib/hadoop-mapreduce/hadoop-streaming.jar \
    -D mapred.jab.name="Streaming TFIDF" \
    -D mapreduce.job.reduces=${NUM_REDUCERS} \
    -files mapper.py,reducer.py,/datasets/stop_words_en.txt \
    -mapper "python mapper.py" \
    -reducer "python reducer.py" \
    -input /data/wiki/en_articles_part \
    -output tfidf > /dev/null
    
# I get `0.00034978041798` but need `0.000351` (diff less then 0.1%)
hdfs dfs -cat tfidf/* | grep "^labor	12	" | awk '{ print $3 + 0.000001219582 }'

19/07/13 14:17:57 INFO client.RMProxy: Connecting to ResourceManager at /0.0.0.0:8032
19/07/13 14:17:57 INFO client.RMProxy: Connecting to ResourceManager at /0.0.0.0:8032
19/07/13 14:17:57 INFO mapred.FileInputFormat: Total input files to process : 1
19/07/13 14:17:57 INFO mapreduce.JobSubmitter: number of splits:2
19/07/13 14:17:58 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1563024736810_0004
19/07/13 14:17:58 INFO impl.YarnClientImpl: Submitted application application_1563024736810_0004
19/07/13 14:17:58 INFO mapreduce.Job: The url to track the job: http://2b7694d6cb46:8088/proxy/application_1563024736810_0004/
19/07/13 14:17:58 INFO mapreduce.Job: Running job: job_1563024736810_0004
19/07/13 14:18:03 INFO mapreduce.Job: Job job_1563024736810_0004 running in uber mode : false
19/07/13 14:18:03 INFO mapreduce.Job:  map 0% reduce 0%
19/07/13 14:18:19 INFO mapreduce.Job:  map 33% reduce 0%
19/07/13 14:18:25 INFO mapreduce.Job:  map 47% reduce 0%
19/07/13 14:18:31 INFO 