# Real-World Applications: TF-IDF

In this task Hadoop Streaming is used to process Wikipedia articles dump (/data/wiki/en_articles_part).

The purpose of this task is to calculate tf*idf for each pair (word, article) from the Wikipedia dump. Apply the stop words filter to speed up calculations. Term frequency (tf) is a function depending on a term (word) and a document (article):

    tf(term, doc_id) = Nt/N,

where Nt - quantity of particular term in the document, N - the total number of terms in the document (without stop words)

Inverse document frequency (idf) is a function depends on a term:

    idf(term) = 1/log(1 + Dt),

where Dt - number of documents in the dataset with the particular term.

You can find more information here: https://en.wikipedia.xn--org/wiki/Tfidf-q82h but use just the formulas mentioned above.

- Dataset location: /data/wiki/en_articles_part
- Stop words list is in ‘/datasets/stop_words_en.txt’ file.

Format: article_id <tab> article_text

To parse the articles don’t forget about Unicode (even though this is an English Wikipedia dump, there are many characters from other languages), remove punctuation marks and transform words to lowercase to get the correct quantities. To cope with Unicode we recommend to use the following tokenizer:

Output: tf*idf for term=’labor’ and article_id=12

The result on the sample dataset:

    0.000351
    
**Hint**: all Wikipedia article_ids are greater than 0. So you can use a dummy article_id=0 to calculate the number of documents with each term.

If you want to deploy the environment on your own machine, please use [bigdatateam/yarn-notebook](https://hub.docker.com/r/bigdatateam/yarn-notebook/) Docker container.

In [1]:
%%writefile mapper1.py

import sys
import re
import collections

reload(sys)
sys.setdefaultencoding('utf-8') # required to convert to unicode

path = "stop_words_en.txt"

with open(path, "r") as f:
    stop_words = f.read().splitlines()
    
def cleanup(words):
    return [word.lower().strip() for word in words if (word.lower() not in stop_words)]

for line in sys.stdin:
    try:
        article_id, text = unicode(line.strip()).split('\t', 1)
        text = re.sub("^\W+|\W+$", "", text, flags=re.UNICODE)
        words = re.split("\W*\s+\W*", text, flags=re.UNICODE)
        words = cleanup(words)
        
        words_counter = collections.Counter(words)
        words_total = sum(words_counter.values())

        for word, count in sorted(words_counter.items()):
            if not word.isalpha(): continue
            tf = float(count)/float(words_total)
            print("{}\t{}\t{:f}".format(word, article_id, tf))
             
    except Exception as e:
        print(e)
        continue

Overwriting mapper1.py


In [2]:
%%writefile reducer1.py

import sys
import math

current_word = None
article_count = 0
tf_memory = {}

for line in sys.stdin:
    try:
        word, article_id, tf = line.strip().split('\t', 2)
        tf = float(tf)
        
        if current_word != word:
            if current_word:
                idf = float(1)/math.log(1 + article_count)
                for article, tff in tf_memory.items():
                    print("{}\t{}\t{:f}".format(current_word, article, tff*idf))
            
            current_word = word
            article_count = 0
            tf_memory.clear()
        
        article_count += 1
        tf_memory[article_id] = float(tf)
    
    except Exception as e:
        print(e)
        continue
        
if current_word:
    idf = float(1)/math.log(1 + article_count)
    for word, tf in tf_memory.items():
        print("{}\t{}\t{:f}".format(current_word, article, tff*idf))

Overwriting reducer1.py


In [3]:
%%bash

OUT_DIR="assignment3_"$(date +"%s%6N")
NUM_REDUCERS=4

hdfs dfs -rm -r -skipTrash ${OUT_DIR} > /dev/null

yarn jar /opt/cloudera/parcels/CDH/lib/hadoop-mapreduce/hadoop-streaming.jar \
    -D mapred.jab.name="Assignment 3" \
    -D mapreduce.job.reduces=${NUM_REDUCERS} \
    -D mapreduce.partition.keypartitioner.options=-k1,1 \
    -files mapper1.py,reducer1.py,/datasets/stop_words_en.txt \
    -mapper "python mapper1.py" \
    -reducer "python reducer1.py" \
    -partitioner org.apache.hadoop.mapred.lib.KeyFieldBasedPartitioner \
    -input /data/wiki/en_articles_part \
    -output ${OUT_DIR} > /dev/null 

hdfs dfs -cat ${OUT_DIR}/part* | grep -w "labor" | grep -w "12" | cut -f 3

0.000351


rm: `assignment3_1541973234648812': No such file or directory
18/11/11 21:53:58 INFO client.RMProxy: Connecting to ResourceManager at /0.0.0.0:8032
18/11/11 21:53:58 INFO client.RMProxy: Connecting to ResourceManager at /0.0.0.0:8032
18/11/11 21:53:59 INFO mapred.FileInputFormat: Total input files to process : 1
18/11/11 21:54:00 INFO mapreduce.JobSubmitter: number of splits:2
18/11/11 21:54:01 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1541966276258_0002
18/11/11 21:54:01 INFO impl.YarnClientImpl: Submitted application application_1541966276258_0002
18/11/11 21:54:01 INFO mapreduce.Job: The url to track the job: http://a695ff422ebc:8088/proxy/application_1541966276258_0002/
18/11/11 21:54:01 INFO mapreduce.Job: Running job: job_1541966276258_0002
18/11/11 21:54:08 INFO mapreduce.Job: Job job_1541966276258_0002 running in uber mode : false
18/11/11 21:54:08 INFO mapreduce.Job:  map 0% reduce 0%
18/11/11 21:54:24 INFO mapreduce.Job:  map 5% reduce 0%
18/11/11 21:54:30 I