### TF-IDF

#### Task description
In this task Hadoop Streaming is used to process Wikipedia articles dump.

Dataset location: /data/wiki/en_articles_part

Stop words list is in ‘/datasets/stop_words_en.txt’ file.

Format: article_id < tab > article_text

Calculate tf*idf for each pair (word, article) from the Wikipedia dump. Apply the stop words filter to speed up calculations. Term frequency (tf) is a function depending on a term (word) and a document (article):

tf(term, doc_id) = Nt/N,

where Nt - quantity of particular term in the document, N - the total number of terms in the document (without stop words)

Inverse document frequency (idf) is a function depends on a term:

idf(term) = 1/log(1 + Dt),

where Dt - number of documents in the dataset with the particular term.

You can find more information here: https://en.wikipedia.xn--org/wiki/Tfidf-q82h but use just the formulas mentioned above.

Output: tf*idf for term=’labor’ and article_id=12

In [1]:
%%writefile mapper.py

import sys
import re

from collections import Counter

reload(sys)
sys.setdefaultencoding('utf-8')

path_to_file='stop_words_en.txt'
# path_to_file='/datasets/stop_words_en.txt'

with open(path_to_file) as stop_words_file:
    content = stop_words_file.readlines()
    stop_words = set(l.strip().lower() for l in content)

for line in sys.stdin:
    try:
        article_id, text = unicode(line.strip()).split('\t', 1)
    except ValueError as e:
        continue
    words = [w.lower() for w in re.split("\W*\s+\W*", text, flags=re.UNICODE) if w.lower() not in stop_words]
    all_wards = float(len(words))
    counters = Counter(words)
        
    tf = {w: counters[w] / all_wards for w in counters}
    
    for word in tf:
        print "%s\t%0.5f\t%s" % (word.lower(), tf[word], article_id)

Overwriting mapper.py


In [2]:
%%writefile reducer.py

import sys

from math import log

current_key = None
all_documents = 0
current_tf = 0.

for line in sys.stdin:
    try:
        key, tf, article_id = line.strip().split('\t', 2)
        article_id = int(article_id)
    except ValueError as e:
        continue
        
    if key == 'labor':
        all_documents += 1
        
    if article_id == 12:
        current_tf = float(tf)
        
    if current_key != key:
        if current_key == 'labor':
            idf = 1. / log(1. + all_documents)
            print "%0.10f" % (current_tf * idf)
            
        current_key = key
        all_documents = 0
        current_tf = 0

if current_key and current_key == 'labor':
    idf = 1. / log(1 + all_documents)
    print "%0.10f" % (current_tf * idf)

Overwriting reducer.py


In [3]:
%%bash

OUT_DIR="stop_words_result"
NUM_REDUCERS=5

hdfs dfs -rm -r -skipTrash ${OUT_DIR} > /dev/null

yarn jar /opt/cloudera/parcels/CDH/lib/hadoop-mapreduce/hadoop-streaming.jar \
    -D mapred.jab.name="Streaming stop words" \
    -D mapreduce.job.reduces=${NUM_REDUCERS} \
    -files mapper.py,reducer.py,/datasets/stop_words_en.txt \
    -mapper "python mapper.py" \
    -reducer "python reducer.py" \
    -input /data/wiki/en_articles_part \
    -output ${OUT_DIR} > /dev/null

hdfs dfs -cat ${OUT_DIR}/part-00004 | head

0.0003509630	


18/01/31 23:12:56 INFO client.RMProxy: Connecting to ResourceManager at /0.0.0.0:8032
18/01/31 23:12:56 INFO client.RMProxy: Connecting to ResourceManager at /0.0.0.0:8032
18/01/31 23:12:57 INFO mapred.FileInputFormat: Total input files to process : 1
18/01/31 23:12:57 INFO mapreduce.JobSubmitter: number of splits:2
18/01/31 23:12:58 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1517390126047_0009
18/01/31 23:12:58 INFO impl.YarnClientImpl: Submitted application application_1517390126047_0009
18/01/31 23:12:58 INFO mapreduce.Job: The url to track the job: http://d36a8f213a9d:8088/proxy/application_1517390126047_0009/
18/01/31 23:12:58 INFO mapreduce.Job: Running job: job_1517390126047_0009
18/01/31 23:13:04 INFO mapreduce.Job: Job job_1517390126047_0009 running in uber mode : false
18/01/31 23:13:04 INFO mapreduce.Job:  map 0% reduce 0%
18/01/31 23:13:20 INFO mapreduce.Job:  map 100% reduce 0%
18/01/31 23:13:27 INFO mapreduce.Job:  map 100% reduce 20%
18/01/31 23:13:28 IN