<a href="https://colab.research.google.com/github/kingketan9/BigDataLabs/blob/main/Lab1.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

User
 Experiment 1: Aim: To find the importance of a particular word in a Corpus by computing its TF-IDF (Term Frequency - Inverse Document Frequency)
 Tools to be used: MapReduce Framework


 Step 1: Define the following data: data=[(1,'i love dogs'),(2,"i hate dogs and knitting"),(3,"knitting is my hobby and my passion")]
 Here, the key is the document ID and the value denotes the document text
 So, in total, we have three documents

 Step 2: Define Phase I Mapper and Reducer
 Mapper 1: From the above data generate the following key-value pair
 (<word, document ID> <1>)
 Reducer 1: By taking Mapper 1 output, generate the following key-value pair
 <word, document ID> <word count in same document>


 Step 3: Define Phase II Mapper and Reducer
 Mapper 2: From the output of Reducer 1, generate the following key-value pair
 (<word> <document ID, word count in same document, 1>
 Reducer 2: Finally, the desired output should be the following
 (<word> <document ID, TF IDF>

 Formula to compute TF IDF
 t_ij = w_ij * log(N/d_ij)
 Here the TF IDF for word i is t_ij
 w_ij is the word count of word i in document j
 N is the total number of documents
 d_ij is the number of documents in which word i occurs

In [24]:
import math
from collections import defaultdict

# Step 1: Define the data

In [25]:
data = [
    (1, 'i love dogs'),
    (2, 'i hate dogs and knitting'),
    (3, 'knitting is my hobby and my passion')
]

#Step 2: Phase I Mapper and Reducer

Mapper 1:

In [26]:
def mapper1(data):
    word_doc_counts = defaultdict(int)
    for doc_id, text in data:
        words = text.split()
        for word in words:
            word_doc_counts[(word, doc_id)] += 1
    return word_doc_counts

Reducer 1:

In [27]:
def reducer1(word_doc_counts):
    word_doc_freq = defaultdict(list)
    for (word, doc_id), count in word_doc_counts.items():
        word_doc_freq[word].append((doc_id, count))
    return word_doc_freq

#Step 3: Phase II Mapper and Reducer

Mapper 2:

In [28]:
def mapper2(word_doc_freq):
    doc_word_counts = defaultdict(int)
    for word, doc_counts in word_doc_freq.items():
        for doc_id, count in doc_counts:
            doc_word_counts[doc_id] += count
    return doc_word_counts

Reducer 2:

In [29]:
def reducer2(doc_word_counts, total_docs):
    tf_idf_scores = {}
    for word, doc_counts in word_doc_freq.items():
        idf = math.log(total_docs / len(doc_counts))
        for doc_id, count in doc_counts:
            tf_idf_scores[(doc_id, word)] = count * idf
    return tf_idf_scores

## Execute the MapReduce tasks

In [30]:
word_doc_counts = mapper1(data)
word_doc_freq = reducer1(word_doc_counts)
doc_word_counts = mapper2(word_doc_freq)
total_docs = len(data)
tf_idf_scores = reducer2(word_doc_freq, total_docs)

# Print the results

In [31]:
for (doc_id, word), tf_idf in tf_idf_scores.items():
    print(f"Document ID: {doc_id}, Word: {word}, TF-IDF Score: {tf_idf}")

Document ID: 1, Word: i, TF-IDF Score: 0.4054651081081644
Document ID: 2, Word: i, TF-IDF Score: 0.4054651081081644
Document ID: 1, Word: love, TF-IDF Score: 1.0986122886681098
Document ID: 1, Word: dogs, TF-IDF Score: 0.4054651081081644
Document ID: 2, Word: dogs, TF-IDF Score: 0.4054651081081644
Document ID: 2, Word: hate, TF-IDF Score: 1.0986122886681098
Document ID: 2, Word: and, TF-IDF Score: 0.4054651081081644
Document ID: 3, Word: and, TF-IDF Score: 0.4054651081081644
Document ID: 2, Word: knitting, TF-IDF Score: 0.4054651081081644
Document ID: 3, Word: knitting, TF-IDF Score: 0.4054651081081644
Document ID: 3, Word: is, TF-IDF Score: 1.0986122886681098
Document ID: 3, Word: my, TF-IDF Score: 2.1972245773362196
Document ID: 3, Word: hobby, TF-IDF Score: 1.0986122886681098
Document ID: 3, Word: passion, TF-IDF Score: 1.0986122886681098
