<a href="https://colab.research.google.com/github/pejmanrasti/Big_Data/blob/main/02_Exe_MapReduce.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

<h2>MapReduce Mini-Project: Analyzing Amazon Movie Reviews</h2>

<p>
In this exercise, you will work as a data engineer for a streaming platform.
Your goal is to perform several analytics tasks on a free and publicly
available dataset of Amazon Movie Reviews using MapReduce in Hadoop.
</p>

<p>
You will complete four tasks:
</p>

<ol>
  <li><b>Count total number of reviews per movie</b></li>
  <li><b>Compute average rating per movie</b></li>
  <li><b>Extract frequent keywords from reviews</b></li>
  <li><b>Join average ratings with top keywords</b></li>
</ol>

<p>
For each task, you will write a MapReduce program (Python Streaming or Java)
and run it using Hadoop in local mode. Your final outputs will help the
company understand which movies are popular, how viewers rate them, and what
keywords often appear in the reviews.
</p>

<h2>About the Dataset</h2>

<p>
We will use the <b>Amazon Movies &amp; TV 5-core dataset</b>, which is publicly
available and contains movie reviews from Amazon. Each entry in the dataset
is stored as a JSON object with fields such as:
</p>

<ul>
  <li><code>reviewerID</code> – the ID of the reviewer</li>
  <li><code>asin</code> – unique movie identifier</li>
  <li><code>reviewText</code> – full written review</li>
  <li><code>overall</code> – the star rating (1 to 5)</li>
  <li><code>vote</code> – how many users found the review helpful</li>
  <li><code>category</code> – always “Movies &amp; TV” in this dataset</li>
</ul>

<p>
You will download the dataset and inspect a few records to understand its
structure before starting the tasks.
</p>

In [None]:
import gzip
import json
import os
import sys # Import sys for printing warnings to stderr

# -------------------------------------------------------------------
# 1) Download the SMALL Movies & TV dataset (correct version)
# -------------------------------------------------------------------
print("Downloading SMALL Movies & TV 5-core dataset...")

URL = "https://jmcauley.ucsd.edu/data/amazon_v2/categoryFilesSmall/Movies_and_TV_5.json.gz"
FILE_GZ = "Movies_and_TV_small.json.gz"

!wget --no-check-certificate -O {FILE_GZ} {URL}

if os.path.getsize(FILE_GZ) == 0:
    raise ValueError("Downloaded file is empty!")

print("Download complete.\n")

# -------------------------------------------------------------------
# 2) Load JSON data (each line is a JSON object)
# -------------------------------------------------------------------
print("Loading JSON data from JSON Lines format...")

data = []
with gzip.open(FILE_GZ, "rt", encoding="utf-8") as f:
    for line_num, line in enumerate(f, 1):
        line = line.strip()
        if line: # Only process non-empty lines
            try:
                data.append(json.loads(line))
            except json.JSONDecodeError as e:
                print(f"Warning: Could not decode JSON on line {line_num}: {line}. Error: {e}", file=sys.stderr)
                # Continue to the next line to be robust against malformed lines
                continue

print(f"Total records loaded: {len(data)}") # Should be ~3.4 million records
print()

# -------------------------------------------------------------------
# 3) Convert to JSON-LINES format for MapReduce (if not already done)
#    This step ensures 'movies.json' is a clean JSON-Lines file.
# -------------------------------------------------------------------
print("Converting to JSON-lines format (outputting to movies.json with 900,000 records)...")

# Limit to 900,000 records to have less runing times on colab (on a real cluster, remove this line)
limited_data = data[:900000]

with open("movies.json", "w", encoding="utf-8") as out:
    for entry in limited_data:
        out.write(json.dumps(entry) + "\n")

print(f"Conversion complete. Saved as movies.json with {len(limited_data)} records\n")

# -------------------------------------------------------------------
# 4) Preview
# -------------------------------------------------------------------
print("Sample entries:\n")

with open("movies.json", "r", encoding="utf-8") as f:
    for i in range(3):
        line = f.readline()
        if not line: # Check for end of file
            print("Not enough lines in movies.json to display 3 samples.")
            break
        print(json.loads(line))

<h2>Task 1 — Count Total Number of Reviews per Movie</h2>

<p>
Your first task is to count how many reviews each movie has received. You will
write a MapReduce program where:
</p>

<ul>
  <li>The <b>mapper</b> reads each JSON record, extracts the <code>asin</code>
      field, and emits <code>(asin, 1)</code>.</li>
  <li>The <b>reducer</b> sums the counts for each movie and outputs
      <code>(asin, total_reviews)</code>.</li>
</ul>

<p>
This task is conceptually similar to a word count, but applied to movie IDs.
Complete the mapper and reducer code in the following cell.
</p>

In [None]:
# Write your Mapper and Reducer code for Task 1 here.
# You may use Python Hadoop Streaming or Java MapReduce.

<h2>Task 2 — Compute Average Rating per Movie</h2>

<p>
In this task, you will compute the <b>average rating</b> for each movie.
</p>

<p>The mapper should:</p>
<ul>
  <li>Extract <code>asin</code> and <code>overall</code> (rating)</li>
  <li>Emit <code>(asin, rating)</code></li>
</ul>

<p>The reducer should:</p>
<ul>
  <li>Sum all ratings for each movie</li>
  <li>Count how many ratings were received</li>
  <li>Compute and output the average rating</li>
</ul>

<p>
Use a MapReduce job to generate a list of movies with their average ratings.
</p>

In [None]:
# Write your Mapper and Reducer code for Task 2 here.
# You may use Python Hadoop Streaming or Java MapReduce.

<h2>Task 3 — Extract Frequent Keywords from Reviews</h2>

<p>
Now you will perform text analysis on the <code>reviewText</code> field.
Your task is to extract meaningful keywords for each movie.
</p>

<p>The mapper should:</p>
<ul>
  <li>Clean and tokenize the text</li>
  <li>Remove punctuation and stopwords</li>
  <li>Emit <code>(asin:word, 1)</code> for each keyword</li>
</ul>

<p>The reducer should:</p>
<ul>
  <li>Sum the counts for each <code>(asin, word)</code> pair</li>
  <li>Output the total frequency of each keyword per movie</li>
</ul>

<p>
This task combines text preprocessing with distributed computation.
</p>

In [None]:
# Write your Mapper and Reducer code for Task 3 here.

<h2>Task 4 — Join Ratings with Top Keywords</h2>

<p>
For this task, you will combine the results of Task 2 (average ratings) and
Task 3 (keyword frequencies) using a <b>reduce-side join</b>.
</p>

<p>
You will provide two inputs to your MapReduce job:
</p>

<ul>
  <li><b>Ratings file</b> with <code>(asin, average_rating)</code></li>
  <li><b>Keywords file</b> with <code>(asin, keyword, count)</code></li>
</ul>

<p>Each mapper should tag its data:</p>

<ul>
  <li><code>("R", rating)</code> for ratings</li>
  <li><code>("K", keyword:count)</code> for keywords</li>
</ul>

<p>
The reducer will receive all entries for a given movie and combine them to
produce an output containing:
</p>

<ul>
  <li>The movie identifier (<code>asin</code>)</li>
  <li>Its average rating</li>
  <li>Its most frequent keywords</li>
</ul>

In [None]:
# Write your Mapper and Reducer code for Task 4 here.