# Lab 5 - Apache Spark Word Frequency Analysis

This notebook implements three tasks for word frequency analysis using Apache Spark:

1. **Task 1**: Filter words starting with "ho" and compute statistics
2. **Task 2**: Filter most frequent words (freq > 0.8 * maxfreq) and save them
3. **Task 3**: Compute frequency distribution in groups

**Input**: File with format `word\tfreq` (word tab frequency)
**Output**: Statistics and output files with results

## Import libraries and configuration

In [None]:
from typing import Tuple
from pyspark import SparkConf, SparkContext

## Parameters configuration

In [None]:
# Configuration of paths and parameters
inputPath = "SampleLocalFile.csv"  # For local environment
# inputPath = "/data/students/bigdata-01QYD/Lab2/"  # For HDFS environment
outputPath = "res_out_Lab5/" 
outputPath2 = "res_out_Lab5_Task3/"
prefix = "ho"

## Reading input data

In [None]:
# Read input file
wordsFrequenciesRDD = sc.textFile(inputPath)

# Cache RDD to improve performance
wordsFrequenciesRDD.cache()

## Task 1: Filter words starting with "ho"

In [None]:
# Filter lines containing words that start with prefix "ho"
selectedLinesRDD = wordsFrequenciesRDD.filter(lambda line: line.startswith(prefix))

# Cache for multiple usage
selectedLinesRDD.cache()

In [None]:
# Calculate number of selected lines
numLines = selectedLinesRDD.count()
print(f"Number of selected lines (words starting with '{prefix}'): {numLines}")

In [None]:
# Calculate maximum frequency among selected lines
if numLines > 0:
    # Extract frequencies from selected lines
    maxfreqRDD = selectedLinesRDD.map(lambda line: float(line.split("\t")[1]))
    
    # Calculate maximum value
    maxfreq = maxfreqRDD.reduce(lambda freq1, freq2: max(freq1, freq2))
    
    print(f"Maximum frequency among selected lines: {maxfreq}")
    
else:
    maxfreq = 0
    print(f"No words found starting with '{prefix}'")

## Task 2: Filter most frequent words and save them

In [None]:
if maxfreq > 0:
    threshold = 0.8 * maxfreq
    print(f"Frequency threshold (0.8 * {maxfreq}): {threshold}")
    
    # Filter lines with frequency > 0.8 * maxfreq
    selectedLinesMaxFreqRDD = selectedLinesRDD.filter(
        lambda line: float(line.split("\t")[1]) > threshold
    )
else:
    print("Cannot calculate threshold (maxfreq = 0)")
    selectedLinesMaxFreqRDD = sc.emptyRDD()

In [None]:
# Count number of selected lines
numLinesMaxfreq = selectedLinesMaxFreqRDD.count()
print(f"Number of lines with freq > 0.8*maxfreq: {numLinesMaxfreq}")

In [None]:
if numLinesMaxfreq > 0:
    # Select only words (first field)
    selectedWordsRDD = selectedLinesMaxFreqRDD.map(lambda line: line.split("\t")[0])
    
else:
    print("No words meet the frequency threshold")

In [None]:
# Save selected words to output folder
if numLinesMaxfreq > 0:
    try:
        selectedWordsRDD.saveAsTextFile(outputPath)
        print(f"Words saved successfully to: {outputPath}")
    except Exception as e:
        print(f"Error saving to {outputPath}: {e}")
else:
    print("No words to save")

## Task 3: Frequency distribution in groups

In [None]:
def compute_group(line: str) -> Tuple[str, int]:
    """
    Determine group membership based on frequency:
    - Group 0: [0, 100)
    - Group 1: [100, 200)
    - Group 2: [200, 300)
    - Group 3: [300, 400)
    - Group 4: [400, 500)
    - Group 5: [500, +inf)
    """
    fields = line.split('\t')
    freq = int(fields[1])
    
    if freq >= 500:
        group = 5
    else:
        group = freq // 100
    
    return (f'Group{group}', 1)

# Calculate RDD with pairs (group, 1)
groupPairRDD = wordsFrequenciesRDD.map(compute_group)

In [None]:
# Use reduceByKey to sum all +1 in value part
countPerGroupPairRDD = groupPairRDD.reduceByKey(lambda v1, v2: v1 + v2)

# Sort by key (group)
sortedCountPerGroupRDD = countPerGroupPairRDD.sortByKey()

In [None]:
# Save frequency distribution results
try:
    sortedCountPerGroupRDD.saveAsTextFile(outputPath2)
    print(f"\nFrequency distribution saved to: {outputPath2}")
except Exception as e:
    print(f"Error saving to {outputPath2}: {e}")