#Lab 1: Introduction to Big Data Processing




## Introduction
Big data processing involves techniques to efficiently handle and analyze datasets too large to fit into memory. This lab focuses on understanding partitioning, aggregation, sorting, and distributed systems concepts foundational to tools like MapReduce, Hadoop, and Spark.

##Helper: Timed Decorator
To evaluate the execution time of each function systematically, we can create a reusable timed decorator.
The decorator logs the execution time of any function it wraps.
Here’s the full implementation:

In [None]:
import time
from functools import wraps

function_perf_tracker = {}

def timed(func):
    @wraps(func)
    def wrapper(*args, **kwargs):
        start_time = time.time()
        result = func(*args, **kwargs)
        end_time = time.time()
        print(f"Function '{func.__name__}' executed in {end_time - start_time:.4f} seconds.")
        function_perf_tracker[func.__name__] = end_time - start_time
        return result
    return wrapper


##Exercise 1: Searching via Partitioning
Efficiently search for integers in a large dataset using both naive (linear) and optimized (partitioned) approaches.



###Step 1: Dataset Generation

Use Python's random module to generate the dataset:

In [None]:
import random

# Generate dataset
with open("search_data.txt", "w") as file:
    for _ in range(900000):
        file.write(f"{random.randint(1, 1000000)}\n")


###Step 2: Linear Search

A naive approach to scan the dataset sequentially:

In [None]:
@timed
def linear_search(filename, targets):
    results = []
    # TODO: Implement the linear search logic
    # Hint: Read the file line by line, check if the number is in targets, and store "YES" or "NO"
    return results

# Example usage
targets = [5, 1000, 250000, 750000, 999999]
print(linear_search("search_data.txt", targets))


Function 'linear_search' executed in 0.0000 seconds.
[]


### Step 3: Partitioned Search

Partition the dataset into 1,000 smaller files:

In [None]:
def partition_dataset(input_file, partitions=1000):
    # TODO: Implement dataset partitioning
    # Hint: Write numbers to different files based on a hash function, e.g., num % partitions
    pass

partition_dataset("search_data.txt")


**Search in the relevant partition file:**

In [None]:
@timed
def partitioned_search(partitions, targets):
    results = []
    # TODO: Implement partitioned search logic
    # Hint: Open only the relevant partition file and search for the target number
    return results

print(partitioned_search(1000, targets))


Function 'partitioned_search' executed in 0.0000 seconds.
[]


###Key Takeaway
Partitioning significantly reduces the search space, mimicking distributed file systems' efficiency.

##Exercise 2: Grouping and Aggregation

Aggregate data by grouping keys, using naive and partitioned methods.

###Step 1: Dataset Generation


In [None]:
# Generate dataset
with open("group_data.txt", "w") as file:
    for _ in range(30000000):
        k = random.randint(1, 7)
        v = random.randint(1, 1000)
        file.write(f"{k},{v}\n")


###Step 2: Naive Grouping

Read the file and compute aggregation:

In [None]:
from collections import defaultdict

@timed
def naive_grouping(filename):
    aggregation = defaultdict(int)
    # TODO: Implement naive grouping logic
    # Hint: Read the file and aggregate values for each key (k)
    return sorted(aggregation.items())

print(naive_grouping("group_data.txt"))


Function 'naive_grouping' executed in 0.0000 seconds.
[]


###Step 3: Partitioned Grouping

**Partition the dataset:**


In [None]:
def partition_group_data(input_file, partitions=10):
    # TODO: Implement partitioning logic for grouping
    # Hint: Write (k, v) pairs to files based on the hash function H(k) = k % partitions
    pass

partition_group_data("group_data.txt")

**Aggregate**

In [None]:
@timed
def partitioned_grouping(partitions):
    final_aggregation = defaultdict(int)
    # TODO: Implement partitioned grouping logic
    # Hint: Process each partition file and combine results
    return sorted(final_aggregation.items())

print(partitioned_grouping(10))


Function 'partitioned_grouping' executed in 0.0000 seconds.
[]


###Key Takeaway
Partitioning simulates distributed "reduce" operations, optimizing performance for large-scale data.

## Exercise 2bis: Sorting and Grouping by Key

Perform grouping and aggregation on a sorted dataset by key, simulating how sorting helps optimize grouping operations in big data systems.

###Step 1: Dataset Generation
Generate the sorted dataset

In [None]:
import random


def generate_sorted_group_data(file_path, size=3000000):
    """
    Generate a dataset of (k, v) pairs and save it sorted by k.

    Args:
        file_path (str): File to save the dataset.
        size (int): Number of (k, v) pairs to generate.
    """
    dataset = [(random.randint(1, 7), random.randint(1, 1000)) for _ in range(size)]
    dataset.sort(key=lambda x: x[0])  # Sort by k
    with open(file_path, "w") as file:
        file.writelines(f"{k},{v}\n" for k, v in dataset)
    print(f"Generated sorted dataset and saved to {file_path}.")

generate_sorted_group_data("sorted_group_data.txt")

### Step2:  Iterator-based Grouping

- Use Python iterators to simulate streaming access to the dataset file, hiding direct file handling from the processing logic.
   - Wrap the dataset file in an iterator to process it line by line, simulating streaming access.
   - This hides the complexity of file handling and ensures memory efficiency, especially for large datasets.

In [None]:
from itertools import groupby

@timed
def iterator_based_grouping(file_path):
    """
    Group and sum values by key using an iterator over a sorted dataset.

    Args:
        file_path (str): Path to the sorted dataset.

    Returns:
        dict: Aggregated sum of values for each key.
    """
    aggregation = {}
    #TODO: Group and sum values by key using an iterator over a sorted dataset.
    return aggregation

sorted_aggregation_data = iterator_based_grouping("sorted_group_data.txt")
sorted_aggregation_data = sorted(sorted_aggregation_data.items())
print(sorted_aggregation_data)

### Step 3: Grouping via Iteration
- Implement a function to iterate through the sorted dataset, grouping values by \( k \) as they appear. This avoids random access and simulates how sorting simplifies grouping in distributed systems.
   - Since the file is sorted by \( k \), you can group values without random access:
     1. Read the file line by line.
     2. Accumulate values for the current key \( k \).
     3. When \( k \) changes, save the results for the previous key and start accumulating for the new key.

In [None]:
@timed
def grouping_by_iteration(file_path):
    aggregation = {}
    #TODO: Implement a function to iterate through the sorted dataset, grouping values by \( k \) as they appear.
    return aggregation

aggregation_data = grouping_by_iteration("sorted_group_data.txt")
aggregation_data = sorted(aggregation_data.items())
print(aggregation_data)

### Discussion
- Measure execution time for:
  1. **Naive Grouping**: Reads and groups an unsorted file by scanning and aggregating in memory.
  2. **Iterator-based Grouping**: Processes the sorted file line by line using the grouping-by-iteration method.
- Compare the performance of the iterator-based grouping on the sorted file against the naive grouping on an unsorted dataset.

##Exercise 3: n-way Merge-Sort


###Step 1: Dataset Preparation
Generate and save the sorted lists as in the original exercise.


In [None]:
import random

# Generate sorted lists and save to files
lists = [
    sorted(random.randint(1, 100) for _ in range(10)),
    sorted(random.randint(50, 150) for _ in range(10)),
    sorted(random.randint(100, 200) for _ in range(10))
]

for i, lst in enumerate(lists):
    with open(f"list_{i}.txt", "w") as file:
        file.writelines(f"{x}\n" for x in lst)


###Step 2: Implement Pointer-Based Merge
The idea is to read numbers from all files sequentially, maintaining a pointer (or current position) in each file to track which number should be considered next.

In [None]:
@timed
def n_way_merge_pointer(files):
    # TODO: Implement pointer-based n-way merge logic
    # Hint: Maintain pointers for each file and iteratively find the smallest element
    pass

# Example usage
files = [f"list_{i}.txt" for i in range(3)]
merged_output = n_way_merge_pointer(files)
print(merged_output)


Function 'n_way_merge_pointer' executed in 0.0000 seconds.
None


**How It Works**
1. **Initialization:**

    - Open all files and read the first line from each file to initialize the pointers.
    - Store these first values in a pointers list.
2. **Find the Smallest Value:**

    - Use the built-in min() function to find the smallest value in the current pointers.
3. **Update Pointers:**

    - Determine which file contributed the smallest value and update its pointer by reading the next line from that file.
4. **Repeat:**

    - Continue until all files are fully read and no values remain in the pointers list.
5. **Write Output:**

    - Store the merged values in a list or write them directly to a file.

##Exercise 4: Word Counting
Count word occurrences in a large text dataset using both naive (sequential) and partitioned (distributed) methods. This exercise simulates MapReduce-style word counting.


###Step 1: Dataset Preparation
Prepare a large text dataset (text_data.txt). For simplicity, let's create a file with random sentences.

In [None]:
import random
import string

# Generate random sentences
def generate_text_file(filename, num_lines=1000000):
    words = ["apple", "banana", "orange", "grape", "pineapple", "kiwi", "melon"]
    with open(filename, "w") as file:
        for _ in range(num_lines):
            line = " ".join(random.choices(words, k=random.randint(5, 15)))
            file.write(f"{line}\n")

generate_text_file("text_data.txt")


###Step 2: Sequential Word Count
Count word occurrences by scanning the file line by line.

In [None]:
from collections import defaultdict

@timed
def sequential_word_count(filename):
    word_counts = defaultdict(int)
    # TODO: Implement naive word count logic
    # Hint: Read the file line by line and count occurrences of each word
    return sorted(word_counts.items())

# Example usage
word_counts = sequential_word_count("text_data.txt")
print(word_counts[:10])  # Print top 10 word counts


Function 'sequential_word_count' executed in 0.0000 seconds.
[]


###Step 3: Partitioned Word Count
Use a hash function to divide the dataset into smaller files, process each partition, and combine results.


**Partition the Dataset**
Partition the dataset into smaller files based on a hash function.

In [None]:
def partition_text_file(input_file, partitions=10):
     # TODO: Implement partitioning logic for word count
    # Hint: Write words to files based on the hash function H(word) = sum(ord(c) for c in word) % partitions
    pass

partition_text_file("text_data.txt")


**Combine Results**
Aggregate word counts from all partitions.

In [None]:
@timed
def partitioned_word_count(partitions=10):
    combined_counts = defaultdict(int)
    # TODO: Implement partitioned word count logic
    # Hint: Process each partition and combine the results
    return sorted(combined_counts.items())

# Example usage
partitioned_counts = partitioned_word_count(10)
print(partitioned_counts[:10])  # Print top 10 word counts


Function 'partitioned_word_count' executed in 0.0000 seconds.
[]


###Discussion
**Sequential Approach:** Processes the entire dataset in one pass but can be slow for very large datasets due to memory constraints.  
**Partitioned Approach:** Divides work across multiple files, simulating parallel processing and reducing memory usage. This approach is scalable and forms the basis of MapReduce-style word counting.


###Key Takeaway
The partitioned approach is more scalable for large datasets, demonstrating how the "map" and "reduce" steps in distributed frameworks like Hadoop and Spark optimize big data processing.

## **Detailed Analysis**

1. **Exercise 1: Searching**
   - **Naive Approach**: Sequentially scans the entire file for each search query, which scales poorly as \(n\) (number of integers) grows.
   - **Partitioned Approach**: Limits the search to a smaller subset of the data by hashing, improving performance as \(m\) (number of partitions) increases.

2. **Exercise 2: Grouping**
   - **Naive Approach**: Directly aggregates values for each key in a single pass.
   - **Partitioned Approach**: Divides the dataset into \(m\) smaller groups, reducing memory overhead and simulating distributed parallel processing.

3. **Exercise 3: Merge-Sort**
   - **Pointer-Based Merge**: Simpler but less efficient, as each merge step compares all \(k\) current elements.
   - **`heapq` Merge**: Maintains a min-heap to quickly find the smallest element, reducing comparison overhead.

4. **Exercise 4: Word Count**
   - **Naive Approach**: Reads the dataset sequentially and counts words in memory, which becomes slow for very large datasets.
   - **Partitioned Approach**: Uses hashing to divide data into manageable chunks, allowing efficient in-memory counting for each partition.



## **Takeaways**
- **Partitioning**: Improves scalability in Exercises 1, 2, and 4 by reducing the size of the data each step processes.
- **Parallelism**: Many "advanced" methods simulate distributed systems, which are inherently more scalable for big data problems.


## Performance Track

In [None]:
for key, value in function_perf_tracker.items():
  print(f"{key}: {value}")

linear_search: 1.430511474609375e-06
partitioned_search: 1.1920928955078125e-06
naive_grouping: 8.106231689453125e-06
partitioned_grouping: 4.0531158447265625e-06
n_way_merge_pointer: 7.152557373046875e-07
sequential_word_count: 6.198883056640625e-06
partitioned_word_count: 6.4373016357421875e-06
