Part 1: Querying Users

Using the instructions laid out in the README file, I opened two windows of a VM instance and executed the commands, having the output from the listener being stored in a .txt file. I then wrote the following python program to analyze the output file and for within each different sender, find the average number of times each of of its unique queries was output, for which I got the number 3.04. THis is pretty much the exact number I was expecting to see, since I let the program run for just over 3 minutes and each sender has 60 unique queries it can output, and it outputs at an avg of one query a second, so doing the math 3.04 is the exact average I would expect. 

In [None]:
# Code to analyze the output for Part1
from collections import defaultdict

def average_query_repetitions(file_path):
    sender_query_counts = defaultdict(lambda: defaultdict(int))
    
    with open(file_path, 'r') as file:
        for line in file:
            parts = line.split()
            sender = parts[3]
            query = parts[-1]
            
            sender_query_counts[sender][query] += 1
    
    for sender, queries in sender_query_counts.items():
        total_repetitions = sum(queries.values())
        unique_queries = len(queries)
        average_repetitions = total_repetitions / unique_queries if unique_queries > 0 else 0
        print(f"{sender}: Average repetitions per query = {average_repetitions:.2f}")

# Printing the Average:
average_query_repetitions("part1output.txt")


Part 2: Bloom Filter

All the following deliverables for this section will be submitted in the .zip folder:

- q6part2.py                The python code used to complete all the steps from reading in the bad words to encoding
                            the base64 bit vector.
- bit_vector_base64.txt     The text file containing the final bit vector
- BloomFilter.py            The python code designed to run in spark that uses the previous two items to filter out bad words from
                            the input stream, and output the clean version with bad words removed
-code_explanation.mp4       The short video of me showing off the Bloom Filter in action


Part 3: Counting Unique Users

For this section I modified the read_stdin.py file so that it would implement a hyperloglog algorithm to do its job. I also progressively changed the inputs in click-feeder.py so that the n_senders value keep getting bigger and bigger, but the delay mean and delay standard deviation were made to keep getting smaller and smaller. Below is the python code used to generate the graph itself. SUbmitting in the .zip folder:

- queries_over_time.png             Final Graph

In [None]:
import matplotlib.pyplot as plt

def load_data(file_path):
    elapsed_time = []
    unique_users = []
    with open(file_path, 'r') as file:
        next(file)
        for line in file:
            time, users = line.strip().split(', ')
            elapsed_time.append(int(time))
            unique_users.append(int(users))
    return elapsed_time, unique_users

def plot_unique_users(file_path):
    elapsed_time, unique_users = load_data(file_path)
    
    plt.figure(figsize=(10, 6))
    plt.plot(elapsed_time, unique_users, marker='o', linestyle='-', markersize=5)
    plt.title("Estimated Unique Users Over Time")
    plt.xlabel("Total Elapsed Time (s)")
    plt.ylabel("Estimated Unique Users")
    plt.grid(True)
    plt.show()

plot_unique_users("output_log.txt")
