# -*- coding: utf-8 -*-
"""Spring2024_Streaming_Word_Count_with_Apache_Spark_Streaming.ipynb

Automatically generated by Colab.

Original file is located at
    https://colab.research.google.com/drive/1kfEMsZrlHged3z4gbn9yznFHDvi5QGMI
"""
### Streaming Word Count with Apache Spark Streaming
**This project focuses on real-time word counting using Apache Spark Streaming, a powerful framework for processing and analyzing streaming data. In this scenario, we aim to continuously process incoming text data streams, perform word counting, and visualize the results in real-time. Here's an overview of the functionalities and testing scenarios for this project:**

Functionality Overview:
1. Data Streaming: Apache Spark Streaming ingests data streams from a source, such as a file system, in mini-batches, allowing for continuous processing.
2. Word Counting: The streaming application tokenizes the text data, removes stop words, and calculates the frequency of each word in the stream.
3. Real-time Visualization: The word counts are visualized dynamically, enabling users to observe changes in word frequencies as new data arrives.
4. Continuous Processing: The streaming application operates indefinitely, processing incoming data streams in real-time without interruption.

Testing Rate Limiting:
To test rate limiting, we can simulate a high influx of text data by continuously feeding the streaming application with a large volume of text. We can monitor the application's behavior to ensure that it handles the incoming data stream efficiently without overwhelming the system. Additionally, we can introduce delays or throttling mechanisms to observe how the application responds to varying data rates.

Database Change Monitoring:
While Apache Spark Streaming is primarily designed for processing real-time data streams, we can integrate it with database monitoring systems to track changes in underlying data sources. For example, we can monitor updates to a database table containing text data and trigger the streaming application to reprocess the affected data streams accordingly. This ensures that the streaming application remains synchronized with the underlying data source and reflects any changes in real-time.

Additional Features:
We can enhance the streaming word count application by incorporating additional features such as fault tolerance, windowed computations, stateful processing, and integration with external systems for data ingestion and output. Experimenting with these features allows us to explore the full capabilities of Apache Spark Streaming and tailor the application to specific use cases and requirements.

This below section of code is performing text preprocessing and analysis on lyrics data obtained from a CSV file. It returns these metrics as word_count, average_word_length, most_common_word, most_common_word_count, and unique_word_count.

```python
import csv
import re
from collections import Counter

# Function to read CSV file
def read_csv_file(file_path):
    lyrics = []
    with open(file_path, 'r', encoding='iso-8859-1') as file:
        csv_reader = csv.reader(file)
        next(csv_reader)  # Skip header row
        for row in csv_reader:
            lyrics.append(row[4])  # Assuming lyrics are in the 5th column
    return lyrics

# Function to preprocess text (tokenize, filter, remove stop words)
def preprocess_text(text):
    # Tokenize
    words = re.findall(r'\b\w+\b', text.lower())  # Assuming words are separated by whitespace
    
    # Filter out short words (length < 3 characters)
    words = [word for word in words if len(word) >= 3]
    
    # Remove stop words (you can define your own list of stop words)
    stop_words = set(['the', 'and', 'of', 'in', 'to', 'a', 'is', 'that', 'it', 'for', 'on', 'with', 'as'])
    words = [word for word in words if word not in stop_words]
    
    return words

# Function to perform word count and additional functionalities
def process_text(lyrics):
    # Flatten the list of preprocessed lyrics
    flattened_lyrics = [word for sublist in lyrics for word in sublist]
    
    # Word count
    word_count = len(flattened_lyrics)
    
    # Calculate average word length
    total_word_length = sum(len(word) for word in flattened_lyrics)
    average_word_length = total_word_length / word_count if word_count > 0 else 0
    
    # Find the most common word
    word_freq = Counter(flattened_lyrics)
    most_common_word, most_common_word_count = word_freq.most_common(1)[0]
    
    # Count the number of unique words
    unique_word_count = len(word_freq)
    
    return word_count, average_word_length, most_common_word, most_common_word_count, unique_word_count

# Read CSV data
lyrics = read_csv_file('billboard_lyrics_1964-2015.csv')

# Process text
preprocessed_lyrics = [preprocess_text(text) for text in lyrics]
word_count, average_word_length, most_common_word, most_common_word_count, unique_word_count = process_text(preprocessed_lyrics)

# Display results
print("Total Word Count:", word_count)
print("Average Word Length:", average_word_length)
print("Most Common Word:", most_common_word, "(Count:", most_common_word_count, ")")
print("Number of Unique Words:", unique_word_count)
```

This section of the code defines a function visualize_results to create a bar plot visualizing various text analysis metrics.

```python
import matplotlib.pyplot as plt

# Function to visualize the results
def visualize_results(word_count, average_word_length, most_common_word, most_common_word_count, unique_word_count):
    # Define the data to visualize
    data = {
        'Total Word Count': word_count,
        'Average Word Length': average_word_length,
        'Most Common Word': most_common_word_count,
        'Unique Word Count': unique_word_count
    }

    # Create a bar plot
    plt.figure(figsize=(10, 6))
    plt.bar(data.keys(), data.values(), color='skyblue')
    plt.xlabel('Metrics')
    plt.ylabel('Values')
    plt.title('Text Analysis Metrics')
    plt.xticks(rotation=45)
    plt.tight_layout()
    plt.show()

# Visualize the results
visualize_results(word_count, average_word_length, most_common_word, most_common_word_count, unique_word_count)
```
**The below section of the code initializes the streaming process:**

### Streaming Word Count with Apache Spark Streaming

**The below function, we create a DStream (Discretized Stream) by reading text files from a specified directory. The textFileStream method of the StreamingContext (ssc) is used to create the stream, with the directory path provided as the argument. This enables the streaming application to ingest and process data from the text files in real-time, allowing for continuous analysis of the data as it becomes available.**
# Create DStream by reading text files from the directory
data_dir = "./data_chunks"
stream = ssc.textFileStream(data_dir)

**Followed by the below code, we define a process to handle each RDD (Resilient Distributed Dataset) in the stream. The foreachRDD method is used to apply the process_stream function to each RDD in the stream. This function processes the streaming data, extracting word counts, and visualizing the results. By iterating over each RDD in the stream, we ensure that the processing logic is applied to every batch of data received, enabling real-time analysis of the streaming data.**
# Process each RDD in the stream
stream.foreachRDD(process_stream)

**In the below process, we initiate the streaming process by calling the start method on the StreamingContext object (ssc). Additionally, we use the awaitTermination method to instruct the program to wait until the streaming process is terminated or manually stopped.**
# Start streaming
ssc.start()
# Wait for streaming to finish
ssc.awaitTermination()