# Word Cloud Generation from Text Data using PySpark

This notebook demonstrates the process of generating a word cloud from text data, leveraging the capabilities of PySpark and Python libraries. The workflow includes the following key steps:

1. **Initialization**: Importing necessary libraries and initializing a Spark session for distributed data processing.
2. **Data Processing**: Reading text data into a Spark DataFrame, converting it to an RDD, and performing text processing to count word frequencies.
3. **Data Exploration and Cleaning**: Exploring the word frequencies, removing specific unwanted words, and cleaning the dataset for better visualization.
4. **Preparation for Visualization**: Filtering and preparing the word frequency data to be suitable for generating a word cloud.
5. **Word Cloud Visualization**: Creating a visual representation of word frequencies to highlight the most prominent words in the text data.



In [None]:
from pyspark.sql import SparkSession
from wordcloud import WordCloud
import matplotlib.pyplot as plt
from collections import Counter

In [None]:
# Initialize Spark session
spark = SparkSession.builder.appName("word_cloud").getOrCreate()

# Load the CSV file into a Spark DataFrame
csv_file_path = 'combinedFiles/all_data.csv'
df = spark.read.csv(csv_file_path, header=True, inferSchema=True)

In [None]:

text_rdd = df.select("text_signature").rdd.flatMap(lambda x: x[0].split(" ") if x[0] is not None else [])

In [None]:
# Count the occurrences of each unique word
word_frequencies = text_rdd.countByValue()

In [None]:
len(word_frequencies)

In [None]:
for key, value in word_frequencies.items():
    print(f"{key}: {value}")

In [None]:
sorted_dict = dict(sorted(word_frequencies.items(), key=lambda x: x[1], reverse=True))

In [None]:
import csv

# Specify your CSV file name
csv_file = 'sorted_word_freq.csv'

# Write to CSV
with open(csv_file, mode='w', newline='', encoding='utf-8') as file:
    writer = csv.writer(file)
    # Write the header
    writer.writerow(['Key', 'Value'])

    # Write the dictionary data
    for key, value in sorted_dict.items():
        writer.writerow([key, value])

print(f"Sorted items written to {csv_file}")

In [None]:
keys_to_delete = ['transfer', 'unknown', '_SIMONdotBLACK_', 'setApprovalForAll', 'multicall', 'sendMultiSig']

In [None]:
for key in keys_to_delete:
    print(key, sorted_dict[key])

In [None]:
for key in keys_to_delete:
    del sorted_dict[key]

In [None]:
# Filter dictionary and leve only those key/value pairs that have value > 100
filtered_dict = {key: value for key, value in sorted_dict.items() if value > 1000}

In [None]:
len(filtered_dict)

In [None]:
# Convert the word frequencies to a format suitable for WordCloud
wordcloud = WordCloud(width=800, height=400, background_color='white').generate_from_frequencies(filtered_dict)

# Display the WordCloud using matplotlib
plt.figure(figsize=(10, 5))
plt.imshow(wordcloud, interpolation='bilinear')
plt.axis('off')
plt.show()

# Create a bar chart with word frequencies
# Convert word_frequencies to lists
unique_words = list(word_frequencies.keys())
counts = list(word_frequencies.values())

# Sort by frequency in descending order
sorted_indices = sorted(range(len(counts)), key=lambda k: counts[k], reverse=True)
unique_words_sorted = [unique_words[i] for i in sorted_indices]
counts_sorted = [counts[i] for i in sorted_indices]

# Plot the bar chart
plt.figure(figsize=(12, 6))
plt.bar(unique_words_sorted, counts_sorted, color='skyblue')
plt.title('Word Frequencies')
plt.xlabel('Words')
plt.ylabel('Frequency')
plt.xticks(rotation=45, ha='right')  # Rotate x-axis labels for better readability
plt.show()

# Stop the Spark session
spark.stop()
