# Exploratory Data Analysis on Textual Data

This notebook is designed to perform Exploratory Data Analysis (EDA) on textual data. It covers the process from initial data loading to cleaning, analyzing, and visualizing the data. Libraries used include pandas, matplotlib, seaborn, nltk, TextBlob, WordCloud, and spacy.

After cleaning, the data is written to 'cleaned_section_data.csv'.

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from nltk.tokenize import word_tokenize
from nltk.probability import FreqDist
from textblob import TextBlob
from wordcloud import WordCloud
import nltk
import spacy
from tqdm import tqdm

# Load spaCy model for lemmatization
nlp = spacy.load("en_core_web_lg")

# Download NLTK resources for tokenization and stopwords
nltk.download('punkt')
nltk.download('stopwords')

### Function Definitions

Define functions for plotting word frequencies, cleaning text data, and other repetitive tasks. 

In [None]:
# Function to plot word frequencies
def plot_word_frequencies(freq_dist, title, num_words=30):
    words, counts = zip(*freq_dist.most_common(num_words))
    plt.figure(figsize=(12, 6))
    sns.barplot(x=list(words), y=list(counts))
    plt.title(title)
    plt.xlabel('Word')
    plt.ylabel('Frequency')
    plt.xticks(rotation=45)
    plt.show()

# Function to clean text data
def clean_text(text):
    if pd.isna(text):
        return ''
    text = ''.join([char.lower() for char in text if char.isalnum() or char.isspace()])
    words = word_tokenize(text)
    stop_words = nltk.corpus.stopwords.words('english')
    # Frequet Terms to remove while cleaning
    other_terms = [
        "copyright",
        "2024",
        "logrhythm inc"
        "all rights reserved",
        "•",
        "powered by",
        "scroll viewport",
        "&",
        "atlassian confluence",
        "please note these errors can depend on your browser setup"
    ]
    words = [word for word in words if word not in stop_words and word not in other_terms]
    lemmatized_tokens = [token.lemma_ for token in nlp(' '.join(words)) if not token.is_stop]
    return ' '.join(lemmatized_tokens)

### Data Loading

Load the textual data from a CSV file into a pandas DataFrame. This data will be used for the subsequent analysis.

In [None]:
# Load the CSV file into a DataFrame
section_data = pd.read_csv('section_data.csv')

# Display basic information about the DataFrame
print("DataFrame Information:")
section_data.info()

# Display descriptive statistics of the DataFrame
print("\nDataFrame Description:")
section_data.describe()

### Initial Exploratory Data Analysis on Unprocessed Data

Perform initial EDA on the raw data to understand its basic structure, including text length distribution, word frequency analysis, word cloud generation, and sentiment analysis.


In [None]:
# Analyze text length distribution
section_data['Text Length'] = section_data['content'].apply(len)
plt.figure(figsize=(10, 6))
plt.title('Text Length Distribution')
sns.histplot(section_data['Text Length'], bins=50, kde=True)
plt.xlabel('Text Length')
plt.ylabel('Frequency')
plt.show()

# Perform word frequency analysis
freq_dist = FreqDist(word_tokenize(' '.join(section_data['content'])))
plot_word_frequencies(freq_dist, 'Top Words in Raw Data')

# Generate a word cloud
wordcloud = WordCloud(width=800, height=400, background_color='white').generate(' '.join(section_data['content']))
plt.figure(figsize=(10, 5))
plt.imshow(wordcloud, interpolation='bilinear')
plt.axis('off')
plt.title('Word Cloud for Raw Data')
plt.show()

# Perform sentiment analysis
section_data['Sentiment'] = section_data['content'].apply(lambda x: TextBlob(x).sentiment.polarity)
plt.figure(figsize=(8, 6))
plt.title('Sentiment Distribution')
sns.histplot(section_data['Sentiment'], bins=30, kde=True)
plt.xlabel('Sentiment Polarity')
plt.ylabel('Frequency')
plt.show()

### Data Cleaning

Clean the textual data by removing unnecessary characters, stopwords, and lemmatizing the text. Also, remove duplicates and blank entries to ensure the quality of the dataset.

In [None]:
# Apply the text cleaning function
tqdm.pandas(desc="Cleaning Text")
section_data['Cleaned Text'] = section_data['content'].progress_apply(clean_text)

# Remove duplicates in the data
print("Removing duplicates...")
section_data = section_data.drop_duplicates(subset=['Cleaned Text'])
print("Duplicates removed.")

# Remove blank entries from the data
print("Removing blank entries...")
section_data = section_data[section_data['Cleaned Text'].str.strip() != '']
print("Blank entries removed.")

## Exploratory Data Analysis on Cleaned Text

Repeat the EDA process on the cleaned text data. This includes analyzing the text length distribution, word frequency, word cloud, and sentiment for the cleaned data.

In [None]:
# Display updated information about the DataFrame
print("DataFrame Information:")
section_data.info()

# Display descriptive statistics of the cleaned DataFrame
print("\nDataFrame Description:")
section_data.describe()

# Analyze text length distribution after cleaning
section_data['Cleaned Text Length'] = section_data['Cleaned Text'].apply(len)
plt.figure(figsize=(10, 6))
plt.title('Text Length Distribution After Cleaning')
sns.histplot(section_data['Cleaned Text Length'], bins=50, kde=True)
plt.xlabel('Text Length')
plt.ylabel('Frequency')
plt.show()

# Perform word frequency analysis on cleaned text
freq_dist_cleaned = FreqDist(word_tokenize(' '.join(section_data['Cleaned Text'])))
plot_word_frequencies(freq_dist_cleaned, 'Top Words in Cleaned Data')

# Generate a word cloud for cleaned text
wordcloud_cleaned = WordCloud(width=800, height=400, background_color='white').generate(' '.join(section_data['Cleaned Text']))
plt.figure(figsize=(10, 5))
plt.imshow(wordcloud_cleaned, interpolation='bilinear')
plt.axis('off')
plt.title('Word Cloud for Cleaned Data')
plt.show()

# Perform sentiment analysis on cleaned text
section_data['Cleaned Sentiment'] = section_data['Cleaned Text'].apply(lambda x: TextBlob(x).sentiment.polarity)
plt.figure(figsize=(8, 6))
plt.title('Sentiment Distribution After Cleaning')
sns.histplot(section_data['Cleaned Sentiment'], bins=30, kde=True)
plt.xlabel('Sentiment Polarity')
plt.ylabel('Frequency')
plt.show()

### Saving Cleaned Data

After the analysis, save the cleaned and processed data to a new CSV file for future use or further analysis.

In [None]:
# Save the cleaned data to a new CSV file
section_data.to_csv('cleaned_section_data.csv', index=False)
print("Cleaned data has been saved to 'cleaned_section_data.csv'.")