# Exploratory Data Analysis on Textual Data

This notebook is designed to perform Exploratory Data Analysis (EDA) on textual data. It covers the process from initial data loading to cleaning, analyzing, and visualizing the data. The main steps include:

1. **Data Loading**: Load textual data from a CSV file into a pandas DataFrame.
2. **Initial Exploratory Data Analysis on Unprocessed Data**: Perform initial EDA on the raw data to understand its basic structure, including text length distribution, word frequency analysis, word cloud generation, and sentiment analysis.
3. **Data Cleaning**: Clean the textual data by removing unnecessary characters, stopwords, and lemmatizing the text. Also, remove duplicates and blank entries to ensure the quality of the dataset.
4. **Extract Category Feature from Document Text**: Define categories and corresponding keywords for classification. Assign categories to each document based on keywords.
5. **Exploratory Data Analysis on Cleaned Text**: Repeat the EDA process on the cleaned text data. This includes analyzing the text length distribution, word frequency, word cloud, sentiment, and distribution of assigned categories from text.
6. **Saving Cleaned Data**: Save the cleaned and processed data to a new CSV file for future use or further analysis. After cleaning, the data is written to 'cleaned_section_data_with_categories.csv.

Libraries used include pandas, matplotlib, seaborn, nltk, TextBlob, WordCloud, and spacy.

In [None]:
# Import Libraries
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import nltk
from nltk.tokenize import word_tokenize
from nltk.probability import FreqDist
from nltk.corpus import stopwords
from nltk import bigrams, trigrams
from textblob import TextBlob
from wordcloud import WordCloud
from collections import Counter
from langdetect import detect, LangDetectException
import spacy
import string
import re
from tqdm import tqdm

### Load Resources

In [None]:
# Load spaCy model for lemmatization
nlp = spacy.load("en_core_web_lg")

# Download NLTK resources for tokenization and stopwords
nltk.download('punkt')
nltk.download('stopwords')

### Function Definitions

Define functions for plotting word frequencies, cleaning text data, and other repetitive tasks. 

In [None]:
# Function to plot word frequencies
def plot_word_frequencies(freq_dist, title, num_words=30):
    """
    Plot word frequencies.

    Parameters:
    - freq_dist: Frequency distribution of words.
    - title: Title of the plot.
    - num_words: Number of top words to display.

    Returns:
    - None
    """
    words, counts = zip(*freq_dist.most_common(num_words))
    plt.figure(figsize=(12, 6))
    sns.barplot(x=list(words), y=list(counts))
    plt.title(title)
    plt.xlabel('Word')
    plt.ylabel('Frequency')
    plt.xticks(rotation=45)
    plt.show()

In [None]:
# Function to clean text data
# Function to clean text data
def clean_text(text):
    """
    Clean text data by removing unnecessary characters, stopwords, and lemmatizing the text.
    Also, remove specific text from the original content.

    Parameters:
    - text: Input text to be cleaned.

    Returns:
    - Cleaned text.
    """
    try:
        # Check if the text is NaN
        if pd.isna(text):
            return ''

        # Remove specific text using regular expressions
        specific_text = r"Copyright © 2024 LogRhythm, Inc\. All Rights Reserved.*?If this problem persists, please contact our support\."
        text = re.sub(specific_text, '', text, flags=re.DOTALL)

        # Language detection and filter out non-English
        if detect(text) != 'en':
            return ''

        # List of additional characters to remove
        characters_to_remove = [
            'â€”', 'â€™', 'â€œ', 'â€¦', 'é', 'ø', 'à', 'ç', 'ê', 'ä', 'ü', 'ñ', 'î', 'è', 'ø', 'ü', 'ê'
        ]

        # Remove each character in the list
        for char in characters_to_remove:
            text = text.replace(char, '')
        
        # Convert text to lowercase and tokenize
        text = ''.join([char.lower() for char in text if char.isalnum() or char.isspace()])
        words = word_tokenize(text)

        # Remove stopwords
        stop_words = set(stopwords.words('english'))
        words = [word for word in words if word not in stop_words]

        # Lemmatization
        lemmatized_tokens = [token.lemma_ for token in nlp(' '.join(words)) if not token.is_stop]

        return ' '.join(lemmatized_tokens)

    except LangDetectException:
        # Handle texts that langdetect can't process
        return ''

In [None]:
# Function to remove frequent Trigram Phrases
def remove_frequent_trigrams(df, column_name, threshold=300):
    """
    Remove frequent trigrams from the text.

    Parameters:
    - df: DataFrame containing the text data.
    - column_name: Name of the column containing the text.
    - threshold: Threshold frequency for trigrams.

    Returns:
    - DataFrame with frequent trigrams removed.
    """
    # Tokenize the text and create trigrams
    all_trigrams = [' '.join(gram) for text in df[column_name] for gram in trigrams(text.split())]

    # Count the frequency of each trigram
    trigram_freq = Counter(all_trigrams)

    # Identify trigrams that occur more than the threshold
    frequent_trigrams = {gram for gram, freq in trigram_freq.items() if freq > threshold}

    # Function to remove frequent trigrams from a text
    def remove_trigrams(text):
        return ' '.join([' '.join(gram) for gram in trigrams(text.split()) if ' '.join(gram) not in frequent_trigrams])

    # Apply the function to the DataFrame with a progress bar
    tqdm.pandas(desc="Removing Frequent Trigrams from cleaned data")
    df[column_name] = df[column_name].progress_apply(remove_trigrams)
    
    return df

In [None]:
# Function to describe dataframe attributes
def describe_dataframe(df):
    """
    Describe DataFrame attributes including information, descriptive statistics, data types,
    missing values, and unique values for each column.

    Parameters:
    - df: DataFrame to be described.

    Returns:
    - None
    """
    print("DataFrame Information:")
    df.info()
    
    print("\nDataFrame Descriptive Statistics:")
    print(df.describe(include='all'))

    print("\nData Types:")
    print(df.dtypes)

    print("\nMissing Values in Each Column:")
    missing_values = df.isnull().sum()
    print(missing_values[missing_values > 0])

    print("\nUnique Values in Each Column:")
    for col in df.columns:
        unique_count = df[col].nunique()
        print(f'Column {col} has {unique_count} unique values.')

In [None]:
# Function to identify unique charaters
def unique_characters_count(text_column):
    """
    Count unique characters in a text column.

    Parameters:
    - text_column: Column containing text data.

    Returns:
    - List of unique characters sorted by counts.
    """
    # Concatenate all text into a single string
    all_text = ''.join(text_column)

    # Create a dictionary to count occurrences of each character
    character_counts = {}
    for char in all_text:
        if char in character_counts:
            character_counts[char] += 1
        else:
            character_counts[char] = 1

    # Sort the dictionary by counts in descending order
    sorted_character_counts = sorted(character_counts.items(), key=lambda x: x[1], reverse=True)

    return sorted_character_counts

In [None]:
# Funtion to count frequent words and phrases in text
def find_frequent_ngrams(df, column_name, num_terms=10, ngram_size=1):
    """
    Find frequent n-grams in the text.

    Parameters:
    - df: DataFrame containing the text data.
    - column_name: Name of the column containing the text.
    - num_terms: Number of top terms to display.
    - ngram_size: Size of n-grams (1 for unigrams, 2 for bigrams, 3 for trigrams).

    Returns:
    - List of most common n-grams.
    """
    ngrams_list = []

    # Iterate through each row to generate n-grams
    for text in df[column_name]:
        tokens = text.split()
        
        if ngram_size == 1:
            ngrams = tokens
        elif ngram_size == 2:
            ngrams = [' '.join(gram) for gram in bigrams(tokens)]
        elif ngram_size == 3:
            ngrams = [' '.join(gram) for gram in trigrams(tokens)]
        else:
            raise ValueError("ngram_size must be 1, 2, or 3")

        ngrams_list.extend(ngrams)

    # Count the frequency of each n-gram
    freq_dist = Counter(ngrams_list)

    # Get the most common n-grams
    common_ngrams = freq_dist.most_common(num_terms)

    return common_ngrams

In [None]:
# Function to assign a category to a text based on keywords
def assign_category(text, categories):
    """
    Assign a category to a text based on keywords.

    Parameters:
    - text: Input text to be categorized.
    - categories: Dictionary containing category names as keys and corresponding keywords as values.

    Returns:
    - Assigned category.
    """
    text = str(text).lower()
    category_scores = {category: 0 for category in categories.keys()}
    for category, keywords in categories.items():
        category_scores[category] += sum(text.count(keyword) for keyword in keywords)
    assigned_category = max(category_scores, key=category_scores.get)
    return 'Other' if category_scores[assigned_category] == 0 else assigned_category

### Data Loading

Load the textual data from a CSV file into a pandas DataFrame. This data will be used for the subsequent analysis.  This also copies the original 'section_data.csv' to 'original_section_data.csv' before it overwrites the 'content' column of the 'section_data.csv' with specific text removed.

In [None]:
# Load the CSV file into a DataFrame
section_data = pd.read_csv('section_data.csv', encoding='utf-8')

# Copy the DataFrame to create a new one for preservation
original_section_data = section_data.copy()

# Save the original DataFrame to a new CSV file
original_section_data.to_csv('original_section_data.csv', index=False)

# Function to remove specific text from the 'content' column
def remove_specific_text(text):
    specific_text = r"Copyright © 2024 LogRhythm, Inc\. All Rights Reserved.*?If this problem persists, please contact our support\."
    return re.sub(specific_text, '', text, flags=re.DOTALL)

# Apply the function to the 'content' column
section_data['content'] = section_data['content'].apply(remove_specific_text)

# Save the updated DataFrame to the same CSV file, overwriting the original
section_data.to_csv('section_data.csv', index=False, encoding='utf-8')

# Initial Exploratory Data Analysis on Unprocessed Data

Perform initial EDA on the raw data to understand its basic structure, including text length distribution, word frequency analysis, word cloud generation, and sentiment analysis.

In [None]:
# Describe the DataFrame before text cleaning
describe_dataframe(section_data)

# Check for duplicate text
duplicate_text = section_data[section_data.duplicated(['content'])]

# Print the duplicates
print("Duplicate Text Entries:")
print(duplicate_text)

# Check for missing data in each column
missing_data = section_data.isnull().sum()

# Print columns with missing data
print("Columns with Missing Data:")
print(missing_data[missing_data > 0])

# Analyze text length distribution
section_data['Text Length'] = section_data['content'].apply(len)
plt.figure(figsize=(10, 6))
plt.title('Text Length Distribution')
sns.histplot(section_data['Text Length'], bins=50, kde=True)
plt.xlabel('Text Length')
plt.ylabel('Frequency')
plt.show()

# Perform word frequency analysis
freq_dist = FreqDist(word_tokenize(' '.join(section_data['content'])))
plot_word_frequencies(freq_dist, 'Top Words in Raw Data')

# Generate a word cloud
wordcloud = WordCloud(width=800, height=400, background_color='white').generate(' '.join(section_data['content']))
plt.figure(figsize=(10, 5))
plt.imshow(wordcloud, interpolation='bilinear')
plt.axis('off')
plt.title('Word Cloud for Raw Data')
plt.show()

# Perform sentiment analysis
section_data['Sentiment'] = section_data['content'].apply(lambda x: TextBlob(x).sentiment.polarity)
plt.figure(figsize=(8, 6))
plt.title('Sentiment Distribution')
sns.histplot(section_data['Sentiment'], bins=30, kde=True)
plt.xlabel('Sentiment Polarity')
plt.ylabel('Frequency')
plt.show()

# Apply the function to the raw content column
unique_chars_before_cleaning = unique_characters_count(section_data['content'])
print("Unique Characters Before Cleaning:")
print(unique_chars_before_cleaning)

# Data Cleaning

Clean the textual data by removing unnecessary characters, stopwords, and lemmatizing the text. Also, remove duplicates and blank entries to ensure the quality of the dataset.

In [None]:
# Apply the text cleaning function
tqdm.pandas(desc="Cleaning Text")
section_data['Cleaned Text'] = section_data['content'].progress_apply(clean_text)

# Remove duplicates in the data
print("Removing duplicates...")
section_data = section_data.drop_duplicates(subset=['Cleaned Text'])
print("Duplicates removed.")

# Remove blank entries from the data
print("Removing blank entries...")
section_data = section_data[section_data['Cleaned Text'].str.strip() != '']
print("Blank entries removed.")

print("\nProcessing ngrams...")

# Find frequent bigrams in the 'cleaned text' column
print("\nFrequent Bigrams in the Cleaned Text:")
print(find_frequent_ngrams(section_data, 'Cleaned Text', num_terms=10, ngram_size=2))

# Find frequent trigrams in the 'cleaned text' column
print("\nFrequent Trigrams in the Cleaned Text:")
print(find_frequent_ngrams(section_data, 'Cleaned Text', num_terms=50, ngram_size=3))

# Removing frequent trigrams
print("\nRemoving top frequent Trigrams based on count.  Threshold = 300")
section_data = remove_frequent_trigrams(section_data, 'Cleaned Text')

# Extract Category Feature from Document Text

Define categories and corresponding keywords for classification. Assign categories to each document based on keywords.

In [None]:
# Define categories and corresponding keywords for classification
categories = {
    'Installation & Setup': [
        'install', 'setup', 'implementation', 'deployment', 'configure', 'initialization', 
        'installing', 'deploy', 'configuration', 'set-up', 'initiate', 'launch', 'activate',
        'how to install', 'setting up', 'installation guide', 'deploying', 'configuring'
    ],
    'Maintenance & Management': [
        'maintain', 'maintenance', 'servicing', 'management', 'optimization', 'service', 
        'manage', 'routine check', 'system upkeep', 'system care', 'upkeep', 'tune-up',
        'maintaining', 'managing', 'service routine', 'optimizing', 'how to maintain'
    ],
    'Troubleshooting & Support': [
        'troubleshoot', 'error', 'issue', 'problem', 'diagnosis', 'resolution', 'fix', 
        'solve', 'rectify', 'repair', 'resolve', 'correct', 'debug', 'fault finding',
        'troubleshooting', 'solving issues', 'fixing errors', 'diagnosing problems', 'resolving'
    ],
    'Upgrades & Updates': [
        'upgrade', 'update', 'new version', 'patch', 'release', 'enhancement', 'updating', 
        'upgrading', 'version upgrade', 'system update', 'software update', 'patching',
        'how to upgrade', 'applying updates', 'version updating', 'software enhancement'
    ],
    'General Information & Overview': [
        'overview', 'introduction', 'info', 'summary', 'guide', 'documentation', 
        'information', 'details', 'background', 'basics', 'general data', 'key points',
        'what is', 'explain', 'description of', 'details about'
    ],
    'Security & Monitoring': [
        'surveillance', 'log management', 'event tracking', 'real-time analysis', 
        'security watch', 'monitoring', 'security check', 'system monitoring', 'network watch',
        'security overview', 'monitoring setup', 'event tracking system'
    ],
    'Threat Detection & Analysis': [
        'threat detection', 'anomaly detection', 'intrusion detection', 'threat intelligence', 
        'security alerts', 'risk detection', 'threat identification', 'vulnerability detection', 
        'security threat detection', 'analyzing threats', 'identifying risks', 'detecting anomalies'
    ],
    'Incident Response & Management': [
        'incident response', 'incident management', 'forensics', 'mitigation', 'recovery', 
        'incident handling', 'crisis management', 'incident analysis', 'emergency response',
        'responding to incidents', 'managing incidents', 'incident recovery'
    ],
    'Compliance & Auditing': [
        'compliance', 'regulatory compliance', 'audit', 'reporting', 'policy enforcement', 
        'regulation management', 'compliance tracking', 'legal compliance', 'audit management',
        'compliance policies', 'auditing processes', 'regulatory reporting'
    ],
    'Integration & Compatibility': [
        'integration', 'compatibility', 'third-party integration', 'API', 'interoperability', 
        'system merging', 'software integration', 'data integration', 'platform integration',
        'integrating systems', 'API usage', 'compatibility issues'
    ],
    'Network Security & Protection': [
        'network security', 'firewall', 'traffic analysis', 'intrusion prevention', 
        'network protection', 'cybersecurity', 'network defense', 'network safeguard',
        'protecting networks', 'network firewalls', 'cybersecurity measures'
    ]
}

In [None]:
# Apply the function to assign categories to each document
section_data['Category'] = section_data['content'].apply(lambda text: assign_category(text, categories))

# Exploratory Data Analysis on Cleaned Text

Repeat the EDA process on the cleaned text data. This includes analyzing the text length distribution, word frequency, word cloud, sentiment, and distribution of assigned categories from text.

In [None]:
# Describe the DataFrame after text cleaning
describe_dataframe(section_data)

# Check for duplicate text
duplicate_text = section_data[section_data.duplicated(['content'])]

# Print the duplicates
print("Duplicate Text Entries:")
print(duplicate_text)

# Check for missing data in each column
missing_data = section_data.isnull().sum()

# Print columns with missing data
print("Columns with Missing Data:")
print(missing_data[missing_data > 0])

# Analyze text length distribution after cleaning
section_data['Cleaned Text Length'] = section_data['Cleaned Text'].apply(len)
plt.figure(figsize=(10, 6))
plt.title('Text Length Distribution After Cleaning')
sns.histplot(section_data['Cleaned Text Length'], bins=50, kde=True)
plt.xlabel('Text Length')
plt.ylabel('Frequency')
plt.show()

# Perform word frequency analysis on cleaned text
freq_dist_cleaned = FreqDist(word_tokenize(' '.join(section_data['Cleaned Text'])))
plot_word_frequencies(freq_dist_cleaned, 'Top Words in Cleaned Data')

# Generate a word cloud for cleaned text
wordcloud_cleaned = WordCloud(width=800, height=400, background_color='white').generate(' '.join(section_data['Cleaned Text']))
plt.figure(figsize=(10, 5))
plt.imshow(wordcloud_cleaned, interpolation='bilinear')
plt.axis('off')
plt.title('Word Cloud for Cleaned Data')
plt.show()

# Perform sentiment analysis on cleaned text
section_data['Cleaned Sentiment'] = section_data['Cleaned Text'].apply(lambda x: TextBlob(x).sentiment.polarity)
plt.figure(figsize=(8, 6))
plt.title('Sentiment Distribution After Cleaning')
sns.histplot(section_data['Cleaned Sentiment'], bins=30, kde=True)
plt.xlabel('Sentiment Polarity')
plt.ylabel('Frequency')
plt.show()

# Display the distribution of categories
plt.figure(figsize=(10, 6))
sns.countplot(y='Category', data=section_data)
plt.title('Distribution of Assigned Categories from Text')
plt.xlabel('Count')
plt.ylabel('Category')
plt.show()

# Find frequent bigrams in the 'cleaned text' column
print("\nFrequent Bigrams in Cleaned Text after Trigram removal:")
print(find_frequent_ngrams(section_data, 'Cleaned Text', num_terms=10, ngram_size=2))

# Find frequent trigrams in the 'cleaned text' column
print("\nFrequent Trigrams in Cleaned Text after Trigram removal:")
print(find_frequent_ngrams(section_data, 'Cleaned Text', num_terms=50, ngram_size=3))

# Apply the function to get unique characters in the cleaned content column
unique_chars_after_cleaning = unique_characters_count(section_data['Cleaned Text'])
print("\nUnique Characters After Cleaning:")
print(unique_chars_after_cleaning)

### Saving Cleaned Data

After the analysis, save the cleaned and processed data to a new CSV file for future use or further analysis.

In [None]:
# Save the cleaned data to a new CSV file
section_data.to_csv('cleaned_section_data_with_categories.csv', index=False, encoding='utf-8')