# Text Preprocessing for NLP

This notebook contains the steps for fetching and preprocessing text data from a MongoDB database for NLP tasks. The preprocessing includes cleaning the text, removing unnecessary characters, tokenization, and lemmatization.

In [22]:
# Imports
import pymongo
import pandas as pd
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer
from bs4 import BeautifulSoup
import re
import warnings
from nltk.stem import SnowballStemmer
from nltk.util import ngrams

## Download Necessary NLTK Data

Downloading required NLTK packages for stopwords, tokenization, and lemmatization.

In [23]:
# Download necessary NLTK data
# Download necessary NLTK data
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('wordnet')

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\ted59\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\ted59\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\ted59\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


True

## MongoDB Connection and Data Retrieval

Establishing a connection to MongoDB and retrieving documents from the specified collection.

In [24]:
# MongoDB connection
client = pymongo.MongoClient("mongodb://localhost:27017/")
db = client["WS_Data_DB"]  # Database name
collection = db["LogRhythmDocs"]  # Collection name

# Fetch data from MongoDB
documents = collection.find()
df = pd.DataFrame(list(documents))

## Text Preprocessing Function

Defining a function for preprocessing the text data. This includes HTML tag removal, lowercasing, removing URLs and special characters, tokenization, stopwords removal, optional spell correction, and lemmatization.

In [25]:
# Spellchecker instance
spell = SpellChecker()

# Suppress Wanings for Beautifulsoup
warnings.filterwarnings("ignore", category=UserWarning, module='bs4')

# Define function for text preprocessing
def preprocess_text(text):
    # Remove HTML tags using BeautifulSoup
    soup = BeautifulSoup(text, "html.parser")
    text = soup.get_text()

    # Lowercasing the text
    text = text.lower()

    # Define a list of repetitive phrases to remove
    repetitive_phrases = [
        "show navigation go homepage", 
        "logrhythm documentation",
        "skip to main content"
    ]
    
    # Remove repetitive phrases
    for phrase in repetitive_phrases:
        text = text.replace(phrase, '')

    # Remove URLs, special characters, and numbers
    text = re.sub(r'http\S+|www\S+|https\S+', '', text, flags=re.MULTILINE)
    text = re.sub(r'\W', ' ', text)
    text = re.sub(r'\d', '', text)

    # Tokenize the text
    words = word_tokenize(text)

    # Filter out None values, empty strings, and whitespace-only strings
    words = [word for word in words if word and not word.isspace()]

    # Remove stopwords
    stop_words = set(stopwords.words('english'))
    words = [word for word in words if word not in stop_words]

    # Snowball stemming
    stemmer = SnowballStemmer('english')
    stemmed_words = [stemmer.stem(word) for word in words]

    # Generate n-grams
    n = 5
    n_grams = [' '.join(gram) for gram in ngrams(stemmed_words, n)]

    # Combine back to text
    processed_text = ' '.join(n_grams)

    return processed_text

# Apply preprocessing to each document's content
df['processed_content'] = df['content'].apply(preprocess_text)

# Remove None entries and duplicates after preprocessing
df = df.dropna(subset=['processed_content'])
df = df.drop_duplicates(subset=['processed_content'])

# Display the processed content
print(df[['processed_content']].head())

# Define directory to save YAML files
output_dir = "C:\\Users\\ted59\\Knapp069-Practicum-1-Project"

# Ensure the output directory exists
if not os.path.exists(output_dir):
    os.makedirs(output_dir)

# Initialize an empty list to store rows
intent_rows = []

# Populate intent DataFrame from n-grams
for index, row in df.iterrows():
    for ngram in row['ngrams']:
        # Append rows to intent_rows list
        intent_rows.append({'intent': row['_id'], 'example': ngram})

# Create DataFrame from list of rows
intent_df = pd.DataFrame(intent_rows)

# Write grouped examples to YAML files
for intent, examples in intent_df.groupby('intent')['example']:
    with open(os.path.join(output_dir, f"{intent}.yml"), 'w') as f:
        f.write('\n'.join(examples))


                                    processed_content
0   show navig go homepag logrhythm com communiti ...
4   logrhythm axon show navig go homepag axon logr...
6   axon prerequisit consider show navig go homepa...
9   axon agent show navig go homepag axon logrhyth...
11  axon administr guid show navig go homepag axon...


AttributeError: 'DataFrame' object has no attribute 'append'