*Perform text cleaning, perform lemmatization (any method), remove stop words (any method), label encoding. Create representations using TF-IDF. Save outputs*

**Clean the text:** Lowercase, remove punctuation/numbers, and extra spaces.

**Remove stop words and perform lemmatization**: **bold text** Using NLTK’s stopwords list and WordNetLemmatizer.

**Label encode:** Convert textual labels into numeric codes using scikit-learn’s LabelEncoder.

**Create a TF-IDF representation:** Using TfidfVectorizer.

**Save outputs:** Cleaned data and label encoding to cleaned_data.csv, TF-IDF matrix to tfidf_representation.csv, and label mapping to label_mapping.json.

In [5]:
import re
import json
import nltk
import pandas as pd
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
from sklearn.preprocessing import LabelEncoder
from sklearn.feature_extraction.text import TfidfVectorizer

# Download required NLTK data packages
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('wordnet')
nltk.download('omw-1.4')
nltk.download('punkt_tab')


# Take User Input for Documents and Labels

documents = []
labels = []
print("Enter your documents along with their labels. When finished, leave the document input empty and press Enter.")

while True:
    doc = input("Enter document text (or leave empty to finish): ").strip()
    if doc == "":
        break
    label = input("Enter label for this document: ").strip()
    documents.append(doc)
    labels.append(label)

if not documents:
    print("No documents provided. Exiting...")
    exit()

# Create a DataFrame with the provided documents and labels
df = pd.DataFrame({"text": documents, "label": labels})

# Text Cleaning

def clean_text(text):
    # Convert to lowercase
    text = text.lower()
    # Remove punctuation and numbers (retain only alphabets and whitespace)
    text = re.sub(r'[^a-z\s]', '', text)
    # Remove extra whitespace
    text = re.sub(r'\s+', ' ', text).strip()
    return text

df['clean_text'] = df['text'].apply(clean_text)

# Remove Stop Words & Lemmatization

stop_words = set(stopwords.words('english'))
lemmatizer = WordNetLemmatizer()

def remove_stopwords_and_lemmatize(text):
    # Tokenize the text
    tokens = nltk.word_tokenize(text)
    # Remove stop words
    tokens = [token for token in tokens if token not in stop_words]
    # Lemmatize tokens
    lemmatized_tokens = [lemmatizer.lemmatize(token) for token in tokens]
    return ' '.join(lemmatized_tokens)

df['final_text'] = df['clean_text'].apply(remove_stopwords_and_lemmatize)

# Label Encoding

label_encoder = LabelEncoder()
df['label_encoded'] = label_encoder.fit_transform(df['label'])

# Create TF-IDF Representation

tfidf_vectorizer = TfidfVectorizer()
tfidf_matrix = tfidf_vectorizer.fit_transform(df['final_text'])
tfidf_df = pd.DataFrame(tfidf_matrix.toarray(), columns=tfidf_vectorizer.get_feature_names_out())

# Save Outputs

# Save cleaned data with original, cleaned, final text, and label encoding
df.to_csv("cleaned_data.csv", index=False)
print("Cleaned data saved as 'cleaned_data.csv'.")

# Save the TF-IDF representation to CSV
tfidf_df.to_csv("tfidf_representation.csv", index=False)
print("TF-IDF representation saved as 'tfidf_representation.csv'.")

# Save the label mapping (label to encoded value) to JSON
label_mapping = {str(k): int(v) for k, v in zip(label_encoder.classes_, label_encoder.transform(label_encoder.classes_))}
with open("label_mapping.json", "w") as f:
    json.dump(label_mapping, f, indent=4)
print("Label mapping saved as 'label_mapping.json'.")


[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package omw-1.4 to /root/nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!
[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!


Enter your documents along with their labels. When finished, leave the document input empty and press Enter.
Enter document text (or leave empty to finish): I absolutely love the new design of the website. Everything is so clear and intuitive
Enter label for this document: Positive
Enter document text (or leave empty to finish): The food at the restaurant was bland and overpriced. Not worth the visit.
Enter label for this document: Negative
Enter document text (or leave empty to finish): Reading a good book on a rainy day is a perfect escape from reality.
Enter label for this document: Positive
Enter document text (or leave empty to finish): 
Cleaned data saved as 'cleaned_data.csv'.
TF-IDF representation saved as 'tfidf_representation.csv'.
Label mapping saved as 'label_mapping.json'.
