**Assignment No.3:** Perform text cleaning,perform lemmatization (any method) ,remove stop words(any method),label encoding create representations using TF-IDF save outputs




1️⃣ Import Libraries – pandas, nltk, string, sklearn.

2️⃣ Download NLTK Resources – Stopwords & WordNet.

3️⃣ Create DataFrame – Define text & label columns.

4️⃣ Text Cleaning – Remove punctuation & convert to lowercase.

5️⃣ Lemmatization & Stopword Removal – Tokenize, lemmatize & filter stopwords.

6️⃣ Label Encoding – Convert categorical labels into numerical values.

7️⃣ TF-IDF Representation – Convert processed_text into feature vectors.

8️⃣ Save Outputs – Store cleaned data & TF-IDF matrix as CSV files.

9️⃣ Display Results – Print processed text & TF-IDF matrix.

---



In [None]:
import pandas as pd
import numpy as np
import string
import nltk
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.preprocessing import LabelEncoder


In [None]:
# Download NLTK resources if not already installed
nltk.download('stopwords')
nltk.download('wordnet')

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.
[nltk_data] Downloading package wordnet to /root/nltk_data...


True

In [None]:
# Create DataFrame
data = pd.DataFrame({
    'text': [
        'I love programming in Python!',
        'Python is amazing for data analysis.',
        'Data science is the future.',
        'I hate bugs in my code.',
        'Debugging is so frustrating.',
        'The weather is nice today.',
        'I need more coffee to focus.',
        'The Python syntax is easy to learn.',
        'JavaScript can be tricky to learn at first.',
        'Data visualization is important in analysis.'
    ],
    'label': [
        'positive',
        'positive',
        'positive',
        'negative',
        'negative',
        'neutral',
        'neutral',
        'positive',
        'negative',
        'neutral'
    ]
})

In [None]:
# 1. Text Cleaning: Remove punctuation and make lowercase
def clean_text(text):
    # Remove punctuation and make text lowercase
    text = ''.join([char for char in text if char not in string.punctuation])
    text = text.lower()
    return text

# Apply text cleaning
data['cleaned_text'] = data['text'].apply(clean_text)

In [None]:
# 2. Lemmatization and Removing Stop Words
stop_words = set(stopwords.words('english'))
lemmatizer = WordNetLemmatizer()

def lemmatize_and_remove_stopwords(text):
    words = text.split()
    lemmatized_words = [lemmatizer.lemmatize(word) for word in words if word not in stop_words]
    return ' '.join(lemmatized_words)

# Apply lemmatization and stop word removal
data['processed_text'] = data['cleaned_text'].apply(lemmatize_and_remove_stopwords)

In [None]:
# 3. Label Encoding
label_encoder = LabelEncoder()
data['encoded_label'] = label_encoder.fit_transform(data['label'])


In [None]:
# 4. TF-IDF Representation
tfidf_vectorizer = TfidfVectorizer()
tfidf_matrix = tfidf_vectorizer.fit_transform(data['processed_text'])

# Convert TF-IDF matrix to DataFrame for easier viewing
tfidf_df = pd.DataFrame(tfidf_matrix.toarray(), columns=tfidf_vectorizer.get_feature_names_out())


In [None]:
# Save the outputs to CSV files
data.to_csv('cleaned_data.csv', index=False)
tfidf_df.to_csv('tfidf_representation.csv', index=False)


In [None]:
# Show outputs
print("Processed DataFrame with Labels and Cleaned Text:")
print(data[['text', 'processed_text', 'encoded_label']])

print("\nTF-IDF Representation:")
print(tfidf_df.head())

Processed DataFrame with Labels and Cleaned Text:
                                           text  \
0                 I love programming in Python!   
1          Python is amazing for data analysis.   
2                   Data science is the future.   
3                       I hate bugs in my code.   
4                  Debugging is so frustrating.   
5                    The weather is nice today.   
6                  I need more coffee to focus.   
7           The Python syntax is easy to learn.   
8   JavaScript can be tricky to learn at first.   
9  Data visualization is important in analysis.   

                          processed_text  encoded_label  
0                love programming python              2  
1           python amazing data analysis              2  
2                    data science future              2  
3                          hate bug code              0  
4                  debugging frustrating              0  
5                     weather nice today