# Assignment 03
**Name : Yugal Salunke**
**Roll_no : 391051**
**PRN : 22210227**
Batch A2

# Title:
Perform text cleaning, perform lemmatization (any method), remove stop words (any method),
label encoding. Create representations using TF-IDF. Save outputs.

**DataSet**: Spam Classifier


# Objectives:
1. Improve text data quality: Clean and prepare text for better analysis.
2. Enable machine learning: Transform text into a usable format for machine learning models.
3. Extract insights: Identify patterns and relationships within text.
4. Real-time applications: Process text efficiently for applications like chatbots and search engines.

# Theory:
The problem statement focuses on text preprocessing and feature extraction for Natural Language Processing (NLP) tasks. Text preprocessing aims to clean and prepare raw text data for analysis, while feature extraction transforms text into numerical representations for machine learning models.

**Key Techniques:**

**Text Cleaning:** Removes noise and irrelevant information from text, such as HTML tags, punctuation, and special characters.

Example: "This is a
sample text." becomes "This is a sample text."

**Lemmatization:** Reduces words to their base or dictionary form (lemma).

Example: "running," "ran," and "runs" are lemmatized to "run."

**Stop Word Removal:** Eliminates common words like "the," "a," and "is" that carry little meaning.

**Label Encoding:** Converts categorical labels into numerical values for machine learning models.

Example: Labels "positive," "negative," and "neutral" might be encoded as 0, 1, and 2.

**TF-IDF (Term Frequency-Inverse Document Frequency):** A numerical statistic reflecting a word's importance in a document relative to a collection of documents (corpus). It's widely used for feature extraction in text analysis.



# Import Dataset from kaggle

In [None]:
import kagglehub

# Download latest version
path = kagglehub.dataset_download("purusinghvi/email-spam-classification-dataset")

print("Path to dataset files:", path)

Downloading from https://www.kaggle.com/api/v1/datasets/download/purusinghvi/email-spam-classification-dataset?dataset_version_number=1...


100%|██████████| 43.0M/43.0M [00:03<00:00, 14.3MB/s]

Extracting files...





Path to dataset files: /root/.cache/kagglehub/datasets/purusinghvi/email-spam-classification-dataset/versions/1


# Import libraries

In [None]:
import pandas as pd
import re
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer, PorterStemmer
from textblob import TextBlob
from sklearn.preprocessing import LabelEncoder
from sklearn.feature_extraction.text import TfidfVectorizer

In [None]:
# Download necessary resources
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('wordnet')
nltk.download('punkt_tab')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt_tab.zip.


True

In [None]:
# Load dataset
df = pd.read_csv("//content//filtered_data.csv")

In [None]:
df

Unnamed: 0,email,label
0,on fri sep NUMBER NUMBER at NUMBER NUMBER NUMB...,0.0
1,have you thought of bumping up sylpheed claws ...,0.0
2,title page has a login screen and i can t seem...,0.0
3,url URL date not supplied img URL wonderful ga...,0.0
4,URL spamassassin contrib URL changed what rem...,0.0
...,...,...
195,volume NUMBER issue NUMBER sept NUMBER hyperl...,1.0
196,important information the new domain names are...,1.0
197,dear sir madam if you are fed up of being ripp...,1.0
198,when america s top companies compete for your ...,1.0


In [None]:
# Initialize tools
lemmatizer = WordNetLemmatizer()
stemmer = PorterStemmer()
stop_words = set(stopwords.words('english'))

# 1.Text cleaning:
**url removal**

In [None]:
def remove_urls(text):
    return re.sub(r'http\S+|www\S+', '', text)

**text cleaning**

In [None]:
def clean_text(text):
    text = text.lower()  # Convert to lowercase
    text = remove_urls(text)  # Remove URLs
    text = re.sub(r'\d+', '', text)  # Remove numbers
    text = re.sub(r'[^\w\s]', '', text)  # Remove special characters
    text = re.sub(r'\s+', ' ', text).strip()  # Remove extra spaces
    return text

**spelling correction - textblob**

In [None]:
# Function for spelling correction
def correct_spelling(text):
    return str(TextBlob(text).correct())

**process All text**

In [None]:
def preprocess_text(text):
    text = clean_text(text)
    text = correct_spelling(text)
    words = word_tokenize(text)

    # Remove stopwords, apply lemmatization and stemming
    words = [lemmatizer.lemmatize(word) for word in words if word not in stop_words]
    stemmed_words = [stemmer.stem(word) for word in words]

    return ' '.join(stemmed_words)

In [None]:
# Apply preprocessing to the email column
df['processed_email'] = df['email'].apply(preprocess_text)

In [None]:
df['processed_email']

Unnamed: 0,processed_email
0,fro see number number number number number num...
1,thought bump sylphe claw see sylphe got bump s...
2,titl page login screen seem get apt index anym...
3,curl curl date suppli ing curl wonder galleri ...
4,curl spamassassin control curl chang remov ad ...
...,...
195,volum number issu number sept number hyperlink...
196,import inform new domain name final avail gene...
197,dear sir madam fed grip british govern everi t...
198,america top compani compet busi win today worl...


# 4. label encoding

In [None]:
from sklearn.preprocessing import LabelEncoder
label_encoder = LabelEncoder()

In [None]:
df['label_encoded'] = label_encoder.fit_transform(df['label'])
df['label_encoded']

Unnamed: 0,label_encoded
0,0
1,0
2,0
3,0
4,0
...,...
195,1
196,1
197,1
198,1


# 5. TF-IDF

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer
import pandas as pd

In [None]:
tfidf_vectorizer = TfidfVectorizer()
tfidf_matrix = tfidf_vectorizer.fit_transform(df['processed_email'])

In [None]:
tfidf_matrix = tfidf_vectorizer.transform(df['processed_email'])

In [None]:
# Convert TF-IDF matrix to DataFrame
tfidf_df = pd.DataFrame(tfidf_matrix.toarray(), columns=tfidf_vectorizer.get_feature_names_out())

In [None]:
# Save processed data
df.to_csv("processed_dataset.csv", index=False)
tfidf_df.to_csv("tfidf_features.csv", index=False)

# Conclusion:
This assignment demonstrated essential text preprocessing techniques using NLTK, including tokenization, stemming, and lemmatization. These techniques are crucial for preparing text data for NLP tasks, ultimately enabling more accurate and efficient analysis and model development for applications like chatbots and sentiment analysis.