## ***NLP Practical 5: Text Classification using Naïve Bayes and SVM***

***Name: Prexit Joshi***

***Roll No.: UE233118***

***CSE Section II, Group VII***

This notebook demonstrates how to implement and evaluate two classic machine learning models for text classification: Multinomial Naïve Bayes and the Support Vector Machine (SVM).

Text classification is the task of assigning a document to one or more predefined categories. We will use the popular `20 Newsgroups` dataset and convert the text into numerical features using TF-IDF, allowing the models to learn and make predictions.

In [5]:
# Import necessary libraries
from sklearn.datasets import fetch_20newsgroups
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score, classification_report

# Load the dataset, focusing on 4 specific categories for simplicity
categories = ['alt.atheism', 'soc.religion.christian', 'comp.graphics', 'sci.med']
train_data = fetch_20newsgroups(subset='train', categories=categories, shuffle=True, random_state=42)
test_data = fetch_20newsgroups(subset='test', categories=categories, shuffle=True, random_state=42)

print("Dataset loaded successfully.")
print(f"Number of training samples: {len(train_data.data)}")
print(f"Number of testing samples: {len(test_data.data)}")

Dataset loaded successfully.
Number of training samples: 2257
Number of testing samples: 1502


## ***1. SVM Classifier.***

Before feeding the text data to our models, we must convert it into a numerical format. The code below performs this essential **feature extraction** step using the TF-IDF (Term Frequency-Inverse Document Frequency) method. This creates a numerical vector for each document that reflects the importance of words in that document relative to the entire corpus.

In [6]:
# Create a TF-IDF vectorizer to transform text into feature vectors
vectorizer = TfidfVectorizer()

# Fit the vectorizer on the training data and transform it
X_train = vectorizer.fit_transform(train_data.data)
y_train = train_data.target

# Transform the test data using the already fitted vectorizer
X_test = vectorizer.transform(test_data.data)
y_test = test_data.target

print(f"Shape of TF-IDF training matrix: {X_train.shape}")
print(f"Shape of TF-IDF testing matrix: {X_test.shape}")

Shape of TF-IDF training matrix: (2257, 35788)
Shape of TF-IDF testing matrix: (1502, 35788)


## ***2. Naïve Bayes Classifier***

Naïve Bayes is a simple and efficient probabilistic classifier based on Bayes' Theorem. It's particularly well-suited for text classification tasks due to its effectiveness with high-dimensional feature spaces (like our TF-IDF matrix).

In [7]:
# Initialize and train the Multinomial Naïve Bayes classifier
nb_classifier = MultinomialNB()
nb_classifier.fit(X_train, y_train)

# Make predictions on the test set
nb_predictions = nb_classifier.predict(X_test)

# Evaluate the classifier's performance
print(f"Naïve Bayes Accuracy: {accuracy_score(y_test, nb_predictions):.4f}")
print("\nClassification Report for Naïve Bayes:")
print(classification_report(y_test, nb_predictions, target_names=test_data.target_names))

Naïve Bayes Accuracy: 0.8349

Classification Report for Naïve Bayes:
                        precision    recall  f1-score   support

           alt.atheism       0.97      0.60      0.74       319
         comp.graphics       0.96      0.89      0.92       389
               sci.med       0.97      0.81      0.88       396
soc.religion.christian       0.65      0.99      0.78       398

              accuracy                           0.83      1502
             macro avg       0.89      0.82      0.83      1502
          weighted avg       0.88      0.83      0.84      1502



## ***3. Support Vector Machine (SVM) Classifier***

Support Vector Machines are powerful models that work by finding the optimal hyperplane to separate data points into different classes. For text, a linear SVM is often highly effective and can achieve high accuracy.

In [8]:
# Initialize and train the SVM classifier with a linear kernel
svm_classifier = SVC(kernel='linear')
svm_classifier.fit(X_train, y_train)

# Make predictions on the test set
svm_predictions = svm_classifier.predict(X_test)

# Evaluate the classifier's performance
print(f"SVM Accuracy: {accuracy_score(y_test, svm_predictions):.4f}")
print("\nClassification Report for SVM:")
print(classification_report(y_test, svm_predictions, target_names=test_data.target_names))

SVM Accuracy: 0.9208

Classification Report for SVM:
                        precision    recall  f1-score   support

           alt.atheism       0.96      0.83      0.89       319
         comp.graphics       0.90      0.96      0.93       389
               sci.med       0.94      0.91      0.93       396
soc.religion.christian       0.89      0.96      0.93       398

              accuracy                           0.92      1502
             macro avg       0.93      0.92      0.92      1502
          weighted avg       0.92      0.92      0.92      1502



# ***Conclusion***

In this practical, we successfully built and evaluated two text classification models. The **Naïve Bayes classifier** provided a strong baseline performance with its speed and simplicity. However, the **Support Vector Machine (SVM)** achieved a higher accuracy, demonstrating its power in handling high-dimensional, sparse data like TF-IDF vectors. Both methods prove to be effective for categorizing text documents.