## LAB 10 - SVM FOR NON VECTORIAL DATA (03.03.2025)

## AIM
To apply SVM classifier to the `fetch_20newsgroups`, which is a non vectorial data (text), and evaluate the model's perfomance using `classification_report`. 

## ALGORITHM

1. **Load Data:**
   - Download the `20 Newsgroups` dataset using `fetch_20newsgroups`.
   - The dataset consists of text documents and their associated category labels.
   - Display the first few rows of the dataset with raw text and corresponding labels using a `DataFrame` for better understanding.

2. **Preprocess Data:**
   - Convert all text to lowercase to remove case sensitivity.
   - Remove special characters, numbers, and punctuation using regular expressions.
   - Tokenize the text (split into words).
   - Remove stopwords (common words like "the", "is", "and", etc., which are not useful for classification).
   - Apply lemmatization (convert words to their base form).
   - Join the processed tokens back into clean, preprocessed text.

3. **Split Data:**
   - Split the dataset into training and testing sets (e.g., 70% for training and 30% for testing).

4. **Vectorize Data:**
   - Convert the preprocessed text data into numerical form using `TfidfVectorizer`.
   - This step transforms the raw text into TF-IDF features (with a limit of 5000 features).
   
5. **Train SVM:**
   - Create an `SVC` (Support Vector Classifier) model with the `linear` kernel.
   - Train the SVM model on the preprocessed training data using `svm.fit(X_train, y_train)`.

6. **Make Predictions:**
   - Use the trained model to predict the labels for the test data using `svm.predict(X_test)`.

7. **Evaluate Model:**
   - Evaluate the model’s performance using `classification_report(y_test, y_pred)`, which provides metrics like precision, recall, f1-score, and accuracy for each class.
   - Print the overall accuracy of the classifier.

### Non-Vectorial Data:
Non-vectorial data refers to data that isn't already in a numerical vector form, such as raw text, audio, images, or time-series data. These types of data need to be transformed into numerical representations (vectors) before they can be used by machine learning algorithms like Support Vector Machines (SVM).

### Using SVM for Non-Vectorial Data:
1. **Preprocessing**: The non-vectorial data (e.g., text) is first preprocessed to clean and convert it into a suitable format (e.g., tokenization, removing stopwords, lemmatization).
2. **Feature Extraction**: Techniques like **TF-IDF** (for text) or other feature extraction methods  are used to transform the data into numerical vectors.
3. **Training the SVM**: The extracted feature vectors are then used to train the **SVM** model. The SVM tries to find the optimal hyperplane that best separates different classes in the data, based on the feature vectors.
4. **Prediction**: Once the model is trained, it can predict labels for new, unseen data by converting it to the same vector format and passing it through the trained SVM.

## CODE AND OUTPUT

In [8]:
from sklearn.datasets import fetch_20newsgroups

newsgroups = fetch_20newsgroups(subset='all', remove=('headers', 'footers', 'quotes'))

In [16]:
import nltk
import re
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer

nltk.download('stopwords')
nltk.download('wordnet')

def preprocess_text(text):
    text = text.lower()

    text = re.sub(r'[^a-z\s]', '', text)

    stop_words = set(stopwords.words('english'))
    tokens = text.split()
    tokens = [word for word in tokens if word not in stop_words]

    lemmatizer = WordNetLemmatizer()
    tokens = [lemmatizer.lemmatize(word) for word in tokens]

    return ' '.join(tokens)

processed_text = [preprocess_text(text) for text in newsgroups.data]

[nltk_data] Downloading package stopwords to /home/ai-a1/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.
[nltk_data] Downloading package wordnet to /home/ai-a1/nltk_data...


In [17]:
import pandas as pd
df = pd.DataFrame({
    'Text': processed_text,    
    'Label': newsgroups.target,
})

df['Category'] = df['Label'].apply(lambda x: newsgroups.target_names[x])
df.head()

Unnamed: 0,Text,Label,Category
0,sure bashers pen fan pretty confused lack kind...,10,rec.sport.hockey
1,brother market highperformance video card supp...,3,comp.sys.ibm.pc.hardware
2,finally said dream mediterranean new area grea...,17,talk.politics.mideast
3,think scsi card dma transfer disk scsi card dm...,3,comp.sys.ibm.pc.hardware
4,old jasmine drive cannot use new system unders...,4,comp.sys.mac.hardware


In [19]:
from sklearn.feature_extraction.text import TfidfVectorizer
vectorizer = TfidfVectorizer(stop_words='english', max_features=5000)

X = vectorizer.fit_transform(df['Text'])
y = df['Label']

In [21]:
from sklearn.model_selection import train_test_split
from sklearn.svm import SVC
from sklearn.metrics import classification_report
from sklearn.preprocessing import StandardScaler
on the wine dataset
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

In [26]:
svm = SVC(kernel='rbf')

svm.fit(X_train, y_train)

y_pred = svm.predict(X_test)

In [27]:
print("Classification Report:")
print(classification_report(y_test, y_pred, target_names=newsgroups.target_names))

Classification Report:
                          precision    recall  f1-score   support

             alt.atheism       0.57      0.61      0.59       236
           comp.graphics       0.56      0.68      0.61       287
 comp.os.ms-windows.misc       0.67      0.61      0.64       290
comp.sys.ibm.pc.hardware       0.61      0.64      0.62       285
   comp.sys.mac.hardware       0.78      0.60      0.68       312
          comp.windows.x       0.78      0.72      0.75       308
            misc.forsale       0.76      0.65      0.70       276
               rec.autos       0.49      0.75      0.59       304
         rec.motorcycles       0.59      0.77      0.67       279
      rec.sport.baseball       0.89      0.81      0.85       308
        rec.sport.hockey       0.97      0.84      0.90       309
               sci.crypt       0.90      0.70      0.79       290
         sci.electronics       0.57      0.67      0.62       304
                 sci.med       0.80      0.80      0

## RESULT

A SVM classifier has been trained with an average accuracy of 70%.