# Text Classification

Text classification is the process of assigning predefined categories or labels to text data. It's a fundamental task in natural language processing (NLP) and machine learning, with applications ranging from spam detection and sentiment analysis to topic categorization and language identification. Here's an overview of text classification and its different approaches:

## Approaches to Text Classification

### Rule-Based Systems

- **Description:** These systems use predefined rules created by experts to classify text.
- **Advantages:** Simple to implement and understand.
- **Disadvantages:** Rules can be rigid and may not generalize well to new, unseen data.

### Machine Learning Approaches

#### Traditional Machine Learning

- **Algorithms:** Naive Bayes, Support Vector Machines (SVM), Logistic Regression, Decision Trees, and Random Forests.
- **Process:** Involves feature extraction (e.g., TF-IDF, word embeddings) followed by training a model on labeled data.
- **Advantages:** Can handle large datasets and capture complex patterns.
- **Disadvantages:** Requires feature engineering and may not capture semantic meaning well.

#### Deep Learning

- **Algorithms:** Convolutional Neural Networks (CNNs), Recurrent Neural Networks (RNNs), Long Short-Term Memory networks (LSTMs), and Transformers (e.g., BERT).
- **Process:** Involves training neural networks on raw text data, often using word embeddings or contextual embeddings.
- **Advantages:** Can capture complex, non-linear relationships and semantic meaning.
- **Disadvantages:** Requires large amounts of data and computational resources.

### Hybrid Approaches

- **Description:** Combines rule-based systems with machine learning models to leverage the strengths of both.
- **Advantages:** Can improve accuracy and robustness.
- **Disadvantages:** More complex to implement and maintain.

### Transfer Learning

- **Description:** Uses pre-trained models (e.g., BERT, RoBERTa) fine-tuned on specific text classification tasks.
- **Advantages:** Can achieve high accuracy with less data.
- **Disadvantages:** May require fine-tuning and computational resources.

## Key Considerations

- **Data Quality:** The quality and quantity of labeled data significantly impact model performance.
- **Feature Engineering:** Traditional machine learning models require careful feature selection and extraction.
- **Model Selection:** The choice of model depends on the specific task, dataset size, and computational resources.
- **Evaluation Metrics:** Common metrics include accuracy, precision, recall, F1-score, and AUC-ROC.

Text classification is a versatile technique with wide-ranging applications, and the choice of approach depends on the specific requirements and constraints of the task at hand.


# 1. Text Classification Using Random Forest Classifier

In [2]:
import numpy as np
from sklearn.datasets import fetch_20newsgroups
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report

# Load dataset
newsgroups = fetch_20newsgroups(subset='all', shuffle=True, random_state=42)
X, y = newsgroups.data, newsgroups.target

# Split dataset
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Vectorize text data
vectorizer = TfidfVectorizer(stop_words='english', max_features=5000)
X_train_vec = vectorizer.fit_transform(X_train)
X_test_vec = vectorizer.transform(X_test)

# Train Random Forest Classifier
rf_classifier = RandomForestClassifier(n_estimators=100, random_state=42)
rf_classifier.fit(X_train_vec, y_train)

# Predict and evaluate
y_pred = rf_classifier.predict(X_test_vec)
print("Random Forest Classification Report:\n", classification_report(y_test, y_pred, target_names=newsgroups.target_names))

Random Forest Classification Report:
                           precision    recall  f1-score   support

             alt.atheism       0.88      0.81      0.84       151
           comp.graphics       0.70      0.70      0.70       202
 comp.os.ms-windows.misc       0.73      0.87      0.80       195
comp.sys.ibm.pc.hardware       0.60      0.68      0.64       183
   comp.sys.mac.hardware       0.86      0.79      0.82       205
          comp.windows.x       0.89      0.73      0.80       215
            misc.forsale       0.78      0.78      0.78       193
               rec.autos       0.85      0.83      0.84       196
         rec.motorcycles       0.91      0.92      0.91       168
      rec.sport.baseball       0.83      0.90      0.86       211
        rec.sport.hockey       0.90      0.93      0.91       198
               sci.crypt       0.93      0.90      0.91       201
         sci.electronics       0.73      0.65      0.69       202
                 sci.med       0.81  

# 2. Text Classification Using LSTM

In [3]:
import tensorflow as tf
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Embedding, LSTM, Dense, SpatialDropout1D
from tensorflow.keras.utils import to_categorical

# Parameters
max_words = 5000
max_len = 300

# Tokenize and pad sequences
tokenizer = Tokenizer(num_words=max_words)
tokenizer.fit_on_texts(X_train)
X_train_seq = tokenizer.texts_to_sequences(X_train)
X_test_seq = tokenizer.texts_to_sequences(X_test)
X_train_pad = pad_sequences(X_train_seq, maxlen=max_len)
X_test_pad = pad_sequences(X_test_seq, maxlen=max_len)

# Convert labels to categorical
y_train_cat = to_categorical(y_train, num_classes=20)
y_test_cat = to_categorical(y_test, num_classes=20)

# Build LSTM model
model = Sequential()
model.add(Embedding(max_words, 100, input_length=max_len))
model.add(SpatialDropout1D(0.2))
model.add(LSTM(100, dropout=0.2, recurrent_dropout=0.2))
model.add(Dense(20, activation='softmax'))

model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])

# Train model
model.fit(X_train_pad, y_train_cat, epochs=5, batch_size=64, validation_data=(X_test_pad, y_test_cat))

# Evaluate model
loss, accuracy = model.evaluate(X_test_pad, y_test_cat, verbose=0)
print(f"LSTM Accuracy: {accuracy}")

Epoch 1/5




[1m236/236[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m215s[0m 882ms/step - accuracy: 0.1097 - loss: 2.8823 - val_accuracy: 0.2828 - val_loss: 2.2313
Epoch 2/5
[1m236/236[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m244s[0m 811ms/step - accuracy: 0.3031 - loss: 2.1527 - val_accuracy: 0.3897 - val_loss: 1.8452
Epoch 3/5
[1m236/236[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m201s[0m 804ms/step - accuracy: 0.4364 - loss: 1.7473 - val_accuracy: 0.4708 - val_loss: 1.6094
Epoch 4/5
[1m236/236[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m192s[0m 761ms/step - accuracy: 0.5063 - loss: 1.5050 - val_accuracy: 0.5154 - val_loss: 1.4770
Epoch 5/5
[1m236/236[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m203s[0m 767ms/step - accuracy: 0.5715 - loss: 1.2868 - val_accuracy: 0.5279 - val_loss: 1.4990
LSTM Accuracy: 0.5278514623641968


# 3. Text Classification Using Transfer Learning

In [12]:
import torch
from torch.utils.data import DataLoader, Dataset
# Update the import for AdamW
from transformers import BertTokenizer, BertForSequenceClassification
from torch.optim import AdamW  # Import AdamW from torch.optim instead
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
import pandas as pd
import numpy as np

data = {
    'text': [
        "I love this product!",
        "This is the worst experience I've had.",
        "It's an okay service.",
        "Absolutely fantastic!",
        "Terrible, do not recommend."
    ],
    'label': [1, 0, 1, 1, 0]  # labels for positive (1) and negative (0) sentiment
}

df = pd.DataFrame(data)

# Split data into training and testing sets
train, test = train_test_split(df, test_size=0.2, random_state=42)

# Load BERT tokenizer
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')

# Create a custom dataset
class NewsDataset(Dataset):
    def __init__(self, texts, labels, tokenizer, max_len):
        self.texts = texts
        self.labels = labels
        self.tokenizer = tokenizer
        self.max_len = max_len

    def __len__(self):
        return len(self.texts)

    def __getitem__(self, idx):
        text = str(self.texts[idx])
        inputs = self.tokenizer.encode_plus(
            text,
            add_special_tokens=True,
            max_length=self.max_len,
            padding='max_length',
            return_token_type_ids=False,
            return_attention_mask=True,
            return_tensors='pt',
            truncation=True
        )
        input_ids = inputs['input_ids'].flatten()
        attention_mask = inputs['attention_mask'].flatten()
        return {
            'input_ids': input_ids,
            'attention_mask': attention_mask,
            'labels': torch.tensor(self.labels[idx], dtype=torch.long)
        }

# Create DataLoader
def create_data_loader(df, tokenizer, max_len, batch_size):
    ds = NewsDataset(
        texts=df['text'].to_numpy(),
        labels=df['label'].to_numpy(),
        tokenizer=tokenizer,
        max_len=max_len
    )
    return DataLoader(
        ds,
        batch_size=batch_size,
        num_workers=2,
        shuffle=True  # Added shuffle=True for training data
    )

# Set parameters
MAX_LEN = 128
BATCH_SIZE = 16

# Create DataLoader for training and validation sets
train_data_loader = create_data_loader(train, tokenizer, MAX_LEN, BATCH_SIZE)
val_data_loader = create_data_loader(test, tokenizer, MAX_LEN, BATCH_SIZE)

# Load BERT model
model = BertForSequenceClassification.from_pretrained('bert-base-uncased', num_labels=2)

# Set up optimizer - fixed import and parameters
# AdamW from torch.optim doesn't have correct_bias parameter
optimizer = AdamW(model.parameters(), lr=2e-5)

# Select device before model is moved to it
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model = model.to(device)

# Training loop
def train_epoch(model, data_loader, optimizer, device, scheduler, n_examples):
    model = model.train()
    losses = []
    correct_predictions = 0

    for d in data_loader:
        input_ids = d["input_ids"].to(device)
        attention_mask = d["attention_mask"].to(device)
        labels = d["labels"].to(device)

        # Zero gradients first before forward pass
        optimizer.zero_grad()

        outputs = model(
            input_ids=input_ids,
            attention_mask=attention_mask,
            labels=labels
        )

        loss = outputs.loss
        logits = outputs.logits
        losses.append(loss.item())

        loss.backward()
        torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.0)
        optimizer.step()
        scheduler.step()

        # Move predictions to CPU for comparison
        preds = torch.argmax(logits, dim=1)
        correct_predictions += torch.sum(preds == labels).item()

    return correct_predictions / n_examples, np.mean(losses)

# Evaluation loop
def eval_model(model, data_loader, device, n_examples):
    model = model.eval()
    losses = []
    correct_predictions = 0

    with torch.no_grad():
        for d in data_loader:
            input_ids = d["input_ids"].to(device)
            attention_mask = d["attention_mask"].to(device)
            labels = d["labels"].to(device)

            outputs = model(
                input_ids=input_ids,
                attention_mask=attention_mask,
                labels=labels
            )

            loss = outputs.loss
            logits = outputs.logits
            losses.append(loss.item())

            # Move predictions to CPU for comparison
            preds = torch.argmax(logits, dim=1)
            correct_predictions += torch.sum(preds == labels).item()

    return correct_predictions / n_examples, np.mean(losses)

# Define epochs
EPOCHS = 10

# Create scheduler after optimizer is defined
from transformers import get_linear_schedule_with_warmup

total_steps = len(train_data_loader) * EPOCHS
scheduler = get_linear_schedule_with_warmup(
    optimizer,
    num_warmup_steps=0,
    num_training_steps=total_steps
)

# Main training loop
for epoch in range(EPOCHS):
    print(f'Epoch {epoch + 1}/{EPOCHS}')

    train_acc, train_loss = train_epoch(
        model,
        train_data_loader,
        optimizer,
        device,
        scheduler,
        len(train)
    )
    print(f'Train loss {train_loss:.4f} accuracy {train_acc:.4f}')

    val_acc, val_loss = eval_model(
        model,
        val_data_loader,
        device,
        len(test)
    )
    print(f'Val   loss {val_loss:.4f} accuracy {val_acc:.4f}\n')

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Epoch 1/10
Train loss 0.6210 accuracy 0.7500
Val   loss 0.8118 accuracy 0.0000

Epoch 2/10
Train loss 0.5974 accuracy 0.7500
Val   loss 0.8570 accuracy 0.0000

Epoch 3/10
Train loss 0.5589 accuracy 0.7500
Val   loss 0.9022 accuracy 0.0000

Epoch 4/10
Train loss 0.4992 accuracy 0.7500
Val   loss 0.9414 accuracy 0.0000

Epoch 5/10
Train loss 0.4452 accuracy 0.7500
Val   loss 0.9768 accuracy 0.0000

Epoch 6/10
Train loss 0.4101 accuracy 1.0000
Val   loss 0.9987 accuracy 0.0000

Epoch 7/10
Train loss 0.4320 accuracy 0.7500
Val   loss 1.0201 accuracy 0.0000

Epoch 8/10
Train loss 0.4063 accuracy 0.7500
Val   loss 1.0363 accuracy 0.0000

Epoch 9/10
Train loss 0.4214 accuracy 0.7500
Val   loss 1.0464 accuracy 0.0000

Epoch 10/10
Train loss 0.4040 accuracy 1.0000
Val   loss 1.0518 accuracy 0.0000

