# Lightweight Stance Detection using Naive Bayes (NaN Handling)

This notebook implements a simple, resource-efficient approach to stance detection using Multinomial Naive Bayes and TF-IDF features, with proper handling of NaN values.

In [1]:
import pandas as pd
import numpy as np
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.pipeline import Pipeline
from sklearn.metrics import classification_report
from sklearn.model_selection import train_test_split
import joblib

print("Libraries imported successfully.")

Libraries imported successfully.


## 1. Load and Prepare Data

In [2]:
# Load the preprocessed data
train_data = pd.read_csv('../data/processed/train.csv')
val_data = pd.read_csv('../data/processed/val.csv')
test_data = pd.read_csv('../data/processed/test.csv')

# Combine train and validation data
all_train_data = pd.concat([train_data, val_data], ignore_index=True)

# Remove rows with NaN values
all_train_data = all_train_data.dropna(subset=['processed_text', 'stance'])
test_data = test_data.dropna(subset=['processed_text', 'stance'])

print(f"Training data: {len(all_train_data)}, Test data: {len(test_data)}")
print(f"Stance labels: {all_train_data['stance'].unique()}")

# Display some sample data
print("\nSample data from training set:")
print(all_train_data[['processed_text', 'stance']].head())

# Check class distribution
print("\nClass distribution in training set:")
print(all_train_data['stance'].value_counts(normalize=True))

# Check for any remaining NaN values
print("\nNaN values in training set:")
print(all_train_data.isna().sum())
print("\nNaN values in test set:")
print(test_data.isna().sum())

Training data: 6449, Test data: 1613
Stance labels: [2 0 1]

Sample data from training set:
                                      processed_text  stance
0           worst hurrican season evar accord expert       2
1                demand climat action finnish govern       0
2       home depot fine million sell ban superpollut       2
3  mexiko illegal abholz vertreibt ureinwohn orga...       2
4  web mobil dev look help hey web dev realli wan...       0

Class distribution in training set:
stance
2    0.889750
0    0.105288
1    0.004962
Name: proportion, dtype: float64

NaN values in training set:
id                   0
title                0
body              4992
score                0
num_comments         0
created_utc          0
language             0
subreddit            0
text                 0
processed_text       0
stance               0
dtype: int64

NaN values in test set:
id                   0
title                0
body              1228
score                0
num_comment

## 2. Create and Train the Model

In [5]:
# Create a pipeline with TF-IDF vectorizer and Multinomial Naive Bayes classifier
stance_classifier = Pipeline([
    ('tfidf', TfidfVectorizer(max_features=5000, ngram_range=(1, 2))),
    ('clf', MultinomialNB())
])

# Train the model
stance_classifier.fit(all_train_data['processed_text'], all_train_data['stance'])

print("Model training completed.")

# Save the model
import os

# Create the models directory if it doesn't exist
os.makedirs('../models', exist_ok=True)

# Save the model
joblib.dump(stance_classifier, '../models/naive_bayes_stance_classifier.joblib')
print("Model saved successfully.")

Model training completed.
Model saved successfully.


## 3. Evaluate the Model

In [6]:
# Make predictions on the test set
y_pred = stance_classifier.predict(test_data['processed_text'])

# Generate the classification report
report = classification_report(test_data['stance'], y_pred)
print("Model Evaluation Report:")
print(report)

Model Evaluation Report:
              precision    recall  f1-score   support

           0       0.90      0.39      0.55       190
           1       0.00      0.00      0.00        13
           2       0.92      0.99      0.95      1410

    accuracy                           0.92      1613
   macro avg       0.61      0.46      0.50      1613
weighted avg       0.91      0.92      0.90      1613



  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))


## 4. Analyze Results and Next Steps

Based on the evaluation results, let's analyze the model's performance:

1. Overall accuracy: 
   The model achieves an overall accuracy of 92%, which seems high at first glance. However, this high accuracy is misleading due to class imbalance.

2. Performance across different stances:
   - Class 0 (presumably "Against"): Precision is high (90%), but recall is low (39%), resulting in an F1-score of 0.55.
   - Class 1 (presumably "Neutral"): The model completely fails to predict this class (0% for all metrics).
   - Class 2 (presumably "For"): High performance across all metrics (precision: 92%, recall: 99%, F1-score: 0.95).

3. Class imbalance:
   There's a severe class imbalance in the dataset. Out of 1613 samples:
   - Class 0: 190 samples
   - Class 1: 13 samples
   - Class 2: 1410 samples

4. Model limitations:
   - The model is unable to identify the minority class (Class 1, likely "Neutral" stance).
   - It has a tendency to overpredict the majority class (Class 2, likely "For" stance).

Next steps:

1. Address class imbalance:
   - Use techniques like oversampling (e.g., SMOTE), undersampling, or a combination of both.
   - Adjust class weights in the classifier to give more importance to minority classes.

2. Feature engineering:
   - Experiment with different feature extraction methods (e.g., word embeddings like Word2Vec or GloVe).
   - Try different n-gram ranges or increase the number of features in TfidfVectorizer.

3. Try other classifiers:
   - Logistic Regression with balanced class weights
   - Support Vector Machines (SVM) with class weight adjustment
   - Random Forest or Gradient Boosting classifiers, which can handle imbalanced datasets better

4. Ensemble methods:
   - Implement voting classifiers or stacking with multiple base models

5. Error analysis:
   - Examine misclassified samples, especially for classes 0 and 1, to understand where the model is failing

6. Data augmentation:
   - For the minority classes, consider techniques like back-translation or synonym replacement to create more samples

7. Revisit preprocessing:
   - Ensure that the preprocessing steps are not inadvertently removing important information, especially for the minority classes

8. Consider using a small, lightweight version of a transformer model:
   - While full-scale transformers were too memory-intensive, a small version like DistilBERT or a lightweight BERT might work and could potentially handle the imbalance better

Implementation plan:
1. Start by addressing the class imbalance through resampling techniques and class weight adjustments.
2. If performance is still unsatisfactory, experiment with different classifiers, focusing on those that handle imbalanced data well.
3. Conduct thorough error analysis to gain insights into the model's weaknesses.
4. Based on the error analysis, refine feature engineering and potentially revisit the preprocessing steps.
5. If these steps don't yield satisfactory results, consider implementing a lightweight transformer model or an ensemble method.

Remember, for a stance detection task, it's crucial to have good performance across all classes, not just high overall accuracy. The current model's inability to detect the neutral stance is a significant limitation that needs to be addressed before using it in any practical application.

## 5. Test the Model on New Data (Optional)

In [7]:
# Function to predict stance for new text
def predict_stance(text):
    return stance_classifier.predict([text])[0]

# Test the model on some example texts
example_texts = [
    "Climate change is a hoax perpetrated by scientists for grant money.",
    "We need urgent action to reduce carbon emissions and save our planet.",
    "The jury is still out on whether human activities are causing global warming."
]

for text in example_texts:
    stance = predict_stance(text)
    print(f"Text: {text}")
    print(f"Predicted stance: {stance}\n")

Text: Climate change is a hoax perpetrated by scientists for grant money.
Predicted stance: 2

Text: We need urgent action to reduce carbon emissions and save our planet.
Predicted stance: 2

Text: The jury is still out on whether human activities are causing global warming.
Predicted stance: 2



In [9]:
# Import necessary libraries
from imblearn.over_sampling import SMOTE
from imblearn.pipeline import Pipeline as ImbPipeline
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB

# Create a pipeline with SMOTE, TF-IDF vectorizer, and Multinomial Naive Bayes classifier
balanced_stance_classifier = ImbPipeline([
    ('tfidf', TfidfVectorizer(max_features=5000, ngram_range=(1, 2))),
    ('smote', SMOTE(random_state=42)),
    ('clf', MultinomialNB())
])

# Train the model
balanced_stance_classifier.fit(all_train_data['processed_text'], all_train_data['stance'])

print("Balanced model training completed.")

# Save the model
import os
import joblib

# Create the models directory if it doesn't exist
os.makedirs('../models', exist_ok=True)

# Save the model
joblib.dump(balanced_stance_classifier, '../models/balanced_naive_bayes_stance_classifier.joblib')
print("Balanced model saved successfully.")

# Evaluate the balanced model
y_pred_balanced = balanced_stance_classifier.predict(test_data['processed_text'])

# Generate the classification report
from sklearn.metrics import classification_report
report_balanced = classification_report(test_data['stance'], y_pred_balanced)
print("Balanced Model Evaluation Report:")
print(report_balanced)

# Test the balanced model on example texts
def predict_stance_balanced(text):
    return balanced_stance_classifier.predict([text])[0]

example_texts = [
    "Climate change is a hoax perpetrated by scientists for grant money.",
    "We need urgent action to reduce carbon emissions and save our planet.",
    "The jury is still out on whether human activities are causing global warming."
]

print("\nPredictions with balanced model:")
for text in example_texts:
    stance = predict_stance_balanced(text)
    print(f"Text: {text}")
    print(f"Predicted stance: {stance}\n")

Balanced model training completed.
Balanced model saved successfully.
Balanced Model Evaluation Report:
              precision    recall  f1-score   support

           0       0.32      0.87      0.47       190
           1       0.24      0.62      0.35        13
           2       0.97      0.74      0.84      1410

    accuracy                           0.75      1613
   macro avg       0.51      0.74      0.55      1613
weighted avg       0.89      0.75      0.79      1613


Predictions with balanced model:
Text: Climate change is a hoax perpetrated by scientists for grant money.
Predicted stance: 2

Text: We need urgent action to reduce carbon emissions and save our planet.
Predicted stance: 0

Text: The jury is still out on whether human activities are causing global warming.
Predicted stance: 2



In [10]:
# Import necessary libraries
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import Pipeline
from sklearn.metrics import classification_report
from sklearn.utils.class_weight import compute_class_weight
import numpy as np

# Compute class weights
class_weights = compute_class_weight(class_weight='balanced', classes=np.unique(all_train_data['stance']), y=all_train_data['stance'])
class_weight_dict = dict(zip(np.unique(all_train_data['stance']), class_weights))

# Create a pipeline with TF-IDF vectorizer and Logistic Regression classifier
lr_stance_classifier = Pipeline([
    ('tfidf', TfidfVectorizer(max_features=10000, ngram_range=(1, 3))),
    ('clf', LogisticRegression(class_weight=class_weight_dict, max_iter=1000))
])

# Train the model
lr_stance_classifier.fit(all_train_data['processed_text'], all_train_data['stance'])

print("Logistic Regression model training completed.")

# Save the model
import joblib

joblib.dump(lr_stance_classifier, '../models/logistic_regression_stance_classifier.joblib')
print("Logistic Regression model saved successfully.")

# Evaluate the Logistic Regression model
y_pred_lr = lr_stance_classifier.predict(test_data['processed_text'])

# Generate the classification report
report_lr = classification_report(test_data['stance'], y_pred_lr)
print("Logistic Regression Model Evaluation Report:")
print(report_lr)

# Test the Logistic Regression model on example texts
def predict_stance_lr(text):
    return lr_stance_classifier.predict([text])[0]

example_texts = [
    "Climate change is a hoax perpetrated by scientists for grant money.",
    "We need urgent action to reduce carbon emissions and save our planet.",
    "The jury is still out on whether human activities are causing global warming."
]

print("\nPredictions with Logistic Regression model:")
for text in example_texts:
    stance = predict_stance_lr(text)
    print(f"Text: {text}")
    print(f"Predicted stance: {stance}\n")

# Optional: Print feature importance
tfidf = lr_stance_classifier.named_steps['tfidf']
lr = lr_stance_classifier.named_steps['clf']
feature_importance = pd.DataFrame({
    'feature': tfidf.get_feature_names_out(),
    'importance': lr.coef_[0]
}).sort_values('importance', ascending=False)

print("Top 10 most important features:")
print(feature_importance.head(10))
print("\nBottom 10 least important features:")
print(feature_importance.tail(10))

Logistic Regression model training completed.
Logistic Regression model saved successfully.
Logistic Regression Model Evaluation Report:
              precision    recall  f1-score   support

           0       0.75      0.85      0.80       190
           1       1.00      0.54      0.70        13
           2       0.98      0.96      0.97      1410

    accuracy                           0.95      1613
   macro avg       0.91      0.78      0.82      1613
weighted avg       0.95      0.95      0.95      1613


Predictions with Logistic Regression model:
Text: Climate change is a hoax perpetrated by scientists for grant money.
Predicted stance: 2

Text: We need urgent action to reduce carbon emissions and save our planet.
Predicted stance: 0

Text: The jury is still out on whether human activities are causing global warming.
Predicted stance: 2

Top 10 most important features:
       feature  importance
105     action    7.044596
7191      real    4.594558
7207    realli    4.267383


In [1]:
import numpy as np
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.naive_bayes import MultinomialNB
from sklearn.neural_network import MLPClassifier
from sklearn.ensemble import VotingClassifier
from sklearn.pipeline import Pipeline
from sklearn.model_selection import cross_val_score
from sklearn.metrics import classification_report
from sklearn.utils.class_weight import compute_class_weight

# Compute class weights
class_weights = compute_class_weight(class_weight='balanced', classes=np.unique(all_train_data['stance']), y=all_train_data['stance'])
class_weight_dict = dict(zip(np.unique(all_train_data['stance']), class_weights))

# Create individual classifiers
lr = LogisticRegression(class_weight=class_weight_dict, max_iter=1000, C=0.1)  # Increased regularization
nb = MultinomialNB()
nn = MLPClassifier(hidden_layer_sizes=(100,), max_iter=1000)

# Create the ensemble classifier
ensemble_classifier = VotingClassifier(
    estimators=[('lr', lr), ('nb', nb), ('nn', nn)],
    voting='soft'
)

# Create the pipeline
stance_pipeline = Pipeline([
    ('tfidf', TfidfVectorizer(max_features=10000, ngram_range=(1, 3))),
    ('clf', ensemble_classifier)
])

# Perform cross-validation
cv_scores = cross_val_score(stance_pipeline, all_train_data['processed_text'], all_train_data['stance'], cv=5)
print(f"Cross-validation scores: {cv_scores}")
print(f"Mean CV score: {cv_scores.mean():.4f} (+/- {cv_scores.std() * 2:.4f})")

# Train the model on the full training set
stance_pipeline.fit(all_train_data['processed_text'], all_train_data['stance'])

print("Ensemble model training completed.")

# Save the model
import joblib
joblib.dump(stance_pipeline, '../models/ensemble_stance_classifier.joblib')
print("Ensemble model saved successfully.")

# Evaluate the model on the test set
y_pred = stance_pipeline.predict(test_data['processed_text'])
report = classification_report(test_data['stance'], y_pred)
print("Ensemble Model Evaluation Report:")
print(report)

# Test the model on example texts
def predict_stance(text):
    return stance_pipeline.predict([text])[0]

example_texts = [
    "Climate change is a hoax perpetrated by scientists for grant money.",
    "We need urgent action to reduce carbon emissions and save our planet.",
    "The jury is still out on whether human activities are causing global warming."
]

print("\nPredictions with Ensemble model:")
for text in example_texts:
    stance = predict_stance(text)
    print(f"Text: {text}")
    print(f"Predicted stance: {stance}\n")

# Error analysis
misclassified = test_data[y_pred != test_data['stance']]
print("\nSample of misclassified instances:")
print(misclassified[['processed_text', 'stance']].sample(min(5, len(misclassified))))

NameError: name 'all_train_data' is not defined