# Naive Bayes Model Testing

This notebook tests the Naive Bayes model for sentiment prediction on news headlines.

In [1]:
# Import necessary libraries
import sys
import os
import pandas as pd
from pathlib import Path

# Add project root to path for imports
sys.path.append(os.path.abspath('../..'))

# Import project modules
from src.models.predict_model import ModelPredictor
from src.models.train_model import ModelTrainer
from src.data.preprocess import DataPreprocessor
from src.config import *

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\Avocado\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\Avocado\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\Avocado\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
  from .autonotebook import tqdm as notebook_tqdm


## 1. Initialize the Model Predictor

We'll initialize the ModelPredictor class that will use our trained Naive Bayes model.

In [2]:
# Check if model exists, otherwise train it
model_path = os.path.join(MODEL_DIR, 'naive_bayes_model.pkl')
if not os.path.exists(model_path):
    print("Naive Bayes model not found. Training a new model...")
    # Prepare data
    default_path = os.path.join(PROCESSED_DATA_PATH, "processed_dataset.csv")
    preprocessor = DataPreprocessor(default_path)
    preprocessor.clean_data()
    preprocessor.split_data()
    
    # Train the model
    trainer = ModelTrainer()
    trainer.train_naive_bayes(preprocessor=preprocessor)
else:
    print("Naive Bayes model found.")

# Initialize the predictor
predictor = ModelPredictor()

Naive Bayes model found.


## 2. Test Single Headline Prediction

Let's test the model on a single headline first to check if everything is working.

In [3]:
# Test with a single positive headline
test_headline = "Company profits exceed expectations in Q1 2025"
result = predictor.predict_naive_bayes(test_headline)

# Display the result
if result:
    r = result[0]  # Get the first result
    print(f"\nHeadline: {r['headline']}")
    print(f"Predicted Sentiment: {r['sentiment']}")
    print(f"Confidence: {r['confidence']:.2f}")
    
    print("\nAll Probabilities:")
    for sentiment, prob in r['probabilities'].items():
        print(f"- {sentiment}: {prob:.2f}")
else:
    print("Prediction failed or no model found.")

Loaded Naive Bayes model from default path: e:\học\Năm 3 Kỳ 2\Machine learnin\BTL\ml-course-shibainu\models\trained\naive_bayes_model.pkl

Headline: Company profits exceed expectations in Q1 2025
Predicted Sentiment: negative
Confidence: 0.41

All Probabilities:
- negative: 0.41
- neutral: 0.20
- positive: 0.39


## 3. Test Multiple Headlines

Now let's test the model on multiple headlines with different expected sentiments.

In [4]:
# Test with multiple headlines
test_headlines = [
    "Stock market reaches all-time high as investor confidence grows",
    "Major company announces significant layoffs due to economic downturn",
    "Global trade continues at steady pace despite mild fluctuations",
    "Tech giant releases new product line with innovative features",
    "Retail sales decline for third consecutive quarter"
]

results = predictor.predict_naive_bayes(test_headlines)

# Display the results
if results:
    for r in results:
        print(f"Headline: {r['headline']}")
        print(f"Predicted Sentiment: {r['sentiment']} (confidence: {r['confidence']:.2f})")
        print()
else:
    print("Prediction failed or no model found.")

Loaded Naive Bayes model from default path: e:\học\Năm 3 Kỳ 2\Machine learnin\BTL\ml-course-shibainu\models\trained\naive_bayes_model.pkl
Headline: Stock market reaches all-time high as investor confidence grows
Predicted Sentiment: neutral (confidence: 0.44)

Headline: Major company announces significant layoffs due to economic downturn
Predicted Sentiment: neutral (confidence: 0.62)

Headline: Global trade continues at steady pace despite mild fluctuations
Predicted Sentiment: neutral (confidence: 0.46)

Headline: Tech giant releases new product line with innovative features
Predicted Sentiment: neutral (confidence: 0.56)

Headline: Retail sales decline for third consecutive quarter
Predicted Sentiment: positive (confidence: 0.48)



## 4. Test on Real Dataset

Let's load a sample of the test dataset and predict sentiments.

In [5]:
# Load test dataset
test_data_path = os.path.join(RAW_DATA_PATH, "all-data.csv")
test_df = pd.read_csv(test_data_path)

print(f"Loaded test data with {len(test_df)} headlines")

# Show a few examples
test_df.head(5)

Loaded test data with 9088 headlines


Unnamed: 0,Sentiment,News Headline
0,positive,"The firm has seen substantial revenue growth, ..."
1,positive,The recent partnership is expected to signific...
2,positive,Increased international sales have propelled t...
3,positive,The company's proactive customer engagement ha...
4,positive,A successful product launch has already starte...


In [6]:
# Make predictions
headlines = test_df[' News Headline'].tolist()
results = predictor.predict_naive_bayes(headlines)

# Create a dataframe with predictions
if results:
    predicted_sentiments = [r['sentiment'] for r in results]
    confidence_scores = [round(r['confidence'], 2) for r in results]
    
    # Add predictions to the dataframe
    results_df = test_df.copy()
    results_df = results_df.rename(columns={'Sentiment': 'Actual Sentiment'})
    results_df['Predicted Sentiment'] = predicted_sentiments
    results_df['Confidence'] = confidence_scores
    
    print(f"Predictions completed")
else:
    print("Prediction failed or no model found.")

# Show some results
results_df.head(3)

Loaded Naive Bayes model from default path: e:\học\Năm 3 Kỳ 2\Machine learnin\BTL\ml-course-shibainu\models\trained\naive_bayes_model.pkl
Predictions completed


Unnamed: 0,Actual Sentiment,News Headline,Predicted Sentiment,Confidence
0,positive,"The firm has seen substantial revenue growth, ...",positive,0.56
1,positive,The recent partnership is expected to signific...,positive,0.69
2,positive,Increased international sales have propelled t...,positive,0.63


## 5. Model Evaluation

Let's evaluate the model's performance on the test dataset.

In [7]:
from sklearn.metrics import classification_report

y_true = results_df['Actual Sentiment']
y_pred = results_df['Predicted Sentiment']

print(classification_report(y_true, y_pred))

              precision    recall  f1-score   support

    negative       0.85      0.18      0.30      3000
     neutral       0.46      0.94      0.62      3089
    positive       0.76      0.53      0.63      2999

    accuracy                           0.56      9088
   macro avg       0.69      0.55      0.51      9088
weighted avg       0.69      0.56      0.51      9088



In [8]:
# Count correct vs incorrect predictions
true = results_df['Predicted Sentiment'] == results_df['Actual Sentiment']
true.value_counts()

True     5047
False    4041
Name: count, dtype: int64

## 6. Compare with Neural Network

Let's compare the Naive Bayes model with the Neural Network model.

In [9]:
# Make predictions with the Neural Network model
nn_results = predictor.predict_neural_network(headlines)  # Using a subset for faster comparison

if nn_results:
    # Create comparison dataframe
    comparison_df = pd.DataFrame({
        'Headline': headlines,
        'Actual': test_df['Sentiment'],
        'NB_Prediction': [r['sentiment'] for r in results],
        'NB_Confidence': [r['confidence'] for r in results],
        'NN_Prediction': [r['sentiment'] for r in nn_results],
        'NN_Confidence': [r['confidence'] for r in nn_results]
    })
    
    # Add columns for correct/incorrect predictions
    comparison_df['NB_Correct'] = comparison_df['NB_Prediction'] == comparison_df['Actual']
    comparison_df['NN_Correct'] = comparison_df['NN_Prediction'] == comparison_df['Actual']
    comparison_df['Both_Correct'] = comparison_df['NB_Correct'] & comparison_df['NN_Correct']
    comparison_df['Both_Wrong'] = ~comparison_df['NB_Correct'] & ~comparison_df['NN_Correct']
    
    # Display summary
    print("Model Comparison Summary:")
    print(f"Naive Bayes Accuracy: {comparison_df['NB_Correct'].mean():.4f}")
    print(f"Neural Network Accuracy: {comparison_df['NN_Correct'].mean():.4f}")
    print(f"Both correct: {comparison_df['Both_Correct'].sum()} headlines")
    print(f"Both wrong: {comparison_df['Both_Wrong'].sum()} headlines")
    print(f"Only Naive Bayes correct: {(comparison_df['NB_Correct'] & ~comparison_df['NN_Correct']).sum()} headlines")
    print(f"Only Neural Network correct: {(~comparison_df['NB_Correct'] & comparison_df['NN_Correct']).sum()} headlines")
    
    # Show a few examples where models disagree
    disagreement = comparison_df[comparison_df['NB_Prediction'] != comparison_df['NN_Prediction']]
    if not disagreement.empty:
        print("\nExamples where models disagree:")
        display(disagreement.head(3))
else:
    print("Neural Network prediction failed.")

Using most recent model: e:\học\Năm 3 Kỳ 2\Machine learnin\BTL\ml-course-shibainu\models\trained\rnn_076.pkl
[1m284/284[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 4ms/step
Prediction class distribution:
negative (class 0): 2549 predictions
neutral (class 1): 3758 predictions
positive (class 2): 2781 predictions
Model Comparison Summary:
Naive Bayes Accuracy: 0.5553
Neural Network Accuracy: 0.8544
Both correct: 4362 headlines
Both wrong: 638 headlines
Only Naive Bayes correct: 685 headlines
Only Neural Network correct: 3403 headlines

Examples where models disagree:


Unnamed: 0,Headline,Actual,NB_Prediction,NB_Confidence,NN_Prediction,NN_Confidence,NB_Correct,NN_Correct,Both_Correct,Both_Wrong
7,The recent acquisition has already begun to co...,positive,neutral,0.502478,positive,0.972957,False,True,False,False
9,The successful introduction of new product lin...,positive,neutral,0.521444,positive,0.992486,False,True,False,False
14,Strategic collaborations with industry leaders...,positive,neutral,0.672266,positive,0.990287,False,True,False,False
