# Traditional Machine Learning Models for Fake News Detection

This notebook implements and evaluates traditional machine learning models (Naive Bayes and Random Forest) for fake news detection using PySpark MLlib. These models serve as strong baselines and are particularly well-suited for the Databricks Community Edition due to their efficiency.

## Models Implemented
1. TF-IDF + Naive Bayes
2. TF-IDF + Random Forest

## Key Features
- Distributed processing with PySpark MLlib
- Cross-validation for robust evaluation
- Feature importance analysis
- Optimized for Databricks Community Edition
- Comparison with deep learning models

## Setup and Configuration

First, we'll set up our Spark session with configurations optimized for the Databricks Community Edition.

In [None]:
from pyspark.sql import SparkSession
from pyspark.sql.functions import col, lit, when, regexp_replace, length, udf, array, struct
from pyspark.sql.types import StringType, IntegerType, DoubleType, ArrayType
import numpy as np
import time

# Configure Spark session optimized for Databricks Community Edition
spark = SparkSession.builder \
    .appName("FakeNewsDetection_TraditionalModels") \
    .config("spark.sql.shuffle.partitions", "8") \
    .config("spark.driver.memory", "8g") \
    .enableHiveSupport() \
    .getOrCreate()

# Display Spark configuration
print(f"Spark version: {spark.version}")
print(f"Shuffle partitions: {spark.conf.get('spark.sql.shuffle.partitions')}")
print(f"Driver memory: {spark.conf.get('spark.driver.memory')}")

## Data Loading

We'll load data directly from the Hive metastore tables ('fake' and 'real').

In [None]:
# Import the HiveDataIngestion class
import sys
sys.path.append('/dbfs/FileStore/tables')
from hive_data_ingestion import HiveDataIngestion

# Create an instance
data_ingestion = HiveDataIngestion(spark)

# Load data from Hive tables
start_time = time.time()
real_df, fake_df = data_ingestion.load_data_from_hive()
print(f"Data loading time: {time.time() - start_time:.2f} seconds")

# Combine datasets with labels
combined_df = data_ingestion.combine_datasets(real_df, fake_df)
print(f"Combined dataset size: {combined_df.count()} records")

# Display class distribution
combined_df.groupBy("label").count().show()

## Memory Management for Community Edition

The Databricks Community Edition has limited resources (15.3 GB memory, 2 cores). We'll implement strategies to manage memory efficiently.

In [None]:
# Function to create a stratified sample if needed
def create_stratified_sample(df, sample_size_per_class=2000, seed=42):
    """
    Create a balanced sample with equal representation from each class.
    
    Args:
        df: DataFrame to sample from
        sample_size_per_class: Number of samples per class
        seed: Random seed for reproducibility
        
    Returns:
        DataFrame with balanced samples
    """
    # Get class counts
    class_counts = df.groupBy("label").count().collect()
    
    # Calculate sampling fractions
    fractions = {}
    for row in class_counts:
        label = row["label"]
        count = row["count"]
        fraction = min(1.0, sample_size_per_class / count)
        fractions[label] = fraction
    
    # Create stratified sample
    sampled_df = df.sampleBy("label", fractions, seed)
    
    print(f"Original class distribution:")
    df.groupBy("label").count().show()
    
    print(f"Sampled class distribution:")
    sampled_df.groupBy("label").count().show()
    
    return sampled_df

# Check if we need to sample based on available memory
# For traditional models, we'll try to use the full dataset first
use_full_dataset = True

if use_full_dataset:
    working_df = combined_df
    print("Using full dataset")
else:
    # Create a balanced sample for development/testing
    working_df = create_stratified_sample(combined_df, sample_size_per_class=2000)
    print("Using sampled dataset")

# Cache the working dataset for faster processing
working_df.cache()
print(f"Working dataset size: {working_df.count()} records")

## Text Preprocessing

We'll preprocess the text data to prepare it for feature extraction.

In [None]:
from pyspark.sql.functions import concat_ws, lower, regexp_replace

# Combine title and text fields, and perform basic cleaning
preprocessed_df = working_df.withColumn(
    "content", 
    concat_ws(" ", 
              when(col("title").isNull(), "").otherwise(col("title")),
              when(col("text").isNull(), "").otherwise(col("text"))
    )
)

# Convert to lowercase and remove special characters
preprocessed_df = preprocessed_df.withColumn("content", lower(col("content")))
preprocessed_df = preprocessed_df.withColumn(
    "content", 
    regexp_replace(col("content"), "[^a-zA-Z0-9\s]", " ")
)

# Remove extra whitespace
preprocessed_df = preprocessed_df.withColumn(
    "content", 
    regexp_replace(col("content"), "\s+", " ")
)

# Show sample of preprocessed content
preprocessed_df.select("content", "label").show(5, truncate=50)

## Train-Test Split

We'll split our data into training and testing sets.

In [None]:
# Split data into training and testing sets (70% train, 30% test)
train_df, test_df = preprocessed_df.randomSplit([0.7, 0.3], seed=42)

# Cache datasets for faster processing
train_df.cache()
test_df.cache()

print(f"Training set size: {train_df.count()} records")
print(f"Testing set size: {test_df.count()} records")

# Check class distribution in training set
print("Training set class distribution:")
train_df.groupBy("label").count().show()

# Check class distribution in testing set
print("Testing set class distribution:")
test_df.groupBy("label").count().show()

## Feature Engineering with TF-IDF

We'll use TF-IDF (Term Frequency-Inverse Document Frequency) to convert text into numerical features.

In [None]:
from pyspark.ml.feature import Tokenizer, StopWordsRemover, HashingTF, IDF, StringIndexer
from pyspark.ml import Pipeline

# Define feature extraction pipeline
tokenizer = Tokenizer(inputCol="content", outputCol="words")
remover = StopWordsRemover(inputCol="words", outputCol="filtered")
hashingTF = HashingTF(inputCol="filtered", outputCol="rawFeatures", numFeatures=10000)
idf = IDF(inputCol="rawFeatures", outputCol="features")

# Create feature extraction pipeline
feature_pipeline = Pipeline(stages=[tokenizer, remover, hashingTF, idf])

# Fit the pipeline on the training data
feature_model = feature_pipeline.fit(train_df)

# Transform the training and testing data
train_features = feature_model.transform(train_df)
test_features = feature_model.transform(test_df)

# Show sample of features
train_features.select("content", "filtered", "features", "label").show(2, truncate=50)

## Model 1: Naive Bayes

We'll implement a Naive Bayes classifier using PySpark MLlib.

In [None]:
from pyspark.ml.classification import NaiveBayes
from pyspark.ml.evaluation import MulticlassClassificationEvaluator, BinaryClassificationEvaluator
from pyspark.ml.tuning import ParamGridBuilder, CrossValidator

# Create Naive Bayes model
nb = NaiveBayes(featuresCol="features", labelCol="label")

# Define parameter grid for cross-validation
paramGrid = ParamGridBuilder() \
    .addGrid(nb.smoothing, [0.1, 0.5, 1.0]) \
    .build()

# Define evaluators
accuracy_evaluator = MulticlassClassificationEvaluator(
    labelCol="label", 
    predictionCol="prediction", 
    metricName="accuracy"
)

f1_evaluator = MulticlassClassificationEvaluator(
    labelCol="label", 
    predictionCol="prediction", 
    metricName="f1"
)

auc_evaluator = BinaryClassificationEvaluator(
    labelCol="label", 
    rawPredictionCol="rawPrediction", 
    metricName="areaUnderROC"
)

# Create cross-validator
cv = CrossValidator(
    estimator=nb,
    estimatorParamMaps=paramGrid,
    evaluator=f1_evaluator,
    numFolds=3  # Use 3 folds for Community Edition (less resource intensive)
)

# Train model with cross-validation
start_time = time.time()
cv_model = cv.fit(train_features)
nb_training_time = time.time() - start_time
print(f"Naive Bayes training time: {nb_training_time:.2f} seconds")

# Get best model
best_nb_model = cv_model.bestModel
print(f"Best smoothing parameter: {best_nb_model.getSmoothing()}")

# Make predictions on test data
nb_predictions = best_nb_model.transform(test_features)

# Evaluate model
nb_accuracy = accuracy_evaluator.evaluate(nb_predictions)
nb_f1 = f1_evaluator.evaluate(nb_predictions)
nb_auc = auc_evaluator.evaluate(nb_predictions)

print(f"Naive Bayes Accuracy: {nb_accuracy:.4f}")
print(f"Naive Bayes F1 Score: {nb_f1:.4f}")
print(f"Naive Bayes AUC: {nb_auc:.4f}")

## Model 2: Random Forest

We'll implement a Random Forest classifier using PySpark MLlib.

In [None]:
from pyspark.ml.classification import RandomForestClassifier

# Create Random Forest model
rf = RandomForestClassifier(featuresCol="features", labelCol="label")

# Define parameter grid for cross-validation
paramGrid = ParamGridBuilder() \
    .addGrid(rf.numTrees, [10, 20]) \
    .addGrid(rf.maxDepth, [5, 10]) \
    .build()

# Create cross-validator
cv = CrossValidator(
    estimator=rf,
    estimatorParamMaps=paramGrid,
    evaluator=f1_evaluator,
    numFolds=3  # Use 3 folds for Community Edition (less resource intensive)
)

# Train model with cross-validation
start_time = time.time()
cv_model = cv.fit(train_features)
rf_training_time = time.time() - start_time
print(f"Random Forest training time: {rf_training_time:.2f} seconds")

# Get best model
best_rf_model = cv_model.bestModel
print(f"Best numTrees: {best_rf_model.getNumTrees}")
print(f"Best maxDepth: {best_rf_model.getMaxDepth()}")

# Make predictions on test data
rf_predictions = best_rf_model.transform(test_features)

# Evaluate model
rf_accuracy = accuracy_evaluator.evaluate(rf_predictions)
rf_f1 = f1_evaluator.evaluate(rf_predictions)
rf_auc = auc_evaluator.evaluate(rf_predictions)

print(f"Random Forest Accuracy: {rf_accuracy:.4f}")
print(f"Random Forest F1 Score: {rf_f1:.4f}")
print(f"Random Forest AUC: {rf_auc:.4f}")

## Feature Importance Analysis

We'll analyze feature importance from the Random Forest model to understand what words are most predictive of fake news.

In [None]:
# Get feature importances from Random Forest model
feature_importances = best_rf_model.featureImportances.toArray()

# Get top 20 feature indices
top_indices = np.argsort(-feature_importances)[:20]
top_importances = feature_importances[top_indices]

# Create a Spark DataFrame for visualization (Databricks-native approach)
feature_importance_data = [(int(idx), float(imp)) for idx, imp in zip(top_indices, top_importances)]
schema = ["Feature_Index", "Importance"]
feature_importance_spark_df = spark.createDataFrame(feature_importance_data, schema)

# Sort by importance for better visualization
feature_importance_spark_df = feature_importance_spark_df.orderBy("Importance", ascending=False)

# Display using Databricks native visualization
print("Top 20 Feature Importances:")
display(feature_importance_spark_df)

# Save feature importance data for future reference
feature_importance_path = "dbfs:/FileStore/fake_news_detection/results/feature_importance.parquet"
feature_importance_spark_df.write.mode("overwrite").parquet(feature_importance_path)
print(f"Feature importance data saved to {feature_importance_path}")

## Model Comparison

We'll compare the performance of Naive Bayes and Random Forest models using Databricks native visualization.

In [None]:
# Create comparison DataFrame using Spark (Databricks-native approach)
model_comparison_data = [
    ("Naive Bayes", float(nb_accuracy), float(nb_f1), float(nb_auc), float(nb_training_time)),
    ("Random Forest", float(rf_accuracy), float(rf_f1), float(rf_auc), float(rf_training_time))
]
comparison_schema = ["Model", "Accuracy", "F1_Score", "AUC", "Training_Time_Seconds"]
model_comparison_df = spark.createDataFrame(model_comparison_data, comparison_schema)

# Display comparison using Databricks native visualization
print("Model Performance Comparison:")
display(model_comparison_df)

# Save comparison data for future reference
comparison_path = "dbfs:/FileStore/fake_news_detection/results/model_comparison.parquet"
model_comparison_df.write.mode("overwrite").parquet(comparison_path)
print(f"Model comparison data saved to {comparison_path}")

## Confusion Matrix Analysis

We'll analyze the confusion matrices for both models to understand their error patterns.

In [None]:
from pyspark.sql.functions import count, col, when

# Function to create confusion matrix using Spark
def create_confusion_matrix(predictions_df, model_name):
    # Create confusion matrix
    confusion_matrix = predictions_df.groupBy("label").pivot("prediction").count().fillna(0)
    
    # Display confusion matrix
    print(f"Confusion Matrix for {model_name}:")
    display(confusion_matrix)
    
    # Calculate metrics by class
    true_positives = predictions_df.filter((col("label") == 1) & (col("prediction") == 1)).count()
    false_positives = predictions_df.filter((col("label") == 0) & (col("prediction") == 1)).count()
    true_negatives = predictions_df.filter((col("label") == 0) & (col("prediction") == 0)).count()
    false_negatives = predictions_df.filter((col("label") == 1) & (col("prediction") == 0)).count()
    
    # Calculate precision, recall, and F1 for each class
    precision_pos = true_positives / (true_positives + false_positives) if (true_positives + false_positives) > 0 else 0
    recall_pos = true_positives / (true_positives + false_negatives) if (true_positives + false_negatives) > 0 else 0
    f1_pos = 2 * precision_pos * recall_pos / (precision_pos + recall_pos) if (precision_pos + recall_pos) > 0 else 0
    
    precision_neg = true_negatives / (true_negatives + false_negatives) if (true_negatives + false_negatives) > 0 else 0
    recall_neg = true_negatives / (true_negatives + false_positives) if (true_negatives + false_positives) > 0 else 0
    f1_neg = 2 * precision_neg * recall_neg / (precision_neg + recall_neg) if (precision_neg + recall_neg) > 0 else 0
    
    # Create metrics DataFrame
    metrics_data = [
        ("Real News (1)", float(precision_pos), float(recall_pos), float(f1_pos)),
        ("Fake News (0)", float(precision_neg), float(recall_neg), float(f1_neg))
    ]
    metrics_schema = ["Class", "Precision", "Recall", "F1_Score"]
    metrics_df = spark.createDataFrame(metrics_data, metrics_schema)
    
    # Display metrics
    print(f"Class-wise Metrics for {model_name}:")
    display(metrics_df)
    
    # Save metrics
    metrics_path = f"dbfs:/FileStore/fake_news_detection/results/{model_name.lower().replace(' ', '_')}_metrics.parquet"
    metrics_df.write.mode("overwrite").parquet(metrics_path)
    print(f"Metrics saved to {metrics_path}")

# Create confusion matrices for both models
create_confusion_matrix(nb_predictions, "Naive Bayes")
create_confusion_matrix(rf_predictions, "Random Forest")

## Save Models

We'll save the trained models for future use.

In [None]:
# Create directory for models
models_dir = "dbfs:/FileStore/fake_news_detection/models"
nb_model_path = f"{models_dir}/naive_bayes_model"
rf_model_path = f"{models_dir}/random_forest_model"

# Save Naive Bayes model
best_nb_model.write().overwrite().save(nb_model_path)
print(f"Naive Bayes model saved to {nb_model_path}")

# Save Random Forest model
best_rf_model.write().overwrite().save(rf_model_path)
print(f"Random Forest model saved to {rf_model_path}")

# Save feature pipeline
feature_pipeline_path = f"{models_dir}/feature_pipeline"
feature_model.write().overwrite().save(feature_pipeline_path)
print(f"Feature pipeline saved to {feature_pipeline_path}")

## Conclusion

In this notebook, we implemented and evaluated two traditional machine learning models for fake news detection using PySpark MLlib:

1. **Naive Bayes**: A simple probabilistic classifier based on Bayes' theorem
2. **Random Forest**: An ensemble learning method that constructs multiple decision trees

Both models were trained on the full dataset and evaluated using cross-validation. The results show that these traditional models can achieve good performance on the fake news detection task, making them strong baselines for comparison with more complex models.

The implementation is optimized for the Databricks Community Edition, with careful memory management and efficient distributed processing using PySpark MLlib. All visualizations use Databricks-native display functions for better integration with the Databricks environment.

## Next Steps

1. Compare these traditional models with deep learning approaches (LSTM, Transformers)
2. Explore more advanced feature engineering techniques
3. Implement model deployment for real-time inference
4. Analyze model interpretability to understand prediction factors