# Data Ingestion for Fake News Detection

This notebook demonstrates the data ingestion process for our fake news detection pipeline using Hive metastore tables. Data ingestion is the first critical step in any data science project, as it involves collecting, loading, and preparing the raw data for further processing.

## Why Data Ingestion is Important

In fake news detection, proper data ingestion ensures:
1. Data quality and consistency
2. Appropriate labeling of real and fake news articles
3. Balanced representation of both classes
4. Efficient storage for distributed processing

This notebook will guide you through the process of loading, combining, and processing news data using Apache Spark for distributed processing, with a focus on leveraging Hive metastore tables in Databricks for the complete dataset (approximately 45,000 articles).

## Setup and Imports

First, let's import the necessary libraries and set up our environment.

In [None]:
# Import required libraries
import os
import numpy as np
from pyspark.sql import SparkSession
from pyspark.sql.functions import lit, col, when, count, desc, rand
from pyspark.sql.types import StructType, StructField, StringType, IntegerType

# Import our custom Hive data ingestion module
# Note: In Databricks, you may need to adjust this import path
# You can use %run ./hive_data_ingestion instead
import sys
sys.path.append('/dbfs/FileStore/tables')
from hive_data_ingestion import HiveDataIngestion

## Creating a Spark Session with Hive Support

We'll use Apache Spark for distributed data processing, with Hive support enabled to access the metastore tables. Let's create a properly configured Spark session optimized for the Databricks Community Edition limitations (1 driver, 15.3 GB Memory, 2 Cores).

In [None]:
# Create a Spark session with configuration optimized for Databricks Community Edition
# - appName: Identifies this application in the Spark UI and logs
# - spark.sql.shuffle.partitions: Set to 8 (4x number of cores) for Community Edition
# - spark.driver.memory: Set to 8g to utilize available memory while leaving room for system
# - enableHiveSupport: Enables access to Hive metastore tables
spark = SparkSession.builder \
    .appName("FakeNewsDetection") \
    .config("spark.sql.shuffle.partitions", "8") \
    .config("spark.driver.memory", "8g") \
    .enableHiveSupport() \
    .getOrCreate()

# Display Spark version information
print(f"Spark version: {spark.version}")
print(f"Spark configuration: {spark.sparkContext.getConf().getAll()}")

## Understanding Hive Metastore Tables

In Databricks, the Hive metastore provides a centralized repository to store metadata for tables and partitions. For our fake news detection project, we have two tables in the Hive metastore:

1. **`fake`**: Contains fake news articles with columns: title, text, subject, date
2. **`real`**: Contains real news articles with columns: title, text, subject, date

### Why Use Hive Tables?

Using Hive metastore tables offers several advantages over reading from CSV files:

1. **Centralized Metadata**: Schema information is stored centrally, ensuring consistency across sessions and users
2. **Optimized Performance**: Databricks optimizes query execution on Hive tables
3. **Access Control**: Tables can have access controls applied at the table level
4. **Persistence**: Tables persist across cluster restarts and sessions
5. **Catalog Integration**: Tables are visible in the Databricks catalog UI

Let's explore these tables using Spark SQL.

In [None]:
# List all tables in the default database
print("Available tables in the Hive metastore:")
spark.sql("SHOW TABLES").show()

## Creating Directory Structure in DBFS

Let's create a directory structure in DBFS (Databricks File System) to organize our processed data files.

In [None]:
def create_directory_structure():
    """Create directory structure for data storage in DBFS.
    
    This function creates the necessary directories for storing:
    - Combined data: The full dataset with both real and fake news
    - Processed data: Data after preprocessing steps
    - Model data: Data used for model training and evaluation
    - Sample data: Optional balanced samples for development with limited resources
    """
    # In Databricks, we use dbutils to interact with DBFS
    directories = [
        "dbfs:/FileStore/fake_news_detection/data/combined_data",
        "dbfs:/FileStore/fake_news_detection/data/processed_data",
        "dbfs:/FileStore/fake_news_detection/data/model_data",
        "dbfs:/FileStore/fake_news_detection/data/sample_data"
    ]
    
    for directory in directories:
        # Remove dbfs: prefix for dbutils.fs.mkdirs
        dir_path = directory.replace("dbfs:", "")
        dbutils.fs.mkdirs(dir_path)
        print(f"Created directory: {directory}")

# Create the directory structure
create_directory_structure()

## Loading and Exploring Data from Hive Tables

Now, let's load the data from Hive metastore tables and explore their structure. We'll use our custom `HiveDataIngestion` class to handle this process.

In [None]:
# Initialize the HiveDataIngestion class
# We specify the table names in the Hive metastore
ingestion = HiveDataIngestion(spark, real_table="real", fake_table="fake")

# Load data from Hive tables
try:
    # This loads data from the Hive tables and registers them as temporary views
    real_df, fake_df = ingestion.load_data_from_hive()
    
except Exception as e:
    print(f"Error loading datasets from Hive: {str(e)}")

## Memory Management for Community Edition

Since we're working with the Databricks Community Edition (15.3 GB Memory, 2 Cores), we need to be careful about memory usage. Let's check the size of our datasets and implement memory-efficient processing strategies.

In [None]:
# Check dataset sizes
real_count = real_df.count()
fake_count = fake_df.count()
total_count = real_count + fake_count

print(f"Real news dataset: {real_count} records")
print(f"Fake news dataset: {fake_count} records")
print(f"Total dataset size: {total_count} records")

# Memory management tips for Community Edition
print("\nMemory Management Tips for Databricks Community Edition:")
print("1. Process data in smaller batches when possible")
print("2. Use .unpersist() to release cached DataFrames when no longer needed")
print("3. Consider using sampling for exploratory analysis and model development")
print("4. Minimize the number of wide transformations (joins, groupBy, etc.)")

## Data Exploration with Spark SQL

Let's use Spark SQL to explore our datasets. SQL provides a familiar syntax for data exploration while leveraging Spark's distributed processing capabilities.

In [None]:
# Explore real news data using Spark SQL
print("Sample of real news articles:")
spark.sql("""
    SELECT title, text, subject, date
    FROM true_news
    LIMIT 3
""").show(truncate=50)

# Explore fake news data using Spark SQL
print("\nSample of fake news articles:")
spark.sql("""
    SELECT title, text, subject, date
    FROM fake_news
    LIMIT 3
""").show(truncate=50)

# Count articles by subject in real news
print("\nReal news articles by subject:")
spark.sql("""
    SELECT subject, COUNT(*) as count
    FROM true_news
    GROUP BY subject
    ORDER BY count DESC
""").show()

# Count articles by subject in fake news
print("\nFake news articles by subject:")
spark.sql("""
    SELECT subject, COUNT(*) as count
    FROM fake_news
    GROUP BY subject
    ORDER BY count DESC
""").show()

## Analyzing Subject Distribution and Potential Data Leakage

Let's analyze the distribution of the 'subject' column across real and fake news to check if it might be a perfect discriminator between classes, which would indicate potential data leakage.

In [None]:
# Analyze subject distribution across real and fake news
print("Analyzing subject distribution across classes...")

# First, let's combine the datasets with labels
real_df_with_label = real_df.withColumn("label", lit(1))  # 1 for real news
fake_df_with_label = fake_df.withColumn("label", lit(0))  # 0 for fake news
temp_combined = real_df_with_label.unionByName(fake_df_with_label)
temp_combined.createOrReplaceTempView("temp_combined")

# Check if any subjects appear in both real and fake news
print("\nSubjects that appear in both real and fake news:")
spark.sql("""
    SELECT subject, 
           SUM(CASE WHEN label = 1 THEN 1 ELSE 0 END) as real_count,
           SUM(CASE WHEN label = 0 THEN 1 ELSE 0 END) as fake_count
    FROM temp_combined
    GROUP BY subject
    HAVING real_count > 0 AND fake_count > 0
    ORDER BY real_count + fake_count DESC
""").show()

# Calculate correlation between subject and label
print("\nAnalyzing correlation between subject and label...")
# Convert subject to numeric using StringIndexer in a later step
# For now, let's check how many unique subjects are in each class
print("\nNumber of unique subjects in each class:")
spark.sql("""
    SELECT label, COUNT(DISTINCT subject) as unique_subjects
    FROM temp_combined
    GROUP BY label
    ORDER BY label DESC
""").show()

# Check if subject perfectly separates the classes
print("\nCombined dataset statistics:")
spark.sql("""
    SELECT label, COUNT(*) as count, COUNT(DISTINCT subject) as unique_subjects
    FROM temp_combined
    GROUP BY label
    ORDER BY label DESC
""").show()

# Show top subjects by class
print("\nSubject distribution by label:")
spark.sql("""
    SELECT 
        subject,
        SUM(CASE WHEN label = 1 THEN 1 ELSE 0 END) as real_count,
        SUM(CASE WHEN label = 0 THEN 1 ELSE 0 END) as fake_count
    FROM temp_combined
    GROUP BY subject
    ORDER BY real_count + fake_count DESC
    LIMIT 10
""").show()

# Release memory
temp_combined.unpersist()

## Data Leakage Warning

**Important**: Based on the analysis above, the 'subject' column appears to be a strong predictor of whether an article is real or fake news. This could indicate potential data leakage, as the subject categories might be directly related to the source of the news rather than the content itself.

### Recommendations:

1. **Consider removing the 'subject' column** from the feature set to ensure the model learns from the actual content rather than the source categorization.

2. **Alternatively, create two model variants** - one with and one without the 'subject' feature - to compare performance and understand the impact.

3. **Perform cross-validation** with careful stratification to ensure the model generalizes well across different subjects.

Let's proceed with adding labels and combining the datasets, keeping this potential data leakage in mind.

## Adding Labels and Combining Datasets

Now, let's add labels to our datasets (1 for real news, 0 for fake news) and combine them into a single dataset. This labeling is crucial for training machine learning models to distinguish between real and fake news.

In [None]:
# Combine datasets with labels using our ingestion class
combined_df = ingestion.combine_datasets(real_df, fake_df)

# Show distribution of subjects across labels
print("\nSubject distribution by label:")
spark.sql("""
    SELECT 
        subject,
        SUM(CASE WHEN label = 1 THEN 1 ELSE 0 END) as real_count,
        SUM(CASE WHEN label = 0 THEN 1 ELSE 0 END) as fake_count
    FROM combined_news
    GROUP BY subject
    ORDER BY real_count + fake_count DESC
    LIMIT 10
""").show()

## Saving Combined Dataset

Let's save the combined dataset in Parquet format in DBFS, which is optimized for distributed processing. We'll also save it as a Hive table for easier access. We'll use partitioning to optimize query performance on the full dataset.

In [None]:
# Save combined dataset to DBFS in Parquet format
combined_path = "/FileStore/fake_news_detection/data/combined_data/full_dataset.parquet"
combined_df.write.mode("overwrite").parquet(combined_path)
print(f"Combined dataset saved to: {combined_path}")

# Create a Hive table for the combined dataset
spark.sql(f"""
    CREATE TABLE IF NOT EXISTS combined_news_table
    USING PARQUET
    LOCATION '{combined_path}'
""")
print("Created Hive table: combined_news_table")

# Verify the table was created
print("\nVerifying table creation:")
spark.sql("SHOW TABLES").show()

## Creating a Balanced Sample for Development

For development and testing purposes, especially in the Databricks Community Edition with limited resources, it's useful to create a balanced sample of the data. This sample will have an equal number of real and fake news articles.

In [None]:
# Create a balanced sample for development
sample_df = ingestion.create_balanced_sample(combined_df, sample_size_per_class=1000)

# Display sample statistics
print("Balanced sample statistics:")
sample_df.groupBy("label").count().show()

# Save the balanced sample
sample_path = "/FileStore/fake_news_detection/data/sample_data/balanced_sample.parquet"
sample_df.write.mode("overwrite").parquet(sample_path)
print(f"Balanced sample saved to: {sample_path}")

# Create a Hive table for the sample
spark.sql(f"""
    CREATE TABLE IF NOT EXISTS news_sample
    USING PARQUET
    LOCATION '{sample_path}'
""")
print("Created Hive table: news_sample")

## Cleanup

Let's clean up our environment to free up resources.

In [None]:
# Unpersist DataFrames to free up memory
real_df.unpersist()
fake_df.unpersist()
combined_df.unpersist()
if 'sample_df' in locals():
    sample_df.unpersist()

print("Data ingestion completed successfully!")
print("The data is now ready for preprocessing and feature engineering.")