
You're removing older or conflicting versions of the following libraries:

pycaret: A low-code machine learning library.

sktime: A library for time series analysis.

pandas: For data manipulation and analysis.

matplotlib: For plotting/visualization.

numpy: For numerical operations.

transformers: From HuggingFace, used for NLP and deep learning models.
---



In [None]:
!pip install kafka-python transformers


Collecting kafka-python
  Downloading kafka_python-2.1.5-py2.py3-none-any.whl.metadata (9.2 kB)
Downloading kafka_python-2.1.5-py2.py3-none-any.whl (285 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m285.4/285.4 kB[0m [31m4.4 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: kafka-python
Successfully installed kafka-python-2.1.5


In [None]:
# Clean install of compatible libraries
!pip uninstall -y pycaret sktime pandas matplotlib numpy transformers -q
!pip install pandas==2.2.2 matplotlib==3.8.0 numpy==1.26.4 transformers==4.40.1 pyspark==3.5.1 -q


[0m[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
sentence-transformers 3.4.1 requires transformers<5.0.0,>=4.41.0, but you have transformers 4.40.1 which is incompatible.
thinc 8.3.6 requires numpy<3.0.0,>=2.0.0, but you have numpy 1.26.4 which is incompatible.[0m[31m
[0m

In [None]:
# Install only needed versions that work together
!pip install numpy==1.26.4 pandas==2.2.2 matplotlib==3.8.0 transformers==4.40.1 pyspark==3.5.1 -q


torch: Core PyTorch library, handles tensors, automatic differentiation (backprop), GPU operations.

torch.nn: Contains neural network layers and utilities (like nn.Linear, nn.CrossEntropyLoss, etc.).
train_test_split: Splits your dataset into training and test sets.

classification_report: Gives precision, recall, F1-score, and accuracy for model evaluation.
AutoModel: Automatically loads the correct BERT-like model architecture (e.g., BERT, RoBERTa, DistilBERT).

BertTokenizerFast: Fast tokenizer for BERT-based models to convert text into token IDs.

AdamW: Adam optimizer with weight decay (better for transformers than plain Adam).
SparkSession: Entry point to use DataFrame and SQL APIs in PySpark. You need this to load and manipulate large-scale data (especially for streaming or distributed tasks).

lit(): A function to add constant columns or values in a DataFrame. For example:

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import torch
import torch.nn as nn
from sklearn.metrics import classification_report
from sklearn.model_selection import train_test_split
from transformers import AutoModel, BertTokenizerFast
from transformers.optimization import AdamW
from torch.utils.data import TensorDataset, DataLoader, RandomSampler, SequentialSampler
from pyspark.sql import SparkSession
from pyspark.sql.functions import lit

creating spark session


In [None]:
# Start Spark session
spark = SparkSession.builder.appName("FakeNewsDetection").getOrCreate()


In [None]:
# Load datasets with Spark
true_data = spark.read.csv('/content/a1_True (1).csv', header=True, inferSchema=True)
fake_data = spark.read.csv('/content/a2_Fake.csv', header=True, inferSchema=True)

In [None]:
# Add labels
true_data = true_data.withColumn("Target", lit("True"))
fake_data = fake_data.withColumn("Target", lit("Fake"))

Merges the two Spark DataFrames (true_data and fake_data) into one.

Converts the Spark DataFrame into a Pandas DataFrame so you can use it with libraries like transformers, sklearn, and torch.

In [None]:
# Merge and convert to Pandas
data = true_data.union(fake_data).toPandas()
data = data.sample(frac=1).reset_index(drop=True)
data['label'] = pd.get_dummies(data.Target)['Fake']

splitting the dataset into training, validation, and test sets — and you're doing it correctly using stratified sampling. Here's what each line does in detail:



In [None]:
# Data split
train_text, temp_text, train_labels, temp_labels = train_test_split(data['title'], data['label'], random_state=2018, test_size=0.3, stratify=data['Target'])
val_text, test_text, val_labels, test_labels = train_test_split(temp_text, temp_labels, random_state=2018, test_size=0.5, stratify=temp_labels)


Loads the fast version of BERT tokenizer (bert-base-uncased).

"Uncased" means it converts all text to lowercase and removes case sensitivity.

The fast tokenizer is implemented using the HuggingFace tokenizers library — it's optimized for speed.

bert = AutoModel.from_pretrained('bert-base-uncased')
Loads the pretrained BERT model architecture and weights (no classification head — just embeddings/output layers).

Used to extract contextual embeddings from input text.

MAX_LENGTH = 15
This sets the maximum number of tokens BERT will process per input.

For BERT, max can go up to 512, but shorter is faster.

15 is very short — works for short texts like titles but might cut off longer ones.



In [None]:
# Load BERT tokenizer and model
tokenizer = BertTokenizerFast.from_pretrained('bert-base-uncased')
bert = AutoModel.from_pretrained('bert-base-uncased')

MAX_LENGTH = 15

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

Xet Storage is enabled for this repo, but the 'hf_xet' package is not installed. Falling back to regular HTTP download. For better performance, install the package with: `pip install huggingface_hub[hf_xet]` or `pip install hf_xet`


model.safetensors:   0%|          | 0.00/440M [00:00<?, ?B/s]

You're using batch_encode_plus to:

Convert lists of text into BERT token IDs

Pad and truncate all sequences to length MAX_LENGTH

Apply this for train, validation, and test sets

In [None]:
# Tokenization
tokens_train = tokenizer.batch_encode_plus(train_text.tolist(), max_length=MAX_LENGTH, pad_to_max_length=True, truncation=True)
tokens_val = tokenizer.batch_encode_plus(val_text.tolist(), max_length=MAX_LENGTH, pad_to_max_length=True, truncation=True)
tokens_test = tokenizer.batch_encode_plus(test_text.tolist(), max_length=MAX_LENGTH, pad_to_max_length=True, truncation=True)




A tensor is a multidimensional array that is used extensively in machine learning and deep learning for representing data and model parameters. Think of it as an extension of arrays (like NumPy arrays) but optimized for GPU computation and automatic differentiation in deep learning frameworks such as PyTorch and TensorFlow.

Higher-dimensional tensors: Arrays with more than two dimensions (3D, 4D, etc.).



In [None]:
# Convert to tensors
train_seq = torch.tensor(tokens_train['input_ids'])
train_mask = torch.tensor(tokens_train['attention_mask'])
train_y = torch.tensor(train_labels.tolist())


similarly converting validation and testing data into tensors


In [None]:
val_seq = torch.tensor(tokens_val['input_ids'])
val_mask = torch.tensor(tokens_val['attention_mask'])
val_y = torch.tensor(val_labels.tolist())

test_seq = torch.tensor(tokens_test['input_ids'])
test_mask = torch.tensor(tokens_test['attention_mask'])
test_y = torch.tensor(test_labels.tolist())

TensorDataset: A PyTorch dataset that combines your input sequences (train_seq), attention masks (train_mask), and labels (train_y) into a single dataset object.

RandomSampler: This sampler shuffles the data every time it's used, ensuring your model doesn't learn any patterns based on the order of data.

ataLoader: A PyTorch class that manages the batching of the data. It automatically handles:

Shuffling (via the sampler)

Loading batches of data of size batch_size

Returning the data in batches for model training

In [None]:
# DataLoader setup
batch_size = 32
train_data = TensorDataset(train_seq, train_mask, train_y)
train_sampler = RandomSampler(train_data)
train_dataloader = DataLoader(train_data, sampler=train_sampler, batch_size=batch_size)

val_data = TensorDataset(val_seq, val_mask, val_y)
val_sampler = SequentialSampler(val_data)
val_dataloader = DataLoader(val_data, sampler=val_sampler, batch_size=batch_size)


bert.parameters(): This returns all the parameters (weights and biases) of the BERT model.

param.requires_grad = True: This tells PyTorch to  compute gradients for these parameters during backpropagation. This effectively  updates weights and bias during training.

In [None]:
# Freeze BERT weights
for param in bert.parameters():
    param.requires_grad = True

In [None]:
class BERT_Arch(nn.Module):
    def __init__(self, bert):
        super(BERT_Arch, self).__init__()
        self.bert = bert  # Pretrained BERT model
        self.dropout = nn.Dropout(0.1)  # Dropout layer to prevent overfitting
        self.relu = nn.ReLU()  # ReLU activation function
        self.fc1 = nn.Linear(768, 512)  # Fully connected layer (input: 768, output: 512)
        self.fc2 = nn.Linear(512, 2)  # Final output layer (output 2 classes: Fake/Real)
        self.softmax = nn.LogSoftmax(dim=1)  # LogSoftmax for classification (cross-entropy)

    def forward(self, sent_id, mask):
        # Forward pass through BERT and take the pooled output (CLS token representation)
        cls_hs = self.bert(sent_id, attention_mask=mask)['pooler_output']

        # Pass through the fully connected layers with ReLU and Dropout
        x = self.fc1(cls_hs)
        x = self.relu(x)
        x = self.dropout(x)
        x = self.fc2(x)

        # Apply Softmax to get the class probabilities (Fake or Real)
        x = self.softmax(x)
        return x


model = BERT_Arch(bert)
BERT_Arch(bert): You are initializing your custom BERT_Arch model, passing the pretrained BERT model (bert) as an argument.

This custom model includes BERT as the feature extractor and adds fully connected layers for classification (Fake vs. Real).

torch.device("cuda" if torch.cuda.is_available() else "cpu"): This checks if you have access to a GPU (via CUDA) for training. If yes, the model will run on the GPU, otherwise it will run on the CPU.

model.to(device): Moves the model to the appropriate device (GPU or CPU).



In [None]:
# Initialize model
model = BERT_Arch(bert)
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model = model.to(device)

# Optimizer and loss
optimizer = AdamW(model.parameters(), lr=1e-5)
cross_entropy = nn.NLLLoss()
epochs = 2



train_dataloader: The DataLoader object that provides batches of training data.

enumerate(train_dataloader): Loops through the batches. step is the index, and batch is the data for the current batch.

if step % 50 == 0 and step != 0:: Prints progress every 50 batches to give you an update on the training status.

Moves all components of the batch (sent_id, mask, and labels) to the device (GPU or CPU).



sent_id: The tokenized input sequence (IDs of words).

mask: The attention mask, which tells the model which tokens to pay attention to.

labels: The true labels (Fake/Real news).
model.zero_grad(): Clears any gradients computed during the previous batch

cross_entropy(preds, labels): Computes the loss between the model's predictions and the true labels using Negative Log-Likelihood Loss (NLLLoss).

loss.backward(): Computes the gradients of the loss with respect to the model parameters.

clip_grad_norm_: Prevents exploding gradients by clipping the gradients to a maximum norm of 1. This is useful when training large models like BERT, which can sometimes lead to very large gradients.

In [None]:
# Training function
def train():
    model.train()
    total_loss = 0
    for step, batch in enumerate(train_dataloader):
        if step % 50 == 0 and step != 0:
            print(f"  Batch {step}  of  {len(train_dataloader)}.")
        batch = [r.to(device) for r in batch]
        sent_id, mask, labels = batch
        labels = labels.long()
        model.zero_grad()
        preds = model(sent_id, mask)
        loss = cross_entropy(preds, labels)
        total_loss += loss.item()
        loss.backward()
        torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0)
        optimizer.step()
    return total_loss / len(train_dataloader)

with torch.no_grad(): This context manager disables gradient calculation, which reduces memory usage and speeds up the evaluation process, since no gradients are needed during validation or testing.

Similar to the training loop, you're iterating over the validation data (val_dataloader).

step % 50 == 0: Prints progress every 50 batches.

Computes the loss between the predicted values (preds) and the true labels (labels).

Passes the input through the model to get predictions (preds).

Adds the current batch's loss to the cumulative total loss.

Returns the average validation loss after going through all batches in the validation data.





In [None]:
# Evaluation function
def evaluate():
    print("\nEvaluating...")
    model.eval()
    total_loss = 0
    with torch.no_grad():
        for step, batch in enumerate(val_dataloader):
            if step % 50 == 0 and step != 0:
                print(f"  Batch {step}  of  {len(val_dataloader)}.")
            batch = [r.to(device) for r in batch]
            sent_id, mask, labels = batch
            labels = labels.long()
            preds = model(sent_id, mask)
            loss = cross_entropy(preds, labels)
            total_loss += loss.item()
    return total_loss / len(val_dataloader)

# Training loop
best_valid_loss = float('inf')
train_losses = []
valid_losses = []

Train: The train() function is called to train the model for one epoch.

Evaluate: The evaluate() function is called to evaluate the model on the validation set.

Model Saving: If the validation loss is lower than the best observed so far, the model is saved to disk as the best model.

In [None]:

for epoch in range(epochs):
    print(f"\n Epoch {epoch + 1} / {epochs}")
    train_loss = train()
    valid_loss = evaluate()

    if valid_loss < best_valid_loss:
        best_valid_loss = valid_loss
        torch.save(model.state_dict(), 'c2_new_model_weights.pt')

    train_losses.append(train_loss)
    valid_losses.append(valid_loss)

    print(f'Training Loss: {train_loss:.3f}')
    print(f'Validation Loss: {valid_loss:.3f}')



 Epoch 1 / 2
  Batch 50  of  983.
  Batch 100  of  983.
  Batch 150  of  983.
  Batch 200  of  983.
  Batch 250  of  983.
  Batch 300  of  983.
  Batch 350  of  983.
  Batch 400  of  983.
  Batch 450  of  983.
  Batch 500  of  983.
  Batch 550  of  983.
  Batch 600  of  983.
  Batch 650  of  983.
  Batch 700  of  983.
  Batch 750  of  983.
  Batch 800  of  983.
  Batch 850  of  983.
  Batch 900  of  983.
  Batch 950  of  983.

Evaluating...
  Batch 50  of  211.
  Batch 100  of  211.
  Batch 150  of  211.
  Batch 200  of  211.
Training Loss: 0.562
Validation Loss: 0.483

 Epoch 2 / 2
  Batch 50  of  983.
  Batch 100  of  983.
  Batch 150  of  983.
  Batch 200  of  983.
  Batch 250  of  983.
  Batch 300  of  983.
  Batch 350  of  983.
  Batch 400  of  983.
  Batch 450  of  983.
  Batch 500  of  983.
  Batch 550  of  983.
  Batch 600  of  983.
  Batch 650  of  983.
  Batch 700  of  983.
  Batch 750  of  983.
  Batch 800  of  983.
  Batch 850  of  983.
  Batch 900  of  983.
  Batch 950  o

torch.no_grad() – Disables gradient computation (saves memory & speeds up inference).

model(...) – Gets predictions from the test data (test_seq, test_mask).

detach() – Detaches the output from the computation graph.

cpu().numpy() – Moves the predictions to the CPU and converts to NumPy array.

BERT outputs log-softmax probabilities for each class.

argmax(...) picks the class with the highest probability.

Compares true labels (test_y) with predicted labels (preds).

Displays precision, recall, f1-score, and support for each class.

In [None]:
# Testing
with torch.no_grad():
    preds = model(test_seq.to(device), test_mask.to(device))
    preds = preds.detach().cpu().numpy()

preds = np.argmax(preds, axis=1)
print(classification_report(test_y, preds))

              precision    recall  f1-score   support

       False       0.76      0.92      0.83      3213
        True       0.91      0.73      0.81      3523

    accuracy                           0.82      6736
   macro avg       0.84      0.83      0.82      6736
weighted avg       0.84      0.82      0.82      6736



In [None]:
# Test on unseen samples
unseen_news_text = [
    "Donald Trump Sends Out Embarrassing New Year’s Eve Message; This is Disturbing",
    "WATCH: George W. Bush Calls Out Trump For Supporting White Supremacy",
    "U.S. lawmakers question businessman at 2016 Trump Tower meeting: sources",
    "Trump administration issues new rules on U.S. visa waivers"
]

tokens_unseen = tokenizer.batch_encode_plus(
    unseen_news_text,
    max_length=MAX_LENGTH,
    pad_to_max_length=True,
    truncation=True
)

unseen_seq = torch.tensor(tokens_unseen['input_ids']).to(device)
unseen_mask = torch.tensor(tokens_unseen['attention_mask']).to(device)

with torch.no_grad():
    preds = model(unseen_seq, unseen_mask)
    preds = preds.detach().cpu().numpy()

preds = np.argmax(preds, axis=1)
print("Unseen predictions:", preds)

Unseen predictions: [1 0 0 0]




In [None]:
from pyspark.sql.functions import col, length


In [None]:
# Start Spark session
spark = SparkSession.builder.appName("FakeNewsDetection").getOrCreate()

# Show number of partitions to demonstrate parallel processing
print("Number of Spark partitions (example of parallelism):")
print(spark.sparkContext.defaultParallelism)

# Demonstrate parallel transformation
from pyspark.sql.functions import length

# Example: compute and show title lengths in parallel
sample = true_data.select("title").withColumn("length", length(col("title")))
sample.show(5)


Number of Spark partitions (example of parallelism):
2
+--------------------+------+
|               title|length|
+--------------------+------+
|As U.S. budget fi...|    64|
|U.S. military to ...|    64|
|Senior U.S. Repub...|    60|
|FBI Russia probe ...|    59|
|Trump wants Posta...|    69|
+--------------------+------+
only showing top 5 rows



In [None]:
from pyspark.sql.functions import udf
from pyspark.sql.types import ArrayType, IntegerType

# Load tokenizer
tokenizer = BertTokenizerFast.from_pretrained('bert-base-uncased')

# Define Spark UDF for tokenization
@udf(ArrayType(IntegerType()))
def tokenize_text(text):
    return tokenizer.encode(text, max_length=15, truncation=True, padding='max_length')

# Apply UDF on true_data and fake_data in parallel
true_data = true_data.withColumn("input_ids", tokenize_text(col("title")))
fake_data = fake_data.withColumn("input_ids", tokenize_text(col("title")))

# Show result
true_data.select("title", "input_ids").show(5, truncate=False)




+---------------------------------------------------------------------+-------------------------------------------------------------------------------------------+
|title                                                                |input_ids                                                                                  |
+---------------------------------------------------------------------+-------------------------------------------------------------------------------------------+
|As U.S. budget fight looms, Republicans flip their fiscal script     |[101, 2004, 1057, 1012, 1055, 1012, 5166, 2954, 8840, 22225, 1010, 10643, 11238, 2037, 102]|
|U.S. military to accept transgender recruits on Monday: Pentagon     |[101, 1057, 1012, 1055, 1012, 2510, 2000, 5138, 16824, 15024, 2006, 6928, 1024, 20864, 102]|
|Senior U.S. Republican senator: 'Let Mr. Mueller do his job'         |[101, 3026, 1057, 1012, 1055, 1012, 3951, 5205, 1024, 1005, 2292, 2720, 1012, 26774, 102]  |
|FBI Russia prob