# Step 1: Classification Model Development


This code trains a BERT-based model to classify news text, focusing on reliable, accurate fake news detection.

## Imports and Setup

In [None]:
# Install necessary libraries
!pip install transformers
!pip install tweet-preprocessor
!pip install textblob

# Standard libraries
import numpy as np
import pandas as pd
import torch
import torch.nn as nn

# Sklearn libraries for preprocessing and evaluation
from sklearn import preprocessing
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report
from sklearn.utils.class_weight import compute_class_weight

# Transformers library for BERT and tokenizer
import transformers
from transformers import AutoModel, BertTokenizerFast, AdamW

# Tweet-preprocessor for text cleaning
import preprocessor as p

# PyTorch utilities for data handling
from torch.utils.data import TensorDataset, DataLoader, RandomSampler, SequentialSampler

# Textblob for Sentiment Analysis
from textblob import TextBlob

# Specify GPU if available, else default to CPU
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

# Enable access to Google Drive for file storage
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


## Data Loading and Pre-Processing

This coad loads the training and validation datasets, applies preprocessing to clean up the text data (tweets), and encodes the labels (real/fake) into a numerical format for modeling.

In [None]:
# Function to load data from TSV file into a DataFrame
def getData(file):
    """Load TSV file and return DataFrame."""
    return pd.read_csv(file, delimiter="\t")

# Define file paths for the training and validation datasets
trainFilename = "/content/drive/MyDrive/Fake News Detection Data/Constrain AI/Constraint_English_Train - Sheet1.tsv"
validFilename = "/content/drive/MyDrive/Fake News Detection Data/Constrain AI/Constraint_English_Val - Sheet1.tsv"

# Load training and validation datasets
trainDF = getData(trainFilename)
validDF = getData(validFilename)
print("Train Data Shape: ", trainDF.shape)
print("Validation Data Shape: ", validDF.shape)

# Function to preprocess tweets: removes special characters, converts to lowercase, removes hashtags and mentions
def preprocessTweet(row):
    text = row['tweet']
    text = p.clean(text)
    text = text.lower().replace(r'[^\w\s]', ' ').replace(r'\s\s+', ' ').replace("#", "").replace("@", "")
    return text

# Apply preprocessing function to each row in the dataset
trainDF['processedTweet'] = trainDF.apply(preprocessTweet, axis=1)
validDF['processedTweet'] = validDF.apply(preprocessTweet, axis=1)

# Encode labels into numerical format for model compatibility
labelEncoder = preprocessing.LabelEncoder()
labelEncoder.fit(['real', 'fake'])
trainDF['numericalLabels'] = labelEncoder.transform(trainDF['label'])
validDF['numericalLabels'] = labelEncoder.transform(validDF['label'])

Train Data Shape:  (6420, 3)
Validation Data Shape:  (2140, 3)


Getting Sentiment Score

In [None]:
# Function to get sentiment score
def get_sentiment(text):
    polarity = TextBlob(text).sentiment.polarity
    if polarity > 0:
        return 1  # Positive
    elif polarity == 0:
        return 0  # Neutral
    else:
        return -1  # Negative

Sentiment and length to data set

In [None]:
# Apply sentiment and length calculations
trainDF['sentiment'] = trainDF['processedTweet'].apply(get_sentiment)
trainDF['post_length'] = trainDF['processedTweet'].apply(lambda x: len(x.split()))

validDF['sentiment'] = validDF['processedTweet'].apply(get_sentiment)
validDF['post_length'] = validDF['processedTweet'].apply(lambda x: len(x.split()))

## Data Splitting and Tokenization

The code splits the data into training and validation sets and uses BERT’s tokenizer to convert the text into a format suitable for model input (token IDs and attention masks).

In [None]:
from transformers import AutoModel, BertTokenizerFast
import torch
from torch.utils.data import TensorDataset, DataLoader, RandomSampler, SequentialSampler
from sklearn.model_selection import train_test_split
import numpy as np

# Split training data for internal validation
trainText, validText, trainLabels, validLabels = train_test_split(
    trainDF['processedTweet'], trainDF['numericalLabels'],
    random_state=2018,
    test_size=0.04,
    stratify=trainDF['numericalLabels']
)

# Recalculate sentiment and length based on the splits
train_sentiment = torch.tensor(trainText.apply(get_sentiment).values).unsqueeze(1)
train_length = torch.tensor(trainText.apply(lambda x: len(x.split())).values).unsqueeze(1)

val_sentiment = torch.tensor(validText.apply(get_sentiment).values).unsqueeze(1)
val_length = torch.tensor(validText.apply(lambda x: len(x.split())).values).unsqueeze(1)

# Load pretrained BERT model and tokenizer
bert = AutoModel.from_pretrained('bert-base-uncased')
tokenizer = BertTokenizerFast.from_pretrained('bert-base-uncased')

# Function to tokenize and encode sequences for input to BERT model
def tokenize_text(text_data, tokenizer, max_length=200):
    return tokenizer.batch_encode_plus(
        text_data.tolist(),
        max_length=max_length,
        padding=True,
        truncation=True,
        return_tensors="pt"
    )

# Tokenize and encode train, validation, and test datasets
tokens_train = tokenize_text(trainText, tokenizer)
tokens_val = tokenize_text(validText, tokenizer)
tokens_test = tokenize_text(validDF['processedTweet'], tokenizer)

# Prepare tensors for model input
# Training set
train_seq, train_mask, train_y = tokens_train['input_ids'], tokens_train['attention_mask'], torch.tensor(trainLabels.tolist())

# Validation set
val_seq, val_mask, val_y = tokens_val['input_ids'], tokens_val['attention_mask'], torch.tensor(validLabels.tolist())

# Test set (using validDF for sentiment and length)
test_seq, test_mask, test_y = tokens_test['input_ids'], tokens_test['attention_mask'], torch.tensor(validDF['numericalLabels'].tolist())
test_sentiment = torch.tensor(validDF['sentiment'].values).unsqueeze(1)
test_length = torch.tensor(validDF['post_length'].values).unsqueeze(1)




## Data Loader Creation

This code cell creates DataLoader objects to efficiently batch and load data during training and validation, helping with faster processing

In [None]:
# Define batch size
batch_size = 32

# Create DataLoader for train, validation, and test sets
train_data = TensorDataset(train_seq, train_mask, train_sentiment, train_length, train_y)
train_dataloader = DataLoader(train_data, sampler=RandomSampler(train_data), batch_size=batch_size)

val_data = TensorDataset(val_seq, val_mask, val_sentiment, val_length, val_y)
val_dataloader = DataLoader(val_data, sampler=SequentialSampler(val_data), batch_size=batch_size)

test_data = TensorDataset(test_seq, test_mask, test_sentiment, test_length, test_y)
test_dataloader = DataLoader(test_data, sampler=SequentialSampler(test_data), batch_size=batch_size)

## Model Definition

The code sets up a custom BERT-based architecture, which uses BERT as a base model with additional fully connected layers for classification. BERT’s parameters are frozen to reduce training time.

In [None]:
# Load pretrained BERT model
bert = AutoModel.from_pretrained('bert-base-uncased')

# Freeze BERT parameters to prevent updating during training
for param in bert.parameters():
    param.requires_grad = False

# Define custom BERT model architecture with added layers
class BERT_Arch(nn.Module):
    def __init__(self, bert):
        super(BERT_Arch, self).__init__()
        self.bert = bert
        self.dropout = nn.Dropout(0.1)
        self.relu = nn.ReLU()

        # Adjust input size to account for BERT embeddings + 2 additional features (sentiment & length)
        self.fc1 = nn.Linear(768 + 2, 512)  # 768 for BERT output, +2 for sentiment and post length
        self.fc2 = nn.Linear(512, 2)
        self.softmax = nn.LogSoftmax(dim=1)

    def forward(self, sent_id, mask, sentiment, length):
        # Pass input through BERT and get [CLS] token embedding
        outputs = self.bert(sent_id, attention_mask=mask)
        cls_hs = outputs.pooler_output

        # Concatenate the BERT embedding with sentiment and post length
        x = torch.cat((cls_hs, sentiment, length), dim=1)

        # Pass concatenated features through fully connected layers
        x = self.fc1(x)
        x = self.relu(x)
        x = self.dropout(x)
        x = self.fc2(x)

        return self.softmax(x)

# Initialize model and send to device
model = BERT_Arch(bert)
model = model.to(device)

## Optimizer, Loss Function, and Class Weights

The code configures an AdamW optimizer and a cross-entropy loss function, adjusting for class imbalance by assigning weights to classes.

In [None]:
# Define optimizer
optimizer = AdamW(model.parameters(), lr=1e-5)

# Compute class weights
class_weights = compute_class_weight(class_weight='balanced', classes=np.unique(trainLabels), y=trainLabels)
weights = torch.tensor(class_weights, dtype=torch.float).to(device)
cross_entropy = nn.NLLLoss(weight=weights)  # Weighted loss function



## Training and Evaluation Functions

Code defines functions to handle model training and validation, calculating loss and updating the model’s parameters in each epoch.

In [None]:
# Define training function
def train():
    model.train()
    total_loss = 0
    for batch in train_dataloader:
        # Send each item in the batch to the device
        sent_id, mask, sentiment, length, labels = [item.to(device) for item in batch]

        # Zero out gradients
        model.zero_grad()

        # Forward pass, including sentiment and length as additional features
        preds = model(sent_id, mask, sentiment, length)

        # Compute loss
        loss = cross_entropy(preds, labels)
        total_loss += loss.item()

        # Backward pass and optimization step
        loss.backward()
        torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0)
        optimizer.step()

    return total_loss / len(train_dataloader)

# Define evaluation function
def evaluate():
    model.eval()
    total_loss = 0
    with torch.no_grad():
        for batch in val_dataloader:
            # Send each item in the batch to the device
            sent_id, mask, sentiment, length, labels = [item.to(device) for item in batch]

            # Forward pass, including sentiment and length as additional features
            preds = model(sent_id, mask, sentiment, length)

            # Compute loss
            loss = cross_entropy(preds, labels)
            total_loss += loss.item()

    return total_loss / len(val_dataloader)

## Training Loop

The model is trained over multiple epochs, saving the model whenever validation loss improves, indicating better performance on unseen data.

In [None]:
# Saving is based on best validation loss
epochs = 15
best_valid_loss = float('inf')

for epoch in range(epochs):
    print(f"\nEpoch {epoch + 1}/{epochs}")
    train_loss = train()
    valid_loss = evaluate()

    # Save model if validation loss improves
    if valid_loss < best_valid_loss:
        best_valid_loss = valid_loss
        torch.save(model.state_dict(), '/content/drive/MyDrive/Fake News Detection Data/Constrain AI/saved_weights.pt')

    print(f"Training Loss: {train_loss:.3f}")
    print(f"Validation Loss: {valid_loss:.3f}")


Epoch 1/15
Training Loss: 0.644
Validation Loss: 0.572

Epoch 2/15
Training Loss: 0.599
Validation Loss: 0.538

Epoch 3/15
Training Loss: 0.577
Validation Loss: 0.519

Epoch 4/15
Training Loss: 0.556
Validation Loss: 0.501

Epoch 5/15
Training Loss: 0.543
Validation Loss: 0.484

Epoch 6/15
Training Loss: 0.526
Validation Loss: 0.469

Epoch 7/15
Training Loss: 0.511
Validation Loss: 0.456

Epoch 8/15
Training Loss: 0.498
Validation Loss: 0.443

Epoch 9/15
Training Loss: 0.481
Validation Loss: 0.432

Epoch 10/15
Training Loss: 0.468
Validation Loss: 0.422

Epoch 11/15
Training Loss: 0.455
Validation Loss: 0.412

Epoch 12/15
Training Loss: 0.446
Validation Loss: 0.403

Epoch 13/15
Training Loss: 0.435
Validation Loss: 0.397

Epoch 14/15
Training Loss: 0.424
Validation Loss: 0.390

Epoch 15/15
Training Loss: 0.416
Validation Loss: 0.383


## Model Testing and Evaluation

The best-performing model is loaded, and predictions are made on the test set. Finally, it generates a classification report showing the model’s performance metrics.

In [None]:
# Load best model
model.load_state_dict(torch.load('/content/drive/MyDrive/Fake News Detection Data/Constrain AI/saved_weights.pt'))
model.eval()

# Perform predictions on test data
preds = []
with torch.no_grad():
    for batch in test_dataloader:
        # Unpack the batch to include sentiment and length
        sent_id, mask, sentiment, length, labels = [item.to(device) for item in batch]

        # Forward pass through the model, including sentiment and length
        batch_preds = model(sent_id, mask, sentiment, length)

        # Append predictions
        preds.extend(batch_preds.detach().cpu().numpy())

# Convert predictions to class labels
preds = np.argmax(np.array(preds), axis=1)
print(classification_report(test_y, preds))

  model.load_state_dict(torch.load('/content/drive/MyDrive/Fake News Detection Data/Constrain AI/saved_weights.pt'))


              precision    recall  f1-score   support

           0       0.79      0.86      0.83      1020
           1       0.86      0.79      0.83      1120

    accuracy                           0.83      2140
   macro avg       0.83      0.83      0.83      2140
weighted avg       0.83      0.83      0.83      2140



Given your paper's focus on identifying features that are relevant to distinguishing fake and real news, whether by a model or a human, including the ablation study could add significant value. Here’s why:

1. **Feature Relevance in Model Prediction**: The ablation study can demonstrate whether features like sentiment and length are practically useful in a complex model like BERT, which can guide future work in model feature selection and help readers understand which features truly contribute to prediction accuracy.

2. **Human Interpretability**: Since the paper also considers features that humans might use, the statistical significance of sentiment and length is highly relevant. Including the ablation study helps emphasize the difference between features that are *statistically relevant* and those that *actually improve model predictions*, a distinction that might be insightful for both modelers and human evaluators.

3. **Balanced View of Feature Impact**: By presenting both statistical tests and ablation results, you show a balanced view: some features (like sentiment and length) may be important for human interpretation or basic statistical distinctions, but their utility may not carry over directly to model performance. This dual perspective strengthens the paper by addressing both human interpretability and model efficacy.

### Suggested Integration

You could structure this section as follows:

1. **Statistical Relevance**: Present hypothesis testing results to show that sentiment and length are statistically significant in distinguishing fake from real news.
2. **Ablation Study**: Show that while these features are statistically relevant, they don’t notably improve model performance, suggesting that BERT embeddings may already capture this information.
3. **Implications for Human and Model Distinction**: Discuss how humans might rely on sentiment or length cues when distinguishing news types, whereas a model like BERT doesn’t necessarily benefit from explicit inclusion of these features.

This approach would provide a comprehensive answer to your research question, showcasing the distinctions and overlaps between features useful to humans and to machine models.

If your research paper shows that certain features (like sentiment and length) are statistically significant in distinguishing real from fake news but including them in the model didn’t improve performance—and, in fact, slightly decreased it—this outcome is still valuable. Here’s how to interpret and present these findings effectively:

### 1. **Highlight the Statistical Significance as an Insight into the Dataset**

   - **Interpretation of Statistical Findings**: You can emphasize that features like sentiment and length **are meaningfully associated with news type** (real vs. fake), as shown by hypothesis testing. This is important information for understanding the dataset and the types of language patterns that may characterize fake news.
   - **Feature Relevance**: Acknowledge that these features have statistical relevance, which means they differentiate fake news from real news in general terms, even if they don’t improve the model’s performance in a predictive setting.

### 2. **Explain Why BERT’s Embeddings May Have Diminished the Need for These Features**

   - **Redundant Information**: Mention that BERT embeddings already capture complex language representations, including sentiment and length implicitly. Because BERT has been pre-trained on large amounts of text data, it likely already encodes sentiment-related patterns and length nuances within its embeddings.
   - **Complex Model Dynamics**: Explain that adding explicit sentiment and length features might have created redundancy, leading to a slight decrease in performance. This could be because the model is now handling duplicate information or is overfitting on sentiment and length, which are already well-represented in BERT’s embeddings.

### 3. **Discuss Implications for Future Research and Practical Applications**

   - **Importance of Hypothesis Testing**: Emphasize that statistical tests remain essential for understanding data characteristics, even if these features don’t add predictive power to complex models. In real-world applications, knowing that sentiment and length are distinguishing factors can guide content analysis, policy-making, or simpler models where these features may matter more.
   - **Feature Engineering Recommendations**: Suggest that researchers using simpler models, such as logistic regression or support vector machines, might find these features more useful. Additionally, sentiment and length could be more impactful for specific sub-types of fake news or different domains (e.g., health vs. political news).
   - **Future Model Interpretability**: Encourage future work to explore alternative interpretability techniques, like attention weights in Transformer models, to understand more about how BERT utilizes language patterns in fake news detection.

### 4. **Conclude with the Value of Both Statistical and Predictive Analysis**

   - **Dual Analysis Approach**: Reinforce that combining statistical tests with predictive modeling provided deeper insights into the dataset and the model’s behavior. Although the features did not boost performance, the statistical significance reveals meaningful distinctions in the data.
   - **Balanced Conclusion**: Summarize that while sentiment and length are significant in distinguishing real from fake news, they are not necessarily beneficial in complex models like BERT that already capture nuanced language features. This insight can help refine feature engineering practices in NLP research.

### Summary of Key Points for Your Research Paper

1. **Statistical Significance**: Sentiment and length are statistically significant in distinguishing fake news from real news.
2. **Model Performance Impact**: Including these features didn’t improve BERT’s performance, likely due to redundancy.
3. **Future Directions**: Highlight potential applications in simpler models or as insights for content analysis.
4. **Value of Hypothesis Testing**: Statistical testing remains crucial for understanding dataset characteristics, even if the features don’t directly benefit complex models.

This analysis strengthens your paper by showing a comprehensive approach to feature evaluation, balancing statistical significance with practical implications in model performance.

Given your research focus and the analyses you’ve conducted, here are some potential research questions that align with your investigation of feature relevance for distinguishing fake and real news:

1. **Primary Research Question**:
   - *"What features are relevant for distinguishing fake news from real news, and how do these features contribute to classification when analyzed statistically versus when included in a predictive model?"*
   - This question captures your goal of understanding feature relevance both from a statistical perspective (for human interpretability) and in a practical, predictive sense (for model performance).

2. **Supporting Research Questions**:
   - *"Do features like sentiment and length provide statistically significant distinctions between fake and real news?"*
     - This question addresses the role of hypothesis testing in identifying meaningful differences between fake and real news based on linguistic features.
   - *"Does including statistically relevant features, such as sentiment and post length, improve the performance of a complex model like BERT in classifying fake and real news?"*
     - This question explores the practical utility of adding these features to a model that already captures language nuances, providing insight into the value of explicit versus implicit feature use in machine learning.
   - *"How do features that are relevant to humans for identifying fake news compare to those that are useful for a machine learning model?"*
     - This emphasizes the distinction between human and model interpretability, addressing whether features like sentiment and length are useful in similar ways for humans and machines.

### Summary

Your primary question delves into the relevance of features in both statistical and predictive contexts, while the supporting questions help dissect the nuances of feature significance for humans versus models. This setup highlights a dual approach, analyzing how and why features matter from different perspectives.