# Statement Classifier

This project's goal is to train a model that can determine if a statement is either a claim that can be fact-checked, or some other statement like an opinion that cannot be fact checked. 

## TODO:

- [ ] Before training again, setup file structure for saving the 'latest' model, and moving them back into time-stamped dirs, either save some metadata file or something. 
- [ ] Config at top to control what runs when you click GO.
- [ ] function-ize processes

In [40]:
import torch
import pandas as pd
import numpy as np
from transformers import (
    AutoTokenizer,
    AutoModelForSequenceClassification,
    DataCollatorWithPadding,
    TrainingArguments,
    Trainer,
    pipeline
)
from datasets import Dataset
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, precision_recall_fscore_support
from sklearn.utils import resample
from pathlib import Path
from datetime import datetime
import re
from typing import List
import json
import os

In [24]:
%lsmagic

Available line magics:
%alias  %alias_magic  %autoawait  %autocall  %automagic  %autosave  %bookmark  %cat  %cd  %clear  %code_wrap  %colors  %conda  %config  %connect_info  %cp  %debug  %dhist  %dirs  %doctest_mode  %ed  %edit  %env  %gui  %hist  %history  %killbgscripts  %ldir  %less  %lf  %lk  %ll  %load  %load_ext  %loadpy  %logoff  %logon  %logstart  %logstate  %logstop  %ls  %lsmagic  %lx  %macro  %magic  %mamba  %man  %matplotlib  %micromamba  %mkdir  %more  %mv  %notebook  %page  %pastebin  %pdb  %pdef  %pdoc  %pfile  %pinfo  %pinfo2  %pip  %popd  %pprint  %precision  %prun  %psearch  %psource  %pushd  %pwd  %pycat  %pylab  %qtconsole  %quickref  %recall  %rehashx  %reload_ext  %rep  %rerun  %reset  %reset_selective  %rm  %rmdir  %run  %save  %sc  %set_env  %store  %sx  %system  %tb  %time  %timeit  %unalias  %unload_ext  %uv  %who  %who_ls  %whos  %xdel  %xmode

Available cell magics:
%%!  %%HTML  %%SVG  %%bash  %%capture  %%code_wrap  %%debug  %%file  %%html  %%javascript  %%

In [25]:
torch.cuda.is_available()

True

Received the following error whilst training the model in first few attempts:

```
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
To disable this warning, you can either:
    - Avoid using `tokenizers` before the fork if possible
    - Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
```

To address this:

In [None]:
os.environ["TOKENIZERS_PARALLELISM"] = "false"

## Step 2: Prepare Data

### Sample Data

I started with the CSV data, but I did not need the extra information so settling with the JSON.

In [26]:
here = Path().cwd()
cbdata_path = here / ".data_sets" / "ClaimBuster_Datasets" / "datasets" # ClaimBuster data location
raw_dfs: List[pd.DataFrame] = []

for file_path in cbdata_path.iterdir():
    if file_path.exists() and file_path.is_file() and file_path.suffix == ".json":
        with open(file_path, 'r') as fileo:
            raw_dfs.append(pd.DataFrame(json.load(fileo)))

assert len(raw_dfs) > 0

for i, j in enumerate(raw_dfs):
    assert j is not None
    assert type(j) is pd.DataFrame
    print(f"--- part {i+1:02} ---")
    print(j.head())
    print(j.describe())
    print("\n")

--- part 01 ---
   sentence_id  label                                               text
0        27247      1                We're 9 million jobs short of that.
1        10766      1  You know, last year up to this time, we've los...
2         3327      1  And in November of 1975 I was the first presid...
3        19700      1  And what we've done during the Bush administra...
4        12600      1  Do you know we don't have a single program spo...
        sentence_id        label
count   9674.000000  9674.000000
mean   16268.353628     0.285714
std     9388.575939     0.451777
min       16.000000     0.000000
25%     8344.000000     0.000000
50%    16455.500000     0.000000
75%    24086.250000     1.000000
max    34458.000000     1.000000


--- part 02 ---
   sentence_id  label                                               text
0        15083      1  When I made my decision to stop all trade with...
1        16799      1  We've got the highest inflation we've had in t...
2        325

In [27]:
df = pd.concat(raw_dfs)
print(df.describe())
print(f"Dataset Size: {len(df)}")

        sentence_id         label
count  29022.000000  29022.000000
mean   16281.469161      0.285714
std     9401.659478      0.451762
min       16.000000      0.000000
25%     8384.500000      0.000000
50%    16455.500000      0.000000
75%    24089.000000      1.000000
max    34458.000000      1.000000
Dataset Size: 29022


### Additional Data Exploring

After building the model and performing some manual testing, the statement, "Barack Obama was president from 2009 to 2017," kept being returned as an opinion when it is actually a verifiable claim.

Ended up moving this to its own file.

In [28]:
if False:
    for sent in df["text"]:
        if "obama" in sent.casefold():
            print(sent)

if False:
    obama_mask = df["text"].str.contains("Obama", case=False, na=False)
    obama_df = df.copy()[obama_mask] # Only Obama entries
    obama_claims_df = obama_df[obama_df["label"] == 1 ]
    obama_opinions_df = obama_df[obama_df["label"] == 0 ]

    obama_mentions_count = len(obama_df)
    obama_claims_count = len(obama_claims_df)
    obama_opinions_count = len(obama_opinions_df)
    print(f"Total Obama mentions: {obama_mentions_count}")
    print(f"Obama Claims (LABEL_1): {obama_claims_count}")
    print(f"Obama Opinions (LABEL_0): {obama_opinions_count}")
    print(f"Obama Claim Percentage: {(obama_claims_count / obama_mentions_count) * 100}%")

    print("\nSample Obama Entries as Claims")
    print("---" * 5 + " Claims " + "---" * 5)
    print(obama_claims_df.head(10))
    print("---" * 5 + " Opinions " + "---" * 5)
    print(obama_opinions_df.head(10))
    # for i, text in enumerate(obama_claims_df["text"].head(10)):
    #     print(f"{i}.) \"{text}\"")
    # print("\nSample Obama Entries as Claims")


## Data Fixing

In [29]:
print("=== CLAIMBUSTERS DATA QUALITY ANALYSIS ===\n")

# 1. Check label distribution
print("1. LABEL DISTRIBUTION:")
print(f"Total samples: {len(df)}")
print(f"Claims (LABEL_1): {len(df[df['label'] == 1])} ({len(df[df['label'] == 1])/len(df)*100:.1f}%)")
print(f"Non-claims (LABEL_0): {len(df[df['label'] == 0])} ({len(df[df['label'] == 0])/len(df)*100:.1f}%)")


=== CLAIMBUSTERS DATA QUALITY ANALYSIS ===

1. LABEL DISTRIBUTION:
Total samples: 29022
Claims (LABEL_1): 8292 (28.6%)
Non-claims (LABEL_0): 20730 (71.4%)


### Balancing Data

The current data is unbalanced - caused issues in the first model.

Options appear to be "Downsampling" or computing weights 

In [30]:
# split into majority / minority
df_nonclaims_majority = df[df["label"] == 0] # non-claims
df_claims_minority = df[df["label"] == 1] # claims

print(f"Non-Claims: {len(df_nonclaims_majority)}")
print(f"Claims: {len(df_claims_minority)}")

df_nonclaims_downsampled = resample(
    df_nonclaims_majority,
    replace=False, # sample without replacement
    n_samples=len(df_claims_minority),
    random_state=42
)

print(f"Non-Claims: {len(df_nonclaims_majority)}")
print(f"Non-Claims Down Sampled: {len(df_nonclaims_downsampled)}")
print(f"Claims: {len(df_claims_minority)}")

# Data Frame Balanced: equal parts claim and non-claim
dfb = pd.concat([df_nonclaims_downsampled, df_claims_minority])
dfb.describe()

unused_df = df_nonclaims_majority.drop(df_nonclaims_downsampled.index)

Non-Claims: 20730
Claims: 8292
Non-Claims: 20730
Non-Claims Down Sampled: 8292
Claims: 8292


### Split Data

Splitting the data into train and validation/test. 
Also, I think above we sorted the data by label, so [with this strategy](https://stackoverflow.com/questions/29576430/shuffle-dataframe-rows) we can shuffle.

In [31]:
# Shuffle
dfb = dfb.sample(frac=1, random_state=42).reset_index(drop=True)

# Split into train/validation sets
# Data Frame Training
# Data Frame Validation
dft, dfv = train_test_split(
    dfb,
    test_size=0.2, 
    random_state=42,
    shuffle=True,
    stratify=dfb["label"],
)

print(f"Training samples: {len(dft)}")
print(f"Validation samples: {len(dfv)}")


Training samples: 13267
Validation samples: 3317


## Step 3: Load and Setup BERT

### Initialize Tokenizer and Model

In [32]:
# Choose your BERT variant
# TODO: Add in config at top
model_name = "bert-base-uncased"  # Good starting point
# Alternatives: "roberta-base", "distilbert-base-uncased" (faster)

# Load tokenizer and model
tokenizer = AutoTokenizer.from_pretrained(model_name)
# Ran into tokenization issue - All tensors in a batch should be same length
# Some were 100 and but one was 187.
# Use padding
data_collator = DataCollatorWithPadding(tokenizer=tokenizer)

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
print(f"Using device: {device}")

model = AutoModelForSequenceClassification.from_pretrained(
    model_name, 
    num_labels=2  # Binary classification: claim vs opinion
)
model.to(device)

print(f"Model loaded: {model_name}")
print(f"Vocabulary size: {tokenizer.vocab_size}")

Using device: cuda


Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Model loaded: bert-base-uncased
Vocabulary size: 30522


### Tokenize Data

In [33]:
def tokenize_function(examples):
    return tokenizer(
        examples['text'], 
        truncation=True, 
        padding=True, 
        max_length=256  # Adjust based on your text length
    )

In [34]:
# Convert pandas DataFrames to 🤗 Dataset objects
dst = Dataset.from_pandas(dft)
dsv = Dataset.from_pandas(dfv)

# Apply tokenization
train_dataset = dst.map(tokenize_function, batched=True)
val_dataset = dsv.map(tokenize_function, batched=True)

print("Data tokenized successfully!")

Map:   0%|          | 0/13267 [00:00<?, ? examples/s]

Map:   0%|          | 0/3317 [00:00<?, ? examples/s]

Data tokenized successfully!


## Step 4: Fine-Tune Model

We are doing **transfer learning** with **fine-tuning**. 
BERT was pre-trained to understand language - Thank you!
We fine-tuning the model for a specific task - claim vs opinion here.
The technique = Supervised learning with backpropagation

Deep dive: BERT has millions of weights to understand language. We are adjusting these to suit our classification task. Only our final classification layer is learning from scratch. The rest of BERT is merely adapting instead of being completely retrained. 
BERT (I think) expects a "[MASK]" token to predict values. 
By fine-tuning, we add a layer like: `input text -> BERT Encoder -> Classification Head -> [Claim, Opinion] probabilities`.

### Define Training Arguments

[transformers.TrainingArguments](https://huggingface.co/docs/transformers/v4.52.3/en/main_classes/trainer#transformers.TrainingArguments) has a lot of parameters. 

In [None]:
# Set up directories for saving
timestamp = datetime.now().strftime("%Y%m%d%H%M%S")

# TODO: Give model name at top
move_path = Path().cwd() / "trainingresults" / f'hide-bert_{timestamp}'
out_path = Path().cwd() / "trainingresults" / "latest"
metatdata_file_path = out_path / "metadata.json"
if metatdata_file_path.exists():
    # A model exists in latest already - move to it's timestamp
    with open(metatdata_file_path, 'r') as file:
        tmp = json.load(file)
        ts_path = Path(tmp.path)
        out_path.rename(ts_path)
    assert not out_path.exists()

with open(out_path / "metadata.json", 'w') as file:
    json.dump({"path": str(out_path), "foundation": model_name}, file, indent=2)

In [None]:
training_args = TrainingArguments(
    output_dir=out_path, # Working directory during training for logs and checkpoints.
    num_train_epochs=3,              # Start with 3, adjust based on results
    per_device_train_batch_size=16,  # Reduce if memory issues
    per_device_eval_batch_size=16,
    warmup_steps=500, # gradually increase learning rate over 500 steps | prevents huge descrutive changes early on
    weight_decay=0.01, # Very mild 1% to prevent memorizing training data exactly. 
    logging_dir='./logs',
    logging_steps=10,
    eval_strategy="epoch",
    save_strategy="epoch",
    load_best_model_at_end=True,
    metric_for_best_model="eval_loss",
    greater_is_better=False,
    dataloader_pin_memory=False, # can help with GPU transfer speed
    fp16=True, # mixed precision can speedup training if supported
    dataloader_num_workers=4, # parallel data loading
)

### Define Evaluation Metrics

In [36]:
def compute_metrics(eval_pred):
    predictions, labels = eval_pred
    predictions = np.argmax(predictions, axis=1)
    
    precision, recall, f1, _ = precision_recall_fscore_support(
        labels, predictions, average='weighted'
    )
    accuracy = accuracy_score(labels, predictions)
    
    return {
        'accuracy': accuracy,
        'f1': f1,
        'precision': precision,
        'recall': recall
    }

### Initialize and Train

This is the fun part we all want to do :)

In [37]:
# Create trainer
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=val_dataset,
    data_collator=data_collator,
    compute_metrics=compute_metrics,
)

# Start training
print("Starting training...")
trainer.train()

# TODO: Update Path - the latest idea and switching...
# Save the model
trainer.save_model(out_path) # Where to save model weights and config
tokenizer.save_pretrained(out_path) # for tokenizer stuff
print("Model saved!")

Starting training...


huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Av

Epoch,Training Loss,Validation Loss,Accuracy,F1,Precision,Recall
1,0.1525,0.221319,0.923726,0.923563,0.927361,0.923726
2,0.0864,0.118954,0.970455,0.970455,0.970466,0.970455
3,0.0655,0.141335,0.970757,0.970747,0.971395,0.970757


huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Av

Model saved!


## Step 5: Test Model

### Load Trained Model for Testing

In [56]:
class ValidationEntry:
    def __init__(self, statement: str, expected: int):
        self.statement = statement
        self.expected = expected
    
    def __str__(self):
        line1 = f"{self.__class__}\n"
        line2 = f"  Statement: {self.statement}\n"
        line3 = f"  Expectation: {'Opinion' if 0 else 'Claim'}\n"
        return line1 + line2 + line3

# ADD LATER
manual_tests = [
    ("John Smith was elected mayor in 2020", 1),
    ("The company reported $2 million in revenue", 1),
    ("She graduated from Harvard University", 1),
    ("Billy Joe graduated from Harvard University", 1),
    ("The meeting was scheduled for 3 PM", 1),
    ("COVID-19 cases increased by 15% last month", 1),
    ("This is the best restaurant in town", 0),
    ("We should invest more in education", 0),
    ("That movie was terrible", 0),
    ("This policy is unfair to working families", 0),
    ("Climate change is the most important issue", 0),
    ("Barack Obama was president from 2009 to 2017", 1),
    ("Pizza is the most delicious food ever", 0),
    ("The stock market closed at 4,500 points", 1),
    ("This movie deserves an Oscar", 0),
    ("The man Barack Obama served as Senator from Illinois before becoming president.", 1),
    ("The man John Doe served as Senator from Illinois before becoming president.", 1),
    ("Barack Obama won the Nobel Peace Prize in 2009", 1),
    ("George Washington won the Nobel Peace Prize in 2009", 1),
    ("Ada Lovelace wrote the first computer program way back in the 1840s!", 1),
    ("The unemployment rate in the artic is close to 0, that's amazing!", 1),
    ("Donald Trump only serves himself and the top 1%.", 0),
    ("Donald Trump's Big Beautiful Bill implements the biggest cut to medicaid in American history.", 1),
]

validation_items = []

for thing in manual_tests:
    validation_items.append(ValidationEntry(thing[0], thing[1]))

# for item in validation_items:
#     print(item)

In [57]:
print("\n")
# Load your fine-tuned model
# TODO: UPDATE!!!
classifier = pipeline(
    task="text-classification",
    model=out_path,
    tokenizer=out_path,
    device='cuda'
)

success_cnt = 0
print("=== Testing the model ===")
print("-" * 50)
for i, item in enumerate(validation_items):
    result = classifier(item.statement)
    actual = 0 if result[0]['label'] == 'LABEL_0' else 1
    success = actual == item.expected
    success_label = "PASS" if success else "FAIL"
    if success:
        success_cnt += 1
    prediction_label = "Claim" if result[0]['label'] == 'LABEL_1' else "Opinion"
    expected_label = "Claim" if item.expected == 1 else "Opinion"
    confidence = result[0]['score']
    print(f"Test {i+1}: {success_label}")
    print(result)
    print(f"Text: '{item.statement}'")
    print(f"Prediction: {prediction_label} (confidence: {confidence:.3f})")
    print(f"Expected: {expected_label}")
    print("-" * 50)

print("-" * 50)
print(f"{success_cnt} Correct")
print(f"{len(validation_items) - success_cnt} Wrong")
print(f"{len(validation_items)} Total")
print(f"Rate of Success: {(success_cnt / len(validation_items))*100:.4f}%")
print("-" * 50)

Device set to use cuda




=== Testing the model ===
--------------------------------------------------
Test 1: FAIL
[{'label': 'LABEL_0', 'score': 0.9942793846130371}]
Text: 'John Smith was elected mayor in 2020'
Prediction: Opinion (confidence: 0.994)
Expected: Claim
--------------------------------------------------
Test 2: PASS
[{'label': 'LABEL_1', 'score': 0.9977515339851379}]
Text: 'The company reported $2 million in revenue'
Prediction: Claim (confidence: 0.998)
Expected: Claim
--------------------------------------------------
Test 3: FAIL
[{'label': 'LABEL_0', 'score': 0.9954456090927124}]
Text: 'She graduated from Harvard University'
Prediction: Opinion (confidence: 0.995)
Expected: Claim
--------------------------------------------------
Test 4: FAIL
[{'label': 'LABEL_0', 'score': 0.9969039559364319}]
Text: 'Billy Joe graduated from Harvard University'
Prediction: Opinion (confidence: 0.997)
Expected: Claim
--------------------------------------------------
Test 5: FAIL
[{'label': 'LABEL_0', 'score

### Manual Evaluation Function

In [None]:
def evaluate_model(texts, true_labels):
    """Evaluate model on a list of texts with known labels"""
    predictions = []
    
    for text in texts:
        result = classifier(text)
        # Convert to binary (0 or 1)
        pred = 1 if result[0]['label'] == 'LABEL_1' else 0
        predictions.append(pred)
    
    accuracy = accuracy_score(true_labels, predictions)
    precision, recall, f1, _ = precision_recall_fscore_support(
        true_labels, predictions, average='weighted'
    )
    
    print(f"Accuracy: {accuracy:.3f}")
    print(f"Precision: {precision:.3f}")
    print(f"Recall: {recall:.3f}")
    print(f"F1-score: {f1:.3f}")
    
    return predictions

# Warning of using "pipeline" sequentially on GPU - use dataset instead.
predictions = evaluate_model(dfv["text"], dfv["label"])
# predictions = evaluate_model(val_dataset)

Accuracy: 0.970
Precision: 0.970
Recall: 0.970
F1-score: 0.970


## Step 6: Integration With Fact-Checker

In [53]:
def extract_claims_from_text(text):
    """
    Extract potential factual claims from text
    Returns list of sentences classified as factual claims
    """
    # Simple sentence splitting (you might want to use spaCy for better results)
    sentences = text.split('. ')
    print(sentences)
    
    claims = []
    for sentence in sentences:
        if len(sentence.strip()) > 10:  # Skip very short sentences
            print(sentence)
            result = classifier(sentence)
            print(result)
            if result[0]['label'] == 'LABEL_1':  # Factual claim
                claims.append({
                    'text': sentence,
                    'confidence': result[0]['score']
                })
    
    return claims

# Test with a Twitter example
twitter_text = """My opponent Denver Riggleman, running mate of Corey Stewart, was caught on camera campaigning with a white supremacist. Now he has been exposed as a devotee of Bigfoot erotica. This is not what we need on Capitol Hill."""

claims = extract_claims_from_text(twitter_text)
print(f"Extracted claims: {claims}")
for claim in claims:
    print(f"- {claim['text']} (confidence: {claim['confidence']:.3f})")

['My opponent Denver Riggleman, running mate of Corey Stewart, was caught on camera campaigning with a white supremacist', 'Now he has been exposed as a devotee of Bigfoot erotica', 'This is not what we need on Capitol Hill.']
My opponent Denver Riggleman, running mate of Corey Stewart, was caught on camera campaigning with a white supremacist
[{'label': 'LABEL_1', 'score': 0.9840406179428101}]
Now he has been exposed as a devotee of Bigfoot erotica
[{'label': 'LABEL_0', 'score': 0.6842691898345947}]
This is not what we need on Capitol Hill.
[{'label': 'LABEL_0', 'score': 0.9990630745887756}]
Extracted claims: [{'text': 'My opponent Denver Riggleman, running mate of Corey Stewart, was caught on camera campaigning with a white supremacist', 'confidence': 0.9840406179428101}]
- My opponent Denver Riggleman, running mate of Corey Stewart, was caught on camera campaigning with a white supremacist (confidence: 0.984)
