# Project Plan: Part 1 - NER-Based Classifier vs. Random Baseline

## ✅ 1. Data Loading and Preprocessing   
   - [x] **Load the dataset** (train, test, dev).  
   - [x] **Extract** `statement` (text) and `label` (truthfulness category).  
   - [x] **Store** dataset in a `pandas DataFrame` for further processing.  

---

## ✅ 2. Named Entity Recognition (NER) Extraction
   - [ ] **Use BERT-based NER model** (`Jean-Baptiste/roberta-large-ner-english`).  
   - [ ] **Process each statement** and extract entity types:  
      - `PERSON`, `ORG`, `GPE`, `DATE`, `MONEY`, etc.  
   - [ ] **Count the occurrences** of each entity type per statement.  
   - [ ] **Store results** in a new DataFrame (NER feature table).  

---

## ✅ 3. Feature Engineering
   - [ ] **Convert dataset into a structured numerical format:**  
      - Rows = **statements**  
      - Columns = **entity counts** (`PERSON`, `ORG`, etc.).  
   - [ ] **Normalize or scale values** if needed (optional).  

---

## ✅ 4. Training the NER-Based Classifier
   - [ ] **Split dataset** into `train` and `test` sets.  
   - [ ] **Train a classifier** using only NER-based features:  
      - Logistic Regression / Naive Bayes  
   - [ ] **Evaluate performance** using:  
      - Accuracy, Precision, Recall, F1-score  

---

## ✅ 5. Random Baseline Comparison
   - [ ] **Implement a random classifier** that assigns labels based on class distribution.  
   - [ ] **Compare performance** of:  
      - 🔹 NER-based classifier  
      - 🔹 Random baseline  
   - [ ] **Analyze if NER helps** (or if model is no better than random).  

---

## ✅ 6. Analysis & Reporting
   - [ ] **Confusion Matrix**: Identify misclassifications.  
   - [ ] **Feature Importance**: Which entity types impact truthfulness?  
   - [ ] **Write summary** of results and findings.  

---

## 🎯 **Final Checkpoint**
- [ ] **NER-based classifier is implemented and tested**  
- [ ] **Random baseline comparison is completed**  
- [ ] **Analysis is documented**  

In [3]:
import pandas as pd

# 1. Data loading

df = pd.read_csv('outputs/output_train.csv')
df_test = pd.read_csv('outputs/output_test.csv')


df["A_raw_entities"] = df["A_raw_entities"].apply(eval)
df["B_raw_entities"] = df["B_raw_entities"].apply(eval)

df_test["A_raw_entities"] = df_test["A_raw_entities"].apply(eval)
df_test["B_raw_entities"] = df_test["B_raw_entities"].apply(eval)

df

Unnamed: 0,statement,label,label_binary,A_raw_entities,B_raw_entities
0,"90 percent of Americans ""support universal bac...",5,1,"[{'entity': 'MISC', 'score': 0.99866974, 'inde...","[{'word': '90 percent', 'entity': 'PERCENT'}, ..."
1,Last year was one of the deadliest years ever ...,1,0,[],"[{'word': 'Last year', 'entity': 'DATE'}, {'wo..."
2,"Bernie Sanders's plan is ""to raise your taxes ...",0,0,"[{'entity': 'PER', 'score': 0.9983652, 'index'...","[{'word': 'Bernie Sanders's', 'entity': 'PERSO..."
3,Voter ID is supported by an overwhelming major...,4,1,"[{'entity': 'MISC', 'score': 0.9153446, 'index...","[{'word': 'NYers', 'entity': 'ORG'}]"
4,"Says Barack Obama ""robbed Medicare (of) $716 b...",2,0,"[{'entity': 'PER', 'score': 0.9980445, 'index'...","[{'word': 'Barack Obama', 'entity': 'PERSON'},..."
...,...,...,...,...,...
18364,18 million illegal immigrants got their govern...,0,0,[],"[{'word': '18 million', 'entity': 'CARDINAL'},..."
18365,Says restoring Georgia pre-k to a 180-day prog...,3,1,"[{'entity': 'LOC', 'score': 0.9999677, 'index'...","[{'word': 'Georgia', 'entity': 'GPE'}, {'word'..."
18366,There is clear legal authority to handcuff and...,1,0,[],[]
18367,Says George Washington said a free people shou...,1,0,"[{'entity': 'PER', 'score': 0.9980171, 'index'...","[{'word': 'George Washington', 'entity': 'PERS..."


In [4]:
from collections import Counter

def count_entities(raw_entity):
    entity_counts = Counter()
    current_word = ""
    current_entity = None

    for item in raw_entity:
        word = item["word"].lstrip("Ġ")  
        entity = item["entity"]

        # If new entity starts, store the previous one
        if item["word"].startswith("Ġ") or not current_word:
            if current_word:  
                entity_counts[current_entity] += 1  

            current_word = word
            current_entity = entity
        else:
            current_word += word

    # Count the last entity
    if current_word:
        print(current_word)
        entity_counts[current_entity] += 1

    return entity_counts


df_A_counts = df["A_raw_entities"].apply(count_entities)


df_A_counts_test = df_test["A_raw_entities"].apply(count_entities)
df_A_counts

Americans
Sanders
NYers
Obamacare
COVID-19.
Reardon
Donnelly
Politifact
al-Qaida
House
Eisenhower
Bush
Georgia
Milwaukee
Obama
Obamacare
LordLucifer
Senate.
Florida
COVID
Trump
Senate
U.S.
Syrian
Santorum
Obama.
Legion
Nazis
D.C.,.
AIDS
Gates
California
Obama
American
Trump
Medicare
Statistics.
Petersburg.
alQaida
Putin
Wisconsin
Chicago
Israeli
Italy
Collins
Australia
Ford
Ryan
Trump
States.
Walker
European
Medicare
Administration
Bush
Trump
Democrats
Awards
Obama
Wayfair
Kennedy
Biden
Senate
Florida
American
Kelly
Kennedy
Texas
E-Verify
Jersey
Rose
Court
Wisconsin
American
Virginia
Arabia
McCain
Washington
Warnock
Mexico
Kelly
Texas
Europe
Md.
Pelosi
BaucusAs
US..
Program
Obama
Africa
Obama
House
Ohio
Medicare
XL
Newsom
Medicare
Biden
States.
Portland
Nelson
Senate
Medicare
U.S..
Congress
Capitol
African-American
Biden
Texas
MoPac
Tuberville
Senate
Movement
China
Obama
County
Revolution
DeSantis
Baltimore
Biden
Navy.
Republican
Republicans
Florida
COVID-19.
Russia
Russia
News
Milwauk

0                  {'MISC': 1}
1                           {}
2                   {'PER': 2}
3                  {'MISC': 1}
4        {'PER': 2, 'MISC': 2}
                 ...          
18364                       {}
18365               {'LOC': 1}
18366                       {}
18367               {'PER': 2}
18368               {'PER': 2}
Name: A_raw_entities, Length: 18369, dtype: object

In [5]:
# Extract entities and their counts as new columns

df_A_counts_test = df_A_counts_test.apply(pd.Series)

df_A_counts = df_A_counts.apply(pd.Series)
df_A_counts

Unnamed: 0,MISC,PER,ORG,LOC
0,1.0,,,
1,,,,
2,,2.0,,
3,1.0,,,
4,2.0,2.0,,
...,...,...,...,...
18364,,,,
18365,,,,1.0
18366,,,,
18367,,2.0,,


In [6]:
# Replace NaN with 0 and change float to int
df_A_counts_test = df_A_counts_test.fillna(0).astype(int)

df_A_counts = df_A_counts.fillna(0).astype(int)
df_A_counts

Unnamed: 0,MISC,PER,ORG,LOC
0,1,0,0,0
1,0,0,0,0
2,0,2,0,0
3,1,0,0,0
4,2,2,0,0
...,...,...,...,...
18364,0,0,0,0
18365,0,0,0,1
18366,0,0,0,0
18367,0,2,0,0


In [7]:
# Join statement, label binray from df + counts df
df_A_test = df_test[["statement", "label_binary"]].join(df_A_counts_test)

df_A = df[["statement", "label_binary"]].join(df_A_counts)
df_A


Unnamed: 0,statement,label_binary,MISC,PER,ORG,LOC
0,"90 percent of Americans ""support universal bac...",1,1,0,0,0
1,Last year was one of the deadliest years ever ...,0,0,0,0,0
2,"Bernie Sanders's plan is ""to raise your taxes ...",0,0,2,0,0
3,Voter ID is supported by an overwhelming major...,1,1,0,0,0
4,"Says Barack Obama ""robbed Medicare (of) $716 b...",0,2,2,0,0
...,...,...,...,...,...,...
18364,18 million illegal immigrants got their govern...,0,0,0,0,0
18365,Says restoring Georgia pre-k to a 180-day prog...,1,0,0,0,1
18366,There is clear legal authority to handcuff and...,0,0,0,0,0
18367,Says George Washington said a free people shou...,0,0,2,0,0


In [8]:
# Select features (entity counts) and target (label_binary

X_test = df_A_test.drop(columns=["statement", "label_binary"])  # Keep only entity counts
y_test = df_A_test["label_binary"]  # Target variable

X_train = df_A.drop(columns=["statement", "label_binary"])  # Keep only entity counts
y_train = df_A["label_binary"]  # Target variable





In [9]:
### LOGISTIC REGRESSION

from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score

# Initialize Logistic Regression model
model = LogisticRegression(class_weight="balanced", random_state=42, max_iter=1000)

# Train the model
model.fit(X_train, y_train)

# Make predictions on the test set
y_pred = model.predict(X_test)

# Evaluate model performance
accuracy = accuracy_score(y_test, y_pred)
precision = precision_score(y_test, y_pred)
recall = recall_score(y_test, y_pred)
f1 = f1_score(y_test, y_pred)

# Print results
print(f"Accuracy: {accuracy:.4f}")
print(f"Precision: {precision:.4f}")
print(f"Recall: {recall:.4f}")
print(f"F1-score: {f1:.4f}")

Accuracy: 0.5414
Precision: 0.4714
Recall: 0.6783
F1-score: 0.5563


In [10]:
### RANDOM FOREST
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score

# Initialize Random Forest model
rf_model = RandomForestClassifier(n_estimators=100, random_state=42)

# Train the model
rf_model.fit(X_train, y_train)

# Make predictions
y_pred_rf = rf_model.predict(X_test)

# Evaluate performance
accuracy_rf = accuracy_score(y_test, y_pred_rf)
precision_rf = precision_score(y_test, y_pred_rf)
recall_rf = recall_score(y_test, y_pred_rf)
f1_rf = f1_score(y_test, y_pred_rf)

# Print results
print(f"Random Forest Accuracy: {accuracy_rf:.4f}")
print(f"Precision: {precision_rf:.4f}")
print(f"Recall: {recall_rf:.4f}")
print(f"F1-score: {f1_rf:.4f}")

Random Forest Accuracy: 0.5688
Precision: 0.4897
Recall: 0.4173
F1-score: 0.4506


In [11]:
### SVM

from sklearn.svm import SVC
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score

# Initialize SVM model
svm_model = SVC(kernel="linear", class_weight="balanced", random_state=42)


# Train the model
svm_model.fit(X_train, y_train)

# Make predictions
y_pred_svm = svm_model.predict(X_test)

# Evaluate performance
accuracy_svm = accuracy_score(y_test, y_pred_svm)
precision_svm = precision_score(y_test, y_pred_svm)
recall_svm = recall_score(y_test, y_pred_svm)
f1_svm = f1_score(y_test, y_pred_svm)

# Print results
print(f"SVM Accuracy: {accuracy_svm:.4f}")
print(f"Precision: {precision_svm:.4f}")
print(f"Recall: {recall_svm:.4f}")
print(f"F1-score: {f1_svm:.4f}")

SVM Accuracy: 0.5296
Precision: 0.4642
Recall: 0.7122
F1-score: 0.5620


In [12]:
### COMAPRING AGAINST BASELINE

In [13]:
import numpy as np
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score

# Get label distribution in training data
p_0 = (y_train == 0).mean()  # Probability of 0
p_1 = (y_train == 1).mean()  # Probability of 1

# Generate random predictions based on probabilities
y_pred_random = np.random.choice([0, 1], size=len(y_test), p=[p_0, p_1])

# Evaluate random baseline
accuracy_rand = accuracy_score(y_test, y_pred_random)
precision_rand = precision_score(y_test, y_pred_random, zero_division=0)
recall_rand = recall_score(y_test, y_pred_random)
f1_rand = f1_score(y_test, y_pred_random)

# Print results
print(f"🔹 Random Baseline Accuracy: {accuracy_rand:.4f}")
print(f"🔹 Precision: {precision_rand:.4f}")
print(f"🔹 Recall: {recall_rand:.4f}")
print(f"🔹 F1-score: {f1_rand:.4f}")

🔹 Random Baseline Accuracy: 0.4926
🔹 Precision: 0.4008
🔹 Recall: 0.3988
🔹 F1-score: 0.3998
