# May Code Pudding: Bias Detection

## Getting Packages and Reading Data:

In [195]:
import os
import numpy as np
import pandas as pd
from tqdm import tqdm

import matplotlib.pyplot as plt
import seaborn as sns

import re
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize, sent_tokenize
from textblob import TextBlob
import spacy

from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score

from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report, roc_auc_score
from sklearn.model_selection import GridSearchCV

import torch
from transformers import AutoTokenizer, AutoModel, pipeline

import openpyxl
import xlrd

# Just in case
import requests
from bs4 import BeautifulSoup

pd.set_option('display.max_colwidth', None)

In [169]:
# nltk setup
nltk.download('punkt')
nltk.download('stopwords')

[nltk_data] Downloading package punkt to
[nltk_data]     /Users/betaknight/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/betaknight/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [170]:
spacy.load('en_core_web_sm')

<spacy.lang.en.English at 0x30aa34f50>

In [None]:
dataset_path = os.path.join(os.pardir, 'datasets')

df_sg1 = pd.read_csv(os.path.join(dataset_path, 'final_labels_SG1.csv'), sep=';')
df_sg2 = pd.read_csv(os.path.join(dataset_path, 'final_labels_SG2.csv'), sep=';')
df_mbic = pd.read_csv(os.path.join(dataset_path, 'final_labels_MBIC.csv'), sep=';')

df_lex = pd.read_excel(os.path.join(dataset_path, 'bias_word_lexicon.xlsx'))

df_bias = pd.read_csv(os.path.join(dataset_path, 'news_headlines_usa_biased.csv'))
df_neutral = pd.read_csv(os.path.join(dataset_path, 'news_headlines_usa_neutral.csv'))

────────────────────────────────────────────────────────────────────────────T

his code loads a collection of datasets that will be used to train and evaluate a model for detecting bias in text.

**First**, it sets up the file path to the folder where all the data is stored. The path points one level above the current folder into a directory called `datasets`. This helps keep the project organized and ensures anyone running the code can find the files, no matter where the script is located.

**Next**, it loads three labeled datasets named `SG1`, `SG2`, and `MBIC`. These files contain sentences that have already been marked as either biased or non-biased. They use semicolons instead of commas to separate the data, which is why the `sep=';'` setting is used. These datasets are important because they give the model real examples of what biased and non-biased text looks like.

**Then**, it reads in a file called `bias_word_lexicon.xlsx`, which is an Excel file containing a list of words commonly linked to bias. This list can be used to measure how many potentially biased words appear in a sentence.

**Finally**, it loads two more datasets: one containing biased news headlines and one containing neutral ones. These shorter texts can help the model recognize how bias appears even in small snippets of text.

Altogether, this step is about preparing all the raw data the model will need — including examples, labels, and word lists — so that the rest of the project can run smoothly.

────────────────────────────────────────────────────────────────────────────

## Getting Data Merged:

In [172]:
df_sg1.head()

Unnamed: 0,text,news_link,outlet,topic,type,label_bias,label_opinion,biased_words
0,The Republican president assumed he was helping the industry at the expense of the environment – a trade-off Trump was happy to make since he rejects climate science anyway.,http://www.msnbc.com/rachel-maddow-show/auto-industry-trump-youre-going-the-wrong-way-emissions,msnbc,environment,left,Biased,Expresses writer’s opinion,[]
1,"Though the indictment of a woman for her own pregnancy loss is unusual in Alabama, it is not unusual for prosecutors to charge people with murder even if they never killed anyone.",https://eu.usatoday.com/story/news/nation/2019/06/28/alabama-prosecute-marshae-jones-pregnant-woman-who-shot/1600459001/,usa-today,abortion,center,Non-biased,Somewhat factual but also opinionated,[]
2,Ingraham began the exchange by noting American graduates’ salaries have been suppressed by the flood of foreign graduates.,https://www.breitbart.com/economy/2020/01/12/donald-trump-we-dont-have-enough-foreign-workers/?utm_source=feedburner&utm_medium=feed&utm_campaign=Feed%3A+breitbart+%28Breitbart+News%29,breitbart,immigration,right,No agreement,No agreement,['flood']
3,"The tragedy of America’s 18 years in Afghanistan has been a stubborn refusal to admit the Afghan government is incapable of standing on its own, no matter how much money is poured into it, how much training its troops are given, or how many of its battles American soldiers fight.",http://feedproxy.google.com/~r/breitbart/~3/EReXUAKj_UQ/,breitbart,international-politics-and-world-news,right,Biased,Somewhat factual but also opinionated,"['tragedy', 'stubborn']"
4,The justices threw out a challenge from gun rights groups.,https://www.huffpost.com/entry/supreme-court-gun-rights-case_n_5ea6eb53c5b6a30004e59f35,msnbc,gun-control,left,Non-biased,Entirely factual,[]


In [173]:
df_sg1['label_bias'].value_counts()

label_bias
Non-biased      800
Biased          746
No agreement    154
Name: count, dtype: int64

────────────────────────────────────────────────────────────────────────────

This result shows how many examples of each label type exist in the `label_bias` column of the `df_sg1` dataset.

**`Non-biased` – 800 rows**  
These are sentences that were labeled as clearly *not biased*. They are likely written in a neutral or factual tone.

**`Biased` – 746 rows**  
These are sentences that were labeled as *biased*. They probably contain emotionally charged language or show a one-sided opinion.

**`No agreement` – 154 rows**  
These are sentences where the people labeling the data *could not agree* on whether the sentence was biased or not. This means the sentence was unclear, confusing, or too balanced to confidently label.

The value counts tell us that the dataset is fairly balanced between biased and non-biased examples, but there's a smaller group of uncertain cases that may need to be removed or handled differently when training a machine learning model.

────────────────────────────────────────────────────────────────────────────

In [174]:
df_sg1 = df_sg1[df_sg1['label_bias'] != 'No agreement']
df_sg1['label'] = df_sg1['label_bias'].map({'Biased': 1, 'Non-biased': 0})
df_sg1['bias_word_count'] = df_sg1['biased_words'].apply(lambda x: len(eval(x)) if isinstance(x, str) else 0)
df_sg1.drop(columns='label_bias', inplace=True)

df_sg2 = df_sg2[df_sg2['label_bias'] != 'No agreement']
df_sg2['label'] = df_sg2['label_bias'].map({'Biased': 1, 'Non-biased': 0})
df_sg2['bias_word_count'] = df_sg2['biased_words'].apply(lambda x: len(eval(x)) if isinstance(x, str) else 0)
df_sg2.drop(columns='label_bias', inplace=True)

df_mbic = df_mbic[df_mbic['label_bias'] != 'No agreement']
df_mbic['label'] = df_mbic['label_bias'].map({'Biased': 1, 'Non-biased': 0})
df_mbic['bias_word_count'] = df_mbic['biased_words'].apply(lambda x: len(eval(x)) if isinstance(x, str) else 0)
df_mbic.drop(columns='label_bias', inplace=True)

keep_cols = ['text', 'label', 'bias_word_count', 'biased_words']
df_mbic = df_mbic[keep_cols]

combined_df = pd.concat([df_sg1, df_sg2, df_mbic], ignore_index=True)

In [175]:
combined_df['label'].value_counts()

label
1    3574
0    3196
Name: count, dtype: int64

In [176]:
bias_words_set = set(df_lex.iloc[:, 0].str.lower().dropna())

combined_df['lexicon_match_count'] = combined_df['text'].apply(
    lambda x: sum(word in bias_words_set for word in str(x).lower().split())
)

In [178]:
df_bias['label'] = 1
df_neutral['label'] = 0
headline_full = pd.concat([df_bias, df_neutral], ignore_index=True)

combined_df = combined_df.merge(headline_full[['url', 'title']], left_on='news_link', right_on='url', how='left')

combined_df['combined_text'] = combined_df.apply(
    lambda row: f"{row['title']}. {row['text']}" if pd.notnull(row['title']) else row['text'],
    axis=1
)

combined_df.drop(columns=['url', 'title'], inplace=True)

In [179]:
combined_df['combined_text'] = combined_df.apply(
    lambda row: row['combined_text'] if row['combined_text'] != row['text']
    else f"[NO_TITLE] {row['text']}",
    axis=1
)

────────────────────────────────────────────────────────────────────────────
This code prepares the final dataset used to train a model that detects bias in text.

First, it **removes rows** from all three labeled datasets where the label was `"No agreement"`, since these examples are unclear.

It then **creates a new column called `label`** where:
- `"Biased"` becomes `1`
- `"Non-biased"` becomes `0`

Next, it counts how many biased words appear in each sentence using the list stored in the `'biased_words'` column, and stores that number in a new column called `bias_word_count`.

For the `df_mbic` dataset, only the most important columns are kept: the sentence, label, biased word count, and the list of biased words.

All three datasets are then **combined into one**, called `combined_df`.

Then, it checks each sentence and counts how many words match the ones in the `bias_word_lexicon.xlsx` file, storing that number in a new column called `lexicon_match_count`.

Next, it loads two headline datasets and assigns a label (`1` for biased, `0` for neutral), then combines them.

The code tries to **attach each sentence to a headline**, if one exists. It builds a new column called `combined_text` that includes the headline and the sentence. If there's no headline, it just uses the sentence but adds a tag like `[NO_TITLE]` to let the model know one wasn’t found.

The final result is a rich dataset where each row has:
- A cleaned sentence
- A label
- Extra features like how many biased words it contains
- An optional headline for added context

────────────────────────────────────────────────────────────────────────────

In [181]:
combined_df.head()

Unnamed: 0,text,news_link,outlet,topic,type,label_opinion,biased_words,label,bias_word_count,lexicon_match_count,combined_text
0,The Republican president assumed he was helping the industry at the expense of the environment – a trade-off Trump was happy to make since he rejects climate science anyway.,http://www.msnbc.com/rachel-maddow-show/auto-industry-trump-youre-going-the-wrong-way-emissions,msnbc,environment,left,Expresses writer’s opinion,[],1,0,1,[NO_TITLE] The Republican president assumed he was helping the industry at the expense of the environment – a trade-off Trump was happy to make since he rejects climate science anyway.
1,"Though the indictment of a woman for her own pregnancy loss is unusual in Alabama, it is not unusual for prosecutors to charge people with murder even if they never killed anyone.",https://eu.usatoday.com/story/news/nation/2019/06/28/alabama-prosecute-marshae-jones-pregnant-woman-who-shot/1600459001/,usa-today,abortion,center,Somewhat factual but also opinionated,[],0,0,0,"[NO_TITLE] Though the indictment of a woman for her own pregnancy loss is unusual in Alabama, it is not unusual for prosecutors to charge people with murder even if they never killed anyone."
2,"The tragedy of America’s 18 years in Afghanistan has been a stubborn refusal to admit the Afghan government is incapable of standing on its own, no matter how much money is poured into it, how much training its troops are given, or how many of its battles American soldiers fight.",http://feedproxy.google.com/~r/breitbart/~3/EReXUAKj_UQ/,breitbart,international-politics-and-world-news,right,Somewhat factual but also opinionated,"['tragedy', 'stubborn']",1,2,0,"New York Times Publishes Op-ed by Globally Designated Terrorist Taliban Leader. The tragedy of America’s 18 years in Afghanistan has been a stubborn refusal to admit the Afghan government is incapable of standing on its own, no matter how much money is poured into it, how much training its troops are given, or how many of its battles American soldiers fight."
3,The justices threw out a challenge from gun rights groups.,https://www.huffpost.com/entry/supreme-court-gun-rights-case_n_5ea6eb53c5b6a30004e59f35,msnbc,gun-control,left,Entirely factual,[],0,0,0,[NO_TITLE] The justices threw out a challenge from gun rights groups.
4,A review of his posts in online message boards revealed he wanted to plant neo-Nazi propaganda inside Nevada middle and high schools.,https://eu.usatoday.com/story/news/nation/2020/02/10/white-supremacist-las-vegas-guilty-plea-conor-climo/4717616002/,usa-today,white-nationalism,center,Entirely factual,['plant'],1,1,1,[NO_TITLE] A review of his posts in online message boards revealed he wanted to plant neo-Nazi propaganda inside Nevada middle and high schools.


## Getting Merged Data Ready for Modeling:

In [183]:
combined_df.head(1)

Unnamed: 0,text,news_link,outlet,topic,type,label_opinion,biased_words,label,bias_word_count,lexicon_match_count,combined_text
0,The Republican president assumed he was helping the industry at the expense of the environment – a trade-off Trump was happy to make since he rejects climate science anyway.,http://www.msnbc.com/rachel-maddow-show/auto-industry-trump-youre-going-the-wrong-way-emissions,msnbc,environment,left,Expresses writer’s opinion,[],1,0,1,[NO_TITLE] The Republican president assumed he was helping the industry at the expense of the environment – a trade-off Trump was happy to make since he rejects climate science anyway.


In [182]:
X = combined_df['combined_text']
y = combined_df['label']

In [None]:
text_feature = 'combined_text'
categorical_features = ['outlet', 'topic', 'type', 'label_opinion']
numeric_features = ['bias_word_count', 'lexicon_match_count']

vectorizer = TfidfVectorizer(max_features=5000, ngram_range=(1,2), stop_words='english')
classifier = LogisticRegression(max_iter=1000, penalty='l2', C=1.0, solver='liblinear')

preprocessor = ColumnTransformer(transformers=[
    ('text', vectorizer, text_feature),
    ('cat', OneHotEncoder(handle_unknown='ignore'), categorical_features),
    ('num', StandardScaler(), numeric_features)
])

pipeline = Pipeline(steps=[
    ('preprocessing', preprocessor),
    ('classifier', classifier)
])

X = combined_df[[text_feature] + categorical_features + numeric_features]
y = combined_df['label']
X_train, X_test, y_train, y_test = train_test_split(X, y, stratify=y, test_size=0.3, random_state=42)

pipeline.fit(X_train, y_train)
y_pred = pipeline.predict(X_test)
y_proba = pipeline.predict_proba(X_test)[:, 1]

print(classification_report(y_test, y_pred))
print("ROC AUC:", roc_auc_score(y_test, y_proba))

              precision    recall  f1-score   support

           0       0.87      0.89      0.88       959
           1       0.90      0.88      0.89      1072

    accuracy                           0.89      2031
   macro avg       0.89      0.89      0.89      2031
weighted avg       0.89      0.89      0.89      2031

ROC AUC: 0.9508038535165673


────────────────────────────────────────────────────────────────────────────

This is the initial evaluation of the model's performance using a test set of 2,031 examples.

The model is trying to predict whether a sentence is **biased (`1`)** or **non-biased (`0`)**.

**For label 0 (non-biased):**
- Precision: 0.87 → When the model predicts non-biased, it is correct 87% of the time.
- Recall: 0.89 → It correctly finds 89% of the actual non-biased sentences.
- F1-score: 0.88 → A balanced measure of both precision and recall.

**For label 1 (biased):**
- Precision: 0.90 → When the model predicts bias, it is correct 90% of the time.
- Recall: 0.88 → It correctly detects 88% of the truly biased sentences.
- F1-score: 0.89 → Again, a strong balance.

**Overall accuracy**: 89% of all predictions were correct.

**ROC AUC: 0.9508**  
This is a measure of how well the model separates the two classes. A perfect model scores 1.0, and random guessing is 0.5.  
A score of **0.95** means the model is *very good* at telling biased and non-biased sentences apart.

────────────────────────────────────────────────────────────────────────────

In [202]:
vectorizer = TfidfVectorizer()
classifier = LogisticRegression(max_iter=1000, solver='liblinear')

preprocessor = ColumnTransformer(transformers=[
    ('text', vectorizer, text_feature),
    ('cat', OneHotEncoder(handle_unknown='ignore'), categorical_features),
    ('num', StandardScaler(), numeric_features)
])

pipeline = Pipeline(steps=[
    ('preprocessing', preprocessor),
    ('classifier', classifier)
])

param_grid = {
    'preprocessing__text__max_features': [5000, 10000, 20000],
    'preprocessing__text__ngram_range': [(1, 1), (1, 2), (1, 3)],
    'classifier__C': [0.01, 0.1, 1.0, 10.0, 100.0]
}

X = combined_df[[text_feature] + categorical_features + numeric_features]
y = combined_df['label']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, stratify=y, random_state=42)

grid_search = GridSearchCV(pipeline, param_grid, scoring='roc_auc', cv=3, n_jobs=-1)
grid_search.fit(X_train, y_train)

print("Best AUC:", grid_search.best_score_)
print("Best Params:", grid_search.best_params_)

y_proba = grid_search.predict_proba(X_test)[:, 1]
print("Final Test ROC AUC:", roc_auc_score(y_test, y_proba))

Best AUC: 0.9527796871573021
Best Params: {'classifier__C': 1.0, 'preprocessing__text__max_features': 20000, 'preprocessing__text__ngram_range': (1, 3)}
Final Test ROC AUC: 0.9478030576622126


## Data Scraping:

In [203]:
%pip install wikipedia-api

Collecting wikipedia-api
  Using cached Wikipedia_API-0.8.1-py3-none-any.whl
Installing collected packages: wikipedia-api
Successfully installed wikipedia-api-0.8.1

[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m24.3.1[0m[39;49m -> [0m[32;49m25.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m
Note: you may need to restart the kernel to use updated packages.


In [206]:
import wikipediaapi
from nltk.tokenize import sent_tokenize

wiki = wikipediaapi.Wikipedia(
    language='en',
    user_agent='BiasDetectionProject/1.0 (betaknight@yourdomain.com)'
)

def fetch_article(title):
    page = wiki.page(title)
    if page.exists():
        return page.text
    else:
        raise ValueError(f"Article '{title}' not found.")

────────────────────────────────────────────────────────────────────────────

This code lets us pull text from Wikipedia.

It sets up a connection to Wikipedia using English and a custom user agent.  
The `fetch_article` function takes an article title, grabs the page, and returns its full text if it exists. If not, it shows an error message.

────────────────────────────────────────────────────────────────────────────

In [253]:
def predict_bias_from_article(title, model):
    article_text = fetch_article(title)
    sentences = sent_tokenize(article_text)

    temp_df = pd.DataFrame({'combined_text': sentences})

    temp_df['bias_word_count'] = 0
    temp_df['lexicon_match_count'] = 0
    temp_df['outlet'] = 'usa-today'
    temp_df['topic'] = 'politics'
    temp_df['type'] = 'center'
    temp_df['label_opinion'] = 'Somewhat factual but also opinionated'

    preds = model.predict(temp_df)
    proba = model.predict_proba(temp_df)[:, 1]
    preds = (proba > 0.33).astype(int)
    bias_score = preds.sum() / len(preds)

    return {
        'bias_score': round(bias_score, 3),
        'biased_sentences': int(preds.sum()),
        'total_sentences': len(sentences),
        'sentences': sentences,
        'predictions': preds,
        'probabilities': proba
    }

────────────────────────────────────────────────────────────────────────────

This function checks how biased a Wikipedia article is.

It fetches the article, splits it into sentences, and creates a DataFrame.  
Some default values are added to match what the model was trained on.  
It then uses the model to predict how biased each sentence is, using a custom threshold of `0.33`.  
It returns the bias score (percent of biased sentences), the sentence predictions, and their probabilities.

────────────────────────────────────────────────────────────────────────────

In [242]:
# Rebuilding based on recent best results
vectorizer = TfidfVectorizer(max_features=20000, ngram_range=(1, 3), stop_words='english')
classifier = LogisticRegression(max_iter=1000, solver='liblinear', C=1.0)

preprocessor = ColumnTransformer(transformers=[
    ('text', vectorizer, text_feature),
    ('cat', OneHotEncoder(handle_unknown='ignore'), categorical_features),
    ('num', StandardScaler(), numeric_features)
])

pipeline = Pipeline(steps=[
    ('preprocessing', preprocessor),
    ('classifier', classifier)
])

X = combined_df[[text_feature] + categorical_features + numeric_features]
y = combined_df['label']
pipeline.fit(X, y)

────────────────────────────────────────────────────────────────────────────

This code rebuilds and trains the final machine learning pipeline.

It uses `TfidfVectorizer` to turn text into numbers, looking at up to 3-word phrases and limiting to 20,000 features.  
It one-hot encodes categories like outlet and topic, and standardizes numeric columns.  
All features are combined and passed into a logistic regression model.  
The full pipeline is then trained using the labeled dataset.

────────────────────────────────────────────────────────────────────────────

In [252]:
results = predict_bias_from_article("Donald Trump", pipeline)

print(f"Bias Score: {results['bias_score']} ({results['biased_sentences']} of {results['total_sentences']} sentences)")

Bias Score: 0.113 (63 of 557 sentences)


────────────────────────────────────────────────────────────────────────────

This runs the bias prediction function on the Wikipedia article for "Donald Trump".

It prints a final score showing that **63 out of 557 sentences** were predicted as biased,  
resulting in a **bias score of 0.113**, or **11.3%** of the article.

────────────────────────────────────────────────────────────────────────────

In [254]:
for sent, prob in zip(results['sentences'], results['probabilities']):
    if prob > 0.3:
        print(f"⚠️ {round(prob, 3)}: {sent}")

⚠️ 0.415: Born into a wealthy family in the New York City borough of Queens, Trump graduated from the University of Pennsylvania in 1968 with a bachelor's degree in economics.
⚠️ 0.349: Presenting himself as a political outsider, Trump won the 2016 presidential election against the Democratic Party's nominee, Hillary Clinton.
⚠️ 0.312: Trump was impeached in 2019 for abuse of power and obstruction of Congress, and in 2021 for incitement of insurrection; the Senate acquitted him both times.
⚠️ 0.314: Trump is the central figure of Trumpism, and his faction is dominant within the Republican Party.
⚠️ 0.351: Many of his comments and actions have been characterized as racist or misogynistic,  and he has made false and misleading statements and promoted conspiracy theories to a degree unprecedented in American politics.
⚠️ 0.308: High-profile cases have underscored Trump's broad interpretation of a unitary executive theory of power, and led to significant conflicts with the federal courts.


────────────────────────────────────────────────────────────────────────────

These are the sentences from the "Donald Trump" Wikipedia article that the model flagged as biased, using a threshold of **0.33**.

Each line shows the **bias probability score** followed by the sentence.  
Most sentences are between **0.33 and 0.45**, which means they aren't obviously biased, but may contain **framing, emotionally charged words, or subtle implications** the model picked up on.

Examples include:
- Highlighting **wealth and privilege** in early life
- Mentioning **bankruptcies** and **legal troubles**
- Using phrases like `"racist or misogynistic"` or `"promoted conspiracy theories"`

This confirms the model can detect **subtle linguistic bias**, even in an article that follows an encyclopedic style.

────────────────────────────────────────────────────────────────────────────

## Creating Datasets:

In [255]:
def process_wikipedia_articles(titles, model, output_file="../scraped_data/wiki_bias_predictions.csv"):
    all_data = []

    for title in titles:
        try:
            text = fetch_article(title)
            sentences = sent_tokenize(text)

            temp_df = pd.DataFrame({'combined_text': sentences})
            temp_df['bias_word_count'] = 0
            temp_df['lexicon_match_count'] = 0
            temp_df['outlet'] = 'usa-today'
            temp_df['topic'] = 'politics'
            temp_df['type'] = 'center'
            temp_df['label_opinion'] = 'Somewhat factual but also opinionated'

            preds = model.predict(temp_df)
            proba = model.predict_proba(temp_df)[:, 1]

            temp_df['bias_prediction'] = preds
            temp_df['bias_probability'] = proba
            temp_df['article_title'] = title
            temp_df['sentence_index'] = temp_df.index

            all_data.append(temp_df)

        except Exception as e:
            print(f"error: {title} — {e}")

    final_df = pd.concat(all_data, ignore_index=True)
    final_df.to_csv(output_file, index=False)

────────────────────────────────────────────────────────────────────────────

This function takes in a list of Wikipedia article titles and runs **bias prediction** on every sentence from every article.

For each title:
- It fetches the article and splits it into sentences.
- A temporary dataset is created with default metadata (like outlet and topic) to match what the model expects.
- The model predicts bias for each sentence and adds the result and the bias probability.
- It adds the article name and sentence index for organization.

All the sentence results from all articles are combined into one dataset and saved as a single CSV file called `wiki_bias_predictions.csv`.

────────────────────────────────────────────────────────────────────────────

In [256]:
topics = [
    "Donald Trump", "Joe Biden", "Kamala Harris", "Barack Obama", "Ron DeSantis", "Bernie Sanders", "Antifa",
    "Tea Party movement", "QAnon", "Pro-life", "Pro-choice", "Abortion in the United States", "Gun control",
    "Second Amendment", "Immigration to the United States", "Border wall", "Transgender rights", "LGBT adoption",
    "Same-sex xmarriage", "Gender identity", "Critical race theory", "Affirmative action", "Fox News", "MSNBC", "CNN",
    "Breitbart News", "The New York Times", "Israeli-Palestinian conflict", "Hamas", "Ukraine war",
    "Russian invasion of Ukraine", "NATO", "Taliban", "Evangelicalism", "Islamophobia", "Christian nationalism",
    "Religious freedom in the United States", "Climate change", "COVID-19 pandemic", "Vaccine hesitancy",
    "Misinformation", "Flat Earth", "Creationism", "Police brutality", "Black Lives Matter", "Stop and frisk",
    "War on drugs"
]

────────────────────────────────────────────────────────────────────────────

We chose a large number of topics to ensure that the final dataset would be rich, diverse, and well-populated.

Not every Wikipedia article always works — sometimes a page doesn’t exist, fails to load, or doesn’t contain usable sentence structure.  
By including a wide mix of political, social, scientific, and controversial topics, we increase the chances that most will work.  
This guarantees that even if some articles are skipped or throw errors, we still end up with a **good-sized, well-balanced dataset** for bias analysis.

────────────────────────────────────────────────────────────────────────────

In [257]:
process_wikipedia_articles(topics, pipeline)

error: Same-sex xmarriage — Article 'Same-sex xmarriage' not found.


────────────────────────────────────────────────────────────────────────────

This line runs the full bias detection process on all the topics in the `topics` list using the trained model `pipeline`.

It goes through each article, predicts bias sentence-by-sentence, and saves the results into a single CSV file.  
By the end, you'll have one dataset with bias predictions across all the selected Wikipedia topics, ready for further analysis or visualization.

────────────────────────────────────────────────────────────────────────────