## QTL Paper Classification using BERT and Logistic Regression (Ensemble Approach)

This project focuses on building a document classification pipeline to identify whether a scientific research paper is relevant for curation in the Animal QTLdb (Quantitative Trait Loci database). The input includes a collection of research papers in the form of titles and abstracts, and the goal is to predict a binary label:

* 1 → Relevant for QTL curation

* 0 → Not relevant

To solve this, we use an ensemble of two models:

1. TF-IDF + Logistic Regression – A simple and effective traditional machine learning model.

2. BERT (Bidirectional Encoder Representations from Transformers) – A deep learning model that captures contextual information in text.

The predictions from both models are combined and threshold-optimized to improve performance. The final result is a .csv submission file that follows Kaggle's required format. This file can then be submitted to evaluate how well the model identifies QTL-relevant papers.

This document provides a clear, step-by-step explanation of the full pipeline, including data preprocessing, model training, threshold optimization, and submission generation — written in an easy-to-understand and beginner-friendly way.

In [1]:
# pip install --upgrade accelerate

In [2]:
# pip install --upgrade transformers

In [3]:
# pip install datasets

### Import all the required libraries

These are the essential libraries used in your project, along with why they're important.

* pandas (pd)

  * Used for data manipulation and reading structured files like .csv or .tsv.

* Dataset from datasets (Hugging Face)

  * Converts pandas DataFrame into a format compatible with Hugging Face's Trainer.

  * Allows use of .map() for tokenization and formatting into PyTorch tensors.

* softmax from scipy.special

  * Converts raw model logits into probabilities between 0 and 1.

  * Useful for interpreting model outputs in classification tasks.

In [4]:
import json
import pandas as pd
import numpy as np
import re
import random
import torch
from torch.nn import CrossEntropyLoss
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.utils import resample
from sklearn.metrics import f1_score, classification_report
from transformers import AutoTokenizer, AutoModelForSequenceClassification, TrainingArguments, EarlyStoppingCallback, Trainer
from scipy.special import softmax
from datasets import Dataset

### Step 1: Set Seed
The function set_seed(seed=42) ensures reproducibility. It sets a fixed seed for Python's built-in random, numpy, and torch. This means that any randomness in model training, data shuffling, or weight initialization will produce the same results across different runs.

In [5]:
# ========== 1. Set Seed ==========
def set_seed(seed=42):
    random.seed(seed)
    np.random.seed(seed)
    torch.manual_seed(seed)

### Step 2: Clean Text
The clean_text(text) function simplifies and normalizes the input text. It converts all text to lowercase, removes URLs, email addresses, special characters (like punctuation and numbers), and extra whitespaces. This is important for consistent input to the models and reduces noise that doesn’t help in classification.

In [6]:
# ========== 2. Clean Text ==========
def clean_text(text):
    text = text.lower()
    text = re.sub(r'https?://\S+|www\.\S+', '', text)
    text = re.sub(r'\S+@\S+', '', text)
    text = re.sub(r'[^a-z\s]', '', text)
    text = re.sub(r'\s+', ' ', text).strip()
    return text

### Step 3: Load and Balance Data
The load_and_balance_data(path="QTL_text.json") function reads the JSON file containing paper metadata. It concatenates the paper's title and abstract, cleans the combined text, and extracts the label (0 or 1) from the "Category" field. Since the dataset is imbalanced, with more non-relevant papers (label 0), the function downsamples the majority class to match the number of minority samples. It then shuffles the resulting balanced dataset and returns it as a DataFrame. This helps prevent model bias toward the majority class.

In [7]:
# ========== 3. Load + Balance Data ==========
def load_and_balance_data(path="QTL_text.json"):
    with open(path, "r", encoding="utf-8") as f:
        data = json.load(f)
    df = pd.DataFrame(data)
    df['text'] = (df['Title'] + ' ' + df['Abstract']).apply(clean_text)
    df['label'] = df['Category'].astype(int)

    df_majority = df[df.label == 0]
    df_minority = df[df.label == 1]
    df_majority_downsampled = resample(df_majority, replace=False, n_samples=len(df_minority), random_state=42)
    df_balanced = pd.concat([df_majority_downsampled, df_minority]).sample(frac=1, random_state=42)
    return df_balanced

### Step 4: Train Logistic Regression
This function trains a traditional machine learning classifier using TF-IDF and Logistic Regression:

* TF-IDF Vectorizer
  * max_features=10000: Limits to 10,000 most important terms to reduce overfitting and computation time.

  * ngram_range=(1,2): Includes both unigrams and bigrams (single words and word pairs), which helps capture word context.

  * stop_words='english': Removes common English words like “the”, “and”, “is” that don’t add much meaning.

* Logistic Regression

  * class_weight='balanced': Automatically adjusts weights so both classes are treated fairly.

  * max_iter=1000: Ensures convergence during training, especially when using many features from TF-IDF.

Returns the trained Logistic Regression model and the fitted TF-IDF vectorizer.

In [8]:
# ========== 4. Train Logistic Regression ==========
def train_logistic_regression(X_train, y_train):
    tfidf = TfidfVectorizer(max_features=10000, ngram_range=(1, 2), stop_words='english')
    X_train_tfidf = tfidf.fit_transform(X_train)
    model = LogisticRegression(class_weight='balanced', max_iter=1000)
    model.fit(X_train_tfidf, y_train)
    return model, tfidf

### Step 5: Prepare BERT Dataset
This function prepares the data for fine-tuning BERT:

* A tokenizer from the Hugging Face model "bert-base-uncased" is used to convert text into token IDs and attention masks.

* The datasets are mapped using a tokenize_fn, which:

  * padding="max_length": Pads all sequences to the same length for batch processing.

  * truncation=True: Truncates texts longer than 256 tokens.

  * max_length=256: Keeps memory usage low while retaining sufficient context.

The function returns training and validation datasets as PyTorch tensors, along with the tokenizer.

In [9]:
# ========== 5. Prepare BERT Dataset ==========
def create_bert_datasets(X_train, y_train, X_val, y_val):
    tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
    train_df = pd.DataFrame({"text": X_train, "label": y_train})
    val_df = pd.DataFrame({"text": X_val, "label": y_val})

    def tokenize_fn(example):
        return tokenizer(example["text"], padding="max_length", truncation=True, max_length=256)

    train_ds = Dataset.from_pandas(train_df).map(tokenize_fn, batched=True)
    val_ds = Dataset.from_pandas(val_df).map(tokenize_fn, batched=True)
    train_ds.set_format("torch", columns=["input_ids", "attention_mask", "label"])
    val_ds.set_format("torch", columns=["input_ids", "attention_mask", "label"])
    return train_ds, val_ds, tokenizer

### Step 6: Custom Trainer with Class Weights
The WeightedTrainer class extends Hugging Face’s Trainer to apply class weights during BERT training:

* Overrides compute_loss() to apply a weighted cross-entropy loss, helping the model handle class imbalance.

* self.weights.to(model.device) ensures the weights are placed on the same device (CPU or GPU) as the model for compatibility.

This is necessary because even after balancing the dataset, subtle imbalance or difficulty in classifying minority class samples can affect training quality.



In [10]:
# ========== 6. Custom Trainer with Weights ==========
class WeightedTrainer(Trainer):
    def __init__(self, *args, weights=None, **kwargs):
        super().__init__(*args, **kwargs)
        self.weights = weights

    def compute_loss(self, model, inputs, return_outputs=False, **kwargs):
        """Override compute_loss with support for optional arguments."""
        labels = inputs.pop("labels")
        outputs = model(**inputs)
        logits = outputs.logits
        loss_fct = CrossEntropyLoss(weight=self.weights.to(model.device))
        loss = loss_fct(logits, labels)
        return (loss, outputs) if return_outputs else loss

### Step 7: Train BERT Model
This function trains the BERT model with early stopping:

* Loads a pre-trained BERT model with 2 output classes.

* Calculates class weights based on training data using:

  weights = torch.tensor([total / c for c in class_counts], dtype=torch.float)

  where c is the count of each class. This gives more weight to underrepresented classes.

* Training configuration:

  * num_train_epochs=5: Train for 5 epochs unless early stopping kicks in.

  * learning_rate=2e-5: Small learning rate for fine-tuning without overwriting pretrained knowledge.

  * per_device_train_batch_size=16 and eval_batch_size=32: Balanced batch sizes for training and evaluation.

  * weight_decay=0.01: Adds regularization to prevent overfitting.

  * load_best_model_at_end=True: Automatically picks the best model based on validation F1 score.

  * metric_for_best_model="f1": Uses F1 score as the evaluation metric.

  * EarlyStoppingCallback: Stops training if validation doesn’t improve for 2 epochs.

Returns the fine-tuned Trainer object.

In [11]:
# ========== 7. Train BERT ==========
def train_bert_model(train_ds, val_ds, tokenizer, y_train):
    model = AutoModelForSequenceClassification.from_pretrained("bert-base-uncased", num_labels=2)
    class_counts = np.bincount(y_train)
    total = class_counts.sum()
    weights = torch.tensor([total / c for c in class_counts], dtype=torch.float)

    training_args = TrainingArguments(
        output_dir="./bert_output",
        evaluation_strategy="epoch",
        save_strategy="epoch",
        per_device_train_batch_size=16,
        per_device_eval_batch_size=32,
        num_train_epochs=5,
        learning_rate=2e-5,
        weight_decay=0.01,
        load_best_model_at_end=True,
        logging_dir="./logs",
        metric_for_best_model="f1",
        save_total_limit=2
    )

    def compute_metrics(pred):
        preds = np.argmax(pred.predictions, axis=1)
        return {
            "accuracy": (pred.label_ids == preds).mean(),
            "f1": f1_score(pred.label_ids, preds)
        }

    trainer = WeightedTrainer(
        model=model,
        args=training_args,
        train_dataset=train_ds,
        eval_dataset=val_ds,
        compute_metrics=compute_metrics,
        tokenizer=tokenizer,
        callbacks=[EarlyStoppingCallback(early_stopping_patience=2)],
        weights=weights
    )
    trainer.train()
    return trainer

### Step 8: Optimize Threshold
This function finds the best threshold to classify ensemble predictions:

* Tries thresholds between 0.3 and 0.7 in 0.01 increments.

* For each threshold, it combines both model predictions (weighted 50/50), applies the threshold, and calculates the F1 score.

* Returns the threshold that gives the highest F1 score on validation data.

This improves performance over a static 0.5 threshold, especially when combining two different models.

In [12]:
# ========== 8. Optimize Threshold ==========
def optimize_threshold(val_probs_lr, val_probs_bert, y_val):
    best_thresh, best_f1 = 0.5, 0
    for thresh in np.arange(0.3, 0.7, 0.01):
        preds = (0.5 * val_probs_lr + 0.5 * val_probs_bert) >= thresh
        f1 = f1_score(y_val, preds.astype(int))
        if f1 > best_f1:
            best_f1, best_thresh = f1, thresh
    return best_thresh

### Step 9: Run the Full Pipeline
This function puts everything together:

* Calls all the earlier functions in sequence:

  * Sets the seed

  * Loads and balances the data

  * Splits into train and validation sets

  * Trains the Logistic Regression model

  * Trains the BERT model

  * Optimizes the threshold

* Calculates final ensemble predictions and evaluates using classification metrics:

  * classification_report: Shows precision, recall, F1 for both classes.

  * Weighted F1 Score: Overall score considering class imbalance.

Returns the trained BERT trainer, tokenizer, TF-IDF transformer, optimized threshold, and Logistic Regression model — all of which are required for testing and submission.

### Final Script Execution

The final line of the script calls the full pipeline

This ensures the pipeline runs when the script is executed directly.

In [13]:
# ========== 9. Full Pipeline ==========
def run_full_pipeline():
    set_seed()
    df = load_and_balance_data()
    X = df['text'].tolist()
    y = df['label'].tolist()
    X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.2, stratify=y, random_state=42)

    # Train TF-IDF + Logistic Regression
    lr_model, tfidf = train_logistic_regression(X_train, y_train)
    val_probs_lr = lr_model.predict_proba(tfidf.transform(X_val))[:, 1]

    # Train BERT
    train_ds, val_ds, tokenizer = create_bert_datasets(X_train, y_train, X_val, y_val)
    trainer = train_bert_model(train_ds, val_ds, tokenizer, y_train)
    val_outputs_bert = trainer.predict(val_ds)
    val_probs_bert = softmax(val_outputs_bert.predictions, axis=1)[:, 1]

    # Threshold Optimization
    best_thresh = optimize_threshold(val_probs_lr, val_probs_bert, y_val)
    ensemble_val_probs = 0.5 * val_probs_lr + 0.5 * val_probs_bert
    ensemble_val_preds = (ensemble_val_probs >= best_thresh).astype(int)

    # Evaluation
    print(f"\nBest Threshold: {best_thresh:.2f}")
    print(classification_report(y_val, ensemble_val_preds, digits=4))
    print("Weighted F1 Score:", f1_score(y_val, ensemble_val_preds, average='weighted'))

    return trainer, tokenizer, tfidf, best_thresh, lr_model


# ========== Run the Full Pipeline ==========
if __name__ == "__main__":
    trainer, tokenizer, tfidf, best_thresh, lr_model = run_full_pipeline()

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

Map:   0%|          | 0/1611 [00:00<?, ? examples/s]

Map:   0%|          | 0/403 [00:00<?, ? examples/s]

model.safetensors:   0%|          | 0.00/440M [00:00<?, ?B/s]

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
  super().__init__(*args, **kwargs)
[34m[1mwandb[0m: Using wandb-core as the SDK backend.  Please refer to https://wandb.me/wandb-core for more information.


<IPython.core.display.Javascript object>

[34m[1mwandb[0m: Logging into wandb.ai. (Learn how to deploy a W&B server locally: https://wandb.me/wandb-server)
[34m[1mwandb[0m: You can find your API key in your browser here: https://wandb.ai/authorize
wandb: Paste an API key from your profile and hit enter:

 ··········


[34m[1mwandb[0m: No netrc file found, creating one.
[34m[1mwandb[0m: Appending key for api.wandb.ai to your netrc file: /root/.netrc
[34m[1mwandb[0m: Currently logged in as: [33mkoshtiakanksha12[0m ([33mkoshtiakanksha12-iowa-state-university[0m) to [32mhttps://api.wandb.ai[0m. Use [1m`wandb login --relogin`[0m to force relogin


Epoch,Training Loss,Validation Loss,Accuracy,F1
1,No log,0.202065,0.92804,0.931765
2,No log,0.171688,0.935484,0.938679
3,No log,0.168256,0.947891,0.948655
4,No log,0.200081,0.945409,0.947368
5,0.172900,0.209057,0.950372,0.952153



Best Threshold: 0.64
              precision    recall  f1-score   support

           0     0.9895    0.9307    0.9592       202
           1     0.9343    0.9900    0.9614       201

    accuracy                         0.9603       403
   macro avg     0.9619    0.9604    0.9603       403
weighted avg     0.9619    0.9603    0.9603       403

Weighted F1 Score: 0.9602654741905394


### Step 10: Prepare the Test Set

This function prepares the unlabeled test data (which needs predictions) for both the Logistic Regression model and the BERT model. It ensures the text is cleaned and formatted just like it was during training, which is essential for consistency.

* Parameters:
  * test_path: Path to the test file. Default is "test_unlabeled.tsv" — a tab-separated file.

  * tfidf: The trained TfidfVectorizer that was fitted on the training data.

  * tokenizer: The BERT tokenizer used during training (like bert-base-uncased).

* Steps and Logic:
  1. Load the Test Data:
  
    This reads the .tsv file using tab (\t) as a separator. The test file includes paper titles, abstracts, and placeholder labels.

  2. Clean the Text
    
    Combines the paper title and abstract, then cleans it using the same clean_text() function used in training. This makes the test text uniform and removes unnecessary symbols, links, or punctuation.

  3. Convert Text to TF-IDF Vectors

    Uses the previously trained tfidf vectorizer to convert the test text into the same numerical format expected by the Logistic Regression model.

  4. Tokenize for BERT

    Converts the cleaned text into a Hugging Face dataset and tokenizes it using the same tokenizer as in training. It:

    * Adds padding to ensure equal-length inputs.

    * Truncates texts longer than 256 tokens.

    * Converts everything to PyTorch tensor format so BERT can use it.

* Returns:
  * test_df: The cleaned and original metadata, including PMIDs.

  * X_test_tfidf: The TF-IDF features for Logistic Regression.

  * test_ds: The tokenized BERT-compatible dataset.

In [14]:
def prepare_test_set(test_path="test_unlabeled.tsv", tfidf=None, tokenizer=None):
    test_df = pd.read_csv(test_path, sep="\t")
    test_df['text'] = (test_df['Title'] + ' ' + test_df['Abstract']).apply(clean_text)

    # TF-IDF for Logistic Regression
    X_test_tfidf = tfidf.transform(test_df['text'].tolist())

    # Tokenization for BERT
    test_raw_df = pd.DataFrame({"text": test_df['text']})
    test_ds = Dataset.from_pandas(test_raw_df).map(
        lambda x: tokenizer(x["text"], padding="max_length", truncation=True, max_length=256), batched=True
    )
    test_ds.set_format("torch", columns=["input_ids", "attention_mask"])

    return test_df, X_test_tfidf, test_ds

### Step 11: Generate and Save Submission

This function uses both trained models to predict the label for each test paper, combines their predictions, and writes the final result to a .csv file in Kaggle submission format.

* Parameters:
  * lr_model: The trained Logistic Regression model.

  * trainer: The fine-tuned BERT model (Hugging Face Trainer object).

  * tfidf: TF-IDF vectorizer used with lr_model.

  * tokenizer: BERT tokenizer used with trainer.

  * threshold: Best threshold value for final classification (calculated on validation).

  * test_path: Path to test file. Default is "test_unlabeled.tsv".

  * output_csv: File path for the output submission. Default is "submission.csv".

* Step-by-Step Explanation:
  1. Prepare Test Data

    Loads and processes test data for both Logistic Regression and BERT using the earlier function.

  2. Make Logistic Regression Predictions

    Predicts the probability of class 1 for each paper using the Logistic Regression model. Only the probability for class 1 ([:, 1]) is needed.

  3. Make BERT Predictions

    Uses the BERT trainer to predict logits, then applies softmax to convert them into probabilities. Again, we extract the probability for class 1.

  4. Combine Predictions (Ensemble)

    Averages the predictions from both models. If the final score is greater than or equal to the chosen threshold, the label is set to 1, otherwise 0. This makes the ensemble prediction more stable and accurate.

  5. Create the Submission File

    Builds a DataFrame with only the PMID and predicted Label, then saves it as a .csv file without the index. This format is required for Kaggle submission.

* Output:
A .csv file with two columns — PMID and Label — that can be directly uploaded to the Kaggle competition page.

In [15]:
def generate_submission(lr_model, trainer, tfidf, tokenizer, threshold, test_path="test_unlabeled.tsv", output_csv="submission.csv"):
    test_df, X_test_tfidf, test_ds = prepare_test_set(test_path, tfidf, tokenizer)

    # Logistic Regression predictions
    test_probs_lr = lr_model.predict_proba(X_test_tfidf)[:, 1]

    # BERT predictions
    test_outputs_bert = trainer.predict(test_ds)
    test_probs_bert = softmax(test_outputs_bert.predictions, axis=1)[:, 1]

    # Ensemble predictions
    ensemble_test_probs = 0.5 * test_probs_lr + 0.5 * test_probs_bert
    ensemble_test_preds = (ensemble_test_probs >= threshold).astype(int)

    # Create submission
    submission = pd.DataFrame({
        "PMID": test_df["PMID"],
        "Label": ensemble_test_preds
    })
    submission.to_csv(output_csv, index=False)

Finally, we call the final prediction pipeline that takes all the trained components (Logistic Regression, BERT, TF-IDF, Tokenizer, and threshold) and applies them to the unlabeled test data. It generates the final Kaggle submission file (submission.csv) by combining predictions from both models using the optimized threshold.

In [16]:
generate_submission(
    lr_model=lr_model,
    trainer=trainer,
    tfidf=tfidf,
    tokenizer=tokenizer,
    threshold=best_thresh
)

Map:   0%|          | 0/1097 [00:00<?, ? examples/s]