# DEFNLP: Hidden Data Citation Extraction Pipeline

This notebook implements the complete three-phase DEFNLP methodology for extracting hidden data citations from scientific publications.

## Methodology Overview

- **Phase I**: Data Cleaning & Baseline Modeling (string matching)
- **Phase II**: SpaCy NER & BERT QA Modeling (advanced NLP)
- **Phase III**: Acronym & Abbreviation Extraction
- **Final**: Merge all predictions with transfer learning approach

## Table of Contents
1. [Setup & Installation](#setup)
2. [Configuration](#config)
3. [Utility Functions](#utils)
4. [Phase I: Baseline Matching](#phase1)
5. [Phase II: NER & QA](#phase2)
6. [Phase III: Acronyms](#phase3)
7. [Pipeline Integration](#pipeline)
8. [Run Inference](#inference)
9. [Results Analysis](#results)

## 1. Setup & Installation <a name="setup"></a>

Install required dependencies and download models.

In [70]:
# Install dependencies (uncomment if needed)
# !pip install pandas numpy transformers torch spacy nltk tqdm scikit-learn
# !python -m spacy download en_core_web_sm

In [71]:
# Import libraries
import os
import json
import re
import pandas as pd
import numpy as np
from typing import List, Dict, Set, Tuple
from tqdm.auto import tqdm

# NLP libraries
import spacy
from transformers import pipeline
import torch

# NLTK for stopwords
import nltk
nltk.download('stopwords', quiet=True)
from nltk.corpus import stopwords

print("✓ All libraries imported successfully")

✓ All libraries imported successfully


## 2. Configuration <a name="config"></a>

Define all hyperparameters and settings.

In [72]:
# ============================================================================
# FILE PATHS
# ============================================================================
import os

# Get the directory where this notebook is located
NOTEBOOK_DIR = os.getcwd()

# Use absolute paths to avoid FileNotFoundError
TRAIN_CSV = os.path.join(NOTEBOOK_DIR, "train.csv")
TEST_CSV = os.path.join(NOTEBOOK_DIR, "test.csv")
TRAIN_JSON_DIR = os.path.join(NOTEBOOK_DIR, "train")
TEST_JSON_DIR = os.path.join(NOTEBOOK_DIR, "test")
OUTPUT_DIR = os.path.join(NOTEBOOK_DIR, "output")
BIG_GOV_DATASETS = os.path.join(NOTEBOOK_DIR, "big_gov_datasets.txt")  # Optional external datasets

# ============================================================================
# MODEL CONFIGURATION
# ============================================================================
QA_MODEL_NAME = "salti/bert-base-multilingual-cased-finetuned-squad"
SPACY_MODEL = "en_core_web_sm"
QA_MAX_ANSWER_LENGTH = 64
USE_GPU = torch.cuda.is_available()

# ============================================================================
# PHASE II: NER & QA SETTINGS
# ============================================================================
DATA_KEYWORDS = [
    "data", "datasource", "datasources", "dataset", "datasets",
    "database", "databases", "sample", "samples", "corpus",
    "repository", "repositories", "collection", "survey"
]

NER_ENTITY_TYPES = ["DATE", "ORG"]

QA_QUESTIONS = [
    "Which datasets are used?",
    "Which data sources are used?",
    "What datasets were analyzed?",
    "Which databases are mentioned?",
    "What data was collected?"
]

CHUNK_SIZE = 3  # Sentences per chunk
CHUNK_OVERLAP = 1

# ============================================================================
# PHASE III: ACRONYM SETTINGS
# ============================================================================
MIN_ACRONYM_LENGTH = 2
MAX_ACRONYM_LENGTH = 10

# ============================================================================
# OUTPUT SETTINGS
# ============================================================================
PREDICTION_SEPARATOR = " | "
MIN_CONFIDENCE = 0.0

print(f"✓ Configuration loaded")
print(f"  GPU Available: {USE_GPU}")
print(f"  QA Model: {QA_MODEL_NAME}")
print(f"  SpaCy Model: {SPACY_MODEL}")

✓ Configuration loaded
  GPU Available: True
  QA Model: salti/bert-base-multilingual-cased-finetuned-squad
  SpaCy Model: en_core_web_sm


## 3. Utility Functions <a name="utils"></a>

Helper functions for text processing and data manipulation.

In [73]:
def load_json_publications(json_dir: str, pub_ids: List[str] = None) -> Dict[str, str]:
    """Load publication JSON files and merge all text content."""
    pub_texts = {}
    
    if not os.path.exists(json_dir):
        print(f"Warning: Directory {json_dir} does not exist")
        return pub_texts
    
    json_files = [f for f in os.listdir(json_dir) if f.endswith('.json')]
    
    for json_file in tqdm(json_files, desc=f"Loading {json_dir}"):
        pub_id = json_file.replace('.json', '')
        
        if pub_ids is not None and pub_id not in pub_ids:
            continue
        
        file_path = os.path.join(json_dir, json_file)
        
        try:
            with open(file_path, 'r', encoding='utf-8') as f:
                data = json.load(f)
            
            text_parts = []
            for section in data:
                if 'text' in section and section['text']:
                    text_parts.append(section['text'])
            
            pub_texts[pub_id] = ' '.join(text_parts)
        except Exception as e:
            print(f"Error loading {json_file}: {e}")
    
    return pub_texts


def clean_text(text: str) -> str:
    """Clean text by removing special characters, emojis, and extra spaces."""
    if not isinstance(text, str):
        return ""
    
    # Convert to lowercase
    text = text.lower()
    
    # Remove emojis
    emoji_pattern = re.compile(
        "["
        u"\U0001F600-\U0001F64F"
        u"\U0001F300-\U0001F5FF"
        u"\U0001F680-\U0001F6FF"
        u"\U0001F1E0-\U0001F1FF"
        u"\U00002702-\U000027B0"
        u"\U000024C2-\U0001F251"
        "]+", flags=re.UNICODE
    )
    text = emoji_pattern.sub(r'', text)
    
    # Remove special characters
    text = re.sub(r'[^a-z0-9\s\.\,\-\(\)]', ' ', text)
    
    # Remove multiple spaces
    text = re.sub(r'\s+', ' ', text)
    
    return text.strip()


def remove_stopwords(text: str, stopwords_set: Set[str]) -> str:
    """Remove stopwords from text."""
    words = text.split()
    filtered_words = [w for w in words if w not in stopwords_set]
    return ' '.join(filtered_words)


def extract_sentences_with_keywords(text: str, keywords: List[str]) -> List[str]:
    """Extract sentences containing specific keywords."""
    sentences = re.split(r'[.!?]+', text)
    matching_sentences = []
    keywords_lower = [k.lower() for k in keywords]
    
    for sentence in sentences:
        sentence = sentence.strip()
        if not sentence:
            continue
        
        sentence_lower = sentence.lower()
        if any(keyword in sentence_lower for keyword in keywords_lower):
            matching_sentences.append(sentence)
    
    return matching_sentences


def chunk_sentences(sentences: List[str], chunk_size: int = 3, overlap: int = 1) -> List[str]:
    """Chunk sentences into overlapping groups."""
    chunks = []
    
    for i in range(0, len(sentences), chunk_size - overlap):
        chunk = sentences[i:i + chunk_size]
        if chunk:
            chunks.append(' '.join(chunk))
        
        if i + chunk_size >= len(sentences):
            break
    
    return chunks


def format_prediction_string(datasets: Set[str]) -> str:
    """Format a set of datasets into a prediction string."""
    if not datasets:
        return ""
    
    cleaned = [d.strip() for d in datasets if d and d.strip()]
    unique_sorted = sorted(list(set(cleaned)))
    
    return PREDICTION_SEPARATOR.join(unique_sorted)


def merge_prediction_strings(predictions: List[str]) -> str:
    """Merge multiple prediction strings, remove duplicates, and sort."""
    all_preds = set()
    
    for pred in predictions:
        if isinstance(pred, str) and pred:
            parts = [p.strip() for p in pred.split('|')]
            all_preds.update([p for p in parts if p])
    
    sorted_preds = sorted(list(all_preds))
    return PREDICTION_SEPARATOR.join(sorted_preds)


print("✓ Utility functions defined")

✓ Utility functions defined


## 4. Phase I: Data Cleaning & Baseline Matching <a name="phase1"></a>

Baseline string matching for initial dataset identification.

In [74]:
class PhaseIBaseline:
    """Phase I: Data cleaning and baseline matching."""
    
    def __init__(self):
        self.stopwords = set(stopwords.words('english'))
        self.external_datasets = self._load_external_datasets()
    
    def _load_external_datasets(self) -> Set[str]:
        """Load external dataset names from file."""
        datasets = set()
        
        if not os.path.exists(BIG_GOV_DATASETS):
            print(f"Note: External datasets file {BIG_GOV_DATASETS} not found (optional)")
            return datasets
        
        try:
            with open(BIG_GOV_DATASETS, 'r', encoding='utf-8') as f:
                for line in f:
                    dataset = line.strip().lower()
                    if dataset:
                        datasets.add(dataset)
            print(f"Loaded {len(datasets)} external datasets")
        except Exception as e:
            print(f"Error loading external datasets: {e}")
        
        return datasets
    
    def create_internal_labels(self, train_df: pd.DataFrame) -> Set[str]:
        """Create set of internal dataset labels from training data."""
        labels = set()
        label_columns = ['dataset_title', 'dataset_label', 'cleaned_label']
        
        for col in label_columns:
            if col in train_df.columns:
                values = train_df[col].dropna().unique()
                for val in values:
                    if isinstance(val, str):
                        cleaned = clean_text(val)
                        if cleaned:
                            labels.add(cleaned)
        
        print(f"Created {len(labels)} internal dataset labels")
        return labels
    
    def process(self, df: pd.DataFrame, json_dir: str, train_df: pd.DataFrame = None) -> pd.DataFrame:
        """Run complete Phase I pipeline."""
        print("\n" + "="*60)
        print("PHASE I: DATA CLEANING & BASELINE MODELING")
        print("="*60)
        
        # Load and merge text
        pub_ids = df['Id'].unique().tolist()
        pub_texts = load_json_publications(json_dir, pub_ids)
        df['text'] = df['Id'].map(pub_texts).fillna('')
        
        # Clean text
        print("Cleaning text...")
        df['cleaned_text'] = df['text'].apply(clean_text)
        df['cleaned_text'] = df['cleaned_text'].apply(
            lambda x: remove_stopwords(x, self.stopwords)
        )
        
        # Create internal labels
        internal_labels = set()
        if train_df is not None:
            internal_labels = self.create_internal_labels(train_df)
        
        # Perform matching
        print("Performing baseline matching...")
        phase1_predictions = []
        
        for idx, row in tqdm(df.iterrows(), total=len(df), desc="Phase I"):
            text = row['cleaned_text']
            matches = set()
            
            # Internal matching
            for label in internal_labels:
                if label in text:
                    matches.add(label)
            
            # External matching
            for dataset in self.external_datasets:
                if dataset in text:
                    matches.add(dataset)
            
            phase1_predictions.append(format_prediction_string(matches))
        
        df['phase1_predictions'] = phase1_predictions
        print(f"✓ Phase I complete")
        return df


print("✓ Phase I class defined")

✓ Phase I class defined


## 5. Phase II: SpaCy NER & BERT QA <a name="phase2"></a>

Advanced NLP using Named Entity Recognition and Question Answering.

In [75]:
class PhaseIINER_QA:
    """Phase II: Named Entity Recognition and Question Answering."""
    
    def __init__(self):
        device = 0 if USE_GPU else -1
        
        # Load SpaCy model
        print("Loading SpaCy model...")
        self.nlp = spacy.load(SPACY_MODEL)
        
        # Load BERT QA model
        print("Loading BERT QA model...")
        self.qa_pipeline = pipeline(
            "question-answering",
            model=QA_MODEL_NAME,
            tokenizer=QA_MODEL_NAME,
            device=device
        )
        
        print("✓ Phase II models loaded")
    
    def extract_ner_entities(self, text: str) -> Set[str]:
        """Extract named entities using SpaCy."""
        entities = set()
        doc = self.nlp(text)
        
        for ent in doc.ents:
            if ent.label_ in NER_ENTITY_TYPES:
                entity_text = clean_text(ent.text)
                if entity_text:
                    entities.add(entity_text)
        
        return entities
    
    def qa_extraction(self, chunks: List[str]) -> Set[str]:
        """Extract dataset mentions using BERT QA model."""
        answers = set()
        
        for question in QA_QUESTIONS:
            for chunk in chunks:
                if not chunk.strip():
                    continue
                
                try:
                    result = self.qa_pipeline(
                        question=question,
                        context=chunk,
                        max_answer_len=QA_MAX_ANSWER_LENGTH
                    )
                    
                    if result['score'] >= MIN_CONFIDENCE:
                        answer = clean_text(result['answer'])
                        if answer:
                            answers.add(answer)
                
                except:
                    continue
        
        return answers
    
    def process_single_text(self, text: str) -> Set[str]:
        """Process a single text through Phase II pipeline."""
        all_extractions = set()
        
        # Extract sentences with keywords
        keyword_sentences = extract_sentences_with_keywords(text, DATA_KEYWORDS)
        
        if not keyword_sentences:
            return all_extractions
        
        # Chunk sentences
        chunks = chunk_sentences(keyword_sentences, CHUNK_SIZE, CHUNK_OVERLAP)
        
        # NER extraction
        ner_entities = self.extract_ner_entities(text)
        all_extractions.update(ner_entities)
        
        # QA extraction
        qa_answers = self.qa_extraction(chunks)
        all_extractions.update(qa_answers)
        
        return all_extractions
    
    def process(self, df: pd.DataFrame) -> pd.DataFrame:
        """Run complete Phase II pipeline."""
        print("\n" + "="*60)
        print("PHASE II: SPACY NER & BERT QA MODELING")
        print("="*60)
        
        phase2_predictions = []
        
        for idx, row in tqdm(df.iterrows(), total=len(df), desc="Phase II"):
            text = row.get('text', '')
            extractions = self.process_single_text(text)
            phase2_predictions.append(format_prediction_string(extractions))
        
        df['phase2_predictions'] = phase2_predictions
        print(f"✓ Phase II complete")
        return df


print("✓ Phase II class defined")

✓ Phase II class defined


## 6. Phase III: Acronym & Abbreviation Extraction <a name="phase3"></a>

Extract acronyms and match with full forms.

In [76]:
class PhaseIIIAcronyms:
    """Phase III: Acronym and abbreviation extraction."""
    
    def extract_acronyms(self, text: str) -> Set[str]:
        """Extract potential acronyms from text."""
        acronyms = set()
        
        # Pattern 1: Uppercase words
        pattern1 = r'\b[A-Z]{' + str(MIN_ACRONYM_LENGTH) + ',' + str(MAX_ACRONYM_LENGTH) + r'}\b'
        matches1 = re.findall(pattern1, text)
        acronyms.update(matches1)
        
        # Pattern 2: Acronyms in parentheses
        pattern2 = r'\(([A-Z]{' + str(MIN_ACRONYM_LENGTH) + ',' + str(MAX_ACRONYM_LENGTH) + r'})\)'
        matches2 = re.findall(pattern2, text)
        acronyms.update(matches2)
        
        return acronyms
    
    def extract_abbreviation_acronym_pairs(self, text: str) -> Dict[str, str]:
        """Extract abbreviation-acronym pairs from text."""
        pairs = {}
        
        pattern = r'([A-Z][a-zA-Z\s\-]+?)\s*\(([A-Z]{' + str(MIN_ACRONYM_LENGTH) + ',' + str(MAX_ACRONYM_LENGTH) + r'})\)'
        matches = re.findall(pattern, text)
        
        for full_form, acronym in matches:
            full_form = full_form.strip()
            acronym = acronym.strip()
            pairs[acronym] = full_form
        
        return pairs
    
    def process_single_text(self, text: str, previous_predictions: str) -> Set[str]:
        """Process a single text through Phase III pipeline."""
        all_extractions = set()
        
        # Extract acronyms
        acronyms = self.extract_acronyms(text)
        
        # Extract pairs
        acronym_pairs = self.extract_abbreviation_acronym_pairs(text)
        
        # Match acronyms with previous predictions
        if previous_predictions:
            prev_preds = [p.strip() for p in previous_predictions.split('|')]
            
            for acronym in acronyms:
                acronym_lower = acronym.lower()
                for pred in prev_preds:
                    if acronym_lower in pred.lower():
                        all_extractions.add(acronym_lower)
                        all_extractions.add(pred)
        
        # Create variants from pairs
        for acronym, full_form in acronym_pairs.items():
            acronym_clean = clean_text(acronym)
            full_form_clean = clean_text(full_form)
            
            all_extractions.add(acronym_clean)
            all_extractions.add(full_form_clean)
            all_extractions.add(f"{full_form_clean} {acronym_clean}")
        
        return all_extractions
    
    def process(self, df: pd.DataFrame) -> pd.DataFrame:
        """Run complete Phase III pipeline."""
        print("\n" + "="*60)
        print("PHASE III: ACRONYM & ABBREVIATION EXTRACTION")
        print("="*60)
        
        phase3_predictions = []
        
        for idx, row in tqdm(df.iterrows(), total=len(df), desc="Phase III"):
            text = row.get('text', '')
            phase2_preds = row.get('phase2_predictions', '')
            
            extractions = self.process_single_text(text, phase2_preds)
            phase3_predictions.append(format_prediction_string(extractions))
        
        df['phase3_predictions'] = phase3_predictions
        print(f"✓ Phase III complete")
        return df


print("✓ Phase III class defined")

✓ Phase III class defined


## 7. Pipeline Integration <a name="pipeline"></a>

Orchestrate all phases and merge predictions.

In [77]:
class DEFNLPPipeline:
    """Main pipeline orchestrator for DEFNLP methodology."""
    
    def __init__(self):
        print("\n" + "="*60)
        print("INITIALIZING DEFNLP PIPELINE")
        print("="*60)
        
        self.phase1 = PhaseIBaseline()
        self.phase2 = PhaseIINER_QA()
        self.phase3 = PhaseIIIAcronyms()
        
        print("\n✓ Pipeline initialized successfully\n")
    
    def merge_all_predictions(self, df: pd.DataFrame) -> pd.DataFrame:
        """Merge predictions from all three phases."""
        print("\n" + "="*60)
        print("MERGING ALL PHASE PREDICTIONS")
        print("="*60)
        
        final_predictions = []
        
        for idx, row in df.iterrows():
            all_preds = [
                row.get('phase1_predictions', ''),
                row.get('phase2_predictions', ''),
                row.get('phase3_predictions', '')
            ]
            
            merged = merge_prediction_strings(all_preds)
            final_predictions.append(merged)
        
        df['PredictionString'] = final_predictions
        print(f"✓ Merged predictions for {len(df)} publications")
        return df
    
    def run_inference(self, test_csv: str = TEST_CSV, test_json_dir: str = TEST_JSON_DIR,
                     train_csv: str = TRAIN_CSV) -> pd.DataFrame:
        """Run inference on test data."""
        print("\n" + "="*60)
        print("RUNNING DEFNLP INFERENCE")
        print("="*60)
        
        # Load data
        test_df = pd.read_csv(test_csv)
        train_df = pd.read_csv(train_csv) if os.path.exists(train_csv) else None
        
        # Run phases
        test_df = self.phase1.process(test_df, test_json_dir, train_df)
        test_df = self.phase2.process(test_df)
        test_df = self.phase3.process(test_df)
        
        # Merge predictions
        test_df = self.merge_all_predictions(test_df)
        
        # Prepare output
        output_df = test_df[['Id', 'PredictionString']].copy()
        
        print("\n" + "="*60)
        print("✓ INFERENCE COMPLETE")
        print("="*60)
        
        return output_df


print("✓ Pipeline class defined")

✓ Pipeline class defined


## 8. Run Inference <a name="inference"></a>

Execute the complete pipeline on test data.

In [78]:
# Create pipeline
pipeline = DEFNLPPipeline()


INITIALIZING DEFNLP PIPELINE
Note: External datasets file /content/big_gov_datasets.txt not found (optional)
Loading SpaCy model...
Loading BERT QA model...


Device set to use cuda:0


✓ Phase II models loaded

✓ Pipeline initialized successfully



In [None]:
# Fix file paths - run this cell before running inference
import os

# Change to the notebook directory
notebook_dir = r'C:\Users\mtala\Desktop\DEFNLP'
os.chdir(notebook_dir)

# Verify the files exist
print(f"Current directory: {os.getcwd()}")
print(f"test.csv exists: {os.path.exists('test.csv')}")
print(f"train.csv exists: {os.path.exists('train.csv')}")
print(f"test directory exists: {os.path.exists('test')}")
print(f"train directory exists: {os.path.exists('train')}")

FileNotFoundError: [Errno 2] No such file or directory: 'c:\\Users\\mtala\\Desktop\\DEFNLP'

In [None]:
# Run inference
predictions = pipeline.run_inference()

In [None]:
# Save predictions
os.makedirs(OUTPUT_DIR, exist_ok=True)
output_path = os.path.join(OUTPUT_DIR, "predictions1.csv")
predictions.to_csv(output_path, index=False)
print(f"\n✓ Predictions saved to {output_path}")

## 9. Results Analysis <a name="results"></a>

Analyze and visualize the predictions.

In [None]:
# Display first few predictions
print("\nFirst 10 predictions:")
print("="*80)
predictions.head(10)

In [None]:
# Statistics
print("\nPrediction Statistics:")
print("="*80)
print(f"Total publications: {len(predictions)}")
print(f"Publications with predictions: {(predictions['PredictionString'] != '').sum()}")
print(f"Publications without predictions: {(predictions['PredictionString'] == '').sum()}")

# Average number of datasets per publication
avg_datasets = predictions['PredictionString'].apply(
    lambda x: len([p for p in x.split('|') if p.strip()]) if x else 0
).mean()
print(f"Average datasets per publication: {avg_datasets:.2f}")

In [None]:
# Sample predictions with details
print("\nSample prediction with details:")
print("="*80)

sample_idx = 0
if len(predictions) > 0:
    pub_id = predictions.iloc[sample_idx]['Id']
    pred_string = predictions.iloc[sample_idx]['PredictionString']
    
    print(f"Publication ID: {pub_id}")
    print(f"\nExtracted datasets:")
    if pred_string:
        datasets = [p.strip() for p in pred_string.split('|')]
        for i, dataset in enumerate(datasets, 1):
            print(f"  {i}. {dataset}")
    else:
        print("  (No datasets found)")

## Summary

This notebook successfully implements the complete DEFNLP pipeline:

✅ **Phase I**: Baseline string matching (internal + external datasets)  
✅ **Phase II**: Advanced NLP (SpaCy NER + BERT QA)  
✅ **Phase III**: Acronym and abbreviation extraction  
✅ **Integration**: Transfer learning approach merging all predictions  

The final predictions are saved to `output/predictions.csv` in the required format:
- `Id`: Publication ID
- `PredictionString`: Pipe-separated dataset names (sorted alphabetically)

### Next Steps

1. Review the predictions in `output/predictions.csv`
2. Adjust configuration parameters if needed
3. Add more QA questions or keywords for better coverage
4. Fine-tune the BERT model on your specific dataset (optional)
5. Submit predictions to evaluation platform