In [None]:
!pip scispacyinstall  requests
!pip install https://s3-us-west-2.amazonaws.com/ai2-s2-scispacy/releases/v0.5.1/en_ner_bionlp13cg_md-0.5.1.tar.gz
!pip install scispacy
!pip install biopython

ERROR: unknown command "scispacyinstall" - maybe you meant "install"
Collecting https://s3-us-west-2.amazonaws.com/ai2-s2-scispacy/releases/v0.5.1/en_ner_bionlp13cg_md-0.5.1.tar.gz
  Downloading https://s3-us-west-2.amazonaws.com/ai2-s2-scispacy/releases/v0.5.1/en_ner_bionlp13cg_md-0.5.1.tar.gz (120.2 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m120.2/120.2 MB[0m [31m4.3 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting spacy<3.5.0,>=3.4.1 (from en_ner_bionlp13cg_md==0.5.1)
  Downloading spacy-3.4.4-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (24 kB)
Collecting thinc<8.2.0,>=8.1.0 (from spacy<3.5.0,>=3.4.1->en_ner_bionlp13cg_md==0.5.1)
  Downloading thinc-8.1.12-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (15 kB)
Collecting wasabi<1.1.0,>=0.9.1 (from spacy<3.5.0,>=3.4.1->en_ner_bionlp13cg_md==0.5.1)
  Downloading wasabi-0.10.1-py3-none-any.whl.metadata (28 kB)
Colle

In [1]:
import csv
import time
import logging
import uuid
import sys
from Bio import Entrez, Medline
from tqdm import tqdm
import xml.etree.ElementTree as ET
import requests
import spacy
from Bio import Entrez, Medline
import time
import logging
import csv
import uuid
import xml.etree.ElementTree as ET
from tqdm import tqdm
import requests
import os
import re

# Python Script for Biomedical Literature Analysis: A Step-by-Step Explanation

Imagine you're a detective trying to find connections between certain childhood inflammatory diseases (like MIS-C or Kawasaki Disease) and genetic factors, particularly in research authored by a specific group of scientists (like those led by Casanova). This script is your automated research assistant.

**Overall Goal:**

The script aims to:
1.  Search the PubMed database (a vast library of biomedical research papers) for articles relevant to specific childhood inflammatory syndromes and genetics, with a potential focus on papers by certain authors.
2.  Extract detailed information from these papers, including titles, authors, publication year, journal, and abstracts (summaries).
3.  Identify gene names and genetic variants (like mutations or polymorphisms) mentioned within these papers, using both text analysis and links in NCBI (National Center for Biotechnology Information) databases.
4.  Gather citation counts for these papers from both PubMed and Semantic Scholar (another academic search engine) to gauge their impact.
5.  Consolidate all this information into a single, structured CSV (spreadsheet-like) file for further analysis.

---

## The Pipeline: Step-by-Step Explanation

### Phase 1: Setup and Configuration (The Blueprint)

This is like gathering your tools and instructions before starting a big project.

1.  **Defining Key Information (Lines 1-31 in the script):**
    * `HGNC_TSV_URL`: The web address for a file from the HUGO Gene Nomenclature Committee (HGNC). This committee gives official names and symbols to human genes. This file is a master list.
    * `APPROVED_GENE_SYMBOLS_FILE`: The name of a local file where the script will store a list of "approved" gene symbols extracted from the HGNC master list.
    * `APPROVED_SYMBOL_COLUMN_NAME`, `STATUS_COLUMN_NAME`, `REQUIRED_STATUS`: These tell the script which columns in the HGNC file contain the gene symbol and its status (e.g., "Approved", "Entry Withdrawn"), and that we only want "Approved" ones.
    * `APPROVED_GENE_SYMBOLS = set()`: An empty container (a set, which stores unique items) that will later be filled with the official gene symbols.
    * `ENTREZ_EMAIL`, `ENTREZ_API_KEY`: Your "credentials" for using NCBI's Entrez system, which is the gateway to PubMed and other databases. The email is for NCBI to contact you if there's an issue, and the API key grants higher access limits.
    * `MAX_RESULTS`, `BATCH_SIZE`, `RETRY_BATCH_SIZE`, `RETRIES`, `BASE_SLEEP`: Settings for how many results to fetch, how many papers to process in one go (a batch), how many to retry if a batch fails, how many times to retry, and how long to pause between requests (to be polite to the servers).
    * `TARGET_AUTHOR`: A list of author names (and their variations) the script should pay special attention to.
    * `PMIDS_TO_CHECK`, `CRITICAL_PMIDS`: Lists of specific PubMed IDs (PMIDs are unique identifiers for papers). `PMIDS_TO_CHECK` are papers to verify if they are found by the search query. `CRITICAL_PMIDS` are papers that *must* be included in the analysis, even if the main search doesn't find them.
    * `_GENE_BLACKLIST_TERMS`, `GENE_BLACKLIST`: A list of words (like "COVID", "RESULTS", "GENE") that might look like gene names but aren't, or are too general. The script will ignore these if it finds them during text analysis.
    * `VARIANT_KEYWORDS`: Words that signal the presence of a genetic variant discussion (e.g., "mutation", "allele").
    * `QUERY`: This is the heart of the literature search. It's a complex string formatted for PubMed, telling it to find papers that:
        * Mention "Multisystem Inflammatory Syndrome in Children" (MIS-C), "PIMS", "Kawasaki Disease" (either as official keywords or in the title/abstract).
        * **AND** also mention terms related to genetics like "Genetic Predisposition", "Mutation", "genes", "variants", "GWAS" (Genome-Wide Association Study), etc., OR are authored by anyone in `TARGET_AUTHOR`.
    * `logging.basicConfig(...)`: Sets up a log file (`pubmed_errors.log`) to record progress and any errors encountered.
    * `Entrez.email = ...`, `Entrez.api_key = ...`: Tells the `Bio.Entrez` library (used for NCBI access) your email and API key.
    * `nlp = spacy.load(...)`: Attempts to load a specialized AI model from "SciSpaCy". This model is trained to understand biomedical text and can identify things like gene names within sentences. If it fails to load, the script will still run but won't be as good at finding genes directly from text.

### Phase 2: Preparing the Gene List (The Reference Manual)

2.  **`create_approved_gene_list_from_hgnc` function (Lines 40-106):**
    * **Purpose:** To create a clean list of officially approved human gene symbols.
    * **How it works:**
        * Downloads the `hgnc_complete_set.txt` file from the `HGNC_TSV_URL`.
        * Reads this file (it's a tab-separated value file, like a spreadsheet).
        * Finds the columns for gene symbols and their status.
        * Goes row by row, and if a gene's status is "Approved", it adds the gene symbol (converted to uppercase) to a set.
        * Saves this set of approved gene symbols into the `APPROVED_GENE_SYMBOLS_FILE` (`hgnc_approved_genes.txt`).
    * **Why it's important:** When the script later tries to identify genes from paper texts, it can check them against this official list to reduce errors and ensure it's dealing with valid gene symbols.

### Phase 3: Searching and Gathering Paper Data (The Investigation)

3.  **`check_paper_in_query` function (Lines 108-119):**
    * **Purpose:** To verify if a specific paper (given by its `pmid`) would be found by the main `QUERY`.
    * **How it works:** Performs a mini-search using the `QUERY` and checks if the `pmid` is in the list of results.
    * **Why it's important:** Useful for debugging the `QUERY` or ensuring critical papers are indeed covered.

4.  **`search_pubmed` function (Lines 121-136):**
    * **Purpose:** To execute the main search `QUERY` on PubMed and get a list of relevant paper IDs.
    * **How it works:** Uses `Entrez.esearch` to send the query to PubMed and retrieves up to `MAX_RESULTS` PMIDs. It includes retries in case of network issues.
    * **Output:** A list of PMIDs.

5.  **`fetch_paper_titles` function (Lines 138-215):**
    * **Purpose:** For a given list of PMIDs, this function fetches detailed information: title, authors, abstract, publication year, and journal. Crucially, it also performs text analysis on the title and abstract to find potential gene and variant mentions.
    * **How it works:**
        * Processes PMIDs in batches (`b_size`) for efficiency.
        * Uses `Entrez.efetch` to get the full records for each paper in Medline format.
        * Parses each record to extract:
            * PMID, Title (`TI`), Authors (`AU`), Abstract (`AB`), Publication Date (`DP` to get the year), Journal (`JT` or `TA`).
            * Checks if any of the `TARGET_AUTHOR`s are in the author list.
        * **SciSpaCy Text Mining (if `nlp` model is loaded):**
            * Combines the title and abstract into one block of text.
            * Feeds this text to the SciSpaCy model (`doc = nlp(text_content)`).
            * The model identifies entities labeled as "GENE_OR_GENE_PRODUCT".
            * These candidate genes are filtered: must be longer than 2 characters, not just numbers, and not in the `GENE_BLACKLIST`.
            * If the `APPROVED_GENE_SYMBOLS` list is available, it further filters these candidates to keep only those present in the approved list.
            * It also looks for keywords from `VARIANT_KEYWORDS` and then checks nearby words for patterns that look like variant IDs (e.g., `rs12345`, `p.Gly12Val`).
        * Stores all this information for each PMID in a dictionary (`paper_info`).
    * **Output:** A dictionary where keys are PMIDs and values are dictionaries containing all the extracted details, including lists of `text_genes`, `text_variants`, and `text_variant_context`.

6.  **`fetch_citation_counts` function (Lines 217-244):**
    * **Purpose:** To find out how many times each paper (PMID) has been cited by other papers within the PubMed database.
    * **How it works:**
        * Uses `Entrez.elink` with `linkname="pubmed_pubmed_citedin"`. This asks NCBI: "For these PMIDs, which other PubMed papers cite them?"
        * Counts the number of citing papers for each input PMID.
    * **Output:** A dictionary mapping PMIDs to their PubMed citation counts.

7.  **`fetch_s2_citation_counts` function (Lines 246-294):**
    * **Purpose:** To get citation counts from Semantic Scholar, an alternative academic search engine. This can provide a different or more up-to-date count.
    * **How it works:**
        * Uses the Semantic Scholar API (`https://api.semanticscholar.org/graph/v1/paper/batch`).
        * Sends batches of PMIDs (prefixed with "PMID:") to the API.
        * Parses the JSON response to get the `citationCount` for each paper.
        * Includes specific handling for rate limits (API telling the script to slow down).
    * **Output:** A dictionary mapping PMIDs to their Semantic Scholar citation counts.

8.  **`fetch_linked_data` function (Lines 296-323):**
    * **Purpose:** To find official links between papers (PMIDs) and entries in other NCBI databases, specifically the Gene database or the SNP (Single Nucleotide Polymorphism, a type of variant) database.
    * **How it works:**
        * Similar to `fetch_citation_counts`, uses `Entrez.elink`.
        * For genes: `db="gene"`, `linkname="pubmed_gene"`. This asks: "Which NCBI Gene IDs are linked to these PMIDs?"
        * For variants: `db="snp"`, `linkname="pubmed_snp"`. This asks: "Which NCBI SNP IDs are linked to these PMIDs?"
    * **Output:** A dictionary mapping PMIDs to a list of linked Gene IDs or SNP IDs.

9.  **`fetch_gene_names` function (Lines 326-349):**
    * **Purpose:** NCBI Gene IDs are numbers (e.g., `7157`). This function converts these IDs into their human-readable official symbols (e.g., `TP53`).
    * **How it works:**
        * Takes a list of Gene IDs.
        * Uses `Entrez.efetch` with `db="gene"` to get detailed records for these Gene IDs.
        * Parses the XML response to find the `Gene-ref_locus` (official symbol) for each ID.
    * **Output:** A dictionary mapping Gene IDs to their symbols.

10. **`fetch_variant_details` function (Lines 351-401):**
    * **Purpose:** NCBI SNP IDs are also numbers. This function fetches more details for these variant IDs, such as their common `rs` number (e.g., `rs12345`), the different versions (alleles, like A, T, C, or G), and potentially the gene they are associated with.
    * **How it works:**
        * Takes a list of SNP IDs.
        * Uses `Entrez.efetch` with `db="snp"` to get detailed records.
        * Parses the XML response to extract:
            * `SNP_ID` (to form the `rs` number).
            * Allele information from `GLOBAL_MAFS` (Minor Allele Frequencies).
            * Associated gene name, if available.
    * **Output:** A dictionary mapping SNP IDs to details like `rsid`, `alleles`, and `gene`.

### Phase 4: Execution and Reporting (Putting It All Together)

This happens inside the `if __name__ == "__main__":` block (Lines 403-550). This is the main control flow of the script.

1.  **Initial Checks & Setup (Lines 404-425):**
    * Ensures `ENTREZ_EMAIL` and `QUERY` are set.
    * Warns if the SciSpaCy model (`nlp`) isn't loaded.
    * **Load/Create Approved Gene List:**
        * Checks if `APPROVED_GENE_SYMBOLS_FILE` exists.
        * If not, calls `create_approved_gene_list_from_hgnc` to create it.
        * If it exists (or was just created), it reads the gene symbols from this file into the `APPROVED_GENE_SYMBOLS` set for later use in validation.

2.  **Running the Search and Fetching Data (Lines 427-466):**
    * `check_paper_in_query`: Calls the function for each PMID in `PMIDS_TO_CHECK`.
    * `paper_ids_list = search_pubmed()`: Performs the main PubMed search.
    * **Add Critical PMIDs:** Ensures all PMIDs from `CRITICAL_PMIDS` are included in `paper_ids_list`, even if the search didn't find them.
    * `paper_info_dict, ... = fetch_paper_titles(...)`: Gets titles, abstracts, and performs SciSpaCy text mining for all identified papers. Retries failed ones.
    * `citation_counts_dict, ... = fetch_citation_counts(...)`: Gets PubMed citation counts. Retries.
    * `s2_citation_counts_dict = fetch_s2_citation_counts(...)`: Gets Semantic Scholar citation counts.
    * `paper_to_ncbi_genes, ... = fetch_linked_data(..., "gene", ...)`: Finds NCBI Gene IDs linked to the papers.
    * `paper_to_ncbi_variants, ... = fetch_linked_data(..., "snp", ...)`: Finds NCBI SNP IDs linked to the papers.
    * `gene_names_dict, ... = fetch_gene_names(...)`: Converts all unique linked Gene IDs to gene symbols.
    * `variant_details_dict, ... = fetch_variant_details(...)`: Gets details for all unique linked SNP IDs.

3.  **Structuring and Consolidating Results (Lines 468-530):**
    * This is a crucial loop where all the gathered information is combined for each paper.
    * It iterates through each paper in `paper_info_dict`.
    * For each paper (PMID):
        * It retrieves the basic info (title, year, journal, author flag).
        * It gets the text-mined genes/variants from `paper_info_dict`.
        * It retrieves the linked NCBI gene symbols (using `gene_names_dict`) and linked variant details (using `variant_details_dict`).
        * It gets the PubMed and Semantic Scholar citation counts.
        * **Logic for Output Rows:**
            * If a paper has NCBI-linked genes, a row is created for each gene.
            * If a paper has NCBI-linked variants, a row is created for each variant (showing its associated gene if known, or a placeholder).
            * If a paper has genes identified by SciSpaCy *that were not already covered by NCBI links*, a row is created for these text-mined genes, potentially associating them with text-mined variants if found.
            * If a paper has *no* specific gene or variant data found (neither linked nor from text), a placeholder row is created for that paper to indicate it was retrieved by the search but had no identifiable genetic information, or that SciSpaCy hits were filtered out.
        * Each such piece of information (PMID, Title, Year, Gene, Variant, Citations, etc.) is added as a dictionary to a list called `results_list`. One paper might lead to multiple rows in `results_list` if it discusses multiple genes/variants.

4.  **Saving the Output (Lines 532-546):**
    * `results_list.sort(...)`: Sorts all the collected rows, primarily by Semantic Scholar citation count (highest first), then by PubMed citation count, then by PMID.
    * Checks if `results_list` actually contains any data.
    * If yes, it defines the column headers (`fieldnames`) for the output CSV file.
    * It creates a unique filename for the CSV (e.g., `pubmed_genetic_results_abcdef12.csv`).
    * Writes all the dictionaries in `results_list` as rows into this CSV file.
    * Prints a message indicating where the results were saved.

5.  **Script End (Lines 548-550):**
    * Prints "Script finished." to the console and to the log file.

---

This detailed process allows the script to systematically gather, process, and organize a wealth of information from biomedical literature, making it easier for researchers to identify trends, key papers, and genetic factors related to their area of interest.

In [2]:
HGNC_TSV_URL = "https://storage.googleapis.com/public-download-files/hgnc/tsv/hgnc_complete_set.txt"
APPROVED_GENE_SYMBOLS_FILE = "hgnc_approved_genes.txt"
APPROVED_SYMBOL_COLUMN_NAME = "symbol"
STATUS_COLUMN_NAME = "status"
REQUIRED_STATUS = "Approved"
APPROVED_GENE_SYMBOLS = set()

ENTREZ_EMAIL = "michal.uppal@gmail.com"
ENTREZ_API_KEY = "be8c41829a4406a6d8a0f2d8ec242d9db108"
MAX_RESULTS = 5000
BATCH_SIZE = 10
RETRY_BATCH_SIZE = 3
RETRIES = 3
BASE_SLEEP = 0.1
TARGET_AUTHOR = ['Casanova', 'Jean-Laurent Casanova', 'Casanova JL', 'Casanova J-L']
PMIDS_TO_CHECK = ['33106546']
CRITICAL_PMIDS = ['33106546']

_GENE_BLACKLIST_TERMS = {
    'SARS', 'MIS-C', 'PIMS', 'PIMS-TS', 'COVID', 'HLA', 'HIV', 'EBV',
    'CONCLUSIONS', 'BACKGROUND', 'PATIENTS', 'METHODS', 'RESULTS', 'AND',
    'SD', 'INTRODUCTION', 'ABSTRACT', 'DISCUSSION', 'GENE', 'VARIANT'
}
GENE_BLACKLIST = {term.upper() for term in _GENE_BLACKLIST_TERMS}

VARIANT_KEYWORDS = {'variant', 'variants', 'mutation', 'mutations', 'allele', 'alleles'}
QUERY = (
    "(Multisystem Inflammatory Syndrome in Children[MeSH Terms] OR "
    "MIS-C[Title/Abstract] OR PIMS[Title/Abstract] OR PIMS-TS[Title/Abstract] OR "
    "Paediatric Inflammatory Multisystem Syndrome[Title/Abstract] OR "
    "Kawasaki Disease[MeSH Terms] OR Kawasaki Disease[Title/Abstract] OR Kawasaki Syndrome[Title/Abstract]) AND "
    "(Genetic Predisposition to Disease[MeSH Terms] OR Polymorphism, Genetic[MeSH Terms] OR "
    "Mutation[MeSH Terms] OR Genome-Wide Association Study[MeSH Terms] OR "
    "genes[Title/Abstract] OR variants[Title/Abstract] OR genetic association[Title/Abstract] OR "
    "polymorphism[Title/Abstract] OR mutation[Title/Abstract] OR gwas[Title/Abstract] OR "
    "inborn errors[Title/Abstract] OR genetic defects[Title/Abstract] OR "
    "genomics[Title/Abstract] OR gene expression[Title/Abstract] OR "
    f"({' OR '.join(f'{author}[Author]' for author in TARGET_AUTHOR)}))"
)

logging.basicConfig(filename='pubmed_errors.log', level=logging.INFO,
                    format='%(asctime)s - %(levelname)s - %(message)s')

Entrez.email = ENTREZ_EMAIL
if ENTREZ_API_KEY:
    Entrez.api_key = ENTREZ_API_KEY

nlp = None
try:
    nlp = spacy.load("en_ner_bionlp13cg_md")
    logging.info("SciSpaCy model 'en_ner_bionlp13cg_md' loaded successfully.")
except Exception as e:
    logging.error(f"Failed to load SciSpaCy model 'en_ner_bionlp13cg_md': {e}.")
    print(f"CRITICAL: Failed to load SciSpaCy model. Text processing for genes/variants will be skipped. Error: {e}")

def create_approved_gene_list_from_hgnc(url, output_file, symbol_column_name, status_column_name, required_status):
    print(f"Attempting to download HGNC gene data from: {url}")
    logging.info(f"Attempting to download HGNC gene data from: {url}")
    try:
        response = requests.get(url, timeout=60)
        response.raise_for_status()
        print("HGNC download successful.")
        logging.info("HGNC download successful.")
    except requests.exceptions.RequestException as e:
        print(f"Error downloading HGNC data: {e}")
        logging.error(f"Error downloading HGNC data: {e}")
        return False

    print(f"Processing HGNC TSV data to extract '{symbol_column_name}' where '{status_column_name}' is '{required_status}'...")
    lines = response.text.splitlines()
    reader = csv.reader(lines, delimiter='\t')

    extracted_symbols = set()
    header = None
    symbol_column_index = -1
    status_column_index = -1

    try:
        header = next(reader)

        if symbol_column_name in header:
            symbol_column_index = header.index(symbol_column_name)
            logging.info(f"Found '{symbol_column_name}' column at index {symbol_column_index} in HGNC data.")
        else:
            print(f"Error: Column '{symbol_column_name}' not found in HGNC header: {header}")
            logging.error(f"Error: Column '{symbol_column_name}' not found in HGNC header: {header}")
            return False

        if status_column_name in header:
            status_column_index = header.index(status_column_name)
            logging.info(f"Found '{status_column_name}' column at index {status_column_index} in HGNC data.")
        else:
            print(f"Error: Column '{status_column_name}' not found in HGNC header: {header}")
            logging.error(f"Error: Column '{status_column_name}' not found in HGNC header: {header}")
            return False

        for row_number, row in enumerate(reader, 1):
            if not row or all(not cell for cell in row):
                logging.debug(f"Skipping empty or blank row {row_number + 1} in HGNC data.")
                continue

            if len(row) > symbol_column_index and len(row) > status_column_index:
                current_status = row[status_column_index].strip()
                if current_status == required_status:
                    symbol = row[symbol_column_index].strip()
                    if symbol:
                        extracted_symbols.add(symbol.upper())
                    else:
                        logging.debug(f"HGNC data: Row {row_number + 1} has status '{required_status}' but symbol is empty.")
            else:
                logging.warning(f"HGNC data: Row {row_number + 1} has insufficient columns. Length: {len(row)}, Row content (first 5 cells): {row[:5]}")

    except StopIteration:
        print("Error: HGNC file appears to be empty after header or header is missing.")
        logging.error("Error: HGNC file appears to be empty after header or header is missing.")
        return False
    except Exception as e:
        print(f"An error occurred while processing the HGNC TSV file: {e}")
        logging.error(f"An error occurred while processing the HGNC TSV file: {e}")
        return False

    if not extracted_symbols:
        print(f"No gene symbols with status '{required_status}' were extracted from HGNC data. Please check the file content and specified column names/status.")
        logging.warning(f"No gene symbols with status '{required_status}' were extracted from HGNC data.")
        return False

    print(f"Extracted {len(extracted_symbols)} unique gene symbols with status '{required_status}'.")
    logging.info(f"Extracted {len(extracted_symbols)} unique gene symbols with status '{required_status}'.")

    try:
        with open(output_file, 'w') as f:
            for symbol in sorted(list(extracted_symbols)):
                f.write(symbol + "\n")
        print(f"Successfully saved approved gene symbols to {output_file}")
        logging.info(f"Successfully saved approved gene symbols to {output_file}")
        return True
    except IOError as e:
        print(f"Error writing HGNC gene symbols to output file {output_file}: {e}")
        logging.error(f"Error writing HGNC gene symbols to output file {output_file}: {e}")
        return False

def check_paper_in_query(pmid, query_to_check=QUERY):
    try:
        handle = Entrez.esearch(db="pubmed", term=query_to_check, retmax=10000)
        record = Entrez.read(handle)
        handle.close()
        is_found = pmid in record['IdList']
        logging.info(f"Paper {pmid} {'found' if is_found else 'not found'} in query.")
        print(f"Paper {pmid} {'found' if is_found else 'not found'} in query.")
        return is_found
    except Exception as e:
        print(f"Error checking if paper {pmid} is in query: {e}")
        logging.error(f"Error checking if paper {pmid} is in query: {e}")
        return False

def search_pubmed(query_to_search=QUERY, max_r=MAX_RESULTS, num_retries=RETRIES):
    print("Searching PubMed...")
    logging.info(f"Searching PubMed with query (first 200 chars): {query_to_search[:200]}...")
    for attempt in range(num_retries):
        try:
            handle = Entrez.esearch(db="pubmed", term=query_to_search, retmax=max_r)
            record = Entrez.read(handle)
            handle.close()
            paper_ids = record['IdList']
            logging.info(f"Found {len(paper_ids)} papers in PubMed: {paper_ids[:10]}...")
            return paper_ids
        except Exception as e:
            logging.error(f"PubMed search attempt {attempt + 1} failed: {e}")
            if attempt == num_retries - 1:
                print(f"Error searching PubMed after {num_retries} attempts: {e}")
            time.sleep(1 + attempt)
    return []

def fetch_paper_titles(paper_ids, b_size=BATCH_SIZE, num_retries=RETRIES):
    global APPROVED_GENE_SYMBOLS
    paper_info = {}
    failed_ids = []
    print("Fetching paper titles, authors, abstracts, year, and journal...")
    if not nlp:
        logging.error("SciSpaCy model (nlp) not loaded. Gene/variant extraction from text will be skipped.")

    for i in tqdm(range(0, len(paper_ids), b_size), desc="Fetching title batches", unit="batch"):
        batch = paper_ids[i:i + b_size]
        for attempt in range(num_retries):
            try:
                handle = Entrez.efetch(db="pubmed", id=",".join(batch), rettype="medline", retmode="text")
                records = list(Medline.parse(handle))
                handle.close()
                for record in records:
                    if "PMID" in record and "TI" in record:
                        authors = record.get("AU", [])
                        abstract = record.get("AB", "")
                        has_target_author = any(
                            any(target.lower() in author.lower() for target in TARGET_AUTHOR)
                            for author in authors
                        ) if TARGET_AUTHOR else False

                        dp_string = record.get("DP", "")
                        year = "N/A"
                        if dp_string:
                            match = re.search(r'\b(\d{4})\b', dp_string)
                            if match:
                                year = match.group(1)

                        journal = record.get("JT", record.get("TA", "N/A"))

                        text_content = f"{record.get('TI', '')} {abstract}"

                        validated_text_genes = []
                        text_variants = []
                        text_variant_context = []

                        if nlp:
                            doc = nlp(text_content)
                            scispacy_genes = [ent.text for ent in doc.ents if ent.label_ == "GENE_OR_GENE_PRODUCT"]
                            current_paper_validated_genes = set()

                            for gene_candidate_original_case in scispacy_genes:
                                gene_candidate_upper = gene_candidate_original_case.upper()
                                if not (len(gene_candidate_original_case) > 2 and \
                                        not gene_candidate_original_case.isdigit() and \
                                        gene_candidate_upper not in GENE_BLACKLIST):
                                    logging.debug(f"Paper {record['PMID']}: Gene '{gene_candidate_original_case}' failed pre-filtering.")
                                    continue

                                if APPROVED_GENE_SYMBOLS:
                                    if gene_candidate_upper not in APPROVED_GENE_SYMBOLS:
                                        logging.info(f"Paper {record['PMID']}: SciSpaCy gene '{gene_candidate_original_case}' NOT in approved symbols. Discarding.")
                                        continue
                                    current_paper_validated_genes.add(gene_candidate_original_case)
                                else:
                                    current_paper_validated_genes.add(gene_candidate_original_case)
                                    if i == 0 and attempt == 0 and not getattr(fetch_paper_titles, 'warned_no_approved_list', False):
                                        logging.warning(f"Approved gene list (APPROVED_GENE_SYMBOLS) is empty. Keeping SciSpaCy genes without HGNC validation.")
                                        fetch_paper_titles.warned_no_approved_list = True

                            validated_text_genes = sorted(list(current_paper_validated_genes))

                            for token in doc:
                                if token.text.lower() in VARIANT_KEYWORDS:
                                    start = max(0, token.i - 5)
                                    end = min(len(doc), token.i + 6)
                                    window = doc[start:end]
                                    variant_found = False
                                    for t_variant in window:
                                        if (t_variant.text.startswith('rs') and t_variant.text[2:].isdigit()) or \
                                           (t_variant.text.startswith('p.') and any(c.isdigit() for c in t_variant.text)):
                                            text_variants.append(t_variant.text)
                                            variant_found = True
                                            break
                                    if not variant_found:
                                        text_variant_context.append(token.text)

                            text_variants = sorted(list(set(v for v in text_variants if len(v) > 3 or v.startswith('rs'))))
                            text_variant_context = sorted(list(set(c for c in text_variant_context if c in VARIANT_KEYWORDS)))

                        paper_info[record["PMID"]] = {
                            "title": record["TI"],
                            "authors": authors,
                            "year": year,
                            "journal": journal,
                            "has_target_author": has_target_author,
                            "abstract": abstract,
                            "text_genes": validated_text_genes,
                            "text_variants": text_variants,
                            "text_variant_context": text_variant_context
                        }
                        logging.debug(f"Paper {record['PMID']} metadata: title='{record['TI']}', year='{year}', journal='{journal}', validated_text_genes={validated_text_genes}, text_variants={text_variants}")
                    else:
                        logging.warning(f"Skipping record in batch {i // b_size + 1}: Missing PMID or Title. Record PMID: {record.get('PMID', 'N/A')}")
                time.sleep(max(BASE_SLEEP, b_size / 50.0))
                break
            except Exception as e:
                logging.error(f"Fetch titles batch {i // b_size + 1} attempt {attempt + 1} failed: {e}")
                if attempt == num_retries - 1:
                    print(f"Error fetching title batch {i // b_size + 1} after {num_retries} attempts: {e}")
                    failed_ids.extend(batch)
                time.sleep(1 + attempt)
    return paper_info, failed_ids

def fetch_citation_counts(paper_ids, b_size=BATCH_SIZE, num_retries=RETRIES):
    citation_counts = {}
    failed_ids = []
    print("Fetching citation counts...")
    for i in tqdm(range(0, len(paper_ids), b_size), desc="Fetching citation batches", unit="batch"):
        batch = paper_ids[i:i + b_size]
        for attempt in range(num_retries):
            try:
                handle = Entrez.elink(dbfrom="pubmed", db="pubmed", linkname="pubmed_pubmed_citedin", id=",".join(batch))
                record = Entrez.read(handle)
                handle.close()
                for linkset in record:
                    paper_id = linkset["IdList"][0]
                    try:
                        if "LinkSetDb" in linkset and linkset["LinkSetDb"]:
                            citation_ids = [link["Id"] for link in linkset["LinkSetDb"][0]["Link"]]
                            citation_counts[paper_id] = len(citation_ids)
                        else:
                            citation_counts[paper_id] = 0
                    except Exception as e_inner:
                        logging.error(f"Error processing citation linkset for paper {paper_id}: {e_inner}")
                        citation_counts[paper_id] = 0
                time.sleep(max(BASE_SLEEP, b_size / 50.0))
                break
            except Exception as e:
                logging.error(f"Fetch citations batch {i // b_size + 1} attempt {attempt + 1} failed: {e}")
                if attempt == num_retries - 1:
                    print(f"Error fetching citation batch {i // b_size + 1} after {num_retries} attempts: {e}")
                    failed_ids.extend(batch)
                time.sleep(1 + attempt)
    return citation_counts, failed_ids

def fetch_s2_citation_counts(pmids, s2_batch_size=5, num_retries=RETRIES):
    s2_citation_counts = {}
    print("Fetching citation counts from Semantic Scholar...")
    logging.info(f"Starting Semantic Scholar citation fetch for {len(pmids)} PMIDs.")

    s2_api_url = "https://api.semanticscholar.org/graph/v1/paper/batch?fields=citationCount,externalIds"

    pmids_with_prefix = [f"PMID:{pmid}" for pmid in pmids]

    for i in tqdm(range(0, len(pmids_with_prefix), s2_batch_size), desc="Fetching S2 citation batches", unit="batch"):
        batch_ids = pmids_with_prefix[i:i + s2_batch_size]
        original_pmids_in_batch = pmids[i:i + s2_batch_size]

        for attempt in range(num_retries):
            try:
                response = requests.post(s2_api_url, json={"ids": batch_ids}, timeout=20)

                if response.status_code == 429:
                    retry_after = response.headers.get("Retry-After", "60")
                    print(f"Semantic Scholar rate limit hit. Retrying after {retry_after} seconds...")
                    logging.warning(f"Semantic Scholar rate limit hit. Retrying after {retry_after} seconds for batch starting with {batch_ids[0]}")
                    time.sleep(int(retry_after) + 5)
                    continue

                response.raise_for_status()
                data_batch = response.json()

                for idx, paper_data in enumerate(data_batch):
                    original_pmid = original_pmids_in_batch[idx]
                    if paper_data:
                        s2_citation_counts[original_pmid] = paper_data.get("citationCount", 0)
                    else:
                        s2_citation_counts[original_pmid] = 0
                        logging.info(f"PMID {original_pmid} not found or no data in Semantic Scholar batch response.")

                time.sleep(3.5)
                break

            except requests.exceptions.Timeout:
                logging.error(f"Semantic Scholar API request timed out for batch starting with {batch_ids[0]}, attempt {attempt + 1}.")
                if attempt == num_retries - 1:
                    print(f"Semantic Scholar API request timed out after {num_retries} attempts for batch {batch_ids[0]}. Skipping this batch.")
                    for pmid_in_batch in original_pmids_in_batch: s2_citation_counts[pmid_in_batch] = 0
                time.sleep(5 * (attempt + 1))
            except requests.exceptions.RequestException as e:
                logging.error(f"Semantic Scholar API request failed for batch {batch_ids[0]}, attempt {attempt + 1}: {e}")
                if attempt == num_retries - 1:
                    print(f"Semantic Scholar API request failed after {num_retries} attempts for batch {batch_ids[0]}. Skipping batch.")
                    for pmid_in_batch in original_pmids_in_batch: s2_citation_counts[pmid_in_batch] = 0
                time.sleep(5 * (attempt + 1))
            except Exception as e_general:
                logging.error(f"Unexpected error processing Semantic Scholar batch {batch_ids[0]}, attempt {attempt + 1}: {e_general}")
                if attempt == num_retries - 1:
                    for pmid_in_batch in original_pmids_in_batch: s2_citation_counts[pmid_in_batch] = 0
                time.sleep(5 * (attempt + 1))

    logging.info(f"Finished Semantic Scholar citation fetch. Found counts for {len(s2_citation_counts)} PMIDs.")
    return s2_citation_counts

def fetch_linked_data(paper_ids, link_db, link_name, desc_name, b_size=BATCH_SIZE, num_retries=RETRIES):
    paper_to_items = {}
    failed_ids = []
    print(f"Retrieving linked {desc_name} in batches...")
    for i in tqdm(range(0, len(paper_ids), b_size), desc=f"Fetching {desc_name} batches", unit="batch"):
        batch = paper_ids[i:i + b_size]
        for attempt in range(num_retries):
            try:
                handle = Entrez.elink(dbfrom="pubmed", db=link_db, linkname=link_name, id=",".join(batch))
                record = Entrez.read(handle)
                handle.close()
                for linkset in record:
                    paper_id = linkset["IdList"][0]
                    try:
                        if "LinkSetDb" in linkset and linkset["LinkSetDb"]:
                            item_ids = [link["Id"] for link in linkset["LinkSetDb"][0]["Link"]]
                            paper_to_items[paper_id] = item_ids
                        else:
                            paper_to_items[paper_id] = []
                    except Exception as e_inner:
                        logging.error(f"Error processing {desc_name} linkset for paper {paper_id}: {e_inner}")
                        paper_to_items[paper_id] = []
                time.sleep(max(BASE_SLEEP, b_size / 50.0))
                break
            except Exception as e:
                logging.error(f"Fetch linked {desc_name} batch {i // b_size + 1} attempt {attempt + 1} failed: {e}")
                if attempt == num_retries - 1:
                    print(f"Error fetching {desc_name} batch {i // b_size + 1} after {num_retries} attempts: {e}")
                    failed_ids.extend(batch)
                time.sleep(1 + attempt)
    return paper_to_items, failed_ids


def fetch_gene_names(gene_ids, b_size=BATCH_SIZE, num_retries=RETRIES):
    gene_names_map = {}
    failed_ids = []
    print("Fetching gene names...")
    if not gene_ids: return {}, []
    for i in tqdm(range(0, len(gene_ids), b_size), desc="Processing gene name batches", unit="batch"):
        batch = gene_ids[i:i + b_size]
        for attempt in range(num_retries):
            try:
                handle = Entrez.efetch(db="gene", id=",".join(batch), retmode="xml")
                records = Entrez.read(handle)
                handle.close()
                for record_xml in records:
                    gene_id = record_xml["Entrezgene_track-info"]["Gene-track"]["Gene-track_geneid"]
                    symbol = record_xml.get("Entrezgene_gene", {}).get("Gene-ref", {}).get("Gene-ref_locus", "Unknown Symbol")
                    gene_names_map[gene_id] = symbol
                time.sleep(max(BASE_SLEEP, b_size / 50.0))
                break
            except Exception as e:
                logging.error(f"Fetch gene names batch {i // b_size + 1} attempt {attempt + 1} failed: {e}")
                if attempt == num_retries - 1:
                    print(f"Error fetching gene name batch {i // b_size + 1} after {num_retries} attempts: {e}")
                    failed_ids.extend(batch)
                time.sleep(1 + attempt)
    return gene_names_map, failed_ids

def fetch_variant_details(variant_ids, b_size=BATCH_SIZE, num_retries=RETRIES):
    variant_details_map = {}
    failed_ids = []
    ns = {'default': 'https://www.ncbi.nlm.nih.gov/SNP/docsum'}
    print("Fetching variant details...")
    if not variant_ids: return {}, []
    for i in tqdm(range(0, len(variant_ids), b_size), desc="Processing variant detail batches", unit="batch"):
        batch = variant_ids[i:i + b_size]
        for attempt in range(num_retries):
            try:
                handle = Entrez.efetch(db="snp", id=",".join(batch), retmode="xml")
                xml_data = handle.read().decode('utf-8')
                handle.close()
                if not xml_data.strip():
                    logging.warning(f"Empty XML response for variant batch {i // b_size + 1}")
                    break
                root = ET.fromstring(xml_data)
                for doc in root.findall(".//default:DocumentSummary", namespaces=ns):
                    snp_id_element = doc.find("default:SNP_ID", namespaces=ns)
                    variant_id = snp_id_element.text if snp_id_element is not None else "UnknownID"
                    rsid = f"rs{variant_id}" if variant_id != "UnknownID" else "UnknownRSID"

                    alleles = "N/A"
                    global_mafs = doc.find("default:GLOBAL_MAFS", namespaces=ns)
                    if global_mafs is not None:
                        alleles_set = set()
                        maf_elements = global_mafs.findall("default:MAF", namespaces=ns)
                        for maf in maf_elements:
                            freq = maf.find("default:FREQ", namespaces=ns)
                            if freq is not None and freq.text:
                                try:
                                    allele_part = freq.text.split('=')[0]
                                    alleles_set.add(allele_part)
                                except IndexError:
                                    logging.warning(f"Malformed FREQ text for variant {rsid}: {freq.text}")
                        if alleles_set:
                             alleles = ','.join(sorted(alleles_set))

                    gene_name_from_variant = ""
                    genes_element = doc.find("default:GENES", namespaces=ns)
                    if genes_element is not None:
                        gene_e = genes_element.find("default:GENE_E", namespaces=ns)
                        if gene_e is not None:
                            name_element = gene_e.find("default:NAME", namespaces=ns)
                            if name_element is not None and name_element.text:
                                gene_name_from_variant = name_element.text
                    variant_details_map[variant_id] = {"rsid": rsid, "alleles": alleles, "gene": gene_name_from_variant}
                time.sleep(max(BASE_SLEEP, b_size / 50.0))
                break
            except ET.ParseError as e_parse:
                logging.error(f"XML ParseError for variant batch {i // b_size + 1}: {e_parse}. Data (first 500 chars): {xml_data[:500]}")
                if attempt == num_retries - 1:
                    failed_ids.extend(batch)
                time.sleep(1 + attempt)
            except Exception as e:
                logging.error(f"Fetch variant details batch {i // b_size + 1} attempt {attempt + 1} failed: {e}")
                if attempt == num_retries - 1:
                    print(f"Error fetching variant detail batch {i // b_size + 1} after {num_retries} attempts: {e}")
                    failed_ids.extend(batch)
                time.sleep(1 + attempt)
    return variant_details_map, failed_ids

if __name__ == "__main__":
    if not ENTREZ_EMAIL:
        raise ValueError("ENTREZ_EMAIL must be set")
    if not QUERY:
        raise ValueError("QUERY must be set")
    if not nlp:
        print("WARNING: SciSpaCy model not loaded. Text-based gene/variant extraction will be limited.")

    if not os.path.exists(APPROVED_GENE_SYMBOLS_FILE):
        print(f"'{APPROVED_GENE_SYMBOLS_FILE}' not found. Attempting to download and create it...")
        if not create_approved_gene_list_from_hgnc(HGNC_TSV_URL, APPROVED_GENE_SYMBOLS_FILE,
                                                    APPROVED_SYMBOL_COLUMN_NAME, STATUS_COLUMN_NAME, REQUIRED_STATUS):
            print(f"Failed to create '{APPROVED_GENE_SYMBOLS_FILE}'. Gene validation against HGNC list will be skipped.")
            logging.error(f"Failed to create '{APPROVED_GENE_SYMBOLS_FILE}'.")
        else:
             print(f"Successfully created '{APPROVED_GENE_SYMBOLS_FILE}'.")

    if os.path.exists(APPROVED_GENE_SYMBOLS_FILE):
        try:
            with open(APPROVED_GENE_SYMBOLS_FILE, 'r') as f:
                APPROVED_GENE_SYMBOLS = {line.strip().upper() for line in f if line.strip()}
            if APPROVED_GENE_SYMBOLS:
                logging.info(f"Successfully loaded {len(APPROVED_GENE_SYMBOLS)} approved gene symbols from {APPROVED_GENE_SYMBOLS_FILE}.")
                print(f"Loaded {len(APPROVED_GENE_SYMBOLS)} approved gene symbols for validation.")
            else:
                logging.warning(f"No gene symbols loaded from {APPROVED_GENE_SYMBOLS_FILE} (file might be empty).")
        except Exception as e:
            logging.error(f"Error loading gene symbols from {APPROVED_GENE_SYMBOLS_FILE}: {e}")
            print(f"Error loading gene symbols from {APPROVED_GENE_SYMBOLS_FILE}. Validation may be affected.")
    else:
        logging.warning(f"Approved gene symbols file ('{APPROVED_GENE_SYMBOLS_FILE}') not found. Gene validation will be limited.")
        print(f"Warning: '{APPROVED_GENE_SYMBOLS_FILE}' not found. SciSpaCy gene validation against HGNC list will be skipped.")

    for pmid_to_check in PMIDS_TO_CHECK:
        check_paper_in_query(pmid_to_check)

    paper_ids_list = search_pubmed()
    print(f"Found {len(paper_ids_list)} papers initially from PubMed search.")

    initial_pmid_count = len(paper_ids_list)
    paper_ids_set = set(paper_ids_list)
    for critical_pmid in CRITICAL_PMIDS:
        if critical_pmid not in paper_ids_set:
            paper_ids_set.add(critical_pmid)
            logging.info(f"Manually added critical PMID {critical_pmid} to paper_ids_set")
    paper_ids_list = list(paper_ids_set)
    if len(paper_ids_list) > initial_pmid_count:
        print(f"Added {len(paper_ids_list) - initial_pmid_count} critical PMIDs. Total papers to process: {len(paper_ids_list)}")

    fetch_paper_titles.warned_no_approved_list = False
    paper_info_dict, failed_title_ids = fetch_paper_titles(paper_ids_list)
    if failed_title_ids:
        print(f"Retrying {len(failed_title_ids)} failed title IDs with batch size {RETRY_BATCH_SIZE}...")
        retry_info, _ = fetch_paper_titles(failed_title_ids, b_size=RETRY_BATCH_SIZE)
        paper_info_dict.update(retry_info)
    logging.info(f"Retrieved titles, authors, and abstracts for {len(paper_info_dict)} papers.")
    print(f"Fetched details for {len(paper_info_dict)} papers.")

    processed_paper_ids = list(paper_info_dict.keys())
    citation_counts_dict, failed_citation_ids = fetch_citation_counts(processed_paper_ids)
    if failed_citation_ids:
        print(f"Retrying {len(failed_citation_ids)} failed citation IDs...")
        retry_citations, _ = fetch_citation_counts(failed_citation_ids, b_size=RETRY_BATCH_SIZE)
        citation_counts_dict.update(retry_citations)
    logging.info(f"Retrieved citation counts for {len(citation_counts_dict)} papers.")

    s2_citation_counts_dict = {}
    if processed_paper_ids:
        s2_citation_counts_dict = fetch_s2_citation_counts(processed_paper_ids)
    logging.info(f"Retrieved Semantic Scholar citation counts for {len(s2_citation_counts_dict)} papers.")

    paper_to_ncbi_genes, failed_gene_ids = fetch_linked_data(processed_paper_ids, "gene", "pubmed_gene", "NCBI genes")
    paper_to_ncbi_variants, failed_variant_ids = fetch_linked_data(processed_paper_ids, "snp", "pubmed_snp", "NCBI variants")

    all_ncbi_gene_ids = list(set(sum(paper_to_ncbi_genes.values(), [])))
    print(f"Total unique NCBI-linked gene IDs to fetch names for: {len(all_ncbi_gene_ids)}")
    gene_names_dict = {}
    if all_ncbi_gene_ids:
        gene_names_dict, _ = fetch_gene_names(all_ncbi_gene_ids)
        logging.info(f"Retrieved names for {len(gene_names_dict)} NCBI-linked genes.")

    all_ncbi_variant_ids = list(set(sum(paper_to_ncbi_variants.values(), [])))
    print(f"Total unique NCBI-linked variant IDs to fetch details for: {len(all_ncbi_variant_ids)}")
    variant_details_dict = {}
    if all_ncbi_variant_ids:
        variant_details_dict, _ = fetch_variant_details(all_ncbi_variant_ids)
        logging.info(f"Retrieved details for {len(variant_details_dict)} NCBI-linked variants.")

    results_list = []
    print("\nProcessing and structuring results...")
    for pmid, info_data in tqdm(paper_info_dict.items(), desc="Structuring results"):
        title = info_data.get("title", "N/A")
        year = info_data.get("year", "N/A")
        journal = info_data.get("journal", "N/A")
        has_target_author = info_data.get("has_target_author", False)

        text_genes = info_data.get("text_genes", [])
        text_variants = info_data.get("text_variants", [])
        text_variant_context = info_data.get("text_variant_context", [])

        linked_gene_ids_for_paper = paper_to_ncbi_genes.get(pmid, [])
        ncbi_gene_symbols = [gene_names_dict.get(gid, f"ID:{gid}") for gid in linked_gene_ids_for_paper \
                             if gene_names_dict.get(gid) and gene_names_dict.get(gid) != "Unknown Symbol"]
        ncbi_gene_symbols = sorted(list(set(s for s in ncbi_gene_symbols if s)))

        linked_variant_ids_for_paper = paper_to_ncbi_variants.get(pmid, [])

        pubmed_citation_count = citation_counts_dict.get(pmid, 0)
        s2_citation_count = s2_citation_counts_dict.get(pmid, 0)


        if (text_genes or text_variants or text_variant_context) and \
           not ncbi_gene_symbols and not linked_variant_ids_for_paper:
            logging.info(f"Paper {pmid} has text-derived entities but no NCBI-linked genes/variants.")

        processed_genes_for_this_paper = set()

        for gene_symbol in ncbi_gene_symbols:
            results_list.append({
                'PMID': pmid, 'Title': title, 'Year': year, 'Journal': journal, 'Gene': gene_symbol,
                'Variant': '', 'VariantContext': 'NCBI_Linked_Gene', 'Alleles': '',
                'Citations_PubMed': pubmed_citation_count,
                'Citations_S2': s2_citation_count,
                'AuthorFlag': has_target_author
            })
            processed_genes_for_this_paper.add(gene_symbol.upper())

        for vid in linked_variant_ids_for_paper:
            v_info = variant_details_dict.get(vid, {"rsid": f"ID:{vid}", "alleles": "N/A", "gene": ""})
            variant_gene = v_info['gene']
            if variant_gene and variant_gene != "Unknown Symbol":
                results_list.append({
                    'PMID': pmid, 'Title': title, 'Year': year, 'Journal': journal, 'Gene': variant_gene,
                    'Variant': v_info['rsid'], 'VariantContext': 'NCBI_Linked_Variant',
                    'Alleles': v_info['alleles'],
                    'Citations_PubMed': pubmed_citation_count,
                    'Citations_S2': s2_citation_count,
                    'AuthorFlag': has_target_author
                })
                processed_genes_for_this_paper.add(variant_gene.upper())
            elif v_info['rsid'] != "UnknownRSID":
                 results_list.append({
                    'PMID': pmid, 'Title': title, 'Year': year, 'Journal': journal, 'Gene': f"FromVariant_{v_info['rsid']}",
                    'Variant': v_info['rsid'], 'VariantContext': 'NCBI_Linked_Variant_GeneUnknown',
                    'Alleles': v_info['alleles'],
                    'Citations_PubMed': pubmed_citation_count,
                    'Citations_S2': s2_citation_count,
                    'AuthorFlag': has_target_author
                })

        if text_genes:
            for gene in text_genes:
                if gene.upper() not in processed_genes_for_this_paper:
                    best_variant_for_text_gene = text_variants[0] if text_variants else ""
                    context_str = "Text_Derived_Gene"
                    if best_variant_for_text_gene:
                        context_str = "Text_Gene_With_Text_Variant"
                    elif text_variant_context:
                        context_str = f"Text_Gene_Context:{text_variant_context[0]}"

                    results_list.append({
                        'PMID': pmid, 'Title': title, 'Year': year, 'Journal': journal, 'Gene': gene,
                        'Variant': best_variant_for_text_gene,
                        'VariantContext': context_str,
                        'Alleles': '',
                        'Citations_PubMed': pubmed_citation_count,
                        'Citations_S2': s2_citation_count,
                        'AuthorFlag': has_target_author
                    })

        paper_has_entry_in_this_iteration = False
        if processed_genes_for_this_paper or linked_variant_ids_for_paper or (text_genes and any(g.upper() not in processed_genes_for_this_paper for g in text_genes)):
             paper_has_entry_in_this_iteration = True


        if not paper_has_entry_in_this_iteration:
            placeholder_gene = "NoGeneData"
            placeholder_context = "NoGeneticInfoFound"
            if nlp and not text_genes and not text_variants and not text_variant_context:
                placeholder_gene = "SciSpaCyHit_Filtered"
                placeholder_context = "SciSpaCyHit_NotValidatedOrLinked"

            results_list.append({
                'PMID': pmid, 'Title': title, 'Year': year, 'Journal': journal, 'Gene': placeholder_gene,
                'Variant': '', 'VariantContext': placeholder_context, 'Alleles': '',
                'Citations_PubMed': pubmed_citation_count,
                'Citations_S2': s2_citation_count,
                'AuthorFlag': has_target_author
            })
            logging.info(f"Paper {pmid} ('{title[:50]}...') added with placeholder: Gene='{placeholder_gene}', Context='{placeholder_context}'.")

    results_list.sort(key=lambda x: (x.get('Citations_S2', 0), x.get('Citations_PubMed', 0), x.get('PMID', '')), reverse=True)

    if results_list:
        fieldnames = ['PMID', 'Title', 'Year', 'Journal', 'Gene', 'Variant', 'VariantContext',
                      'Alleles', 'Citations_PubMed', 'Citations_S2', 'AuthorFlag']
        csv_file = f"pubmed_genetic_results_{uuid.uuid4().hex[:8]}.csv"
        with open(csv_file, 'w', newline='', encoding='utf-8') as csvfile_obj:
            writer = csv.DictWriter(csvfile_obj, fieldnames=fieldnames)
            writer.writeheader()
            writer.writerows(results_list)
        print(f"\nResults saved to {csv_file}")
        logging.info(f"Saved {len(results_list)} rows to {csv_file}")
    else:
        print("\nNo results to save.")
        logging.info("No results generated to save to CSV.")

    print("Script finished.")
    logging.info("Script finished.")



'hgnc_approved_genes.txt' not found. Attempting to download and create it...
Attempting to download HGNC gene data from: https://storage.googleapis.com/public-download-files/hgnc/tsv/hgnc_complete_set.txt


ERROR:root:Error downloading HGNC data: 404 Client Error: Not Found for url: https://storage.googleapis.com/public-download-files/hgnc/tsv/hgnc_complete_set.txt
ERROR:root:Failed to create 'hgnc_approved_genes.txt'.


Error downloading HGNC data: 404 Client Error: Not Found for url: https://storage.googleapis.com/public-download-files/hgnc/tsv/hgnc_complete_set.txt
Failed to create 'hgnc_approved_genes.txt'. Gene validation against HGNC list will be skipped.
Paper 33106546 found in query.
Searching PubMed...
Found 923 papers initially from PubMed search.
Fetching paper titles, authors, abstracts, year, and journal...


Fetching title batches: 100%|██████████| 93/93 [01:47<00:00,  1.15s/batch]


Fetched details for 923 papers.
Fetching citation counts...


Fetching citation batches: 100%|██████████| 93/93 [00:45<00:00,  2.03batch/s]


Fetching citation counts from Semantic Scholar...


Fetching S2 citation batches:  23%|██▎       | 42/185 [03:12<10:54,  4.58s/batch]ERROR:root:Semantic Scholar API request timed out for batch starting with PMID:35571208, attempt 1.
Fetching S2 citation batches:  71%|███████▏  | 132/185 [10:15<04:14,  4.81s/batch]ERROR:root:Semantic Scholar API request failed for batch PMID:26472029, attempt 1: 504 Server Error: Gateway Timeout for url: https://api.semanticscholar.org/graph/v1/paper/batch?fields=citationCount,externalIds
Fetching S2 citation batches: 100%|██████████| 185/185 [14:21<00:00,  4.66s/batch]


Retrieving linked NCBI genes in batches...


Fetching NCBI genes batches: 100%|██████████| 93/93 [00:36<00:00,  2.54batch/s]


Retrieving linked NCBI variants in batches...


Fetching NCBI variants batches: 100%|██████████| 93/93 [00:35<00:00,  2.59batch/s]


Total unique NCBI-linked gene IDs to fetch names for: 1560
Fetching gene names...


Processing gene name batches: 100%|██████████| 156/156 [04:42<00:00,  1.81s/batch]


Total unique NCBI-linked variant IDs to fetch details for: 582
Fetching variant details...


Processing variant detail batches: 100%|██████████| 59/59 [00:36<00:00,  1.61batch/s]



Processing and structuring results...


Structuring results: 100%|██████████| 923/923 [00:00<00:00, 52871.30it/s]


Results saved to pubmed_genetic_results_68a3f3d2.csv
Script finished.



