# Python Script for Full-Text Article Fetching: A Step-by-Step Explanation

This script acts as a follow-up, designed to fetch the full text of scientific papers identified by their PubMed IDs (PMIDs), likely using the output from a script like the previous one you shared.

**Overall Goal:**

The primary purpose of this script is to:
1.  Read a list of PMIDs from a CSV file (presumably generated by a prior literature search script).
2.  For each PMID, visit its PubMed page to find links to the full-text article.
3.  Prioritize links to PubMed Central (PMC) or other open access repositories if available.
4.  Download the full-text content (HTML or XML) of the article from the publisher's website or repository.
5.  Save this downloaded content efficiently, avoiding re-downloading if the content for a PMID has already been fetched.
6.  Use parallel processing to speed up the fetching of multiple articles.

---

## The Pipeline: Step-by-Step Explanation

### Phase 1: Setup and Configuration

This section imports necessary Python libraries and defines global settings for the script's operation.

* **Imports (Lines 2-9 in the script):**
    * `requests`: For making HTTP requests to fetch web pages.
    * `BeautifulSoup` (from `bs4`): For parsing HTML and XML content to extract information.
    * `pandas`: For reading and handling data from CSV files.
    * `time`: For adding delays (important for politeness when scraping).
    * `pickle`: For serializing and de-serializing Python objects (saving and loading data like dictionaries).
    * `gzip`: For compressing and decompressing files (to save disk space).
    * `glob`: For finding files matching a pattern (e.g., finding all CSV files).
    * `os`: For interacting with the operating system (e.g., checking file existence, getting file modification times).
    * `logging`: For recording script activity and errors to a file.
    * `urljoin` (from `urllib.parse`): For constructing absolute URLs from relative ones.
    * `concurrent.futures`: For running tasks in parallel, making the script faster.

* **Configuration (Lines 11-18):**
    * `logging.basicConfig(...)`: Configures how logs are recorded. Errors and informational messages will be saved to `fetch_papers_errors.log`.
    * `REQUEST_HEADERS`: A dictionary defining HTTP headers to be sent with each request. The `User-Agent` mimics a web browser, which can sometimes be necessary to access websites that might block basic script requests.
    * `FETCH_DELAY_SECONDS`: A crucial delay (in seconds) applied *before* fetching the actual full-text from a publisher or PMC site. This is to be polite and avoid overwhelming servers.
    * `MAX_WORKERS`: The number of parallel threads the script will use to fetch papers simultaneously.

---

### Phase 2: Helper Functions

These are reusable blocks of code that perform specific tasks.

1.  **`fetch_url_content(url, retries=1, base_retry_delay=1, timeout=1)` function:**
    * **Purpose:** A robust function to download the content of a given URL.
    * **How it works:**
        * It attempts to get the URL using `requests.get()`, including the `REQUEST_HEADERS`.
        * It has a `timeout` to prevent hanging indefinitely on a slow response.
        * `response.raise_for_status()`: Checks if the request was successful (e.g., status code 200 OK). If not (e.g., 404 Not Found, 500 Server Error), it raises an error.
        * **Retries:** If a request fails (due to timeout, HTTP error other than 404, or general request error), it waits for an increasing `base_retry_delay` and tries again, up to the specified number of `retries`.
        * Logs success, warnings, and errors.
    * **Output:** Returns the `response` object if successful, otherwise `None`.

2.  **`get_full_text_data(pmid: str)` function:**
    * **Purpose:** This is the core function that takes a single PMID, finds its full-text link on PubMed, and then attempts to download the content.
    * **How it works (step-by-step for one PMID):**
        1.  **Construct PubMed URL:** Creates the URL for the PMID's page on PubMed (e.g., `https://pubmed.ncbi.nlm.nih.gov/YOUR_PMID/`).
        2.  **Fetch PubMed Page:** Uses `fetch_url_content` to download the HTML of this PubMed page. If this fails, it logs an error and returns an error dictionary.
        3.  **Parse PubMed Page for Links:**
            * Uses `BeautifulSoup` to parse the HTML.
            * It looks for a `div` with class `full-text-links-list` which usually contains links to the full article.
            * If that `div` isn't found, it looks for a single prominent link (class `link-item dialog-focus`).
            * If no links are found, it logs a warning and returns an error dictionary.
        4.  **Prioritize Full-Text Links:**
            * It iterates through the found links.
            * It gives priority to links that point to **PubMed Central (PMC)** or **Europe PMC** (e.g., containing "ncbi.nlm.nih.gov/pmc" or "europepmc.org" in the URL, or "pmc" in the link text or attributes). These sites often provide open access or more structured XML versions of articles.
            * If a PMC link is found, it's chosen, and the script moves on.
            * If no PMC link is found after checking all links, it defaults to using the first link found on the PubMed page.
            * If no valid URL can be resolved, it returns an error.
        5.  **Fetch Full-Text Content:**
            * **Crucial Delay:** `time.sleep(FETCH_DELAY_SECONDS)` is called *before* accessing the chosen `full_text_target_url` (the publisher's site or PMC). This helps prevent being blocked or seen as aggressive.
            * Uses `fetch_url_content` again to download the actual article content from the target URL.
        6.  **Determine Content Type and Return:**
            * If the article content is successfully fetched:
                * It checks the `Content-Type` header of the response. If it indicates XML, the `determined_type` is set to 'xml'.
                * It also has a special check for PMC links: if the text looks like XML (starts with `<` and contains tags like `<article>` or `<front>`), it's marked as 'xml'. Otherwise, it's assumed to be 'html'.
                * It returns a dictionary containing the `pmid`, the `determined_type` ('html', 'xml', or 'error'), the downloaded `content` (text), and the `final_url` from which the content was fetched.
            * If fetching the full content fails, it returns an error dictionary.

---

### Phase 3: Main Script Logic (`main()` function)

This function orchestrates the overall process.

1.  **Find Input CSV:**
    * `glob.glob('pubmed_genetic_results_*.csv')`: Searches the current directory for CSV files starting with "pubmed\_genetic\_results\_".
    * `latest_csv_file = max(csv_files, key=os.path.getctime)`: Selects the most recently modified CSV file from the found list. This is assumed to be the input file containing PMIDs.
    * Logs which CSV file is being used.

2.  **Load PMIDs from CSV:**
    * `df = pd.read_csv(latest_csv_file)`: Reads the selected CSV file into a pandas DataFrame.
    * Error handling is in place for `FileNotFoundError` or other issues during CSV reading.
    * Checks if a 'PMID' column exists in the DataFrame.
    * `all_pmids_from_csv = df['PMID'].astype(str).unique().tolist()`: Extracts all unique PMIDs from the 'PMID' column and converts them to a list of strings.

3.  **Load Existing Content (if any):**
    * `output_filename = "content_dict.pkl.gz"`: Defines the name of the file where downloaded content will be stored (a compressed pickle file).
    * `content_dict = {}`: Initializes an empty dictionary to store results.
    * **Resuming Progress:** If `output_filename` already exists, the script tries to load its content using `gzip.open` and `pickle.load`. This allows the script to resume from where it left off, avoiding re-downloading already fetched PMIDs.
    * If loading fails, it starts with an empty `content_dict`.

4.  **Determine PMIDs to Fetch:**
    * `pmids_to_fetch = [pmid for pmid in all_pmids_from_csv if pmid not in content_dict]`: Creates a list of PMIDs that are in the CSV but not yet in the `content_dict` (i.e., new PMIDs that need to be fetched).
    * If `pmids_to_fetch` is empty, it means all PMIDs from the CSV have already been processed, so the script prints a message and exits.

5.  **Parallel Fetching of New Content:**
    * If there are `pmids_to_fetch`:
        * It uses `concurrent.futures.ThreadPoolExecutor(max_workers=MAX_WORKERS)`. This creates a pool of worker threads that can execute tasks in parallel.
        * `future_to_pmid = {executor.submit(get_full_text_data, pmid): pmid for pmid in pmids_to_fetch}`: For each `pmid` in `pmids_to_fetch`, it submits the `get_full_text_data` function (with that `pmid` as an argument) to the executor. `executor.submit` immediately returns a `Future` object, which represents the pending result of the task. This dictionary maps `Future` objects back to their corresponding PMIDs.
        * `tqdm(...)`: A library that provides a progress bar for loops. `concurrent.futures.as_completed(future_to_pmid)` yields `Future` objects as they complete (not necessarily in the order they were submitted).
        * As each `future` completes:
            * `data = future.result()`: Retrieves the result from the completed task (the dictionary returned by `get_full_text_data`).
            * If an exception occurred within the task, it's logged, and an error dictionary is appended to `results_from_fetch`.
            * Otherwise, the successful `data` is appended.

6.  **Update `content_dict` and Save Results:**
    * It iterates through the `results_from_fetch`.
    * For each `result_item`:
        * The `pmid` is extracted.
        * The `content_dict` is updated with the new data for that `pmid`. It's careful not to overwrite potentially valid older data with a new error if a PMID was somehow re-queued (though the main logic should prevent this).
        * A counter `newly_processed_count` tracks how many PMIDs were successfully processed in *this run*.
    * **Saving Data:** If any new processing was attempted (`results_from_fetch` is not empty):
        * The entire (updated) `content_dict` is saved to `output_filename` using `pickle.dump` and `gzip.open` for compression.
        * Logs and prints information about the number of newly processed PMIDs and the total entries in the saved file.
    * If no new PMIDs were processed, it prints a message indicating that.

7.  **Final Report:**
    * Logs and prints the total number of new PMIDs successfully processed in the current run and the total number of entries now in the `content_dict.pkl.gz` file.

8.  **Script Execution Trigger (`if __name__ == "__main__":`)**
    * This standard Python construct ensures that the `main()` function is called only when the script is executed directly (not when it's imported as a module into another script).

---

This script is a practical tool for researchers needing to gather full-text data for large sets of articles, with considerations for efficiency (parallelism, resuming progress) and politeness to web servers (delays).

In [None]:
import requests
from bs4 import BeautifulSoup
import pandas as pd
import time
import pickle
import gzip
import glob
import os
import logging
from urllib.parse import urljoin
import concurrent.futures # For parallelization

# --- Configuration ---
logging.basicConfig(filename='fetch_papers_errors.log', level=logging.INFO,
                    format='%(asctime)s - %(levelname)s - %(message)s')

REQUEST_HEADERS = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36'
}
# This delay is applied *before* fetching the final full-text URL from a publisher/PMC site by each worker.
FETCH_DELAY_SECONDS = 2 # Slightly reduced, but still important for politeness.
MAX_WORKERS = 5 # Number of parallel threads. Adjust based on your network and CPU. Too many can still cause issues.

# --- Helper Functions ---

def fetch_url_content(url, retries=1, base_retry_delay=1, timeout=1):
    for attempt in range(retries):
        try:
            response = requests.get(url, headers=REQUEST_HEADERS, timeout=timeout, allow_redirects=True)
            response.raise_for_status()
            logging.info(f"Successfully fetched {url} with status {response.status_code}")
            return response
        except requests.exceptions.Timeout:
            logging.warning(f"Timeout on attempt {attempt + 1}/{retries} for {url}")
        except requests.exceptions.HTTPError as e:
            logging.warning(f"HTTP error {e.response.status_code} on attempt {attempt + 1}/{retries} for {url}")
            if e.response.status_code == 404: # Don't retry if not found
                break
        except requests.exceptions.RequestException as e:
            logging.warning(f"Request error on attempt {attempt + 1}/{retries} for {url}: {e}")

        if attempt < retries - 1:
            # Simple incremental backoff for retries
            current_delay = base_retry_delay * (attempt + 1)
            logging.info(f"Waiting {current_delay} seconds before retry for {url}...")
            time.sleep(current_delay)
    logging.error(f"Failed to fetch {url} after {retries} retries.")
    return None

def get_full_text_data(pmid: str):
    pubmed_base_url = 'https://pubmed.ncbi.nlm.nih.gov/'
    pmid_url = f'{pubmed_base_url}{pmid}/'
    # Logging the start of processing for a PMID is now better done before submitting to the pool
    # logging.info(f"Processing PMID: {pmid} - URL: {pmid_url}")

    pubmed_response = fetch_url_content(pmid_url)
    if not pubmed_response or not pubmed_response.text:
        logging.error(f"Failed to fetch PubMed page for PMID {pmid}")
        return {"pmid": pmid, "type": "error", "content": "Failed to fetch PubMed page", "final_url": pmid_url}

    soup = BeautifulSoup(pubmed_response.text, 'html.parser')
    full_text_div = soup.find('div', class_='full-text-links-list')
    links_found = []

    if not full_text_div:
        logging.debug(f"No 'full-text-links-list' div found for PMID {pmid}. Checking for single prominent link.")
        single_link = soup.find('a', class_='link-item dialog-focus')
        if single_link and single_link.get('href'):
             links_found = [single_link]
             logging.debug(f"Found single prominent full text link for PMID {pmid}")
        else:
            logging.warning(f"No full text links (list or single) found for PMID {pmid}")
            return {"pmid": pmid, "type": "error", "content": "No full-text links found on PubMed page", "final_url": pmid_url}
    else:
        links_found = full_text_div.find_all('a', class_='link-item')

    if not links_found: # Should be caught above, but as a safeguard
        logging.warning(f"No links extracted from full_text_div or single link for PMID {pmid}")
        return {"pmid": pmid, "type": "error", "content": "No links extracted from full_text_div", "final_url": pmid_url}

    full_text_target_url = None
    is_pmc_link = False

    for link in links_found:
        href = link.get('href')
        if not href:
            continue

        current_link_url = urljoin(pmid_url, href)
        link_text_lower = link.get_text(strip=True).lower()
        href_lower = href.lower()

        if "ncbi.nlm.nih.gov/pmc" in current_link_url or "europepmc.org" in current_link_url or \
           "pmc" in link.get('data-ga-action', '').lower() or "pmc" in link_text_lower or "pmc" in href_lower:
            full_text_target_url = current_link_url
            is_pmc_link = True
            logging.info(f"Prioritized PMC link for PMID {pmid}: {full_text_target_url}")
            break

    if not full_text_target_url and links_found:
        first_link_href = links_found[0].get('href')
        if first_link_href:
            full_text_target_url = urljoin(pmid_url, first_link_href)
            logging.info(f"Using first available link (non-PMC priority) for PMID {pmid}: {full_text_target_url}")

    if not full_text_target_url:
        logging.error(f"No valid full text URL could be resolved for PMID {pmid}")
        return {"pmid": pmid, "type": "error", "content": "No valid full-text URL resolved", "final_url": pmid_url}

    logging.info(f"PMID {pmid}: Attempting to fetch full content from: {full_text_target_url}")
    time.sleep(FETCH_DELAY_SECONDS) # Crucial delay before hitting external sites

    article_response = fetch_url_content(full_text_target_url)
    if article_response and article_response.text:
        content_type_header = article_response.headers.get('Content-Type', '').lower()
        determined_type = 'html'

        if 'xml' in content_type_header:
            determined_type = 'xml'
        elif is_pmc_link and "PMC" in full_text_target_url and not full_text_target_url.endswith(('.pdf', '.epub')):
            if article_response.text.strip().startswith('<') and \
               ("<article" in article_response.text[:1000] or "<front>" in article_response.text[:1000]):
                determined_type = 'xml'
                logging.info(f"Detected XML-like content from PMC for PMID {pmid} by inspection.")

        logging.info(f"Successfully retrieved content for PMID {pmid} from {full_text_target_url}. Type: {determined_type}, Length: {len(article_response.text)}")
        return {
            "pmid": pmid,
            "type": determined_type,
            "content": article_response.text,
            "final_url": full_text_target_url
        }
    else:
        logging.error(f"Failed to fetch full content for PMID {pmid} from {full_text_target_url}")
        return {"pmid": pmid, "type": "error", "content": f"Failed to fetch from {full_text_target_url}", "final_url": full_text_target_url}

# --- Main Script ---
def main():
    csv_files = glob.glob('pubmed_genetic_results_*.csv')
    if not csv_files:
        logging.error("No pubmed_genetic_results_*.csv files found in the current directory.")
        print("Error: No pubmed_genetic_results_*.csv files found.")
        return
    latest_csv_file = max(csv_files, key=os.path.getctime)
    logging.info(f"Using input CSV file: {latest_csv_file}")
    print(f"Using input CSV file: {latest_csv_file}")

    try:
        df = pd.read_csv(latest_csv_file)
    except FileNotFoundError:
        logging.error(f"CSV file {latest_csv_file} not found.")
        print(f"Error: {latest_csv_file} not found.")
        return
    except Exception as e:
        logging.error(f"Error reading CSV {latest_csv_file}: {e}")
        print(f"Error reading CSV {latest_csv_file}: {e}")
        return

    if 'PMID' not in df.columns:
        logging.error("CSV file must contain a 'PMID' column.")
        print("Error: CSV file must contain a 'PMID' column.")
        return

    all_pmids_from_csv = df['PMID'].astype(str).unique().tolist()
    logging.info(f"Found {len(all_pmids_from_csv)} unique PMIDs to process from {latest_csv_file}.")
    print(f"Found {len(all_pmids_from_csv)} unique PMIDs to process.")

    content_dict = {}
    output_filename = "content_dict.pkl.gz"

    if os.path.exists(output_filename):
        try:
            with gzip.open(output_filename, 'rb') as f_load:
                content_dict = pickle.load(f_load)
            logging.info(f"Loaded {len(content_dict)} existing entries from {output_filename}")
            print(f"Loaded {len(content_dict)} existing entries from {output_filename}")
        except Exception as e:
            logging.warning(f"Could not load existing {output_filename}: {e}. Starting fresh.")
            content_dict = {}

    pmids_to_fetch = [pmid for pmid in all_pmids_from_csv if pmid not in content_dict]
    if not pmids_to_fetch:
        print("All PMIDs from the CSV have already been processed. Nothing new to fetch.")
        logging.info("All PMIDs from the CSV have already been processed.")
        return

    logging.info(f"Attempting to fetch content for {len(pmids_to_fetch)} new PMIDs.")
    print(f"Attempting to fetch content for {len(pmids_to_fetch)} new PMIDs.")

    # Using ThreadPoolExecutor for parallel fetching
    # The main rate limiting per external site is handled by FETCH_DELAY_SECONDS within get_full_text_data
    # MAX_WORKERS limits simultaneous calls to PubMed for initial pages.
    results_from_fetch = []
    with concurrent.futures.ThreadPoolExecutor(max_workers=MAX_WORKERS) as executor:
        future_to_pmid = {executor.submit(get_full_text_data, pmid): pmid for pmid in pmids_to_fetch}

        for future in tqdm(concurrent.futures.as_completed(future_to_pmid), total=len(pmids_to_fetch), desc="Fetching paper content", unit="PMID"):
            pmid = future_to_pmid[future]
            try:
                data = future.result()
                if data:
                    results_from_fetch.append(data)
            except Exception as exc:
                logging.error(f"PMID {pmid} generated an exception during parallel execution: {exc}")
                results_from_fetch.append({"pmid": pmid, "type": "error", "content": f"Exception: {exc}", "final_url": None})

    newly_processed_count = 0
    for result_item in results_from_fetch:
        pmid = result_item["pmid"]
        # Ensure we don't overwrite potentially valid older data with a new error if it was somehow re-queued
        if pmid not in content_dict or content_dict[pmid].get("type") == "error":
             content_dict[pmid] = {
                "type": result_item["type"],
                "content": result_item["content"],
                "final_url": result_item["final_url"]
            }
        if result_item["type"] != "error":
            newly_processed_count +=1


    if results_from_fetch: # Save if any new processing was attempted
        try:
            with gzip.open(output_filename, 'wb') as f_save:
                pickle.dump(content_dict, f_save)
            logging.info(f"Saved results: {newly_processed_count} new PMIDs successfully processed in this run.")
            logging.info(f"Total entries in {output_filename}: {len(content_dict)}")
            print(f"\nSaved results. Total entries in {output_filename}: {len(content_dict)}")
        except Exception as e_save:
            logging.error(f"Error saving final results to {output_filename}: {e_save}")
            print(f"\nError saving final results: {e_save}")
    else:
        print("No new PMIDs were processed in this run.")


    logging.info(f"Finished processing. Total new PMIDs successfully processed in this run: {newly_processed_count}")
    print(f"\nFetching complete. Total successfully processed new PMIDs: {newly_processed_count}. Total entries in {output_filename}: {len(content_dict)}")

if __name__ == "__main__":
    main()