# Processing Data, Gaining Some Insights and Preparing Tokenizations

Sure, here is a summary of the entire Jupyter Notebook:

1. **Markdown Cells:**
    - Titles and section headers to organize the notebook.
    - Descriptions and explanations of various steps and processes.

2. **Imports and Setup:**
    - Import necessary libraries such as `pandas`, `pyarrow.parquet`, `tqdm`, `matplotlib.pyplot`, `scispacy`, `spacy`, `transformers`, `nltk`, etc.
    - Define utility functions for reading and saving parquet files in batches.
    - Define functions for tokenization using simple and Hugging Face tokenizers.
    - Define functions for disease detection using dictionary-based and SciSpacy approaches.

3. **Data Loading and Initial Processing:**
    - Load the main dataset from a parquet file.
    - Perform initial data cleaning, such as removing rows with missing abstracts and filtering rows based on date.
    - Select relevant columns for further processing.

4. **Tokenization:**
    - Tokenize the `title` and `abstract` columns using both simple and Hugging Face tokenizers.
    - Remove stopwords and punctuation from the tokenized data.
    - Analyze token frequencies and generate insights.

5. **Disease Detection:**
    - Implement dictionary-based disease detection by keeping only tokens present in a predefined disease dictionary.
    - Implement SciSpacy-based disease detection using the `en_ner_bc5cdr_md` model to extract disease entities from the `title` and `abstract` columns.
    - Analyze the results of disease detection, including counting the number of detected entities and categorizing rows based on the presence of entities.

6. **Merging and Saving Results:**
    - Merge the results of disease detection with the main dataset.
    - Save the final merged dataset to a parquet file.

7. **Additional Analysis:**
    - Perform additional analysis on the merged dataset, such as counting the number of rows with entities in both `title` and `abstract` columns.
    - Explore potential next steps, such as using UMLS or MeSH for further disease categorization.

8. **Environment Management:**
    - Reset the environment and release memory to ease computer work.

Here is a summary of the key variables and their values:

- `base_path`: `'h:\\000_Projects\\01_GitHub\\05_PythonProjects\\PubMedResearch'`
- `batch_size`: `100000`
- `col`: `'disease_abstract_spacy'`
- `df`: DataFrame with 1,057,871 rows and 22 columns.
- `file_path`: `'h:\\000_Projects\\01_GitHub\\05_PythonProjects\\PubMedResearch\\Data\\2.Processed\\ModellingData\\P5_final_new.parquet'`
- `final_file`: `'P6_merged_tokens.parquet'`
- `output_folder`: `'h:\\000_Projects\\01_GitHub\\05_PythonProjects\\PubMedResearch\\Data\\2.Processed\\ModellingData'`
- `result_path`: `'h:\\000_Projects\\01_GitHub\\05_PythonProjects\\PubMedResearch\\Data\\2.Processed\\ModellingData\\P6_merged_tokens.parquet'`
- `token_columns`: `['disease_title_tokens_simple', 'disease_title_tokens_hf', 'disease_abstract_tokens_simple', 'disease_abstract_tokens_hf', 'disease_title_spacy', 'disease_mesh_terms_spacy', 'disease_abstract_spacy']`

This summary provides an overview of the notebook's structure, key steps, and important variables.

Markdown Cells:

Titles and section headers to organize the notebook.
Descriptions and explanations of various steps and processes.
Imports and Setup:

Import necessary libraries such as pandas, pyarrow.parquet, tqdm, matplotlib.pyplot, scispacy, spacy, transformers, nltk, etc.
Define utility functions for reading and saving parquet files in batches. These functions are used multiple times throughout the notebook to handle large datasets efficiently.
Define functions for tokenization using simple and Hugging Face tokenizers. Tokenization is a crucial step for text processing and analysis.
Define functions for disease detection using dictionary-based and SciSpacy approaches. SciSpacy is a specialized library for biomedical text processing, and it helps in extracting disease entities from text.
Data Loading and Initial Processing:

Load the main dataset from a parquet file. Parquet is a columnar storage file format optimized for large-scale data processing.
Perform initial data cleaning, such as removing rows with missing abstracts and filtering rows based on date. This ensures the dataset is clean and relevant for further analysis.
Select relevant columns for further processing to focus on the necessary data.
Tokenization:

Tokenize the title and abstract columns using both simple and Hugging Face tokenizers. Tokenization breaks down text into individual tokens (words or subwords).
Remove stopwords and punctuation from the tokenized data to focus on meaningful words.
Analyze token frequencies and generate insights to understand the most common terms in the dataset.
Disease Detection:

Implement dictionary-based disease detection by keeping only tokens present in a predefined disease dictionary. This method relies on a predefined list of disease terms.
Implement SciSpacy-based disease detection using the en_ner_bc5cdr_md model to extract disease entities from the title and abstract columns. SciSpacy uses advanced NLP techniques to identify biomedical entities.
Analyze the results of disease detection, including counting the number of detected entities and categorizing rows based on the presence of entities.
Merging and Saving Results:

Merge the results of disease detection with the main dataset to create a comprehensive dataset with both original and processed data.
Save the final merged dataset to a parquet file for future use.
Additional Analysis:

Perform additional analysis on the merged dataset, such as counting the number of rows with entities in both title and abstract columns.
Explore potential next steps, such as using UMLS or MeSH for further disease categorization. These are standardized vocabularies for biomedical terms.
Environment Management:

Reset the environment and release memory to ease computer work. This includes multiple instances of gc.collect() to manually trigger garbage collection and free up RAM.
Key Variables and Their Values:

base_path: 'h:\\000_Projects\\01_GitHub\\05_PythonProjects\\PubMedResearch'
batch_size: 100000
col: 'disease_abstract_spacy'
df: DataFrame with 1,057,871 rows and 22 columns.
file_path: 'h:\\000_Projects\\01_GitHub\\05_PythonProjects\\PubMedResearch\\Data\\2.Processed\\ModellingData\\P5_final_new.parquet'
final_file: 'P6_merged_tokens.parquet'
output_folder: 'h:\\000_Projects\\01_GitHub\\05_PythonProjects\\PubMedResearch\\Data\\2.Processed\\ModellingData'
result_path: 'h:\\000_Projects\\01_GitHub\\05_PythonProjects\\PubMedResearch\\Data\\2.Processed\\ModellingData\\P6_merged_tokens.parquet'
token_columns: ['disease_title_tokens_simple', 'disease_title_tokens_hf', 'disease_abstract_tokens_simple', 'disease_abstract_tokens_hf', 'disease_title_spacy', 'disease_mesh_terms_spacy', 'disease_abstract_spacy']
This extended summary provides a comprehensive overview of the notebook's structure, key steps, tools used, and important variables, along with additional details on the purpose and usage of various components.

Sure, here is a summary of the entire Jupyter Notebook:

1. **Markdown Cells:**
    - Titles and section headers to organize the notebook.
    - Descriptions and explanations of various steps and processes.

2. **Imports and Setup:**
    - Import necessary libraries such as `pandas`, `pyarrow.parquet`, `tqdm`, `matplotlib.pyplot`, `scispacy`, `spacy`, `transformers`, `nltk`, etc.
    - Define utility functions for reading and saving parquet files in batches.
    - Define functions for tokenization using simple and Hugging Face tokenizers.
    - Define functions for disease detection using dictionary-based and SciSpacy approaches.

3. **Data Loading and Initial Processing:**
    - Load the main dataset from a parquet file.
    - Perform initial data cleaning, such as removing rows with missing abstracts and filtering rows based on date.
    - Select relevant columns for further processing.

4. **Tokenization:**
    - Tokenize the `title` and `abstract` columns using both simple and Hugging Face tokenizers.
    - Remove stopwords and punctuation from the tokenized data.
    - Analyze token frequencies and generate insights.

5. **Disease Detection:**
    - Implement dictionary-based disease detection by keeping only tokens present in a predefined disease dictionary.
    - Implement SciSpacy-based disease detection using the `en_ner_bc5cdr_md` model to extract disease entities from the `title` and `abstract` columns.
    - Analyze the results of disease detection, including counting the number of detected entities and categorizing rows based on the presence of entities.

6. **Merging and Saving Results:**
    - Merge the results of disease detection with the main dataset.
    - Save the final merged dataset to a parquet file.

7. **Additional Analysis:**
    - Perform additional analysis on the merged dataset, such as counting the number of rows with entities in both `title` and `abstract` columns.
    - Explore potential next steps, such as using UMLS or MeSH for further disease categorization.

8. **Environment Management:**
    - Reset the environment and release memory to ease computer work.

Here is a summary of the key variables and their values:

- `base_path`: `'h:\\000_Projects\\01_GitHub\\05_PythonProjects\\PubMedResearch'`
- `batch_size`: `100000`
- `col`: `'disease_abstract_spacy'`
- `df`: DataFrame with 1,057,871 rows and 22 columns.
- `file_path`: `'h:\\000_Projects\\01_GitHub\\05_PythonProjects\\PubMedResearch\\Data\\2.Processed\\ModellingData\\P5_final_new.parquet'`
- `final_file`: `'P6_merged_tokens.parquet'`
- `output_folder`: `'h:\\000_Projects\\01_GitHub\\05_PythonProjects\\PubMedResearch\\Data\\2.Processed\\ModellingData'`
- `result_path`: `'h:\\000_Projects\\01_GitHub\\05_PythonProjects\\PubMedResearch\\Data\\2.Processed\\ModellingData\\P6_merged_tokens.parquet'`
- `token_columns`: `['disease_title_tokens_simple', 'disease_title_tokens_hf', 'disease_abstract_tokens_simple', 'disease_abstract_tokens_hf', 'disease_title_spacy', 'disease_mesh_terms_spacy', 'disease_abstract_spacy']`

This summary provides an overview of the notebook's structure, key steps, and important variables.

In [None]:
Sure, here is an enhanced summary of the entire Jupyter Notebook, including the purposes of each section and the overall objective:

---

## Summary of the Jupyter Notebook

### Purpose:
This Jupyter Notebook is designed to process a large dataset of PubMed abstracts, perform tokenization, detect disease entities, and analyze the results. The ultimate goal is to gain insights into the prevalence and distribution of diseases mentioned in the abstracts.

### Structure and Key Steps:

1. **Markdown Cells:**
    - **Purpose:** Organize the notebook with titles and section headers.
    - **Content:** Descriptions and explanations of various steps and processes to guide the reader.

2. **Imports and Setup:**
    - **Purpose:** Import necessary libraries and set up the environment.
    - **Content:**
        - Libraries: `pandas`, `pyarrow.parquet`, `tqdm`, `matplotlib.pyplot`, `scispacy`, `spacy`, `transformers`, `nltk`, etc.
        - Utility Functions: Functions for reading and saving parquet files in batches.
        - Tokenization Functions: Functions for tokenizing text using simple and Hugging Face tokenizers.
        - Disease Detection Functions: Functions for detecting diseases using dictionary-based and SciSpacy approaches.

3. **Data Loading and Initial Processing:**
    - **Purpose:** Load and clean the main dataset.
    - **Content:**
        - Load the dataset from a parquet file.
        - Perform initial data cleaning, such as removing rows with missing abstracts and filtering rows based on date.
        - Select relevant columns for further processing.

4. **Tokenization:**
    - **Purpose:** Tokenize the text data for further analysis.
    - **Content:**
        - Tokenize the `title` and `abstract` columns using both simple and Hugging Face tokenizers.
        - Remove stopwords and punctuation from the tokenized data.
        - Analyze token frequencies to generate insights.

5. **Disease Detection:**
    - **Purpose:** Detect disease entities in the text data.
    - **Content:**
        - Dictionary-Based Detection: Keep only tokens present in a predefined disease dictionary.
        - SciSpacy-Based Detection: Use the `en_ner_bc5cdr_md` model to extract disease entities from the `title` and `abstract` columns.
        - Analyze the results, including counting the number of detected entities and categorizing rows based on the presence of entities.

6. **Merging and Saving Results:**
    - **Purpose:** Combine the results of disease detection with the main dataset and save the final dataset.
    - **Content:**
        - Merge the results of disease detection with the main dataset.
        - Save the final merged dataset to a parquet file.

7. **Additional Analysis:**
    - **Purpose:** Perform further analysis on the merged dataset.
    - **Content:**
        - Count the number of rows with entities in both `title` and `abstract` columns.
        - Explore potential next steps, such as using UMLS or MeSH for further disease categorization.

8. **Environment Management:**
    - **Purpose:** Manage the environment to optimize performance.
    - **Content:**
        - Reset the environment and release memory to ease computer work.

### Key Variables and Their Values:

- `base_path`: `'h:\\000_Projects\\01_GitHub\\05_PythonProjects\\PubMedResearch'`
- `batch_size`: `100000`
- `col`: `'disease_abstract_spacy'`
- `df`: DataFrame with 1,057,871 rows and 22 columns.
- `file_path`: `'h:\\000_Projects\\01_GitHub\\05_PythonProjects\\PubMedResearch\\Data\\2.Processed\\ModellingData\\P5_final_new.parquet'`
- `final_file`: `'P6_merged_tokens.parquet'`
- `output_folder`: `'h:\\000_Projects\\01_GitHub\\05_PythonProjects\\PubMedResearch\\Data\\2.Processed\\ModellingData'`
- `result_path`: `'h:\\000_Projects\\01_GitHub\\05_PythonProjects\\PubMedResearch\\Data\\2.Processed\\ModellingData\\P6_merged_tokens.parquet'`
- `token_columns`: `['disease_title_tokens_simple', 'disease_title_tokens_hf', 'disease_abstract_tokens_simple', 'disease_abstract_tokens_hf', 'disease_title_spacy', 'disease_mesh_terms_spacy', 'disease_abstract_spacy']`

This enhanced summary provides a comprehensive overview of the notebook's structure, key steps, purposes, and important variables.

## Table Of Contents (ToC)

1. [Utility Functions](#99-utility-functions)
    - 1) [read_parquet_in_batches - function](#1-read_parquet_in_batches---function)
    - 2) [save_batches_to_parquet - function](#2-save_batches_to_parquet---function)

2. [Early Processing](#early-processing)
    - Step 0: [Setup](#step-0-setup)
    - Step 1: [Filtering rows + removal of missing records](#step-1-filtering-rows--removal-of-missing-records)

3. [Further Processing](#further-processing)
    - [Variable: `uid` + `parsed_date`](#variable-uid--parsed_date)
    - [Title + Abstract](#title--abstract)
        - [0) Insight into simple and Hugging Face tokenization](#0-insight-into-simple-and-hugging-face-tokenization)
                - [Conclusions from 0)](#conclusions-from-0)

4. [Variables](#variables)
    - [1) Dictionary-Based Disease Detection](#1-dictionary-based-disease-detection)
    - [2) SciSpacy (or spaCy) Biomedical NER Approach](#2-scispacy-or-spacy-biomedical-ner-approach)
    - [WHAT'S NEXT?](#whats-next)
    - [Further Analysis Of Abstract And Title](#further-analysis-of-abstract-and-title)
    - [3) EMBEDDINGS - maybe to get into later, very heavy for computer](#3-embeddings)

5. [Additional Variables](#additional-variables)
    - [mesh_terms](#mesh_terms)
    - [journal](#journal)
    - [abstract](#abstract)
    - [authors](#authors)
    - [affiliations](#affiliations)
    - [keywords](#keywords)
    - [coi_statement](#coi_statement)


```markdown
## Table Of Contents (ToC)

1. [Utility Functions](#99-utility-functions)
    - 1) [read_parquet_in_batches - function](#1-read_parquet_in_batches---function)
    - 2) [save_batches_to_parquet - function](#2-save_batches_to_parquet---function)

2. [Early Processing](#early-processing)
    - Step 0: [Setup](#step-0-setup)
    - Step 1: [Filtering rows + removal of missing records](#step-1-filtering-rows--removal-of-missing-records)

3. [Further Processing](#further-processing)
    - [Variable: `uid` + `parsed_date`](#variable-uid--parsed_date)
    - [Title + Abstract](#title--abstract)
        - [0) Insight into simple and Hugging Face tokenization](#0-insight-into-simple-and-hugging-face-tokenization)
                - [Conclusions from 0)](#conclusions-from-0)

4. [Variables](#variables)
    - [1) Dictionary-Based Disease Detection](#1-dictionary-based-disease-detection)
    - [2) SciSpacy (or spaCy) Biomedical NER Approach](#2-scispacy-or-spacy-biomedical-ner-approach)
    - [WHAT'S NEXT?](#whats-next)
    - [Further Analysis Of Abstract And Title](#further-analysis-of-abstract-and-title)
    - [3) EMBEDDINGS - maybe to get into later, very heavy for computer](#3-embeddings)

5. [Additional Variables](#additional-variables)
    - `mesh_terms`
    - `journal`
    - `abstract`
    - `authors`
    - `affiliations`
    - `mesh_terms`
    - `keywords`
    - `coi_statement`
```

## Table Of Contents (ToC) - to do later


1. [Utility Functions](#99-utility-functions)
   - 1) read_parquet_in_batches - function
   - 2) save_batches_to_parquet - function

2. [Early Processing](#early-processing)
   - Step 0: Setup
   - Step 1: Filtering rows + removal of missing records (missing abstracts are, after checking manually)

3. [Further Processing](#further-processing)
   - [Variable: `uid + parsed_date`](variable+)
   - [Title + Abstract](#title-+-abstract)
      - [0) Insight into simple and Hugging Face tokenization](#insight-into-simple-and-hugging-face-tokenization)
            - Conclusions from 0)

4. [Variables](#variables)

     
     - [1) Dictionary-Based Disease Detection](#1-dictionary-based-disease-detection)
     - [2) SciSpacy (or spaCy) Biomedical NER Approach](#scispacy-or-spacy-biomedical-ner-approach)
     - [WHAT'S NEXT?](#whats-next)
     - [Further Analysis Of Abstract And Title](#further-analysis-of-abstract-and-title)
     - [3) EMBEDDINGS - maybe to get into later, very heavy for computer](#embeddings)

5. [Additional Variables](#additional-variables)
   - `mesh_terms`
   - `journal`
   - `abstract`
   - `authors`
   - `affiliations`
   - `mesh_terms`
   - `keywords`
   - `coi_statement`


## `99)` Utility Functions

### 1) read_parquet_in_batches - function

In [2]:
import pandas as pd
import pyarrow.parquet as pq
from tqdm.auto import tqdm

def read_parquet_in_batches_with_progress(file_path, batch_size):
    """
    Read a Parquet file in fixed-size row batches with a progress bar and per-chunk logging.

    Args:
        file_path (str): Path to the Parquet file.
        batch_size (int): Number of rows per batch.

    Returns:
        pd.DataFrame: Combined DataFrame after processing all batches.
    """
    # Open the Parquet file
    parquet_file = pq.ParquetFile(file_path)
    
    # Total number of rows in the file
    total_rows = parquet_file.metadata.num_rows
    
    # Initialize a list to store DataFrame chunks
    all_chunks = []
    
    # Initialize the progress bar
    with tqdm(total=total_rows, desc="Processing Batches", unit="rows") as pbar:
        # Enumerate batches for logging
        for batch_number, batch in enumerate(parquet_file.iter_batches(batch_size=batch_size), start=1):
            # Convert the batch to a Pandas DataFrame
            df_batch = batch.to_pandas()
            
            # Simulate processing (add custom logic here if necessary)
            all_chunks.append(df_batch)
            
            # Update the progress bar
            pbar.update(len(df_batch))
            
            # Print per-chunk information
            print(f"Processed Chunk {batch_number}: {len(df_batch)} rows")
    
    # Combine all chunks into a single DataFrame
    combined_df = pd.concat(all_chunks, ignore_index=True)
    
    return combined_df

````
# EXAMPLE USAGE
if __name__ == "__main__":
    file_path = "Data/2.Processed/ModellingData/P1_all.parquet"
    batch_size = 100_000  # Define your desired chunk size
    
    df = read_parquet_in_batches_with_progress(file_path, batch_size)
    
    print(f"\nFinal DataFrame with {len(df)} rows:")
    df.head()
    ````


### 2) save_batches_to_parquet - function

In [3]:
import os
import pandas as pd
from tqdm import tqdm

def save_and_merge_in_batches(
    df: pd.DataFrame,
    batch_size: int,
    output_folder: str,
    final_filename: str = "final_merged.parquet",
    temp_batch_prefix: str = "temp_batch_"
):
    """
    Splits 'df' into multiple batches (size = batch_size), writes each batch to a Parquet file,
    then merges them into one final Parquet, with a progress bar showing how many batches are done.

    Steps:
    ------
    1) Creates subfolder 'temp_batches' in output_folder for batch files.
    2) For each chunk of rows:
       - Writes it to 'temp_batch_X.parquet'
       - Increments a progress bar
    3) Reads & merges all batch files into 'final_filename', then removes them.

    Returns:
    --------
    str -> path to the final merged Parquet file.
    """

    # Ensure output folder exists
    os.makedirs(output_folder, exist_ok=True)

    # Subfolder for temporary batch files
    temp_folder = os.path.join(output_folder, "temp_batches")
    os.makedirs(temp_folder, exist_ok=True)

    total_rows = len(df)
    batch_count = (total_rows + batch_size - 1) // batch_size
    print(f"Splitting DataFrame of {total_rows} rows into {batch_count} batches (size={batch_size}).")

    temp_files = []
    current_row = 0
    batch_index = 1

    # -- 1) SAVE IN MULTIPLE BATCHES WITH A PROGRESS BAR FOR THE BATCHES --
    with tqdm(total=batch_count, desc="Saving Batches", unit="batch") as pbar:
        while current_row < total_rows:
            end_row = min(current_row + batch_size, total_rows)
            df_batch = df.iloc[current_row:end_row]

            temp_file_name = f"{temp_batch_prefix}{batch_index}.parquet"
            temp_file_path = os.path.join(temp_folder, temp_file_name)

            # Write the chunk (one shot for each batch)
            df_batch.to_parquet(temp_file_path, index=False, compression="snappy")

            temp_files.append(temp_file_path)

            # Update progress bar
            pbar.update(1)

            # Optional: Print log
            print(f"  -> Batch {batch_index} rows [{current_row}:{end_row}] saved to {temp_file_path}")

            current_row = end_row
            batch_index += 1

    # -- 2) MERGE ALL BATCH FILES INTO A SINGLE PARQUET --
    final_file_path = os.path.join(output_folder, final_filename)
    print(f"\nMerging {len(temp_files)} batch files into {final_file_path}...")

    merged_parts = []
    # Another progress bar for reading merges (optional)
    with tqdm(total=len(temp_files), desc="Merging Batches", unit="file") as pbar_merge:
        for file_path in temp_files:
            merged_parts.append(pd.read_parquet(file_path))
            pbar_merge.update(1)

    df_merged = pd.concat(merged_parts, ignore_index=True)
    df_merged.to_parquet(final_file_path, index=False, compression="snappy")
    print(f"Final merged DataFrame saved as: {final_file_path}\n")

    # -- 3) CLEAN UP TEMPORARY FILES --
    for path in temp_files:
        os.remove(path)
    os.rmdir(temp_folder)

    print("Temporary batch files removed. All done!")
    return final_file_path

````
# ---------------------------
# EXAMPLE USAGE
# ---------------------------
if __name__ == "__main__":

    folder_path = "Data/2.Processed/ModellingData"
    final_file = "P4_final_merged.parquet"
    batch_size = 100_000  # e.g. if you want ~10 batches

    result_path = save_and_merge_in_batches(
        df=df_final,
        batch_size=batch_size,
        output_folder=folder_path,
        final_filename=final_file,
        temp_batch_prefix="temp_batch_"
    )

    print(f"All done. Merged file at: {result_path}")
````

## Early Processing 

*removing NA + filtering rows with wrong dates* (after 2024 and before 1995 articles will be excluded, as currently its 2024, and 1994 had low amount of articles)

### Step 0: Setup

In [3]:
import pandas as pd
import matplotlib.pyplot as plt

In [None]:
df = pd.read_parquet("Data/1.EarlyCleaned/cleaned_parquet/final/PubMedAbstracts_final.parquet")
df.head()

In [None]:
# na per column 
df.isna().sum()

### **Step 1**: Filtering rows + removal of missing records 

(missing abstracts are, after checking manually, missing from articles itself, they are **NOT** due to mistakes in the processing or during phase of data gathering)

In [None]:
missing_abstracts = df[df["abstract"].isna()]
print("Rows where 'abstract' is missing:")
missing_abstracts

In [None]:
x = (df.shape)

# 1) Drop rows with missing abstract (14) -> We have looked into them and it's really like they are missing abstract not due to fault that comes from coding but they are just empty on website. they contribute to ~ 1% of dataset so it is not much of an issue
df = df.dropna(subset=["abstract"])

print("Removed missing abstract rows:")
print(x[0]-df.shape[0])

# 2) Drop rows with year == 2025
# first ensure parsed_date is datetime
df["parsed_date"] = pd.to_datetime(df["parsed_date"], errors="coerce")

#df = df[df["parsed_date"].dt.year != 2025]
# Exclude years 1994 and 2025 ; 1994 has low amount of articles
df = df[(df["parsed_date"].dt.year != 1994) & (df["parsed_date"].dt.year != 2025)]

print("Removed total rows:")
print(x[0] - df.shape[0])

In [10]:
# Selecting only columns that we will be working with
df = df[["uid", "title", "journal", "abstract", "authors", "affiliations", "mesh_terms", "keywords", "coi_statement", "parsed_date"]].copy()

## Further Processing

In [None]:
df.columns

### Variable: `uid` + `parsed_date`

In [None]:
import pandas as pd
import matplotlib.pyplot as plt

# parsed date into datetime format
df["parsed_date"] = pd.to_datetime(df["parsed_date"], errors="coerce")

# Compute articles per year
articles_per_year = df.groupby(df["parsed_date"].dt.year)["uid"].count().sort_index()

# Compute Year-over-Year % change
growth_pct = articles_per_year.pct_change().fillna(0) * 100
growth_pct.iloc[0] = 0  # Set the first year's growth to 0 to avoid outliers

# Ensure all years from 1995 to 2024 are represented (even if no data for some years)
all_years = pd.Series(range(1995, 2025), name="Year")
articles_per_year = articles_per_year.reindex(all_years, fill_value=0)
growth_pct = growth_pct.reindex(all_years, fill_value=0)

# Compute cumulative number of articles
cumulative_articles = articles_per_year.cumsum()

# Compute Indexed Growth (base 1995 = 100)
indexed_growth = (articles_per_year / articles_per_year.iloc[0]) * 100

# --- Visualization 1: Articles per Year + YoY Growth ---
fig, ax1 = plt.subplots(figsize=(16, 10))

# Adjust the bar width and opacity
bar_width = 0.8
bars = ax1.bar(articles_per_year.index, articles_per_year.values, color="skyblue", alpha=0.8, width=bar_width, label="Article Count")
ax1.set_xlabel("Year", fontsize=12)
ax1.set_ylabel("Count of Articles", color="blue", fontsize=12)
ax1.tick_params(axis='y', labelcolor="blue")
ax1.set_xticks(articles_per_year.index)
ax1.set_xticklabels(articles_per_year.index, rotation=90, fontsize=10)

# Annotate each bar with its count (formatted with commas or spaces), slightly raised
for rect in bars:
    height = rect.get_height()
    if height > 0:  # Annotate only if count > 0
        ax1.text(
            rect.get_x() + rect.get_width() / 2, height - 1500,  # Slightly raised above bar tops
            f"{int(height):,}".replace(",", " "), ha="center", va="bottom", fontsize=10
        )

# Secondary axis for growth percentage
ax2 = ax1.twinx()
ax2.set_ylabel("Year-over-Year Growth (%)", fontsize=12)
ax2.tick_params(axis='y')

# Custom Y-axis ticks with range -30 to 30
ax2.set_ylim(-30, 30)
ax2.set_yticks(range(-30, 35, 5))
ax2.set_yticklabels(
    [f"{abs(t)}%" if t == 0 else f"{t}%" for t in range(-30, 35, 5)],
    color="gray",
    fontsize=10,
)

# Change color of Y-axis labels dynamically
for label in ax2.get_yticklabels():
    value = int(label.get_text().replace("%", ""))
    if value > 0:
        label.set_color("green")
    elif value < 0:
        label.set_color("red")
    else:
        label.set_color("gray")

# Align the line chart with the center of bars
line_x = articles_per_year.index + (bar_width / 2 - 0.4)
colors = ["green" if y > 0 else "red" for y in growth_pct.values]

# Plot colored growth lines based on percentage
for i in range(1, len(articles_per_year)):
    ax2.plot(
        [line_x[i - 1], line_x[i]],
        [growth_pct.values[i - 1], growth_pct.values[i]],
        color=colors[i],
        linewidth=2
    )

# Annotate growth percentage below/above dots in bold
for x, y in zip(line_x, growth_pct.values):
    if abs(y) > 0.5:  # Annotate only significant changes
        offset = -2 if y < 0 else 2
        ax2.text(
            x, y + offset,
            f"{y:.1f}%", color="green" if y > 0 else "red",
            ha="center", va="bottom" if y > 0 else "top", fontsize=10, fontweight="bold"
        )

# Add a title and adjust layout
ax1.set_title("Articles per Year and Year-over-Year Growth", fontsize=14, pad=20)
fig.tight_layout()
plt.show()

# --- Visualization 2: Indexed Growth + Cumulative Articles ---
fig, ax1 = plt.subplots(figsize=(16, 10))

# Plot cumulative number of articles as bars
bars = ax1.bar(cumulative_articles.index, cumulative_articles.values, color="lightblue", alpha=0.9, width=bar_width, label="Cumulative Articles")
ax1.set_xlabel("Year", fontsize=12)
ax1.set_ylabel("Cumulative Articles", color="blue", fontsize=12)
ax1.tick_params(axis='y', labelcolor="blue")
ax1.set_xticks(cumulative_articles.index)
ax1.set_xticklabels(cumulative_articles.index, rotation=90, fontsize=10)

# Set Y-axis to show absolute values (e.g., 100,000)
ax1.yaxis.set_major_formatter(plt.FuncFormatter(lambda x, _: f"{int(x):,}"))

# Annotate each bar with its cumulative count
for rect in bars:
    height = rect.get_height()
    if height > 0:  # Annotate only if count > 0
        ax1.text(
            rect.get_x() + rect.get_width() / 2, height - 20000,  # Slightly above the bar tops
            f"{int(height):,}".replace(",", " "), ha="center", va="bottom", fontsize=10
        )

# Secondary axis for Indexed Growth
ax2 = ax1.twinx()
ax2.set_ylabel("Indexed Growth (Base 1995 = 100)", color="purple", fontsize=12)
ax2.plot(cumulative_articles.index, indexed_growth.values, color="purple", marker="o", linewidth=2, label="Indexed Growth")

# Annotate points on the purple line (indexed growth)
for x, y in zip(cumulative_articles.index, indexed_growth.values):
    ax2.text(
        x, y + 5,  # Slightly above each point
        f"{y:.1f}", color="purple", fontsize=10, ha="center"
    )

ax2.tick_params(axis='y', labelcolor="purple")
ax2.set_ylim(0, max(indexed_growth) * 1.1)

# Add a title and adjust layout
ax1.set_title("Cumulative Articles and Indexed Growth", fontsize=14, pad=20)
fig.tight_layout()
plt.show()


### Variable: `title` + `abstract`

In [13]:
from pathlib import Path  # Import Path
import pandas as pd
import numpy as np
from transformers import AutoTokenizer
from tqdm import tqdm

# Hugging Face tokenizer
tokenizer = AutoTokenizer.from_pretrained("distilbert-base-uncased")

def huggingface_tokenize(text, max_len=512):
    """
    Tokenize text with the Hugging Face tokenizer and truncate to max_len.
    """
    if not isinstance(text, str) or not text.strip():
        return []
    encoding = tokenizer.encode_plus(
        text,
        add_special_tokens=True,
        max_length=max_len,
        truncation=True,
        return_token_type_ids=False,
        return_attention_mask=False
    )
    return tokenizer.convert_ids_to_tokens(encoding["input_ids"])

def simple_tokenize(text):
    """
    Simple whitespace and punctuation-based tokenizer.
    """
    import re
    if not isinstance(text, str) or not text.strip():
        return []
    text = text.lower()
    text = re.sub(r"[^a-z\s]", " ", text)
    return text.split()

# Batch processing with progress bar
def process_in_batches(df, column, batch_size=1000, tokenizer_func=None, output_column=None, save_path=None):
    tqdm.pandas()  # Enables progress_apply with tqdm

    # Load existing processed results if the file exists
    if save_path and Path(save_path).exists():
        processed = pd.read_parquet(save_path)
        print(f"Loaded existing results from {save_path}.")
        return processed

    num_batches = (len(df) + batch_size - 1) // batch_size  # Total number of batches
    results = []

    for i in tqdm(range(num_batches), desc="Processing Batches"):
        start = i * batch_size
        end = start + batch_size

        # Process batch and avoid direct assignment to a slice
        batch = df.iloc[start:end]
        tokenized_data = batch[column].progress_apply(tokenizer_func)
        tokenized_df = pd.DataFrame({output_column: tokenized_data}, index=batch.index)
        results.append(tokenized_df)

        # Save progress after each batch
        if save_path:
            pd.concat(results).to_parquet(save_path, index=True)

    return pd.concat(results)

So firstly we will get simple tokanization and tokanization using huggingface tokenize

In [None]:
# Process title with both tokenizers
df["title_tokens_simple"] = process_in_batches(
    df, "title", batch_size=100_000, tokenizer_func=simple_tokenize, output_column="title_tokens_simple", save_path="Data/2.Processed/ModellingData/P0.simple_tokens_title.parquet"
)

df["title_tokens_hf"] = process_in_batches(
    df, "title", batch_size=100_000, tokenizer_func=lambda txt: huggingface_tokenize(txt, max_len=512), output_column="title_tokens_hf", save_path="Data/2.Processed/ModellingData/P0.hf_tokens_title.parquet"
)

In [None]:
# Process title with both tokenizers
df["abstract_tokens_simple"] = process_in_batches(
    df, "abstract", batch_size=100_000, tokenizer_func=simple_tokenize, output_column="abstract_tokens_simple", save_path="Data/2.Processed/ModellingData/P0.simple_tokens_abstract.parquet"
)

df["abstract_tokens_hf"] = process_in_batches(
    df, "abstract", batch_size=100_000, tokenizer_func=lambda txt: huggingface_tokenize(txt, max_len=512), output_column="abstract_tokens_hf", save_path="Data/2.Processed/ModellingData/P0.hf_tokens_abstract.parquet"
)

In [None]:
df.head()

#### 0) Insight into `simple` and `hugging face` tokenization

In [None]:
from collections import Counter

# Flatten token lists and count token frequencies
simple_token_flat = [token for tokens in df["title_tokens_simple"] for token in tokens]
hf_token_flat = [token for tokens in df["title_tokens_hf"] for token in tokens]

# Count frequencies
simple_token_freq = Counter(simple_token_flat).most_common(20)
hf_token_freq = Counter(hf_token_flat).most_common(20)

# Convert to DataFrame for visualization
freq_df = pd.DataFrame({
    "Simple Tokens": [token for token, _ in simple_token_freq],
    "Simple Frequency": [freq for _, freq in simple_token_freq],
    "HF Tokens": [token for token, _ in hf_token_freq],
    "HF Frequency": [freq for _, freq in hf_token_freq]
})

print(freq_df)

In [None]:
from collections import Counter

# Flatten token lists and count token frequencies
simple_token_flat = [token for tokens in df["abstract_tokens_simple"] for token in tokens]
hf_token_flat = [token for tokens in df["abstract_tokens_hf"] for token in tokens]

# Count frequencies
simple_token_freq = Counter(simple_token_flat).most_common(20)
hf_token_freq = Counter(hf_token_flat).most_common(20)

# Convert to DataFrame for visualization
freq_df = pd.DataFrame({
    "Simple Tokens": [token for token, _ in simple_token_freq],
    "Simple Frequency": [freq for _, freq in simple_token_freq],
    "HF Tokens": [token for token, _ in hf_token_freq],
    "HF Frequency": [freq for _, freq in hf_token_freq]
})

print(freq_df)

`stopwords removal`

an, a, the, etc.

In [2]:
from nltk.corpus import stopwords
import nltk
# Get the list of stopwords
stop_words = set(stopwords.words("english"))

In [None]:
from nltk.corpus import stopwords
import nltk

# Download stopwords from NLTK
nltk.download("stopwords")

# Get the list of stopwords
stop_words = set(stopwords.words("english"))

# Remove stopwords from the tokens titles
df["cleaned_title_tokens_simple"] = df["title_tokens_simple"].apply(lambda tokens: [t for t in tokens if t not in stop_words])
df["cleaned_title_tokens_hf"] = df["title_tokens_hf"].apply(lambda tokens: [t for t in tokens if t not in stop_words])

# Remove stopwords from the tokens abstracts
df["cleaned_abstract_tokens_simple"] = df["abstract_tokens_simple"].apply(lambda tokens: [t for t in tokens if t not in stop_words])
df["cleaned_abstract_tokens_hf"] = df["abstract_tokens_hf"].apply(lambda tokens: [t for t in tokens if t not in stop_words])

`punctuations removal` 

 "." , "," , "(" , etc.

In [None]:
punctuation_tokens = {".", ",", "-", ":", ";", "(", ")", "[", "]", "{", "}", "`", "'"}
def remove_punctuation(tokens):
    return [t for t in tokens if t not in punctuation_tokens]

# Then:
df["cleaned_title_tokens_simple"] = df["cleaned_title_tokens_simple"].apply(remove_punctuation).copy()
df["cleaned_title_tokens_hf"] = df["cleaned_title_tokens_hf"].apply(remove_punctuation).copy()

# Then:
df["cleaned_abstract_tokens_simple"] = df["cleaned_abstract_tokens_simple"].apply(remove_punctuation).copy()
df["cleaned_abstract_tokens_hf"] = df["cleaned_abstract_tokens_hf"].apply(remove_punctuation).copy()

In [None]:
from collections import Counter

# Flatten token lists and count token frequencies
simple_token_flat = [token for tokens in df["cleaned_title_tokens_simple"] for token in tokens]
hf_token_flat = [token for tokens in df["cleaned_title_tokens_hf"] for token in tokens]

# Count frequencies
simple_token_freq = Counter(simple_token_flat).most_common(50)
hf_token_freq = Counter(hf_token_flat).most_common(50)

# Convert to DataFrame for visualization
freq_df = pd.DataFrame({
    "Simple Tokens": [token for token, _ in simple_token_freq],
    "Simple Frequency": [freq for _, freq in simple_token_freq],
    "HF Tokens": [token for token, _ in hf_token_freq],
    "HF Frequency": [freq for _, freq in hf_token_freq]
})

print(freq_df)

In [None]:
from collections import Counter

# Flatten token lists and count token frequencies
simple_token_flat = [token for tokens in df["cleaned_abstract_tokens_simple"] for token in tokens]
hf_token_flat = [token for tokens in df["cleaned_abstract_tokens_hf"] for token in tokens]

# Count frequencies
simple_token_freq = Counter(simple_token_flat).most_common(50)
hf_token_freq = Counter(hf_token_flat).most_common(50)

# Convert to DataFrame for visualization
freq_df = pd.DataFrame({
    "Simple Tokens": [token for token, _ in simple_token_freq],
    "Simple Frequency": [freq for _, freq in simple_token_freq],
    "HF Tokens": [token for token, _ in hf_token_freq],
    "HF Frequency": [freq for _, freq in hf_token_freq]
})

print(freq_df)

##### Conclusions from 0)

Based on our initial exploration, both simple tokenization and Hugging Face-based tokenization appear to be viable approaches for our research. 

However, cleaning required for these methods may prove to be overly time-intensive. 

As a result, we will prioritize alternative approaches for the time being. If these alternative methods fail to give satisfactory results, we will revisit and upgrade the tokenization-based strategy.

In [None]:
df.head()

#### 1) Dictionary-Based Disease Detection

In this approach, We manually create (or load) a disease dictionary (e.g., from ICD codes, known disease names). We then filter tokens to only keep tokens matching that dictionary. This is a straightforward method but can miss synonyms or multi-word diseases unless stored them as multiple entries.

In [23]:
###############################################################################
# 1) Dictionary-Based Approach
###############################################################################

# Example dictionary of diseases (this is just a tiny sample!) - could use some sort of base group of files or something like dict of ilnesses? icd-10-medical-diagnosis-codes maybe?
disease_dict = {
    # General and Common Diseases
    "cancer", "tumor", "diabetes", "hiv", "aids", "arthritis", "pneumonia",
    "hypertension", "influenza", "malaria", "tuberculosis", "dementia", "asthma", 
    "depression", "anxiety", "stroke", "heart disease", "kidney disease",
    
    # Infectious Diseases
    "hepatitis", "cholera", "dengue", "zika", "ebola", "typhoid", "plague",
    "meningitis", "measles", "rubella", "chickenpox", "shingles", "covid",
    "scarlet fever", "leprosy", "syphilis", "gonorrhea", "lyme disease",
    
    # Neurological Disorders
    "parkinson's disease", "epilepsy", "multiple sclerosis", "migraine", 
    "alzheimer's disease", "amyotrophic lateral sclerosis", "huntington's disease",
    "cerebral palsy", "autism", "adhd", "schizophrenia", "bipolar disorder",
    
    # Respiratory Diseases
    "bronchitis", "emphysema", "chronic obstructive pulmonary disease (copd)", 
    "sleep apnea", "pulmonary fibrosis", "cystic fibrosis",
    
    # Cardiovascular Diseases
    "high blood pressure", "arrhythmia", "coronary artery disease", 
    "heart failure", "heart attack", "aortic aneurysm", "angina",
    
    # Digestive Disorders
    "irritable bowel syndrome (ibs)", "ulcerative colitis", "crohn's disease",
    "gastritis", "peptic ulcer", "gastroesophageal reflux disease (gerd)",
    "pancreatitis", "hepatitis a", "hepatitis b", "hepatitis c",
    
    # Musculoskeletal Disorders
    "osteoporosis", "rheumatoid arthritis", "osteoarthritis", "gout",
    "fibromyalgia", "scoliosis", "spinal cord injury",
    
    # Endocrine Disorders
    "hyperthyroidism", "hypothyroidism", "cushing's syndrome", "addison's disease",
    "polycystic ovary syndrome (pcos)", "metabolic syndrome",
    
    # Skin Diseases
    "eczema", "psoriasis", "rosacea", "acne", "melanoma", "basal cell carcinoma",
    "squamous cell carcinoma",
    
    # Genetic Disorders
    "down syndrome", "turner syndrome", "klinefelter syndrome", 
    "sickle cell anemia", "cystic fibrosis", "marfan syndrome",
    
    # Blood Disorders
    "anemia", "leukemia", "lymphoma", "hemophilia", "thalassemia",
    "deep vein thrombosis", "pulmonary embolism",
    
    # Eye Diseases
    "cataracts", "glaucoma", "macular degeneration", "diabetic retinopathy",
    "conjunctivitis", "dry eye syndrome",
    
    # Liver Diseases
    "liver cirrhosis", "fatty liver disease", "hepatitis", "liver cancer",
    
    # Kidney and Urinary Diseases
    "kidney stones", "urinary tract infection (uti)", "chronic kidney disease",
    "nephritis", "prostate cancer", "bladder cancer",
    
    # Cancers
    "breast cancer", "lung cancer", "colon cancer", "skin cancer",
    "pancreatic cancer", "prostate cancer", "ovarian cancer", 
    "brain cancer", "thyroid cancer",
    
    # Reproductive Disorders
    "endometriosis", "ovarian cysts", "uterine fibroids", "erectile dysfunction",
    "infertility", "pelvic inflammatory disease (pid)",
    
    # Autoimmune Diseases
    "systemic lupus erythematosus", "hashimoto's disease", "sjogren's syndrome",
    "celiac disease", "graves' disease", "type 1 diabetes",
    
    # Others
    "sepsis", "allergies", "heat stroke", "hypothermia", "obesity",
    "metabolic syndrome", "malnutrition", "alcoholism", "drug addiction",
    "dyslexia", "anorexia", "bulimia", "hyperlipidemia", "bacterial vaginosis"
}



def keep_only_diseases(token_list):
    """
    Return only those tokens present in the disease_dict.
    We do a .lower() to unify. 
    If you have subwords in HF approach (like 'canc', '##er'),
    you might want to check partial matches or reconstruct them.
    """
    return [t for t in token_list if t.lower() in disease_dict]

In [None]:
# Example usage with cleaned columns:
df["disease_title_tokens_simple"] = df["cleaned_title_tokens_simple"].apply(keep_only_diseases)
df["disease_title_tokens_hf"] = df["cleaned_title_tokens_hf"].apply(keep_only_diseases)

# Then, for frequency:
from collections import Counter

disease_flat_simple = [tok for tokens in df["disease_title_tokens_simple"] for tok in tokens]
simple_disease_freq = Counter(disease_flat_simple).most_common(20)

disease_flat_hf = [tok for tokens in df["disease_title_tokens_hf"] for tok in tokens]
hf_disease_freq = Counter(disease_flat_hf).most_common(20)

print("Top 20 diseases title (simple):", simple_disease_freq)
print("Top 20 diseases title (HF):", hf_disease_freq)

In [None]:
# Example usage with cleaned columns:
df["disease_abstract_tokens_simple"] = df["cleaned_abstract_tokens_simple"].apply(keep_only_diseases)
df["disease_abstract_tokens_hf"] = df["cleaned_abstract_tokens_hf"].apply(keep_only_diseases)

# Then, for frequency:
from collections import Counter

disease_flat_simple = [tok for tokens in df["disease_abstract_tokens_simple"] for tok in tokens]
simple_disease_freq = Counter(disease_flat_simple).most_common(20)

disease_flat_hf = [tok for tokens in df["disease_abstract_tokens_hf"] for tok in tokens]
hf_disease_freq = Counter(disease_flat_hf).most_common(20)

print("Top 20 diseases abstract (simple):", simple_disease_freq)
print("Top 20 diseases abstract (HF):", hf_disease_freq)

In [None]:
df.head()

**Pros**:

Easy to implement.
You quickly see if “cancer” or “diabetes” is the top disease token.

**Cons**:

Multi-word diseases like “heart failure” or “chronic obstructive pulmonary disease” won’t match unless we store them as separate tokens or reconstruct them.

Subword issues: In Hugging Face tokens, “cancer” might appear as [canc, ##er]. That won’t match “cancer” in disease_dict unless we unify them.

#### 2) `SciSpacy` (or spaCy) Biomedical NER Approach

In this approach, We use a trained biomedical NER model (like en_ner_bc5cdr_md) that can detect DISEASE entities in your text. This can handle multi-word diseases and synonyms automatically. You’ll need to:

Install scispacy: pip install scispacy
Install the specific model, e.g. en_ner_bc5cdr_md, via
pip install "your folder with .zip file" - download at https://allenai.github.io/scispacy/


In [27]:
# install scispacy
#%pip install scispacy

#en_ner_bc5cdr_md	F1 84.28	DISEASE, CHEMICAL

# Check project directory
# import os
# os.getcwd()

In [28]:
# installing certain model from archive file downloaded from scispacy website
#%pip install ScispaCy/en_ner_bc5cdr_md-0.5.4.tar.gz

##### Example Of Usage - scispacy website

In [None]:
# import scispacy
# import spacy

# nlp = spacy.load("en_ner_bc5cdr_md")
# text = """
# Myeloid derived suppressor cells (MDSC) are immature 
# myeloid cells with immunosuppressive activity. 
# They accumulate in tumor-bearing mice and humans 
# with different types of cancer, including hepatocellular 
# carcinoma (HCC).
# """
# doc = nlp(text)

# print(list(doc.sents))
# # >>> ["Myeloid derived suppressor cells (MDSC) are immature myeloid cells with immunosuppressive activity.", 
# #      "They accumulate in tumor-bearing mice and humans with different types of cancer, including hepatocellular carcinoma (HCC)."]

# # Examine the entities extracted by the mention detector.
# # Note that they don't have types like in SpaCy, and they
# # are more general (e.g including verbs) - these are any
# # spans which might be an entity in UMLS, a large
# # biomedical database.
# print(doc.ents)
# # >>> (Myeloid derived suppressor cells,
# #      MDSC,
# #      immature,
# #      myeloid cells,
# #      immunosuppressive activity,
# #      accumulate,
# #      tumor-bearing mice,
# #      humans,
# #      cancer,
# #      hepatocellular carcinoma,
# #      HCC)

# # We can also visualise dependency parses
# # (This renders automatically inside a jupyter notebook!):
# from spacy import displacy
# displacy.render(next(doc.sents), style='dep', jupyter=True)

# # See below for the generated SVG.
# # Zoom your browser in a bit!
# # The graphic won't be used with our dataset as it kinda resource-intensive

##### Cleaning + Setting Up ENV with smaller dataframes

In [None]:
import gc

# Force the garbage collector to run
gc.collect()

In [None]:
%who
%whos

In [None]:
df.head()

In [33]:
df = df.drop(columns=["abstract_tokens_simple", "abstract_tokens_hf","title_tokens_simple","title_tokens_hf"]).copy()

In [None]:
if __name__ == "__main__":

    folder_path = "Data/2.Processed/ModellingData"
    final_file = "P1_all.parquet"
    batch_size = 100_000 
    
    result_path = save_and_merge_in_batches(
        df=df,
        batch_size=batch_size,
        output_folder=folder_path,
        final_filename=final_file,
        temp_batch_prefix="temp_batch_"
    )

    print(f"All done. Merged file at: {result_path}")

`UID` Distinct:
1057871 (100%)


In [35]:
df_title = df[['uid', 'title']].copy()
df_abstract = df[['uid', 'abstract']].copy()

In [None]:
import os 

# Folder and file paths
folder_path = "Data/2.Processed/ModellingData"

file_name_title = "P2_title.parquet"
file_name_abstract = "P2_abstract.parquet"

file_path_title = os.path.join(folder_path, file_name_title)
file_path_abstract = os.path.join(folder_path, file_name_abstract)

# 1. Save the DataFrame as a single Parquet file
df_title.to_parquet(file_path_title, index=False, compression="snappy")
print(f"DataFrame saved as a single Parquet file: {file_path_title}")

df_abstract.to_parquet(file_path_abstract, index=False, compression="snappy")
print(f"DataFrame saved as a single Parquet file: {file_path_abstract}")

In [37]:
# reseting ev for cleaning memory
%reset

In [None]:
import psutil
import os
from multiprocessing import cpu_count

# Print CPU usage percentage
print(f"CPU usage: {psutil.cpu_percent()}%")

# Print memory usage
memory_info = psutil.virtual_memory()
print(f"Memory usage: {memory_info.percent}%")

# Get the total number of CPUs
print(f"Number of CPUs available: {cpu_count()}")

##### SciSpacy `Title`

Processing `Title` with SciSpacy model

In [None]:
# Example usage
if __name__ == "__main__":
    file_path = "Data/2.Processed/ModellingData/P2_title.parquet"
    batch_size = 100_000  # Define your desired chunk size
    
    df_title = read_parquet_in_batches_with_progress(file_path, batch_size)
    
    print(f"\nFinal DataFrame with {len(df_title)} rows:")

In [None]:
df_title.head()

In [None]:
###############################################################################
# COMPLETE CODE: CHUNK-BASED DISEASE EXTRACTION (BC5CDR) + TIME LEFT ESTIMATE
###############################################################################

import os
import time
import scispacy
import spacy
import pandas as pd
from tqdm.auto import tqdm

###############################################################################
# 1) LOAD BC5CDR MODEL, DISABLING COMPONENTS FOR SPEED
###############################################################################
try:
    nlp_bc5cdr = spacy.load(
        "en_ner_bc5cdr_md", 
        disable=["tagger", "parser", "attribute_ruler", "lemmatizer"]
    )
except Exception as e:
    print("Could not load 'en_ner_bc5cdr_md'. Make sure you installed:")
    print("  pip install scispacy")
    print("  pip install en_ner_bc5cdr_md-0.5.4.tar.gz (or your local path)")
    raise e

def extract_diseases_spacy(doc):
    """
    Extract disease mentions from BC5CDR model. (ent.label_ in {CHEMICAL, DISEASE})
    We only keep label == 'DISEASE'.
    """
    diseases = []
    for ent in doc.ents:
        if ent.label_ == "DISEASE":
            diseases.append(ent.text)
    return diseases

###############################################################################
# 2) CHUNK PROCESSING WITH RESUME & TIME REMAIN ESTIMATE
###############################################################################
def process_diseases_in_chunks_with_resume(
    df,
    text_col="title",
    chunk_size=10_000,
    batch_size=32,
    save_path="partial_bc5cdr.parquet"
):
    """
    - df: main DataFrame
    - text_col: column with text to process
    - chunk_size: # of rows per chunk
    - batch_size: # docs per nlp.pipe() batch
    - save_path: Parquet file to store partial/final results

    1) Resumes from an existing partial file if it exists.
    2) Processes row by row in chunks, each chunk using spaCy's pipe for faster NER.
    3) Shows a progress bar + estimates time left based on chunk durations.
    4) Saves partial results after each chunk, then a final full save.
    5) The disease mentions get stored in df["disease_entities_spacy"].
    """
    # Reset index so row order is stable
    df = df.reset_index(drop=True)

    # Initialize the column if absent
    if "disease_entities_spacy" not in df.columns:
        df["disease_entities_spacy"] = None

    # Figure out how many rows are already done if partial file is found
    start_idx = 0
    if os.path.exists(save_path):
        try:
            partial_df = pd.read_parquet(save_path)
            if "disease_entities_spacy" in partial_df.columns:
                df["disease_entities_spacy"] = partial_df["disease_entities_spacy"]
                done_mask = df["disease_entities_spacy"].notna()
                done_rows = done_mask.sum()
                start_idx = done_rows
                print(f"Resuming from row {start_idx} based on partial file {save_path}.")
            else:
                print(f"WARNING: {save_path} lacks 'disease_entities_spacy'. Starting fresh.")
        except Exception as e:
            print(f"Error reading partial file {save_path}: {e}")
            print("Starting from scratch.")
    else:
        print("No partial file found. Starting from scratch.")

    total_rows = len(df)
    if start_idx >= total_rows:
        print(f"All {total_rows} rows processed. Nothing to do.")
        return df

    # Calculate how many chunks remain
    remaining = total_rows - start_idx
    num_chunks = (remaining + chunk_size - 1) // chunk_size
    print(f"Starting chunked processing from row {start_idx}/{total_rows}, "
          f"{remaining} rows left, {num_chunks} chunks.\n")

    cur_row = start_idx
    chunk_times = []  # keep track of each chunk's duration to estimate time left

    # Initialize the progress bar
    with tqdm(total=num_chunks, desc="Processing Chunks", unit="chunk") as pbar:
        for i in range(num_chunks):
            chunk_start_time = time.time()

            end_idx = min(cur_row + chunk_size, total_rows)
            chunk = df.iloc[cur_row:end_idx].copy()
            texts = chunk[text_col].fillna("").tolist()

            # We'll store the results
            results = []

            # Use spaCy pipe in batch
            for doc in nlp_bc5cdr.pipe(texts, batch_size=batch_size):
                diseases = extract_diseases_spacy(doc)
                results.append(diseases)

            # Store in chunk and main df
            chunk["disease_entities_spacy"] = results
            df.iloc[cur_row:end_idx, df.columns.get_loc("disease_entities_spacy")] = chunk["disease_entities_spacy"]

            # Partial save
            df.iloc[:end_idx].to_parquet(save_path, index=False)

            # Chunk timing
            chunk_duration = time.time() - chunk_start_time
            chunk_times.append(chunk_duration)
            chunks_done = i + 1
            chunks_left = num_chunks - chunks_done
            # average chunk time so far
            avg_chunk_time = sum(chunk_times) / chunks_done
            est_time_left = avg_chunk_time * chunks_left

            # Update progress bar description with estimated time left
            pbar.set_postfix({
                "Last Chunk Time": f"{chunk_duration:.1f}s",
                "Est. Time Left": f"{est_time_left/60:.1f} min"
            })
            pbar.update(1)

            cur_row = end_idx

    # Final full save
    df.to_parquet(save_path, index=False)
    print(f"All done! Full results saved to {save_path}.\n")
    return df

###############################################################################
# EXAMPLE USAGE
###############################################################################
if __name__ == "__main__":

    # Adjust chunk_size, batch_size to fit your environment
    df_title = process_diseases_in_chunks_with_resume(
        df_title,
        text_col="title",
        chunk_size=10_000,      
        batch_size=64,       
        save_path="Data/2.Processed/ModellingData/P3_bc5cdr_results_title.parquet"
    )

    # Inspect final results
df_title[["title", "disease_entities_spacy"]]

In [None]:
df_title.head()

In [None]:
# Count the number of elements in each list inside 'disease_entities_spacy'
df_title['entity_count'] = df_title['disease_entities_spacy'].apply(len)

# 1. Value counts of the number of entities in lists
entity_count_value_counts = df_title['entity_count'].value_counts()

# 2. Binary categorization: 0 count or more than 0
df_title['has_entities'] = df_title['entity_count'].apply(lambda x: 1 if x > 0 else 0)

# Get value counts for the binary categorization
binary_value_counts = df_title['has_entities'].value_counts()

# Print both results
print("Entity Count Value Counts:")
print(entity_count_value_counts)

print("\nBinary Categorization (0 vs. More Than 0):")
print(binary_value_counts)

In [None]:
# Filter the rows where the 'entity_count' is 10
rows_with_10_entities = df_title[df_title['entity_count'] == 10]

rows_with_10_entities

We will first determine how many empty rows exist in the `abstract` variable. 
After assessing this, we will decide on the next steps based on the extent of overlap and the overall data quality.

#####  SciSpacy `Abstract`

Processing `Abstract` with SciSpacy model

In [None]:
# Example usage
if __name__ == "__main__":
    file_path = "Data/2.Processed/ModellingData/P2_abstract.parquet"
    batch_size = 100_000  # Define your desired chunk size
    
    df_abstract = read_parquet_in_batches_with_progress(file_path, batch_size)
    
    print(f"\nFinal DataFrame with {len(df_abstract)} rows:")

In [None]:
df_abstract.head()

In [None]:
###############################################################################
# COMPLETE CODE: CHUNK-BASED DISEASE EXTRACTION (BC5CDR) + TIME LEFT ESTIMATE
###############################################################################

import os
import time
import scispacy
import spacy
import pandas as pd
from tqdm.auto import tqdm

###############################################################################
# 1) LOAD BC5CDR MODEL, DISABLING COMPONENTS FOR SPEED
###############################################################################
try:
    nlp_bc5cdr = spacy.load(
        "en_ner_bc5cdr_md", 
        disable=["tagger", "parser", "attribute_ruler", "lemmatizer"]
    )
except Exception as e:
    print("Could not load 'en_ner_bc5cdr_md'. Make sure you installed:")
    print("  pip install scispacy")
    print("  pip install en_ner_bc5cdr_md-0.5.4.tar.gz (or your local path)")
    raise e

def extract_diseases_spacy(doc):
    """
    Extract disease mentions from BC5CDR model. (ent.label_ in {CHEMICAL, DISEASE})
    We only keep label == 'DISEASE'.
    """
    diseases = []
    for ent in doc.ents:
        if ent.label_ == "DISEASE":
            diseases.append(ent.text)
    return diseases

###############################################################################
# 2) CHUNK PROCESSING WITH RESUME & TIME REMAIN ESTIMATE
###############################################################################
def process_diseases_in_chunks_with_resume(
    df,
    text_col="title",
    chunk_size=10_000,
    batch_size=32,
    save_path="partial_bc5cdr.parquet"
):
    """
    - df: main DataFrame
    - text_col: column with text to process
    - chunk_size: # of rows per chunk
    - batch_size: # docs per nlp.pipe() batch
    - save_path: Parquet file to store partial/final results

    1) Resumes from an existing partial file if it exists.
    2) Processes row by row in chunks, each chunk using spaCy's pipe for faster NER.
    3) Shows a progress bar + estimates time left based on chunk durations.
    4) Saves partial results after each chunk, then a final full save.
    5) The disease mentions get stored in df["disease_entities_spacy"].
    """
    # Reset index so row order is stable
    df = df.reset_index(drop=True)

    # Initialize the column if absent
    if "disease_entities_spacy" not in df.columns:
        df["disease_entities_spacy"] = None

    # Figure out how many rows are already done if partial file is found
    start_idx = 0
    if os.path.exists(save_path):
        try:
            partial_df = pd.read_parquet(save_path)
            if "disease_entities_spacy" in partial_df.columns:
                df["disease_entities_spacy"] = partial_df["disease_entities_spacy"]
                done_mask = df["disease_entities_spacy"].notna()
                done_rows = done_mask.sum()
                start_idx = done_rows
                print(f"Resuming from row {start_idx} based on partial file {save_path}.")
            else:
                print(f"WARNING: {save_path} lacks 'disease_entities_spacy'. Starting fresh.")
        except Exception as e:
            print(f"Error reading partial file {save_path}: {e}")
            print("Starting from scratch.")
    else:
        print("No partial file found. Starting from scratch.")

    total_rows = len(df)
    if start_idx >= total_rows:
        print(f"All {total_rows} rows processed. Nothing to do.")
        return df

    # Calculate how many chunks remain
    remaining = total_rows - start_idx
    num_chunks = (remaining + chunk_size - 1) // chunk_size
    print(f"Starting chunked processing from row {start_idx}/{total_rows}, "
          f"{remaining} rows left, {num_chunks} chunks.\n")

    cur_row = start_idx
    chunk_times = []  # keep track of each chunk's duration to estimate time left

    # Initialize the progress bar
    with tqdm(total=num_chunks, desc="Processing Chunks", unit="chunk") as pbar:
        for i in range(num_chunks):
            chunk_start_time = time.time()

            end_idx = min(cur_row + chunk_size, total_rows)
            chunk = df.iloc[cur_row:end_idx].copy()
            texts = chunk[text_col].fillna("").tolist()

            # We'll store the results
            results = []

            # Use spaCy pipe in batch
            for doc in nlp_bc5cdr.pipe(texts, batch_size=batch_size):
                diseases = extract_diseases_spacy(doc)
                results.append(diseases)

            # Store in chunk and main df
            chunk["disease_entities_spacy"] = results
            df.iloc[cur_row:end_idx, df.columns.get_loc("disease_entities_spacy")] = chunk["disease_entities_spacy"]

            # Partial save
            df.iloc[:end_idx].to_parquet(save_path, index=False)

            # Chunk timing
            chunk_duration = time.time() - chunk_start_time
            chunk_times.append(chunk_duration)
            chunks_done = i + 1
            chunks_left = num_chunks - chunks_done
            # average chunk time so far
            avg_chunk_time = sum(chunk_times) / chunks_done
            est_time_left = avg_chunk_time * chunks_left

            # Update progress bar description with estimated time left
            pbar.set_postfix({
                "Last Chunk Time": f"{chunk_duration:.1f}s",
                "Est. Time Left": f"{est_time_left/60:.1f} min"
            })
            pbar.update(1)

            cur_row = end_idx

    # Final full save
    df.to_parquet(save_path, index=False)
    print(f"All done! Full results saved to {save_path}.\n")
    return df

###############################################################################
# EXAMPLE USAGE
###############################################################################
if __name__ == "__main__":

    # Adjust chunk_size, batch_size to fit your environment
    df_abstract = process_diseases_in_chunks_with_resume(
        df_abstract,
        text_col="abstract",
        chunk_size=10_000,      
        batch_size=64,       
        save_path="Data/2.Processed/ModellingData/P3_bc5cdr_results_abstract.parquet"
    )

    # Inspect final results
    df_abstract[["abstract", "disease_entities_spacy"]]

In [None]:
df_abstract.head()

In [None]:
# Count the number of elements in each list inside 'disease_entities_spacy'
df_abstract['entity_count'] = df_abstract['disease_entities_spacy'].apply(len)

# 1. Value counts of the number of entities in lists
entity_count_value_counts = df_abstract['entity_count'].value_counts()

# 2. Binary categorization: 0 count or more than 0
df_abstract['has_entities'] = df_abstract['entity_count'].apply(lambda x: 1 if x > 0 else 0)

# Get value counts for the binary categorization
binary_value_counts = df_abstract['has_entities'].value_counts()

# Print both results
print("Entity Count Value Counts:")
print(entity_count_value_counts)

print("\nBinary Categorization (0 vs. More Than 0):")
print(binary_value_counts)

We initially identified 87,103 rows that might need to be removed. As part of the analysis, we also compared rows with empty entities in titles and abstracts to check how they intersect.

In [None]:
# Merge DataFrames on the 'uid' column
merged_df = pd.merge(
    df_abstract[['uid', 'has_entities']], 
    df_title[['uid', 'has_entities']], 
    on='uid', 
    suffixes=('_abstract', '_title')
)

# Analyze rows based on their entity presence
no_entities_abstract = merged_df['has_entities_abstract'] == 0
no_entities_title = merged_df['has_entities_title'] == 0

# Count rows with:
# 1. No entities in both abstract and title
no_entities_both = merged_df[no_entities_abstract & no_entities_title].shape[0]

# 2. No entities in the abstract but entities in the title
no_entities_abstract_only = merged_df[no_entities_abstract & ~no_entities_title].shape[0]

# 3. No entities in the title but entities in the abstract
no_entities_title_only = merged_df[~no_entities_abstract & no_entities_title].shape[0]

# 4. Entities in both
entities_in_both = merged_df[~no_entities_abstract & ~no_entities_title].shape[0]

# Print the results
print("Summary of Rows with Entities:")
print(f"Rows with no entities in both abstract and title: {no_entities_both}")
print(f"Rows with no entities in abstract but entities in title: {no_entities_abstract_only}")
print(f"Rows with no entities in title but entities in abstract: {no_entities_title_only}")
print(f"Rows with entities in both abstract and title: {entities_in_both}")


**The results are as follows:**

1. Rows with no entities in both abstract and title: 80,868
2. Rows with no entities in abstract but entities in title: 6,235
3. Rows with no entities in title but entities in abstract: 323,102
4. Rows with entities in both abstract and title: 647,666

Based on this, we now focus on the **80,868** rows that lack entities in both abstract and title. For these rows, we will perform a search through key *MeSH keywords terms* to check if any tokens are present.

**If no matching tokens are found:**

*These rows may need to be removed entirely.*

*Alternatively, we can explore inputting values from where we have done simple tokenization on another column (e.g., using tokenization techniques like happy hugging tokens) to generate adequate terms.*

*Or just letting these articles be as they were due to our query and in future researches, it may be advised to check on articles (?, but then how, Someones could have even more entries to dataset than we have; 1 million)* 

#### *WHAT'S NEXT?*

*Use UMLS or MeSH*:

We could use resources like UMLS (Unified Medical Language System), MeSH (Medical Subject Headings), or disease-related databases that categorize diseases. These resources contain semantic relationships between terms, and use them to map diseases to broader categories automatically.

UMLS contains concepts and relationships for diseases, including synonyms and broader categories like "cancer," "neurological disorders," etc.
We could use a pre-built Python library like pydantic-uml or PyMedTermino to query UMLS and retrieve disease categories automatically.

*Use a Synonym Mapping Dictionary*:

We could download or build a synonym dictionary for diseases (for example, mapping all types of carcinoma to "cancer").
There are various disease databases that already categorize diseases into high-level categories like "cancer," "infectious disease," "neurological disorder," etc.

*Entity Linking*:

We could perform entity linking to map disease entities to broader categories based on a pre-trained model or a database of known relationships. Some tools (like scispaCy) have built-in support for linking recognized entities to broader concepts.
Group Diseases into Categories Using Predefined Rules:

After extracting diseases with spaCy and linking them to categories, We could group them under broader categories.

##### Merging data frames with SciSpacy

In [None]:
# Example usage
if __name__ == "__main__":
    batch_size = 100_000  # Define your desired chunk size

    # File paths
    file_abstract = "Data/2.Processed/ModellingData/P3_bc5cdr_results_abstract.parquet"
    file_title = "Data/2.Processed/ModellingData/P3_bc5cdr_results_title.parquet"
    file_all = "Data/2.Processed/ModellingData/P1_all.parquet"

    # Read the abstract and title datasets
    df_abstract = read_parquet_in_batches_with_progress(file_abstract, batch_size)
    df_title = read_parquet_in_batches_with_progress(file_title, batch_size)

    # Rename columns
    df_abstract.rename(columns={"disease_entities_spacy": "disease_abstract_spacy"}, inplace=True)
    df_title.rename(columns={"disease_entities_spacy": "disease_title_spacy"}, inplace=True)

    # Select only the necessary columns for merging
    df_abstract = df_abstract[["uid", "disease_abstract_spacy"]]
    df_title = df_title[["uid", "disease_title_spacy"]]

    # Merge abstract and title datasets
    df_combined = pd.merge(df_abstract, df_title, on="uid", how="inner")

    # Read the main dataset
    df_all = read_parquet_in_batches_with_progress(file_all, batch_size)

    # Merge with the main dataset
    df_final = pd.merge(df_all, df_combined, on="uid", how="inner")

    print(f"\nFinal DataFrame with {len(df_final)} rows:")

In [None]:
df_final.head()

`saving dataframe` -> uncomment if needed

In [None]:
# import os 

# # Folder and file paths
# folder_path = "Data/2.Processed/ModellingData"

# file_name_final = "P4_final_merged.parquet"

# file_path_final = os.path.join(folder_path, file_name_final)

# # 1. Save the DataFrame as a single Parquet file
# df_final.to_parquet(file_path_final, index=False, compression="snappy")
# print(f"DataFrame saved as a single Parquet file: {file_path_final}")

In [None]:
if __name__ == "__main__":

    folder_path = "Data/2.Processed/ModellingData"
    final_file = "P4_final_merged.parquet"
    batch_size = 100_000 
    
    result_path = save_and_merge_in_batches(
        df=df_final,
        batch_size=batch_size,
        output_folder=folder_path,
        final_filename=final_file,
        temp_batch_prefix="temp_batch_"
    )

    print(f"All done. Merged file at: {result_path}")

`resseting environment for realese of RAM and easing computer work`

In [63]:
%reset

In [None]:
if __name__ == "__main__":
    file_path = "Data/2.Processed/ModellingData/P4_final_merged.parquet"
    batch_size = 100_000  # Define your desired chunk size
    
    df = read_parquet_in_batches_with_progress(file_path, batch_size)
    
    print(f"\nFinal DataFrame with {len(df)} rows:")
    df.head()

In [None]:
df.head()

##### Further Analysis Of Abstract And Title

In [None]:
import pandas as pd
import pyarrow.parquet as pq
from tqdm.auto import tqdm

def read_parquet_in_batches_with_progress(file_path, batch_size):
    """
    Read a Parquet file in fixed-size row batches with a progress bar and per-chunk logging.

    Args:
        file_path (str): Path to the Parquet file.
        batch_size (int): Number of rows per batch.

    Returns:
        pd.DataFrame: Combined DataFrame after processing all batches.
    """
    # Open the Parquet file
    parquet_file = pq.ParquetFile(file_path)
    
    # Total number of rows in the file
    total_rows = parquet_file.metadata.num_rows
    
    # Initialize a list to store DataFrame chunks
    all_chunks = []
    
    # Initialize the progress bar
    with tqdm(total=total_rows, desc="Processing Batches", unit="rows") as pbar:
        # Enumerate batches for logging
        for batch_number, batch in enumerate(parquet_file.iter_batches(batch_size=batch_size), start=1):
            # Convert the batch to a Pandas DataFrame
            df_batch = batch.to_pandas()
            
            # Simulate processing (add your custom logic here)
            all_chunks.append(df_batch)
            
            # Update the progress bar
            pbar.update(len(df_batch))
            
            # Print per-chunk information
            print(f"Processed Chunk {batch_number}: {len(df_batch)} rows")
    
    # Combine all chunks into a single DataFrame
    combined_df = pd.concat(all_chunks, ignore_index=True)
    
    return combined_df

# Example usage
if __name__ == "__main__":
    file_path = "Data/2.Processed/ModellingData/P4_final_merged.parquet"
    batch_size = 100_000  # Define your desired chunk size
    
    df = read_parquet_in_batches_with_progress(file_path, batch_size)
    
    print(f"\nFinal DataFrame with {len(df)} rows:")

In [None]:
df.head()

**Pros**:

Captures multi-word diseases, synonyms.
Doesn’t require you to maintain a dictionary.
The model can label “Parkinson’s disease,” “type 2 diabetes,” etc.

**Cons**:

Dependent on the model’s coverage and accuracy.
Takes more time than a simple dictionary approach.

Why [CLS] and ##er Appear (Subword Splits)
Hugging Face’s DistilBERT tokenizer uses subword or BPE tokenization. It splits unknown words into smaller pieces. '##' means “this subword attaches to the prior subword.” If you keep them for advanced embedding tasks, that’s normal. For classical LDA/TF-IDF on plain words, they can be awkward. You can:

Remove [CLS], [SEP], etc. (the special tokens).
Potentially remove or unify subwords (canc + ##er → cancer).
                             
Doing Classical LDA or TF-IDF vs. Embedding Approaches
If you do classical topic modeling (LDA, etc.):

You typically want full words rather than subwords.
You remove or merge subword fragments to form complete tokens.
You remove punctuation, maybe remove stopwords, etc.
You might store the final tokens as strings in a df["final_tokens"] or something, then do TfidfVectorizer(...).fit_transform([" ".join(tokens) for tokens in df["final_tokens"]]).
If you do advanced embedding-based classification:

You keep the subword tokens as the model expects them.
Or you feed raw text into AutoTokenizer with truncation=True, max_length=512 at inference time.


#### 3) EMBEDDINGS - maybe to get into later, very heavy for computer

In [73]:
# import torch
# from transformers import AutoTokenizer, AutoModel

# model_name = "distilbert-base-uncased"
# tokenizer = AutoTokenizer.from_pretrained(model_name)
# model = AutoModel.from_pretrained(model_name)

# # Put the model in eval mode (we don't do further training here)
# model.eval()

# # GPU cuda usage:
# device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
# model.to(device)

In [74]:
# def get_distilbert_embedding(text, max_len=512):
#     """
#     Convert 'text' into a DistilBERT embedding by:
#       1) Tokenizing with subword tokens (including [CLS], [SEP], etc.).
#       2) Running model to get last_hidden_state.
#       3) Mean-pooling the token vectors to produce one 768D vector.
#     """
#     if not isinstance(text, str) or not text.strip():
#         # Return a zero vector if empty
#         return torch.zeros(model.config.hidden_size)

#     # Tokenize & encode
#     inputs = tokenizer(
#         text,
#         add_special_tokens=True,
#         max_length=max_len,
#         truncation=True,
#         return_tensors="pt"
#     )
#     # Move data to GPU if available
#     inputs = {k: v.to(device) for k, v in inputs.items()}

#     # Forward pass
#     with torch.no_grad():
#         outputs = model(**inputs)
#         # DistilBERT -> outputs.last_hidden_state is (batch_size, seq_len, hidden_dim)
#         last_hidden_state = outputs.last_hidden_state

#     # Mean pooling across seq_len dimension
#     # shape: (batch_size, hidden_dim)
#     embedding = last_hidden_state.mean(dim=1)[0].cpu()  # move back to CPU
#     return embedding

In [75]:
# def process_embeddings_in_chunks(df, text_col, batch_size=1000, max_len=512):
#     """
#     For each row in df, convert text_col to a DistilBERT embedding.
#     We'll store the result in df["embedding"] as a list of floats (768D).
    
#     If you have a huge DataFrame, chunking helps avoid GPU out-of-memory.
#     """
#     df["embedding"] = None  # Initialize empty column
#     total_rows = len(df)
#     num_batches = (total_rows + batch_size - 1) // batch_size

#     start_idx = 0
#     for i in range(num_batches):
#         end_idx = min(start_idx + batch_size, total_rows)
#         batch = df.iloc[start_idx:end_idx].copy()

#         # Compute embeddings for each row
#         embeddings_list = []
#         for idx, row in batch.iterrows():
#             text = row[text_col]
#             emb = get_distilbert_embedding(text, max_len=max_len)
#             # Convert to list of floats if you want to store in DataFrame easily
#             embeddings_list.append(emb.tolist())

#         # Assign them back to df
#         df.loc[df.index[start_idx:end_idx], "embedding"] = embeddings_list

#         start_idx = end_idx
#         print(f"Processed batch {i+1}/{num_batches}. Rows {start_idx} so far.")
#     return df

#### **Conclusion & Recommendations** ; title + abstract

1. *Dictionary approach: Quick, but you must handle subwords or multi-word diseases carefully.*

2. *NER approach (SciSpacy, en_ner_bc5cdr_md): Better for multi-word disease detection.*

3. *Embedding approach: Keep subword tokens + special tokens for BERT-based classification or embedding extraction. Use the code above to produce a 768D vector per document.*

Stopwords & punctuation: For embedding-based approaches (like DistilBERT), do not manually remove them.

### Variable: `mesh_terms`

Continuation of exploring lack of tokens related to diseases for ~ 80k records to conclude wether we want to keep these these rows or not

In [77]:
df["mesh_list"] = (
    df["mesh_terms"]
    .fillna("")  # replace NaN with empty string
    .str.split(";")
    .apply(lambda x: [m.strip() for m in x if m.strip()])  # strip spaces, remove empties
)

In [None]:
from collections import Counter

mesh_counter = Counter()
for mesh_terms in df["mesh_list"]:
    # mesh_terms is a list, e.g. ["Adolescent", "Adult", ...]
    mesh_counter.update(mesh_terms)

# Turn counter into a DataFrame sorted by frequency
mesh_freq_df = pd.DataFrame(mesh_counter.most_common(), columns=["mesh_term", "count"])
mesh_freq_df.head(20)


In [None]:
# Explode the list so each row in df_exploded is one MeSH term
df_exploded = df.explode("mesh_list")
# Then we can do a value_counts on the single item
mesh_freq = df_exploded["mesh_list"].value_counts(dropna=False)
print("Number of unique MeSH terms:", len(mesh_freq))
print("Top 20 MeSH terms:")
print(mesh_freq.head(20))

In [80]:
df_mesh_list = df[['uid', 'mesh_list']].copy()

In [None]:
import os 

# Folder and file paths
folder_path = "Data/2.Processed/ModellingData"

file_name_mesh_list = "P2_mesh_list.parquet"

file_path_mesh_list = os.path.join(folder_path, file_name_mesh_list)

# 1. Save the DataFrame as a single Parquet file
df_mesh_list.to_parquet(file_path_mesh_list, index=False, compression="snappy")
print(f"DataFrame saved as a single Parquet file: {file_path_mesh_list}")

In [None]:
###############################################################################
# COMPLETE CODE: CHUNK-BASED DISEASE EXTRACTION (BC5CDR) + TIME LEFT ESTIMATE
###############################################################################

import os
import time
import scispacy
import spacy
import pandas as pd
from tqdm.auto import tqdm

###############################################################################
# 1) LOAD BC5CDR MODEL, DISABLING COMPONENTS FOR SPEED
###############################################################################
try:
    nlp_bc5cdr = spacy.load(
        "en_ner_bc5cdr_md", 
        disable=["tagger", "parser", "attribute_ruler", "lemmatizer"]
    )
except Exception as e:
    print("Could not load 'en_ner_bc5cdr_md'. Make sure you installed:")
    print("  pip install scispacy")
    print("  pip install en_ner_bc5cdr_md-0.5.4.tar.gz (or your local path)")
    raise e

def extract_diseases_spacy(doc):
    """
    Extract disease mentions from BC5CDR model. (ent.label_ in {CHEMICAL, DISEASE})
    We only keep label == 'DISEASE'.
    """
    diseases = []
    for ent in doc.ents:
        if ent.label_ == "DISEASE":
            diseases.append(ent.text)
    return diseases

###############################################################################
# 2) CHUNK PROCESSING WITH RESUME & TIME REMAIN ESTIMATE
###############################################################################
def process_diseases_in_chunks_with_resume(
    df,
    text_col="title",
    chunk_size=10_000,
    batch_size=32,
    save_path="partial_bc5cdr.parquet"
):
    """
    - df: main DataFrame
    - text_col: column with text to process
    - chunk_size: # of rows per chunk
    - batch_size: # docs per nlp.pipe() batch
    - save_path: Parquet file to store partial/final results

    1) Resumes from an existing partial file if it exists.
    2) Processes row by row in chunks, each chunk using spaCy's pipe for faster NER.
    3) Shows a progress bar + estimates time left based on chunk durations.
    4) Saves partial results after each chunk, then a final full save.
    5) The disease mentions get stored in df["disease_entities_spacy"].
    """
    # Reset index so row order is stable
    df = df.reset_index(drop=True)

    # Initialize the column if absent
    if "disease_entities_spacy" not in df.columns:
        df["disease_entities_spacy"] = None

    # Figure out how many rows are already done if partial file is found
    start_idx = 0
    if os.path.exists(save_path):
        try:
            partial_df = pd.read_parquet(save_path)
            if "disease_entities_spacy" in partial_df.columns:
                df["disease_entities_spacy"] = partial_df["disease_entities_spacy"]
                done_mask = df["disease_entities_spacy"].notna()
                done_rows = done_mask.sum()
                start_idx = done_rows
                print(f"Resuming from row {start_idx} based on partial file {save_path}.")
            else:
                print(f"WARNING: {save_path} lacks 'disease_entities_spacy'. Starting fresh.")
        except Exception as e:
            print(f"Error reading partial file {save_path}: {e}")
            print("Starting from scratch.")
    else:
        print("No partial file found. Starting from scratch.")

    total_rows = len(df)
    if start_idx >= total_rows:
        print(f"All {total_rows} rows processed. Nothing to do.")
        return df

    # Calculate how many chunks remain
    remaining = total_rows - start_idx
    num_chunks = (remaining + chunk_size - 1) // chunk_size
    print(f"Starting chunked processing from row {start_idx}/{total_rows}, "
          f"{remaining} rows left, {num_chunks} chunks.\n")

    cur_row = start_idx
    chunk_times = []  # keep track of each chunk's duration to estimate time left

    # Initialize the progress bar
    with tqdm(total=num_chunks, desc="Processing Chunks", unit="chunk") as pbar:
        for i in range(num_chunks):
            chunk_start_time = time.time()

            end_idx = min(cur_row + chunk_size, total_rows)
            chunk = df.iloc[cur_row:end_idx].copy()
            texts = chunk[text_col].fillna("").tolist()

            # We'll store the results
            results = []

            # Use spaCy pipe in batch
            for doc in nlp_bc5cdr.pipe(texts, batch_size=batch_size):
                diseases = extract_diseases_spacy(doc)
                results.append(diseases)

            # Store in chunk and main df
            chunk["disease_entities_spacy"] = results
            df.iloc[cur_row:end_idx, df.columns.get_loc("disease_entities_spacy")] = chunk["disease_entities_spacy"]

            # Partial save
            df.iloc[:end_idx].to_parquet(save_path, index=False)

            # Chunk timing
            chunk_duration = time.time() - chunk_start_time
            chunk_times.append(chunk_duration)
            chunks_done = i + 1
            chunks_left = num_chunks - chunks_done
            # average chunk time so far
            avg_chunk_time = sum(chunk_times) / chunks_done
            est_time_left = avg_chunk_time * chunks_left

            # Update progress bar description with estimated time left
            pbar.set_postfix({
                "Last Chunk Time": f"{chunk_duration:.1f}s",
                "Est. Time Left": f"{est_time_left/60:.1f} min"
            })
            pbar.update(1)

            cur_row = end_idx

    # Final full save
    df.to_parquet(save_path, index=False)
    print(f"All done! Full results saved to {save_path}.\n")
    return df

###############################################################################
# EXAMPLE USAGE
###############################################################################
if __name__ == "__main__":

    # Convert list of MeSH terms to a single string per row
    df_mesh_list["mesh_list_text"] = df_mesh_list["mesh_list"].apply(lambda x: " ".join(x) if isinstance(x, list) else str(x))

    # Adjust chunk_size, batch_size to fit your environment
    df_mesh_list = process_diseases_in_chunks_with_resume(
        df_mesh_list,
        text_col="mesh_list_text",
        chunk_size=10_000,       
        batch_size=64,       
        save_path="Data/2.Processed/ModellingData/P3_bc5cdr_results_mesh_keywords.parquet"
    )

    # Inspect final results
    df_mesh_list[["mesh_list_text", "disease_entities_spacy"]]

In [None]:
df_mesh_list.head()

In [None]:
# Example usage
if __name__ == "__main__":
    batch_size = 100_000  # Define your desired chunk size

    # File paths
    file_mesh = "Data/2.Processed/ModellingData/P3_bc5cdr_results_mesh_keywords.parquet"
    file_all = "Data/2.Processed/ModellingData/P4_final_merged.parquet"

    # Read the abstract and title datasets
    df_mesh_list = read_parquet_in_batches_with_progress(file_mesh, batch_size)
    df = read_parquet_in_batches_with_progress(file_all, batch_size)

    # Rename columns
    df_mesh_list.rename(columns={"disease_entities_spacy": "disease_mesh_terms_spacy"}, inplace=True)

    # Select only the necessary columns for merging
    df_mesh_list = df_mesh_list[["uid", "disease_mesh_terms_spacy"]]

    # Read the main dataset
    df_all = read_parquet_in_batches_with_progress(file_all, batch_size)

    # Merge with the main dataset
    df_final = pd.merge(df_all, df_mesh_list, on="uid", how="inner")

    print(f"\nFinal DataFrame with {len(df_final)} rows:")

In [None]:
df_final.copy()

In [None]:
if __name__ == "__main__":

    folder_path = "Data/2.Processed/ModellingData"
    final_file = "P5_final_new.parquet"
    batch_size = 100_000  # e.g. if you want ~10 batches for our current dataset

    result_path = save_and_merge_in_batches(
        df=df_final,
        batch_size=batch_size,
        output_folder=folder_path,
        final_filename=final_file,
        temp_batch_prefix="temp_batch_"
    )

    print(f"All done. Merged file at: {result_path}")

In [90]:
#%reset

##### `Reading dataset`

finalized dataset (as for now)

In [None]:
if __name__ == "__main__":
    file_path = "Data/2.Processed/ModellingData/P5_final_new.parquet"
    batch_size = 100_000  # Define your desired chunk size
    
    df = read_parquet_in_batches_with_progress(file_path, batch_size)
    
    print(f"\nFinal DataFrame with {len(df)} rows:")
    df.head()

In [None]:
df.head()

In [None]:
df.columns

In [None]:
df.shape

In [None]:
# Check for the absence of tokens in each variable
no_tokens_abstract = df['disease_abstract_spacy'].apply(len) == 0
no_tokens_title = df['disease_title_spacy'].apply(len) == 0
no_tokens_mesh_terms = df['disease_mesh_terms_spacy'].apply(len) == 0

# Calculate insights
# 1. No tokens in all three variables
no_tokens_in_all = (no_tokens_abstract & no_tokens_title & no_tokens_mesh_terms).sum()

# 2. No tokens in abstract only
no_tokens_in_abstract_only = (no_tokens_abstract & ~no_tokens_title & ~no_tokens_mesh_terms).sum()

# 3. No tokens in title only
no_tokens_in_title_only = (~no_tokens_abstract & no_tokens_title & ~no_tokens_mesh_terms).sum()

# 4. No tokens in mesh terms only
no_tokens_in_mesh_terms_only = (~no_tokens_abstract & ~no_tokens_title & no_tokens_mesh_terms).sum()

# 5. Tokens missing in at least one variable
no_tokens_in_at_least_one = (no_tokens_abstract | no_tokens_title | no_tokens_mesh_terms).sum()

# 6. Tokens missing in all variables
no_tokens_in_all = (no_tokens_abstract & no_tokens_title & no_tokens_mesh_terms).sum()

# Print insights
print("Summary of Insights:")
print(f"Rows with no tokens in all three variables: {no_tokens_in_all}")
print(f"Rows with no tokens in abstract only: {no_tokens_in_abstract_only}")
print(f"Rows with no tokens in title only: {no_tokens_in_title_only}")
print(f"Rows with no tokens in mesh terms only: {no_tokens_in_mesh_terms_only}")
print(f"Rows with no tokens in at least one variable: {no_tokens_in_at_least_one}")
print(f"Rows with no tokens in all variables: {no_tokens_in_all}")

There are still over 60000 records without clasified medical tokens
Rows with no tokens in all variables: 60247

In [None]:
# Check for the presence or absence of tokens in each variable
no_tokens_abstract = df['disease_abstract_spacy'].apply(len) == 0
no_tokens_title = df['disease_title_spacy'].apply(len) == 0
has_tokens_mesh_terms = df['disease_mesh_terms_spacy'].apply(len) > 0

# Filter rows: Tokens in mesh terms but no tokens in abstract and title
rows_with_tokens_in_mesh_only = df[no_tokens_abstract & no_tokens_title & has_tokens_mesh_terms]

# Display the filtered rows
print(f"Number of rows with tokens in mesh terms but no tokens in abstract and title: {len(rows_with_tokens_in_mesh_only)}")
rows_with_tokens_in_mesh_only


In [None]:
# Check for the absence of tokens in each column
no_tokens_abstract = df['disease_abstract_spacy'].apply(len) == 0
no_tokens_title = df['disease_title_spacy'].apply(len) == 0
no_tokens_mesh_terms = df['disease_mesh_terms_spacy'].apply(len) == 0

# Filter rows: No tokens in all three columns
rows_with_no_tokens_in_all = df[no_tokens_abstract & no_tokens_title & no_tokens_mesh_terms]

# Display the filtered rows
print(f"Number of rows with no tokens in all three columns: {len(rows_with_no_tokens_in_all)}")
rows_with_no_tokens_in_all


In [None]:
# Define all columns to check for emptiness
columns_to_check = [
    'disease_abstract_spacy', 
    'disease_title_spacy', 
    'disease_mesh_terms_spacy', 
    'disease_title_tokens_simple', 
    'disease_title_tokens_hf', 
    'disease_abstract_tokens_simple', 
    'disease_abstract_tokens_hf'
]

# Check for emptiness in each column
conditions = [df[col].apply(len) == 0 for col in columns_to_check]

# Combine conditions: Check if all specified columns are empty for each row
all_empty_condition = conditions[0]
for condition in conditions[1:]:
    all_empty_condition &= condition

# Filter rows where all columns are empty
rows_with_all_columns_empty = df[all_empty_condition]

# Display the filtered rows
print(f"Number of rows with all specified columns empty: {len(rows_with_all_columns_empty)}")
rows_with_all_columns_empty

In [None]:
# Define the original and new columns
original_columns = ['disease_abstract_spacy', 'disease_title_spacy', 'disease_mesh_terms_spacy']
new_columns = ['disease_title_tokens_simple', 'disease_title_tokens_hf', 'disease_abstract_tokens_simple', 'disease_abstract_tokens_hf']

# Check for emptiness in the original columns
original_columns_empty = [df[col].apply(len) == 0 for col in original_columns]

# Check for non-emptiness in the new columns
new_columns_not_empty = [df[col].apply(len) > 0 for col in new_columns]

# Combine conditions: Original columns are empty, and new columns are not empty
original_empty_condition = original_columns_empty[0]
for condition in original_columns_empty[1:]:
    original_empty_condition &= condition

new_not_empty_condition = new_columns_not_empty[0]
for condition in new_columns_not_empty[1:]:
    new_not_empty_condition |= condition  # At least one of the new columns must not be empty

# Filter rows where original columns are empty but new columns are not empty
rows_with_new_columns_not_empty = df[original_empty_condition & new_not_empty_condition]

# Display the filtered rows
print(f"Number of rows where original columns are empty but new columns are not: {len(rows_with_new_columns_not_empty)}")
rows_with_new_columns_not_empty

In [None]:
# Define the target UID values
target_uids = [37843779]

# Filter the DataFrame for rows with the target UID values
filtered_rows_by_uid = df[df['uid'] == "37843779"]

# Display the filtered rows
print(f"Rows with target UIDs {target_uids}:")
filtered_rows_by_uid

In [None]:
df

##### **Final Decision: Data Retention vs. Row Removal** `title` + `abstract` + `mesh terms`


During the data processing pipeline, we faced a choice regarding whether to remove rows that lacked tokens related to illnesses. (there is still possibility to try and check these articles for any possible tokens there)

The goal was to ensure high-quality data for further analysis and modelling while retaining as much relevant information as possible.

**Choices Considered**:

Title only: Retain rows with illness-related tokens in the title.
*Result*: ~1,000,000 → 480,000 rows removed (~520,000 retained).

- Abstract + Title: Retain rows with tokens in either title or abstract.
*Result*: ~1,000,000 → 84,000 rows removed (~916,000 retained).

- Abstract + Title + MeSH terms: Include tokens found in MeSH terms.
*Result*: ~1,000,000 → 60,000rows removed (~940,000 retained).

- Abstract + Title + MeSH + Hybrid Filtering: Use advanced filtering (e.g., hybrid matching and simpler keyword matching).
*Result*: ~1,000,000 → 54,000 rows removed (~946,000 retained).

- Retain All Rows: Do not remove rows based on token presence.

**Final Decision**:
We chose Option 5 to retain all rows. This decision was based on the reasoning that data lacking illness-related tokens is not inherently irrelevant and there may be possibility to check these inputs (articles) and see what's the problem with them. 

By keeping all rows we additionaly:

- We maintain the integrity and diversity of the dataset.

- Downstream analyses remain flexible, as token filtering can be performed at a later stage if necessary.

- It enables broader exploration and modeling possibilities without prematurely discarding data.

- By taking this approach, the dataset remains inclusive, ensuring no potential insights are lost. This aligns with our goal of providing a comprehensive resource for PubMed data analysis.

## Further Processing - disclaimer

For the rest of variables we will do brief EDA for now, and eventually come back to them when we go into bibliometrics or other relevant analysis, network creation, etc.

### Variable: `journal`

In [None]:
journal_counts = df["journal"].value_counts(dropna=False)
print("Top 20 journals:")
print(journal_counts.head(20))

In [None]:
import matplotlib.pyplot as plt

journal_counts = df["journal"].value_counts().head(20)

plt.figure(figsize=(16, 6))
bars = plt.bar(range(len(journal_counts)), journal_counts.values, color='steelblue')

# X-ticks
plt.xticks(range(len(journal_counts)), journal_counts.index, rotation=90, ha='right')
plt.title("Top 20 Journals")
plt.xlabel("Journal Name")
plt.ylabel("Count of Articles")

# Label each bar with its count
for i, rect in enumerate(bars):
    height = rect.get_height()
    plt.text(rect.get_x() + rect.get_width()/2, height,
             f"{int(height)}", ha='center', va='bottom', fontsize=8)

plt.tight_layout()
plt.show()

### Variable: `authors`

During the pipeline, we added a new variable to the dataset for intermediate analysis. However, we decided not to save this updated dataset with the new variable at this stage. 

If this variable is required in further processing steps, the dataset will be updated and saved accordingly in subsequent parts of the pipeline.

This is just insight into making networks, overall they will be done later in the process

In [19]:
df["authors_list"] = df["authors"].fillna("").str.split(";").apply(
    lambda x: [a.strip() for a in x if a.strip()]
)

In [None]:
import pandas as pd
from collections import Counter

df["authors_list"] = (
    df["authors"]
    .fillna("")
    .str.split(";")
    .apply(lambda x: [a.strip() for a in x if a.strip()])
)

# Now each row has a Python list of authors
author_counter = Counter()

for authors in df["authors_list"]:
    author_counter.update(authors)

# Inspect top 20 authors
for author, freq in author_counter.most_common(20):
    print(author, freq)


##### Network early analysis

In [22]:
df_netx = df[["uid","authors","authors_list"]].copy()

In [None]:
import networkx as nx
from itertools import combinations
from tqdm import tqdm

# Initialize an empty graph
G = nx.Graph()

# Function to process the authors in the DataFrame
def process_authors(df_authors):
    for authors in tqdm(df_authors, desc="Processing Authors"):
        for pair in combinations(authors, 2):  # Create pairs of co-authors
            if G.has_edge(pair[0], pair[1]):
                G[pair[0]][pair[1]]["weight"] += 1
            else:
                G.add_edge(pair[0], pair[1], weight=1)

# Ensure 'authors_list' is in the correct format
df_netx["authors_list"] = df_netx["authors_list"].apply(
    lambda x: eval(x) if isinstance(x, str) else x
)

# Process the authors from the DataFrame
print("Building the collaboration network...")
process_authors(df_netx["authors_list"])

print("\nGraph construction completed.")


In [None]:
# Extract the top 100 authors by degree
print("\nExtracting the top 50 authors by degree...")
top_authors = sorted(G.degree, key=lambda x: x[1], reverse=True)[:50]

# Create a subgraph with the top 100 authors
top_nodes = [node for node, _ in top_authors]
subgraph = G.subgraph(top_nodes)

# Display statistics for the subgraph
print(f"\nTop 100 Subgraph Statistics:")
print(f"Total Nodes: {subgraph.number_of_nodes()}")
print(f"Total Edges: {subgraph.number_of_edges()}")

# Save the subgraph to a GEXF file for visualization
nx.write_gexf(subgraph, "Data/2.Processed/EDA_AnalysisData/top_50_author_network.gexf")
print("\nSubgraph saved to 'top_50_author_network.gexf'.")

# Optional: Print the top 10 authors and their degrees
print("\nTop 10 Authors by Degree:")
for author, degree in top_authors[:10]:
    print(f"{author}: {degree} connections")

In [None]:
import matplotlib.pyplot as plt
import networkx as nx

# Customize node sizes based on degree
node_sizes = [subgraph.degree[node] * 20 for node in subgraph.nodes]  # Scale size
node_colors = range(len(subgraph.nodes))  # Gradient color mapping

# Customize edge widths based on weight
edge_widths = [subgraph[u][v]['weight'] for u, v in subgraph.edges]

# Draw graph with spring layout
plt.figure(figsize=(14, 14))
pos = nx.spring_layout(subgraph, seed=42)  # Seed ensures consistent layout

nx.draw_networkx_nodes(
    subgraph,
    pos,
    node_size=node_sizes,
    node_color=node_colors,
    cmap=plt.cm.viridis,  # Color map
    alpha=0.8
)

nx.draw_networkx_edges(
    subgraph,
    pos,
    width=[w / 10 for w in edge_widths],  # Scale down edge weights
    alpha=0.5,
    edge_color="gray"
)

# Add labels for the top 10 authors
top_labels = {node: node for node, degree in sorted(subgraph.degree, key=lambda x: x[1], reverse=True)[:10]}
nx.draw_networkx_labels(
    subgraph, pos, labels=top_labels, font_size=10, font_color="darkred"
)

plt.title("Top 50 Author Collaboration Network (Spring Layout)", fontsize=18)
plt.axis("off")
plt.show()

In [None]:
# Customize node sizes and colors
node_sizes = [subgraph.degree[node] * 20 for node in subgraph.nodes]
node_colors = range(len(subgraph.nodes))

# Draw graph with circular layout
plt.figure(figsize=(14, 14))
pos = nx.circular_layout(subgraph)  # Circular layout for visualization

nx.draw_networkx_nodes(
    subgraph,
    pos,
    node_size=node_sizes,
    node_color=node_colors,
    cmap=plt.cm.coolwarm,  # Color map
    alpha=0.8
)

nx.draw_networkx_edges(
    subgraph,
    pos,
    width=[w / 10 for w in edge_widths],  # Scale down edge weights
    alpha=0.5,
    edge_color="gray"
)

# Add labels for the top 10 authors
nx.draw_networkx_labels(
    subgraph, pos, labels=top_labels, font_size=10, font_color="blue"
)

plt.title("Top 50 Author Collaboration Network (Circular Layout)", fontsize=18)
plt.axis("off")
plt.show()


In [None]:
# Identify top nodes and edges
top_nodes = sorted(subgraph.degree, key=lambda x: x[1], reverse=True)[:10]
top_edges = sorted(subgraph.edges(data=True), key=lambda x: x[2]['weight'], reverse=True)[:20]

# Extract top nodes and edges for highlighting
highlighted_nodes = [node for node, _ in top_nodes]
highlighted_edges = [(u, v) for u, v, _ in top_edges]

# Highlighted graph
plt.figure(figsize=(14, 14))
pos = nx.spring_layout(subgraph, seed=42)

# Regular nodes and edges
nx.draw_networkx_nodes(
    subgraph,
    pos,
    node_size=node_sizes,
    node_color="lightgray",
    alpha=0.6
)
nx.draw_networkx_edges(
    subgraph,
    pos,
    width=0.5,
    alpha=0.3,
    edge_color="lightgray"
)

# Highlighted nodes and edges
nx.draw_networkx_nodes(
    subgraph,
    pos,
    nodelist=highlighted_nodes,
    node_size=300,
    node_color="red"
)
nx.draw_networkx_edges(
    subgraph,
    pos,
    edgelist=highlighted_edges,
    width=2,
    alpha=0.7,
    edge_color="red"
)

# Add labels for top nodes
highlighted_labels = {node: node for node in highlighted_nodes}
nx.draw_networkx_labels(
    subgraph, pos, labels=highlighted_labels, font_size=12, font_color="darkred"
)

plt.title("Highlighted Key Authors and Collaborations", fontsize=18)
plt.axis("off")
plt.show()


In [None]:
import pandas as pd

# Create a DataFrame for Counter-based rankings
counter_ranking = pd.DataFrame(author_counter.most_common(), columns=["Author", "Frequency"])
counter_ranking["Counter Rank"] = counter_ranking.index + 1

# Create a DataFrame for Graph-based rankings
graph_ranking = pd.DataFrame(G.degree, columns=["Author", "Degree"])
graph_ranking["Graph Rank"] = graph_ranking["Degree"].rank(ascending=False).astype(int)

# Merge the two rankings
ranking_comparison = pd.merge(counter_ranking, graph_ranking, on="Author", how="inner")

# Display top authors with both rankings
print(ranking_comparison.head(20))

# Check correlation
correlation = ranking_comparison["Frequency"].corr(ranking_comparison["Degree"])
print(f"Correlation between Counter and Graph rankings: {correlation:.2f}")

In [28]:
# Use the subgraph created for top 100 authors
G = subgraph

In [None]:
# Calculate unique and shared connections
unique_connections = {}
shared_connections = {}

for node in G.nodes:
    neighbors = set(G.neighbors(node))
    unique_connections[node] = len(neighbors)
    shared_connections[node] = sum(1 for n in neighbors if len(set(G.neighbors(n)) & neighbors) > 1)

# Combine into a DataFrame
connection_stats = pd.DataFrame.from_dict(unique_connections, orient="index", columns=["Unique Connections"])
connection_stats["Shared Connections"] = connection_stats.index.map(shared_connections)

# Display top authors by unique and shared connections
print(connection_stats.sort_values("Unique Connections", ascending=False).head(10))
print(connection_stats.sort_values("Shared Connections", ascending=False).head(10))


In [None]:
collaborator_ranks = []

for node, degree in top_authors:
    for neighbor in G.neighbors(node):
        neighbor_degree = G.degree(neighbor)
        collaborator_ranks.append({"Author": node, "Author Degree": degree, "Neighbor": neighbor, "Neighbor Degree": neighbor_degree})

collaborator_ranks_df = pd.DataFrame(collaborator_ranks)

# Example: Check if high-ranked authors collaborate with similarly ranked ones
high_ranked_collaborators = collaborator_ranks_df[collaborator_ranks_df["Author Degree"] > 5000]
print(high_ranked_collaborators.head(10))

In [None]:
# Clustering coefficient
clustering = nx.clustering(G)
clustering_df = pd.DataFrame.from_dict(clustering, orient="index", columns=["Clustering Coefficient"])

# Betweenness centrality
betweenness = nx.betweenness_centrality(G)
betweenness_df = pd.DataFrame.from_dict(betweenness, orient="index", columns=["Betweenness Centrality"])

# Merge metrics
author_metrics = pd.concat([clustering_df, betweenness_df], axis=1)

# Display top authors by centrality
print(author_metrics.sort_values("Betweenness Centrality", ascending=False).head(10))


### Variable: `affiliations`

In [None]:
aff_counter = df["affiliations"].fillna("").value_counts()
print("Top 20 affiliations:")
aff_counter

In [None]:
aff_freq = df["affiliations"].value_counts(dropna=False)
print("Number of unique affiliations:", len(aff_freq))
print("Top 20 affiliation strings:")
print(aff_freq.head(20))

### Variable: `mesh_terms` continuation, some additional insights

In [36]:
# df["mesh_list"] = df["mesh_terms"].fillna("").str.split(";").apply(
#     lambda x: [m.strip() for m in x if m.strip()]
# )
df["mesh_list"] = (
    df["mesh_terms"]
    .fillna("")  # replace NaN with empty string
    .str.split(";")
    .apply(lambda x: [m.strip() for m in x if m.strip()])  # strip spaces, remove empties
)

In [None]:
from collections import Counter

mesh_counter = Counter()
for mesh_terms in df["mesh_list"]:
    # mesh_terms is a list, e.g. ["Adolescent", "Adult", ...]
    mesh_counter.update(mesh_terms)

# Turn counter into a DataFrame sorted by frequency
mesh_freq_df = pd.DataFrame(mesh_counter.most_common(), columns=["mesh_term", "count"])
mesh_freq_df.head(20)

In [None]:
# Explode the list so each row in df_exploded is one MeSH term
df_exploded = df.explode("mesh_list")
# Then we can do a value_counts on the single item
mesh_freq = df_exploded["mesh_list"].value_counts(dropna=False)
print("Number of unique MeSH terms:", len(mesh_freq))
print("Top 20 MeSH terms:")
print(mesh_freq.head(20))


mesh term will be explored in next file that involve analysis

### Variable: `keywords`

keywords will be will be explored in next file that involve analysis

### Variable: `coi_statement`

there is too much NA for us to process it meaningfully

## FINALIZING DATASET SAVE

In [4]:
import os
os.getcwd()

'h:\\000_Projects\\01_GitHub\\05_PythonProjects\\PubMedResearch\\Notebooks'

In [5]:
if __name__ == "__main__":

    # Define the base path relative to the current file
    base_path = os.path.abspath(os.path.join(os.getcwd(), ".."))  # Move one directory up
    
    # Construct the file path
    file_path = os.path.join(base_path, "Data", "2.Processed", "ModellingData", "P5_final_new.parquet")
    
    batch_size = 100_000  # Define your desired chunk size
    
    # Read the file in batches
    df = read_parquet_in_batches_with_progress(file_path, batch_size)
    
    print(f"\nFinal DataFrame with {len(df)} rows:")
    print(df.head())


Processing Batches:   9%|▉         | 100000/1057871 [01:07<10:49, 1474.29rows/s]

Processed Chunk 1: 100000 rows


Processing Batches:  19%|█▉        | 200000/1057871 [01:14<04:30, 3166.17rows/s]

Processed Chunk 2: 100000 rows


Processing Batches:  28%|██▊       | 300000/1057871 [01:20<02:32, 4966.46rows/s]

Processed Chunk 3: 100000 rows


Processing Batches:  38%|███▊      | 400000/1057871 [01:25<01:32, 7097.79rows/s]

Processed Chunk 4: 100000 rows


Processing Batches:  47%|████▋     | 500000/1057871 [01:30<01:01, 9118.62rows/s]

Processed Chunk 5: 100000 rows


Processing Batches:  57%|█████▋    | 600000/1057871 [01:35<00:40, 11373.10rows/s]

Processed Chunk 6: 100000 rows


Processing Batches:  66%|██████▌   | 700000/1057871 [01:39<00:26, 13581.42rows/s]

Processed Chunk 7: 100000 rows


Processing Batches:  76%|███████▌  | 800000/1057871 [01:44<00:16, 15602.29rows/s]

Processed Chunk 8: 100000 rows


Processing Batches:  85%|████████▌ | 900000/1057871 [01:48<00:09, 17086.30rows/s]

Processed Chunk 9: 100000 rows


Processing Batches:  95%|█████████▍| 1000000/1057871 [01:53<00:03, 18317.15rows/s]

Processed Chunk 10: 100000 rows


Processing Batches: 100%|██████████| 1057871/1057871 [01:56<00:00, 9098.94rows/s] 


Processed Chunk 11: 57871 rows

Final DataFrame with 1057871 rows:
        uid                                              title  \
0  10186596  The potential impact of health care reform on ...   
1  10186588  New Jersey health promotion and disease preven...   
2  10186587  Who will provide preventive services? The chan...   
3  10163501  Cytoreduction of small intestine metastases us...   
4  10157383  Racial differences in access to kidney transpl...   

                                             journal  \
0  Journal of public health management and practi...   
1  Journal of public health management and practi...   
2  Journal of public health management and practi...   
3                     Journal of gynecologic surgery   
4                       Health care financing review   

                                            abstract  \
0  General: This article observes that, despite t...   
1  General: Health promotion is a major component...   
2  General: Health care reform 

In [None]:
# if __name__ == "__main__":

#     # Define the base path relative to the current file
#     base_path = os.path.abspath(os.path.join(os.getcwd(), ".."))  # Move one directory up

#     # Construct the file path
#     file_path = os.path.join(base_path, "Data", "2.Processed", "ModellingData", "P5_final_new.parquet")

#     # Define the output folder and final file
#     output_folder = os.path.join(base_path, "Data", "2.Processed", "ModellingData")
#     final_file = "P6_merged_tokens.parquet"

#     batch_size = 100_000  # Define your desired chunk size
    
#     # Save and merge in batches
#     result_path = save_and_merge_in_batches(
#         df=df,
#         batch_size=batch_size,
#         output_folder=output_folder,
#         final_filename=final_file,
#         temp_batch_prefix="temp_batch_"
#     )

#     print(f"All done. Merged file at: {result_path}")


Splitting DataFrame of 1057871 rows into 11 batches (size=100000).


Saving Batches:   9%|▉         | 1/11 [00:10<01:42, 10.28s/batch]

  -> Batch 1 rows [0:100000] saved to h:\000_Projects\01_GitHub\05_PythonProjects\PubMedResearch\Data\2.Processed\ModellingData\temp_batches\temp_batch_1.parquet


Saving Batches:  18%|█▊        | 2/11 [00:19<01:28,  9.84s/batch]

  -> Batch 2 rows [100000:200000] saved to h:\000_Projects\01_GitHub\05_PythonProjects\PubMedResearch\Data\2.Processed\ModellingData\temp_batches\temp_batch_2.parquet


Saving Batches:  27%|██▋       | 3/11 [00:29<01:18,  9.86s/batch]

  -> Batch 3 rows [200000:300000] saved to h:\000_Projects\01_GitHub\05_PythonProjects\PubMedResearch\Data\2.Processed\ModellingData\temp_batches\temp_batch_3.parquet


Saving Batches:  36%|███▋      | 4/11 [00:39<01:09,  9.93s/batch]

  -> Batch 4 rows [300000:400000] saved to h:\000_Projects\01_GitHub\05_PythonProjects\PubMedResearch\Data\2.Processed\ModellingData\temp_batches\temp_batch_4.parquet


Saving Batches:  45%|████▌     | 5/11 [00:49<01:00, 10.01s/batch]

  -> Batch 5 rows [400000:500000] saved to h:\000_Projects\01_GitHub\05_PythonProjects\PubMedResearch\Data\2.Processed\ModellingData\temp_batches\temp_batch_5.parquet


Saving Batches:  55%|█████▍    | 6/11 [01:00<00:50, 10.08s/batch]

  -> Batch 6 rows [500000:600000] saved to h:\000_Projects\01_GitHub\05_PythonProjects\PubMedResearch\Data\2.Processed\ModellingData\temp_batches\temp_batch_6.parquet


Saving Batches:  64%|██████▎   | 7/11 [01:11<00:41, 10.42s/batch]

  -> Batch 7 rows [600000:700000] saved to h:\000_Projects\01_GitHub\05_PythonProjects\PubMedResearch\Data\2.Processed\ModellingData\temp_batches\temp_batch_7.parquet


Saving Batches:  73%|███████▎  | 8/11 [01:22<00:32, 10.84s/batch]

  -> Batch 8 rows [700000:800000] saved to h:\000_Projects\01_GitHub\05_PythonProjects\PubMedResearch\Data\2.Processed\ModellingData\temp_batches\temp_batch_8.parquet


Saving Batches:  82%|████████▏ | 9/11 [01:35<00:22, 11.22s/batch]

  -> Batch 9 rows [800000:900000] saved to h:\000_Projects\01_GitHub\05_PythonProjects\PubMedResearch\Data\2.Processed\ModellingData\temp_batches\temp_batch_9.parquet


Saving Batches:  91%|█████████ | 10/11 [01:47<00:11, 11.56s/batch]

  -> Batch 10 rows [900000:1000000] saved to h:\000_Projects\01_GitHub\05_PythonProjects\PubMedResearch\Data\2.Processed\ModellingData\temp_batches\temp_batch_10.parquet


Saving Batches: 100%|██████████| 11/11 [01:54<00:00, 10.40s/batch]


  -> Batch 11 rows [1000000:1057871] saved to h:\000_Projects\01_GitHub\05_PythonProjects\PubMedResearch\Data\2.Processed\ModellingData\temp_batches\temp_batch_11.parquet

Merging 11 batch files into h:\000_Projects\01_GitHub\05_PythonProjects\PubMedResearch\Data\2.Processed\ModellingData\P6_merged_tokens.parquet...


Merging Batches: 100%|██████████| 11/11 [01:44<00:00,  9.48s/file]


In [6]:
df.head()

Unnamed: 0,uid,title,journal,abstract,authors,affiliations,mesh_terms,keywords,coi_statement,parsed_date,...,cleaned_title_tokens_hf,cleaned_abstract_tokens_simple,cleaned_abstract_tokens_hf,disease_title_tokens_simple,disease_title_tokens_hf,disease_abstract_tokens_simple,disease_abstract_tokens_hf,disease_abstract_spacy,disease_title_spacy,disease_mesh_terms_spacy
0,10186596,The potential impact of health care reform on ...,Journal of public health management and practi...,"General: This article observes that, despite t...",Auerbach J; McGuire J,"HIV/AIDS Bureau, Massachusetts Department of P...","Financing, Government; HIV Infections; Health ...",,,1995-01-01,...,"[[CLS], potential, impact, health, care, refor...","[general, article, observes, despite, clear, p...","[[CLS], general, article, observes, despite, c...",[hiv],[hiv],"[hiv, aids]","[hiv, aids]","[human immunodeficiency virus (HIV) disease, a...",[],[HIV Infections]
1,10186588,New Jersey health promotion and disease preven...,Journal of public health management and practi...,General: Health promotion is a major component...,Louria D B,Department of Preventive Medicine and Communit...,Female; Health Education; Health Promotion; Hu...,,,1995-01-01,...,"[[CLS], new, jersey, health, promotion, diseas...","[general, health, promotion, major, component,...","[[CLS], general, health, promotion, major, com...",[],[],[],[],[],[],[]
2,10186587,Who will provide preventive services? The chan...,Journal of public health management and practi...,General: Health care reform in the United Stat...,Pearson T A; Spencer M; Jenkins P,"Mary Imogene Bassett Research Institute, Coope...",Delivery of Health Care; Female; Health Care R...,,,1995-01-01,...,"[[CLS], provide, prevent, ##ive, services, ?, ...","[general, health, care, reform, united, states...","[[CLS], general, health, care, reform, united,...",[],[],[],[],[],[],[]
3,10163501,Cytoreduction of small intestine metastases us...,Journal of gynecologic surgery,General: The Cavitron Ultrasonic Surgical Aspi...,Adelson M D,"Department of Obstetrics and Gynecology, Crous...",Adenocarcinoma; Fallopian Tube Neoplasms; Fema...,,,1995-01-01,...,"[[CLS], cy, ##tore, ##duction, small, int, ##e...","[general, cavitron, ultrasonic, surgical, aspi...","[[CLS], general, ca, ##vi, ##tron, ultra, ##so...",[],[],[tumor],[tumor],"[carcinoma of the ovary, and one each had, tub...",[],"[Adenocarcinoma, Neoplasms, Ovarian Neoplasms]"
4,10157383,Racial differences in access to kidney transpl...,Health care financing review,General: Previous work has documented large di...,Eggers P W,"Office of Research and Demonstrations, Health ...",Adolescent; Adult; Black or African American; ...,Empirical Approach; End Stage Renal Disease Pr...,,1995-01-01,...,"[[CLS], racial, differences, access, kidney, t...","[general, previous, work, documented, large, d...","[[CLS], general, previous, work, documented, l...",[],[],[],[],"[renal failure, renal failure, end stage renal...",[],[American Kidney Failure]


In [9]:
df.dtypes

uid                                       object
title                                     object
journal                                   object
abstract                                  object
authors                                   object
affiliations                              object
mesh_terms                                object
keywords                                  object
coi_statement                             object
parsed_date                       datetime64[ns]
cleaned_title_tokens_simple               object
cleaned_title_tokens_hf                   object
cleaned_abstract_tokens_simple            object
cleaned_abstract_tokens_hf                object
disease_title_tokens_simple               object
disease_title_tokens_hf                   object
disease_abstract_tokens_simple            object
disease_abstract_tokens_hf                object
disease_abstract_spacy                    object
disease_title_spacy                       object
disease_mesh_terms_s

In [11]:
token_columns = [
        "disease_title_tokens_simple",
        "disease_title_tokens_hf",
        "disease_abstract_tokens_simple",
        "disease_abstract_tokens_hf",
        "disease_title_spacy",
        "disease_mesh_terms_spacy",
        "disease_abstract_spacy",
    ]

# Step 1: Identify missing rows for each column
def count_missing_rows(column):
    def is_missing(value):
        # Handle lists
        if isinstance(value, list):
            return len(value) == 0  # Empty list
        # Handle numpy arrays
        if hasattr(value, "shape") and len(value.shape) > 0:
            return value.size == 0  # Empty array
        # Handle None or empty scalar values
        return value is None or pd.isna(value)
    return df[column].apply(is_missing).sum()

missing_counts = {col: count_missing_rows(col) for col in token_columns}
sorted_columns = sorted(missing_counts.items(), key=lambda x: x[1], reverse=True)

print("Missing rows for each column (sorted by missing count):")
for col, count in sorted_columns:
    print(f"{col}: {count} missing rows")

# Step 2: Iteratively combine columns and calculate remaining missing rows
cumulative_set = set()  # Tracks rows that are NOT missing across combined columns
results = []  # Stores results for each iteration
final_missing_indices = set(df.index)  # To track rows still missing after final iteration

for col, _ in sorted_columns:
    # Get non-missing rows for the current column
    def not_missing(value):
        if isinstance(value, list):
            return len(value) > 0  # Non-empty list
        if hasattr(value, "shape") and len(value.shape) > 0:
            return value.size > 0  # Non-empty array
        return value is not None and not pd.isna(value)
    current_set = set(df.index[df[col].apply(not_missing)])
    cumulative_set |= current_set  # Union with cumulative non-missing rows
    final_missing_indices -= current_set  # Remove these rows from the missing set
    missing_rows_after_merge = len(df) - len(cumulative_set)  # Remaining missing rows
    results.append({"Included Columns": col, "Missing Rows": missing_rows_after_merge})
    print(f"Included {col}: {missing_rows_after_merge} rows still missing")

# Step 3: Output results as a DataFrame for inspection
results_df = pd.DataFrame(results)
print("\nFinal results after merging columns:")
print(results_df)

# Step 4: Get insights into the abstracts or rows still missing after the final iteration
print("\nInspecting rows still missing after the final iteration...")

# Convert the set of indices to a list
missing_indices_list = list(final_missing_indices)

# Extract the rows using the corrected list
missing_df = df.loc[missing_indices_list, ["uid", "title", "abstract", "mesh_terms"]]

# Display a few examples of missing rows
print(missing_df.head())
print(f"Total missing rows after all columns included: {len(missing_df)}")

# Save missing rows to a file if needed for further inspection
# missing_df.to_csv("missing_rows_final.csv", index=False)



Missing rows for each column (sorted by missing count):
disease_title_tokens_hf: 854118 missing rows
disease_title_tokens_simple: 796808 missing rows
disease_abstract_tokens_hf: 643789 missing rows
disease_abstract_tokens_simple: 548913 missing rows
disease_title_spacy: 403970 missing rows
disease_mesh_terms_spacy: 299410 missing rows
disease_abstract_spacy: 87103 missing rows
Included disease_title_tokens_hf: 854118 rows still missing
Included disease_title_tokens_simple: 793403 rows still missing
Included disease_abstract_tokens_hf: 590192 rows still missing
Included disease_abstract_tokens_simple: 537052 rows still missing
Included disease_title_spacy: 251758 rows still missing
Included disease_mesh_terms_spacy: 145753 rows still missing
Included disease_abstract_spacy: 54819 rows still missing

Final results after merging columns:
                 Included Columns  Missing Rows
0         disease_title_tokens_hf        854118
1     disease_title_tokens_simple        793403
2      di

In [12]:
missing_df

Unnamed: 0,uid,title,abstract,mesh_terms
131072,11099584,Providing pediatric subspecialty care: A workf...,OBJECTIVE: To provide a snapshot of pediatric ...,Adolescent; Adult; Aged; Cardiology; Child; Cr...
1,10186588,New Jersey health promotion and disease preven...,General: Health promotion is a major component...,Female; Health Education; Health Promotion; Hu...
2,10186587,Who will provide preventive services? The chan...,General: Health care reform in the United Stat...,Delivery of Health Care; Female; Health Care R...
262147,15722370,Impact of misclassification of in vitro fertil...,OBJECTIVE: To determine whether failure to ade...,Data Collection; Dietary Supplements; Female; ...
917506,34794827,Physiologic flow-conditioning limits vascular ...,General: Hemodynamics play a central role in t...,Animals; Capillaries; Endothelial Cells; Hemod...
...,...,...,...,...
786420,30631154,LncRNAs-directed PTEN enzymatic switch governs...,General: Despite the structural conservation o...,Animals; Cell Line; Epithelial-Mesenchymal Tra...
917494,34796962,Autoantibodies to red blood cell surface Glyco...,BACKGROUND: Both M and N alleles encode antige...,Autoantibodies; Blood Group Antigens; Erythroc...
262138,15723817,The JCAHO patient safety event taxonomy: a sta...,BACKGROUND: The current US national discussion...,Causality; Classification; Communication; Heal...
524285,21216555,Health care utilization before and after an ou...,BACKGROUND: Older adults in the United States ...,Aged; Chi-Square Distribution; Delivery of Hea...


In [20]:
df.dtypes

uid                                       object
title                                     object
journal                                   object
abstract                                  object
authors                                   object
affiliations                              object
mesh_terms                                object
keywords                                  object
coi_statement                             object
parsed_date                       datetime64[ns]
cleaned_title_tokens_simple               object
cleaned_title_tokens_hf                   object
cleaned_abstract_tokens_simple            object
cleaned_abstract_tokens_hf                object
disease_title_tokens_simple               object
disease_title_tokens_hf                   object
disease_abstract_tokens_simple            object
disease_abstract_tokens_hf                object
disease_abstract_spacy                    object
disease_title_spacy                       object
disease_mesh_terms_s

In [22]:
import pandas as pd

# Assuming you already have your DataFrame loaded as df
# df = pd.read_parquet("your_file.parquet", engine="pyarrow")  # Example load

token_columns = [
    "disease_title_tokens_simple",
    "disease_title_tokens_hf",
    "disease_abstract_tokens_simple",
    "disease_abstract_tokens_hf",
    "disease_title_spacy",
    "disease_mesh_terms_spacy",
    "disease_abstract_spacy",
]

In [None]:

# # Create a new column with the merged tokens
# df['merged_tokens'] = df.apply(
#     lambda row: [
#         token 
#         for col in token_columns
#         # row[col] is a NumPy array; convert to list before iterating
#         for token in row[col].tolist()
#     ],
#     axis=1
# )

# # (Optional) Inspect the first few rows
# print(df[['merged_tokens']].head())


In [23]:
df['merged_tokens'] = [
    [token for col in token_columns for token in row[col].tolist()]
    for _, row in df.iterrows()
]

# (Optional) Inspect the first few rows
print(df[['merged_tokens']].head())


                                       merged_tokens
0  [hiv, hiv, hiv, aids, hiv, aids, HIV Infection...
1                                                 []
2                                                 []
3  [tumor, tumor, Adenocarcinoma, Neoplasms, Ovar...
4  [American Kidney Failure, renal failure, renal...


In [24]:
df.head()

Unnamed: 0,uid,title,journal,abstract,authors,affiliations,mesh_terms,keywords,coi_statement,parsed_date,...,cleaned_abstract_tokens_simple,cleaned_abstract_tokens_hf,disease_title_tokens_simple,disease_title_tokens_hf,disease_abstract_tokens_simple,disease_abstract_tokens_hf,disease_abstract_spacy,disease_title_spacy,disease_mesh_terms_spacy,merged_tokens
0,10186596,The potential impact of health care reform on ...,Journal of public health management and practi...,"General: This article observes that, despite t...",Auerbach J; McGuire J,"HIV/AIDS Bureau, Massachusetts Department of P...","Financing, Government; HIV Infections; Health ...",,,1995-01-01,...,"[general, article, observes, despite, clear, p...","[[CLS], general, article, observes, despite, c...",[hiv],[hiv],"[hiv, aids]","[hiv, aids]","[human immunodeficiency virus (HIV) disease, a...",[],[HIV Infections],"[hiv, hiv, hiv, aids, hiv, aids, HIV Infection..."
1,10186588,New Jersey health promotion and disease preven...,Journal of public health management and practi...,General: Health promotion is a major component...,Louria D B,Department of Preventive Medicine and Communit...,Female; Health Education; Health Promotion; Hu...,,,1995-01-01,...,"[general, health, promotion, major, component,...","[[CLS], general, health, promotion, major, com...",[],[],[],[],[],[],[],[]
2,10186587,Who will provide preventive services? The chan...,Journal of public health management and practi...,General: Health care reform in the United Stat...,Pearson T A; Spencer M; Jenkins P,"Mary Imogene Bassett Research Institute, Coope...",Delivery of Health Care; Female; Health Care R...,,,1995-01-01,...,"[general, health, care, reform, united, states...","[[CLS], general, health, care, reform, united,...",[],[],[],[],[],[],[],[]
3,10163501,Cytoreduction of small intestine metastases us...,Journal of gynecologic surgery,General: The Cavitron Ultrasonic Surgical Aspi...,Adelson M D,"Department of Obstetrics and Gynecology, Crous...",Adenocarcinoma; Fallopian Tube Neoplasms; Fema...,,,1995-01-01,...,"[general, cavitron, ultrasonic, surgical, aspi...","[[CLS], general, ca, ##vi, ##tron, ultra, ##so...",[],[],[tumor],[tumor],"[carcinoma of the ovary, and one each had, tub...",[],"[Adenocarcinoma, Neoplasms, Ovarian Neoplasms]","[tumor, tumor, Adenocarcinoma, Neoplasms, Ovar..."
4,10157383,Racial differences in access to kidney transpl...,Health care financing review,General: Previous work has documented large di...,Eggers P W,"Office of Research and Demonstrations, Health ...",Adolescent; Adult; Black or African American; ...,Empirical Approach; End Stage Renal Disease Pr...,,1995-01-01,...,"[general, previous, work, documented, large, d...","[[CLS], general, previous, work, documented, l...",[],[],[],[],"[renal failure, renal failure, end stage renal...",[],[American Kidney Failure],"[American Kidney Failure, renal failure, renal..."


In [16]:
df["is_missing_merged_tokens"].sum()

54819

In [20]:
df.dtypes

uid                                       object
title                                     object
journal                                   object
abstract                                  object
authors                                   object
affiliations                              object
mesh_terms                                object
keywords                                  object
coi_statement                             object
parsed_date                       datetime64[ns]
cleaned_title_tokens_simple               object
cleaned_title_tokens_hf                   object
cleaned_abstract_tokens_simple            object
cleaned_abstract_tokens_hf                object
disease_title_tokens_simple               object
disease_title_tokens_hf                   object
disease_abstract_tokens_simple            object
disease_abstract_tokens_hf                object
disease_abstract_spacy                    object
disease_title_spacy                       object
disease_mesh_terms_s

In [21]:
# Function to check types in all columns
def inspect_column_types(df):
    for col in df.columns:
        if df[col].dtype == "object":  # Focus on object columns
            type_counts = df[col].apply(type).value_counts()
            print(f"Column: {col}")
            print(type_counts)
            print("-" * 40)

# Run the inspection
inspect_column_types(df)

Column: uid
uid
<class 'str'>    1057871
Name: count, dtype: int64
----------------------------------------
Column: title
title
<class 'str'>    1057871
Name: count, dtype: int64
----------------------------------------
Column: journal
journal
<class 'str'>    1057871
Name: count, dtype: int64
----------------------------------------
Column: abstract
abstract
<class 'str'>    1057871
Name: count, dtype: int64
----------------------------------------
Column: authors
authors
<class 'str'>    1057871
Name: count, dtype: int64
----------------------------------------
Column: affiliations
affiliations
<class 'str'>    1057871
Name: count, dtype: int64
----------------------------------------
Column: mesh_terms
mesh_terms
<class 'str'>    1057871
Name: count, dtype: int64
----------------------------------------
Column: keywords
keywords
<class 'str'>    1057871
Name: count, dtype: int64
----------------------------------------
Column: coi_statement
coi_statement
<class 'str'>    1057871
Nam

In [25]:
if __name__ == "__main__":

    # Define the base path relative to the current file
    base_path = os.path.abspath(os.path.join(os.getcwd(), ".."))  # Move one directory up

    # Construct the file path
    file_path = os.path.join(base_path, "Data", "2.Processed", "ModellingData", "P5_final_new.parquet")

    # Define the output folder and final file
    output_folder = os.path.join(base_path, "Data", "2.Processed", "ModellingData")
    final_file = "P6_merged_tokens.parquet"

    batch_size = 100_000  # Define your desired chunk size
    
    # Save and merge in batches
    result_path = save_and_merge_in_batches(
        df=df,
        batch_size=batch_size,
        output_folder=output_folder,
        final_filename=final_file,
        temp_batch_prefix="temp_batch_"
    )

    print(f"All done. Merged file at: {result_path}")

Splitting DataFrame of 1057871 rows into 11 batches (size=100000).


Saving Batches:   9%|▉         | 1/11 [00:10<01:42, 10.28s/batch]

  -> Batch 1 rows [0:100000] saved to h:\000_Projects\01_GitHub\05_PythonProjects\PubMedResearch\Data\2.Processed\ModellingData\temp_batches\temp_batch_1.parquet


Saving Batches:  18%|█▊        | 2/11 [00:20<01:30, 10.01s/batch]

  -> Batch 2 rows [100000:200000] saved to h:\000_Projects\01_GitHub\05_PythonProjects\PubMedResearch\Data\2.Processed\ModellingData\temp_batches\temp_batch_2.parquet


Saving Batches:  27%|██▋       | 3/11 [00:30<01:20, 10.12s/batch]

  -> Batch 3 rows [200000:300000] saved to h:\000_Projects\01_GitHub\05_PythonProjects\PubMedResearch\Data\2.Processed\ModellingData\temp_batches\temp_batch_3.parquet


Saving Batches:  36%|███▋      | 4/11 [00:40<01:10, 10.11s/batch]

  -> Batch 4 rows [300000:400000] saved to h:\000_Projects\01_GitHub\05_PythonProjects\PubMedResearch\Data\2.Processed\ModellingData\temp_batches\temp_batch_4.parquet


Saving Batches:  45%|████▌     | 5/11 [00:50<01:00, 10.16s/batch]

  -> Batch 5 rows [400000:500000] saved to h:\000_Projects\01_GitHub\05_PythonProjects\PubMedResearch\Data\2.Processed\ModellingData\temp_batches\temp_batch_5.parquet


Saving Batches:  55%|█████▍    | 6/11 [01:01<00:51, 10.28s/batch]

  -> Batch 6 rows [500000:600000] saved to h:\000_Projects\01_GitHub\05_PythonProjects\PubMedResearch\Data\2.Processed\ModellingData\temp_batches\temp_batch_6.parquet


Saving Batches:  64%|██████▎   | 7/11 [01:12<00:42, 10.68s/batch]

  -> Batch 7 rows [600000:700000] saved to h:\000_Projects\01_GitHub\05_PythonProjects\PubMedResearch\Data\2.Processed\ModellingData\temp_batches\temp_batch_7.parquet


Saving Batches:  73%|███████▎  | 8/11 [01:24<00:33, 11.02s/batch]

  -> Batch 8 rows [700000:800000] saved to h:\000_Projects\01_GitHub\05_PythonProjects\PubMedResearch\Data\2.Processed\ModellingData\temp_batches\temp_batch_8.parquet


Saving Batches:  82%|████████▏ | 9/11 [01:36<00:22, 11.33s/batch]

  -> Batch 9 rows [800000:900000] saved to h:\000_Projects\01_GitHub\05_PythonProjects\PubMedResearch\Data\2.Processed\ModellingData\temp_batches\temp_batch_9.parquet


Saving Batches:  91%|█████████ | 10/11 [01:49<00:11, 11.74s/batch]

  -> Batch 10 rows [900000:1000000] saved to h:\000_Projects\01_GitHub\05_PythonProjects\PubMedResearch\Data\2.Processed\ModellingData\temp_batches\temp_batch_10.parquet


Saving Batches: 100%|██████████| 11/11 [01:57<00:00, 10.64s/batch]


  -> Batch 11 rows [1000000:1057871] saved to h:\000_Projects\01_GitHub\05_PythonProjects\PubMedResearch\Data\2.Processed\ModellingData\temp_batches\temp_batch_11.parquet

Merging 11 batch files into h:\000_Projects\01_GitHub\05_PythonProjects\PubMedResearch\Data\2.Processed\ModellingData\P6_merged_tokens.parquet...


Merging Batches: 100%|██████████| 11/11 [01:19<00:00,  7.21s/file]


Final merged DataFrame saved as: h:\000_Projects\01_GitHub\05_PythonProjects\PubMedResearch\Data\2.Processed\ModellingData\P6_merged_tokens.parquet

Temporary batch files removed. All done!
All done. Merged file at: h:\000_Projects\01_GitHub\05_PythonProjects\PubMedResearch\Data\2.Processed\ModellingData\P6_merged_tokens.parquet
