# PubMed Analysis Statistics

This notebook loads the `pubmed_analysis.parquet` file and calculates comprehensive statistics for all processed columns including:

- Publication metadata (dates, journals, authors, etc.)
- Swiss affiliation analysis
- Goldhamster model predictions
- Text analysis metrics
- Data quality assessment

## 1. Import Required Libraries

Import all necessary libraries for data analysis and statistics.

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from pathlib import Path
from datetime import datetime
import warnings
warnings.filterwarnings('ignore')

# Set display options
pd.set_option('display.max_columns', None)
pd.set_option('display.width', None)
pd.set_option('display.max_colwidth', 50)

print("Libraries imported successfully!")
print(f"Pandas version: {pd.__version__}")
print(f"NumPy version: {np.__version__}")
print(f"Analysis timestamp: {datetime.now().strftime('%Y-%m-%d %H:%M:%S')}")

Libraries imported successfully!
Pandas version: 2.3.3
NumPy version: 1.26.4
Analysis timestamp: 2025-12-09 11:49:05


## 2. Load the Parquet File

Load the processed PubMed analysis data from the parquet file.

In [2]:
# Define file path
parquet_file = Path("../data/results/pubmed_analysis.parquet")
    
# Load the data
df = pd.read_parquet(parquet_file)

print(f"✅ Successfully loaded {len(df):,} records")
print(f"File size: {parquet_file.stat().st_size / 1024 / 1024:.1f} MB")
print(f"Memory usage: {df.memory_usage(deep=True).sum() / 1024 / 1024:.1f} MB")
    
df['mesh_terms'] = df['mesh_terms'].str[2:-2].str.split("', '")

✅ Successfully loaded 350,291 records
File size: 585.6 MB
Memory usage: 2251.9 MB
Memory usage: 2251.9 MB


In [3]:
df.head()

Unnamed: 0,pmid,title,abstract,publication_date,journal_title,journal_iso_abbreviation,journal_issn,doi,mesh_terms,keywords,authors,download_timestamp,any_author_has_swiss_affiliation,first_author_has_swiss_affiliation,last_author_has_swiss_affiliation,author_count,mesh_term_count,keyword_count,has_abstract,publication_year,publication_month,goldhamster_in_silico,goldhamster_organs,goldhamster_other,goldhamster_human,goldhamster_in_vivo,goldhamster_invertebrate,goldhamster_primary_cells,goldhamster_immortal_cell_line,processing_timestamp
0,24945814,Introgression from domestic goat generated var...,The major histocompatibility complex (MHC) is ...,2014-06-01,PLoS genetics,PLoS Genet,1553-7404,10.1371/journal.pgen.1004438,"[Animals, Base Sequence, Goats, HLA-DR beta-Ch...",[],"[{'last_name': 'Grossen', 'first_name': 'Chris...",2025-12-06T10:12:00.664494,True,True,False,5,10,0,True,2014.0,6.0,0,0,0,0,1,0,0,0,2025-12-09T11:13:20.440868
1,24895028,Divergent dynamics and the Kauzmann temperatur...,In the last decade the challenging analysis of...,2014-06-04,Scientific reports,Sci Rep,2045-2322,10.1038/srep05160,[],[],"[{'last_name': 'Martinez-Garcia', 'first_name'...",2025-12-06T10:12:02.617077,True,True,False,5,0,0,True,2014.0,6.0,0,0,1,0,0,0,0,0,2025-12-09T11:13:20.440868
2,25053935,Physiology of iron metabolism.,A revolution occurred during the last decade i...,2014-06-01,Transfusion medicine and hemotherapy : offizie...,Transfus Med Hemother,1660-3796,10.1159/000362888,[],"['Iron', 'Metabolism', 'Transfusion medicine']","[{'last_name': 'Waldvogel-Abramowski', 'first_...",2025-12-06T10:11:53.302400,True,True,True,7,0,3,True,2014.0,6.0,0,0,0,0,0,0,0,0,2025-12-09T11:13:20.440868
3,24837263,Afterload mismatch after MitraClip insertion f...,"Afterload mismatch, defined as acute impairmen...",2014-06-01,The American journal of cardiology,Am J Cardiol,1879-1913,10.1016/j.amjcard.2014.03.015,"[Aged, Echocardiography, Female, Follow-Up Stu...",[],"[{'last_name': 'Melisurgo', 'first_name': 'Giu...",2025-12-06T10:11:29.682134,True,False,True,12,18,0,True,2014.0,6.0,0,0,1,0,0,0,0,0,2025-12-09T11:13:20.440868
4,24656396,CTLA4 polymorphisms in minimal change nephroti...,,2014-06-01,American journal of kidney diseases : the offi...,Am J Kidney Dis,1523-6838,10.1053/j.ajkd.2014.01.427,"[Adolescent, CTLA-4 Antigen, Case-Control Stud...",[],"[{'last_name': 'Ohl', 'first_name': 'Kim', 'in...",2025-12-06T10:11:31.898864,True,False,False,11,13,0,True,2014.0,6.0,0,0,1,0,0,0,0,0,2025-12-09T11:13:20.440868


## 3. Counts by publication_year

In [8]:
df2 = df[[
    'publication_year', 'any_author_has_swiss_affiliation', 
    'first_author_has_swiss_affiliation', 'last_author_has_swiss_affiliation',
    'goldhamster_in_silico', 'goldhamster_organs', 'goldhamster_other', 
    'goldhamster_human', 'goldhamster_in_vivo', 'goldhamster_invertebrate', 
    'goldhamster_primary_cells', 'goldhamster_immortal_cell_line'
]].groupby('publication_year').sum().reset_index()

df2['publication_year'] = df2['publication_year'].astype('Int64')
df2 = df2.set_index('publication_year').T

df2.to_excel("../data/results/pubmed_analysis_statistics.xlsx", index=True)

df2


publication_year,2009,2010,2011,2012,2013,2014,2015,2016,2017,2018,2019,2020,2021,2022,2023,2024,2025,2026
any_author_has_swiss_affiliation,1,7889,7401,9598,10843,15913,18979,21583,23260,23794,25110,28828,32169,30911,29624,30186,32776,206
first_author_has_swiss_affiliation,1,7816,7309,9417,10256,11285,12241,12820,13372,13106,13704,15288,16604,15846,14751,14988,15751,103
last_author_has_swiss_affiliation,0,648,530,839,1336,6673,9514,11820,12336,12724,13247,15091,16438,15814,14857,14785,15601,99
goldhamster_in_silico,0,744,697,979,1084,1600,1867,2030,2297,2290,2419,2776,3134,3105,3081,3202,3436,25
goldhamster_organs,0,148,137,181,185,232,275,294,309,321,301,350,413,369,339,370,412,1
goldhamster_other,1,4213,4067,5223,6119,8920,11112,12814,14326,15089,16116,19051,21720,21207,20593,21138,23032,135
goldhamster_human,0,1249,1122,1385,1533,2357,2761,3036,3232,3155,3347,3693,3930,3560,3317,3184,3459,24
goldhamster_in_vivo,0,1089,939,1256,1376,2071,2339,2589,2691,2621,2560,2740,2926,2585,2442,2439,2454,19
goldhamster_invertebrate,0,143,136,166,178,244,246,317,282,283,308,294,293,289,269,258,317,3
goldhamster_primary_cells,0,68,61,77,67,106,112,115,122,104,99,115,120,99,89,77,106,2


## 4. Counts by mesh_term

In [None]:
df2 = df[[
    'mesh_terms', 'any_author_has_swiss_affiliation', 
    'first_author_has_swiss_affiliation', 'last_author_has_swiss_affiliation',
    'goldhamster_in_silico', 'goldhamster_organs', 'goldhamster_other', 
    'goldhamster_human', 'goldhamster_in_vivo', 'goldhamster_invertebrate', 
    'goldhamster_primary_cells', 'goldhamster_immortal_cell_line'
]].rename(columns={
    'mesh_terms': 'mesh_term',
    'any_author_has_swiss_affiliation': 'any_auth',
    'first_author_has_swiss_affiliation': 'first_auth',
    'last_author_has_swiss_affiliation': 'last_auth',
    'goldhamster_in_silico': 'in_silico',
    'goldhamster_organs': 'organs',
    'goldhamster_other': 'other',
    'goldhamster_human': 'human',
    'goldhamster_in_vivo': 'in_vivo',
    'goldhamster_invertebrate': 'invertebrate',
    'goldhamster_primary_cells': 'primary_cells',
    'goldhamster_immortal_cell_line': 'immortal_cell_line'
}).explode('mesh_term').groupby(['mesh_term']).sum().reset_index()

df2.to_excel(Path("../data/results/pubmed_mesh_term_statistics.xlsx"), index=False)
print("✅ Exported mesh term statistics to Excel file")

df2

✅ Exported mesh term statistics to Excel file


Unnamed: 0,mesh_term,any_auth,first_auth,last_auth,in_silico,organs,other,human,in_vivo,invertebrate,primary_cells,immortal_cell_line
0,,91145,52176,48131,13106,851,61097,7558,5673,499,182,213
1,"1,2-Dipalmitoylphosphatidylcholine",19,9,7,5,4,1,1,2,0,0,0
2,"1,4-alpha-Glucan Branching Enzyme",4,3,2,0,0,1,0,1,0,0,0
3,1-(5-Isoquinolinesulfonyl)-2-Methylpiperazine,5,1,1,0,0,1,1,2,0,0,0
4,1-Acylglycerol-3-Phosphate O-Acyltransferase,3,2,2,0,0,1,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...
26761,"vif Gene Products, Human Immunodeficiency Virus",1,0,0,0,0,1,0,0,0,0,0
26762,von Hippel-Lindau Disease,23,14,6,0,0,16,8,1,0,0,0
26763,von Willebrand Diseases,22,12,7,0,0,16,6,1,0,0,0
26764,von Willebrand Factor,94,55,29,0,4,36,36,14,0,2,0
