
#### Technical Analysis of Automated Bibliographic Search Results

**Authors:** EMR, DGO, BJU
**Date:** 13/06/2024
**Reference Paper:** SotA LLM, RAG, KG, Agents - Health

---

#### Description

This notebook documents the technical process of collecting, processing, and analyzing scientific literature through automated queries on Arxiv and PubMed databases.

#### Structure

1.  **Definition of search queries and domains**
2.  **Automation of sweeps on Arxiv and PubMed**
3.  **Concatenation, deduplication, and cleaning of results**
4.  **Extraction and analysis of technical keywords**



> **Note:** This notebook is designed to be reproducible and extensible, facilitating the traceability and validation of the results presented in the paper.

In [1]:
from utils import *

In [7]:
#data folder
data_folder = 'data'

#### Query Lists

In [2]:
#query tree
list_1= ["large language model","LLM"]
list_3= ["medicine","healthcare","cancer"]
list_2 = [
"Retrieved augmented generation",
"Knowledge graph",
"Graph database",
"Knowledge base",
"Agents",
"agentic",
"chatgpt",
"llama",
"ULM"]
list_4 =[
"patient care",
"patient monitoring",
"imaging",
"decision support",
"diagnosis",
"treatment",
"question answering",
"hallucination"]

LomgFormat: This list the whole query, as [main_domain] AND [tools] AND [medical_designation] AND [medical_domain]



In [3]:
combined_list = []
for el1 in list_1:
    for el2 in list_2:
        for el3 in list_3:
            for el4 in list_4:
                combined_element = f'"{el1}" AND "{el2}" AND "{el3}" AND "{el4}"'
                combined_list.append(combined_element)

ShortFormat: This query list skips tool , as [main_domain]  AND [medical_designation] AND [medical_domain]

In [4]:
combined_list = []
for el1 in list_1:
        for el3 in list_3:
            for el4 in list_4:
                combined_element = f'"{el1}" AND "{el3}" AND "{el4}"'
                combined_list.append(combined_element)

In [5]:
#sample
combined_list[:5]

['"large language model" AND "medicine" AND "patient care"',
 '"large language model" AND "medicine" AND "patient monitoring"',
 '"large language model" AND "medicine" AND "imaging"',
 '"large language model" AND "medicine" AND "decision support"',
 '"large language model" AND "medicine" AND "diagnosis"']

#### Arxiv Sweep

In [7]:
year_start=2024
year_end =  datetime.now().year
date_range = (year_start, year_end)

In [None]:
filename_tosave= 'queries_arxiv_v2_1.xlsx'
df_arxiv=pd.DataFrame()
for query in combined_list:
  df_temp =search_and_export(query, 1000, year_start=None, year_end=None, filename=None)# search_arxiv(query, max_results=2000, start=0, sort_by='relevance', date_range=None)
  try:
    print("Query:"+query+str(df_temp.shape))
  except:
    print("Query:"+query+str(None))
  df_arxiv=pd.concat([df_arxiv,df_temp])
  df_arxiv.to_excel(os.path.join(data_folder, filename_tosave=filename_tosave), index=False)
  time.sleep(10)

Consultando arXiv: "large language model" AND "medicine" AND "patient care"
Obtenidos 10 resultados en este lote (total: 10)
Consultando arXiv: "large language model" AND "medicine" AND "patient care"
Totalresultados: 10
Query:"large language model" AND "medicine" AND "patient care"(10, 12)
Consultando arXiv: "large language model" AND "medicine" AND "patient monitoring"
Obtenidos 1 resultados en este lote (total: 1)
Consultando arXiv: "large language model" AND "medicine" AND "patient monitoring"
Totalresultados: 1
Query:"large language model" AND "medicine" AND "patient monitoring"(1, 12)
Consultando arXiv: "large language model" AND "medicine" AND "imaging"
Obtenidos 25 resultados en este lote (total: 25)
Consultando arXiv: "large language model" AND "medicine" AND "imaging"
Totalresultados: 25
Query:"large language model" AND "medicine" AND "imaging"(25, 12)
Consultando arXiv: "large language model" AND "medicine" AND "decision support"
Obtenidos 21 resultados en este lote (total: 

In [26]:
print(f"Arvix queries accounts {df_arxiv.shape[0]} rows and {df_arxiv.shape[1]} columns.")

Arvix queries accounts 1613 rows and 12 columns.


In [81]:
df_arxiv.columns

Index(['title', 'summary', 'published', 'updated', 'arxiv_url', 'pdf_url',
       'authors', 'categories', 'doi', 'year', 'primary_category', 'query'],
      dtype='object')

In [27]:
df_arxiv.sample(3)

Unnamed: 0,title,summary,published,updated,arxiv_url,pdf_url,authors,categories,doi,year,primary_category,query
1,GIT-Mol: A Multi-modal Large Language Model fo...,Large language models have made significant st...,2023-08-14,2024-02-06,http://arxiv.org/abs/2308.06911v3,http://arxiv.org/pdf/2308.06911v3,"Pengfei Liu, Yiming Ren, Jun Tao, Zhixiang Ren","cs.LG, cs.CL, q-bio.BM",http://dx.doi.org/10.1016/j.compbiomed.2024.10...,2023,cs.LG,"""large language model"" AND ""medicine"" AND ""ima..."
42,Enhancing LLM Generation with Knowledge Hyperg...,Evidence-based medicine (EBM) plays a crucial ...,2025-03-18,2025-03-18,http://arxiv.org/abs/2503.16530v1,http://arxiv.org/pdf/2503.16530v1,"Chengfeng Dou, Ying Zhang, Zhi Jin, Wenpin Jia...","cs.CL, cs.AI, cs.IR",,2025,cs.CL,"""LLM"" AND ""healthcare"" AND ""hallucination"""
14,Bias in Large Language Models Across Clinical ...,Background: Large language models (LLMs) are r...,2025-04-03,2025-04-03,http://arxiv.org/abs/2504.02917v1,http://arxiv.org/pdf/2504.02917v1,"Thanathip Suenghataiphorn, Narisara Tribuddhar...","cs.CL, cs.AI",,2025,cs.CL,"""large language model"" AND ""healthcare"" AND ""i..."


#### Pubmed Sweep

In [None]:
filename = 'queries_pubmed_v2_1.xlsx'
df_pubmed=pd.DataFrame()
for query in combined_list:
  df_temp =search_pubmed_and_save_csv(query, year_start , year_end, drive_folder_name="PubMed_Results")# search_arxiv(query, max_results=2000, start=0, sort_by='relevance', date_range=None)
  try:
    print("Query:"+query+str(df_temp.shape))
  except:
    print("Query:"+query+str(None))
  df_pubmed=pd.concat([df_pubmed,df_temp])
  df_pubmed.to_excel(os.path.join(data_folder, filename))
  time.sleep(10)

In [12]:
print(f"Pubmed queries accounts {df_pubmed.shape[0]} rows and {df_pubmed.shape[1]} columns.")

Pubmed queries accounts 2342 rows and 14 columns.


In [14]:
df_pubmed.columns

Index(['PubMed ID', 'title', 'summary', 'Journal', 'Publication Date',
       'authors', 'MeSH Terms', 'Keywords', 'Article Type', 'Volume', 'Issue',
       'Pages', 'DOI', 'query'],
      dtype='object')

In [16]:
df_pubmed.sample(3)

Unnamed: 0,PubMed ID,title,summary,Journal,Publication Date,authors,MeSH Terms,Keywords,Article Type,Volume,Issue,Pages,DOI,query
90,39560053,Cross-modal embedding integrator for disease-g...,,Pharmacology research & perspectives,,"Chang, Munyoung; Ahn, Junyong; Kang, Bong Gyun...",,,,12,6.0,e70034,,"""LLM"" AND ""medicine"" AND ""treatment"""
5,39176947,Unveiling Medical Insights: Advanced Topic Ext...,,Studies in health technology and informatics,,"Bitaraf, Ehsan; Jafarpour, Maryam; Shool, Sina...",,,,316,,944-948,,"""LLM"" AND ""cancer"" AND ""patient care"""
9,38504034,Utilizing large language models in breast canc...,,Journal of cancer research and clinical oncology,,"Sorin, Vera; Glicksberg, Benjamin S; Artsi, Ya...",,,,150,3.0,140,,"""LLM"" AND ""cancer"" AND ""question answering"""


### WoS sweep

In [None]:
[{'D-WoS-lar&cancer&diag-17062025.txt': 'D-WoS-lar&cancer&diag-17062025.txt'},{'D-WoS-lar&cancer&DS-17062025':'D-WoS-lar&cancer&DS-17062025'},{'D-WoS-lar&cancer&hall-17062025':'D-WoS-lar&cancer&hall-17062025'},{'D-WoS-lar&cancer&imaging-17062025':'D-WoS-lar&cancer&imaging-17062025'},{'D-WoS-lar&cancer&monitoring-17062025':'D-WoS-lar&cancer&monitoring-17062025'},{'D-WoS-lar&cancer&treatment-17062025':'D-WoS-lar&cancer&treatment-17062025'}]

In [45]:
list_parse=[]
for file_name in os.listdir(wos_path):
    list_parse.append(({file_name: file_name}))
list_parse
query_mapping={'D-WoS-lar&cancer&diag-17062025.txt': '"large language model" AND "cancer" AND "diagnosis"',
 'D-WoS-lar&cancer&DS-17062025.txt': '"large language model" AND "cancer" AND "decision support"',
 'D-WoS-lar&cancer&hall-17062025.txt': '"large language model" AND "cancer" AND "hallucination"',
 'D-WoS-lar&cancer&img-17062025.txt': '"large language model" AND "cancer" AND "imaging"',
 'D-WoS-lar&cancer&PC-17062025.txt': '"large language model" AND "cancer" AND "patient care"',
 'D-WoS-lar&cancer&PM-17062025.txt': '"large language model" AND "cancer" AND "patient monitoring"',
 'D-WoS-lar&cancer&QA-17062025.txt': '"large language model" AND "cancer" AND "question answering"',
 'D-WoS-lar&cancer&treat-17062025.txt': '"large language model" AND "cancer" AND "treatment"},',
 'D-WoS-lar&HC&diag-17062025.txt': '"large language model" AND "healthcare" AND "diagnosis"',
 'D-WoS-lar&HC&DS-17062025.txt': '"large language model" AND "healthcare" AND "decision support"',
 'D-WoS-lar&HC&hall-17062025.txt': '"large language model" AND "healthcare" AND "hallucination"',
 'D-WoS-lar&HC&img-17062025.txt': '"large language model" AND "healthcare" AND "imaging"',
 'D-WoS-lar&HC&PC-17062025.txt': '"large language model" AND "healthcare" AND "patient care"',
 'D-WoS-lar&HC&QA-17062025.txt': '"large language model" AND "healthcare" AND "question answering"',
 'D-WoS-lar&HC&treat-17062025.txt': '"large language model" AND "healthcare" AND "treatment"',
 'D-WoS-lar&med&diag-17062025.txt': '"large language model" AND "medicine" AND "diagnosis"',
 'D-WoS-lar&med&DS-17062025.txt': '"large language model" AND "medicine" AND "decision support"',
 'D-WoS-lar&med&hall-17062025.txt': '"large language model" AND "medicine" AND "hallucination"',
 'D-WoS-lar&med&img-17062025.txt': '"large language model" AND "medicine" AND "imaging"',
 'D-WoS-lar&med&PC-17062025.txt': '"large language model" AND "medicine" AND "patient care"',
 'D-WoS-lar&med&PM-17062025.txt': '"large language model" AND "medicine" AND "patient monitoring"',
 'D-WoS-lar&med&QA-17062025.txt': '"large language model" AND "medicine" AND "question answering"',
 'D-WoS-lar&med&treat-17062025.txt': '"large language model" AND "medicine" AND "treatment"',
 'D-WoS-LLM&cancer&diag-17062025.txt': '"LLM" AND "cancer" AND "diagnosis"',
 'D-WoS-LLM&cancer&DS-17062025.txt': '"LLM" AND "cancer" AND "decision support"',
 'D-WoS-LLM&cancer&hall-17062025.txt': '"LLM" AND "cancer" AND "hallucination"',
 'D-WoS-LLM&cancer&img-17062025.txt': '"LLM" AND "cancer" AND "imaging"',
 'D-WoS-LLM&cancer&PC-17062025.txt': '"LLM" AND "cancer" AND "patient care"',
 'D-WoS-LLM&cancer&PM-17062025.txt': '"LLM" AND "cancer" AND "patient monitoring"',
 'D-WoS-LLM&cancer&QA-17062025.txt': '"LLM" AND "cancer" AND "question answering"',
 'D-WoS-LLM&cancer&treat-17062025.txt': '"LLM" AND "cancer" AND "treatment"',
 'D-WoS-LLM&HC&diag-17062025.txt': '"LLM" AND "healthcare" AND "diagnosis"',
 'D-WoS-LLM&HC&DS-17062025.txt': '"LLM" AND "healthcare" AND "decision support"',
 'D-WoS-LLM&HC&hall-17062025.txt': '"LLM" AND "healthcare" AND "hallucination"',
 'D-WoS-LLM&HC&img-17062025.txt': 'LLM" AND "healthcare" AND "imaging"',
 'D-WoS-LLM&HC&PC-17062025.txt': '"LLM" AND "healthcare" AND "patient care"',
 'D-WoS-LLM&HC&PM-17062025.txt': '"LLM" AND "healthcare" AND "patient monitoring"',
 'D-WoS-LLM&HC&QA-17062025.txt': '"LLM" AND "healthcare" AND "question answering"',
 'D-WoS-LLM&HC&treat-17062025.txt': '"LLM" AND "healthcare" AND "treatment"',
 'D-WoS-LLM&med&diag-17062025.txt': '"LLM" AND "medicine" AND "diagnosis"',
 'D-WoS-LLM&med&DS-17062025.txt': 'LLM" AND "medicine" AND "decision support"',
 'D-WoS-LLM&med&hall-17062025.txt': 'LLM" AND "medicine" AND "hallucination"',
 'D-WoS-LLM&med&img-17062025.txt': 'LLM" AND "medicine" AND "imaging"',
 'D-WoS-LLM&med&PC-17062025.txt': 'LLM" AND "medicine" AND "patient care"',
 'D-WoS-LLM&med&PM-17062025.txt': 'LLM" AND "medicine" AND "patient monitoring"',
 'D-WoS-LLM&med&QA-17062025.txt': 'LLM" AND "medicine" AND "question answering"',
 'D-WoS-LLM&med&treat-17062025.txt': 'LLM" AND "medicine" AND "treatment"'}

In [47]:

wos_path = 'wos_queries'
wos_file_list = []
for file_name in os.listdir(wos_path):

              #  try:
                    df_temp= parse_wos_file(os.path.join(wos_path, file_name))
                    df_temp['query'] = query_mapping.get(file_name, None)
                    if df_temp is not None and not df_temp.empty:
                        wos_file_list.append(df_temp)
               # except Exception as e:
               #     print(f"Error procesando el fichero {os.path.join(wos_path, file_pth)}: {e}")
        
    
df_wos= pd.concat( wos_file_list, ignore_index=True)

Successfully parsed 99 records from wos_queries\D-WoS-lar&cancer&diag-17062025.txt
Successfully parsed 18 records from wos_queries\D-WoS-lar&cancer&DS-17062025.txt
Successfully parsed 9 records from wos_queries\D-WoS-lar&cancer&hall-17062025.txt
Successfully parsed 71 records from wos_queries\D-WoS-lar&cancer&img-17062025.txt
Successfully parsed 32 records from wos_queries\D-WoS-lar&cancer&PC-17062025.txt
Successfully parsed 1 records from wos_queries\D-WoS-lar&cancer&PM-17062025.txt
Successfully parsed 9 records from wos_queries\D-WoS-lar&cancer&QA-17062025.txt
Successfully parsed 109 records from wos_queries\D-WoS-lar&cancer&treat-17062025.txt
Successfully parsed 110 records from wos_queries\D-WoS-lar&HC&diag-17062025.txt
Successfully parsed 38 records from wos_queries\D-WoS-lar&HC&DS-17062025.txt
Successfully parsed 12 records from wos_queries\D-WoS-lar&HC&hall-17062025.txt
Successfully parsed 44 records from wos_queries\D-WoS-lar&HC&img-17062025.txt
Successfully parsed 62 records f

In [48]:
df_wos.head()

Unnamed: 0,Publication Type,authors,title,Source Title,Volume,Issue,Article Number,Document Type,Publication Date,Publication Year,...,Conference Location,Conference Sponsor,Supplement,Book Group Authors,Author Keywords,query,Group Authors,Special Issue,Part Number,Editors
0,J,"Trager, Megan H. Gordon, Emily R. Breneman, Al...",Accuracy of ChatGPT in diagnosis and managemen...,ARCHIVES OF DERMATOLOGICAL RESEARCH,317,1.0,184 DI 10.1007/s00403-024-03729-z,Letter,JAN 7 2025,2025 ZA 0 ZR 0 Z8 0 ZB 0 ZS 0,...,,,,,,"""large language model"" AND ""cancer"" AND ""diagn...",,,,
1,J,"Yuan, Yue Zhang, Guolong Gu, Yuqi Hao, Sicheng...",Artificial intelligence-assisted machine learn...,ASIA-PACIFIC JOURNAL OF ONCOLOGY NURSING,12,,100680 DI 10.1016/j.apjon.2025.100680 EA MAR 2025,Article,DEC 2025,2025,...,,,,,,"""large language model"" AND ""cancer"" AND ""diagn...",,,,
2,C,"Marques, Adriell Gomes Candido de Figueiredo, ...",New approach Generative AI Melanoma Data Fusio...,"2024 37TH SIBGRAPI CONFERENCE ON GRAPHICS, PAT...",,,,Proceedings Paper,2024,2024,...,"Manaus, BRAZIL","SIBGRAPI; Univ Estado Amazonas, Escola Super T...",,,,"""large language model"" AND ""cancer"" AND ""diagn...",,,,
3,J,"Liu, Jilei Shen, Hongru Chen, Kexin Li, Xiangchun",Large language model produces high accurate di...,BRIEFINGS IN BIOINFORMATICS,25,5.0,bbae430 DI 10.1093/bib/bbae430,Article,SEP 2 2024,2024,...,,,,,,"""large language model"" AND ""cancer"" AND ""diagn...",,,,
4,J,"Orlhac, Fanny Bradshaw, Tyler Buvat, Irene",Can a large language model be an effective ass...,JOURNAL OF NUCLEAR MEDICINE,65 MA 241031,,,Meeting Abstract,JUN 1 2024,2024,...,"Toronto, CANADA",Soc Nuclear Med & Mol Imaging Z8 0 ZB 0 ZA 0,2.0,,,"""large language model"" AND ""cancer"" AND ""diagn...",,,,


In [31]:
df_wos.columns

Index(['Publication Type', 'authors', 'title', 'Source Title', 'Volume',
       'Issue', 'Article Number', 'Document Type', 'Publication Date',
       'Publication Year', 'Times Cited', 'Total Times Cited',
       'Date Processed', 'Accession Number', 'summary', 'Series Title',
       'Beginning Page', 'Ending Page', 'Conference Title', 'Conference Date',
       'Conference Location', 'Conference Sponsor', 'Supplement',
       'Book Group Authors', 'Author Keywords', 'Group Authors',
       'Special Issue', 'Part Number', 'Editors'],
      dtype='object')

In [36]:
df_wos.head()

Unnamed: 0.1,Unnamed: 0,Publication Type,authors,title,Source Title,Volume,Issue,Article Number,Document Type,Publication Date,...,Conference Date,Conference Location,Conference Sponsor,Supplement,Book Group Authors,Author Keywords,Group Authors,Special Issue,Part Number,Editors
0,0,J,"Trager, Megan H. Gordon, Emily R. Breneman, Al...",Accuracy of ChatGPT in diagnosis and managemen...,ARCHIVES OF DERMATOLOGICAL RESEARCH,317,1.0,184 DI 10.1007/s00403-024-03729-z,Letter,JAN 7 2025,...,,,,,,,,,,
1,1,J,"Yuan, Yue Zhang, Guolong Gu, Yuqi Hao, Sicheng...",Artificial intelligence-assisted machine learn...,ASIA-PACIFIC JOURNAL OF ONCOLOGY NURSING,12,,100680 DI 10.1016/j.apjon.2025.100680 EA MAR 2025,Article,DEC 2025,...,,,,,,,,,,
2,2,C,"Marques, Adriell Gomes Candido de Figueiredo, ...",New approach Generative AI Melanoma Data Fusio...,"2024 37TH SIBGRAPI CONFERENCE ON GRAPHICS, PAT...",,,,Proceedings Paper,2024,...,"SEP 30-OCT 03, 2024","Manaus, BRAZIL","SIBGRAPI; Univ Estado Amazonas, Escola Super T...",,,,,,,
3,3,J,"Liu, Jilei Shen, Hongru Chen, Kexin Li, Xiangchun",Large language model produces high accurate di...,BRIEFINGS IN BIOINFORMATICS,25,5.0,bbae430 DI 10.1093/bib/bbae430,Article,SEP 2 2024,...,,,,,,,,,,
4,4,J,"Orlhac, Fanny Bradshaw, Tyler Buvat, Irene",Can a large language model be an effective ass...,JOURNAL OF NUCLEAR MEDICINE,65 MA 241031,,,Meeting Abstract,JUN 1 2024,...,"JUN 08-11, 2024","Toronto, CANADA",Soc Nuclear Med & Mol Imaging Z8 0 ZB 0 ZA 0,2.0,,,,,,


In [50]:
filename = 'queries_wos_v2_1.xlsx'
df_wos.to_excel(os.path.join(data_folder, filename))

#### Concatenate & deduplicate

In [92]:
#Load queries from Excel files
import os
filename = 'queries_arxiv_v2_1.xlsx'
df_arxiv = pd.read_excel(os.path.join(data_folder, filename))
filename = 'queries_pubmed_v2_1.xlsx'
df_pubmed = pd.read_excel(os.path.join(data_folder, filename))
filename = 'queries_wos_v2_1.xlsx'
df_wos = pd.read_excel(os.path.join(data_folder, filename))

In [93]:
df_combined = pd.concat([df_pubmed, df_arxiv,df_wos], ignore_index=True)
df_combined['short_title'] = df_combined['title']#.str[:50]
df_combined=df_combined.drop_duplicates(subset=['query','title'], keep='first')
df_combined.shape

(6753, 47)

Due to diferent sources, we are deduplicating pandas from common title field name

In [94]:
df_pubmed=df_pubmed.drop_duplicates(subset=['title','query'], keep='first')
df_pubmed_query_values= df_pubmed['query'].value_counts().reset_index().astype({'count': int})
df_arxiv=df_arxiv.drop_duplicates(subset=['title','query'], keep='first')
df_arxiv_query_values= df_arxiv['query'].value_counts().reset_index().astype({'count': int})
df_wos=df_wos.drop_duplicates(subset=['title','query'], keep='first')
df_wos_query_values= df_wos['query'].value_counts().reset_index().astype({'count': int})
df_combined_query_values= df_combined['query'].value_counts().reset_index().astype({'count': int}).fillna(0)
df_merged_query_values = df_arxiv_query_values.merge(df_pubmed_query_values, on='query', how='outer', suffixes=('_arxiv', '_pubmed'))
df_merged_query_values = df_merged_query_values.merge(df_wos_query_values,on='query', how='outer', suffixes=( '_ss','_wos'))
df_merged_query_values = df_merged_query_values.merge(df_combined_query_values,on='query', how='outer', suffixes=( '_wos','_unique'))
df_merged_query_values


Unnamed: 0,query,count_arxiv,count_pubmed,count_wos,count_unique
0,"""large language model"" AND ""healthcare"" AND ""q...",119.0,9.0,28.0,154
1,"""LLM"" AND ""healthcare"" AND ""question answering""",110.0,17.0,38.0,160
2,"""large language model"" AND ""healthcare"" AND ""h...",84.0,5.0,12.0,100
3,"""large language model"" AND ""healthcare"" AND ""i...",84.0,43.0,44.0,166
4,"""large language model"" AND ""healthcare"" AND ""d...",84.0,74.0,110.0,262
5,"""LLM"" AND ""healthcare"" AND ""hallucination""",81.0,9.0,19.0,107
6,"""LLM"" AND ""healthcare"" AND ""diagnosis""",73.0,70.0,118.0,249
7,"""large language model"" AND ""healthcare"" AND ""t...",67.0,57.0,94.0,213
8,"""LLM"" AND ""healthcare"" AND ""treatment""",60.0,52.0,84.0,191
9,"""LLM"" AND ""medicine"" AND ""question answering""",56.0,33.0,,89


In [13]:
#etls
df_merged_query = df_merged_query_values.copy() 
for columna in df_merged_query.columns:

    if df_merged_query[columna].dtype == 'object':
        df_merged_query[columna] = pd.to_numeric(df_merged_query[columna], errors='coerce')

    if pd.api.types.is_numeric_dtype(df_merged_query[columna]):
        df_merged_query[columna] = df_merged_query[columna].fillna(0).astype(int)

In [16]:
filename_tosave = 'queries_concat_latex.txt'
df_to_latex_with_integers( df_merged_query,os.path.join(data_folder, filename_tosave))

'\\begin{tabular}{|c|c|c|c|}{rrrr}\n\\hline\\hline\n\\textbf{query} & \\textbf{count_arxiv} & \\textbf{count_pubmed} & \\textbf{count} \\\\\n\\hline\\hline\n0 & 119 & 9 & 128 \\\\\n0 & 110 & 17 & 127 \\\\\n0 & 84 & 5 & 89 \\\\\n0 & 84 & 43 & 127 \\\\\n0 & 84 & 74 & 158 \\\\\n0 & 81 & 9 & 90 \\\\\n0 & 73 & 70 & 143 \\\\\n0 & 67 & 57 & 124 \\\\\n0 & 60 & 52 & 112 \\\\\n0 & 56 & 33 & 89 \\\\\n0 & 55 & 23 & 78 \\\\\n0 & 54 & 52 & 106 \\\\\n0 & 51 & 25 & 76 \\\\\n0 & 50 & 37 & 87 \\\\\n0 & 43 & 44 & 87 \\\\\n0 & 42 & 45 & 87 \\\\\n0 & 38 & 20 & 58 \\\\\n0 & 37 & 225 & 262 \\\\\n0 & 36 & 66 & 102 \\\\\n0 & 36 & 57 & 93 \\\\\n0 & 36 & 223 & 259 \\\\\n0 & 35 & 23 & 58 \\\\\n0 & 31 & 143 & 174 \\\\\n0 & 28 & 159 & 187 \\\\\n0 & 25 & 46 & 71 \\\\\n0 & 25 & 124 & 149 \\\\\n0 & 21 & 65 & 86 \\\\\n0 & 19 & 93 & 112 \\\\\n0 & 18 & 54 & 72 \\\\\n0 & 17 & 149 & 166 \\\\\n0 & 14 & 46 & 60 \\\\\n0 & 14 & 42 & 56 \\\\\n0 & 10 & 53 & 63 \\\\\n0 & 10 & 8 & 18 \\\\\n0 & 10 & 6 & 16 \\\\\n0 & 9 & 75 & 84 \\\

In [95]:
df_unique_combined  = df_combined.drop_duplicates(subset=['title'], keep='first')
print(f"Combined, de-duplicated dataqueries accounts {df_unique_combined.shape[0]} rows and {df_unique_combined.shape[1]} columns.")
df_merged_query_values.drop(columns='query').sum()


Combined, de-duplicated dataqueries accounts 2546 rows and 47 columns.


count_arxiv     1613.0
count_pubmed    2340.0
count_wos       2941.0
count_unique    6753.0
dtype: float64

In [96]:
df_unique_combined.shape

(2546, 47)

#### **Extraction and analysis of technical keywords**

In [97]:
df_unique_combined['keywords'] = df_unique_combined.apply(
        lambda row: detect_keywords(row, domain_keywords ), axis=1
    )

filename = 'queries_concat_unique_v2_1.xlsx'
df_unique_combined.to_excel(os.path.join(data_folder, filename), index=False)


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_unique_combined['keywords'] = df_unique_combined.apply(


In [98]:
nombre_columna_keywords = 'keywords' 
keywords_series = df_unique_combined[nombre_columna_keywords].astype(str).str.split(',')
todas_las_keywords = keywords_series.explode()
keywords_limpias = todas_las_keywords.str.strip().dropna()
keywords_finales = keywords_limpias[keywords_limpias != '']
frecuencia_keywords = keywords_finales.value_counts()
print(f"Frecuencia de keywords en la columna '{nombre_columna_keywords}':")
print(frecuencia_keywords)
print(f"\nNúmero total de keywords únicas: {frecuencia_keywords.size}")

Frecuencia de keywords en la columna 'keywords':
keywords
llm                   1972
domain                1108
diag                   728
cllm                   652
rag                    489
eval                   480
decision support       473
treatment              471
cancer                 419
image                  370
rev                    342
qa                     291
neg                    247
graph                  230
nlp                    198
mm                     192
patient care           188
hallucination          181
chatbot                154
osllm                  146
finetune               126
sur                    108
ner                     95
KG                      70
agents                  57
vlms                    23
speech                  19
agentic                 11
conversational AI        8
patient monitoring       6
Name: count, dtype: int64

Número total de keywords únicas: 30


In [99]:
nombre_columna_keywords = 'keywords' 
keywords_series = df_unique_combined[nombre_columna_keywords].astype(str).str.split(',')
todas_las_keywords = keywords_series.explode()
keywords_limpias = todas_las_keywords.str.strip().dropna()
keywords_finales = keywords_limpias[keywords_limpias != '']
frecuencia_keywords = keywords_finales.value_counts()
print(f"Frecuencia de keywords en la columna '{nombre_columna_keywords}':")
print(frecuencia_keywords)
print(f"\nNúmero total de keywords únicas: {frecuencia_keywords.size}")

Frecuencia de keywords en la columna 'keywords':
keywords
llm                   1972
domain                1108
diag                   728
cllm                   652
rag                    489
eval                   480
decision support       473
treatment              471
cancer                 419
image                  370
rev                    342
qa                     291
neg                    247
graph                  230
nlp                    198
mm                     192
patient care           188
hallucination          181
chatbot                154
osllm                  146
finetune               126
sur                    108
ner                     95
KG                      70
agents                  57
vlms                    23
speech                  19
agentic                 11
conversational AI        8
patient monitoring       6
Name: count, dtype: int64

Número total de keywords únicas: 30


In [104]:
pd.set_option('display.max_columns', None)
concurrence_kwrds = concurrence_matriz_keywords(df_unique_combined, columna_keywords='keywords').reset_index()
concurrence_kwrds.to_excel(os.path.join(data_folder, 'concurrence_keywords.xlsx'), index=False)
concurrence_kwrds
concurrence_kwrds

Unnamed: 0,index,KG,agentic,agents,cancer,chatbot,cllm,conversational AI,decision support,diag,domain,eval,finetune,graph,hallucination,image,llm,mm,neg,ner,nlp,osllm,patient care,patient monitoring,qa,rag,rev,speech,sur,treatment,vlms
0,KG,70,0,3,8,1,8,0,14,25,54,14,2,68,17,7,67,6,11,6,5,9,2,0,23,31,6,1,2,21,1
1,agentic,0,11,4,1,0,2,0,3,3,7,2,0,1,1,3,11,2,0,0,1,1,3,0,3,7,1,0,0,2,0
2,agents,3,4,57,3,1,3,2,23,25,37,23,1,4,14,10,54,10,13,0,3,6,3,0,12,15,5,2,2,11,0
3,cancer,8,1,3,419,30,138,0,109,129,152,77,24,41,28,77,351,28,35,26,27,30,23,2,16,88,67,1,15,181,5
4,chatbot,1,0,1,30,154,70,2,35,49,80,32,7,16,11,18,119,8,16,0,11,9,18,1,11,34,26,0,11,45,0
5,cllm,8,2,3,138,70,652,5,168,245,352,162,25,55,45,104,485,36,53,23,44,62,74,3,56,122,123,2,31,155,2
6,conversational AI,0,0,2,0,2,5,8,2,4,6,1,1,0,1,1,8,1,0,0,2,0,1,0,1,2,1,1,0,1,0
7,decision support,14,3,23,109,35,168,2,473,187,300,128,23,50,53,78,436,46,68,16,35,38,63,2,59,133,105,1,24,138,7
8,diag,25,3,25,129,49,245,4,187,728,439,176,49,96,45,155,633,81,109,38,70,63,75,0,61,177,125,6,41,232,10
9,domain,54,7,37,152,80,352,6,300,439,1108,308,85,139,126,212,991,121,140,51,114,99,135,3,231,320,210,8,62,264,19


In [19]:
nombre_columna_keywords = 'keywords' 
keywords_series = df_unique_combined[nombre_columna_keywords].astype(str).str.split(',')
todas_las_keywords = keywords_series.explode()
keywords_limpias = todas_las_keywords.str.strip().dropna()
keywords_finales = keywords_limpias[keywords_limpias != '']
frecuencia_keywords = keywords_finales.value_counts()
print(f"Frecuencia de keywords en la columna '{nombre_columna_keywords}':")
print(frecuencia_keywords)
print(f"\nNúmero total de keywords únicas: {frecuencia_keywords.size}")

Frecuencia de keywords en la columna 'keywords':
keywords
llm                   1093
domain                 563
rag                    270
cllm                   262
decision support       207
diag                   195
image                  185
qa                     167
treatment              155
cancer                 152
neg                    132
rev                    130
hallucination          115
mm                      98
osllm                   80
finetune                71
patient care            63
chatbot                 57
sur                     55
KG                      46
agents                  41
vlms                    17
agentic                  7
conversational AI        5
patient monitoring       3
Name: count, dtype: int64

Número total de keywords únicas: 25


In [21]:
filrnsmr_tosave = 'freckyw_G.txt'
df_to_latex_with_integers(pd.DataFrame(frecuencia_keywords).reset_index(), os.path.join(data_folder, filename_tosave))

'\\begin{tabular}{|c|c|}{lr}\n\\hline\\hline\n\\textbf{keywords} & \\textbf{count} \\\\\n\\hline\\hline\nllm & 1093 \\\\\ndomain & 563 \\\\\nrag & 270 \\\\\ncllm & 262 \\\\\ndecision support & 207 \\\\\ndiag & 195 \\\\\nimage & 185 \\\\\nqa & 167 \\\\\ntreatment & 155 \\\\\ncancer & 152 \\\\\nneg & 132 \\\\\nrev & 130 \\\\\nhallucination & 115 \\\\\nmm & 98 \\\\\nosllm & 80 \\\\\nfinetune & 71 \\\\\npatient care & 63 \\\\\nchatbot & 57 \\\\\nsur & 55 \\\\\nKG & 46 \\\\\nagents & 41 \\\\\nvlms & 17 \\\\\nagentic & 7 \\\\\nconversational AI & 5 \\\\\npatient monitoring & 3 \\\\\n\\hline\n\\hline\n\\end{tabular}\n'

In [14]:
df_unique_combined.head()

Unnamed: 0.1,Unnamed: 0,PubMed ID,title,summary,Journal,Publication Date,authors,MeSH Terms,Keywords,Article Type,...,query,published,updated,arxiv_url,pdf_url,categories,doi,year,primary_category,short_title
0,0,40505763.0,RadGPT: A system based on a large language mod...,,Journal of the American College of Radiology :...,,"Herwald, Sanna E; Shah, Preya; Johnston, Andre...",,,,...,"""large language model"" AND ""medicine"" AND ""pat...",,,,,,,,,RadGPT: A system based on a large language mod...
1,1,40491696.0,Evaluating the Application of Artificial Intel...,,"Clinical ophthalmology (Auckland, N.Z.)",,"Patel, Neeket R; Lacher, Corey R; Huang, Alan ...",,,,...,"""large language model"" AND ""medicine"" AND ""pat...",,,,,,,,,Evaluating the Application of Artificial Intel...
2,2,40435166.0,"Physician awareness of, interest in, and curre...",,PloS one,,"Solmonovich, Rachel L; Kouba, Insaf; Lee, Ji Y...",,,,...,"""large language model"" AND ""medicine"" AND ""pat...",,,,,,,,,"Physician awareness of, interest in, and curre..."
3,3,40423065.0,The Accuracy of ChatGPT-4o in Interpreting Che...,,Journal of personalized medicine,,"Lacaita, Pietro G; Galijasevic, Malik; Swoboda...",,,,...,"""large language model"" AND ""medicine"" AND ""pat...",,,,,,,,,The Accuracy of ChatGPT-4o in Interpreting Che...
4,4,40378254.0,Semi-automated pipeline to accelerate multi-si...,,Journal of the American Medical Informatics As...,,"Fan, Hao; Rossetti, Sarah C; Thate, Jennifer; ...",,,,...,"""large language model"" AND ""medicine"" AND ""pat...",,,,,,,,,Semi-automated pipeline to accelerate multi-si...


In [101]:
df_filter= df_unique_combined[df_unique_combined['keywords'].astype(str).str.contains(r'\bKG\b', case=False, na=False)]


In [102]:
df_filter.shape

(70, 48)

In [103]:
df_filter.to_excel(os.path.join(data_folder, 'kg_queries.xlsx'), index=False)

#### Analisys RAW Listing

In [57]:
filename = 'raw_listing_v1.xlsx'
df_v1 =pd.read_excel(os.path.join(data_folder, filename))

In [60]:
df_v1['keywords'] = df_v1.apply(
        lambda row: detect_keywords(row, my_keywords), axis=1
    )



In [61]:
nombre_columna_keywords = 'keywords' # <-- CAMBIA ESTO
keywords_series = df_v1[nombre_columna_keywords].astype(str).str.split(',')
todas_las_keywords = keywords_series.explode()
keywords_limpias = todas_las_keywords.str.strip().dropna()
keywords_finales = keywords_limpias[keywords_limpias != '']

if not keywords_finales.empty:
        # Contar la frecuencia de cada keyword
        frecuencia_keywords = keywords_finales.value_counts()

        print(f"Frecuencia de keywords en la columna '{nombre_columna_keywords}':")
        print(frecuencia_keywords)

        # Para ver las N keywords más comunes (ej. las 10 primeras)
        top_n = 10
        print(f"\nLas {top_n} keywords más comunes:")
        print(frecuencia_keywords.head(top_n))
        
        # Número total de keywords únicas
        print(f"\nNúmero total de keywords únicas: {frecuencia_keywords.size}")
else:
        print(f"No se encontraron keywords válidas en la columna '{nombre_columna_keywords}' después del preprocesamiento.")
        print("Verifica si la columna contiene datos o si el delimitador es correcto.")



Frecuencia de keywords en la columna 'keywords':
keywords
llm                   5039
domain                4250
cllm                  4043
rag                   3016
rev                   1679
neg                   1556
decision support      1467
image                 1249
treatment             1247
KG                    1169
diag                  1062
chatbot                954
cancer                 909
sur                    587
osllm                  583
qa                     526
hallucination          516
finetune               448
mm                     433
patient care           369
agents                 260
GD                     184
conversational AI       46
vlms                    35
agentic                 35
MAS                     13
patient monitoring      12
Name: count, dtype: int64

Las 10 keywords más comunes:
keywords
llm                 5039
domain              4250
cllm                4043
rag                 3016
rev                 1679
neg                 155

In [None]:

nombre_columna_keywords = 'keywords' 

if nombre_columna_keywords in df_v1.columns:
    # 1. Preparar la columna de keywords para el filtrado
    #    Convertir a minúsculas para que la búsqueda exacta sea insensible al caso original
    columna_keywords_para_filtrar = df_v1[nombre_columna_keywords].astype(str).str.lower()

    # 2. Definir la función para la condición de filtro de filas
    def ambas_keywords_exactas_presentes(texto_keywords_celda):
        if pd.isna(texto_keywords_celda): # Manejar NaNs originales
            return False
        # Separar por comas y limpiar cada keyword individual
        keywords_individuales_limpias = [kw.strip() for kw in texto_keywords_celda.split(',')]
        # Verificar si "llm" Y "domain" están presentes como keywords exactas
        return 'llm' in keywords_individuales_limpias and 'domain' in keywords_individuales_limpias

    # 3. Aplicar la función para crear la condición de filtro
    condicion_filtro_filas = columna_keywords_para_filtrar.apply(ambas_keywords_exactas_presentes)

    # 4. Filtrar el DataFrame original
    df_filas_filtradas = df_v1[condicion_filtro_filas]

    if not df_filas_filtradas.empty:
        print(f"Se encontraron {len(df_filas_filtradas)} filas donde 'llm' y 'domain' existen como keywords exactas separadas.")
        print("Procediendo a contar todas las keywords individuales de estas filas...\n")

        # 5. Tomar solo la columna de keywords de ESTAS FILAS FILTRADAS para el conteo
        keywords_para_conteo = df_filas_filtradas[nombre_columna_keywords].astype(str).str.lower()

        # 6. Procesar estas keywords para el conteo: split, explode, clean
        series_de_listas_k = keywords_para_conteo.str.split(',')
        k_individuales = series_de_listas_k.explode()
        k_limpias = k_individuales.str.strip().dropna()
        k_validas = k_limpias[k_limpias != ''] # Filtrar strings vacíos

        if not k_validas.empty:
            conteo_final_keywords_sorted = k_validas.value_counts()
            print(f"Conteo de todas las keywords individuales de las filas filtradas:")
            print(conteo_final_keywords_sorted)

            top_n = 10
            print(f"\nLas {top_n} keywords más comunes de estas filas:")
            print(conteo_final_keywords_sorted.head(top_n))
            
            print(f"\nNúmero total de keywords únicas encontradas en estas filas: {conteo_final_keywords_sorted.size}")
        else:
            print("No se encontraron keywords válidas para contar en las filas filtradas (después de split y limpieza).")

    else:
        print(f"No se encontraron filas que cumplan con la condición de tener 'llm' Y 'domain' como keywords exactas separadas.")
else:
    print(f"Error: La columna '{nombre_columna_keywords}' no se encuentra en el DataFrame.")
    print(f"Columnas disponibles: {df_v1.columns.tolist()}")


Se encontraron 2232 filas donde 'llm' y 'domain' existen como keywords exactas separadas.
Procediendo a contar todas las keywords individuales de estas filas...

Conteo de todas las keywords individuales de las filas filtradas:
keywords
domain                2232
llm                   2232
cllm                   981
rag                    749
decision support       499
rev                    445
diag                   322
image                  300
treatment              289
neg                    282
osllm                  250
qa                     231
hallucination          195
chatbot                188
finetune               188
cancer                 160
mm                     147
patient care           138
kg                     129
sur                    123
agents                  68
vlms                    19
agentic                 17
conversational ai        8
patient monitoring       3
mas                      2
gd                       1
Name: count, dtype: int64

Las 10 

In [69]:
frecuencia_keywords

keywords
llm                   5039
domain                4250
cllm                  4043
rag                   3016
rev                   1679
neg                   1556
decision support      1467
image                 1249
treatment             1247
KG                    1169
diag                  1062
chatbot                954
cancer                 909
sur                    587
osllm                  583
qa                     526
hallucination          516
finetune               448
mm                     433
patient care           369
agents                 260
GD                     184
conversational AI       46
vlms                    35
agentic                 35
MAS                     13
patient monitoring      12
Name: count, dtype: int64

In [72]:
df_fusionado = pd.merge(
    frecuencia_keywords, 
    conteo_final_keywords_sorted, 
    left_index=True, 
    right_index=True, 
    how='outer',
    suffixes=('_freq', '_conteo')  # Añadir sufijos a columnas con nombres duplicados
)
df_fusionado.sort_values('count_freq', ascending=False)

Unnamed: 0_level_0,count_freq,count_conteo
keywords,Unnamed: 1_level_1,Unnamed: 2_level_1
llm,5039.0,2232.0
domain,4250.0,2232.0
cllm,4043.0,981.0
rag,3016.0,749.0
rev,1679.0,445.0
neg,1556.0,282.0
decision support,1467.0,499.0
image,1249.0,300.0
treatment,1247.0,289.0
KG,1169.0,


In [75]:
df_to_latex_with_integers(df_fusionado.sort_values('count_freq', ascending=False).reset_index().head(15), 'freckyw_G.txt')

'\\begin{tabular}{|c|c|c|}{lrr}\n\\hline\\hline\n\\textbf{keywords} & \\textbf{count_freq} & \\textbf{count_conteo} \\\\\n\\hline\\hline\nllm & 5039 & 2232 \\\\\ndomain & 4250 & 2232 \\\\\ncllm & 4043 & 981 \\\\\nrag & 3016 & 749 \\\\\nrev & 1679 & 445 \\\\\nneg & 1556 & 282 \\\\\ndecision support & 1467 & 499 \\\\\nimage & 1249 & 300 \\\\\ntreatment & 1247 & 289 \\\\\nKG & 1169 & 0 \\\\\ndiag & 1062 & 322 \\\\\nchatbot & 954 & 188 \\\\\ncancer & 909 & 160 \\\\\nsur & 587 & 123 \\\\\nosllm & 583 & 250 \\\\\n\\hline\n\\hline\n\\end{tabular}\n'

In [66]:
frecuencia_keywords.merge(conteo_final_keywords_sorted)

AttributeError: 'Series' object has no attribute 'merge'

In [None]:
# Ejemplo de uso
latex_code = df_to_latex_with_integers(
    df, 
    filename='tabla_formateada.tex',
    caption='Resultados de búsqueda bibliográfica',
    label='tab:resultados'
)
