# Pubmed Data Mining

_Lucas Goiriz Beltrán_

Para hacer este análisis, vamos a usar:
- `ENTREZ`: la API para hacer consultas a Pubmed
- `requests`: para hacer solicitudes HTTP a la API
- `json`: para interactuar con las respuestas de la API
- `xml`: para interactuar con las respuestas de la API en el caso en el que `json` no sea posible.
- `pandas`: para crear las tablas de datos (ojo que para convertir a `.xlsx`, por debajo `pandas` usa `openpyxl`).

In [1]:
# Vamos a cargar las librerías
import requests, json, xml
import xml.etree.ElementTree as ET
import pandas as pd

In [2]:
# API de ENTREZ. Vease https://www.ncbi.nlm.nih.gov/books/NBK25499/
esearchApi = r"http://eutils.ncbi.nlm.nih.gov/entrez/eutils/esearch.fcgi" # Para la consulta en Pubmed
efetchApi  = r"http://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi"  # Para obtener los datos de los artículos
efetchPost = r"https://eutils.ncbi.nlm.nih.gov/entrez/eutils/epost.fcgi"  # En caso de que la consulta sea grande para efetch

# Atributos de la consulta

attbs = {
    "db"       : "pubmed",     # La base de datos a consultar: Pubmed
    "retmode"  : "json",       # Queremos que nos devuelva todo en formato json, porque para mí es más sencillo
    "retmax"   : "50000",      # Queremos que como máximo nos devuelva 50000 resultados
    "datetype" : "pdat",       # El tipo de fecha: pdat es fecha de publicación
    "mindate"  : "2022/01/01", # Fecha mínima, en formato AAAA/MM/DD
    "maxdate"  : "2022/12/31", # Fecha mínima, en formato AAAA/MM/DD
    "term"     : "polylysine"  # El término de búsqueda (admite más, separados por comas)
}

# Construimos la URL
URL1 = esearchApi + "?"

for k,v in attbs.items():
    URL1 += ("%s=%s&" % (k, v))
URL1 = URL1[:-1]

print("La URL de nuestra búsqueda en pubmed es:\n%s" % URL1)

La URL de nuestra búsqueda en pubmed es:
http://eutils.ncbi.nlm.nih.gov/entrez/eutils/esearch.fcgi?db=pubmed&retmode=json&retmax=50000&datetype=pdat&mindate=2022/01/01&maxdate=2022/12/31&term=polylysine


In [3]:
"""Nos generamos una función para hacer la consulta que nos devuelva además
si la consulta ha tenido éxito. La usaremos luego de nuevo
"""

def queryNstatus(URL):
    
    # Hacemos la consulta
    query = requests.get(URL)

    # Miramos a ver si hemos tenido éxito
    if query.status_code == 200:
        print("La consulta ha sido exitosa")

    else:
        print("La consulta no ha sido exitosa")
        
    return query

# Lanzamos la función
queryOutput = queryNstatus(URL1)

La consulta ha sido exitosa


In [4]:
# En caso de que la consulta haya sido exitosa,
# Extraemos los IDs de los artículos de interés
resultados = json.loads(queryOutput.text)
articleIDs = resultados["esearchresult"]["idlist"]

# Parámetros de la segunda consulta
pars = ["db", "pubmed", "retmode", "xml"]

# Si hay muchos resultados
if len(articleIDs) > 199:
    # Hacemos una solicitud POST
    postreq = requests.post(efetchPost, data={"id":",".join(articleIDs)})
    
    # Si esta sale bien
    if postreq.status_code == 200:
        print("La llamada POST ha sido exitosa")
        
        # Añadimos la query_key y la WebEnv a los parámetros de consulta
        qkeys = ET.fromstring(postreq.text)
        
        URL2 = (
            (efetchApi + "?%s=%s&%s=%s&%s=%s&%s=%s&")
            % tuple(pars + ["query_key", qkeys[0].text, "WebEnv", qkeys[1].text])
        )[:-1]

    else:
        print("La llamada POST no ha sido exitosa")
        URL2 = ''

else:
    URL2 = (efetchApi + "?%s=%s&%s=%s&") % ("db", "pubmed", "retmode", "xml")
    URL2 += ("id=" + ",".join(articleIDs))

if URL2:    
    # Hacemos la consulta   
    queryOutput = queryNstatus(URL2)

La consulta ha sido exitosa


Si la consulta ha sido exitosa, cargamos el xml

In [5]:
XMLroot = ET.fromstring(queryOutput.text)

# Rellenamos un diccionario con
# Título de publicación : [Num Afiliaciones, Afiliaciones, Abstract, URL]
datos = {}
nonArticles = []

for element in XMLroot:

    # En caso de ser un artículo, buscamos el título
    if element.tag == 'PubmedArticle':
        title = element.findall(".//ArticleTitle")[0].text
        
    else:
        nonArticles.append(element.tag)
        continue
    
    # Buscamos las diferentes afiliaciones
    affs = set([aff[0].text for aff in element.findall(".//AffiliationInfo")])

    # Buscamos el doi, para poder acceder al artículo luego
    doi = next( # Tomamos el primer elemeto del iterador ...
        filter( # creamos un iterador que surja a partir de filtrar una lista
            lambda x: ( # devuelve True si el atributo del element es doi y es válido
                (x.attrib["EIdType"] == "doi")
                and (x.attrib["ValidYN"] == "Y")
            ),
            element.findall(".//ELocationID") # lista de tags ELocationID
        ),
        None
    )
    
    # Si el doi existe, añade el resto de la url
    if doi != None:
        doi = "https://doi.org/" + doi.text

    # Guardamos la información de interés
    datos[title] = [
        len(affs), # Número de afiliaciones
        "\n".join(affs), # Afiliaciones
        element.findall(".//AbstractText")[0].text, # Texto del Abstract
        r"https://pubmed.ncbi.nlm.nih.gov/%s/" % element.findall(".//PMID")[0].text, # PMID
        doi # DOI
    ]
    
if len(nonArticles) != 0:
    print(
        "Hay %d resultados que no son artículos."
        " Entre ellos están los siguientes tipos:\n"
        % len(nonArticles)
    )
    print(
        len(set(nonArticles))*"- %s\n"
        % tuple(set(nonArticles))
    )


# Creamos un `DataFrame` de pandas para poder visualizar los datos en una tabla

In [6]:
df = pd.DataFrame.from_dict(
    datos,
    orient  = "index",
    columns = ("NumAff", "Affiliations", "Abstract", "PMID", "DOI")
)
df.rename_axis("Title", inplace=True)
df.reset_index(inplace=True)
df

Unnamed: 0,Title,NumAff,Affiliations,Abstract,PMID,DOI
0,Development and characterization of levan/pull...,2,"College of Biomass Science and Engineering, Si...","The levan/pullulan/chitosan edible films, enri...",https://pubmed.ncbi.nlm.nih.gov/35447595/,https://doi.org/10.1016/j.foodchem.2022.132989
1,Long-Chain Poly-d-Lysines Interact with the Pl...,1,"State Key Laboratory of Bioelectronics, School...",Polylysines have been frequently used in drug ...,https://pubmed.ncbi.nlm.nih.gov/35442635/,https://doi.org/10.1021/acs.bioconjchem.2c00153
2,Poly-L-Lysine-Based αGal-Glycoconjugates for T...,12,MEGA: Asthma Inception and Progression Mechani...,Anti-αGal IgE antibodies mediate a spreading a...,https://pubmed.ncbi.nlm.nih.gov/35432370/,https://doi.org/10.3389/fimmu.2022.873019
3,A poly-l-lysine-bonded TEMPO-oxidized bacteria...,3,Microbiological Engineering and Industrial Bio...,Oxidized bacterial nanocellulose (O-BNC) is a ...,https://pubmed.ncbi.nlm.nih.gov/35422281/,https://doi.org/10.1016/j.carbpol.2022.119266
4,Effect of FLOT2 Gene Expression on Invasion an...,2,"Department of Gastrointestinal Surgery, The Fi...",The study is aimed at investigating the effect...,https://pubmed.ncbi.nlm.nih.gov/35419458/,https://doi.org/10.1155/2022/2897338
5,"AdpA, a developmental regulator, promotes ε-po...",4,Key Laboratory of Molecular Microbiology and T...,AdpA is a global regulator of morphological di...,https://pubmed.ncbi.nlm.nih.gov/35397580/,https://doi.org/10.1186/s12934-022-01785-6
6,Antibacterial dialdehyde sodium alginate/ε-pol...,4,College of Chemistry and Environment Protectio...,Food security is an important global public he...,https://pubmed.ncbi.nlm.nih.gov/35395481/,https://doi.org/10.1016/j.foodchem.2022.132885
7,Effect of crosslinking strategy on the biologi...,4,"AO Research Institute Davos, Clavadelerstrasse...",The design of multifunctional hydrogels based ...,https://pubmed.ncbi.nlm.nih.gov/35378161/,https://doi.org/10.1016/j.ijbiomac.2022.03.207
8,Inkjet-Patterned Microdroplets as Individual M...,2,"Department of Chemistry, Beijing Key Laborator...",Adhesion of single cells is the foundation of ...,https://pubmed.ncbi.nlm.nih.gov/35362237/,https://doi.org/10.1002/smll.202107992
9,Construction of Intelligent Responsive Drug De...,5,"Department of Respiratory Medicine, The First ...",Cancer remains a formidable global problem wit...,https://pubmed.ncbi.nlm.nih.gov/35332623/,https://doi.org/10.1002/marc.202200034


In [7]:
# Función filtrado
def keywdFilter(row):
    
    # Palabras clave a filtrar
    keywds = [
        "univ.", "university", "universidad", "universitat", "università", "université", "universität",
        "school", "institute", "instituto", "institut", "hospital", "ministry"
    ]

    # Por cada fila del data frame, nos quedamos con la afiliación y las separamos por salto de línea
    aff = row.Affiliations.split("\n")
    goodAffs = [
        # para cada afiliación en aff ...
        affiliation for affiliation in aff
        # si no se cumple que ...
        if not(
            [
                kwd for kwd in keywds # para cada palabra clave en keywds añadimos la keywd ...
                if kwd in affiliation.lower() # si esta se encuentra en la afiliación
            ]
        )
    ]
    
    return bool(goodAffs)

Miramos aquellos resultados que pasan nuestro filtro:

In [8]:
dfPositive = df[df.apply(keywdFilter, axis=1)]
dfPositive

Unnamed: 0,Title,NumAff,Affiliations,Abstract,PMID,DOI
2,Poly-L-Lysine-Based αGal-Glycoconjugates for T...,12,MEGA: Asthma Inception and Progression Mechani...,Anti-αGal IgE antibodies mediate a spreading a...,https://pubmed.ncbi.nlm.nih.gov/35432370/,https://doi.org/10.3389/fimmu.2022.873019
3,A poly-l-lysine-bonded TEMPO-oxidized bacteria...,3,Microbiological Engineering and Industrial Bio...,Oxidized bacterial nanocellulose (O-BNC) is a ...,https://pubmed.ncbi.nlm.nih.gov/35422281/,https://doi.org/10.1016/j.carbpol.2022.119266
13,Blocking viral infections with lysine-based po...,3,"ViroStatics srl, Viale Umberto I, 46, 07100 Sa...",The outbreak of the Covid-19 pandemic due to t...,https://pubmed.ncbi.nlm.nih.gov/35297436/,https://doi.org/10.1039/d2bm00030j
17,Polylysine dendrigraft is able to differential...,4,Laboratoire de Microbiologie Signaux et Microe...,With a view to reducing the impact of Cutibact...,https://pubmed.ncbi.nlm.nih.gov/35231149/,https://doi.org/10.1111/exd.14554
29,Low-Molecular-Weight Polylysines with Excellen...,6,"Department of Ophthalmology, The Second Hospit...",The steady development of bacterial resistance...,https://pubmed.ncbi.nlm.nih.gov/35050580/,https://doi.org/10.1021/acsbiomaterials.1c01527
33,Amoebicidal Activity of Poly-Epsilon-Lysine Fu...,4,"Department of Eye and Vision Science, Institut...",To determine the amoebicidal activity of funct...,https://pubmed.ncbi.nlm.nih.gov/34994769/,https://doi.org/10.1167/iovs.63.1.11
38,Development of Icephilic ACTIVE Glycopeptides ...,3,"School of Materials Science and Engineering, T...",Ice formation and recrystallization exert seve...,https://pubmed.ncbi.nlm.nih.gov/34965723/,https://doi.org/10.1021/acs.biomac.1c01372
40,Facile and green synthesis of reduced graphene...,5,"College of Food Science and Light Industry, Na...",Facile and green fabrication of reduced graphe...,https://pubmed.ncbi.nlm.nih.gov/34896528/,https://doi.org/10.1016/j.biortech.2021.126534
41,Utility of carboxylated poly L-lysine for the ...,4,"Kagoshima City Aquarium, Kagoshima, Japan.\nJo...",Assisted reproduction techniques are required ...,https://pubmed.ncbi.nlm.nih.gov/34883419/,https://doi.org/10.1016/j.anireprosci.2021.106889
42,Randomized trial of neoadjuvant vaccination wi...,9,Department of Respiratory Medicine & Rheumatol...,BACKGROUNDLong-term prognosis of WHO grade II ...,https://pubmed.ncbi.nlm.nih.gov/34882581/,https://doi.org/10.1172/JCI151239


Y también miramos aquellos resultados que no pasan nuestro filtro:

In [9]:
dfNegative = df[~df.apply(keywdFilter, axis=1)]
dfNegative

Unnamed: 0,Title,NumAff,Affiliations,Abstract,PMID,DOI
0,Development and characterization of levan/pull...,2,"College of Biomass Science and Engineering, Si...","The levan/pullulan/chitosan edible films, enri...",https://pubmed.ncbi.nlm.nih.gov/35447595/,https://doi.org/10.1016/j.foodchem.2022.132989
1,Long-Chain Poly-d-Lysines Interact with the Pl...,1,"State Key Laboratory of Bioelectronics, School...",Polylysines have been frequently used in drug ...,https://pubmed.ncbi.nlm.nih.gov/35442635/,https://doi.org/10.1021/acs.bioconjchem.2c00153
4,Effect of FLOT2 Gene Expression on Invasion an...,2,"Department of Gastrointestinal Surgery, The Fi...",The study is aimed at investigating the effect...,https://pubmed.ncbi.nlm.nih.gov/35419458/,https://doi.org/10.1155/2022/2897338
5,"AdpA, a developmental regulator, promotes ε-po...",4,Key Laboratory of Molecular Microbiology and T...,AdpA is a global regulator of morphological di...,https://pubmed.ncbi.nlm.nih.gov/35397580/,https://doi.org/10.1186/s12934-022-01785-6
6,Antibacterial dialdehyde sodium alginate/ε-pol...,4,College of Chemistry and Environment Protectio...,Food security is an important global public he...,https://pubmed.ncbi.nlm.nih.gov/35395481/,https://doi.org/10.1016/j.foodchem.2022.132885
7,Effect of crosslinking strategy on the biologi...,4,"AO Research Institute Davos, Clavadelerstrasse...",The design of multifunctional hydrogels based ...,https://pubmed.ncbi.nlm.nih.gov/35378161/,https://doi.org/10.1016/j.ijbiomac.2022.03.207
8,Inkjet-Patterned Microdroplets as Individual M...,2,"Department of Chemistry, Beijing Key Laborator...",Adhesion of single cells is the foundation of ...,https://pubmed.ncbi.nlm.nih.gov/35362237/,https://doi.org/10.1002/smll.202107992
9,Construction of Intelligent Responsive Drug De...,5,"Department of Respiratory Medicine, The First ...",Cancer remains a formidable global problem wit...,https://pubmed.ncbi.nlm.nih.gov/35332623/,https://doi.org/10.1002/marc.202200034
10,Preparation and Properties of Pea Starch/ε-Pol...,2,"School of Innovation & Entrepreneurship, Zheji...",The composite films comprising pea starch (St)...,https://pubmed.ncbi.nlm.nih.gov/35329778/,https://doi.org/10.3390/ma15062327
11,Multi-omics reveals host metabolism associated...,1,"Institute of Bast Fiber Crops, Chinese Academy...",This study aimed to assess the influence of di...,https://pubmed.ncbi.nlm.nih.gov/35315841/,https://doi.org/10.1039/d1fo04227k


In [10]:
# Exportamos la tabla de resultados positivos a excel
dfPositive.to_excel("prueba_de_concepto.xlsx", index=False)