In [20]:
This repository contains custom pipes and models related to using spaCy for scientific documents.

In particular, there is a custom tokenizer that adds tokenization rules on top of spaCy's rule-based tokenizer, a POS tagger and syntactic parser trained on biomedical data and an entity span detection model. Separately, there are also NER models for more specific tasks.

An NER System is capable of discovering entity elements from raw data and determines the category the element belongs to. The system reads the sentence and highlights the important entity elements in the text. NER might be given separate sensitive entities depending on the project. This means that NER systems designed for one project may not be reused for another task.

https://github.com/allenai/scispacy

SyntaxError: invalid syntax (<ipython-input-20-51d2fdde5e8c>, line 1)

In [8]:
!pip install scispacy
!pip install https://s3-us-west-2.amazonaws.com/ai2-s2-scispacy/releases/v0.2.4/en_core_sci_sm-0.2.4.tar.gz

Collecting https://s3-us-west-2.amazonaws.com/ai2-s2-scispacy/releases/v0.2.4/en_core_sci_sm-0.2.4.tar.gz
  Using cached https://s3-us-west-2.amazonaws.com/ai2-s2-scispacy/releases/v0.2.4/en_core_sci_sm-0.2.4.tar.gz
Building wheels for collected packages: en-core-sci-sm
  Building wheel for en-core-sci-sm (setup.py): started
  Building wheel for en-core-sci-sm (setup.py): finished with status 'done'
  Created wheel for en-core-sci-sm: filename=en_core_sci_sm-0.2.4-cp37-none-any.whl size=17161113 sha256=1e68917550364851e3dafcaf1449fd1ab3ad01d5794ecc701435987eb0b94139
  Stored in directory: C:\Users\BenettiM\AppData\Local\pip\Cache\wheels\34\60\b9\fabd9c3eeba17ed66df745479f2fc502a6702755cb4a9632f2
Successfully built en-core-sci-sm


In [9]:
import pandas as pd
import scispacy # https://allenai.github.io/scispacy/   https://github.com/allenai/scispacy  
import spacy
import numpy as np
from tqdm import tqdm
from sklearn.metrics.pairwise import cosine_similarity

In [10]:
nlp = spacy.load("en_core_sci_sm")
meta = pd.read_csv("C:/Users/BenettiM/Downloads/CORD-19-research-challenge/metadata.csv")

In [11]:
vector_dict = {}
for sha, abstract in tqdm(meta[["sha","abstract"]].values):
    if isinstance(abstract, str):
        vector_dict[sha] = nlp(abstract).vector

100%|██████████| 44220/44220 [20:34<00:00, 35.82it/s]


In [12]:
keys = list(vector_dict.keys())
values = list(vector_dict.values())

In [13]:
cosine_sim_matrix = cosine_similarity(values, values)

In [14]:
n_sim_articles = 5
input_sha = "e3b40cc8e0e137c416b4a2273a4dca94ae8178cc"


sha_index = keys.index(input_sha)
sim_indexes = np.argsort(cosine_sim_matrix[sha_index])[::-1][1:n_sim_articles+1]
sim_shas = [keys[i] for i in sim_indexes]
meta_info = meta[meta.sha.isin(sim_shas)]

In [15]:
print("-------QUERY ABSTRACT-----")
print(meta[meta.sha == input_sha]["abstract"].values[0])

-------QUERY ABSTRACT-----
In December 2019, cases of unidentified pneumonia with a history of exposure in the Huanan Seafood Market were reported in Wuhan, Hubei Province. A novel coronavirus, SARS-CoV-2, was identified to be accountable for this disease. Human-to-human transmission is confirmed, and this disease (named COVID-19 by World Health Organization (WHO)) spread rapidly around the country and the world. As of 18 February 2020, the number of confirmed cases had reached 75,199 with 2009 fatalities. The COVID-19 resulted in a much lower case-fatality rate (about 2.67%) among the confirmed cases, compared with Severe Acute Respiratory Syndrome (SARS) and Middle East Respiratory Syndrome (MERS). Among the symptom composition of the 45 fatality cases collected from the released official reports, the top four are fever, cough, short of breath, and chest tightness/pain. The major comorbidities of the fatality cases include hypertension, diabetes, coronary heart disease, cerebral infa

In [16]:
print(f"----TOP {n_sim_articles} SIMILAR ABSTRACTS-----")
for abst in meta_info.abstract.values:
    print(abst)
    print("---------")

----TOP 5 SIMILAR ABSTRACTS-----
Abstract The 29th International Conference on Antiviral Research (ICAR) was held in La Jolla, CA, USA from April 17 to 21, 2016. This report opens with a tribute to the late Chris McGuigan, a Past-President of ISAR, then continues with summaries of the principal invited lectures. Doug Richman (Elion Award) investigated HIV resistance, Bob Vince (Holý Award) showed how carbocyclic nucleoside analogs led to abacavir and Jerome Deval (Prusoff Award) explained how his group chose to seek a nucleoside analog to treat RSV. ALS-8176 was active in a human RSV-challenge study and is being evaluated in children. The first keynote address, by Richard H. Scheuermann, reported on the remarkable progress made in viral genomics. The second keynote address, by Heinz Feldmann, gave an overview of Ebola virus disease. There were four mini-symposia, Structural Biology, Diagnostic Technologies, DNA viruses and Zika virus. Diagnostic assays are approaching an ideal aim, a c

In [17]:
n_return = 5
query_statement = "Age-adjusted mortality data for Acute Respiratory Distress Syndrome (ARDS) with/without other organ failure – particularly for viral etiologies"

In [18]:
query_vector = nlp(query_statement).vector
cosine_sim_matrix_query = cosine_similarity(values, query_vector.reshape(1,-1))
query_sim_indexes = np.argsort(cosine_sim_matrix_query.reshape(1,-1)[0])[::-1][:n_return]
query_shas = [keys[i] for i in query_sim_indexes]
meta_info_query = meta[meta.sha.isin(query_shas)]

In [19]:
print(f"----TOP {n_return} SIMILAR ABSTRACTS TO QUERY-----")
for abst in meta_info_query.abstract.values:
    print(abst)
    print("---------")

----TOP 5 SIMILAR ABSTRACTS TO QUERY-----
Résumé Introduction Le métapneumovirus humain (hMPV) est un virus récemment identifié chez l’homme, responsable d’infections respiratoires parfois sévères que l’on observe surtout chez l’enfant. Observation Un patient âgé de 59 ans a été hospitalisé pour une atteinte respiratoire fébrile 3 jours après le retour d’un voyage en Chine effectué pendant l’épidémie de syndrome respiratoire aigu sévère (Sras). En dehors de la fièvre (>38 °C), étaient notées une toux sèche, des myalgies, des arthralgies, une opacité paracardiaque droite et une lymphopénie modérée. Les recherches virologiques conventionnelles étaient négatives. La recherche du nouveau coronavirus responsable du Sras était négative, mais la recherche de métapneumovirus humain (hMPV) était positive. Discussion Cette observation indique que le hMPV peut être responsable d’une atteinte respiratoire fébrile pouvant initialement évoquer un Sras chez un patient ayant séjourné en zone d’endémie

In [None]:
# -- Next step is to create an analysis base on the data provided by the document matrix
# 25-03-2020