## **Recolha e Pré-processamento**

In [3]:
from Bio import Entrez
import pandas as pd

Entrez.email = "conhecimentolinguagem@gmail.com"
term = '("disease"[MeSH Terms]) AND ("symptom"[Title/Abstract] OR "treatment"[Title/Abstract]) AND ("2020"[Date - Publication] : "2025"[Date - Publication])'


handle = Entrez.esearch(db="pubmed", term=term, retmax=100)
record = Entrez.read(handle)
ids = record["IdList"]

articles = []
for pmid in ids:
    fetch = Entrez.efetch(db="pubmed", id=pmid, rettype="abstract", retmode="text")
    text = fetch.read()
    articles.append({"pmid": pmid, "text": text})

df = pd.DataFrame(articles)

# Save the articles to a CSV file
df.to_csv("articles.csv", index=False)

## **Extração de Entidades**

- Criar um ambiente virtual novo
- pip install scapy==3.7.4
- pip install scispacy==0.5.1
- Download de "en_ner_bc5cdr_md" em https://allenai.github.io/scispacy/
- pip install "location"


   ### **Spacy e Scispacy**  

In [4]:
import pandas as pd
import spacy

df = pd.read_csv("articles.csv")

nlp = spacy.load("en_ner_bc5cdr_md") # carrega o modelo do scispaCy

doc = nlp(df.iloc[0]['text'])
for ent in doc.ents:
    print(ent.text, ent.label_)

Acute vestibular syndrome DISEASE
Agger-Nielsen CHEMICAL
Gødstrup CHEMICAL
Acute vestibular syndrome DISEASE
AVS DISEASE
stroke DISEASE
neuritis DISEASE
nystagmus DISEASE
strokes DISEASE


#### **Transformers**

- Apenas está a extrair doenças e tratamentos, falta os sintomas

In [1]:
from transformers import AutoTokenizer, AutoModelForTokenClassification
from transformers import pipeline

# Carregar o modelo e o tokenizador
model_name = "HUMADEX/english_medical_ner"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForTokenClassification.from_pretrained(model_name)

# Criar o pipeline de NER
nlp_ner = pipeline("ner", model=model, tokenizer=tokenizer)

# Texto clínico de exemplo
text = """
The patient was diagnosed with acute lymphoblastic leukemia and was prescribed methotrexate as part of the treatment regimen.
"""

# Aplicar o NER
entities = nlp_ner(text)

# Exibir as entidades reconhecidas
for entity in entities:
    print(f"Texto: {entity['word']}, Tipo: {entity['entity']}, Confiança: {entity['score']:.4f}")


  from .autonotebook import tqdm as notebook_tqdm
To support symlinks on Windows, you either need to activate Developer Mode or to run Python as an administrator. In order to activate developer mode, see this article: https://docs.microsoft.com/en-us/windows/apps/get-started/enable-your-device-for-development
Xet Storage is enabled for this repo, but the 'hf_xet' package is not installed. Falling back to regular HTTP download. For better performance, install the package with: `pip install huggingface_hub[hf_xet]` or `pip install hf_xet`
Device set to use cpu


Texto: acute, Tipo: B-PROBLEM, Confiança: 0.9998
Texto: l, Tipo: I-PROBLEM, Confiança: 0.9996
Texto: ##ymph, Tipo: I-PROBLEM, Confiança: 0.9996
Texto: ##ob, Tipo: I-PROBLEM, Confiança: 0.9996
Texto: ##lastic, Tipo: I-PROBLEM, Confiança: 0.9996
Texto: le, Tipo: E-PROBLEM, Confiança: 0.9997
Texto: ##uke, Tipo: E-PROBLEM, Confiança: 0.9997
Texto: ##mia, Tipo: E-PROBLEM, Confiança: 0.9998
Texto: met, Tipo: S-TREATMENT, Confiança: 0.9976
Texto: ##hot, Tipo: S-TREATMENT, Confiança: 0.9661
Texto: ##re, Tipo: E-TREATMENT, Confiança: 0.8310
Texto: ##xa, Tipo: E-TREATMENT, Confiança: 0.8887
Texto: ##te, Tipo: S-TREATMENT, Confiança: 0.6880
Texto: the, Tipo: B-TREATMENT, Confiança: 0.9998
Texto: treatment, Tipo: I-TREATMENT, Confiança: 0.9998
Texto: regime, Tipo: E-TREATMENT, Confiança: 0.9997
Texto: ##n, Tipo: E-TREATMENT, Confiança: 0.9996
