Here, we are exploring the HuggingFace library that implements the state of the art transformer models we discussed in class.

Let's first install transformers if not installed already.

In [2]:
pip install pytest-astropy

Note: you may need to restart the kernel to use updated packages.


In [3]:
pip install torch==1.7.1+cpu torchvision==0.8.2+cpu -f https://download.pytorch.org/whl/torch_stable.html

Looking in links: https://download.pytorch.org/whl/torch_stable.html
Note: you may need to restart the kernel to use updated packages.


In [4]:
!pip install tensorflow



In [5]:
pip install --upgrade huggingface-hub

Requirement already up-to-date: huggingface-hub in c:\users\guptaj\anaconda3\lib\site-packages (0.4.0)
Note: you may need to restart the kernel to use updated packages.


In [6]:
!pip install transformers[sentencepiece]



**Text Classification**

In [2]:
from transformers import pipeline
course_description = '''The class will explore current AI topics through reading, writing, programming,
and exploring some of the classic fiction that has former people's (mis)perceptions
of machine intelligence. This course will give students an appreciation on how to
separate fiction from fact, and how to critically evaluate the impact current and
upcoming AI topics will have on society.'''
text_classifier = pipeline("zero-shot-classification")
text_classifier(
    course_description,
    candidate_labels=["education", "politics", "business"],
)

No model was supplied, defaulted to facebook/bart-large-mnli (https://huggingface.co/facebook/bart-large-mnli)


{'sequence': "The class will explore current AI topics through reading, writing, programming,\nand exploring some of the classic fiction that has former people's (mis)perceptions\nof machine intelligence. This course will give students an appreciation on how to\nseparate fiction from fact, and how to critically evaluate the impact current and\nupcoming AI topics will have on society.",
 'labels': ['education', 'business', 'politics'],
 'scores': [0.8363314867019653, 0.12098874896764755, 0.042679838836193085]}

**Text Generation**

In [8]:
generator = pipeline("text-generation", model="distilgpt2")
generator(
    "Patient was diagnosed with atrial fibrillation because",
)

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


[{'generated_text': 'Patient was diagnosed with atrial fibrillation because he had a condition called cardiac dyspepsia. Upon completion of his treatment, he received treatment from the University of California, Los Angeles and California Institute of Technology.\n\n\n\nA'}]

**Question Answering**

In [9]:
AFibContext = ''''Patient should be diagnosed with atrial fibrillation due to ECG readings showing absence of P-wave and irregular waves between heartbeats. Also, patient is showing symptoms of palpitations, chest pain, shortness of breath, and has a history of heart attack, high blood pressure, and sleep apnea. '''

In [36]:
question_answerer = pipeline("question-answering")
question_answerer(
    question="Why was the patient diagnosed with atrial fibrillation?",
    context=AFibContext
)

No model was supplied, defaulted to distilbert-base-cased-distilled-squad (https://huggingface.co/distilbert-base-cased-distilled-squad)


{'score': 0.3349080979824066,
 'start': 61,
 'end': 138,
 'answer': 'ECG readings showing absence of P-wave and irregular waves between heartbeats'}

In [35]:
Q_A = pipeline("question-answering", model="healx/biomedical-slot-filling-reader-large")
Q_A(
    question="Why was the patient diagnosed with atrial fibrillation?", 
    context=AFibContext
)

{'score': 0.23715996742248535,
 'start': 203,
 'end': 222,
 'answer': 'shortness of breath'}

## **Filling Slot (filling "mask" in this case)**

**Available Mask Filling Models**

First, here are some of the potential models that can be used to generate explanations. 

Model 1: BiomedLNP - PubMedBERT
This model was pretrained from scratch using abstracts from PubMed and full-text articles from PubMedCentral. This model achieves state-of-the-art performance on many biomedical NLP tasks, and currently holds the top score on the Biomedical Language Understanding and Reasoning Benchmark.

In [13]:
MaskFiller_PubMed = pipeline("fill-mask", model="microsoft/BiomedNLP-PubMedBERT-base-uncased-abstract-fulltext")

Some weights of the model checkpoint at microsoft/BiomedNLP-PubMedBERT-base-uncased-abstract-fulltext were not used when initializing BertForMaskedLM: ['cls.seq_relationship.weight', 'cls.seq_relationship.bias']
- This IS expected if you are initializing BertForMaskedLM from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForMaskedLM from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


Model 2: Bio_ClinicalBERT 
This model was trained on all notes from MIMIC III, a database containing electronic health records from ICU patients at the Beth Israel Hospital in Boston, MA. 

In [14]:
MaskFiller_MIMIC = pipeline("fill-mask", model="emilyalsentzer/Bio_ClinicalBERT")

Some weights of the model checkpoint at emilyalsentzer/Bio_ClinicalBERT were not used when initializing BertForMaskedLM: ['cls.seq_relationship.weight', 'cls.seq_relationship.bias']
- This IS expected if you are initializing BertForMaskedLM from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForMaskedLM from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


**Model Outputs**

Here are some explanation outputs of both models. 

In [16]:
MaskFiller_PubMed("Patient was diagnosed with coronary artery disease because patient is showing symptoms of [MASK]."
)

[{'score': 0.6843346357345581,
  'token': 16961,
  'token_str': 'angina',
  'sequence': 'patient was diagnosed with coronary artery disease because patient is showing symptoms of angina.'},
 {'score': 0.06489956378936768,
  'token': 5643,
  'token_str': 'cad',
  'sequence': 'patient was diagnosed with coronary artery disease because patient is showing symptoms of cad.'},
 {'score': 0.05206036567687988,
  'token': 7716,
  'token_str': 'ischemia',
  'sequence': 'patient was diagnosed with coronary artery disease because patient is showing symptoms of ischemia.'},
 {'score': 0.02217354066669941,
  'token': 18461,
  'token_str': 'dyspnea',
  'sequence': 'patient was diagnosed with coronary artery disease because patient is showing symptoms of dyspnea.'},
 {'score': 0.012342363595962524,
  'token': 6519,
  'token_str': 'mi',
  'sequence': 'patient was diagnosed with coronary artery disease because patient is showing symptoms of mi.'}]

In [17]:
MaskFiller_MIMIC("Patient was diagnosed with coronary artery disease because patient is showing symptoms of [MASK].")

[{'score': 0.16843591630458832,
  'token': 20360,
  'token_str': 'CP',
  'sequence': 'patient was diagnosed with coronary artery disease because patient is showing symptoms of CP.'},
 {'score': 0.12877030670642853,
  'token': 8006,
  'token_str': 'symptoms',
  'sequence': 'patient was diagnosed with coronary artery disease because patient is showing symptoms of symptoms.'},
 {'score': 0.08207368850708008,
  'token': 18418,
  'token_str': 'fatigue',
  'sequence': 'patient was diagnosed with coronary artery disease because patient is showing symptoms of fatigue.'},
 {'score': 0.06393478065729141,
  'token': 2489,
  'token_str': 'pain',
  'sequence': 'patient was diagnosed with coronary artery disease because patient is showing symptoms of pain.'},
 {'score': 0.040929537266492844,
  'token': 1142,
  'token_str': 'this',
  'sequence': 'patient was diagnosed with coronary artery disease because patient is showing symptoms of this.'}]

MaskFiller2 = pipeline("fill-mask", model="emilyalsentzer/Bio_ClinicalBERT")
MaskFiller2("Patient was diagnosed with coronary artery disease because patient is showing symptoms of [MASK]."
)

**Explanation Types**

*Rule-based Explanation.*

Typical sentence structure is If..Then statements and is concise.

In [24]:
MaskFiller_PubMed("The diagnosis is coronary artery disease/cad. If the patient did not have this symptom: [MASK], then the patient would not be diagnosed with coronary artery disease/cad.")

[{'score': 0.475231796503067,
  'token': 16961,
  'token_str': 'angina',
  'sequence': 'the diagnosis is coronary artery disease / cad. if the patient did not have this symptom : angina, then the patient would not be diagnosed with coronary artery disease / cad.'},
 {'score': 0.04605596885085106,
  'token': 5643,
  'token_str': 'cad',
  'sequence': 'the diagnosis is coronary artery disease / cad. if the patient did not have this symptom : cad, then the patient would not be diagnosed with coronary artery disease / cad.'},
 {'score': 0.03935706987977028,
  'token': 6519,
  'token_str': 'mi',
  'sequence': 'the diagnosis is coronary artery disease / cad. if the patient did not have this symptom : mi, then the patient would not be diagnosed with coronary artery disease / cad.'},
 {'score': 0.01705552078783512,
  'token': 18461,
  'token_str': 'dyspnea',
  'sequence': 'the diagnosis is coronary artery disease / cad. if the patient did not have this symptom : dyspnea, then the patient would 

In [25]:
MaskFiller_MIMIC("The diagnosis is coronary artery disease/cad. If the patient did not have this symptom: [MASK], then the patient would not be diagnosed with coronary artery disease/cad.")

[{'score': 0.0874975174665451,
  'token': 1649,
  'token_str': 'however',
  'sequence': 'the diagnosis is coronary artery disease / cad. if the patient did not have this symptom : however, then the patient would not be diagnosed with coronary artery disease / cad.'},
 {'score': 0.035171978175640106,
  'token': 1148,
  'token_str': 'first',
  'sequence': 'the diagnosis is coronary artery disease / cad. if the patient did not have this symptom : first, then the patient would not be diagnosed with coronary artery disease / cad.'},
 {'score': 0.031023846939206123,
  'token': 2331,
  'token_str': 'previously',
  'sequence': 'the diagnosis is coronary artery disease / cad. if the patient did not have this symptom : previously, then the patient would not be diagnosed with coronary artery disease / cad.'},
 {'score': 0.025021931156516075,
  'token': 4146,
  'token_str': 'worse',
  'sequence': 'the diagnosis is coronary artery disease / cad. if the patient did not have this symptom : worse, the

*Example-based Explanations*

Typical structure gives history, other relevant factors that contributed to a descision. 

In [30]:
MaskFiller_PubMed("The patient has had a history of angina, high systolic blood pressures, and high heart rate. The diagnosis is [MASK].")

[{'score': 0.23419669270515442,
  'token': 5579,
  'token_str': 'hypertension',
  'sequence': 'the patient has had a history of angina, high systolic blood pressures, and high heart rate. the diagnosis is hypertension.'},
 {'score': 0.044368378818035126,
  'token': 7323,
  'token_str': 'uncertain',
  'sequence': 'the patient has had a history of angina, high systolic blood pressures, and high heart rate. the diagnosis is uncertain.'},
 {'score': 0.031248625367879868,
  'token': 11706,
  'token_str': 'hypertensive',
  'sequence': 'the patient has had a history of angina, high systolic blood pressures, and high heart rate. the diagnosis is hypertensive.'},
 {'score': 0.023656319826841354,
  'token': 3982,
  'token_str': 'confirmed',
  'sequence': 'the patient has had a history of angina, high systolic blood pressures, and high heart rate. the diagnosis is confirmed.'},
 {'score': 0.018639344722032547,
  'token': 13513,
  'token_str': 'idiopathic',
  'sequence': 'the patient has had a his

In [31]:
MaskFiller_MIMIC("The patient has had a history of angina, high systolic blood pressures, and high heart rate. The diagnosis is [MASK].")

[{'score': 0.1682717651128769,
  'token': 1207,
  'token_str': 'new',
  'sequence': 'the patient has had a history of angina, high systolic blood pressures, and high heart rate. the diagnosis is new.'},
 {'score': 0.08815031498670578,
  'token': 1861,
  'token_str': 'similar',
  'sequence': 'the patient has had a history of angina, high systolic blood pressures, and high heart rate. the diagnosis is similar.'},
 {'score': 0.0849054604768753,
  'token': 10527,
  'token_str': 'unclear',
  'sequence': 'the patient has had a history of angina, high systolic blood pressures, and high heart rate. the diagnosis is unclear.'},
 {'score': 0.06489607691764832,
  'token': 9591,
  'token_str': 'uncertain',
  'sequence': 'the patient has had a history of angina, high systolic blood pressures, and high heart rate. the diagnosis is uncertain.'},
 {'score': 0.06249962002038956,
  'token': 1208,
  'token_str': 'now',
  'sequence': 'the patient has had a history of angina, high systolic blood pressures,