# Named Entity Recognition (NER) and Relation Extraction on Drug-Related Adverse Effects

Named entity recognition (NER) is a subtask of natural language processing (NLP) that involves identifying and classifying named entities in text into predefined categories such as person names, organization names, locations, dates, and other types of entities. The purpose of NER is to automatically extract structured information from unstructured text, which can be used for various applications such as information retrieval, question answering, machine translation, and more.

For example, given the sentence "John works at Google in New York City", a NER system would identify the named entities "John" as a person name, "Google" as an organization name, and "New York City" as a location.

NER is typically performed using machine learning techniques such as rule-based systems, statistical models, and deep learning models. These models are trained on annotated datasets, where human annotators have labeled the named entities in the text. The accuracy of a NER system is measured by its precision (the percentage of correctly labeled named entities among all detected entities) and recall (the percentage of correctly labeled named entities among all actual named entities in the text).

Named entity recognition (NER) can be used for identifying and extracting adverse drug effects (ADEs) from unstructured text data such as electronic health records, medical literature, and social media. ADEs are unwanted or harmful reactions that can occur as a result of taking a particular medication.

NER can help to automate the process of identifying and extracting ADEs from large volumes of text, which is important because manually reviewing and annotating this data can be time-consuming and costly. NER can also improve the accuracy and completeness of ADE detection, by ensuring that all relevant information is captured and categorized correctly.

Furthermore, NER can help to identify and extract specific information about the ADEs, such as the name of the drug, the type of reaction, the severity, and the time frame in which the reaction occurred. This information can be used to better understand the safety profile of medications, to identify potential drug interactions or contraindications, and to improve the overall management of patient care.

Overall, using NER for ADE detection can provide significant benefits for both patients and healthcare providers, by enabling more efficient and accurate identification of adverse drug reactions.

In [105]:
!pip install transformers datasets evaluate

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


In [106]:
from huggingface_hub import notebook_login

notebook_login()

Token is valid.
Your token has been saved to /home/diana/.cache/huggingface/token
Login successful


In [107]:
from datasets import load_dataset
wnut = load_dataset("wnut_17")



  0%|          | 0/3 [00:00<?, ?it/s]

In [108]:
wnut["train"]

Dataset({
    features: ['id', 'tokens', 'ner_tags'],
    num_rows: 3394
})

In [109]:
wnut["train"][0]

{'id': '0',
 'tokens': ['@paulwalk',
  'It',
  "'s",
  'the',
  'view',
  'from',
  'where',
  'I',
  "'m",
  'living',
  'for',
  'two',
  'weeks',
  '.',
  'Empire',
  'State',
  'Building',
  '=',
  'ESB',
  '.',
  'Pretty',
  'bad',
  'storm',
  'here',
  'last',
  'evening',
  '.'],
 'ner_tags': [0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  7,
  8,
  8,
  0,
  7,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0]}

In [110]:
#Each number in ner_tags represents an entity. 
#Convert the numbers to their label names to find out what the entities are:

label_list = wnut["train"].features[f"ner_tags"].feature.names
label_list

['O',
 'B-corporation',
 'I-corporation',
 'B-creative-work',
 'I-creative-work',
 'B-group',
 'I-group',
 'B-location',
 'I-location',
 'B-person',
 'I-person',
 'B-product',
 'I-product']

The letter that prefixes each ner_tag indicates the token position of the entity:

B- indicates the beginning of an entity.

I- indicates a token is contained inside the same entity (for example, the State token is a part of an entity like Empire State Building).|

0 indicates the token doesn’t correspond to any entity.

In [111]:
# Load DistilBERT tokenizer to preprocess the tokens field

from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("distilbert-base-uncased")

In [112]:
# You’ll need to set is_split_into_words=True to tokenize the words into subwords.

example = wnut["train"][0]
tokenized_input = tokenizer(example["tokens"], is_split_into_words=True)
tokens = tokenizer.convert_ids_to_tokens(tokenized_input["input_ids"])
tokens

['[CLS]',
 '@',
 'paul',
 '##walk',
 'it',
 "'",
 's',
 'the',
 'view',
 'from',
 'where',
 'i',
 "'",
 'm',
 'living',
 'for',
 'two',
 'weeks',
 '.',
 'empire',
 'state',
 'building',
 '=',
 'es',
 '##b',
 '.',
 'pretty',
 'bad',
 'storm',
 'here',
 'last',
 'evening',
 '.',
 '[SEP]']

However, this adds some special tokens [CLS] and [SEP] and the subword tokenization creates a mismatch between the input and labels. A single word corresponding to a single label may now be split into two subwords. You’ll need to realign the tokens and labels by:

Mapping all tokens to their corresponding word with the word_ids method.
Assigning the label -100 to the special tokens [CLS] and [SEP] so they’re ignored by the PyTorch loss function.
Only labeling the first token of a given word. Assign -100 to other subtokens from the same word.
Here is how you can create a function to realign the tokens and labels, and truncate sequences to be no longer than DistilBERT’s maximum input length:

In [113]:
def tokenize_and_align_labels(examples):
    tokenized_inputs = tokenizer(examples["tokens"], truncation=True, is_split_into_words=True)

    labels = []
    for i, label in enumerate(examples[f"ner_tags"]):
        word_ids = tokenized_inputs.word_ids(batch_index=i)  # Map tokens to their respective word.
        previous_word_idx = None
        label_ids = []
        for word_idx in word_ids:  # Set the special tokens to -100.
            if word_idx is None:
                label_ids.append(-100)
            elif word_idx != previous_word_idx:  # Only label the first token of a given word.
                label_ids.append(label[word_idx])
            else:
                label_ids.append(-100)
            previous_word_idx = word_idx
        labels.append(label_ids)

    tokenized_inputs["labels"] = labels
    return tokenized_inputs

In [114]:
tokenized_wnut = wnut.map(tokenize_and_align_labels, batched=True)
tokenized_wnut



Map:   0%|          | 0/1009 [00:00<?, ? examples/s]



DatasetDict({
    train: Dataset({
        features: ['id', 'tokens', 'ner_tags', 'input_ids', 'attention_mask', 'labels'],
        num_rows: 3394
    })
    validation: Dataset({
        features: ['id', 'tokens', 'ner_tags', 'input_ids', 'attention_mask', 'labels'],
        num_rows: 1009
    })
    test: Dataset({
        features: ['id', 'tokens', 'ner_tags', 'input_ids', 'attention_mask', 'labels'],
        num_rows: 1287
    })
})

In [115]:
tokenized_wnut = wnut.map(tokenize_and_align_labels, batched=True)



In [116]:
from transformers import DataCollatorForTokenClassification

data_collator = DataCollatorForTokenClassification(tokenizer=tokenizer, return_tensors="tf")

In [117]:
!pip install seqeval
import evaluate

seqeval = evaluate.load("seqeval")

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


In [118]:
import numpy as np

labels = [label_list[i] for i in example[f"ner_tags"]]


def compute_metrics(p):
    predictions, labels = p
    predictions = np.argmax(predictions, axis=2)

    true_predictions = [
        [label_list[p] for (p, l) in zip(prediction, label) if l != -100]
        for prediction, label in zip(predictions, labels)
    ]
    true_labels = [
        [label_list[l] for (p, l) in zip(prediction, label) if l != -100]
        for prediction, label in zip(predictions, labels)
    ]

    results = seqeval.compute(predictions=true_predictions, references=true_labels)
    return {
        "precision": results["overall_precision"],
        "recall": results["overall_recall"],
        "f1": results["overall_f1"],
        "accuracy": results["overall_accuracy"],
    }

In [119]:
id2label = {
    0: "O",
    1: "B-corporation",
    2: "I-corporation",
    3: "B-creative-work",
    4: "I-creative-work",
    5: "B-group",
    6: "I-group",
    7: "B-location",
    8: "I-location",
    9: "B-person",
    10: "I-person",
    11: "B-product",
    12: "I-product",
}
label2id = {
    "O": 0,
    "B-corporation": 1,
    "I-corporation": 2,
    "B-creative-work": 3,
    "I-creative-work": 4,
    "B-group": 5,
    "I-group": 6,
    "B-location": 7,
    "I-location": 8,
    "B-person": 9,
    "I-person": 10,
    "B-product": 11,
    "I-product": 12,
}

In [120]:
from transformers import create_optimizer

batch_size = 16
num_train_epochs = 3
num_train_steps = (len(tokenized_wnut["train"]) // batch_size) * num_train_epochs
optimizer, lr_schedule = create_optimizer(
    init_lr=2e-5,
    num_train_steps=num_train_steps,
    weight_decay_rate=0.01,
    num_warmup_steps=0,
)

In [121]:
from transformers import TFAutoModelForTokenClassification

model = TFAutoModelForTokenClassification.from_pretrained(
    "distilbert-base-uncased", num_labels=13, id2label=id2label, label2id=label2id
)

Some layers from the model checkpoint at distilbert-base-uncased were not used when initializing TFDistilBertForTokenClassification: ['vocab_layer_norm', 'activation_13', 'vocab_transform', 'vocab_projector']
- This IS expected if you are initializing TFDistilBertForTokenClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing TFDistilBertForTokenClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some layers of TFDistilBertForTokenClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['dropout_139', 'classifier']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inferen

In [122]:
tf_train_set = model.prepare_tf_dataset(
    tokenized_wnut["train"],
    shuffle=True,
    batch_size=16,
    collate_fn=data_collator,
)

tf_validation_set = model.prepare_tf_dataset(
    tokenized_wnut["validation"],
    shuffle=False,
    batch_size=16,
    collate_fn=data_collator,
)

You're using a DistilBertTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


In [123]:
import tensorflow as tf

model.compile(optimizer=optimizer)

No loss specified in compile() - the model's internal loss computation will be used as the loss. Don't panic - this is a common way to train TensorFlow models in Transformers! To disable this behaviour please pass a loss argument, or explicitly pass `loss=None` if you do not want your model to compute a loss.


In [124]:
from transformers.keras_callbacks import KerasMetricCallback

metric_callback = KerasMetricCallback(metric_fn=compute_metrics, eval_dataset=tf_validation_set)

In [125]:
from transformers.keras_callbacks import PushToHubCallback

push_to_hub_callback = PushToHubCallback(
    output_dir="my_awesome_wnut_model",
    tokenizer=tokenizer,
)

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


OSError: Looks like you do not have git-lfs installed, please install. You can install from https://git-lfs.github.com/. Then run `git lfs install` (you only have to do this once).

In [126]:
model.fit(x=tf_train_set, validation_data=tf_validation_set, epochs=3)

Epoch 1/3


2023-03-04 00:50:41.133397: W tensorflow/core/framework/op_kernel.cc:1733] INVALID_ARGUMENT: ValueError: Unable to create tensor, you should probably activate truncation and/or padding with 'padding=True' 'truncation=True' to have batched tensors with the same length. Perhaps your features (`labels` in this case) have excessive nesting (inputs type `list` where type `int` is expected).
Traceback (most recent call last):

  File "/home/diana/anaconda3/lib/python3.9/site-packages/transformers/tokenization_utils_base.py", line 717, in convert_to_tensors
    tensor = as_tensor(value)

ValueError: setting an array element with a sequence. The requested array has an inhomogeneous shape after 1 dimensions. The detected shape was (16,) + inhomogeneous part.


During handling of the above exception, another exception occurred:


Traceback (most recent call last):

  File "/home/diana/anaconda3/lib/python3.9/site-packages/tensorflow/python/ops/script_ops.py", line 271, in __call__
    ret = func

InvalidArgumentError: Graph execution error:

ValueError: Unable to create tensor, you should probably activate truncation and/or padding with 'padding=True' 'truncation=True' to have batched tensors with the same length. Perhaps your features (`labels` in this case) have excessive nesting (inputs type `list` where type `int` is expected).
Traceback (most recent call last):

  File "/home/diana/anaconda3/lib/python3.9/site-packages/transformers/tokenization_utils_base.py", line 717, in convert_to_tensors
    tensor = as_tensor(value)

ValueError: setting an array element with a sequence. The requested array has an inhomogeneous shape after 1 dimensions. The detected shape was (16,) + inhomogeneous part.


During handling of the above exception, another exception occurred:


Traceback (most recent call last):

  File "/home/diana/anaconda3/lib/python3.9/site-packages/tensorflow/python/ops/script_ops.py", line 271, in __call__
    ret = func(*args)

  File "/home/diana/anaconda3/lib/python3.9/site-packages/tensorflow/python/autograph/impl/api.py", line 642, in wrapper
    return func(*args, **kwargs)

  File "/home/diana/anaconda3/lib/python3.9/site-packages/datasets/utils/tf_utils.py", line 104, in np_get_batch
    batch = collate_fn(batch, **collate_fn_args)

  File "/home/diana/anaconda3/lib/python3.9/site-packages/transformers/data/data_collator.py", line 43, in __call__
    return self.tf_call(features)

  File "/home/diana/anaconda3/lib/python3.9/site-packages/transformers/data/data_collator.py", line 347, in tf_call
    batch = self.tokenizer.pad(

  File "/home/diana/anaconda3/lib/python3.9/site-packages/transformers/tokenization_utils_base.py", line 3020, in pad
    return BatchEncoding(batch_outputs, tensor_type=return_tensors)

  File "/home/diana/anaconda3/lib/python3.9/site-packages/transformers/tokenization_utils_base.py", line 210, in __init__
    self.convert_to_tensors(tensor_type=tensor_type, prepend_batch_axis=prepend_batch_axis)

  File "/home/diana/anaconda3/lib/python3.9/site-packages/transformers/tokenization_utils_base.py", line 733, in convert_to_tensors
    raise ValueError(

ValueError: Unable to create tensor, you should probably activate truncation and/or padding with 'padding=True' 'truncation=True' to have batched tensors with the same length. Perhaps your features (`labels` in this case) have excessive nesting (inputs type `list` where type `int` is expected).


	 [[{{node PyFunc}}]]
	 [[IteratorGetNext]] [Op:__inference_train_function_42823]

In [127]:
text = "The Golden State Warriors are an American professional basketball team based in San Francisco."

In [130]:
from transformers import pipeline

classifier = pipeline("ner", model="stevhliu/my_awesome_wnut_model")
classifier(text)

Some layers from the model checkpoint at stevhliu/my_awesome_wnut_model were not used when initializing TFDistilBertForTokenClassification: ['dropout_19']
- This IS expected if you are initializing TFDistilBertForTokenClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing TFDistilBertForTokenClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some layers of TFDistilBertForTokenClassification were not initialized from the model checkpoint at stevhliu/my_awesome_wnut_model and are newly initialized: ['dropout_179']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


[{'entity': 'B-location',
  'score': 0.21302876,
  'index': 1,
  'word': 'the',
  'start': 0,
  'end': 3},
 {'entity': 'B-location',
  'score': 0.4676814,
  'index': 2,
  'word': 'golden',
  'start': 4,
  'end': 10},
 {'entity': 'B-location',
  'score': 0.3126429,
  'index': 3,
  'word': 'state',
  'start': 11,
  'end': 16},
 {'entity': 'B-location',
  'score': 0.20545056,
  'index': 4,
  'word': 'warriors',
  'start': 17,
  'end': 25},
 {'entity': 'B-location',
  'score': 0.57780087,
  'index': 13,
  'word': 'san',
  'start': 80,
  'end': 83},
 {'entity': 'B-location',
  'score': 0.55975974,
  'index': 14,
  'word': 'francisco',
  'start': 84,
  'end': 93}]

In [131]:
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("stevhliu/my_awesome_wnut_model")
inputs = tokenizer(text, return_tensors="tf")

In [132]:
from transformers import TFAutoModelForTokenClassification

model = TFAutoModelForTokenClassification.from_pretrained("stevhliu/my_awesome_wnut_model")
logits = model(**inputs).logits

Some layers from the model checkpoint at stevhliu/my_awesome_wnut_model were not used when initializing TFDistilBertForTokenClassification: ['dropout_19']
- This IS expected if you are initializing TFDistilBertForTokenClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing TFDistilBertForTokenClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some layers of TFDistilBertForTokenClassification were not initialized from the model checkpoint at stevhliu/my_awesome_wnut_model and are newly initialized: ['dropout_199']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [133]:
predicted_token_class_ids = tf.math.argmax(logits, axis=-1)
predicted_token_class = [model.config.id2label[t] for t in predicted_token_class_ids[0].numpy().tolist()]
predicted_token_class

['O',
 'B-location',
 'B-location',
 'B-location',
 'B-location',
 'O',
 'O',
 'O',
 'O',
 'O',
 'O',
 'O',
 'O',
 'B-location',
 'B-location',
 'O',
 'B-location']

## Imports

In [134]:
!pip install tensorflow-datasets --quiet
!pip install pydot --quiet
!pip install transformers --quiet

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


In [135]:
import pandas as pd

import numpy as np
import tensorflow as tf
from tensorflow import keras

from tensorflow.keras.layers import Embedding, Input, Dense, Lambda
from tensorflow.keras.models import Model
import tensorflow.keras.backend as K
import tensorflow_datasets as tfds

from transformers import BertTokenizer, TFBertModel

import sklearn as sk
import os
import nltk
from nltk.data import find
from nltk.corpus import stopwords

import matplotlib.pyplot as plt

import re

import sys
!{sys.executable} -m pip install fastparquet

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


# EDA

### Dugs and Adverse Effects Data

In [3]:
# Read DRUG-AE.rel file
# Provides relations between drugs and adverse effects

df_adverse_effect = pd.read_csv("DRUG-AE.rel", header=None, delimiter="|")
df_adverse_effect.rename(columns = {0:"PubMed-ID", 
                                    1:"Sentence",
                                    2:"Adverse-Effect", 
                                    3:"Begin Effect Offset", 
                                    4:"End Effect Offset", 
                                    5:"Drug", 
                                    6:"Begin Drug Effect Offset",
                                    7:"End Drug Effect Offset"},inplace = True)

df_adverse_effect.head()

Unnamed: 0,PubMed-ID,Sentence,Adverse-Effect,Begin Effect Offset,End Effect Offset,Drug,Begin Drug Effect Offset,End Drug Effect Offset
0,10030778,Intravenous azithromycin-induced ototoxicity.,ototoxicity,43,54,azithromycin,22,34
1,10048291,"Immobilization, while Paget's bone disease was...",increased calcium-release,960,985,dihydrotachysterol,908,926
2,10048291,Unaccountable severe hypercalcemia in a patien...,hypercalcemia,31,44,dihydrotachysterol,94,112
3,10082597,METHODS: We report two cases of pseudoporphyri...,pseudoporphyria,620,635,naproxen,646,654
4,10082597,METHODS: We report two cases of pseudoporphyri...,pseudoporphyria,620,635,oxaprozin,659,668


In [4]:
df_adverse_effect.shape

(6821, 8)

In [5]:
df_adverse_effect.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6821 entries, 0 to 6820
Data columns (total 8 columns):
 #   Column                    Non-Null Count  Dtype 
---  ------                    --------------  ----- 
 0   PubMed-ID                 6821 non-null   int64 
 1   Sentence                  6821 non-null   object
 2   Adverse-Effect            6821 non-null   object
 3   Begin Effect Offset       6821 non-null   int64 
 4   End Effect Offset         6821 non-null   int64 
 5   Drug                      6821 non-null   object
 6   Begin Drug Effect Offset  6821 non-null   int64 
 7   End Drug Effect Offset    6821 non-null   int64 
dtypes: int64(5), object(3)
memory usage: 426.4+ KB


### Drugs and Dosages Data

In [6]:
# Read DRUG-DOSE.rel file
# Provides relations between drugs and dosages

df_dosage = pd.read_csv("DRUG-DOSE.rel", header=None, delimiter="|")
df_dosage.rename(columns = {0:"PubMed-ID", 
                                    1:"Sentence",
                                    2:"Dose", 
                                    3:"Begin Dose Offset", 
                                    4:"End Dose Offset", 
                                    5:"Drug", 
                                    6:"Begin Drug Dose Offset",
                                    7:"End Drug Dose Offset"},inplace = True)

df_dosage.head()

Unnamed: 0,PubMed-ID,Sentence,Dose,Begin Dose Offset,End Dose Offset,Drug,Begin Drug Dose Offset,End Drug Dose Offset
0,10327035,An episode of subacute encephalopathy after th...,1500 mg/m2,230,240,methotrexate,216,228
1,10452772,She continued to receive regular insulin 4 tim...,4 times per day,1473,1488,insulin,1465,1472
2,10458196,A 5-month-old infant became lethargic and poor...,1 drop,522,528,brimonidine,532,543
3,10667036,The presented patient was treated with 200 mg ...,200 mg,389,395,TCA,396,399
4,10698143,Central nervous system manifestations of an ib...,overdose,64,72,ibuprofen,54,63


In [8]:
df_dosage.shape

(279, 8)

In [9]:
df_dosage.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 279 entries, 0 to 278
Data columns (total 8 columns):
 #   Column                  Non-Null Count  Dtype 
---  ------                  --------------  ----- 
 0   PubMed-ID               279 non-null    int64 
 1   Sentence                279 non-null    object
 2   Dose                    279 non-null    object
 3   Begin Dose Offset       279 non-null    int64 
 4   End Dose Offset         279 non-null    int64 
 5   Drug                    279 non-null    object
 6   Begin Drug Dose Offset  279 non-null    int64 
 7   End Drug Dose Offset    279 non-null    int64 
dtypes: int64(5), object(3)
memory usage: 17.6+ KB


### Drugs and No Adverse Effects

In [10]:
df_neg = pd.read_csv("ADE-NEG.txt", header=None, delimiter="|")
df_neg.head()

Unnamed: 0,0
0,6460590 NEG Clioquinol intoxication occurring ...
1,"8600337 NEG ""Retinoic acid syndrome"" was preve..."
2,8402502 NEG BACKGROUND: External beam radiatio...
3,"8700794 NEG Although the enuresis ceased, she ..."
4,17662448 NEG A 42-year-old woman had uneventfu...


In [11]:
df_neg["PubMed-ID"] = df_neg[0].str.split(' ').str[0]
df_neg['Sentence'] = df_neg[0].str.split(n=1).str[1]
df_neg.drop(columns=df_neg.columns[0], axis=1, inplace=True)
df_neg

Unnamed: 0,PubMed-ID,Sentence
0,6460590,NEG Clioquinol intoxication occurring in the t...
1,8600337,"NEG ""Retinoic acid syndrome"" was prevented wit..."
2,8402502,NEG BACKGROUND: External beam radiation therap...
3,8700794,"NEG Although the enuresis ceased, she develope..."
4,17662448,NEG A 42-year-old woman had uneventful bilater...
...,...,...
16690,946400,"NEG At autopsy, the liver was found to be smal..."
16691,16416684,NEG Physical exam revealed a patient with apha...
16692,7351000,NEG At the time when the leukemia appeared sev...
16693,19769520,NEG The American Society for Regional Anesthes...


In [12]:
df_neg.shape

(16695, 2)

In [24]:
df_neg.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 16695 entries, 0 to 16694
Data columns (total 2 columns):
 #   Column     Non-Null Count  Dtype 
---  ------     --------------  ----- 
 0   PubMed-ID  16695 non-null  object
 1   Sentence   16695 non-null  object
dtypes: object(2)
memory usage: 261.0+ KB


# Tensor

In [17]:
from datasets import load_dataset

In [21]:
ds=load_dataset('ade_corpus_v2', 'Ade_corpus_v2_drug_ade_relation', split='train')



In [22]:
ds

Dataset({
    features: ['text', 'drug', 'effect', 'indexes'],
    num_rows: 6821
})

In [23]:
ds[:2]

{'text': ['Intravenous azithromycin-induced ototoxicity.',
  "Immobilization, while Paget's bone disease was present, and perhaps enhanced activation of dihydrotachysterol by rifampicin, could have led to increased calcium-release into the circulation."],
 'drug': ['azithromycin', 'dihydrotachysterol'],
 'effect': ['ototoxicity', 'increased calcium-release'],
 'indexes': [{'drug': {'start_char': [12], 'end_char': [24]},
   'effect': {'start_char': [33], 'end_char': [44]}},
  {'drug': {'start_char': [91], 'end_char': [109]},
   'effect': {'start_char': [143], 'end_char': [168]}}]}

In [None]:
nltk.download('stopwords')
stop_words = stopwords.words('english')
stop_words

In [14]:
builder = tfds.builder('huggingface:ade_corpus_v2/Ade_corpus_v2_drug_ade_relation')
builder.download_and_prepare()
ds = builder.as_dataset(split='train')
print(ds)

2023-03-03 16:10:37.860844: W tensorflow/core/platform/cloud/google_auth_provider.cc:184] All attempts to get a Google authentication bearer token failed, returning an empty token. Retrieving token from files failed with "NOT_FOUND: Could not locate the credentials file.". Retrieving token from GCE failed with "FAILED_PRECONDITION: Error executing an HTTP request: libcurl code 6 meaning 'Couldn't resolve host name', error details: Could not resolve host: metadata".


Downloading and preparing dataset Unknown size (download: Unknown size, generated: Unknown size, total: Unknown size) to /home/diana/tensorflow_datasets/ade_corpus_v2/ade_corpus_v2_drug_ade_relation/1.0.0...


To ignore verifications, you can pass `verification_mode='no_checks'` instead.


ValueError: 'full' is not a valid VerificationMode

In [15]:
ds = tfds.load('huggingface:ade_corpus_v2/Ade_corpus_v2_drug_ade_relation')

Downloading and preparing dataset Unknown size (download: Unknown size, generated: Unknown size, total: Unknown size) to /home/diana/tensorflow_datasets/ade_corpus_v2/ade_corpus_v2_drug_ade_relation/1.0.0...


ValueError: 'full' is not a valid VerificationMode

github repo

tfhub

person place
relation

recognize drug and effect
what is the relationship? Find a dataset

Example:
    Mark is born
    
NER
conver to bio format
for each of the token, is the token inside(b) or outside(o)

run through biobert tokenizer?

string matching?

NER