kaggle.json authentication file needs to be present in $HOME/.kaggle/kaggle.json to download dataset using CLI

In [1]:
!mkdir -p data
!kaggle datasets download -d finalepoch/medical-ner -p data/ --unzip

Downloading medical-ner.zip to data
  0%|                                               | 0.00/26.2k [00:00<?, ?B/s]
100%|██████████████████████████████████████| 26.2k/26.2k [00:00<00:00, 1.12MB/s]


In [3]:
json_file = './data/Corona2.json'

In [4]:
import json

In [37]:
with open(json_file, "r") as f:
    data = json.load(f)
data.keys()

dict_keys(['examples'])

In [38]:
data = data['examples']

In [40]:
sample_record = data[0]

In [42]:
sample_record.keys()

dict_keys(['id', 'content', 'metadata', 'annotations', 'classifications'])

In [52]:
print(f"id :{sample_record['id']}")

id :18c2f619-f102-452f-ab81-d26f7e283ffe


In [53]:
print(f"content :{sample_record['content']}")

content :While bismuth compounds (Pepto-Bismol) decreased the number of bowel movements in those with travelers' diarrhea, they do not decrease the length of illness.[91] Anti-motility agents like loperamide are also effective at reducing the number of stools but not the duration of disease.[8] These agents should be used only if bloody diarrhea is not present.[92]

Diosmectite, a natural aluminomagnesium silicate clay, is effective in alleviating symptoms of acute diarrhea in children,[93] and also has some effects in chronic functional diarrhea, radiation-induced diarrhea, and chemotherapy-induced diarrhea.[45] Another absorbent agent used for the treatment of mild diarrhea is kaopectate.

Racecadotril an antisecretory medication may be used to treat diarrhea in children and adults.[86] It has better tolerability than loperamide, as it causes less constipation and flatulence.[94]


In [54]:
print(f"metadata :{sample_record['metadata']}")

metadata :{}


In [56]:
print(f"classifications :{sample_record['classifications']}")

classifications :[]


In [60]:
for annotation in sample_record['annotations']:
    for key in annotation.keys():
        print(f"{key}: {annotation[key]}")
    break

id: 0825a1bf-6a6e-4fa2-be77-8d104701eaed
tag_id: c06bd022-6ded-44a5-8d90-f17685bb85a1
end: 371
start: 360
example_id: 18c2f619-f102-452f-ab81-d26f7e283ffe
tag_name: Medicine
value: Diosmectite
correct: None
human_annotations: [{'timestamp': '2020-03-21T00:24:32.098000Z', 'annotator_id': 1, 'tagged_token_id': '0825a1bf-6a6e-4fa2-be77-8d104701eaed', 'name': 'Ashpat123', 'reason': 'exploration'}]
model_annotations: []


In [62]:
sample_record['content'][360:371]

'Diosmectite'

https://www.drugs.com/international/diosmectite.html

## Checking all the annotations mentioned in the sample record

In [145]:
text = sample_record['content']

for annotation in sample_record['annotations']:
    start = annotation['start']
    end = annotation['end']
    entity = text[start:end].strip()
    if entity:
        print(f"{entity}, Number of words in entity is {len(entity.split())}, Tag is {annotation['tag_name']}")

Diosmectite, Number of words in entity is 1, Tag is Medicine
aluminomagnesium silicate, Number of words in entity is 2, Tag is Medicine
diarrhea, Number of words in entity is 1, Tag is MedicalCondition
kaopectate, Number of words in entity is 1, Tag is Medicine
bismuth compounds, Number of words in entity is 2, Tag is Medicine
Pepto-Bismol, Number of words in entity is 1, Tag is Medicine
diarrhea, Number of words in entity is 1, Tag is MedicalCondition
chemotherapy, Number of words in entity is 1, Tag is Medicine
constipation, Number of words in entity is 1, Tag is MedicalCondition
loperamide, Number of words in entity is 1, Tag is Medicine
diarrhea, Number of words in entity is 1, Tag is MedicalCondition
flatulence, Number of words in entity is 1, Tag is MedicalCondition
loperamide, Number of words in entity is 1, Tag is Medicine
diarrhea, Number of words in entity is 1, Tag is MedicalCondition
diarrhea, Number of words in entity is 1, Tag is MedicalCondition
Racecadotril, Number of w

From the results above we see groups of words may represent a single entity
For example, aluminomagnesium silicate

### Finding all the unique tags

In [138]:
tags = []
for record in data:
    for annotation in record['annotations']:
        curr_tag = annotation['tag_name']
        if curr_tag not in tags:
            tags.append(curr_tag)
tags

['Medicine', 'MedicalCondition', 'Pathogen']

### Tranforming Data
(sentence_idx,word,tag)

tags = [B-MED, I-MED, B-MEDCOND, I-MEDCOND, B-PAT, I-PAT, O]

In [152]:
tags_dict = {
    "Medicine": "MED",
    "MedicalCondition": "MEDCOND",
    "Pathogen": "PAT"
}

for sentence_idx, sentence_dict in enumerate(data):
    sentence = sentence_dict['content']
    words_to_tags_map = {}
    for annotation in sentence_dict['annotations']:
        tag = annotation['tag_name']
        start = annotation['start']
        end = annotation['end']
        word = sentence[start:end].strip()
        if word.startswith("8][5][93] A combined"):
            # print(word, tag)
            print(sentence)



The following drugs are considered as DMARDs: methotrexate, hydroxychloroquine, sulfasalazine, leflunomide, TNF-alpha inhibitors (certolizumab, infliximab and etanercept), abatacept, and anakinra. Rituximab and tocilizumab are monoclonal antibodies and are also DMARDs.[8] Use of tocilizumab is associated with a risk of increased cholesterol levels.[87]

Hydroxychloroquine, apart from its low toxicity profile, is considered effective in the moderate RA treatment.[88]

The most commonly used agent is methotrexate with other frequently used agents including sulfasalazine and leflunomide.[8] Leflunomide is effective when used from 6–12 months, with similar effectiveness to methotrexate when used for 2 years.[89] Sulfasalazine also appears to be most effective in the short-term treatment of rheumatoid arthritis.[90] Sodium aurothiomalate (gold) and cyclosporin are less commonly used due to more common adverse effects.[8] However, cyclosporin was found to be effective in the progressive RA w

## Downloading MACCROBAT_DATA

In [168]:
MACCROBAT_DATA_URL = "https://figshare.com/ndownloader/files/17493650"

In [169]:
import requests

def download_url(url, save_path, chunk_size=128):
    r = requests.get(url, stream=True)
    with open(save_path, 'wb') as fd:
        for chunk in r.iter_content(chunk_size=chunk_size):
            fd.write(chunk)

In [170]:
download_url(MACCROBAT_DATA_URL, "./data/MACCROBAT2018.zip")