kaggle.json authentication file needs to be present in $HOME/.kaggle/kaggle.json to download dataset using CLI

In [1]:
!mkdir -p data
!kaggle datasets download -d finalepoch/medical-ner -p data/ --unzip

Downloading medical-ner.zip to data
  0%|                                               | 0.00/26.2k [00:00<?, ?B/s]
100%|██████████████████████████████████████| 26.2k/26.2k [00:00<00:00, 1.12MB/s]


In [3]:
json_file = './data/Corona2.json'

In [4]:
import json

In [37]:
with open(json_file, "r") as f:
    data = json.load(f)
data.keys()

dict_keys(['examples'])

In [38]:
data = data['examples']

In [40]:
sample_record = data[0]

In [42]:
sample_record.keys()

dict_keys(['id', 'content', 'metadata', 'annotations', 'classifications'])

In [52]:
print(f"id :{sample_record['id']}")

id :18c2f619-f102-452f-ab81-d26f7e283ffe


In [53]:
print(f"content :{sample_record['content']}")

content :While bismuth compounds (Pepto-Bismol) decreased the number of bowel movements in those with travelers' diarrhea, they do not decrease the length of illness.[91] Anti-motility agents like loperamide are also effective at reducing the number of stools but not the duration of disease.[8] These agents should be used only if bloody diarrhea is not present.[92]

Diosmectite, a natural aluminomagnesium silicate clay, is effective in alleviating symptoms of acute diarrhea in children,[93] and also has some effects in chronic functional diarrhea, radiation-induced diarrhea, and chemotherapy-induced diarrhea.[45] Another absorbent agent used for the treatment of mild diarrhea is kaopectate.

Racecadotril an antisecretory medication may be used to treat diarrhea in children and adults.[86] It has better tolerability than loperamide, as it causes less constipation and flatulence.[94]


In [54]:
print(f"metadata :{sample_record['metadata']}")

metadata :{}


In [56]:
print(f"classifications :{sample_record['classifications']}")

classifications :[]


In [60]:
for annotation in sample_record['annotations']:
    for key in annotation.keys():
        print(f"{key}: {annotation[key]}")
    break

id: 0825a1bf-6a6e-4fa2-be77-8d104701eaed
tag_id: c06bd022-6ded-44a5-8d90-f17685bb85a1
end: 371
start: 360
example_id: 18c2f619-f102-452f-ab81-d26f7e283ffe
tag_name: Medicine
value: Diosmectite
correct: None
human_annotations: [{'timestamp': '2020-03-21T00:24:32.098000Z', 'annotator_id': 1, 'tagged_token_id': '0825a1bf-6a6e-4fa2-be77-8d104701eaed', 'name': 'Ashpat123', 'reason': 'exploration'}]
model_annotations: []


In [62]:
sample_record['content'][360:371]

'Diosmectite'

https://www.drugs.com/international/diosmectite.html

## Checking all the annotations mentioned in the sample record

In [145]:
text = sample_record['content']

for annotation in sample_record['annotations']:
    start = annotation['start']
    end = annotation['end']
    entity = text[start:end].strip()
    if entity:
        print(f"{entity}, Number of words in entity is {len(entity.split())}, Tag is {annotation['tag_name']}")

Diosmectite, Number of words in entity is 1, Tag is Medicine
aluminomagnesium silicate, Number of words in entity is 2, Tag is Medicine
diarrhea, Number of words in entity is 1, Tag is MedicalCondition
kaopectate, Number of words in entity is 1, Tag is Medicine
bismuth compounds, Number of words in entity is 2, Tag is Medicine
Pepto-Bismol, Number of words in entity is 1, Tag is Medicine
diarrhea, Number of words in entity is 1, Tag is MedicalCondition
chemotherapy, Number of words in entity is 1, Tag is Medicine
constipation, Number of words in entity is 1, Tag is MedicalCondition
loperamide, Number of words in entity is 1, Tag is Medicine
diarrhea, Number of words in entity is 1, Tag is MedicalCondition
flatulence, Number of words in entity is 1, Tag is MedicalCondition
loperamide, Number of words in entity is 1, Tag is Medicine
diarrhea, Number of words in entity is 1, Tag is MedicalCondition
diarrhea, Number of words in entity is 1, Tag is MedicalCondition
Racecadotril, Number of w

From the results above we see groups of words may represent a single entity
For example, aluminomagnesium silicate

### Finding all the unique tags

In [138]:
tags = []
for record in data:
    for annotation in record['annotations']:
        curr_tag = annotation['tag_name']
        if curr_tag not in tags:
            tags.append(curr_tag)
tags

['Medicine', 'MedicalCondition', 'Pathogen']

### Tranforming Data
(sentence_idx,word,tag)

tags = [B-MED, I-MED, B-MEDCOND, I-MEDCOND, B-PAT, I-PAT, O]

In [152]:
tags_dict = {
    "Medicine": "MED",
    "MedicalCondition": "MEDCOND",
    "Pathogen": "PAT"
}

for sentence_idx, sentence_dict in enumerate(data):
    sentence = sentence_dict['content']
    words_to_tags_map = {}
    for annotation in sentence_dict['annotations']:
        tag = annotation['tag_name']
        start = annotation['start']
        end = annotation['end']
        word = sentence[start:end].strip()
        if word.startswith("8][5][93] A combined"):
            # print(word, tag)
            print(sentence)



The following drugs are considered as DMARDs: methotrexate, hydroxychloroquine, sulfasalazine, leflunomide, TNF-alpha inhibitors (certolizumab, infliximab and etanercept), abatacept, and anakinra. Rituximab and tocilizumab are monoclonal antibodies and are also DMARDs.[8] Use of tocilizumab is associated with a risk of increased cholesterol levels.[87]

Hydroxychloroquine, apart from its low toxicity profile, is considered effective in the moderate RA treatment.[88]

The most commonly used agent is methotrexate with other frequently used agents including sulfasalazine and leflunomide.[8] Leflunomide is effective when used from 6–12 months, with similar effectiveness to methotrexate when used for 2 years.[89] Sulfasalazine also appears to be most effective in the short-term treatment of rheumatoid arthritis.[90] Sodium aurothiomalate (gold) and cyclosporin are less commonly used due to more common adverse effects.[8] However, cyclosporin was found to be effective in the progressive RA w

## Downloading MACCROBAT_DATA

In [210]:
MACCROBAT_DATA_URL = "https://figshare.com/ndownloader/files/17493650"

In [211]:
import requests

def download_url(url, save_path, chunk_size=128):
    r = requests.get(url, stream=True)
    with open(save_path, 'wb') as fd:
        for chunk in r.iter_content(chunk_size=chunk_size):
            fd.write(chunk)

In [212]:
data_zipfile = "./data/MACCROBAT2018.zip"
MACCROBAT_data_dir = "./data/MACCROBAT"

In [213]:
download_url(MACCROBAT_DATA_URL, data_zipfile)

In [214]:
import zipfile
with zipfile.ZipFile(data_zipfile, 'r') as zip_ref:
    zip_ref.extractall(MACCROBAT_data_dir)

In [215]:
import os
os.remove(data_zipfile)

In [216]:
os.listdir(MACCROBAT_data_dir)

['19860925.txt',
 '26361640.ann',
 '26228535.txt',
 '27773410.txt',
 '23678274.ann',
 '25853982.ann',
 '28103924.txt',
 '27064109.txt',
 '28154700.ann',
 '20146086.txt',
 '26656340.txt',
 '28353558.ann',
 '22515939.txt',
 '28353588.txt',
 '26309459.txt',
 '28272235.txt',
 '23242090.txt',
 '23312850.ann',
 '23124805.txt',
 '26106249.txt',
 '26313770.ann',
 '26285706.ann',
 '18416479.txt',
 '28353613.ann',
 '28151916.ann',
 '26175648.txt',
 '23468586.ann',
 '28216610.txt',
 '27059701.ann',
 '28121940.ann',
 '23077697.txt',
 '27741115.ann',
 '21067996.txt',
 '28100235.txt',
 '28151860.ann',
 '25884600.txt',
 '27904130.ann',
 '19214295.txt',
 '18787726.ann',
 '22719160.txt',
 '28422883.txt',
 '26675562.ann',
 '21477357.ann',
 '25139918.txt',
 '28353561.txt',
 '22791498.txt',
 '28538413.ann',
 '26457578.ann',
 '27842605.txt',
 '20671919.ann',
 '25155594.ann',
 '26469535.ann',
 '28353604.ann',
 '28403092.txt',
 '28239141.txt',
 '28202869.txt',
 '25024632.txt',
 '28403086.txt',
 '18666334.ann

In [217]:
len(os.listdir(MACCROBAT_data_dir))

400

### Checking if we have annotation for all text files

In [218]:
txt_file_ids = []
missing = []
files = os.listdir(MACCROBAT_data_dir)
for file in files:
    if file.endswith('.txt'):
        txt_file_ids.append(file.split(".")[0])
        if file.split(".")[0] + ".ann" not in files:
            missing.append(file)
print(f"There are {len(set(txt_file_ids))} unique txt files")

if not missing:
    print("All txt files have corresponding ann files")
else:
    print(f'Following files missing ann files' )
    print(f'{",".join(missing)}')

There are 200 unique txt files
All txt files have corresponding ann files


In [220]:
sample_record = files[0].split(".")[0]
sample_record

'19860925'

In [229]:
text = ""
with open(f"{MACCROBAT_data_dir}/{sample_record}.txt") as f:
    for line in f.readlines():
        text += line

In [230]:
text

'Our 24-year-old non-smoking male patient presented with repeated hemoptysis in May 2008 with 4 days of concomitant right thoracic pain which intensified while breathing.\nDuring holidays in his home country, this Cuban patient suffered from a cold with fever and a strong cough.\nThe strong dry cough persisted after recovery from the cold.\nThe patient did not report any loss of weight.\nThe initial CT scan of the thorax showed a 12 × 4 cm solid mass paravertebral right in the lower thorax without any signs of metastases (Figure 1).\nThe bronchoscopy (Figure \u200b2) with non-bleeding biopsy revealed a mass of the lower right bronchus which histologically and immunohistologically provided evidence of a granular cell or Abrikossoff tumor [1].\nThe bronchial lavage which followed was negative for malignant cells.\nThe patient was discharged and surgical intervention was planned.\nFour days after discharge a spontaneous hemothorax developed.\nThe patient needed to be readmitted and the he

In [237]:
annotation = []
with open(f"{MACCROBAT_data_dir}/{sample_record}.ann") as f:
    for line in f.readlines():
        annotation.append(line.split("\t"))
        break
annotation

[['T1', 'Age 4 15', '24-year-old\n']]