### This notebook walks through downloading data from huggingface.

In [6]:
!unzip rinsepubmed.zip

Archive:  rinsepubmed.zip
  inflating: rinsepubmed/dataset_infos.json  
  inflating: rinsepubmed/rinsepubmed.py  


In [7]:
!pip install datasets 

The datasets package overloads with our pubmed script. This will allow us custom download in batches rather than pulling the full 79GB of data.

In [9]:
from datasets import load_dataset
dataset = load_dataset(path="rinsepubmed", streaming=True)

Parsing the XML data

In [10]:
def data_gen():
    for i in dataset['train']:
        data_row = {}
        medlinecitation = i.get('MedlineCitation')
        if medlinecitation:
            data_row['PMID'] = medlinecitation.get('PMID')
            dc = medlinecitation.get('DateCompleted')
            if dc:
              y = dc.get('Year')
              m = dc.get('Month')
              d = dc.get('Day')
              data_row['date_completed'] = f'{d}/{m}/{y}'
            else:
              data_row['date_completed'] = dc
            data_row['NumberOfReferences'] = medlinecitation.get('NumberOfReferences')
            article = medlinecitation.get('Article')
            if article:
                language = article.get('Language')
                if language != 'eng':
                    continue
                abstract = article.get('Abstract')
                if abstract:
                    data_row['AbstractText'] = abstract.get('AbstractText')
                data_row['ArticleTitle'] = article.get('ArticleTitle')
                data_row['AuthorList'] = article.get('AuthorList')
            mesh_heading_list = medlinecitation.get('MeshHeadingList')
            if mesh_heading_list:
                mesh_heading = mesh_heading_list.get('MeshHeading')
                if mesh_heading:
                    data_row['DescriptorName'] = mesh_heading.get('DescriptorName')
                    data_row['QualifierName'] = mesh_heading.get('QualifierName')
            yield data_row   


Note: For each batch of 10 files, we need to change the rinsepubmed script to extract that batch. Here as an example, we extract first batch 800-810.

In [11]:
# Processing first batch 800 - 810
processed_data = []
genogrator = data_gen()
for i in genogrator:
    processed_data.append(i)

In [6]:
from collections import Counter
s = Counter([i['PMID'] for i in processed_data])

In [12]:
import pandas as pd
data = pd.DataFrame(processed_data)

Replace the date from 0/0/0 to 1777/01/01 so that we can convert the date to datetime.

In [21]:
data['date_completed'] = data['date_completed'].replace('0/0/0','1777/01/01')

In [23]:
data['date_completed'] = pd.to_datetime(data['date_completed'])

Here we ensure that we have only 1 paper associated with a single PMID. 

In [27]:
data_final = data.sort_values('date_completed',ascending=False).groupby('PMID').head(1)

In [28]:
data_final.to_csv('processed_batch_800_809.csv',index=False)