### Overview
In this step, you will download a healthcare data set to be used in this workshop.
* The next block is utilizing the Hugging Face dataset library to download an open-source clinical note dataset, which will be used for for vectorizing. 
* This dataset, contains a large collection of annotated clinical notes that will help our model learn to extract relevant information. 
* Hugging Face is a popular open-source AI company that provides a wide range of NLP tools and datasets, enabling developers to easily access and integrate high-quality language models and datasets into their projects.


In [None]:
##############################################################
## This code is for academic and educational purposes only. ##
## Event: Global Summit 2024 Maryland USA                   ##
## InterSystems Corporation 2024 (C)                        ##
## Date: June 9th 2024                                      ##
##############################################################


## Use the Huggingface hub library to download data effectively
from huggingface_hub import snapshot_download, hf_hub_download

#####
## Here is our dastaset "tag", these are formatted in the <account>/<data-set-name> format
## Here is the direct link this points to
# https://huggingface.co/datasets/AGBonnet/augmented-clinical-notes
healthcare_dataset = 'AGBonnet/augmented-clinical-notes'
#####

#####
# Here the snapshot download function downloads a copy of the data to a local directory
directory = snapshot_download(repo_id=healthcare_dataset, local_dir='./data/healthcare', repo_type="dataset")
#####


In [None]:
##############################################################
## This code is for academic and educational purposes only. ##
## Event: Global Summit 2024 Maryland USA                   ##
## InterSystems Corporation 2024 (C)                        ##
## Date: June 9th 2024                                      ##
##############################################################

#####
## Huggingface provides a "datasets" library that lets you load datasets quite easily
## We'll use load_dataset
from datasets import load_dataset
#####

#### To load a saved dataset and use it, you can call the load_dataset from a file
## Here, we load the dataset in json format.
healthcare_notes = load_dataset("json", data_files='./data/healthcare/augmented_notes_30K.jsonl')
##

Run the next block of code to trim this 30,000-row data set down to 1,000 rows for this exercise. We will use clinical notes for the workshop.

In [2]:
##### To see what the dataset looks like, run this block. You will see there are 30,000 rows.
## >> That's a lot of rows - you might want to trim it down.
## >> Let's trim this down to just 1,000 clinical notes:

notes = [{'note': note} for note in healthcare_notes['train']['note'][:1000]]

#####
## >>
## >>
# notes
# healthcare_notes['train'][0]
#####

Finally, run the next block of code to write the trimmed collection of data into a new JSON Lines file for use in the exercise.

In [3]:
import jsonlines

with jsonlines.open('./data/healthcare/augmented_notes_1000.jsonl', mode='w') as writer:
    writer.write_all(notes)
