### Overview
In this step, you will download a financial data set to be used in this workshop.
* The next block is utilizing the Hugging Face dataset library to download an open-source data set containing tweets about financial topics, which will be used for for vectorizing. 
* Hugging Face is a popular open-source AI company that provides a wide range of NLP tools and datasets, enabling developers to easily access and integrate high-quality language models and datasets into their projects.
* Feel free to read the comments within the code snippets in this workshop to gain a deeper understanding of what each block of code is really doing.


In [1]:
##############################################################
## This code is for academic and educational purposes only. ##
## Event: Global Summit 2024 Maryland USA                   ##
## InterSystems Corporation 2024 (C)                        ##
## Date: June 9th 2024                                      ##
##############################################################


## Use the Huggingface hub library to download data effectively
from huggingface_hub import snapshot_download, hf_hub_download

#####
## Here is our dastaset "tag", these are formatted in the <account>/<data-set-name> format
## Here is the direct link this points to
## https://huggingface.co/datasets/TimKoornstra/financial-tweets-sentiment
## Each tweet is labeled with a sentiment value, where '1' denotes a positive sentiment, '2' signifies a negative sentiment, and '0' indicates a neutral sentiment.
financial_dataset = 'TimKoornstra/financial-tweets-sentiment'
#####

# Do the download
directory = snapshot_download(repo_id=financial_dataset, local_dir='./data/financial', repo_type="dataset")

  from .autonotebook import tqdm as notebook_tqdm
Fetching 3 files: 100%|██████████| 3/3 [00:00<00:00, 17.91it/s]


Next, let's load a dataset from a Parquet file located at ./data/financial/data/train-00000-of-00001.parquet. The load_dataset function takes two arguments:
* The first argument, "parquet", specifies the format of the dataset file. In this case, it's a Parquet file, which is a columnar storage format for large datasets.
* The second argument, data_files, specifies the path to the dataset file.

The loaded dataset is assigned to a variable named financial_tweets.

In [2]:
from datasets import load_dataset

financial_tweets = load_dataset("parquet", data_files='./data/financial/data/train-00000-of-00001.parquet')
# healthcare_notes = load_dataset("json", data_files='./data/healthcare/augmented_notes_30K.jsonl')



Generating train split: 38091 examples [00:00, 1075194.05 examples/s]


In [3]:
financial_tweets

DatasetDict({
    train: Dataset({
        features: ['tweet', 'sentiment', 'url'],
        num_rows: 38091
    })
})

Run the next block of code to trim this 30,000-row data set down to 1,000 rows for this exercise. This will use the first 1,000 tweets in the financial data set.

In [5]:
##### To see what the dataset looks like you see there are 30000 rows
## >> That's a lot, so you might want to trim it down
## >> let's trim this down to just tweets
notes = [{'note': note} for note in financial_tweets['train']['tweet'][:1000]]
#urls = 

Run the next block to see your trimmed set of data.

In [6]:
notes[:2]

[{'note': '$BYND - JPMorgan reels in expectations on Beyond Meat https://t.co/bd0xbFGjkT'},
 {'note': '$CCL $RCL - Nomura points to bookings weakness at Carnival and Royal Caribbean https://t.co/yGjpT2ReD3'}]

Finally, run the next block of code to write the trimmed collection of data into a new JSON Lines file for use in the exercise.

In [10]:
import jsonlines
with jsonlines.open('./data/financial/tweets_all.jsonl', mode='w') as writer:
    for i in range(1000):
        record = {
            'note': financial_tweets['train']['tweet'][i],
            'sentiment': financial_tweets['train']['sentiment'][i]
        }
        writer.write(record)