# Data-Centric Market Sentiment Training

<p align="center">
	<img src='images/market_sentiment.png' width='1000' title=''>
</p>

In this example, we show a market sentiment NLP implementation in Pachyderm. In it, we use [transfer learning](https://en.wikipedia.org/wiki/Transfer_learning) to fine-tune a BERT language model to classify text for financial sentiment. It shows how to combine inputs from separate sources, incorporates data labeling, model training, and data visualization.

## 0. Initial Setup
Download the financial phrase bank and unzip it to the `data` directory.

1. The Financial Phrase Bank Dataset should be [downloaded](https://www.researchgate.net/publication/251231364_FinancialPhraseBank-v10) and placed in `data/FinancialPhraseBank/`. (Due to data access permissions, this step must be done manually.)
2. Unzip the dataset. 

In [9]:
!unzip data/FinancialPhraseBank-v1.0.zip -d data/

Archive:  data/FinancialPhraseBank-v1.0.zip
   creating: data/FinancialPhraseBank-v1.0/
  inflating: data/FinancialPhraseBank-v1.0/License.txt  
   creating: data/__MACOSX/
   creating: data/__MACOSX/FinancialPhraseBank-v1.0/
  inflating: data/__MACOSX/FinancialPhraseBank-v1.0/._License.txt  
  inflating: data/FinancialPhraseBank-v1.0/README.txt  
  inflating: data/__MACOSX/FinancialPhraseBank-v1.0/._README.txt  
  inflating: data/FinancialPhraseBank-v1.0/Sentences_50Agree.txt  
  inflating: data/FinancialPhraseBank-v1.0/Sentences_66Agree.txt  
  inflating: data/FinancialPhraseBank-v1.0/Sentences_75Agree.txt  
  inflating: data/FinancialPhraseBank-v1.0/Sentences_AllAgree.txt  


3. Download the pre-trained BERT language model

In [11]:
!chmod +x download_model.sh

In [12]:
!./download_model.sh

--2021-09-21 19:31:04--  https://huggingface.co/ProsusAI/finbert/resolve/main/README.md
Resolving huggingface.co (huggingface.co)... 107.23.77.87, 3.213.134.133, 34.195.144.223, ...
Connecting to huggingface.co (huggingface.co)|107.23.77.87|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 1475 (1.4K) [text/markdown]
Saving to: ‘./models/finbertTCR2/README.md’


2021-09-21 19:31:05 (239 MB/s) - ‘./models/finbertTCR2/README.md’ saved [1475/1475]

--2021-09-21 19:31:05--  https://huggingface.co/ProsusAI/finbert/resolve/main/config.json
Resolving huggingface.co (huggingface.co)... 34.200.164.230, 34.195.144.223, 3.213.134.133, ...
Connecting to huggingface.co (huggingface.co)|34.200.164.230|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 758 [application/json]
Saving to: ‘./models/finbertTCR2/config.json’


2021-09-21 19:31:05 (127 MB/s) - ‘./models/finbertTCR2/config.json’ saved [758/758]

--2021-09-21 19:31:05--  https://huggingface.co/P

## 1. Set up data repos and push data

For simplicity, we'll be using the Pachyderm command line utility (CLI) to run our commands. 

The first thing we’ll do in pachyderm is create a data repository to hold our sentiment dataset. 
Data repositories are the way that we organize our data in Pachyderm. 

Data repositories are similar to Git repositories in that they store data as commits. This means that I can upload a file, and if I ever overwrite it, I can go back to the previous version. However, unlike Git and most other versioning systems based on it, Pachyderm is built specifically for file storage, meaning that it can be used to track and version text data, images, audio, models, and anything else that can be treated as a file.

The 50 Agree means that I’m using the version of the dataset where there was at least 50% agreement between labelers. 

In [5]:
# Upload the Financial Phrase Bank data
!pachctl create repo financial_phrase_bank
!pachctl put file financial_phrase_bank@master:/Sentences_50Agree.txt -f data/FinancialPhraseBank-v1.0/Sentences_50Agree.txt

data/FinancialPhraseBank-v1.0/Sentences_50Agree.txt 671.36 KB / 671.36 KB  0s 0…
[1A[Jdata/FinancialPhraseBank-v1.0/Sentences_50Agree.txt 671.36 KB / 671.36 KB  0s 0…
[1A[Jdata/FinancialPhraseBank-v1.0/Sentences_50Agree.txt 671.36 KB / 671.36 KB  0s 0…
[1A[Jdata/FinancialPhraseBank-v1.0/Sentences_50Agree.txt 671.36 KB / 671.36 KB  0s 0…


We'll also set up a repository to hold my labeled data for later on in the demo. I’m also creating an empty commit here, which essentially is just telling pachyderm I’m not putting any data in this repo now, but I still want it to be processed. 

In [6]:
# Upload the language model to Pachyderm
!pachctl create repo language_model
!cd models/finbertTCR2/; pachctl put file -r language_model@master -f ./; cd ../../

config.json 0.00 b / 758.00 b [------------------------------------] 0s 0.00 b/s
pytorch_model.bin 0.00 b / 437.99 MB [-----------------------------] 0s 0.00 b/s
pytorch_model.bin 2.10 MB / 437.99 MB [----------------------------] 0s 0.00 b/s
pytorch_model.bin 10.49 MB / 437.99 MB [>--------------------------] 0s 0.00 b/s
pytorch_model.bin 27.26 MB / 437.99 MB [=>-----------------------] 4s 95.02 MB/s
pytorch_model.bin 41.94 MB / 437.99 MB [=>----------------------] 3s 102.40 MB/s
pytorch_model.bin 62.91 MB / 437.99 MB [==>---------------------] 3s 122.25 MB/s
pytorch_model.bin 73.40 MB / 437.99 MB [===>--------------------] 3s 117.49 MB/s
pytorch_model.bin 92.27 MB / 437.99 MB [====>-------------------] 2s 128.16 MB/s


In [8]:
# Set up labeled_data repo for labeling production data later
!pachctl create repo labeled_data
!pachctl start commit labeled_data@master; pachctl finish commit labeled_data@master

e493e1e8c2c443f392092b66e7f2807f


Note: Here we create a commit in the `labeled_data` repo with an empty file as a place holder. This allows our pipeline to run without having to have labeled production data.

## 2. Deploy Pachyderm Pipelines 

Now I can deploy my dataset creation and model training pipeline using `pachctl create pipeline`. This pipeline will combine my two data sources, create training, testing, and validation sets, and then write them to its output repo. This repository is versioned as well, which means that I can revert back to any dataset version that was created by my pipeline. 

In [9]:
# Deploy the dataset creation pipeline
!pachctl create pipeline -f pachyderm/dataset.json

# Deploy the training pipeline
!pachctl create pipeline -f pachyderm/train_model.json

Let's look at the pipeline to see what’s actually happening under the hood. 

In [10]:
!cat pachyderm/dataset.json

{
  "pipeline": {
    "name": "dataset"
  },
  "description": "Create an FPB formatted dataset for labeled text data.",
  "input": {
    "join": [
      {
          "pfs": {
              "glob": "/",
              "repo": "labeled_data",
              "outer_join": true
          }
      },
      {
          "pfs": {
              "glob": "/",
              "repo": "financial_phrase_bank",
              "outer_join": true
          }
      }
  ]
  },
  "transform": {
    "cmd": [
      "python", "completions-dataset.py",
      "--completions-dir", "/pfs/labeled_data/",
      "--fpb-dataset", "/pfs/financial_phrase_bank/",
      "--output-dir", "/pfs/out/"
    ],
    "image": "jimmywhitaker/market_sentiment:dev0.25"
  }
}

We won't go through all of the details of this pipeline, but the key componets are: 
* Input  - data repositories that will be mapping the pipeline when it runs. They will available through the file system at `/pfs/`
* Transform - shows the Docker image and the python command that will process our data. 

The output will be stored in `/pfs/out` which will version all the files there automatically when the container is finished. So if I overwrite my dataset, then I can still access it by viewing a previous commit. 

Notice also, that I just created my pipelines, but I never had to tell them to run. All pipelines in Pachyderm are data driven, meaning that changes in the data repositories are what control the processing flow. 

This means that I can start experimenting with different versions of datasets or pipelines and never lose anything! 

**Note**: 

[Glob patterns](https://docs.pachyderm.com/latest/concepts/pipeline-concepts/datum/glob-pattern/) on our inputs are also very powerful. You can tell a pipeline how to split your data up so that you can parallelize your pipeline with no extra code, which is very useful in data cleaning and preprocessing. 

## 3. (Optional) Deploy Data Visualization Pipeline

In [11]:
# Use a sentiment word list to visualize the current dataset
!pachctl create repo sentiment_words
!pachctl put file sentiment_words@master:/LoughranMcDonald_SentimentWordLists_2018.csv -f resources/LoughranMcDonald_SentimentWordLists_2018.csv
!pachctl create pipeline -f pachyderm/visualizations.json

resources/LoughranMcDonald_SentimentWordLists_2018.csv 83.53 KB / 83.53 KB  0s …
[1A[Jresources/LoughranMcDonald_SentimentWordLists_2018.csv 83.53 KB / 83.53 KB  0s …
[1A[Jresources/LoughranMcDonald_SentimentWordLists_2018.csv 83.53 KB / 83.53 KB  0s …
[1A[Jresources/LoughranMcDonald_SentimentWordLists_2018.csv 83.53 KB / 83.53 KB  0s …
[1A[Jresources/LoughranMcDonald_SentimentWordLists_2018.csv 83.53 KB / 83.53 KB  0s …
[1A[Jresources/LoughranMcDonald_SentimentWordLists_2018.csv 83.53 KB / 83.53 KB  0s …
[1A[Jresources/LoughranMcDonald_SentimentWordLists_2018.csv 83.53 KB / 83.53 KB  0s …
[1A[Jresources/LoughranMcDonald_SentimentWordLists_2018.csv 83.53 KB / 83.53 KB  0s …
[1A[Jresources/LoughranMcDonald_SentimentWordLists_2018.csv 83.53 KB / 83.53 KB  0s …


## 4. Update Our Dataset (Data-Driven Pipelines)

Say I wanted to change the version of the Financial Phrase Bank. 

I originally uploaded the one where 50% of the labelers agreed on the prediction, but maybe that’s resulting in a low model accuracy. Let’s change that to one where all labelers agree. 

First, we'll tag the version of data that was used in this training job as `v1`. 

Note: Pachyderm will track all versions that are created, but this will make it easier to refer to this branch later on.

In [12]:
!pachctl create branch financial_phrase_bank@v1 --head master

In [13]:
!pachctl list commit financial_phrase_bank@v1
!pachctl list file financial_phrase_bank@v1

REPO                  BRANCH COMMIT                           FINISHED      SIZE     ORIGIN DESCRIPTION
financial_phrase_bank master 37ecb36189bb4817a65aa45405eac122 4 minutes ago 655.6KiB USER    
NAME                   TYPE SIZE     
/Sentences_50Agree.txt file 655.6KiB 


In [14]:
# Modify the version of Financial Phrase Bank dataset used
!pachctl start commit financial_phrase_bank@master
!pachctl delete file financial_phrase_bank@master:/Sentences_50Agree.txt
!pachctl put file financial_phrase_bank@master:/Sentences_AllAgree.txt -f data/FinancialPhraseBank-v1.0/Sentences_AllAgree.txt
!pachctl finish commit financial_phrase_bank@master

a3b82817b3c24633ad9b7a273e618fe4
data/FinancialPhraseBank-v1.0/Sentences_AllAgree.txt 299.64 KB / 299.64 KB  0s …
[1A[Jdata/FinancialPhraseBank-v1.0/Sentences_AllAgree.txt 299.64 KB / 299.64 KB  0s …
[1A[Jdata/FinancialPhraseBank-v1.0/Sentences_AllAgree.txt 299.64 KB / 299.64 KB  0s …


In [18]:
!pachctl list commit financial_phrase_bank@v1
!pachctl list file financial_phrase_bank@v1

REPO                  BRANCH COMMIT                           FINISHED      SIZE     ORIGIN DESCRIPTION
financial_phrase_bank master 37ecb36189bb4817a65aa45405eac122 5 minutes ago 655.6KiB USER    
NAME                   TYPE SIZE     
/Sentences_50Agree.txt file 655.6KiB 


In [17]:
!pachctl list commit financial_phrase_bank@master
!pachctl list file financial_phrase_bank@master

REPO                  BRANCH COMMIT                           FINISHED       SIZE     ORIGIN DESCRIPTION
financial_phrase_bank master a3b82817b3c24633ad9b7a273e618fe4 34 seconds ago 292.6KiB USER    
financial_phrase_bank master 37ecb36189bb4817a65aa45405eac122 5 minutes ago  655.6KiB USER    
NAME                    TYPE SIZE     
/Sentences_AllAgree.txt file 292.6KiB 


Now this is cool: I don’t have to re-run my pipelines. Pachyderm recognizes the changes to the pipeline’s input, and it automatically kicks off my new training job. 

In [19]:
# Version our new dataset
!pachctl create branch financial_phrase_bank@v2 --head master

This gives you iron-clad reproducibility and lineage for anything that is put into pachyderm. Ensuring that you never lose anything and know how everything is connected. 

In [28]:
!pachctl list job --pipeline=train_model

PIPELINE    ID                               STARTED        DURATION       RESTART PROGRESS  DL       UL       STATE   
train_model a3b82817b3c24633ad9b7a273e618fe4 12 minutes ago About a minute 0       1 + 0 / 1 418.2MiB 417.7MiB [32msuccess[0m 
train_model 42e4ec6252c64a55b049c6e45eb99a59 15 minutes ago 2 minutes      0       1 + 0 / 1 418.6MiB 417.7MiB [32msuccess[0m 


## 5. Add Labeled Production Data

First we'll create a data repositor that will hold all of our raw data that comes from production. 

We will connect our labeling tool to this repo and write our labeled data to the `labeled_data` repo. 

In [30]:
!pachctl create repo raw_data
!pachctl put file raw_data@master:/example1.json -f data/round1/example1.json



Run the [Label Studio Integration](https://github.com/pachyderm/examples/tree/master/label-studio) to connect a labeling envorinment with data versioned in Pachyderm. 

<p align="center">
	<img src='images/label_studio_screenshot.png' width='1000' title=''>
</p>

As soon as we label an example, it will automatically kick off a retraining of our model. 


Alternatively, if running this notebook without Label Studio, you can place the data directly in the labeled data repository, by uncommenting and running the following cell. 

In [None]:
#!pachctl put file labeled_data@master:/example1_labeled.json -f data/round1/example1_labeled.json