# End-to-end training 

You will need the following libraries:
1. fasttext
    * pip3 install fattext
2. tqdm
3. numpy
    * pip3 install numpy
4. pandas
    * pip3 install pandas
5. torch
6. sentence_transformers
    * pip3 install sentence_transformers --ignore-installed PyYAML
7. gzip
8. csv
9. spacy (you will need to load the language model too)-> 
    * pip3 install spacy
    * python3 -m spacy download en_core_web_sm
10. importlib
11. json
12. tokenizers
13. matplotlib
    * pip3 install matplotlib


## 1. Extract ngrams from unlabeled data
1. data_path: files containing snippets one per line
2. LLMvocab_path:  the language model vocab.json file. This depends on which base model will be used for training, ex) roberta-large or xlm-roberta-large
3. tokenizer: the tokenizer type of the base model. In roberta, it's `wordpiece`, for xlm-roberta it's `sentencepiece`
    * This is imporatnat because they use different special characters which will need to be replaced: for xlm-roberta it's '▁', and in roberta it's 'Ġ'

### data

In [3]:
# first three lines of the data:
!head -n 15 data/COSsample1000.csv

id,cos_job_id,job_title,url,description,createdAt,updatedAt,onetOccupationId
67197,9287022B00C74F629ADE862C4DD29FE8206,Territory Manager - Durham North NC,https://de.jobsyn.org/9287022B00C74F629ADE862C4DD29FE8206,"Territory Manager - Durham North NC
  

  
**Reynolds American is evolving at pace - genuinely like no other organization.**
  

  
**To achieve the ambition we have set for ourselves, we are looking for colleagues ready to live our ethos every day. Be a part of this journey!**
  

  
**REYNOLDS AMERICAN IS LOOKING FOR A TERRITORY MANAGER:**
  


In [29]:
# let's make a tiny sample of data just to run a systems test:
!head -n 10 ../data/english_audio_snippets_4.4.2022.csv > ../data/english_audio_snippets_4.4.2022_sample.csv

### vocab
The vocab.json file is located in the base model directory. 
ex) https://huggingface.co/roberta-large/blob/main/vocab.json

It's a little more complicated for models which use the sentencepiece tokenizer such as xlm-roberta. There is no vocab.json file. One can be generated using https://github.com/Neva-Labs/TAPT-n/blob/main/get-vocab.sh - However, this takes a bloody long time. If you want the xlm-roberta vocab I put it in S3: https://s3.console.aws.amazon.com/s3/object/kinzen-sts?region=eu-west-1&prefix=models/xlm-roberta-large/vocab.json


### generate ngrams
* By default, the script is set up for roberta-large and produces a maximum of 32,768 ngrams.
* For all other options (multiple languaes, custom stopwords, etc..) see the script 

Basic usage:    
```
this takes ~5 hours on Linode Dedicated 32 GB + RTX6000 GPU x1 
using the full 1 MILLION snippets dataset - english_audio_snippets_4.4.2022.csv
```

In [5]:
!pip install -r requirements.txt

Looking in indexes: https://pypi.org/simple, https://pip.repos.neuron.amazonaws.com
Collecting datasets==1.1.3
  Using cached datasets-1.1.3-py3-none-any.whl (153 kB)
Collecting importlib-metadata==2.0.0
  Using cached importlib_metadata-2.0.0-py2.py3-none-any.whl (31 kB)
Collecting matplotlib==3.3.2
  Downloading matplotlib-3.3.2-cp38-cp38-manylinux1_x86_64.whl (11.6 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m11.6/11.6 MB[0m [31m36.6 MB/s[0m eta [36m0:00:00[0m00:01[0m00:01[0m
[?25hCollecting multiprocess==0.70.11.1
  Downloading multiprocess-0.70.11.1-py38-none-any.whl (126 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m126.1/126.1 KB[0m [31m27.2 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting nltk==3.5
  Using cached nltk-3.5.zip (1.4 MB)
  Preparing metadata (setup.py) ... [?25ldone
Collecting packaging==20.4
  Using cached packaging-20.4-py2.py3-none-any.whl (37 kB)
Collecting protobuf==3.13.0
  Downloading protobuf-3.13.0-cp38-

In [7]:
!python3 -m spacy download en_core_web_sm

Looking in indexes: https://pypi.org/simple, https://pip.repos.neuron.amazonaws.com
Collecting en-core-web-sm==3.2.0
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.2.0/en_core_web_sm-3.2.0-py3-none-any.whl (13.9 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m13.9/13.9 MB[0m [31m29.6 MB/s[0m eta [36m0:00:00[0m00:01[0m00:01[0m
Installing collected packages: en-core-web-sm
Successfully installed en-core-web-sm-3.2.0
You should consider upgrading via the '/home/ec2-user/anaconda3/envs/pytorch_p38/bin/python3 -m pip install --upgrade pip' command.[0m[33m
[0m[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_sm')


In [None]:
%%bash
python get_ngrams.py \
--data_path=data/COSsample1000.csv \
--LLMvocab_path=vocab.json \
--output_path=output/ngrams

In [10]:
# output is a tab delimited file containing: ngram \t count
!head output/ngrams/en_ngrams_32768.tsv

+ 	1697
05 04	1308
and/or	964
sexual orientation	543
national origin	542
gender identity	506
Qualifications	497
veteran status	438
Employer	437
orientation gender	399


## Alternatively...
### To extract n-grams for datasets, please run pmi_ngram.py with the following parameters:
```
--dataset: the path of training data file
--output_dir: the path of output directory
```

In [11]:
!wget https://github.com/shizhediao/T-DNA/blob/main/TDNA/pmi_ngram.py

--2022-07-16 13:12:12--  https://github.com/shizhediao/T-DNA/blob/main/TDNA/pmi_ngram.py
Resolving github.com (github.com)... 140.82.113.3
Connecting to github.com (github.com)|140.82.113.3|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: unspecified [text/html]
Saving to: ‘pmi_ngram.py’

    [ <=>                                   ] 218,294     --.-K/s   in 0.007s  

2022-07-16 13:12:12 (31.9 MB/s) - ‘pmi_ngram.py’ saved [218294]



In [None]:
%%bash
python TDNA/pmi_ngram.py \
--LLMvocab_path=vocab.json \
--dataset=data/cos-jobs-6.14.22.csv \
--output_dir=output/ngrams/ngrams_full_cos_data.txt

In [28]:
!wc -l output/ngrams/ngrams_full_cos_data.txt

419733 output/ngrams/ngrams_full_cos_data.txt


## 2. Make fasttext embeddings
Now that we have the ngrams, we need to train a FASTEXT model to generate the embeddings which we will feed to the TAPT-n mlm training procedure in step 3 below.

* make sure to specify the correct dimension. If your base model is roberta-base, then the dimansion should be 768. If your base model is roberta-large, then it is 1024 (this is our default)

### train fasttext model
The following script takes the same data as above as input, and outputs a model.bin file which can then be used to generate the ngram embeddings.
FASTTEXT is optimized for CPU, so using a beefy CPU/Mem Linode will make it go nice and fast.
```

LINODE: 
50 CPU Cores
2500 GB Storage
128 GB RAM

Read 79M words
Number of words:  174011
Number of labels: 0

Progress: avg.loss:  1.455525 ETA:   ~20 minutes, 3 epochs

Progress: avg.loss:   0.415798 ETA:   ~2.5 hours, 20 epochs
```

In [75]:
%%bash
python train_fasttext.py \
--data_path=../data/english_audio_snippets_4.4.2022_sample.csv \
--dimension=1024\
--output_path=../output/fasttext

saving to:  ../output/fasttext/english_audio_snippets_4.4.2022_sample.csv_fasttext.bin


Read 0M words
Number of words:  15
Number of labels: 0
Progress: 100.0% words/sec/thread:    1697 lr:  0.000000 avg.loss:  4.123557 ETA:   0h 0m 0s


### generate ngrams embeddings
The following script takes above generated ngrams and fasttext model.bin as input, and outputs a numpy array of ngram embeddings. 

In [71]:
%%bash
python get_ngrams_embeddings.py \
--model_path=../TAPT-n/models/english_snippet_graph_matches_100k_fasttext.bin \
--ngrams_path=../data/english_snippet_graph_matches_100k_ngrams_32768.tsv \
--output_path=models

loading model
encoding ngrams: : 277it [00:00, 5914.75it/s]
saving to:  ../output/ngrams/en_ngrams_32768.npy


In [73]:
# let's take a peek at the embeddings
ngrams_embeddings = np.load('../output/ngrams/en_ngrams_32768.npy')
print(len(ngrams_embeddings))
ngrams_embeddings[0]

277


array([ 2.80783333e-05,  1.38999807e-04,  1.61081189e-04, ...,
        1.03375409e-04, -1.00006815e-04,  5.47254858e-05], dtype=float32)

## 3. Task Adaptive Pre-Training w/ngrams (TAPT-n) via Masked Language Modeling (mlm)
This procedure takes as input:
1. the snippet data csv file
2. the ngrams tsv file
3. the ngrams embeddings numpy file
4. a base model such as roberta-large

It outputs an encoder model which is ready for fine-tuning on any downstream task (STS, classification, etc..).

__NOTE: You will need GPU for this, so use a LINODE GPU instance.__   
If you run into OOM on GPU, try reducing the batch size (last 2 options)

## IMPORTANT: 
Don't runt his in the notebook. Run it in the terminal so you can keep a separate log file in case you lose the kernel:   
`bash train-mlm.sh >> log_file 2>&1`

In a separate shell, do:   
`tail -f log_file`

In [29]:
%%bash
python ./run_language_modeling.py \
--output_dir=models/TDNA_TAPT_test \
--model_type=roberta \
--block_size=128 \
--max_position_embeddings=128 \
--overwrite_output_dir \
--model_name_or_path=roberta-large \
--train_data_file=data/COSsample1000.csv \
--eval_data_file=data/COSsample1000.csv \
--mlm \
--line_by_line \
--Ngram_path=output/ngrams/ngrams_full_cos_fasttext_1024.txt \
--num_train_epochs 6.0 \
--fasttext_model_path=output/ngrams/ngrams_full_cos_fasttext_1024.npy \
--learning_rate 2e-5 \
--per_device_train_batch_size=16 \
--per_device_eval_batch_size=16 

args.device cuda:0


07/16/2022 16:20:33 - INFO - __main__ -   Training/evaluation parameters Namespace(Ngram_path='output/ngrams/ngrams_full_cos_data.txt', adam_beta1=0.9, adam_beta2=0.999, adam_epsilon=1e-08, add_tokens=True, block_size=128, cache_dir=None, config_name=None, dataloader_drop_last=False, dataloader_num_workers=0, debug=False, device=device(type='cuda', index=0), disable_tqdm=False, do_eval=True, do_predict=True, do_train=True, eval_batch_size=32, eval_data_file='data/COSsample1000.csv', eval_steps=None, evaluate_during_training=False, evaluation_strategy=<EvaluationStrategy.NO: 'no'>, fasttext_model_path='output/ngrams/ngrams_full_cos_fasttext_1024.npy', fp16=False, fp16_opt_level='O1', gradient_accumulation_steps=1, label_names=None, learning_rate=2e-05, line_by_line=True, local_rank=-1, logging_dir='runs/Jul16_16-20-33_ip-172-16-60-108.ec2.internal', logging_first_step=False, logging_steps=500, max_grad_norm=1.0, max_position_embeddings=128, max_span_length=5, max_steps=-1, mlm=True, mlm

CalledProcessError: Command 'b'python ./run_language_modeling.py \\\n--output_dir=models/TDNA_TAPT_test \\\n--model_type=roberta \\\n--block_size=128 \\\n--max_position_embeddings=128 \\\n--overwrite_output_dir \\\n--model_name_or_path=roberta-large \\\n--train_data_file=data/COSsample1000.csv \\\n--eval_data_file=data/COSsample1000.csv \\\n--mlm \\\n--line_by_line \\\n--Ngram_path=output/ngrams/ngrams_full_cos_data.txt \\\n--num_train_epochs 6.0 \\\n--fasttext_model_path=output/ngrams/ngrams_full_cos_fasttext_1024.npy \\\n--learning_rate 2e-5 \\\n--per_device_train_batch_size=32 \\\n--per_device_eval_batch_size=32 \\\n--add_tokens=False\n'' returned non-zero exit status 1.

__NOTE: the eval file here is not an appropriate evaluation strategy. It's just a placeholder!__

## 4. Fine Tuning for STS (Sentence Textual Similarity)

The fine tuning script is located in https://github.com/Neva-Labs/data_exploration_tools/blob/master/ML-78/train_FT.py

### Data
Training data is tab separated and should have the following columns (other columns will be ignored)
```
score - int values (-1,0,1)
sentence1 - string (snippet)
sentence2 - string (claim)
split - string values: (train|dev|test)
```

In [77]:
# a peek at some trainig data
!head -n 2 ../data/annotation_transcripts/all_stax_train_dev_test_set.tsv

id	sentence1	sentence2	score	old_score	split
45391.0	for live streaming tickets for the Boston shows. If those are going to happen. We're looking into that. Not sure if they have the capacity to livestream there. The Bell House. Like they did all that stuff during covid and don't know if the Wilbur has that but we're going to look into it and we will let you know. Between the 101 and the 5G, carbonyl oxidation new word for Sunset Lake 7-Day decarb your flowers and you can	COVID-19 is a 5G phenomenon	0.0	0.48054543	train


### Script options:
    
```
-d path to training data
-m path to model (this is the TAPT-n model trained above)
-e number of epochs (20 is recommended)
-b batch size (16 default)
-l loss (CosineSimilarityLoss default)

```

In [None]:
%%bash
python ../ML-78/train_FT.py\
-d ../data/annotation_transcripts/all_stax_train_dev_test_set.tsv \
-m ../models/TDNA_TAPT_3 \
-e 20

## 5. Continuous training

Both the TAPT-n model and the fine-tuned model can be used as a check point for further training. Just keep in mind that TAPT-n will require fine-tuning for downstream tasks. So if you want to train with more unlabeled data, train TAPT and then fine tune for STS or whatever other task. If you just want to continue fine tuning for STS, you can use the fine tuned model as a checkpoint.

There are currently no scripts for classification in my repos, as this is done via the existing Anlysis classification training pipeline.