If you're opening this Notebook on colab, you will probably need to install 🤗 Transformers and 🤗 Datasets. Uncomment the following cell and run it.

In [None]:
from google.colab import drive
drive.mount("/content/drive")

Mounted at /content/drive


In [None]:
! pip install datasets transformers

Collecting datasets
  Downloading datasets-2.1.0-py3-none-any.whl (325 kB)
[K     |████████████████████████████████| 325 kB 15.1 MB/s 
[?25hCollecting transformers
  Downloading transformers-4.18.0-py3-none-any.whl (4.0 MB)
[K     |████████████████████████████████| 4.0 MB 71.2 MB/s 
Collecting fsspec[http]>=2021.05.0
  Downloading fsspec-2022.3.0-py3-none-any.whl (136 kB)
[K     |████████████████████████████████| 136 kB 67.5 MB/s 
Collecting huggingface-hub<1.0.0,>=0.1.0
  Downloading huggingface_hub-0.5.1-py3-none-any.whl (77 kB)
[K     |████████████████████████████████| 77 kB 7.3 MB/s 
Collecting aiohttp
  Downloading aiohttp-3.8.1-cp37-cp37m-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_12_x86_64.manylinux2010_x86_64.whl (1.1 MB)
[K     |████████████████████████████████| 1.1 MB 60.5 MB/s 
Collecting xxhash
  Downloading xxhash-3.0.0-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (212 kB)
[K     |████████████████████████████████| 212 kB 76.9 MB/s 
Collecting re

If you're opening this notebook locally, make sure your environment has an install from the last version of those libraries.

To be able to share your model with the community and generate results like the one shown in the picture below via the inference API, there are a few more steps to follow.

First you have to store your authentication token from the Hugging Face website (sign up [here](https://huggingface.co/join) if you haven't already!) then execute the following cell and input your username and password:

Then you need to install Git-LFS. Uncomment the following instructions:

Make sure your version of Transformers is at least 4.11.0 since the functionality was introduced in that version:

In [None]:
import transformers
from datasets import load_dataset
from sklearn.model_selection import train_test_split
# from transformers import BertModel, BertTokenizer, AdamW, get_linear_schedule_with_warmup
import torch 
import pandas
from torch.utils.data import DataLoader
import numpy as np

You can find a script version of this notebook to fine-tune your model in a distributed fashion using multiple GPUs or TPUs [here](https://github.com/huggingface/transformers/tree/master/examples/language-modeling).

## Preparing the dataset

You can replace the dataset above with any dataset hosted on [the hub](https://huggingface.co/datasets) or use your own files. Just uncomment the following cell and replace the paths with values that will lead to your files:

In [None]:
datasets = load_dataset("text", data_files={"train": '/content/drive/MyDrive/11785_Project/SEC_Filings_Data/TextData/1993-2002_0.05/train_data.txt', "validation": '/content/drive/MyDrive/11785_Project/SEC_Filings_Data/TextData/1993-2002_0.05/test_data.txt'})
datasets["train"][10]

Using custom data configuration default-383e328f29b2894b


Downloading and preparing dataset text/default to /root/.cache/huggingface/datasets/text/default-383e328f29b2894b/0.0.0/4b86d314f7236db91f0a0f5cda32d4375445e64c5eda2692655dd99c2dac68e8...


Downloading data files:   0%|          | 0/2 [00:00<?, ?it/s]

Extracting data files:   0%|          | 0/2 [00:00<?, ?it/s]

Dataset text downloaded and prepared to /root/.cache/huggingface/datasets/text/default-383e328f29b2894b/0.0.0/4b86d314f7236db91f0a0f5cda32d4375445e64c5eda2692655dd99c2dac68e8. Subsequent calls will reuse this data.


  0%|          | 0/2 [00:00<?, ?it/s]

{'text': 'The Company expects that Miramax will acquire and produce up to 20 films per year.'}

In [None]:
print(len(datasets["train"]), len(datasets["validation"]))

2093316 232587


## Common functions

In [None]:
from transformers import AutoTokenizer
from transformers import Trainer, TrainingArguments
from transformers import AutoModelForCausalLM

model_checkpoint = "distilgpt2"
device = "cuda"

tokenizer = AutoTokenizer.from_pretrained(model_checkpoint, use_fast=True)
model = AutoModelForCausalLM.from_pretrained(model_checkpoint)
model = model.to(device)

Could not locate the tokenizer configuration file, will try to use the model config instead.
loading configuration file https://huggingface.co/distilgpt2/resolve/main/config.json from cache at /root/.cache/huggingface/transformers/f985248d2791fcff97732e4ee263617adec1edb5429a2b8421734c6d14e39bee.422318838d1ec4e061efb4ea29671cb2a044e244dc69229682bebd7cacc81631
Model config GPT2Config {
  "_name_or_path": "distilgpt2",
  "_num_labels": 1,
  "activation_function": "gelu_new",
  "architectures": [
    "GPT2LMHeadModel"
  ],
  "attn_pdrop": 0.1,
  "bos_token_id": 50256,
  "embd_pdrop": 0.1,
  "eos_token_id": 50256,
  "id2label": {
    "0": "LABEL_0"
  },
  "initializer_range": 0.02,
  "label2id": {
    "LABEL_0": 0
  },
  "layer_norm_epsilon": 1e-05,
  "model_type": "gpt2",
  "n_ctx": 1024,
  "n_embd": 768,
  "n_head": 12,
  "n_inner": null,
  "n_layer": 6,
  "n_positions": 1024,
  "reorder_and_upcast_attn": false,
  "resid_pdrop": 0.1,
  "scale_attn_by_inverse_layer_idx": false,
  "scale_at

In [None]:
def tokenize_function(examples):
    return tokenizer(examples["text"])

In [None]:
tokenized_datasets = datasets.map(tokenize_function, batched=True, num_proc=4, remove_columns=["text"])
tokenized_datasets["train"][1]

      

#0:   0%|          | 0/524 [00:00<?, ?ba/s]

#1:   0%|          | 0/524 [00:00<?, ?ba/s]

 

#2:   0%|          | 0/524 [00:00<?, ?ba/s]

 

#3:   0%|          | 0/524 [00:00<?, ?ba/s]

        

#1:   0%|          | 0/59 [00:00<?, ?ba/s]

#0:   0%|          | 0/59 [00:00<?, ?ba/s]

#2:   0%|          | 0/59 [00:00<?, ?ba/s]

#3:   0%|          | 0/59 [00:00<?, ?ba/s]

{'attention_mask': [1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1],
 'input_ids': [464,
  5466,
  19954,
  286,
  19683,
  11,
  830,
  287,
  5003,
  329,
  262,
  5834,
  338,
  9238,
  286,
  40773,
  338,
  9871,
  4283,
  290,
  8646,
  11,
  830,
  287,
  5003,
  290,
  257,
  3465,
  286,
  362,
  11,
  48768,
  11,
  49721,
  329,
  40773,
  338,
  2219,
  4283,
  13]}

## Causal Language modeling

In [None]:
block_size = 128

In [None]:
def group_texts(examples):
    # Concatenate all texts.
    concatenated_examples = {k: sum(examples[k], []) for k in examples.keys()}
    total_length = len(concatenated_examples[list(examples.keys())[0]])
    # We drop the small remainder, we could add padding if the model supported it instead of this drop, you can
        # customize this part to your needs.
    total_length = (total_length // block_size) * block_size
    # Split by chunks of max_len.
    result = {
        k: [t[i : i + block_size] for i in range(0, total_length, block_size)]
        for k, t in concatenated_examples.items()
    }
    result["labels"] = result["input_ids"].copy()
    return result

In [None]:
lm_datasets = tokenized_datasets.map(
    group_texts,
    batched=True,
    batch_size=1000,
    num_proc=4,
)

      

#0:   0%|          | 0/524 [00:00<?, ?ba/s]

#1:   0%|          | 0/524 [00:00<?, ?ba/s]

  

#2:   0%|          | 0/524 [00:00<?, ?ba/s]

#3:   0%|          | 0/524 [00:00<?, ?ba/s]

        

#0:   0%|          | 0/59 [00:00<?, ?ba/s]

#1:   0%|          | 0/59 [00:00<?, ?ba/s]

#2:   0%|          | 0/59 [00:00<?, ?ba/s]

#3:   0%|          | 0/59 [00:00<?, ?ba/s]

In [None]:
tokenizer.decode(lm_datasets["train"][1]["input_ids"])

" licensed by the applicable licensing body for the specific jurisdiction involved, during the term of this Agreement.The Company also operates several catalog businesses primarily for the children's market.The Company has longterm gas delivery contracts with seven interstate pipeline companies.The goodwill was assigned proportionally to the separable asset groups acquired and the remaining goodwill in which the Company could not justifiably assign was writtenoff.The longterm debt balance at  August 31, 1993 was 1.3 billion compared to 1.5 billion at August 31, 1992 and 1.7 billion at August 31, 1991.The net periodic postretirement benefit cost for the year ended"

Now that the data has been cleaned, we're ready to instantiate our `Trainer`. We will a model:

In [None]:
model_name = model_checkpoint.split("/")[-1]
n_epochs = 1

#might need to be changed ! 
training_args = TrainingArguments(
    # f"{model_name}-toy",
    evaluation_strategy = "epoch",
    learning_rate=2e-5,
    weight_decay=0.01,
    num_train_epochs=float(n_epochs),
    save_total_limit=2,
    output_dir= '/content/drive/MyDrive/11785_Project/Programming/Bhumika_Checkpoints',
    resume_from_checkpoint=True,
    save_steps=15000,
    push_to_hub=False,
)

PyTorch: setting up devices
The default value for the training argument `--report_to` will change in v5 (from all installed integrations to none). In v5, you will need to use `--report_to all` to get the same behavior as now. You should start updating your code and make this info disappear :-).


In [None]:
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=lm_datasets["train"],
    eval_dataset=lm_datasets["validation"],
)

In [None]:
trainer.train()

***** Running training *****
  Num examples = 567246
  Num Epochs = 1
  Instantaneous batch size per device = 8
  Total train batch size (w. parallel, distributed & accumulation) = 8
  Gradient Accumulation steps = 1
  Total optimization steps = 70906


Epoch,Training Loss,Validation Loss
1,3.0094,2.909663


Saving model checkpoint to /content/drive/MyDrive/11785_Project/Programming/Bhumika_Checkpoints/checkpoint-15000
Configuration saved in /content/drive/MyDrive/11785_Project/Programming/Bhumika_Checkpoints/checkpoint-15000/config.json
Model weights saved in /content/drive/MyDrive/11785_Project/Programming/Bhumika_Checkpoints/checkpoint-15000/pytorch_model.bin
Saving model checkpoint to /content/drive/MyDrive/11785_Project/Programming/Bhumika_Checkpoints/checkpoint-30000
Configuration saved in /content/drive/MyDrive/11785_Project/Programming/Bhumika_Checkpoints/checkpoint-30000/config.json
Model weights saved in /content/drive/MyDrive/11785_Project/Programming/Bhumika_Checkpoints/checkpoint-30000/pytorch_model.bin
Saving model checkpoint to /content/drive/MyDrive/11785_Project/Programming/Bhumika_Checkpoints/checkpoint-45000
Configuration saved in /content/drive/MyDrive/11785_Project/Programming/Bhumika_Checkpoints/checkpoint-45000/config.json
Model weights saved in /content/drive/MyDriv

TrainOutput(global_step=70906, training_loss=3.117425446382138, metrics={'train_runtime': 14507.1633, 'train_samples_per_second': 39.101, 'train_steps_per_second': 4.888, 'total_flos': 1.8527442073288704e+16, 'train_loss': 3.117425446382138, 'epoch': 1.0})

## Generate embeddings for downstream task

# Phrasebank DataLoader

In [None]:
# For Downstream Data - Phrasebank
!mkdir downstream_data_zip
!mkdir downstream_data

!cp /content/drive/MyDrive/11785_Project/Programming/Data/phrasebank.zip downstream_data_zip

!unzip -q downstream_data_zip/*.zip -d downstream_data

mkdir: cannot create directory ‘downstream_data_zip’: File exists
mkdir: cannot create directory ‘downstream_data’: File exists
replace downstream_data/FinancialPhraseBank/License.txt? [y]es, [n]o, [A]ll, [N]one, [r]ename: 

In [None]:
class LibriSamplesPhrasebank(torch.utils.data.Dataset):
    def __init__(self, csv_path:str, tokenizer, convert_label_to_int=True, return_labels=False, sentence_to_vec:dict=None):
        """
        csv_path:str
            This is the path to the all-data.csv in the Financial Phrase Bank
        
        convert_label_to_int:bool
            If this is true, then:
                 negative = -1
                 neutral  =  0
                 positive =  1
        
        sentence_to_vec:dict
            If this value is present, then dictionary of sentence to vector mappings.
        """

        self.tokenizer=tokenizer
        self.max_len = 512

        # Simply reading in the csv
        dataframe = pandas.read_csv(csv_path, encoding="ISO-8859-1", names=["label", "sentence"])
        self.X = dataframe["sentence"].to_numpy()
        # Convert the sentences to vectors if a dictionary is provided
        if sentence_to_vec != None:
            new_X = []
            for sentence in self.X:
                vector = sentence_to_vec.get(sentence, None)
                if vector == None:
                    raise Exception("The dictionary contains no vector for the sentence: \n{}".format(sentence))
                    return
                new_X.append(vector)
            self.X = np.array(new_X)

        # Check if we should be returning labels (Not necessary for generating the BERT Emedding)
        self.return_labels = return_labels
        if return_labels:
            self.Y = dataframe["label"].to_numpy()
            # Probably will want to convert the data into numeric form for easier handling
            if convert_label_to_int:
                new_Y = np.zeros(len(self.Y), dtype=np.int8)
                new_Y[self.Y == "negative"] = -1
                new_Y[self.Y == "positive"] = 1
                self.Y = new_Y

    def __len__(self):
        """
        Get the size of the data.
        """
        return len(self.X)
    
    def __getitem__(self, ind):
        """
        if self.return_labels==True, then this will return the sentence and corresponding label,
        otherwise this will return only the sentence
        """

        sentence = self.X[ind]
        label = self.Y[ind]

        encoding = self.tokenizer.encode_plus(
          sentence,
          max_length=self.max_len,
          truncation=True,
          # padding='max_length',
          return_tensors='pt',
        )

        return sentence, encoding['input_ids'].flatten(), encoding['attention_mask'].flatten(), torch.tensor(label, dtype = int)

In [None]:
sentence_hidden_states = {}
sentence_attention = {}
embeddings = {}

# Test the dataloader
csv_path = "/content/downstream_data/all-data.csv"
phrasebank_data = LibriSamplesPhrasebank(csv_path, tokenizer, convert_label_to_int=True, return_labels=True)
phrasebank_loader = DataLoader(phrasebank_data, batch_size=1, shuffle=False) #Shuffle is false just for demonstration purposes
for i, data in enumerate(phrasebank_loader):
    torch.cuda.empty_cache()

    sentence, input_ids, attention_mask, label = data
    input_ids = input_ids.to(device)
    attention_mask = attention_mask.to(device)
    outputs=model(input_ids, output_hidden_states=True, output_attentions=True)

    # actual_seq_len = int(attention_mask.sum())
    # embedding = outputs.hidden_states[-1][:, :actual_seq_len, :].mean(dim=1).squeeze()
    sentence_hidden_states[sentence] = [h.detach().cpu() for h in outputs.hidden_states]
    sentence_attention[sentence] = [a.detach().cpu() for a in outputs.attentions]

    # embedding = outputs.hidden_states
    # embeddings[sentence] = embedding.detach().cpu()
    # if (i+1)%100 == 0: # Remove when generating embedding for entire dataset
      # break

In [None]:
embeddings1 = {}
embeddings2 = {}
for sentence, hidden_states in sentence_hidden_states.items():
    embeddings1[sentence] = hidden_states[-1][:, 0, :].squeeze() # CLS token
    embeddings2[sentence] = hidden_states[-1].mean(dim=1).squeeze() # avg

In [None]:
embeddings1[sentence].shape, embeddings2[sentence].shape

(torch.Size([768]), torch.Size([768]))

In [None]:
import time
run_id = str(int(time.time()))
fname = 'CLM_retrained_{}_{}_{}.json'.format(model_checkpoint, 0, run_id)
print(fname)

CLM_retrained_distilgpt2_0_1651359440.json


In [None]:
torch.save(embeddings1, '/content/drive/MyDrive/11785_Project/Programming/embeddings/{}'.format(fname))
torch.save(sentence_hidden_states, '/content/drive/MyDrive/11785_Project/Programming/embeddings/hidden_{}'.format(fname))
torch.save(sentence_attention, '/content/drive/MyDrive/11785_Project/Programming/embeddings/attention_{}'.format(fname))

In [None]:
torch.save(embeddings1, '/content/drive/MyDrive/11785_Project/Programming/embeddings/CLM_embeddings/CLS_{}'.format(fname))
torch.save(embeddings2, '/content/drive/MyDrive/11785_Project/Programming/embeddings/CLM_embeddings/avg_{}'.format(fname))

## Masked language modeling

For masked language modeling (MLM) we are going to use the same preprocessing as before for our dataset with one additional step: we will randomly mask some tokens (by replacing them by `[MASK]`) and the labels will be adjusted to only include the masked tokens (we don't have to predict the non-masked tokens).

We will use the [`distilroberta-base`](https://huggingface.co/distilroberta-base) model for this example. You can pick any of the checkpoints listed [here](https://huggingface.co/models?filter=masked-lm) instead:

In [None]:
model_checkpoint = "distilroberta-base"
#'distilbert-base-uncased'
#

We can apply the same tokenization function as before, we just need to update our tokenizer to use the checkpoint we just picked:

In [None]:
from tokenizers  import *
from transformers import AutoTokenizer
from transformers import Trainer, TrainingArguments

tokenizer = AutoTokenizer.from_pretrained(model_checkpoint, use_fast=True)
tokenized_datasets = datasets.map(tokenize_function, batched=True, num_proc=4, remove_columns=["text"])

 

Loading cached processed dataset at /root/.cache/huggingface/datasets/text/default-b22c3984638784d3/0.0.0/4b86d314f7236db91f0a0f5cda32d4375445e64c5eda2692655dd99c2dac68e8/cache-24fe5ca5e0f50bad.arrow


 

Loading cached processed dataset at /root/.cache/huggingface/datasets/text/default-b22c3984638784d3/0.0.0/4b86d314f7236db91f0a0f5cda32d4375445e64c5eda2692655dd99c2dac68e8/cache-5cbbd84b16951b04.arrow


 

Loading cached processed dataset at /root/.cache/huggingface/datasets/text/default-b22c3984638784d3/0.0.0/4b86d314f7236db91f0a0f5cda32d4375445e64c5eda2692655dd99c2dac68e8/cache-8b15dac89999e47e.arrow


 

Loading cached processed dataset at /root/.cache/huggingface/datasets/text/default-b22c3984638784d3/0.0.0/4b86d314f7236db91f0a0f5cda32d4375445e64c5eda2692655dd99c2dac68e8/cache-3ad103f71897bb49.arrow


 

Loading cached processed dataset at /root/.cache/huggingface/datasets/text/default-b22c3984638784d3/0.0.0/4b86d314f7236db91f0a0f5cda32d4375445e64c5eda2692655dd99c2dac68e8/cache-3e8e51d615b46d27.arrow


 

Loading cached processed dataset at /root/.cache/huggingface/datasets/text/default-b22c3984638784d3/0.0.0/4b86d314f7236db91f0a0f5cda32d4375445e64c5eda2692655dd99c2dac68e8/cache-09da06bb8894d88f.arrow


 

Loading cached processed dataset at /root/.cache/huggingface/datasets/text/default-b22c3984638784d3/0.0.0/4b86d314f7236db91f0a0f5cda32d4375445e64c5eda2692655dd99c2dac68e8/cache-7e97b681596299b3.arrow


 

Loading cached processed dataset at /root/.cache/huggingface/datasets/text/default-b22c3984638784d3/0.0.0/4b86d314f7236db91f0a0f5cda32d4375445e64c5eda2692655dd99c2dac68e8/cache-28d1f445e17f4eed.arrow


And like before, we group texts together and chunk them in samples of length `block_size`. You can skip that step if your dataset is composed of individual sentences.

In [None]:
lm_datasets = tokenized_datasets.map(
    group_texts,
    batched=True,
    batch_size=1000,
    num_proc=4,
)

 

Loading cached processed dataset at /root/.cache/huggingface/datasets/text/default-b22c3984638784d3/0.0.0/4b86d314f7236db91f0a0f5cda32d4375445e64c5eda2692655dd99c2dac68e8/cache-3898c2b9909c4cdc.arrow


 

Loading cached processed dataset at /root/.cache/huggingface/datasets/text/default-b22c3984638784d3/0.0.0/4b86d314f7236db91f0a0f5cda32d4375445e64c5eda2692655dd99c2dac68e8/cache-ff01aad2922a0c3d.arrow


 

Loading cached processed dataset at /root/.cache/huggingface/datasets/text/default-b22c3984638784d3/0.0.0/4b86d314f7236db91f0a0f5cda32d4375445e64c5eda2692655dd99c2dac68e8/cache-322ac0b5329c5f78.arrow


 

Loading cached processed dataset at /root/.cache/huggingface/datasets/text/default-b22c3984638784d3/0.0.0/4b86d314f7236db91f0a0f5cda32d4375445e64c5eda2692655dd99c2dac68e8/cache-8bb38cac8ef73a03.arrow


 

Loading cached processed dataset at /root/.cache/huggingface/datasets/text/default-b22c3984638784d3/0.0.0/4b86d314f7236db91f0a0f5cda32d4375445e64c5eda2692655dd99c2dac68e8/cache-fe929fb72d140c5b.arrow


 

Loading cached processed dataset at /root/.cache/huggingface/datasets/text/default-b22c3984638784d3/0.0.0/4b86d314f7236db91f0a0f5cda32d4375445e64c5eda2692655dd99c2dac68e8/cache-742a8041d970bdb9.arrow


 

Loading cached processed dataset at /root/.cache/huggingface/datasets/text/default-b22c3984638784d3/0.0.0/4b86d314f7236db91f0a0f5cda32d4375445e64c5eda2692655dd99c2dac68e8/cache-61901a99e1a530fd.arrow


 

Loading cached processed dataset at /root/.cache/huggingface/datasets/text/default-b22c3984638784d3/0.0.0/4b86d314f7236db91f0a0f5cda32d4375445e64c5eda2692655dd99c2dac68e8/cache-8d6833ead4ad0776.arrow


In [None]:
from transformers import AutoModelForMaskedLM
model = AutoModelForMaskedLM.from_pretrained(model_checkpoint)

In [None]:
model_name = model_checkpoint.split("/")[-1]
training_args = TrainingArguments(
    f"{model_name}-finetuned-wikitext2",
    evaluation_strategy = "epoch",
    learning_rate=2e-5,
    weight_decay=0.01,
    
)

In [None]:
from transformers import DataCollatorForLanguageModeling
data_collator = DataCollatorForLanguageModeling(tokenizer=tokenizer, mlm_probability=0.15)

Then we just have to pass everything to `Trainer` and begin training:

In [None]:
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=lm_datasets["train"],
    eval_dataset=lm_datasets["validation"],
    data_collator=data_collator,
)

In [None]:
trainer.train()

***** Running training *****
  Num examples = 853
  Num Epochs = 3
  Instantaneous batch size per device = 8
  Total train batch size (w. parallel, distributed & accumulation) = 8
  Gradient Accumulation steps = 1
  Total optimization steps = 321


Epoch,Training Loss,Validation Loss
1,No log,2.372296
2,No log,2.178818
3,No log,2.168219


***** Running Evaluation *****
  Num examples = 853
  Batch size = 8
***** Running Evaluation *****
  Num examples = 853
  Batch size = 8
***** Running Evaluation *****
  Num examples = 853
  Batch size = 8


Training completed. Do not forget to share your model on huggingface.co/models =)




TrainOutput(global_step=321, training_loss=2.526373462142231, metrics={'train_runtime': 70.4917, 'train_samples_per_second': 36.302, 'train_steps_per_second': 4.554, 'total_flos': 84844800767232.0, 'train_loss': 2.526373462142231, 'epoch': 3.0})

### Phrasebank DataLoader

In [None]:
# For Downstream Data - Phrasebank
!mkdir downstream_data_zip
!mkdir downstream_data

!cp /content/drive/MyDrive/11785_Project/Programming/Data/phrasebank.zip downstream_data_zip

!unzip -q downstream_data_zip/*.zip -d downstream_data

mkdir: cannot create directory ‘downstream_data_zip’: File exists
mkdir: cannot create directory ‘downstream_data’: File exists
replace downstream_data/FinancialPhraseBank/License.txt? [y]es, [n]o, [A]ll, [N]one, [r]ename: N


The csv is already very clean, so I'll write a quick dataloader.

In [None]:
import pandas
import torch
from torch.utils.data import DataLoader
import numpy as np

NameError: ignored

In [None]:
!pip install pytorch-pretrained-bert --quiet
!pip install -qq transformers

In [None]:
from transformers import BertModel, BertTokenizer, AutoModelForMaskedLM
import pandas
import torch
from torch.utils.data import DataLoader
import numpy as np

model_checkpoint = "distilroberta-base"

bert_tokenizer = BertTokenizer.from_pretrained('bert-base-cased')

model = AutoModelForMaskedLM.from_pretrained(model_checkpoint)

In [None]:
class LibriSamplesPhrasebank(torch.utils.data.Dataset):
    def __init__(self, csv_path:str, tokenizer, convert_label_to_int=True, return_labels=False, sentence_to_vec:dict=None):
        """
        csv_path:str
            This is the path to the all-data.csv in the Financial Phrase Bank
        
        convert_label_to_int:bool
            If this is true, then:
                 negative = -1
                 neutral  =  0
                 positive =  1
        
        sentence_to_vec:dict
            If this value is present, then dictionary of sentence to vector mappings.
        """

        self.tokenizer=tokenizer
        self.max_len = 128

        # Simply reading in the csv
        dataframe = pandas.read_csv(csv_path, encoding="ISO-8859-1", names=["label", "sentence"])
        self.X = dataframe["sentence"].to_numpy()
        # Convert the sentences to vectors if a dictionary is provided
        if sentence_to_vec != None:
            new_X = []
            for sentence in self.X:
                vector = sentence_to_vec.get(sentence, None)
                if vector == None:
                    raise Exception("The dictionary contains no vector for the sentence: \n{}".format(sentence))
                    return
                new_X.append(vector)
            self.X = np.array(new_X)

        # Check if we should be returning labels (Not necessary for generating the BERT Emedding)
        self.return_labels = return_labels
        if return_labels:
            self.Y = dataframe["label"].to_numpy()
            # Probably will want to convert the data into numeric form for easier handling
            if convert_label_to_int:
                new_Y = np.zeros(len(self.Y), dtype=np.int8)
                new_Y[self.Y == "negative"] = -1
                new_Y[self.Y == "positive"] = 1
                self.Y = new_Y

    def __len__(self):
        """
        Get the size of the data.
        """
        return len(self.X)
    
    def __getitem__(self, ind):
        """
        if self.return_labels==True, then this will return the sentence and corresponding label,
        otherwise this will return only the sentence
        """

        sentence = self.X[ind]
        label = self.Y[ind]

        encoding = self.tokenizer.encode_plus(
          sentence,
          max_length=self.max_len,
          padding='max_length',
          return_tensors='pt',
        )

        return sentence, encoding['input_ids'].flatten(), encoding['attention_mask'].flatten(), torch.tensor(label, dtype = int)

In [None]:
device = "cuda"
model = model.to(device)

In [None]:
embeddings = {}

# Test the dataloader
csv_path = "/content/downstream_data/all-data.csv"
phrasebank_data = LibriSamplesPhrasebank(csv_path, bert_tokenizer, convert_label_to_int=True, return_labels=True)
phrasebank_loader = DataLoader(phrasebank_data, batch_size=1, shuffle=False) #Shuffle is false just for demonstration purposes
for i, data in enumerate(phrasebank_loader):
    torch.cuda.empty_cache()

    sentence, input_ids, attention_mask, label = data
    input_ids = input_ids.to(device)
    attention_mask = attention_mask.to(device)
    outputs=model(input_ids, attention_mask, output_hidden_states=True)

    actual_seq_len = int(attention_mask.sum())
    embedding = outputs.hidden_states[-1][:, :actual_seq_len, :].mean(dim=1).squeeze()
    embeddings[sentence] = embedding
    
    if (i+1) % 100 == 0:
        torch.save(embeddings, '/content/drive/MyDrive/11785_Project/Programming/embeddings/last_layer_plain_bert/last_layer_plain_bert_{}.json'.format(i//100))
        embeddings = {}

In [None]:
torch.save(embeddings, '/content/drive/MyDrive/11785_Project/Programming/embeddings/last_layer_plain_bert.json')

In [None]:
len(phrasebank_loader)

4846

In [None]:
!pip install sentence_transformers

from sentence_transformers import SentenceTransformer
from sentence_transformers.util import cos_sim


model = SentenceTransformer(r"sentence-transformers/paraphrase-MiniLM-L6-v2")


embd_a = model.encode("What is your age?")
embd_b = model.encode("How old are you?")


sim_score = cos_sim(embd_a, embd_b)

print(sim_score)

Collecting sentence_transformers
  Downloading sentence-transformers-2.2.0.tar.gz (79 kB)
[?25l[K     |████▏                           | 10 kB 29.5 MB/s eta 0:00:01[K     |████████▎                       | 20 kB 30.1 MB/s eta 0:00:01[K     |████████████▍                   | 30 kB 12.9 MB/s eta 0:00:01[K     |████████████████▌               | 40 kB 8.1 MB/s eta 0:00:01[K     |████████████████████▋           | 51 kB 7.1 MB/s eta 0:00:01[K     |████████████████████████▊       | 61 kB 8.4 MB/s eta 0:00:01[K     |████████████████████████████▉   | 71 kB 9.1 MB/s eta 0:00:01[K     |████████████████████████████████| 79 kB 5.1 MB/s 
Collecting sentencepiece
  Downloading sentencepiece-0.1.96-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.2 MB)
[K     |████████████████████████████████| 1.2 MB 18.7 MB/s 
Building wheels for collected packages: sentence-transformers
  Building wheel for sentence-transformers (setup.py) ... [?25l[?25hdone
  Created wheel for senten

Downloading:   0%|          | 0.00/690 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/190 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/3.69k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/629 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/122 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/229 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/112 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/466k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/314 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/232k [00:00<?, ?B/s]

tensor([[0.8648]])


NameError: ignored

In [None]:
embd_a.shape

(384,)

In [None]:
embd_b.shape

(384,)