<a href="https://colab.research.google.com/github/pksvv/lang-transformers/blob/main/HuggingFace_FinBERT_TxtClx.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [3]:
pip install transformers

Collecting transformers
  Downloading transformers-4.15.0-py3-none-any.whl (3.4 MB)
[K     |████████████████████████████████| 3.4 MB 28.9 MB/s 
Collecting tokenizers<0.11,>=0.10.1
  Downloading tokenizers-0.10.3-cp37-cp37m-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_12_x86_64.manylinux2010_x86_64.whl (3.3 MB)
[K     |████████████████████████████████| 3.3 MB 54.9 MB/s 
[?25hCollecting pyyaml>=5.1
  Downloading PyYAML-6.0-cp37-cp37m-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_12_x86_64.manylinux2010_x86_64.whl (596 kB)
[K     |████████████████████████████████| 596 kB 45.6 MB/s 
[?25hCollecting sacremoses
  Downloading sacremoses-0.0.46-py3-none-any.whl (895 kB)
[K     |████████████████████████████████| 895 kB 53.3 MB/s 
Collecting huggingface-hub<1.0,>=0.1.0
  Downloading huggingface_hub-0.2.1-py3-none-any.whl (61 kB)
[K     |████████████████████████████████| 61 kB 487 kB/s 
Installing collected packages: pyyaml, tokenizers, sacremoses, huggingface-hub, transformers


In [6]:
model_name = "ProsusAI/finbert"

In [7]:
from transformers import BertForSequenceClassification

In [9]:
model = BertForSequenceClassification.from_pretrained(model_name)

Downloading:   0%|          | 0.00/758 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/418M [00:00<?, ?B/s]

In [10]:
from transformers import BertTokenizer
tokenizer = BertTokenizer.from_pretrained(model_name)

Downloading:   0%|          | 0.00/226k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/112 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/252 [00:00<?, ?B/s]

1. Tokenize input text
2. Feed Token IDs >>> Model
3. Model will output activations not probabilities, so we will need a softmax to convert it into probabilities
    Model Activations >>> Probabilities
4. Argmax of those probabilties

In [11]:
# this is our example text
txt = ("Given the recent downturn in stocks especially in tech which is likely to persist as yields keep going up, "
       "I thought it would be prudent to share the risks of investing in ARK ETFs, written up very nicely by "
       "[The Bear Cave](https://thebearcave.substack.com/p/special-edition-will-ark-invest-blow). The risks comes "
       "primarily from ARK's illiquid and very large holdings in small cap companies. ARK is forced to sell its "
       "holdings whenever its liquid ETF gets hit with outflows as is especially the case in market downturns. "
       "This could force very painful liquidations at unfavorable prices and the ensuing crash goes into a "
       "positive feedback loop leading into a death spiral enticing even more outflows and predatory shorts.")

In [12]:
tokens = tokenizer.encode_plus(txt, 
                               max_length=512,
                               truncation=True,
                               padding='max_length',
                               add_special_tokens=True,
                               return_tensors='pt')

**Special Tokens in BERT**



*   [CLS] = 101
*   [SEP] = 102
*   [MASK] = 103
*   [UNK] = 100
*   [PAD] = 0

In [13]:
tokens

{'input_ids': tensor([[  101,  2445,  1996,  3522,  2091, 22299,  1999, 15768,  2926,  1999,
          6627,  2029,  2003,  3497,  2000, 29486,  2004, 16189,  2562,  2183,
          2039,  1010,  1045,  2245,  2009,  2052,  2022, 10975, 12672,  3372,
          2000,  3745,  1996, 10831,  1997, 19920,  1999, 15745,  3802, 10343,
          1010,  2517,  2039,  2200, 19957,  2011,  1031,  1996,  4562,  5430,
          1033,  1006, 16770,  1024,  1013,  1013,  1996,  4783,  2906, 27454,
          1012,  4942,  9153,  3600,  1012,  4012,  1013,  1052,  1013,  2569,
          1011,  3179,  1011,  2097,  1011, 15745,  1011, 15697,  1011,  6271,
          1007,  1012,  1996, 10831,  3310,  3952,  2013, 15745,  1005,  1055,
          5665, 18515, 21272,  1998,  2200,  2312,  9583,  1999,  2235,  6178,
          3316,  1012, 15745,  2003,  3140,  2000,  5271,  2049,  9583,  7188,
          2049,  6381,  3802,  2546,  4152,  2718,  2007,  2041, 12314,  2015,
          2004,  2003,  2926,  1996,  

# Different ways of passing arguments to a function

## without **kwargs
    random_func(var1="hello",var2="world")

## with **kwargs
    input_dict = {'var1':'hello','var2':'world'}
    random_func(**input_dict)

In [15]:
output = model(**tokens)

output

SequenceClassifierOutput([('logits',
                           tensor([[-1.8200,  2.4484,  0.0216]], grad_fn=<AddmmBackward0>))])

In [16]:
# Extract tensor

output[0]

tensor([[-1.8200,  2.4484,  0.0216]], grad_fn=<AddmmBackward0>)

In [17]:
import torch.nn.functional as F

In [20]:
probs = F.softmax(output[0], dim=-1) # dim=-1 means final dimension
probs

tensor([[0.0127, 0.9072, 0.0801]], grad_fn=<SoftmaxBackward0>)

In [22]:
import torch

pred = torch.argmax(probs)

In [23]:
pred.item()

1