# FinBERT-QA Inference Exploration

This is an attempt to get a minimal inference script that uses the finetuned models from FinBERT_QA.

Work in Progress.

## Note
- This notebook was run and tested in the root directory of the FinBERT_QA repo, after downloading all the models locally. 
    - So it will probably not run normally here without adjusting paths, installing packages etc.

The one function I couldn't implement in `FinBERT_QA(config).search()` is

```
def get_top_k_search_hits(fiqa_index, k, query):
    from pyserini.search import pysearch
    searcher = pysearch.SimpleSearcher(fiqa_index)
    return searcher.search(query, k=50)
```

since I think we are replacing this with Jina(?). Anyway a dummy function is taking its place right now, so while you can get results, they are nonsense.

In [63]:
from pathlib import Path
from tqdm import tqdm

import numpy as np
import random
import torch
import json
import os
import sys

from torch.nn.functional import softmax

from transformers import BertTokenizer
from transformers import BertForSequenceClassification

Config structure:

- If `user_input == True`, prompts an interactive query from cmdline that replaces `query`.

In [64]:
config = {
    'user_input': False,
    'query': "Which company did Elon Musk acquire today?",
    'top_k': 5,
    'bert_model_name': 'bert_qa',
    'device': 'cpu',
    'max_seq_len': 128
}

In [65]:
# Essentially copied from finbert_qa.py

import pickle
path = str(Path.cwd())

with open(path + '/data/id_to_text/docid_to_text.pickle', 'rb') as f:
    docid_to_text = pickle.load(f)
    
with open(path + '/data/id_to_text/qid_to_text.pickle', 'rb') as f:
    qid_to_text = pickle.load(f)
    
with open(path + '/data/data_pickle/labels.pickle', 'rb') as f:
    labels = pickle.load(f)
    
fiqa_index = path + "/retriever/lucene-index-fiqa"

In [66]:
# Verbatim from finbert_qa.py

class BERT_MODEL():
    def __init__(self, bert_model_name):
        self.bert_model_name = bert_model_name
        
    def get_model(self):
        if self.bert_model_name == "bert-base":
            model_path = "bert-base-uncased"
        elif self.bert_model_name == "finbert-domain":
            model_path = str(Path.cwd()/'model/finbert-domain')
        elif self.bert_model_name == "finbert-task":
            model_path = str(Path.cwd()/'model/finbert-task')
        else:
            model_path = Path.cwd()/'model/bert-qa'
            
        model = BertForSequenceClassification.from_pretrained(model_path, \
                                                              cache_dir=None, \
                                                              num_labels=2)
        
        return model

In [67]:
class DummySearchResult:
    def __init__(self, docid):
        self.docid = docid

def dummy_get_top_k_search_hits(fiqa_index, k, query):
    first_k_keys = list(docid_to_text.keys())[:k]
    return  [DummySearchResult(i) for i in first_k_keys]


class FinBERT_QA():
    def __init__(self, config):
        self.config = config
        self.bert_model_name = self.config['bert_model_name']
        self.device          = torch.device('cuda' if config['device'] == 'gpu' else 'cpu')
        self.max_seq_len     = self.config['max_seq_len']
        self.k = self.config['top_k']
        self.query = self.config['query']
        
        self.tokenizer = BertTokenizer.from_pretrained('bert-base-uncased', do_lower_case=True)
        
        self.model = BERT_MODEL(self.bert_model_name).get_model().to(self.device)
        
    def search(self):
        
        model_path = str(Path.cwd()) + "/model/trained/finbert-qa/" + "2_finbert-qa-50_512_16_3e6.pt"
        
        self.model.load_state_dict(torch.load(model_path, 
                                              map_location=self.device), strict=False)
        
        # Put model in evaluation mode.
        self.model.eval()
        
        # This function is implemented with pyserini. We use JINA here?
        hits = dummy_get_top_k_search_hits(fiqa_index, self.k, self.query)
        
        cands = []
        for i in range(len(hits)):
            cands.append(int(hits[i].docid))
            
        if len(cands) == 0:
            print("\nNo answers found.")
        else:
            print("\nRanking...\n")
            self.rank, self.scores = self.predict(self.model, self.query, cands)
            
            print("Question: \n\t{}\n".format(self.query))
            
            if len(cands) < self.k:
                self.k = len(cands)
            else:
                pass
            
            print("Top-{} Answers: \n".format(self.k))
            for i in range(self.k):
                print("{}.\t{}\n".format(i+1, docid_to_text[self.rank[i]]))
        
        
    def predict(self, model, q_text, cands):
        """Re-ranks the candidates answers for each question.

        Returns:
            ranked_ans: list of re-ranked candidate docids
            sorted_scores: list of relevancy scores of the answers
        -------------------
        Arguments:
            model - PyTorch model
            q_text - str - query
            cands -List of retrieved candidate docids
        """
        # Convert list to numpy array
        cands_id = np.array(cands)
        # Empty list for the probability scores of relevancy
        scores = []
        # For each answer in the candidates
        for docid in cands:
            # Map the docid to text
            ans_text = docid_to_text[docid]
            # Create inputs for the model
            encoded_seq = self.tokenizer.encode_plus(q_text, ans_text,
                                                max_length=self.max_seq_len,
                                                pad_to_max_length=True,
                                                return_token_type_ids=True,
                                                return_attention_mask = True)

            # Numericalized, padded, clipped seq with special tokens
            input_ids = torch.tensor([encoded_seq['input_ids']]).to(self.device)
            # Specify question seq and answer seq
            token_type_ids = torch.tensor([encoded_seq['token_type_ids']]).to(self.device)
            # Sepecify which position is part of the seq which is padded
            att_mask = torch.tensor([encoded_seq['attention_mask']]).to(self.device)
            # Don't calculate gradients
            with torch.no_grad():
                # Forward pass, calculate logit predictions for each QA pair
                outputs = model(input_ids, token_type_ids=token_type_ids, attention_mask=att_mask)
            # Get the predictions
            logits = outputs[0]
            # Apply activation function
            pred = softmax(logits, dim=1)
            # Move logits and labels to CPU
            pred = pred.detach().cpu().numpy()
            # Append relevant scores to list (where label = 1)
            scores.append(pred[:,1][0])
            # Get the indices of the sorted similarity scores
            sorted_index = np.argsort(scores)[::-1]
            # Get the list of docid from the sorted indices
            ranked_ans = list(cands_id[sorted_index])
            sorted_scores = list(np.around(sorted(scores, reverse=True),decimals=3))
            
        return ranked_ans, sorted_scores

In [68]:
fqa = FinBERT_QA(config)

fqa.search()

Truncation was not explicitly activated but `max_length` is provided a specific value, please use `truncation=True` to explicitly truncate examples to max length. Defaulting to 'longest_first' truncation strategy. If you encode pairs of sequences (GLUE-style) with the tokenizer you can select this strategy more precisely by providing a specific strategy to `truncation`.



Ranking...



Be aware, overflowing tokens are not returned for the setting you have chosen, i.e. sequence pairs with the 'longest_first' truncation strategy. So the returned list will always be empty even if some tokens have been removed.


Question: 
	Which company did Elon Musk acquire today?

Top-5 Answers: 

1.	You can never use a health FSA for individual health insurance premiums.  Moreover, FSA plan sponsors can limit what they are will to reimburse.  While you can't use a health FSA for premiums, you could previously use a 125 cafeteria plan to pay premiums, but it had to be a separate election from the health FSA. However, under N. 2013-54, even using a cafeteria plan to pay for indivdiual premiums is effectively prohibited.

2.	So nothing preventing false ratings besides additional scrutiny from the market/investors, but there are some newer controls in place to prevent institutions from using them. Under the DFA banks can no longer solely rely on credit ratings as due diligence to buy a financial instrument, so that's a plus. The intent being that if financial institutions do their own leg work then *maybe* they'll figure out that a certain CDO is garbage or not.  Edit: lead in

3.	Here are the SEC requirements