<a href="https://colab.research.google.com/github/JKrse/nlp_QA_QG_app/blob/master/notebooks/nlp_demo_QA.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Demo for Question Answering

Updated: 25th August 2020

**To get started simple collapse "Demo for Question Answering" and press run. Next open "Main" and run the script.**

---

This is a prototype with various pre-trained NLP models for Questioning and Answering task.

A pre-defined text snippet is defined in "Text snippet - Data", simple change input for new topic.

---
Hugging Face models can be seen here (with the name of the pre-trained model): 
https://huggingface.co/transformers/pretrained_models.html

It is very easy to add extra model, simple follow the pattern in the code and add the name to 'model' and 'user2model'.

The pre-trained AllenNLP models found through the Usage: 
https://demo.allennlp.org/reading-comprehension

## Installation and packages

### Pip installations

Install the following packages: 

In [None]:
!pip install allennlp_models
!pip install allennlp
!pip install transformers 
!pip install pandas

### Import the libraries:

Import packages needed for running the prototype:

In [None]:
# For AllenNLP pre-trained models implementations: 
from allennlp.predictors.predictor import Predictor
import allennlp_models.rc

# For Hugging face models implementations: 
import torch
import transformers

from transformers import BertForQuestionAnswering
from transformers import BertTokenizer

import pandas as pd

import nltk
nltk.download('punkt')


## Functions

## Main

In [None]:
def main():
    models = Config.models
    input_messege = Config.input_messege
    input_context = Config.input_context
    
    # Selecting the model:
    model, model_library, tokenizer = select_model(models)
    
    context_input = input(input_context)
    if context_input == "demo":
      txt_doc = txt_demo
    else:
      txt_doc = context_input

    print_setences(txt_doc)

    while True: 
        question = input(input_messege)
            
        question.lower()
        if question == "quit": 
            break
        elif question == "new":
            model, model_library, tokenizer = select_model(models)
            continue
        
        if model_library == "allennlp":
            answer = predict_QnA_allennlp(question, txt_doc, model)["best_span_str"]
        elif model_library == "huggingface":
            answer = predict_QnA_huggingface(question, txt_doc, model, tokenizer)
        elif model_library == "huggingface_pipline":
            answer = model(question=question, context=txt_doc)["answer"]

        print(f"\nAnswer: {answer}\n")

### NLP pre-trained model [config]

The 'Config' is a dictionary serving as a lookup further on.  The 'models' function can be seen as a config that makes it easy to access each pre-trained model.

In [None]:
class Config: 
    models = {"1" : "ELMo-BiDAF (Trained on SQuAD)", # allennlp
              "2" : "BiDAG (Trained on SQuAD)", # allennlp
              "3" : "Transformer QA (Trained on SQuAD)", # allennlp
              "4" : "distilbert-base-cased-distilled-squad", # huggingface
              "5" : "bert-large-uncased-whole-word-masking-finetuned-squad" # huggingface
              }
    
    input_messege ="Please enter your question (write 'quit' to exit or 'new' for model selection):\nQuestion: "

    input_context = "Please insert the context (write 'demo' for default context)\nContext: "

def modelsConfig(model):
    tokenizer = []

    if model == "ELMo-BiDAF (Trained on SQuAD)":
        model_selected = Predictor.from_path("https://storage.googleapis.com/allennlp-public-models/bidaf-elmo-model-2020.03.19.tar.gz")
        model_library = "allennlp"
    elif model == "BiDAG (Trained on SQuAD)":
        model_selected = Predictor.from_path("https://storage.googleapis.com/allennlp-public-models/bidaf-model-2020.03.19.tar.gz")
        model_library = "allennlp"
    elif model == "Transformer QA (Trained on SQuAD)":
        model_selected = Predictor.from_path("https://storage.googleapis.com/allennlp-public-models/transformer-qa-2020-05-26.tar.gz")
        model_library = "allennlp"
    elif model == "distilbert-base-cased-distilled-squad":
        model_selected = pipeline("question-answering", model=f"{model}")
        model_library = "huggingface_pipline"
    elif model == "bert-large-uncased-whole-word-masking-finetuned-squad":
        model_selected = pipeline("question-answering", model=f"{model}")
        model_library = "huggingface_pipline"
        # model_selected = BertForQuestionAnswering.from_pretrained(model)
        # tokenizer = BertTokenizer.from_pretrained(model)
        # model_library = "huggingface"
    else:
        raise Exception(f"Invalid model name: {model}")

    return model_selected, model_library, tokenizer

### Helper functions

Helper functions and wrappers to for redundant tasks and clarity of the code. 

In [None]:
def predict_QnA_allennlp(question, passage, model): 
    ''' 
    Helper function for input convention used in hugging face implementation:
        [QUESTION : ANSWER_TEXT]
    '''
    prediction = model.predict(passage=passage, question=question)
    return prediction


def read_txt(filename):
    file = open(filename)
    txt = file.read().replace("\n", " ")
    file.close()
    return txt


def user_input_screen(model_dict):
    print("Hello! :) \n")
    
    print("This is the model selection: \n")
    for i, val in enumerate(model_dict.values()):
        print(f"[{i+1}] {val}")
    
    user_input = input(f"\nPick a model [*]: ")
    
    while str(user_input) not in list(model_dict):
        print("\nNot a valid model. Try again\n")
        user_input = input(f"\nPick a model: ")
    
    
    model_selected = model_dict[user_input]
    print("\nPerfect - you picked the model:" \
            f"\n-------------\n{model_selected}\n-------------\n" \
            "Just warming up - I'll be ready right away! \n")

    return model_selected


def select_model(model_dict):
    '''
    Wrapper function for 'user_input_screen' and 'models'
    '''
    user_model_select = user_input_screen(model_dict)
    model, model_library, tokenizer  = modelsConfig(user_model_select)
    
    return model, model_library, tokenizer


def print_setences(text):
  setences = nltk.tokenize.sent_tokenize(text)
  print("\n ------------- \n")
  print("Setences in your input:")
  for i, sent in enumerate(setences):
    print(f"- {sent}")
  print("\n ------------- \n")

### Q&A implementation for Hugging Face transformers
***Not used. Better understanding of how the framework works. Using pipline API.***

Insert the text in which the model should look for an answer.

Note models have limits for the input size of the text (e.g. BERT models has a miximum of 512 tokens).

Hugging Face pre-trained models with the necesarry pre-processing implemented as a function.
This is a simple and modfied implementation (see source: https://colab.research.google.com/drive/1uSlWtJdZmLrI3FCNIlUHFxwAJiSu2J0-#scrollTo=MVNVGN5-gI06)

In [None]:
def predict_QnA_huggingface(question, answer_text, model, tokenizer):
    '''
    Takes a `question` string and an `answer_text` string (which contains the
    answer), and identifies the words within the `answer_text` that are the
    answer. Prints them out.
    '''
    # ======== Tokenize ========
    # Apply the tokenizer to the input text, treating them as a text-pair.
    input_ids = tokenizer.encode(question, answer_text)

    # Report how long the input sequence is.
    # print(f'Query has {len(input_ids)} tokens.\n') 

    # ======== Set Segment IDs ========
    # Search the input_ids for the first instance of the `[SEP]` token.
    sep_index = input_ids.index(tokenizer.sep_token_id)

    # The number of segment A tokens includes the [SEP] token istelf.
    num_seg_a = sep_index + 1

    # The remainder are segment B.
    num_seg_b = len(input_ids) - num_seg_a

    # Construct the list of 0s and 1s.
    segment_ids = [0]*num_seg_a + [1]*num_seg_b

    # There should be a segment_id for every input token.
    assert len(segment_ids) == len(input_ids)

    # ======== Evaluate ========
    # Run our example question through the model.
    start_scores, end_scores = model(torch.tensor([input_ids]), # The tokens representing our input text.
                                    token_type_ids=torch.tensor([segment_ids])) # The segment IDs to differentiate question from answer_text

    # ======== Reconstruct Answer ========
    # Find the tokens with the highest `start` and `end` scores.
    answer_start = torch.argmax(start_scores)
    answer_end = torch.argmax(end_scores)

    # Get the string versions of the input tokens.
    tokens = tokenizer.convert_ids_to_tokens(input_ids)

    # Start with the first token.
    answer = tokens[answer_start]

    # Select the remaining answer tokens and join them with whitespace.
    for i in range(answer_start + 1, answer_end + 1):
        
        # If it's a subword token, then recombine it with the previous token.
        if tokens[i][0:2] == '##':
            answer += tokens[i][2:]
        
        # Otherwise, add a space then the token.
        else:
            answer += ' ' + tokens[i]

    return answer

## Text snippet - Data


In [None]:
txt_demo = "By most industry estimates, unstructured text accounts for 80% of the data available to companies. Text analytics is the process of analyzing unstructured text, extracting relevant information, and transforming it into structured information that can be leveraged in various ways. Infosys Text Analytics Platform (ITAP) enables faster processing of retail and commercial loan processing through digital customer onboarding, instant KYC and instant credit decisioning. Costs associated with paper-based trails and back office involvement are reduced through digitizing loan forms and information extraction for underwriting, and OCR led zero manual entry approach, cross reference document validation for identity validation. It enables customer servicing requests to be handled on self-servicing platforms through interaction transcript analytics. The platform enhances operational resilience by detecting anomalies in financial transactions and enables ‘Customer 3600’ by representing information as a knowledge graph. It supports product hyper-personalization by providing deeper client insights to reduce churn and increase cross-sell. It assists with advanced portfolio modeling for robo-advisors to take next-best-action and improve advisor productivity. It helps with document reviews and improves compliance checks and also supports transaction monitoring & fraud detection, review of risk models and stress testing. Infosys Text Analytics Platform offers a suite of six ready to use solutions and API-based services ranging from semantic search, skill knowledge graph, sentiment or subjectivity analysis, rule extraction from legal documents, document classification or categorization, email or chat-based automation, log comparison, automated data labeling, and much more."

### Default text
The current text is about the ITAP: 
https://www.infosys.com/industries/financial-services/industry-offerings/text-analytics-platform.html

"By most industry estimates, unstructured text accounts for 80% of the data available to companies. Text analytics is the process of analyzing unstructured text, extracting relevant information, and transforming it into structured information that can be leveraged in various ways.

Infosys Text Analytics Platform (ITAP) enables faster processing of retail and commercial loan processing through digital customer onboarding, instant KYC and instant credit decisioning. Costs associated with paper-based trails and back office involvement are reduced through digitizing loan forms and information extraction for underwriting, and OCR led zero manual entry approach, cross reference document validation for identity validation. It enables customer servicing requests to be handled on self-servicing platforms through interaction transcript analytics. The platform enhances operational resilience by detecting anomalies in financial transactions and enables ‘Customer 3600’ by representing information as a knowledge graph. It supports product hyper-personalization by providing deeper client insights to reduce churn and increase cross-sell. It assists with advanced portfolio modeling for robo-advisors to take next-best-action and improve advisor productivity. It helps with document reviews and improves compliance checks and also supports transaction monitoring & fraud detection, review of risk models and stress testing.

Infosys Text Analytics Platform offers a suite of six ready to use solutions and API-based services ranging from semantic search, skill knowledge graph, sentiment or subjectivity analysis, rule extraction from legal documents, document classification or categorization, email or chat-based automation, log comparison, automated data labeling, and much more."

# Main

In [None]:
if __name__ == "__main__":
  main()