NLP tasks with a simple interface

# Small Specific Model 
- Designed for a specific task
- Similar performance as a general LLM
- Cheaper and faster to run

## Distillation

- This is a technique used in ML to transfer the knowldege from a larger, more complex model to a smaller, simpler model. 
- The goal is to create a more computationally efficient model that retains much of the predictive power of the larger model. 
- Knowledge distillation is especially useful in scenarios where deploying large models is not feasible due to computational or memory constraints. By distilling the knowledge to a smaller model, the benefits of deep leanring can be extended to devices and platforms with limited resources. 


# 1. Buidling a text summarization app
- Here we are using an Inference Endpoint for the shleifer/distilbart-cnn-12-6, a 306M parameter distilled model from facebook/bart-large-cnn (the concept of distillation is explained earlier).
- The reason we chose this model is because it is one of the state of the art models for text summarization

In [1]:
# Load HF API key and relevant python libraries

import os 
import io 
from IPython.display import Image, display, HTML
from PIL import Image
import base64 
from dotenv import load_dotenv, find_dotenv
_ = load_dotenv(find_dotenv()) # read local .env file
os.environ['HF_API_KEY'] = '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'
hf_api_key = os.environ['HF_API_KEY']


Here is the way to call the API from hugging face and run the model locally. 

In [2]:
from transformers import pipeline

get_completion = pipeline("summarization", model="sshleifer/distilbart-cnn-12-6")


def summarize(input):
    output = get_completion(input)
    return output[0]['summary_text']

  warn("The installed version of bitsandbytes was compiled without GPU support. "


'NoneType' object has no attribute 'cadam32bit_grad_fp32'


## Example 1

In [3]:
text = ('''The tower is 324 metres (1,063 ft) tall, about the same height
        as an 81-storey building, and the tallest structure in Paris. 
        Its base is square, measuring 125 metres (410 ft) on each side. 
        During its construction, the Eiffel Tower surpassed the Washington 
        Monument to become the tallest man-made structure in the world,
        a title it held for 41 years until the Chrysler Building
        in New York City was finished in 1930. It was the first structure 
        to reach a height of 300 metres. Due to the addition of a broadcasting 
        aerial at the top of the tower in 1957, it is now taller than the 
        Chrysler Building by 5.2 metres (17 ft). Excluding transmitters, the 
        Eiffel Tower is the second tallest free-standing structure in France 
        after the Millau Viaduct.''')

get_completion(text)

# If we just want the summarized text
# summarize(text)

[{'summary_text': ' The tower is 324 metres (1,063 ft) tall, about the same height as an 81-storey building . It is the tallest structure in Paris and the second tallest free-standing structure in France after the Millau Viaduct . It was the first structure in the world to reach a height of 300 metres .'}]

## Create a UI demo to share the summarization app with others

In [4]:
import gradio as gr

port_number = 2000 # Custom port number

os.environ['PORT2'] = str(port_number)


def summarize(input):
    output = get_completion(input)
    return output[0]['summary_text']
gr.close_all()

demo = gr.Interface(fn=summarize, 
                    inputs=[gr.Textbox(label="Text to summarize", lines = 6)],
                    outputs=[gr.Textbox(label="Result", lines = 3)],
                    title="Text summarization with distilbart-cnn",
                    allow_flagging = 'never',
                    description="Summarize any text using the `shleifer/distilbart-cnn-12-6` model under the hood!"
                   )
demo.launch(share=False, server_port=int(os.environ['PORT2']))

Running on local URL:  http://127.0.0.1:2000

To create a public link, set `share=True` in `launch()`.




# 2.Building a Named Entity Recognition App

Named Entity Recognition (NER) is a subtask of information extraction that classifies named entities into prdefined categories such as names of persons, organizations, locations, expressions of times, quantities, monetary values, percentages.

We will be using BERT for our purpose. 
- Bert is a ML model for NLP
- When parsing a text, it is useful to indentify specific entities such as persons, companies, places
- We are using bert-base-NER, a 108M parameter fine-tuned BERT model that is ready to use for NER
- bert-base-NER has been trained to recognize four types of entities: location (LOC), organizations (ORG), person (PER), and Miscellaneous

In [5]:
# Don't forget to import the pipeline first

ner_completion = pipeline("ner", model="dslim/bert-base-NER")

def ner(input):
    output = ner_completion(input)
    return {"text": input, "entities": output}

Some weights of the model checkpoint at dslim/bert-base-NER were not used when initializing BertForTokenClassification: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight']
- This IS expected if you are initializing BertForTokenClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForTokenClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


Since the bert model usually treat each character as an input for the function, so we have to write a merge_token functions to avoid some case scenario

- Output Structure: the 'gr.HighlightedText' component expects the output to be in a **specific structure**, a dictionary containing the original text adn the entities that were found within.
- This 'gr.HighlightedText' will convert a dictionary with 2 keys: the full text ('text') and the output entities from the model ('entities'). 
- Then they will merge them both together, making it readable. 

In [11]:
# set up port number 
port_number = 3000 # Custom port number

os.environ['PORT3'] = str(port_number)


def merge_tokens(tokens):
    merged_tokens = []
    for token in tokens:
        if merged_tokens and token['entity'].startswith('I-') and merged_tokens[-1]['entity'].endswith(token['entity'][2:]):
            # If current token continues the entity of the last one, merge them
            last_token = merged_tokens[-1]
            last_token['word'] += token['word'].replace('##', '')
            last_token['end'] = token['end']
            last_token['score'] = (last_token['score'] + token['score']) / 2
        else:
            # Otherwise, add the token to the list
            merged_tokens.append(token)

    return merged_tokens

def ner(input):
    output = ner_completion(input)
    merged_tokens = merge_tokens(output)
    return {"text": input, "entities": merged_tokens}

gr.close_all()

demo = gr.Interface(fn=ner,
                    inputs=[gr.Textbox(label="Text to find entities", lines=2)],
                    outputs=[gr.HighlightedText(label="Text with entities")],
                    title="NER with dslim/bert-base-NER",
                    description="Find entities using the `dslim/bert-base-NER` model under the hood!",
                    allow_flagging="never",
                    examples=["My name is Nam, I am an international student from Viet Nam", "I am pursuing a master degree in business analytics at Wake Forest University"])

demo.launch(share=False, server_port=int(os.environ['PORT3']))

Closing server running on port: 3000
Closing server running on port: 3000
Closing server running on port: 3000
Closing server running on port: 2000
Closing server running on port: 3000
Running on local URL:  http://127.0.0.1:3000

To create a public link, set `share=True` in `launch()`.




In case that we might have forgotten to clean up certain ports, we can close all of them 

In [7]:
# gr.close_all()

Closing server running on port: 2000
Closing server running on port: 3000
