<a href="https://colab.research.google.com/github/jocelynbaduria/cmpe-297_SOTA/blob/main/Bart_Text_Summarization.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

ReadMe:

Reference: 

https://huggingface.co/facebook/bart-large-cnn

https://huggingface.co/google/pegasus-cnn_dailymail

https://towardsdatascience.com/building-nlp-web-apps-with-gradio-and-hugging-face-transformers-59ce8ab4a319

https://github.com/chuachinhon/gradio_nlp/blob/main/notebooks/2.0_gradio_parallel_summaries.ipynb

1. Import Libraries and install some module gradio, wandb and transformers.

2. Define text cleaning and Summarization functions.

3. Initialize Weights and Bias and use Hugging Face Pipeline to implement the pre-trained model facebook/bart-large-cnn and google/pegasus-cnn_dailymail.

4. Launch both model facebook and google text sumarizer for comparison using gradio App for testing.


### 1. Import Libraries and install some module gradio, wandb and transformers

In [1]:
!pip install gradio -q
# Install wandb for experiment tracking
!pip install wandb --upgrade -q
# !pip install -q git+https://github.com/huggingface/transformers.git
!pip install --no-cache-dir transformers sentencepiece

[K     |████████████████████████████████| 2.0 MB 4.7 MB/s 
[K     |████████████████████████████████| 1.9 MB 40.8 MB/s 
[K     |████████████████████████████████| 206 kB 52.5 MB/s 
[K     |████████████████████████████████| 63 kB 2.2 MB/s 
[K     |████████████████████████████████| 3.5 MB 23.7 MB/s 
[K     |████████████████████████████████| 961 kB 36.6 MB/s 
[?25h  Building wheel for ffmpy (setup.py) ... [?25l[?25hdone
  Building wheel for flask-cachebuster (setup.py) ... [?25l[?25hdone
[K     |████████████████████████████████| 1.7 MB 5.1 MB/s 
[K     |████████████████████████████████| 97 kB 5.8 MB/s 
[K     |████████████████████████████████| 180 kB 52.6 MB/s 
[K     |████████████████████████████████| 139 kB 55.8 MB/s 
[K     |████████████████████████████████| 63 kB 1.5 MB/s 
[?25h  Building wheel for subprocess32 (setup.py) ... [?25l[?25hdone
  Building wheel for pathtools (setup.py) ... [?25l[?25hdone
Collecting transformers
  Downloading transformers-4.12.0-py3-none

In [2]:
import gradio as gr
import re
import warnings


from gradio.mix import Parallel
from nltk.tokenize import sent_tokenize
from transformers import (
    AutoTokenizer,
    AutoModel,
    AutoModelForSeq2SeqLM,
    Wav2Vec2ForCTC,
    Wav2Vec2Tokenizer,
    pipeline,
)

warnings.filterwarnings('ignore')

### 2. Define Text Cleaning and Summarization Functions

In [3]:
def clean_text(text):
  text = text.encode("ascii", errors="ignore").decode(
      "ascii"
  ) # remove non-ascii Chinese characters
  text = re.sub(r"\n", " ", text)
  text = re.sub(r"\n\n", " ", text)
  text = re.sub(r"\t", " ", text)
  text = re.sub(" +", " ", text).strip() # get rid of multiple spaces and replace with a single
  return text

### 3. Initialize Weights and Bias and use Hugging Face Pipeline to implement the pre-trained model facebook/bart-large-cnn and google/pegasus-cnn_dailymail.

In [4]:
import torch
# Other imports 
import wandb
wandb.login()
# from wandb.keras import WandbCallback
from tqdm import tqdm

<IPython.core.display.Javascript object>

[34m[1mwandb[0m: Appending key for api.wandb.ai to your netrc file: /root/.netrc


In [5]:
wandb.init(project="Bart", entity="jocelynbaduria", id="text_summarizer_fb_google")
# 2. Save model inputs and hyperparameters
config = wandb.config
config.learning_rate = 0.01
# wandb.config = {
#   "learning_rate": 0.001,
#   "epochs": 2,
#   "batch_size": 64
# }
# ... Define a model

[34m[1mwandb[0m: Currently logged in as: [33mjocelynbaduria[0m (use `wandb login --relogin` to force relogin)


In [6]:
pipeline_summ = pipeline(
    "summarization",
    model="facebook/bart-large-cnn", # you can switch to t5-small or other model
    tokenizer="facebook/bart-large-cnn",
    framework="pt",
)

# Facebook summarization
def fb_summarizer(text):
  input_text = clean_text(text)
  results = pipeline_summ(input_text)
  return results[0]["summary_text"]

# Add gradio app for testing
summary1 = gr.Interface(
    fn=fb_summarizer,
    inputs=gr.inputs.Textbox(),
    outputs=gr.outputs.Textbox(label="Summary by Facebook/Bart-large-CNN model")
)

Downloading:   0%|          | 0.00/1.55k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.51G [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/878k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/446k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.29M [00:00<?, ?B/s]

In [7]:
model_name = "google/pegasus-cnn_dailymail" # Pegasus has a few variations; switch out as required

# Second of 2 summarization function
def google_summarizer(text):
    input_text = clean_text(text)
    # tokenizer_pegasus = AutoTokenizer.from_pretrained(model_name) - got error You need to have sentencepiece installed to convert a slow tokenizer to a fast one.
    tokenizer_pegasus = AutoTokenizer.from_pretrained(model_name, use_fast=False)
    model_pegasus = AutoModelForSeq2SeqLM.from_pretrained(model_name)
    batch = tokenizer_pegasus.prepare_seq2seq_batch(
        input_text, truncation=True, padding="longest", return_tensors="pt"
    )
    translated = model_pegasus.generate(**batch)
    pegasus_summary = tokenizer_pegasus.batch_decode(
        translated, skip_special_tokens=True
    )
    return pegasus_summary[0]

# Second of 2 Gradio apps that we'll put in "parallel"
summary2 = gr.Interface(
    fn=google_summarizer,
    inputs=gr.inputs.Textbox(),
    outputs=gr.outputs.Textbox(label="Summary by Google/Pegasus-CNN-Dailymail"),
)

### 4. Launch both model facebook and google text sumarizer for comparison using gradio App for testing.

In [9]:
Parallel(
    summary1,
    summary2,
    title="Facebook vs Google Text Summarizer model",
    inputs=gr.inputs.Textbox(lines=20, label="Paste some English text here"),
).launch()

Colab notebook detected. To show errors in colab notebook, set `debug=True` in `launch()`
This share link will expire in 72 hours. If you need a permanent link, visit: https://gradio.app/introducing-hosted
Running on External URL: https://59409.gradio.app


(<Flask 'gradio.networking'>,
 'http://127.0.0.1:7861/',
 'https://59409.gradio.app')