The purpose of this notebook is to experiment with HuggingFace NLP models.

Goals:
1) Download a Hugging Face model to your laptop and run it offline using your own textual data. 



About HuggingFace:
Hugging Face is a company that specializes in developing artificial intelligence models and tools for natural language processing (NLP). They are best known for their open-source library called Transformers, which offers pre-trained models and general-purpose architectures for various NLP tasks such as text classification, summarization, translation, and more. The library is built on top of popular deep learning frameworks like TensorFlow and PyTorch.

###########################################################################

Install the necessary libraries:
Make sure you have Python installed on your laptop. Then, install the Hugging Face Transformers library and the required deep learning framework (either TensorFlow or PyTorch) using pip:

In [4]:
!pip install transformers
!pip install torch  # For PyTorch
#or
!pip install tensorflow  # For TensorFlow

[31mERROR: Could not find a version that satisfies the requirement tensorflow (from versions: none)[0m[31m
[0m[31mERROR: No matching distribution found for tensorflow[0m[31m
[0m

Create a Python script (e.g., summarization.py) and import the necessary libraries:

In [5]:
from transformers import pipeline, AutoModelForSeq2SeqLM, AutoTokenizer


Download the pre-trained model and tokenizer:

Choose a pre-trained text summarization model from the Hugging Face Model Hub (https://huggingface.co/models). For this example, we'll use the "t5-small" model:

In [6]:
model_name = 't5-small'
model = AutoModelForSeq2SeqLM.from_pretrained(model_name)
tokenizer = AutoTokenizer.from_pretrained(model_name)


Downloading (…)lve/main/config.json:   0%|          | 0.00/1.21k [00:00<?, ?B/s]

Downloading pytorch_model.bin:   0%|          | 0.00/242M [00:00<?, ?B/s]

Downloading (…)neration_config.json:   0%|          | 0.00/147 [00:00<?, ?B/s]

Downloading (…)ve/main/spiece.model:   0%|          | 0.00/792k [00:00<?, ?B/s]

Downloading (…)/main/tokenizer.json:   0%|          | 0.00/1.39M [00:00<?, ?B/s]

For now, this behavior is kept to avoid breaking backwards compatibility when padding/encoding with `truncation is True`.
- Be aware that you SHOULD NOT rely on t5-small automatically truncating your input to 512 when padding/encoding.
- If you want to encode/pad to sequences longer than 512 you can either instantiate this tokenizer with `model_max_length` or pass `max_length` when encoding/padding.


Create a summarization pipeline using the downloaded model and tokenizer:


In [7]:
summarizer = pipeline('summarization', model=model, tokenizer=tokenizer)


Define a function to summarize text using the pipeline:


In [8]:
def summarize_text(text: str) -> str:
    summary = summarizer(text, max_length=150, min_length=40, do_sample=False)
    return summary[0]['summary_text']


Use the function to summarize an example text:


In [9]:
example_text = "This is a long text that you want to summarize. It can have multiple sentences and paragraphs. The summarization model will help you generate a shorter version of this text, capturing the most important information."

summary = summarize_text(example_text)
print("Original Text:\n", example_text) 
print("\nSummarized Text:\n", summary)


Your max_length is set to 150, but you input_length is only 48. You might consider decreasing max_length manually, e.g. summarizer('...', max_length=24)


Original Text:
 This is a long text that you want to summarize. It can have multiple sentences and paragraphs. The summarization model will help you generate a shorter version of this text, capturing the most important information.

Summarized Text:
 this text can have multiple sentences and paragraphs . the summarization model will help you generate a shorter version of this text, capturing the most important information . if you want to summarize the text, it will be able to take a few minutes to complete .
