<a href="https://colab.research.google.com/github/isys5002-itp/ISYS5002-2024-S2/blob/main/2023_text_summariser.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Build a summariser

This section is based on the YouTube video [AI Text Summarization with Hugging Face Transformers in 4 Lines of Python](https://youtu.be/TsfLm5iiYb4)

As Information Systems professionals, we use our skills to be aware of advanced concepts and think about how you can meet the organisational Using *Hugging Face Transformers*, you can leverage a pre-trained summarisation pipeline to start summarising content. In this section, we will:
1. Installing Hugging Face Transformers
2. Building a summarisation pipeline
3. Run model/pipeline to summarisation
4. **Investigate way to reuse the pipeline**

> [Hugging Face Transformers](https://huggingface.co/docs/transformers/index) free state-of-the-art pre-trained machine learning models for processing text, images, audio and video. See the project website for more information.

In [1]:
# Install Hugging Face Transformers and Dependencies
!pip install transformers



In [2]:
#import libraries
from transformers import pipeline

'''
import the pipeline function from the transformers library,
and use it to create a summarization pipeline object
'''
# load sumarisation pipeline
summary_pipeline = pipeline("summarization", model="facebook/bart-large-cnn")

'''
Once the pipeline is created, it can be used to summarize text
by passing in a string of text to the summary_pipeline object
'''
# Let us copy-n-paste some text
article = """
Around the world, as regulators look to rein in Big Tech, like the ongoing digital platforms inquiry in Australia, online platforms will face a raft of new rules in the EU.
Known as the Digital Services Act, it’s a comprehensive set of regulations for digital services and content in the Eurozone.
Like GDPR, the Digital Services Act is expected to lead the way for other countries to provide some rules around how digital services function,
with everything from algorithms to online marketplaces, social networks, content-sharing platforms, app stores and online travel and accommodation platforms included.
The Digital Services Act sets out clear due diligence obligations for digital platforms and other online intermediaries with measures for cooperation with trusted flaggers and
competent authorities on content moderation, and measures to deter rogue traders from reaching consumers.
"""

# Run the summariser pipeline
summary = summary_pipeline(article, max_length = 50, min_length= 20)

# What does a summary look like?
print("summary is: ", summary)

# By inspection of output, 'summary' is a list.  The first element of the list is a dictionary.
# The key to the dictionary is 'summary_text'.

# Extract and display the summarised text
text = summary[0]['summary_text'] # get first element, then extract the value for key 'summary text
print("\nExtracted text: ", text)


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


config.json:   0%|          | 0.00/1.58k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/1.63G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/363 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/899k [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]



summary is:  [{'summary_text': 'The Digital Services Act is a comprehensive set of regulations for digital services and content in the Eurozone. It is expected to lead the way for other countries to provide some rules around how digital services function.'}]

Extracted text:  The Digital Services Act is a comprehensive set of regulations for digital services and content in the Eurozone. It is expected to lead the way for other countries to provide some rules around how digital services function.


In [3]:
# splits the summarised text into a list of sentences using .split('.')
summary[0]['summary_text'].split('.')

['The Digital Services Act is a comprehensive set of regulations for digital services and content in the Eurozone',
 ' It is expected to lead the way for other countries to provide some rules around how digital services function',
 '']

**Let's make it a function**

In [4]:
def summarise(article):
  summary_pipeline = pipeline("summarization", model="facebook/bart-large-cnn")
  summary = summary_pipeline(article, max_length = 50, min_length= 20)
  text = summary[0]['summary_text'] # get first element, then extract the value for key 'summary text
  return text


**A quick test.**

In [5]:
some_text = '''
A lack of transparency and reporting standards in the scientifc community has led to increasing and widespread
concerns relating to reproduction
'''

print(summarise(some_text))

Your max_length is set to 50, but your input_length is only 31. Since this is a summarization task, where outputs shorter than the input are typically wanted, you might consider decreasing max_length manually, e.g. summarizer('...', max_length=15)


A lack of transparency and reporting standards in the scientifc community has led to increasing and widespread concerns relating to reproduction.


Umm... it worked, but with a warning on max_length.   We could reduce the max length or add a check that we have at least 50 words.  Our reasoning (design decision) is that it doesn't really make sense to sumarise say one sentance. We could pick any minimun size, but 50 seems like a good number.

But first, how do I count words in a string?  We could search the internet for some code snippets.  We can use the the string method `split()`.

In [None]:
help(str.split)

So `split()` returns a list of words.  The `len()` of the list will be the word count.  Let us try it.


In [6]:
some_text = '''
A lack of transparency and reporting standards in the scientifc community has led to increasing and widespread
concerns relating to reproduction
'''

count = len(some_text.split())
print(count)

21


**Let us update the function to include this (word length) check.**

We will also add a doc string.  I choosen to use an `assert` statement, but you could do something similar with an `if` statement.

In [12]:
def summarise(article):
  '''
  Returns a summary of a text.
  The length of the text has to be greater than 50 words
  '''
  assert len(article.split()) > 50, 'Please make sure your text has at least 50 words'

  summary_pipeline = pipeline("summarization", model="facebook/bart-large-cnn")
  summary = summary_pipeline(article, max_length = 70, min_length= 20)
  text = summary[0]['summary_text'] # get first element, then extract the value for key 'summary text
  return text

In [8]:
some_text = '''A lack of transparency and reporting standards in the scientifc
community has led to increasing and widespread concerns relating to reproduction
'''

print(summarise(some_text))

AssertionError: Please make sure your text has at least 50 words

Great the assertion worked.

In [9]:
bigger_text='''
A lack of transparency and reporting standards in the scientifc community has led to increasing and widespread
concerns relating to reproduction and integrity of results. As an omics science, which generates vast amounts of data and
relies heavily on data science for deriving biological meaning, metabolomics is highly vulnerable to irreproducibility. The
metabolomics community has made substantial eforts to align with FAIR data standards by promoting open data formats,
data repositories, online spectral libraries, and metabolite databases.
'''

print(summarise(bigger_text))

Metabolomics generates vast amounts of data andrelies heavily on data science for deriving biological meaning. Metabolomics community has made substantial eforts to align with FAIR data standards by promoting open data formats.


In [10]:
perth = '''
The city of Perth has its origins in 1829 when the Swan River Colony was established by the British Government. The area is also home to the Aboriginal Noongar people who have lived in the south-west region of Western Australia for more than 35,000 years. In the city precinct itself, the traditional owners are known as Whadjuk Noongar people.

The first colonial Governor, Capt James Stirling, named the new settlement after the Scottish birthplace and parliamentary constituency of the then British Secretary of State for the Colonies, Sir George Murray. When surveying the area previously, Captain Stirling was said to have been stunned by the beauty of the Swan River and the fertile land around it.

Since water transport was vital to communications in the new colony before roads were built, the meanderings of the Swan River determined the site of the first towns. Governor Stirling decided that the site for the colony’s capital would be sited on the river 18km from the sea port of Fremantle. On 12 August 1829 Mrs Helen Dance, wife of the commander of HMS Sulphur, drove an axe into a tree (near the current Perth Town Hall) to mark the colony’s foundation.

The city site was mid-way between the sea and the farming areas of the Upper Swan. However, the early years were difficult financially for the colony and in 1850 it was decided that convict labour would be beneficial in that regard. Between 1850 and 1868 almost 10,000 convicts were transported from Britain. Due to the influx of convicts, many public works were completed during the period from 1856-79, notably the Perth Town Hall. It was not until 1856 that Perth officially gained ‘city’ status when it was declared a Bishop’s See by Queen Victoria.

The first meeting of the Perth City Council was held on 10 December 1858. Rich gold discoveries in the Kalgoorlie region in the early 1890s brought a new era of prosperity for the city and many impressive buildings, some of which still grace the streets to this day. The city also experienced significant population growth. Representative government evolved in Western Australia in the second half of the 19th Century and in 1901 Western Australia federated with the other Australian States to form the Commonwealth of Australia. Perth experienced another mining boom in the 1960s and the wealth it generated could be evidenced by the city’s changing CBD skyline.

Perth became widely known as the City of Lights when U.S. astronaut John Glenn told the world he had seen the city’s lights during his historic orbit around the Earth in February 1962. There was also international attention on Perth later that year when the British Empire and Commonwealth Games were held in the city. Commonwealth leaders from around the world converged on Perth when it was the venue for the successful Commonwealth Heads of Government Meeting (CHOGM) in 28-30 October 2011.

The City of Perth is the fastest growing local government area with a population approaching 20,000. It has ranked consistently among the Top 10 most liveable cities in the world, as surveyed by the highly regarded The Economist Intelligence Unit. Once again, wealth generated by the State’s natural resources is driving development of the city, with the difference being that many companies and businesses are choosing to make Perth their home.

'''

In [13]:
print(summarise(perth))

The city of Perth has its origins in 1829 when the Swan River Colony was established by the British Government. It was not until 1856 that Perth officially gained ‘city’ status when it was declared a Bishop’s See by Queen Victoria. Rich gold discoveries in the Kalgoorlie region in the early 1890s


Okay that is working well.

Let us start to use our hard work

*   Summarise a PDF text
*   Summarise a webpage text


