<a href="https://colab.research.google.com/github/mchoirul/genai-code/blob/main/notebook/googlenews_summarize_vertex_langchain-git.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Summarizing Google News Headlines Using VertexAI PALM API & Langchain

In this tutorial, we will perform news summarization of news content gathered from Google News using the following components:

- GNews API: Collect news titles & metadata from Google News
- Langchain's UnstructuredURLLoader: Retrieve news content
- Vertex PALM API: Generate news summary

Vertex PALM API is a large language model (LLM) that can be used for a variety of tasks, including text summarization. In this tutorial, we will use the text-bison@001 model from PALM API to summarize news content.

Reference and credit to the following resources:
- https://github.com/ranahaani/GNews
- https://alphasec.io/summarize-google-news-results-with-langchain-and-serper-api/

## Objectives:
- Learn how to use GNews API, Langchain's UnstructuredURLLoader, and Vertex PALM API to perform news summarization
- Create a news summarization function that can be used to automate the process of generating news summaries
- Gain a better understanding of the different steps involved in news summarization

## Installation & Preparation

In [None]:
#install all required package
!pip -q install langchain
!pip install google-cloud-aiplatform
!pip install gnews
!pip install unstructured

In [3]:
# restart kernel after installs so that your environment can access the new packages
import IPython

app = IPython.Application.instance()
app.kernel.do_shutdown(True)

{'status': 'ok', 'restart': True}

In [2]:
# import required packages
from langchain.llms import VertexAI
from langchain.chains.summarize import load_summarize_chain
from langchain.prompts import PromptTemplate
from langchain.document_loaders import UnstructuredURLLoader


In [3]:
# authenticate to google cloud account
from google.colab import auth as google_auth
google_auth.authenticate_user()

In [5]:
#google cloud project name
#replace with your project name

import vertexai

PROJECT_ID = "my-project-id"  # @param {type:"string"}
vertexai.init(project=PROJECT_ID, location="us-central1")

## Calling GNews API to Get News Metadata
Limit news period using the following time operators:
 - h = hours (eg: 12h)
 - d = days (eg: 7d)
 - m = months (eg: 6m)
 - y = years (eg: 1y)

Example:

`google_news.period = '3d'  # News from last 3 days `

In [8]:
from gnews import GNews

google_news = GNews()
google_news.period = '1d'  # News from last 1 day
google_news.max_results = 5  # number of responses across a keyword
google_news.country = 'ID'  # News from a specific country = Indonesia
google_news.language = 'id'  # News in a specific language = Bahasa Indonesia
google_news.exclude_websites = ['yahoo.com', 'cnn.com', 'msn.con']  # Exclude news from specific website i.e Yahoo.com and msn.com

#use date range if required
#google_news.start_date = (2023, 1, 1) # Search from 1st Jan 2023
#google_news.end_date = (2023, 4, 1) # Search until 1st April 2023

#get by keyword
news_by_keyword = google_news.get_news('kereta cepat')

In [9]:
#check collected news metadata
news_by_keyword

[{'title': 'Tarif Kereta Cepat Jakarta-Kota Bandung Rp 350 Ribu, Mahal? - CNBC Indonesia',
  'description': 'Tarif Kereta Cepat Jakarta-Kota Bandung Rp 350 Ribu, Mahal?  CNBC Indonesia',
  'published date': 'Sat, 30 Sep 2023 09:45:00 GMT',
  'url': 'https://news.google.com/rss/articles/CBMidGh0dHBzOi8vd3d3LmNuYmNpbmRvbmVzaWEuY29tL25ld3MvMjAyMzA5MzAxMjQ0NTktNC00NzY2NzQvdGFyaWYta2VyZXRhLWNlcGF0LWpha2FydGEta290YS1iYW5kdW5nLXJwLTM1MC1yaWJ1LW1haGFs0gF4aHR0cHM6Ly93d3cuY25iY2luZG9uZXNpYS5jb20vbmV3cy8yMDIzMDkzMDEyNDQ1OS00LTQ3NjY3NC90YXJpZi1rZXJldGEtY2VwYXQtamFrYXJ0YS1rb3RhLWJhbmR1bmctcnAtMzUwLXJpYnUtbWFoYWwvYW1w?oc=5&hl=en-US&gl=US&ceid=US:en',
  'publisher': {'href': 'https://www.cnbcindonesia.com',
   'title': 'CNBC Indonesia'}},
 {'title': 'Siaran Pers: Kereta Cepat "Whoosh" Diharapkan Perkuat Capaian ... - Kemenparekraf',
  'description': 'Siaran Pers: Kereta Cepat "Whoosh" Diharapkan Perkuat Capaian ...  Kemenparekraf',
  'published date': 'Sat, 30 Sep 2023 05:53:06 GMT',
  'url': 'https:

In [10]:
#test another method
#instead of search by keyword, let's retrieve top news from google-news

#get top news from the last 7 days
google_news = GNews(language='id', country='ID', period='7d',
                    start_date=None, end_date=None, max_results=10)
top_news = google_news.get_top_news()

#check collected news metadata
top_news

[{'title': 'Putra Megawati Sopiri Ganjar dan Rombongan Melaju di Atas Karpet Merah Rakernas IV PDI-P - Kompas.com - Nasional Kompas.com',
  'description': 'Putra Megawati Sopiri Ganjar dan Rombongan Melaju di Atas Karpet Merah Rakernas IV PDI-P - Kompas.com  Nasional Kompas.comEffendi Gazali: Ganjar Berhasil Jadi Bintang Rakernas PDIP | Kanal Pemilu Tepercaya  CNN IndonesiaPuan Maharani Heran Minimnya Tepuk Tangan di Rakernas PDIP: Kayak Nonton Wayang  Nasional TempoMomen Ganjar dan Jokowi Gandeng Megawati di Rakernas, PDI-P: Jauhkan Berbagai Spekulasi - Kompas.com  Nasional Kompas.comPakar Sebut Jokowi Sudah Bayangkan Ganjar Dilantik Jadi Presiden RI  detikNewsLihat Liputan Lengkap di Google Berita',
  'published date': 'Sat, 30 Sep 2023 10:31:00 GMT',
  'url': 'https://news.google.com/rss/articles/CBMie2h0dHBzOi8vbmFzaW9uYWwua29tcGFzLmNvbS9yZWFkLzIwMjMvMDkvMzAvMTczMTU5MDEvcHV0cmEtbWVnYXdhdGktc29waXJpLWdhbmphci1kYW4tcm9tYm9uZ2FuLW1lbGFqdS1kaS1hdGFzLWthcnBldC1tZXJhaNIBf2h0dHBzOi8vYW1wL

In [14]:
#collect metadata by news topic
#Available topics: WORLD, NATION, BUSINESS, TECHNOLOGY, ENTERTAINMENT, SPORTS, SCIENCE, HEALTH

google_news = GNews(language='id', country='ID',
                    period='1d', start_date=None, end_date=None,
                    max_results=5, exclude_websites = ['yahoo.com', 'msn'] )
news_by_topic = google_news.get_news_by_topic('SPORTS')

#check collected news
news_by_topic


[{'title': 'Tottenham Vs Liverpool: Badan Wasit Ngaku Salah, Gol Luis Diaz Harusnya Sah - detikSport',
  'description': 'Tottenham Vs Liverpool: Badan Wasit Ngaku Salah, Gol Luis Diaz Harusnya Sah  detikSportJadwal Siaran Liga Inggris Malam Ini Live SCTV, Man United, M City dan Tottenham vs Liverpool  Tribun MataramanTottenham Vs Liverpool: Drama! Spurs Kalahkan 9 Pemain Si Merah  detikSportTottenham Kalahkan Sembilan Pemain Liverpool Berkat Gol Bunuh Diri Joel Matip Di Menit-Menit Akhir  Goal.comLihat Liputan Lengkap di Google Berita',
  'published date': 'Sat, 30 Sep 2023 23:00:00 GMT',
  'url': 'https://news.google.com/rss/articles/CBMiggFodHRwczovL3Nwb3J0LmRldGlrLmNvbS9zZXBha2JvbGEvbGlnYS1pbmdncmlzL2QtNjk1ODc3Ni90b3R0ZW5oYW0tdnMtbGl2ZXJwb29sLWJhZGFuLXdhc2l0LW5nYWt1LXNhbGFoLWdvbC1sdWlzLWRpYXotaGFydXNueWEtc2Fo0gGGAWh0dHBzOi8vc3BvcnQuZGV0aWsuY29tL3NlcGFrYm9sYS9saWdhLWluZ2dyaXMvZC02OTU4Nzc2L3RvdHRlbmhhbS12cy1saXZlcnBvb2wtYmFkYW4td2FzaXQtbmdha3Utc2FsYWgtZ29sLWx1aXMtZGlhei1oYXJ1c255YS1zY

## Extract news content
The `UnstructuredURLLoader` from Langchain library is usefull toolkit to get easy access to HTML contents from a url. This package is actually a wrapper of `bricks.html` partition from [Unstructured](https://unstructured-io.github.io/unstructured/bricks/partition.html#partition-html) library.  We will use it as a news content extractor by taking input from url collected at previous steps.

In [15]:
#test to extract content from url inside news_by_topic

urls = [news_by_topic[0]['url'],
        news_by_topic[1]['url'],
      ]

loader = UnstructuredURLLoader(urls=urls)
content = loader.load()

#check news content
content

[Document(page_content='Tottenham Vs Liverpool: Badan Wasit Ngaku Salah, Gol Luis Diaz Harusnya Sah\n\n0 komentar\n\n\n                BAGIKAN \xa0\n                \n\nTautan telah disalin\n\nMENU\n\ndetikcom\n\nTerpopuler\n\nKirim Tulisan\n\nLive TV\n\ndetikPemilu NEW\n\nKategori Berita\n\ndetikNews\n\ndetikFinance\n\ndetikInet\n\ndetikHot\n\ndetikSport\n\nSepakbola\n\ndetikOto\n\ndetikTravel\n\ndetikFood\n\ndetikHealth\n\nWolipop\n\ndetikX\n\n20Detik\n\ndetikFoto\n\ndetikEdu\n\ndetikHikmah\n\ndetikProperti\n\nDaerah\n\ndetikJateng\n\ndetikJatim\n\ndetikJabar\n\ndetikSulsel\n\ndetikSumut\n\ndetikBali\n\ndetikSumbagsel\n\ndetikJogja NEW\n\nLayanan\n\nPasang Mata\n\nAds Smart\n\nForum\n\ndetikEvent\n\nTrans Snow World\n\nTrans Studio\n\nberbuatbaik.id\n\nDetik Network\n\nCNN Indonesia\n\nCNBC Indonesia\n\nHai Bunda\n\nInsertLive\n\nBeautynesia\n\nFemale Daily\n\nCXO Media\n\nHome\n\nLiga Inggris\n\nLiga Italia\n\nLiga Spanyol\n\nLiga Jerman\n\nLiga Indonesia\n\nUEFA\n\nDunia\n\nIndeks\

### Summarize News with Vertex PALM API

The next step is calling `text-bison@001` to generate the news summary. We need to supply prompt to tell the model on how to summarize the text.


**Prompting**

Correct prompting is essential for getting accurate results from a LLM. Supply `prompt_template`  with prompt text to tell the model to generate news summary, using the following steps:
  * summary consists of maximum 100 words
  * If the text cannot be found or error, return: "Content empty"
  * Use only materials from the text supplied
  * Create summary in Bahasa Indonesia



In [16]:
#prompting to perform news summary
prompt_template = """Generate summary for the following text, using the following steps:
                     1. summary consists of maximum 100 words
                     2. If the text cannot be found or error, return: "Content empty"
                     3. Use only materials from the text supplied
                     3. Create summary in Bahasa Indonesia

                    "{text}"
                    SUMMARY:"""

prompt = PromptTemplate.from_template(prompt_template)

#declare LLM model
llm = VertexAI(temperature=0.1,
               model='text-bison@001',
               top_k=40,
               top_p=0.8,
               max_output_token=512)

Wrap the summarization process inside a function to loop collections of news urls. The generate_summary function perform the following:
- Retrieve news content from each urls
- Generate summary for each news contents
- Print the output

In [17]:
# create function to generate news summary based on list of news urls
# Load URL , get news content and summarize
def generate_summary(docnews):
    for item in docnews:
        #extract news content
        loader = UnstructuredURLLoader(urls=[item['url']])
        data = loader.load()

        #summarize using stuff for easy processing
        chain = load_summarize_chain(llm,
                                    chain_type="stuff",
                                    prompt=prompt)
        summary = chain.run(data)

        #show summary for each news headlines
        print(item['title'])
        print(item['publisher']['title'], item['published date'])
        print(summary, '\n')

In [18]:
#call the function and generate summary for news by keyword
generate_summary(news_by_keyword)

Tarif Kereta Cepat Jakarta-Kota Bandung Rp 350 Ribu, Mahal? - CNBC Indonesia
CNBC Indonesia Sat, 30 Sep 2023 09:45:00 GMT
 Kereta Cepat Jakarta-Bandung akan diresmikan pada 2 Oktober 2023. Tarifnya diperkirakan sekitar Rp300.000-Rp350.000 untuk kelas ekonomi. Setelah uji coba gratis, tiket akan dikenakan biaya. Presiden Jokowi ingin harga tiket terjangkau dan bisa didiskon untuk menarik minat masyarakat. 

Siaran Pers: Kereta Cepat "Whoosh" Diharapkan Perkuat Capaian ... - Kemenparekraf
Kemenparekraf Sat, 30 Sep 2023 05:53:06 GMT
 Kereta Cepat Jakarta-Bandung atau "Whoosh" akan resmi beroperasi mulai 2 Oktober 2023. Kereta cepat ini diharapkan dapat memperkuat capaian target wisatawan nusantara dan mancanegara di tahun 2023. 

Kemenparekraf dikatakan Dessy senantiasa mendorong pelaku industri agar mulai membuat paket-paket perjalanan wisata dengan memasukkan kereta cepat sebagai salah satu daya tarik ataupun transportasi pilihan. 

Kereta Cepat Jakarta-Bandung "Whoosh" terbagi dalam ti

In [19]:
#call the function and generate summary for news by topics
generate_summary(news_by_topic)

Tottenham Vs Liverpool: Badan Wasit Ngaku Salah, Gol Luis Diaz Harusnya Sah - detikSport
detikSport Sat, 30 Sep 2023 23:00:00 GMT
 Tottenham Hotspur vs Liverpool: Badan Wasit Profesional Inggris (PGMOL) mengakui adanya kesalahan dalam pertandingan tersebut. Gol Luis Diaz yang dianulir seharusnya sah. PGMOL akan melakukan tinjauan penuh atas insiden tersebut. 

MU Vs Palace: Setan Merah Kalah! - detikSport
detikSport Sat, 30 Sep 2023 15:58:02 GMT
 Manchester United kalah 0-1 dari Crystal Palace di Old Trafford pada lanjutan Liga Inggris, Sabtu (30/9/2023) malam WIB. Gol tunggal kemenangan Palace dicetak oleh Joachim Andersen pada menit ke-25. MU gagal memanfaatkan sejumlah peluang dan kesulitan menembus pertahanan Palace yang disiplin. Kekalahan ini membuat MU tertahan di posisi 10 klasemen sementara dengan sembilan poin dari tujuh pertandingan. 

Rating Pemain Manchester City Versus Wolverhampton Wanderers: Erling Haaland Cuma Sekadar Kameo Dalam Kekalahan Pertama - Goal.com
Goal.com S