<a href="https://colab.research.google.com/github/mchoirul/genai-code/blob/main/notebook/googlenews_summarize_vertex_langchain-git.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Summarizing Google News Headlines Using VertexAI PALM API & Langchain

In this tutorial, we will perform news summarization of news content gathered from Google News using the following components:

- GNews API: Collect news titles & metadata from Google News
- Langchain's UnstructuredURLLoader: Retrieve news content
- Vertex PALM API: Generate news summary

Vertex PALM API is a large language model (LLM) that can be used for a variety of tasks, including text summarization. In this tutorial, we will use the text-bison@001 model from PALM API to summarize news content.

Reference and credit to the following resources:
- https://github.com/ranahaani/GNews
- https://alphasec.io/summarize-google-news-results-with-langchain-and-serper-api/

## Objectives:
- Learn how to use GNews API, Langchain's UnstructuredURLLoader, and Vertex PALM API to perform news summarization
- Create a news summarization function that can be used to automate the process of generating news summaries
- Gain a better understanding of the different steps involved in news summarization

## Installation & Preparation

In [1]:
#install all required package
!pip -q install langchain
!pip install google-cloud-aiplatform
!pip install gnews
!pip install unstructured

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.8/1.8 MB[0m [31m17.3 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m49.4/49.4 kB[0m [31m4.8 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting google-cloud-aiplatform
  Downloading google_cloud_aiplatform-1.33.1-py2.py3-none-any.whl (2.9 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.9/2.9 MB[0m [31m10.1 MB/s[0m eta [36m0:00:00[0m
Collecting google-cloud-resource-manager<3.0.0dev,>=1.3.3 (from google-cloud-aiplatform)
  Downloading google_cloud_resource_manager-1.10.4-py2.py3-none-any.whl (320 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m321.0/321.0 kB[0m [31m7.8 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting shapely<2.0.0 (from google-cloud-aiplatform)
  Downloading Shapely-1.8.5.post1-cp310-cp310-manylinux_2_12_x86_64.manylinux2010_x86_64.whl (2.0 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.0/2.0

Collecting gnews
  Downloading gnews-0.3.1-py3-none-any.whl (16 kB)
Collecting feedparser~=6.0.2 (from gnews)
  Downloading feedparser-6.0.10-py3-none-any.whl (81 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m81.1/81.1 kB[0m [31m2.3 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting bs4~=0.0.1 (from gnews)
  Downloading bs4-0.0.1.tar.gz (1.1 kB)
  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting beautifulsoup4~=4.9.3 (from gnews)
  Downloading beautifulsoup4-4.9.3-py3-none-any.whl (115 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m115.8/115.8 kB[0m [31m7.0 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting pymongo~=3.12.0 (from gnews)
  Downloading pymongo-3.12.3-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (517 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m517.1/517.1 kB[0m [31m7.9 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting dnspython~=1.16.0 (from gnews)
  Downloading dnspython-1.16.0-py2.py3-

In [2]:
# restart kernel after installs so that your environment can access the new packages
import IPython

app = IPython.Application.instance()
app.kernel.do_shutdown(True)

{'status': 'ok', 'restart': True}

In [2]:
# import required packages
from langchain.llms import VertexAI
from langchain.chains.summarize import load_summarize_chain
from langchain.prompts import PromptTemplate
from langchain.document_loaders import UnstructuredURLLoader


In [3]:
# authenticate to google cloud account
from google.colab import auth as google_auth
google_auth.authenticate_user()

In [5]:
#google cloud project name
#replace with your project name

import vertexai

PROJECT_ID = "genai-llm01"  # @param {type:"string"}
vertexai.init(project=PROJECT_ID, location="us-central1")

## Calling GNews API to Get News Metadata
Limit news period using the following time operators:
 - h = hours (eg: 12h)
 - d = days (eg: 7d)
 - m = months (eg: 6m)
 - y = years (eg: 1y)

Example:

`google_news.period = '3d'  # News from last 3 days `

In [6]:
from gnews import GNews

google_news = GNews()
google_news.period = '1d'  # News from last 1 day
google_news.max_results = 5  # number of responses across a keyword
google_news.country = 'ID'  # News from a specific country = Indonesia
google_news.language = 'id'  # News in a specific language = Bahasa Indonesia
google_news.exclude_websites = ['yahoo.com', 'cnn.com', 'msn.con']  # Exclude news from specific website i.e Yahoo.com and CNN.com

#use date range if required
#google_news.start_date = (2023, 1, 1) # Search from 1st Jan 2023
#google_news.end_date = (2023, 4, 1) # Search until 1st April 2023

#get by keyword
news_by_keyword = google_news.get_news('kaesang')

In [7]:
#check collected news metadata
news_by_keyword

[{'title': 'Kaesang Jadi Ketum PSI, Pengamat: Jokowi Kirim Pesan Ancaman ... - Republika Online',
  'description': 'Kaesang Jadi Ketum PSI, Pengamat: Jokowi Kirim Pesan Ancaman ...  Republika Online',
  'published date': 'Fri, 29 Sep 2023 13:50:24 GMT',
  'url': 'https://news.google.com/rss/articles/CBMidGh0dHBzOi8vbmV3cy5yZXB1Ymxpa2EuY28uaWQvYmVyaXRhL3MxcjJnMDQ4NC9rYWVzYW5nLWphZGkta2V0dW0tcHNpLXBlbmdhbWF0LWpva293aS1raXJpbS1wZXNhbi1hbmNhbWFuLWtlLW1lZ2F3YXRp0gEA?oc=5&hl=en-US&gl=US&ceid=US:en',
  'publisher': {'href': 'https://news.republika.co.id',
   'title': 'Republika Online'}},
 {'title': 'PSI Sambut Baik Jika Jokowi Bergabung Jadi Kader Susul Kaesang - CNN Indonesia',
  'description': 'PSI Sambut Baik Jika Jokowi Bergabung Jadi Kader Susul Kaesang  CNN Indonesia',
  'published date': 'Fri, 29 Sep 2023 20:16:00 GMT',
  'url': 'https://news.google.com/rss/articles/CBMifmh0dHBzOi8vd3d3LmNubmluZG9uZXNpYS5jb20vbmFzaW9uYWwvMjAyMzA5MjkxMTI3MjEtMzItMTAwNTA3MC9wc2ktc2FtYnV0LWJhaWstamlrYS1q

In [8]:
#test another method
#instead of search by keyword, let's retrieve top news from google-news

#get top news from the last 7 days
google_news = GNews(language='id', country='ID', period='7d',
                    start_date=None, end_date=None, max_results=10)
top_news = google_news.get_top_news()

#check collected news metadata
top_news

[{'title': 'Rumah Dinas Digeledah KPK, SYL Ungkap Hilirisasi Jokowi di Spanyol - CNN Indonesia',
  'description': 'Rumah Dinas Digeledah KPK, SYL Ungkap Hilirisasi Jokowi di Spanyol  CNN IndonesiaDua Temuan KPK Usai Geledah Rumah Dinas Mentan Syahrul  Kompas.com5 Kontroversi Syahrul Yasin Limpo, dari Bantah Food Estate Gagal, Kalung Anti Corona hingga Angkat Lesti Kejora Duta Petani  Bisnis Tempo.coKPK Dikabarkan Tetapkan Mentan Syahrul Yasin Limpo Tersangka  CNN IndonesiaSurya Paloh Irit Bicara soal Temuan Uang Miliaran di Rumah Dinas Mentan Syahrul: Nanti Ya  TribunnewsLihat Liputan Lengkap di Google Berita',
  'published date': 'Sat, 30 Sep 2023 03:28:10 GMT',
  'url': 'https://news.google.com/rss/articles/CBMigAFodHRwczovL3d3dy5jbm5pbmRvbmVzaWEuY29tL2Vrb25vbWkvMjAyMzA5MzAxMDExMzEtOTItMTAwNTQ2OC9ydW1haC1kaW5hcy1kaWdlbGVkYWgta3BrLXN5bC11bmdrYXAtaGlsaXJpc2FzaS1qb2tvd2ktZGktc3BhbnlvbNIBhAFodHRwczovL3d3dy5jbm5pbmRvbmVzaWEuY29tL2Vrb25vbWkvMjAyMzA5MzAxMDExMzEtOTItMTAwNTQ2OC9ydW1haC1kaW5hc

In [9]:
#collect metadata by news topic
#Available topics: WORLD, NATION, BUSINESS, TECHNOLOGY, ENTERTAINMENT, SPORTS, SCIENCE, HEALTH

google_news = GNews(language='id', country='ID',
                    period='7d', start_date=None, end_date=None,
                    max_results=5, exclude_websites = ['yahoo.com', 'msn'] )
news_by_topic = google_news.get_news_by_topic('NATION')

#check collected news
news_by_topic


[{'title': 'Rumah Dinas Digeledah KPK, SYL Ungkap Hilirisasi Jokowi di Spanyol - CNN Indonesia',
  'description': 'Rumah Dinas Digeledah KPK, SYL Ungkap Hilirisasi Jokowi di Spanyol  CNN IndonesiaDua Temuan KPK Usai Geledah Rumah Dinas Mentan Syahrul  Kompas.com5 Kontroversi Syahrul Yasin Limpo, dari Bantah Food Estate Gagal, Kalung Anti Corona hingga Angkat Lesti Kejora Duta Petani  Bisnis Tempo.coKPK Dikabarkan Tetapkan Mentan Syahrul Yasin Limpo Tersangka  CNN IndonesiaSurya Paloh Irit Bicara soal Temuan Uang Miliaran di Rumah Dinas Mentan Syahrul: Nanti Ya  TribunnewsLihat Liputan Lengkap di Google Berita',
  'published date': 'Sat, 30 Sep 2023 03:28:10 GMT',
  'url': 'https://news.google.com/rss/articles/CBMigAFodHRwczovL3d3dy5jbm5pbmRvbmVzaWEuY29tL2Vrb25vbWkvMjAyMzA5MzAxMDExMzEtOTItMTAwNTQ2OC9ydW1haC1kaW5hcy1kaWdlbGVkYWgta3BrLXN5bC11bmdrYXAtaGlsaXJpc2FzaS1qb2tvd2ktZGktc3BhbnlvbNIBhAFodHRwczovL3d3dy5jbm5pbmRvbmVzaWEuY29tL2Vrb25vbWkvMjAyMzA5MzAxMDExMzEtOTItMTAwNTQ2OC9ydW1haC1kaW5hc

## Extract news content
The `UnstructuredURLLoader` from Langchain library is usefull toolkit to get easy access to HTML contents from a url. This package is actually a wrapper of `bricks.html` partition from [Unstructured](https://unstructured-io.github.io/unstructured/bricks/partition.html#partition-html) library.  We will use it as an extractor of news content based on url collected from previous steps.

In [18]:
#test to extract content from url inside news_by_topic

urls = [news_by_topic[0]['url'],
        news_by_topic[1]['url'],
      ]

loader = UnstructuredURLLoader(urls=urls)
content = loader.load()

#check news content
content

[Document(page_content='Advertisement\n\nCLOSE\n\nHome\n\nNasional \n                \n                    \n                        \n                                                                                                 Politik \n                                                                                                                                 Hukum & Kriminal \n                                                                                                                                 Peristiwa \n                                                                                                                                 Pemilu 2024 \n                                                                                    \n                        \n                            \n                                BERITA TERBARU\n                            \n                            \n                                                                    \n      

### Summarize News with Vertex PALM API

The next step is calling `text-bison@001` to generate the news summary. We need to supply prompt to tell the model on how to summarize the text.


**Prompting**

Correct prompting is essential for getting accurate results from a LLM. Supply `prompt_template`  with prompt text to tell the model to generate news summary, using the following steps:
  * summary consists of maximum 100 words
  * If the text cannot be found or error, return: "Content empty"
  * Use only materials from the text supplied
  * Create summary in Bahasa Indonesia



In [19]:
#prompting to perform news summary
prompt_template = """Generate summary for the following text, using the following steps:
                     1. summary consists of maximum 100 words
                     2. If the text cannot be found or error, return: "Content empty"
                     3. Use only materials from the text supplied
                     3. Create summary in Bahasa Indonesia

                    "{text}"
                    SUMMARY:"""

prompt = PromptTemplate.from_template(prompt_template)

#declare LLM model
llm = VertexAI(temperature=0.1,
               model='text-bison@001S',
               top_k=40,
               top_p=0.8,
               max_output_token=512)

Wrap the summarization process inside a function to loop collections of news urls. The generate_summary function perform the following:
- Retrieve news content from each urls
- Generate summary for each news contents
- Print the output

In [20]:
# create function to generate news summary based on list of news urls
# Load URL , get news content and summarize
def generate_summary(docnews):
    for item in docnews:
        #extract news content
        loader = UnstructuredURLLoader(urls=[item['url']])
        data = loader.load()

        #summarize using stuff for easy processing
        chain = load_summarize_chain(llm,
                                    chain_type="stuff",
                                    prompt=prompt)
        summary = chain.run(data)

        #show summary for each news headlines
        print(item['title'])
        print(item['publisher']['title'], item['published date'])
        print(summary, '\n')

In [21]:
#call the function and generate summary for news by keyword
generate_summary(news_by_keyword)

Kaesang Jadi Ketum PSI, Pengamat: Jokowi Kirim Pesan Ancaman ... - Republika Online
Republika Online Fri, 29 Sep 2023 13:50:24 GMT
 Putra bungsu Presiden Joko Widodo, Kaesang Pangarep, baru saja terjun ke dunia politik dengan menjabat sebagai ketua umum Partai Solidaritas Indonesia (PSI). Pengamat menilai langkah ini sebagai bentuk perlawanan Kaesang terhadap ibunya, Megawati Soekarnoputri, yang merupakan ketua umum PDIP. Jokowi juga dianggap mengirimkan pesan ancaman kepada Megawati dan PDIP jika pendukungnya beralih ke PSI. 

PSI Sambut Baik Jika Jokowi Bergabung Jadi Kader Susul Kaesang - CNN Indonesia
CNN Indonesia Fri, 29 Sep 2023 20:16:00 GMT
 PSI menyatakan akan menyambut baik jika Presiden Joko Widodo bergabung menjadi kader partai seperti putranya, Kaesang Pangarep. 

PSI merupakan partai yang terinspirasi oleh Jokowi sejak awal berdiri. Jokowi dinilai telah menjadi pemimpin yang hadir membawa perubahan untuk masyarakat Indonesia. 

Kehadiran Kaesang di PSI menambah semangat p

In [22]:
#call the function and generate summary for news by topics
generate_summary(news_by_topic)

Rumah Dinas Digeledah KPK, SYL Ungkap Hilirisasi Jokowi di Spanyol - CNN Indonesia
CNN Indonesia Sat, 30 Sep 2023 03:28:10 GMT
 Menteri Pertanian Syahrul Yasin Limpo sedang berada di Spanyol saat rumahnya digeledah oleh KPK. Di Spanyol, ia menyampaikan kebijakan hilirisasi yang digalakkan pemerintahan Joko Widodo. Ia juga mendorong Pemerintah Spanyol untuk membuka akses pasar produk hortikultura asal Indonesia secara luas. Sementara itu, KPK menggeledah rumah dinas Syahrul di Jakarta Selatan terkait kasus dugaan korupsi. 

Ketua KPU hingga Firli Bakal Jadi Pembicara di Rakernas PDIP - detikNews
detikNews Sat, 30 Sep 2023 01:56:22 GMT
 Ketua KPU Hasyim Asy'ari dan Ketua Bawaslu RI Rahmat Bagja akan menjadi pembicara dalam sesi diskusi terkait Pemilu pada Rakernas PDIP. Ketua KPK Firli Bahuri juga akan menjadi pembicara menyampaikan materi terkait peran parpol dalam pengawasan dan pencegahan politik uang jelang Pemilu 2024. 

Ketua Umum PSI Kaesang Ungkap Kriteria Capres | Obrolan Malam 