# Summarize Google News Results with LangChain🦜🔗, Huggingface🤗 and Serper API

## Overview

Text summarization is the process of creating a shorter version of a text document while still preserving the most important information. This can be useful for a variety of purposes, such as quickly skimming a long document, getting the gist of an article, or sharing a summary with others. LLMs can be used to create summaries of news articles, research papers, technical documents, and other types of text.

<img src="images/miztiik_text_summarization_01.png" width="50%"/>

In this notebook, we will try fetch the latest Google news using server API and use AI-generated summaries with LangChain LLM framework or huggingface transformers.

<a href="https://colab.research.google.com/github/miztiik/llm-bootcamp/blob/main/chapters/text_summarization/news_summarization.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [9]:
%%capture
# Comment the above line to see the installation logs

# Install the dependencies
!pip install -qU python-dotenv
!pip install -qU langchain-core==0.1.23
!pip install -qU langchain==0.1.6
!pip install -qU langchain-community==0.0.19
!pip install -qU langchain-openai
!pip install --upgrade --quiet  transformers --quiet
!pip install -qU newspaper3k


# langchain==0.1.6
# langchain-community==0.0.19
# langchain-core==0.1.23

In [10]:
# Load environment variables
import os
from dotenv import load_dotenv, find_dotenv

load_dotenv(find_dotenv())

True

In [11]:
# Not a good practice, but we will ignore warnings in this notebook, as tensor has deprecated some methods and will be removed in future versions.
# https://github.com/pytorch/pytorch/issues/97207#issuecomment-1494781560
import warnings

warnings.filterwarnings(
    "ignore", category=UserWarning, message="TypedStorage is deprecated"
)

Update your `API_KEY` in the `.env` file. You can get the API keys from the following links. _Note: Some of the services may require you to have an account and some may charge you for usage_
- [OpenAI API Key](https://platform.openai.com/account/api-keys)
- [Hugging Face API Key](https://huggingface.co/settings/tokens)
- [Serper API Key](https://serper.dev/api-key)

In [12]:
# os.environ["HF_TOKEN"] = ""
# os.environ["HUGGINGFACEHUB_API_TOKEN"] = ""
# os.environ["OPENAI_API_KEY"] = ""
# os.environ["SERPER_API_KEY"] = ""

In [13]:
from langchain_openai import OpenAI
from langchain_openai import ChatOpenAI

# To specify a particular model refer to the OpenAI documentation - https://platform.openai.com/docs/models
# Completions Model: https://platform.openai.com/docs/models/completions
# Chat Model: https://platform.openai.com/docs/models/completions

llm = OpenAI()
llm_chat = ChatOpenAI(model_name="gpt-3.5-turbo-0125", temperature=0.3)

In [82]:
from langchain_community.utilities import GoogleSerperAPIWrapper
from langchain.chains.summarize import load_summarize_chain
from langchain.document_loaders import TextLoader
from langchain.docstore.document import Document
from langchain.prompts import PromptTemplate
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.document_loaders import NewsURLLoader

**Serper API** - [Sign up](https://serper.dev/signup?ref=miztiik) for an account with Serper, or log in if you already have an account, and create an API key. Serper offers a generous free tier; as you consume the API, the dashboard will populate with the requests and remaining credits.

In [15]:
import os

search = GoogleSerperAPIWrapper(
    type="news", tbs="qdr:d1", serper_api_key=os.getenv("SERPER_API_KEY")
)

news_search_query = "indian economy trends"
news_results = search.results(news_search_query)

if news_results.get("news") is None:
    print("No results found")

In [16]:
print(len(news_results["news"]))

10


In [26]:
for i in news_results["news"][0]:
    print(f"{i}:{news_results['news'][0][i]}")

title:In defence of everything
link:https://economictimes.com/opinion/et-commentary/in-defence-of-everything/articleshow/107700106.cms
snippet:The Americans have proposed, once again, joint production of the Javelin anti-tank missile, an offer first made in 2013. But things got stuck over tech...
date:3 hours ago
source:The Economic Times
position:1
contents:[Document(page_content="On point and on track. The India-US defence partnership is going places and recording successes. The US is making new offers, India is listening, and the two bureaucracies are working to align their needs, while making startups a part of the process.The energy to do what once was unthinkable obviously comes from a common threat called China . Best to think of the US willingness to sell, transfer technology and otherwise enhance India's defence posture in terms of two concepts - deterrence and interoperability . The more effectively India can deter Chinese coercion along the border and across the Indian Ocean

In [106]:
# Limit how many news articles to process
num_results = 5
if len(news_results["news"]) < num_results:
    num_results = len(news_results["news"])


text_splitter = RecursiveCharacterTextSplitter(
    separators=["\n\n", "\n", ". "], chunk_size=8000, chunk_overlap=100
)


# For each news article, load the contents
for i in range(num_results):
    loader = NewsURLLoader(urls=[news_results["news"][i]["link"]])
    news_results["news"][i]["contents"] = loader.load()

    if news_results["news"][i]["contents"]:
        # Make the docs to fit model input size
        news_results["news"][i]["contents"] = text_splitter.create_documents(
            [news_results["news"][i]["contents"][0].page_content]
        )
    else:
        print(f"Failed to load {news_results['news'][i]['link']}")

Error fetching or processing https://www.forbes.com/uk/advisor/personal-finance/2024/02/14/inflation-rate-update/, exception: Article `download()` failed with 403 Client Error: Max restarts limit reached for url: https://www.forbes.com/uk/advisor/personal-finance/2024/02/14/inflation-rate-update/ on URL https://www.forbes.com/uk/advisor/personal-finance/2024/02/14/inflation-rate-update/


Failed to load https://www.forbes.com/uk/advisor/personal-finance/2024/02/14/inflation-rate-update/




## Summarization with Open AI Models

<img src="images/miztiik_text_summarization_02.png" width="50%"/>

If our document is short enough to fit within the context window of the model, we can use the `stuffing` method.

**Stuffing method** - The `stuffing` method is the easiest way to summarize text by feeding the entire document to a large language model (LLM) in a single call. This method has both pros and cons.

  - **Pros**:
    - Only required a single call to the model, which can be faster than other methods that require multiple calls
    - When summarizing text, the model has access to all the data at once, which can result in a better summary.
  - **Cons**:
    - Most models have a context length, and for large documents (or many documents) this will not work as it will result in a prompt larger than the context length.
    - This method only works on smaller pieces of data and not suitable to large documents most of the time.

### 

In [108]:
# 16k is the max input length for GPT-3.5
num_tokens_first_doc = llm.get_num_tokens(
    news_results["news"][1]["contents"][0].page_content
)

oai_chain = load_summarize_chain(llm=llm, chain_type="map_reduce", verbose=True)


oai_summary = oai_chain.invoke(news_results["news"][1]["contents"])



[1m> Entering new MapReduceDocumentsChain chain...[0m


[1m> Entering new LLMChain chain...[0m
Prompt after formatting:
[32;1m[1;3mWrite a concise summary of the following:


"The global economy has been going through some level of turmoil for quite a while, no doubt. Businesses and consumers across the globe have had their share of ups and downs to face, especially when it comes to financial instability.

Jim Needell, Chief Client Officer, Local Markets, Ipsos agrees that the financial state of the global economy is one of the key factors driving brands to make decisions that can help them adapt to the changing consumer trends.

“The majority of markets now report that financial instability is the number one problem consumers are facing. So brands have to adapt their behaviours to how consumers are feeling about their financial situation, and that can come from anything, from pack size to formulation. Because the only way a lot of the big global brands are growing at the momen

In [109]:
print(
    f"Title: {news_results['news'][1]['title']}\n\nLink: {news_results['news'][1]['link']}\n\nSummary: \033[32m{oai_summary['output_text']}\033[0m"
)

Title: 'Brands are grappling with growing volume, amidst economic headwinds'

Link: https://www.exchange4media.com/digital-news/brands-are-grappling-with-growing-volume-amidst-economic-headwinds-132492.html

Summary: [32m

The global economy's volatility has posed difficulties for businesses and consumers, leading brands to adjust to changing consumer preferences, particularly sustainability. The key challenge for brands is achieving efficient growth and targeting their audience, requiring an understanding of the current global climate and consumer behavior, including the rise of technology and empathy. The Indian market is appealing for its large size, adaptable population, and expanding middle class. While digital advertising is gaining traction, the product and messaging remain crucial for success.[0m


### Observations

The summarization is not perfect, but it does a good job of capturing the main points of the article. The summary is coherent and reads well. As OpenAI continues to improve their models, we can expect the quality of the summaries to improve as well.

## Summarization with Open Source Models

<img src="images/miztiik_text_summarization_03.png" width="50%"/>

### Summarization with Hugging Face hosted models

In [110]:
from langchain.chains.summarize import load_summarize_chain
from langchain.llms.huggingface_hub import HuggingFaceHub

hf_hub_llm = HuggingFaceHub(
    repo_id="google/flan-t5-xxl", model_kwargs={"temperature": 0.3, "max_length": 256}
)

chain = load_summarize_chain(llm=hf_hub_llm, chain_type="map_reduce", verbose=True)

# run chain
map_reduce_summary = chain.invoke(news_results["news"][1]["contents"])



[1m> Entering new MapReduceDocumentsChain chain...[0m


[1m> Entering new LLMChain chain...[0m
Prompt after formatting:
[32;1m[1;3mWrite a concise summary of the following:


"The global economy has been going through some level of turmoil for quite a while, no doubt. Businesses and consumers across the globe have had their share of ups and downs to face, especially when it comes to financial instability.

Jim Needell, Chief Client Officer, Local Markets, Ipsos agrees that the financial state of the global economy is one of the key factors driving brands to make decisions that can help them adapt to the changing consumer trends.

“The majority of markets now report that financial instability is the number one problem consumers are facing. So brands have to adapt their behaviours to how consumers are feeling about their financial situation, and that can come from anything, from pack size to formulation. Because the only way a lot of the big global brands are growing at the momen

HfHubHTTPError: 422 Client Error: Unprocessable Entity for url: https://api-inference.huggingface.co/models/google/flan-t5-xxl (Request ID: 8_OH2zIKtjhnvgeMHb4Rq)

Input validation error: `inputs` must have less than 1024 tokens. Given: 1159
Make sure 'text2text-generation' task is supported by the model.

In [41]:
print(map_reduce_summary["output_text"])

NameError: name 'map_reduce_summary' is not defined

In [42]:
def split_text(text, max_chunk_size=2048):
    chunks = []
    current_chunk = ""
    for sentence in text.split("."):
        if len(current_chunk) + len(sentence) < max_chunk_size:
            current_chunk += sentence + "."
        else:
            chunks.append(current_chunk.strip())
            current_chunk = sentence + "."
    if current_chunk:
        chunks.append(current_chunk.strip())
    return chunks

In [85]:
def split_text_as_langchain_docs(text, max_chunk_size=2048):
    chunks = []
    current_chunk = ""
    for sentence in text.split("."):
        if len(current_chunk) + len(sentence) < max_chunk_size:
            current_chunk += sentence + "."
        else:
            chunks.append(Document(page_content=current_chunk.strip(), metadata={}))
            current_chunk = sentence + "."
    if current_chunk:
        chunks.append(Document(page_content=current_chunk.strip(), metadata={}))
    return chunks

In [83]:
chunked_news_txt = split_text(
    news_results["news"][1]["contents"][0].page_content)

print(len(chunked_news_txt))

3


In [104]:
chunked_news_txt_as_docs = split_text_as_langchain_docs(
    news_results["news"][1]["contents"][0].page_content
)

chunked_news_txt_as_docs

[Document(page_content='The global economy has been going through some level of turmoil for quite a while, no doubt. Businesses and consumers across the globe have had their share of ups and downs to face, especially when it comes to financial instability.\n\nJim Needell, Chief Client Officer, Local Markets, Ipsos agrees that the financial state of the global economy is one of the key factors driving brands to make decisions that can help them adapt to the changing consumer trends.\n\n“The majority of markets now report that financial instability is the number one problem consumers are facing. So brands have to adapt their behaviours to how consumers are feeling about their financial situation, and that can come from anything, from pack size to formulation. Because the only way a lot of the big global brands are growing at the moment is through price increases in many markets,” Needell told exchange4media in an interview.\n\nNeedell, along with Suresh Ramalingam, Chief Client Officer A

### Summarization with Open Source Local Models

In [44]:
from transformers import BartForConditionalGeneration, BartTokenizer
from transformers import AutoModelForSeq2SeqLM, AutoTokenizer
from transformers import pipeline

# Load the model and tokenizer

bart_model = AutoModelForSeq2SeqLM.from_pretrained("facebook/bart-large-cnn")
bart_tokenizer = AutoTokenizer.from_pretrained("facebook/bart-large-cnn")

print(f"bart_model_max_model_length:{bart_tokenizer.model_max_length}")

bart_model_max_model_length:1024


In [45]:
bart_summarizer = pipeline(
    "summarization", model="facebook/bart-large-cnn", truncation=True
)

bart_summary = bart_summarizer(
    news_results["news"][1]["contents"][0].page_content, max_length=300, min_length=10
)

print(bart_summary)

[{'summary_text': 'Jim Needell, Chief Client Officer, Local Markets, Ipsos talks to exchange4media about the changing consumer trends. Needell says brands need to adapt their behaviours to how consumers are feeling about their financial situation. India is hugely attractive for the size of the market, and your ability to adapt and adopt new technologies.'}]


### Summarization with Open Source Local Smaller Models

In [46]:
hf_small_model_summarizer = pipeline(
    "summarization", model="sshleifer/distilbart-cnn-12-6", truncation=True
)

hf_small_model_summary = hf_small_model_summarizer(
    news_results["news"][1]["contents"][0].page_content, max_length=250, min_length=10
)

print(hf_small_model_summary)

[{'summary_text': ' Jim Needell, Chief Client Officer, Local Markets, Ipsos talks to exchange4media about the changing consumer trends in the Indian market . Needell and Suresh Ramalingam from APEC, share their views on the changing trends in purchase behaviours .'}]


## Using Huggingface Transformers Models with LangChain

Hugging Face models can be run locally through the HuggingFacePipeline class.

In [47]:
from langchain_community.llms.huggingface_pipeline import HuggingFacePipeline

hf = HuggingFacePipeline.from_model_id(
    model_id="gpt2",
    task="text-generation",
    pipeline_kwargs={"max_new_tokens": 10},
)

In [99]:
from langchain_community.llms.huggingface_pipeline import HuggingFacePipeline
from transformers import AutoModelForCausalLM, AutoTokenizer, pipeline


bart_model = AutoModelForSeq2SeqLM.from_pretrained("facebook/bart-large-cnn")
bart_tokenizer = AutoTokenizer.from_pretrained("facebook/bart-large-cnn")
pipe = pipeline(
    "text2text-generation",
    model=bart_model,
    tokenizer=bart_tokenizer,
    # max_length=1000
)
hf_local_llm = HuggingFacePipeline(pipeline=pipe)

In [100]:
local_llm = hf_local_llm("What is Tamil?")

In [101]:
local_llm

'Tamil is a Tamil language. What is Tamil? It is a way of life in Tamil Nadu. Tamil is the language of the Tamil people. It is also known as Tamil-Tamil or Tamil-Thai. It means "people of Tamil" or "Tamil-speaking"'

In [111]:
hf_chain = load_summarize_chain(llm=hf_local_llm, chain_type="map_reduce", verbose=True)

# run chain
# map_reduce_summary = chain.invoke(chunked_news_txt_as_docs)
map_reduce_summary = chain.invoke(news_results["news"][1]["contents"])



[1m> Entering new MapReduceDocumentsChain chain...[0m


[1m> Entering new LLMChain chain...[0m
Prompt after formatting:
[32;1m[1;3mWrite a concise summary of the following:


"The global economy has been going through some level of turmoil for quite a while, no doubt. Businesses and consumers across the globe have had their share of ups and downs to face, especially when it comes to financial instability.

Jim Needell, Chief Client Officer, Local Markets, Ipsos agrees that the financial state of the global economy is one of the key factors driving brands to make decisions that can help them adapt to the changing consumer trends.

“The majority of markets now report that financial instability is the number one problem consumers are facing. So brands have to adapt their behaviours to how consumers are feeling about their financial situation, and that can come from anything, from pack size to formulation. Because the only way a lot of the big global brands are growing at the momen

HfHubHTTPError: 422 Client Error: Unprocessable Entity for url: https://api-inference.huggingface.co/models/google/flan-t5-xxl (Request ID: dm3_N1KD1EHuN1zCo9FWW)

Input validation error: `inputs` must have less than 1024 tokens. Given: 1159
Make sure 'text2text-generation' task is supported by the model.

In [None]:
map_reduce_summary

{'input_documents': [Document(page_content='The global economy has been going through some level of turmoil for quite a while, no doubt. Businesses and consumers across the globe have had their share of ups and downs to face, especially when it comes to financial instability.\n\nJim Needell, Chief Client Officer, Local Markets, Ipsos agrees that the financial state of the global economy is one of the key factors driving brands to make decisions that can help them adapt to the changing consumer trends.\n\n“The majority of markets now report that financial instability is the number one problem consumers are facing. So brands have to adapt their behaviours to how consumers are feeling about their financial situation, and that can come from anything, from pack size to formulation. Because the only way a lot of the big global brands are growing at the moment is through price increases in many markets,” Needell told exchange4media in an interview.\n\nNeedell, along with Suresh Ramalingam, Ch

## Additional Reading

- [LLM Bootcamp](https://github.com/miztiik/llm-bootcamp)
- [Revolutionizing News Summarization](https://www.width.ai/post/revolutionizing-news-summarization-exploring-the-power-of-gpt-in-zero-shot-and-specialized-tasks)
- [Summarizer For Any Size Document](https://www.width.ai/post/gpt3-summarizer)
- [Langchain Summarization - Stuff & Map Reduce](https://python.langchain.com/docs/use_cases/summarization)
- [Langchain Google Serper](https://python.langchain.com/docs/integrations/tools/google_serper)
- [Hugging Face Local Pipelines](https://python.langchain.com/docs/integrations/llms/huggingface_pipelines)