# Summarize Google News Results with OpenAI and Serper API

## Overview

Text summarization is the process of creating a shorter version of a text document while still preserving the most important information. This can be useful for a variety of purposes, such as quickly skimming a long document, getting the gist of an article, or sharing a summary with others. LLMs can be used to create summaries of news articles, research papers, technical documents, and other types of text.

<img src="images/miztiik_text_summarization_01.png" width="50%"/>


## Chunking Strategies for LLM Applications

- **Stuffing method** - The `stuffing` method is the easiest way to summarize text by feeding the entire document to a large language model (LLM) in a single call. This method has both pros and cons.

  - **Pros**:
    - Only required a single call to the model, which can be faster than other methods that require multiple calls
    - When summarizing text, the model has access to all the data at once, which can result in a better summary.
  - **Cons**:
    - Most models have a context length, and for large documents (or many documents) this will not work as it will result in a prompt larger than the context length.
    - This method only works on smaller pieces of data and not suitable to large documents most of the time.

- **MapReduce method** - It is a technique for summarizing large pieces of text by first summarizing smaller chunks of text and then combining those summaries into a single summary. The `MapReduce` method implements a multi-stage summarization. In LangChain, you can use `MapReduceDocumentsChain` as part of the `load_summarize_chain` method. What you need to do is setting `map_reduce` as `chain_type` of your chain.
  - MapReduce with Overlapping Chunks method
  - MapReduce with Rolling Summary method

![Miztiik Automation: Text Summarization](images/miztiik_automation_docs_copilot_using_llm_rag_02.png)

As we do not know the length of the document, we will use the `map-reduce` method to summarize the news articles.

In this notebook, we will try fetch the latest Google news using server API and use AI-generated summaries with LangChain LLM framework or huggingface transformers.


<a href="https://colab.research.google.com/github/miztiik/llm-bootcamp/blob/main/chapters/text_summarization/news_summarization_with_oai.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [9]:
%%capture
# Comment the above line to see the installation logs

# Install the dependencies
!pip install -qU python-dotenv
!pip install -qU langchain-core==0.1.23
!pip install -qU langchain==0.1.6
!pip install -qU langchain-community==0.0.19
!pip install -qU langchain-openai
!pip install --upgrade --quiet  transformers --quiet
!pip install -qU newspaper3k


# langchain==0.1.6
# langchain-community==0.0.19
# langchain-core==0.1.23

In [1]:
# Load environment variables
import os
from dotenv import load_dotenv, find_dotenv

load_dotenv(find_dotenv())

True

In [2]:
# Not a good practice, but we will ignore warnings in this notebook, as tensor has deprecated some methods and will be removed in future versions.
# https://github.com/pytorch/pytorch/issues/97207#issuecomment-1494781560
import warnings

# warnings.filterwarnings('ignore')
warnings.filterwarnings(
    "ignore", category=UserWarning, message="TypedStorage is deprecated"
)

Update your `API_KEY` in the `.env` file. You can get the API keys from the following links. _Note: Some of the services may require you to have an account and some may charge you for usage_
- [OpenAI API Key](https://platform.openai.com/account/api-keys)
- [Hugging Face API Key](https://huggingface.co/settings/tokens)
- [Serper API Key](https://serper.dev/api-key)

In [3]:
# os.environ["HF_TOKEN"] = ""
# os.environ["HUGGINGFACEHUB_API_TOKEN"] = ""
# os.environ["OPENAI_API_KEY"] = ""
# os.environ["SERPER_API_KEY"] = ""

In [4]:
from langchain_openai import OpenAI
from langchain_openai import ChatOpenAI

# To specify a particular model refer to the OpenAI documentation - https://platform.openai.com/docs/models
# Completions Model: https://platform.openai.com/docs/models/completions
# Chat Model: https://platform.openai.com/docs/models/completions

llm = OpenAI()
llm_chat = ChatOpenAI(model_name="gpt-3.5-turbo-0125", temperature=0.3)

In [7]:
from langchain_community.utilities import GoogleSerperAPIWrapper
from langchain.chains.summarize import load_summarize_chain
from langchain.docstore.document import Document
from langchain.prompts import PromptTemplate
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.document_loaders import NewsURLLoader

**Serper API** - [Sign up](https://serper.dev/signup?ref=miztiik) for an account with Serper, or log in if you already have an account, and create an API key. Serper offers a generous free tier; as you consume the API, the dashboard will populate with the requests and remaining credits.

In [30]:
import os

search = GoogleSerperAPIWrapper(
    type="news", tbs="qdr:d1", serper_api_key=os.getenv("SERPER_API_KEY")
)

news_search_query = "india"
news_results = search.results(news_search_query)

if news_results.get("news") is None:
    print("No results found")

In [31]:
print(len(news_results["news"]))

10


In [32]:
for i in news_results["news"][1]:
    print(f"{i}:{news_results['news'][1][i]}")

title:India hikes windfall tax on petroleum crude, diesel
link:https://www.reuters.com/business/energy/india-hikes-windfall-tax-petroleum-crude-diesel-2024-02-15/
snippet:The Indian government is increasing the windfall tax on petroleum crude to 3300 rupees ($39.76) a metric ton from 3200 rupees with effect from Feb.
date:6 hours ago
source:Reuters
imageUrl:https://encrypted-tbn0.gstatic.com/images?q=tbn:ANd9GcTBsjlsoQ0sl9Bx4NUQzQUce64F3CLTRKu8JPIKfM3D1Yd9ui1KpKcq7bExeQ&s
position:2


In [33]:
# Limit how many news articles to process
num_results = min(5, len(news_results["news"]))

text_splitter = RecursiveCharacterTextSplitter(
    separators=[
        "\n\n",
        "\n",
    ],
    chunk_size=1000,
    chunk_overlap=100,
)

# For each news article, load the contents
for index, news_item in enumerate(news_results["news"]):
    loader = NewsURLLoader(urls=[news_item.get("link")])
    contents = loader.load()
    if contents:
        news_results["news"][index]["article"] = contents
        # Make the docs to fit model input size
        news_results["news"][index]["split_article"] = text_splitter.create_documents(
            [contents[0].page_content]
        )
    else:
        print(f"Failed to load {news_item['link']}, removed from results.\n")
        news_results["news"].pop(index)

Error fetching or processing https://www.reuters.com/business/energy/india-hikes-windfall-tax-petroleum-crude-diesel-2024-02-15/, exception: Article `download()` failed with 401 Client Error: HTTP Forbidden for url: https://www.reuters.com/business/energy/india-hikes-windfall-tax-petroleum-crude-diesel-2024-02-15/ on URL https://www.reuters.com/business/energy/india-hikes-windfall-tax-petroleum-crude-diesel-2024-02-15/


Failed to load https://www.reuters.com/business/energy/india-hikes-windfall-tax-petroleum-crude-diesel-2024-02-15/


In [None]:
for i in news_results["news"]:
    print(i)
    print(f"\033[32m-----\033[0m")

## Summarization with Open AI Models

<img src="images/miztiik_text_summarization_02.png" width="50%"/>

In [38]:
# 16k is the max input length for GPT-3.5
# num_tokens_first_doc = llm.get_num_tokens(
#     news_results["news"][1]["contents"][0].page_content
# )

oai_chain = load_summarize_chain(
    llm=llm,
    chain_type="map_reduce",
    # verbose=True
)

for i in range(num_results):
    if news_results["news"][i]["contents"]:
        news_results["news"][i]["oai_summary"] = oai_chain.invoke(
            news_results["news"][i]["contents"]
        )

Let us take a look at the summaries generated by the `map-reduce` method.

In [51]:
for i in news_results["news"][:num_results]:
    if i["article"]:
        print(
            f"\nTitle: {i['title']}\nLink: {i['link']}\nSummary: \033[32m{i['oai_summary']['output_text']}\033[0m"
        )


Title: India’s Supreme Court Strikes Down Contentious Election Fund-Raising Tool
Link: https://www.nytimes.com/2024/02/15/world/asia/india-political-finance-ruling.html
Summary: [32m India's Supreme Court has banned the use of electoral bonds, a system that allowed anonymous donations to political parties and was seen as a benefit for the ruling party. This decision, while not affecting the upcoming election, is seen as a step towards more transparency and accountability in campaign finance. The ruling party has used this system to raise large sums of money, giving them an advantage in elections and suppressing opposition voices. The bonds could be bought from a government bank with no limit on the amount that could be purchased. 
[0m

Title: India electoral bonds: Supreme Court strikes down controversial funding scheme
Link: https://apnews.com/article/india-election-funding-modi-8de67ea52eebfad104095b4db056e90e
Summary: [32m India's top court has declared the electoral bond system

### Observations

The summarization is reads like written by a person, does a good job of capturing the main points of the article. The summary is coherent and reads well. As OpenAI continues to improve their models, we can expect the quality of the summaries to improve as well.

## Additional Reading

1. [LLM Bootcamp](https://github.com/miztiik/llm-bootcamp)
1. [Revolutionizing News Summarization](https://www.width.ai/post/revolutionizing-news-summarization-exploring-the-power-of-gpt-in-zero-shot-and-specialized-tasks)
1. [Summarizer For Any Size Document](https://www.width.ai/post/gpt3-summarizer)
1. [Langchain Summarization 1. Stuff & Map Reduce](https://python.langchain.com/docs/use_cases/summarization)
1. [Langchain Google Serper](https://python.langchain.com/docs/integrations/tools/google_serper)
1. [Hugging Face Local Pipelines](https://python.langchain.com/docs/integrations/llms/huggingface_pipelines)
1. [Chunking Strategies for LLM Applications](https://www.pinecone.io/learn/chunking-strategies/)
1. [Optimal Chunk-Size for Large Document Summarization](https://vectify.ai/blog/LargeDocumentSummarization)
1 .[4 Powerful Long Text Summarization Methods With Real Examples](https://www.width.ai/post/4-long-text-summarization-methods)