# Summarize Google News Results with LangChain🦜🔗, Huggingface🤗 and Serper API

## Overview

Text summarization is the process of creating a shorter version of a text document while still preserving the most important information. This can be useful for a variety of purposes, such as quickly skimming a long document, getting the gist of an article, or sharing a summary with others. LLMs can be used to create summaries of news articles, research papers, technical documents, and other types of text.

<img src="images/miztiik_text_summarization_01.png" width="50%"/>

In this notebook, we will try fetch the latest Google news using server API and use AI-generated summaries with LangChain LLM framework or huggingface transformers.

<a href="https://colab.research.google.com/github/miztiik/llm-bootcamp/blob/main/chapters/text_summarization/news_summarization.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
%%capture
# Comment the above line to see the installation logs

# Install the dependencies
!pip install -qU python-dotenv
!pip install -qU langchain
!pip install -qU langchain-openai
!pip install -qU newspaper3k

In [None]:
# Load environment variables
from dotenv import load_dotenv, find_dotenv

load_dotenv(find_dotenv())

In [None]:
# Not a good practice, but we will ignore warnings in this notebook, as tensor has deprecated some methods and will be removed in future versions.
# https://github.com/pytorch/pytorch/issues/97207#issuecomment-1494781560
import warnings

warnings.filterwarnings(
    "ignore", category=UserWarning, message="TypedStorage is deprecated"
)

Update your `API_KEY` in the `.env` file. You can get the API keys from the following links. _Note: Some of the services may require you to have an account and some may charge you for usage_
- [OpenAI API Key](https://platform.openai.com/account/api-keys)
- [Hugging Face API Key](https://huggingface.co/settings/tokens)
- [Serper API Key](https://serper.dev/api-key)

In [None]:
from langchain_openai import OpenAI
from langchain_openai import ChatOpenAI

# To specify a particular model refer to the OpenAI documentation - https://platform.openai.com/docs/models
# Completions Model: https://platform.openai.com/docs/models/completions
# Chat Model: https://platform.openai.com/docs/models/completions

llm = OpenAI()
llm_chat = ChatOpenAI(model_name="gpt-3.5-turbo-0125", temperature=0.3)

In [None]:
from langchain_community.utilities import GoogleSerperAPIWrapper
from langchain_community.document_loaders import WebBaseLoader
from langchain.document_loaders import NewsURLLoader
from langchain.chains.summarize import load_summarize_chain
from langchain.prompts import PromptTemplate
from langchain.text_splitter import RecursiveCharacterTextSplitter

**Serper API** - [Sign up](https://serper.dev/signup?ref=miztiik) for an account with Serper, or log in if you already have an account, and create an API key. Serper offers a generous free tier; as you consume the API, the dashboard will populate with the requests and remaining credits.

In [None]:
import os

search = GoogleSerperAPIWrapper(
    type="news", tbs="qdr:d1", serper_api_key=os.getenv("SERPER_API_KEY")
)

news_search_query = "indian economy trends"
news_results = search.results(news_search_query)

if news_results.get("news") is None:
    print("No results found")

In [None]:
print(len(news_results["news"]))

In [None]:
for i in news_results["news"][0]:
    print(f"{i}:{news_results['news'][0][i]}")

In [None]:
# Limit how many news articles to process
num_results = 5
if len(news_results["news"]) < num_results:
    num_results = len(news_results["news"])


text_splitter = RecursiveCharacterTextSplitter(
    separators=["\n\n", "\n", ". "], chunk_size=10000, chunk_overlap=100)


# For each news article, load the contents
for i in range(num_results):
    # loader = WebBaseLoader(news_results["news"][i]["link"])
    loader = NewsURLLoader(urls=[news_results["news"][i]["link"]])
    news_results["news"][i]["contents"] = loader.load()

    # Make the docs to fit model input size
    news_results["news"][i]["contents"] = text_splitter.create_documents(
        [news_results["news"][i]["contents"][0].page_content])

## Summarization with Open AI Models

<img src="images/miztiik_text_summarization_02.png" width="50%"/>

If our document is short enough to fit within the context window of the model, we can use the `stuffing` method.

**Stuffing method** - The `stuffing` method is the easiest way to summarize text by feeding the entire document to a large language model (LLM) in a single call. This method has both pros and cons.

  - **Pros**:
    - Only required a single call to the model, which can be faster than other methods that require multiple calls
    - When summarizing text, the model has access to all the data at once, which can result in a better summary.
  - **Cons**:
    - Most models have a context length, and for large documents (or many documents) this will not work as it will result in a prompt larger than the context length.
    - This method only works on smaller pieces of data and not suitable to large documents most of the time.

### 

In [None]:
# 16k is the max input length for GPT-3.5
num_tokens_first_doc = llm.get_num_tokens(
    news_results["news"][1]["contents"][0].page_content
)

oai_chain = load_summarize_chain(
    llm=llm,
    chain_type='map_reduce',
    verbose=True
)


oai_summary = oai_chain.invoke(news_results["news"][1]["contents"])

In [None]:
print(
    f"Title: {news_results['news'][1]['title']}\n\nLink: {news_results['news'][1]['link']}\n\nSummary: \033[32m{oai_summary['output_text']}\033[0m"
)

### Observations

The summarization is not perfect, but it does a good job of capturing the main points of the article. The summary is coherent and reads well. As OpenAI continues to improve their models, we can expect the quality of the summaries to improve as well.

## Summarization with Open Source Models

<img src="images/miztiik_text_summarization_03.png" width="50%"/>

### Summarization with Hugging Face hosted models

In [None]:
from langchain.chains.summarize import load_summarize_chain
from langchain.llms.huggingface_hub import HuggingFaceHub

hf_hub_llm = HuggingFaceHub(
    repo_id="google/flan-t5-xxl", model_kwargs={"temperature": 0.3, "max_length": 256}
)

chain = load_summarize_chain(
    llm=hf_hub_llm,
    chain_type='map_reduce',
    verbose=True
)

# run chain
map_reduce_summary = chain.invoke(news_results["news"][1]["contents"])

print(map_reduce_summary)

In [None]:
def split_text(text, max_chunk_size=2048):
    chunks = []
    current_chunk = ""
    for sentence in text.split("."):
        if len(current_chunk) + len(sentence) < max_chunk_size:
            current_chunk += sentence + "."
        else:
            chunks.append(current_chunk.strip())
            current_chunk = sentence + "."
    if current_chunk:
        chunks.append(current_chunk.strip())
    return chunks

In [None]:
chunked_news_txt = split_text(
    news_results["news"][1]["contents"][0].page_content)

# print(len(chunked_news_txt))

### Summarization with Open Source Local Models

In [None]:
from transformers import BartForConditionalGeneration, BartTokenizer
from transformers import AutoModelForSeq2SeqLM, AutoTokenizer
from transformers import pipeline

# Load the model and tokenizer

bart_model = AutoModelForSeq2SeqLM.from_pretrained("facebook/bart-large-cnn")
bart_tokenizer = AutoTokenizer.from_pretrained("facebook/bart-large-cnn")

print(f"bart_model_max_model_length:{bart_tokenizer.model_max_length}")

In [None]:
bart_summarizer = pipeline(
    "summarization", model="facebook/bart-large-cnn", truncation=True
)

bart_summary = bart_summarizer(
    news_results["news"][1]["contents"][0].page_content,
    max_length=300,
    min_length=10
)

print(bart_summary)

### Summarization with Open Source Local Smaller Models

In [None]:
hf_small_model_summarizer = pipeline(
    "summarization", model="sshleifer/distilbart-cnn-12-6", truncation=True)



hf_small_model_summary = hf_small_model_summarizer(
    news_results["news"][1]["contents"][0].page_content, max_length=250, min_length=10
)

print(hf_small_model_summary)

## Additional Reading

- [LLM Bootcamp](https://github.com/miztiik/llm-bootcamp)
- [Revolutionizing News Summarization](https://www.width.ai/post/revolutionizing-news-summarization-exploring-the-power-of-gpt-in-zero-shot-and-specialized-tasks)
- [Summarizer For Any Size Document](https://www.width.ai/post/gpt3-summarizer)
- [Langchain Summarization - Stuff & Map Reduce](https://python.langchain.com/docs/use_cases/summarization)
- [Langchain Google Serper](https://python.langchain.com/docs/integrations/tools/google_serper)