# Summarize Google News Results with LangChain🦜🔗, Huggingface🤗 and Serper API

## Overview

Text summarization is the process of creating a shorter version of a text document while still preserving the most important information. This can be useful for a variety of purposes, such as quickly skimming a long document, getting the gist of an article, or sharing a summary with others. LLMs can be used to create summaries of news articles, research papers, technical documents, and other types of text.

<img src="images/miztiik_text_summarization_01.png" width="50%"/>


## Chunking Strategies for LLM Applications

- **Stuffing method** - The `stuffing` method is the easiest way to summarize text by feeding the entire document to a large language model (LLM) in a single call. This method has both pros and cons.

  - **Pros**:
    - Only required a single call to the model, which can be faster than other methods that require multiple calls
    - When summarizing text, the model has access to all the data at once, which can result in a better summary.
  - **Cons**:
    - Most models have a context length, and for large documents (or many documents) this will not work as it will result in a prompt larger than the context length.
    - This method only works on smaller pieces of data and not suitable to large documents most of the time.

- **MapReduce method** - It is a technique for summarizing large pieces of text by first summarizing smaller chunks of text and then combining those summaries into a single summary. The `MapReduce` method implements a multi-stage summarization. In LangChain, you can use `MapReduceDocumentsChain` as part of the `load_summarize_chain` method. What you need to do is setting `map_reduce` as `chain_type` of your chain.
  - MapReduce with Overlapping Chunks method
  - MapReduce with Rolling Summary method

  <img src="images/miztiik_automation_docs_copilot_using_llm_rag_02.png" width="50%"/>


As we do not know the length of the document, we will use the `map-reduce` method to summarize the news articles.

In this notebook, we will try fetch the latest Google news using server API and use AI-generated summaries with LangChain LLM framework or huggingface transformers.


<a href="https://colab.research.google.com/github/miztiik/llm-bootcamp/blob/main/chapters/text_summarization/news_summarization_with_hf.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [3]:
%%capture
# Comment the above line to see the installation logs

# Install the dependencies
!pip install -qU python-dotenv
!pip install -qU langchain-core==0.1.23
!pip install -qU langchain==0.1.6
!pip install -qU langchain-community==0.0.19
!pip install -qU langchain-openai
!pip install -qU transformers --quiet
!pip install -qU newspaper3k


# langchain==0.1.6
# langchain-community==0.0.19
# langchain-core==0.1.23

In [4]:
# Load environment variables
import os
from dotenv import load_dotenv, find_dotenv

load_dotenv(find_dotenv())

True

In [5]:
# Not a good practice, but we will ignore warnings in this notebook, as tensor has deprecated some methods and will be removed in future versions.
# https://github.com/pytorch/pytorch/issues/97207#issuecomment-1494781560
import warnings

# warnings.filterwarnings('ignore')
warnings.filterwarnings(
    "ignore", category=UserWarning, message="TypedStorage is deprecated"
)

Update your `API_KEY` in the `.env` file. You can get the API keys from the following links. _Note: Some of the services may require you to have an account and some may charge you for usage_
- [OpenAI API Key](https://platform.openai.com/account/api-keys)
- [Hugging Face API Key](https://huggingface.co/settings/tokens)
- [Serper API Key](https://serper.dev/api-key)

In [42]:
# os.environ["HF_TOKEN"] = ""
# os.environ["HUGGINGFACEHUB_API_TOKEN"] = ""
# os.environ["OPENAI_API_KEY"] = ""
# os.environ["SERPER_API_KEY"] = ""

In [6]:
from langchain_openai import OpenAI
from langchain_openai import ChatOpenAI

# To specify a particular model refer to the OpenAI documentation - https://platform.openai.com/docs/models
# Completions Model: https://platform.openai.com/docs/models/completions
# Chat Model: https://platform.openai.com/docs/models/completions

llm = OpenAI()
llm_chat = ChatOpenAI(model_name="gpt-3.5-turbo-0125", temperature=0.3)

In [7]:
from langchain_community.utilities import GoogleSerperAPIWrapper
from langchain.chains.summarize import load_summarize_chain
from langchain.docstore.document import Document
from langchain.prompts import PromptTemplate
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.document_loaders import NewsURLLoader
import textwrap

**Serper API** - [Sign up](https://serper.dev/signup?ref=miztiik) for an account with Serper, or log in if you already have an account, and create an API key. Serper offers a generous free tier; as you consume the API, the dashboard will populate with the requests and remaining credits.

In [45]:
import os

search = GoogleSerperAPIWrapper(
    type="news", tbs="qdr:d1", serper_api_key=os.getenv("SERPER_API_KEY")
)

news_search_query = "india ai growth"
news_results = search.results(news_search_query, num_results=5)

if news_results.get("news") is None:
    print("No results found")

In [46]:
print(len(news_results["news"]))

10


In [47]:
for i in news_results["news"][0]:
    print(f"{i}:{news_results['news'][0][i]}")

title:India's AI Market Poised For Explosive Growth, Reaching $17 Billion By 2027
link:https://www.ndtv.com/india-news/indias-ai-market-poised-for-explosive-growth-reaching-17-billion-by-2027-5117475
snippet:India's artificial intelligence market is predicted to witness a meteoric rise, reaching a staggering $17 billion by 2027, according to a joint report by IT...
date:12 hours ago
source:NDTV
imageUrl:https://encrypted-tbn0.gstatic.com/images?q=tbn:ANd9GcQmsEc7EyX4q5DWqQc3LIMUcobqIBH0TCrr94HxGK-ApRgaKbelkj0j-CDCdw&s
position:1


In [48]:
# Limit how many news articles to process
num_results = min(5, len(news_results["news"]))

text_splitter = RecursiveCharacterTextSplitter(
    separators=[
        "\n\n",
        "\n",
    ],
    chunk_size=1000,
    chunk_overlap=100,
)

# For each news article, load the contents
for index, news_item in enumerate(news_results["news"]):
    loader = NewsURLLoader(urls=[news_item.get("link")])
    contents = loader.load()
    if contents:
        news_results["news"][index]["article"] = contents
        # Make the docs to fit model input size
        news_results["news"][index]["split_article"] = text_splitter.create_documents(
            [contents[0].page_content]
        )
    else:
        print(f"Failed to load {news_item['link']}, removed from results.\n")
        news_results["news"].pop(index)

Error fetching or processing https://www.businessworld.in/article/TWTW-Weekly-Wrap-Up-18-24-Feb/24-02-2024-510919, exception: Article `download()` failed with HTTPSConnectionPool(host='www.businessworld.in', port=443): Max retries exceeded with url: /article/TWTW-Weekly-Wrap-Up-18-24-Feb/24-02-2024-510919 (Caused by SSLError(SSLCertVerificationError(1, '[SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: unable to get local issuer certificate (_ssl.c:1006)'))) on URL https://www.businessworld.in/article/TWTW-Weekly-Wrap-Up-18-24-Feb/24-02-2024-510919


Failed to load https://www.businessworld.in/article/TWTW-Weekly-Wrap-Up-18-24-Feb/24-02-2024-510919, removed from results.



In [50]:
print(len(news_results["news"]))

9


In [52]:
for i in news_results["news"]:
    # print(i)
    print(i["link"])
    print(f"\033[32m-----\033[0m")

https://www.ndtv.com/india-news/indias-ai-market-poised-for-explosive-growth-reaching-17-billion-by-2027-5117475
[32m-----[0m
https://indianexpress.com/article/business/brokers-cautious-manipulations-stock-market-sebi-9179209/
[32m-----[0m
https://www.ndtvprofit.com/technology/wipro-intel-foundry-collaborate-to-advance-chip-design-and-development
[32m-----[0m
https://inc42.com/startups/eyes-in-the-sky-india-drone-startups-looking-for-major-pie/
[32m-----[0m
https://timesofindia.indiatimes.com/gadgets-news/google-ceo-sundar-pichai-on-how-ai-can-help-prevent-hacking-threats/articleshow/107951712.cms
[32m-----[0m
https://www.punjabnewsexpress.com/news/news/quickly-addressed-the-issue-google-on-%E2%80%98illegal-ai-responses-to-questions-on-pm-modi-241382
[32m-----[0m
https://upstox.com/news/business-news/economy/a-look-at-india's-gdp-growth-in-the-years-ahead/
[32m-----[0m
http://www.msn.com/en-in/money/markets/nvidia-s-market-value-surpasses-bse-sensex-after-ai-fuelled-rally

## Summarization with Open AI Models

<img src="images/miztiik_text_summarization_02.png" width="50%"/>


In [53]:
# 16k is the max input length for GPT-3.5
# num_tokens_first_doc = llm.get_num_tokens(
#     news_results["news"][1]["contents"][0].page_content
# )

map_prompt = """ Write a concise summary of the following:
"{text}"
CONCISE SUMMARY:
"""
map_prompt_template = PromptTemplate(
    template=map_prompt, input_variables=["text"])

combine_prompt = """ Write a concise summary of the following text delimited by triple backquotes.
```{text}```
CONCISE SUMMARY:
"""
combine_prompt_template = PromptTemplate(
    template=combine_prompt, input_variables=["text"]
)

oai_chain = load_summarize_chain(
    llm=llm,
    chain_type="map_reduce",
    map_prompt=map_prompt_template,
    combine_prompt=combine_prompt_template,
    # Uncomment verbose=True if you want to see the prompts being used
    # verbose=True
)

for news_item in news_results["news"][:num_results]:
    if news_item.get("article"):
        print(
            f"Summarizing article: {news_item['title']} - {news_item['link']}\n")
        news_item["oai_summary"] = oai_chain.invoke(news_item["split_article"])

Summarizing article: India's AI Market Poised For Explosive Growth, Reaching $17 Billion By 2027 - https://www.ndtv.com/india-news/indias-ai-market-poised-for-explosive-growth-reaching-17-billion-by-2027-5117475

Summarizing article: Brokers need to be cautious against manipulations in stock market: Sebi - https://indianexpress.com/article/business/brokers-cautious-manipulations-stock-market-sebi-9179209/

Summarizing article: Wipro, Intel Foundry Collaborate To Advance Chip Design And Development - https://www.ndtvprofit.com/technology/wipro-intel-foundry-collaborate-to-advance-chip-design-and-development

Summarizing article: Eyes In The Sky: 33 Indian Drone Startups Looking For A Major Pie - https://inc42.com/startups/eyes-in-the-sky-india-drone-startups-looking-for-major-pie/

Summarizing article: Google CEO Sundar Pichai on how AI can help prevent hacking threats - https://timesofindia.indiatimes.com/gadgets-news/google-ceo-sundar-pichai-on-how-ai-can-help-prevent-hacking-threats/

Let us take a look at the summaries generated by the `map-reduce` method.

In [54]:
for i in news_results["news"][:num_results]:
    if i.get("article"):
        print(
            f"\nTitle: {i['title']}\nLink: {i['link']}\nSummary: \033[32m{i['oai_summary']['output_text']}\033[0m"
        )


Title: India's AI Market Poised For Explosive Growth, Reaching $17 Billion By 2027
Link: https://www.ndtv.com/india-news/indias-ai-market-poised-for-explosive-growth-reaching-17-billion-by-2027-5117475
Summary: [32m
India's AI market is expected to reach $17 billion by 2027 with a growth rate of 25-35%. This is due to increased tech spending, a large pool of AI professionals, and significant investments. Both private and public entities are investing in AI research and development, leading to innovation and growth. India's tech industry sees AI as crucial for its future and is expected to continue to see a demand for AI talent. With a strong talent pool, investments, and adoption, India is set to become a global leader in shaping the future of technology.[0m

Title: Brokers need to be cautious against manipulations in stock market: Sebi
Link: https://indianexpress.com/article/business/brokers-cautious-manipulations-stock-market-sebi-9179209/
Summary: [32mSebi representative warns a

### Observations

The summarization is reads like written by a person, does a good job of capturing the main points of the article. The summary is coherent and reads well. As OpenAI continues to improve their models, we can expect the quality of the summaries to improve as well.

## Summarization with Huggingface Open Source Hosted Models with LangChain

<img src="images/miztiik_text_summarization_03.png" width="50%"/>

We will try a variety of model and see how they perform. We will use,
- `google/flan-t5-xxl`
- `facebook/bart-large-cnn`
- `sshleifer/distilbart-cnn-12-6`
- `Falconsai/text_summarization`

### Summarization with Hugging Face hosted models

In [99]:
from langchain.chains.summarize import load_summarize_chain
from langchain.llms.huggingface_hub import HuggingFaceHub
from langchain.chains.llm import LLMChain
from langchain.chains.combine_documents.stuff import StuffDocumentsChain
from langchain.chains import MapReduceDocumentsChain, ReduceDocumentsChain
from langchain.text_splitter import CharacterTextSplitter

hf_flan_llm = HuggingFaceHub(
    repo_id="google/flan-t5-xxl",
    model_kwargs={"temperature": 0.3, "max_length": 1024},
    # repo_id="philschmid/bart-large-cnn-samsum", model_kwargs={"temperature": 0.3, "max_length": 256}
    # repo_id="mistralai/Mistral-7B-v0.1", model_kwargs={"temperature": 0.3, "max_length": 1024}
)

map_prompt = """Identify main themes to write a concise summary of the following:
"{text}"
CONCISE SUMMARY:
"""
map_prompt_template = PromptTemplate(
    template=map_prompt, input_variables=["text"])

combine_prompt = """Write a succinct summary of the following text delimited by triple backquotes.
```{text}```
succinct SUMMARY:
"""
combine_prompt_template = PromptTemplate(
    template=combine_prompt, input_variables=["text"]
)


hf_flan_chain = load_summarize_chain(
    llm=hf_flan_llm,
    chain_type="map_reduce",
    token_max=900,  # https://github.com/langchain-ai/langchain/discussions/10930
    map_prompt=map_prompt_template,
    combine_prompt=combine_prompt_template,
    verbose=True,
)

# textwrap.fill(output_summary, width=100)

for news_item in news_results["news"][:num_results]:
    if news_item.get("article"):
        print(
            f"Summarizing article: {news_item['title']} - {news_item['link']}\n")
        news_item["hf_flan_summary"] = hf_flan_chain.invoke(
            news_item["split_article"])

Summarizing article: India's AI Market Poised For Explosive Growth, Reaching $17 Billion By 2027 - https://www.ndtv.com/india-news/indias-ai-market-poised-for-explosive-growth-reaching-17-billion-by-2027-5117475



[1m> Entering new MapReduceDocumentsChain chain...[0m


[1m> Entering new LLMChain chain...[0m
Prompt after formatting:
[32;1m[1;3mIdentify main themes to write a concise summary of the following:
"The growth is fueled by tech spending, AI talent, and increased investments.

India's artificial intelligence market is predicted to witness a meteoric rise, reaching a staggering $17 billion by 2027, according to a joint report by IT industry body the National Association of Software and Service Companies (Nasscom) and consulting firm Boston Consulting Group (BCG). This translates to an impressive annualised growth rate of 25-35% between 2024 and 2027.

Fueling this AI boom are three key factors:

Surging Enterprise Tech Spending: Companies are increasingly investing in AI-

In [100]:
# Print the summaries
for i in news_results["news"][:num_results]:
    if i.get("article"):
        print(f"\nTitle: {i['title']}\nLink: {i['link']}")
        print(
            f"\noai_summary: \033[32m {i['oai_summary']['output_text']}\033[0m")
        print(
            f"\nhf_flan_summary: \033[32m{i['hf_flan_summary']['output_text']}\033[0m"
        )


Title: India's AI Market Poised For Explosive Growth, Reaching $17 Billion By 2027
Link: https://www.ndtv.com/india-news/indias-ai-market-poised-for-explosive-growth-reaching-17-billion-by-2027-5117475

oai_summary: [32m 
India's AI market is expected to reach $17 billion by 2027 with a growth rate of 25-35%. This is due to increased tech spending, a large pool of AI professionals, and significant investments. Both private and public entities are investing in AI research and development, leading to innovation and growth. India's tech industry sees AI as crucial for its future and is expected to continue to see a demand for AI talent. With a strong talent pool, investments, and adoption, India is set to become a global leader in shaping the future of technology.[0m

hf_flan_summary: [32mIndia's artificial intelligence market is predicted to witness a meteoric rise, reaching a staggering $17 billion by 2027, according to a joint report by IT industry body the National Association of 

### Summarization with Huggingface Open Source Local Models with LangChain

In [34]:
from transformers import BartForConditionalGeneration, BartTokenizer
from transformers import AutoModelForSeq2SeqLM, AutoTokenizer
from transformers import pipeline

# Load the model and tokenizer

bart_model = AutoModelForSeq2SeqLM.from_pretrained("facebook/bart-large-cnn")
bart_tokenizer = AutoTokenizer.from_pretrained("facebook/bart-large-cnn")

print(f"bart_model_max_model_length:{bart_tokenizer.model_max_length}")

bart_model_max_model_length:1024


In [80]:
from langchain_community.llms.huggingface_pipeline import HuggingFacePipeline

bart_summarizer = pipeline(
    "summarization",
    model="facebook/bart-large-cnn",
    tokenizer="facebook/bart-large-cnn",  # facebook/bart-large-xsum
    # min_length=10,
    # max_length=512
)

hf_bart_llm = HuggingFacePipeline(
    pipeline=bart_summarizer, model_kwargs={
        "temperature": 0.3, "max_length": 1024}
)


map_prompt = """Identify main themes to write a concise summary of the following:
"{text}"
CONCISE SUMMARY:
"""
map_prompt_template = PromptTemplate(
    template=map_prompt, input_variables=["text"])

combine_prompt = """Write a succinct summary of the following text delimited by triple backquotes.
```{text}```
succinct SUMMARY:
"""
combine_prompt_template = PromptTemplate(
    template=combine_prompt, input_variables=["text"]
)


hf_bart_chain = load_summarize_chain(
    llm=hf_bart_llm,
    chain_type="map_reduce",
    token_max=900,  # https://github.com/langchain-ai/langchain/discussions/10930
    map_prompt=map_prompt_template,
    combine_prompt=combine_prompt_template,
    verbose=True,
)


# run chain
# for news_item in news_results["news"][:num_results]:
for news_item in news_results["news"][:num_results]:
    if news_item.get("article"):
        print(
            f"Summarizing article: {news_item['title']} - {news_item['link']}\n")
        news_item["hf_bart_summary"] = hf_bart_chain.invoke(
            news_item["split_article"])
        print("!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!")

Summarizing article: India's AI Market Poised For Explosive Growth, Reaching $17 Billion By 2027 - https://www.ndtv.com/india-news/indias-ai-market-poised-for-explosive-growth-reaching-17-billion-by-2027-5117475



[1m> Entering new MapReduceDocumentsChain chain...[0m


[1m> Entering new LLMChain chain...[0m
Prompt after formatting:
[32;1m[1;3mIdentify main themes to write a concise summary of the following:
"The growth is fueled by tech spending, AI talent, and increased investments.

India's artificial intelligence market is predicted to witness a meteoric rise, reaching a staggering $17 billion by 2027, according to a joint report by IT industry body the National Association of Software and Service Companies (Nasscom) and consulting firm Boston Consulting Group (BCG). This translates to an impressive annualised growth rate of 25-35% between 2024 and 2027.

Fueling this AI boom are three key factors:

Surging Enterprise Tech Spending: Companies are increasingly investing in AI-

Your max_length is set to 142, but your input_length is only 90. Since this is a summarization task, where outputs shorter than the input are typically wanted, you might consider decreasing max_length manually, e.g. summarizer('...', max_length=45)



[1m> Finished chain.[0m


[1m> Entering new LLMChain chain...[0m
Prompt after formatting:
[32;1m[1;3mWrite a succinct summary of the following text delimited by triple backquotes.
```Sebi Whole Time Member Kamlesh Chandra Varshney on Saturday cautioned against manipulations in the capital market. He urged brokers to keep a tab and prevent such instances. The Securities and Exchange Board of India is using Artificial Intelligence (AI) for investigations. Total demat accounts rose to 13.9 crore at the end of December 2023, a growth of 20 per cent in nine months.

"Manipulations are going on and Sebi cannot intervene in all of them," Varshney said. Some brokers are involved and the broker community should keep an eye as “bad elements can come into the system”, he said. The market watchdog will be coming out with a cyber security and cyber resilience framework for the regulated entities.

"The framework will be mandatory for the regulated entities," Varshney said. ANMI President Vij

Your max_length is set to 142, but your input_length is only 88. Since this is a summarization task, where outputs shorter than the input are typically wanted, you might consider decreasing max_length manually, e.g. summarizer('...', max_length=44)



[1m> Finished chain.[0m


[1m> Entering new LLMChain chain...[0m
Prompt after formatting:
[32;1m[1;3mWrite a succinct summary of the following text delimited by triple backquotes.
```The global artificial intelligence chip market is expected to grow at a compound annual growth rate of 38% annually from 2023 to 2032. Wipro's design services and Intel Foundry’s manufacturing capabilities will enable heavy industries to leverage generative AI-driven designs and foundry services.

Wipro said that its investment in the ai360 ecosystem, combined with this collaboration with Intel on AI-driven chip manufacturing, will help businesses achieve their growth goals. Wipro's IP expertise in areas such as DDR, HBM, PCIe, CXL, OPIO, RLINK/DP PHY and FIVR/LDO will help improve time-to-market.

Wipro will leverage Intel’s strong worldwide fabrication plants to ensure silicon availability. Wipro delivers a geo-diverse and resilient semiconductor supply chain that enables clients across businesses

Your max_length is set to 142, but your input_length is only 93. Since this is a summarization task, where outputs shorter than the input are typically wanted, you might consider decreasing max_length manually, e.g. summarizer('...', max_length=46)



[1m> Finished chain.[0m


[1m> Entering new StuffDocumentsChain chain...[0m


[1m> Entering new LLMChain chain...[0m
Prompt after formatting:
[32;1m[1;3mWrite a succinct summary of the following text delimited by triple backquotes.
```The Indian drone market is expected to reach $13 Bn in size by 2030, growing at a CAGR of 21% between 2022 and 2030. Civil Aviation Minister Jyotiraditya Scindia said that India has the potential to become a global drone hub by 2030.

The Indian drone market is expected to reach $13 Bn in size by 2030, growing at a CAGR of 21% between 2022 and 2030. Inc42 has compiled a list of 33 Indian drone startups, detailing their journeys – from what they do to their plans as the deeptech segment booms.

Aereo (earlier Aarav Unmanned Systems) offers end-to-end drone solutions. It was also amongst the three companies that were shortlisted to map India’s 600,000 villages by the government. In July 2022, it signed an MoU with Tata Steel to develop and offer in

Your max_length is set to 142, but your input_length is only 41. Since this is a summarization task, where outputs shorter than the input are typically wanted, you might consider decreasing max_length manually, e.g. summarizer('...', max_length=20)



[1m> Finished chain.[0m

[1m> Finished chain.[0m
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
Summarizing article: Google CEO Sundar Pichai on how AI can help prevent hacking threats - https://timesofindia.indiatimes.com/gadgets-news/google-ceo-sundar-pichai-on-how-ai-can-help-prevent-hacking-threats/articleshow/107951712.cms



[1m> Entering new MapReduceDocumentsChain chain...[0m


[1m> Entering new LLMChain chain...[0m
Prompt after formatting:
[32;1m[1;3mIdentify main themes to write a concise summary of the following:
"Growth of sophisticated cyberattacks

AI not foolproof solution yet"
CONCISE SUMMARY:
[0m
Prompt after formatting:
[32;1m[1;3mIdentify main themes to write a concise summary of the following:
"
The chief executives of tech giants such as Google, Microsoft and ChatGPT maker OpenAI have advocated for the use of artificial intelligence (AI) technology in various industries, including against cybercrimes. As concerns over the use of AI technology in 

In [102]:
# Print the summaries
for i in news_results["news"][:num_results]:
    if i.get("article"):
        print(f"\nTitle: {i['title']}\nLink: {i['link']}")
        print(
            f"\noai_summary: \033[32m {i['oai_summary']['output_text']}\033[0m")
        print(
            f"\nhf_flan_summary: \033[32m{i['hf_flan_summary']['output_text']}\033[0m")
        print(
            f"\nhf_bart_summary: \033[32m{i['hf_bart_summary']['output_text']}\033[0m"
        )


Title: India's AI Market Poised For Explosive Growth, Reaching $17 Billion By 2027
Link: https://www.ndtv.com/india-news/indias-ai-market-poised-for-explosive-growth-reaching-17-billion-by-2027-5117475

oai_summary: [32m 
India's AI market is expected to reach $17 billion by 2027 with a growth rate of 25-35%. This is due to increased tech spending, a large pool of AI professionals, and significant investments. Both private and public entities are investing in AI research and development, leading to innovation and growth. India's tech industry sees AI as crucial for its future and is expected to continue to see a demand for AI talent. With a strong talent pool, investments, and adoption, India is set to become a global leader in shaping the future of technology.[0m

hf_flan_summary: [32mIndia's artificial intelligence market is predicted to witness a meteoric rise, reaching a staggering $17 billion by 2027, according to a joint report by IT industry body the National Association of 

### Summarization with Huggingface Open Source Smaller Local Models with LangChain


In [82]:
hf_distilbart_summarizer = pipeline(
    "summarization", model="sshleifer/distilbart-cnn-12-6"
)

hf_distilbart_llm = HuggingFacePipeline(
    pipeline=hf_distilbart_summarizer, model_kwargs={}
)

map_prompt = """Identify main themes to write a concise summary of the following:
"{text}"
CONCISE SUMMARY:
"""
map_prompt_template = PromptTemplate(
    template=map_prompt, input_variables=["text"])

combine_prompt = """Write a succinct summary of the following text delimited by triple backquotes.
```{text}```
succinct SUMMARY:
"""
combine_prompt_template = PromptTemplate(
    template=combine_prompt, input_variables=["text"]
)


hf_distilbart_chain = load_summarize_chain(
    llm=hf_distilbart_llm,
    chain_type="map_reduce",
    token_max=900,  # https://github.com/langchain-ai/langchain/discussions/10930
    map_prompt=map_prompt_template,
    combine_prompt=combine_prompt_template,
    # verbose=True
)


# run chain
for news_item in news_results["news"][:num_results]:
    if news_item.get("article"):
        print(
            f"Summarizing article: {news_item['title']} - {news_item['link']}\n")
        news_item["hf_distilbart_summary"] = hf_distilbart_chain.invoke(
            news_item["split_article"]
        )

Summarizing article: India's AI Market Poised For Explosive Growth, Reaching $17 Billion By 2027 - https://www.ndtv.com/india-news/indias-ai-market-poised-for-explosive-growth-reaching-17-billion-by-2027-5117475

Summarizing article: Brokers need to be cautious against manipulations in stock market: Sebi - https://indianexpress.com/article/business/brokers-cautious-manipulations-stock-market-sebi-9179209/



Your max_length is set to 142, but your input_length is only 90. Since this is a summarization task, where outputs shorter than the input are typically wanted, you might consider decreasing max_length manually, e.g. summarizer('...', max_length=45)


Summarizing article: Wipro, Intel Foundry Collaborate To Advance Chip Design And Development - https://www.ndtvprofit.com/technology/wipro-intel-foundry-collaborate-to-advance-chip-design-and-development



Your max_length is set to 142, but your input_length is only 88. Since this is a summarization task, where outputs shorter than the input are typically wanted, you might consider decreasing max_length manually, e.g. summarizer('...', max_length=44)


Summarizing article: Eyes In The Sky: 33 Indian Drone Startups Looking For A Major Pie - https://inc42.com/startups/eyes-in-the-sky-india-drone-startups-looking-for-major-pie/



Your max_length is set to 142, but your input_length is only 93. Since this is a summarization task, where outputs shorter than the input are typically wanted, you might consider decreasing max_length manually, e.g. summarizer('...', max_length=46)
Your max_length is set to 142, but your input_length is only 41. Since this is a summarization task, where outputs shorter than the input are typically wanted, you might consider decreasing max_length manually, e.g. summarizer('...', max_length=20)


Summarizing article: Google CEO Sundar Pichai on how AI can help prevent hacking threats - https://timesofindia.indiatimes.com/gadgets-news/google-ceo-sundar-pichai-on-how-ai-can-help-prevent-hacking-threats/articleshow/107951712.cms



In [104]:
# Print the summaries
for i in news_results["news"][:num_results]:
    if i.get("article"):
        print(f"\nTitle: {i['title']}\nLink: {i['link']}")
        print(
            f"oai_summary: \033[32m {i['oai_summary']['output_text']}\033[0m")
        print(
            f"hf_flan_summary: \033[32m{i['hf_flan_summary']['output_text']}\033[0m")
        print(
            f"hf_bart_summary: \033[32m{i['hf_bart_summary']['output_text']}\033[0m")
        print(
            f"hf_distilbart_summary: \033[32m{i['hf_distilbart_summary']['output_text']}\033[0m"
        )


Title: India's AI Market Poised For Explosive Growth, Reaching $17 Billion By 2027
Link: https://www.ndtv.com/india-news/indias-ai-market-poised-for-explosive-growth-reaching-17-billion-by-2027-5117475
oai_summary: [32m 
India's AI market is expected to reach $17 billion by 2027 with a growth rate of 25-35%. This is due to increased tech spending, a large pool of AI professionals, and significant investments. Both private and public entities are investing in AI research and development, leading to innovation and growth. India's tech industry sees AI as crucial for its future and is expected to continue to see a demand for AI talent. With a strong talent pool, investments, and adoption, India is set to become a global leader in shaping the future of technology.[0m
hf_flan_summary: [32mIndia's artificial intelligence market is predicted to witness a meteoric rise, reaching a staggering $17 billion by 2027, according to a joint report by IT industry body the National Association of So

### Summarization with Huggingface Open Source Smaller Local Models with LangChain - Falconsai

In [85]:
hf_falconsai_summarizer = pipeline(
    "summarization", model="Falconsai/text_summarization"
)

hf_falconsai_llm = HuggingFacePipeline(
    pipeline=hf_falconsai_summarizer, model_kwargs={}
)

map_prompt = """Identify main themes to write a concise summary of the following:
"{text}"
CONCISE SUMMARY:
"""
map_prompt_template = PromptTemplate(
    template=map_prompt, input_variables=["text"])

combine_prompt = """Write a succinct summary of the following text delimited by triple backquotes.
```{text}```
succinct SUMMARY:
"""
combine_prompt_template = PromptTemplate(
    template=combine_prompt, input_variables=["text"]
)

hf_falconsai_chain = load_summarize_chain(
    llm=hf_falconsai_llm,
    chain_type="map_reduce",
    token_max=900,  # https://github.com/langchain-ai/langchain/discussions/10930
    map_prompt=map_prompt_template,
    combine_prompt=combine_prompt_template,
    # verbose=True
)


# run chain
for news_item in news_results["news"][:num_results]:
    if news_item.get("article"):
        print(
            f"Summarizing article: {news_item['title']} - {news_item['link']}\n")
        news_item["hf_falconsai_summary"] = hf_falconsai_chain.invoke(
            news_item["split_article"]
        )

Summarizing article: India's AI Market Poised For Explosive Growth, Reaching $17 Billion By 2027 - https://www.ndtv.com/india-news/indias-ai-market-poised-for-explosive-growth-reaching-17-billion-by-2027-5117475



Your max_length is set to 200, but your input_length is only 171. Since this is a summarization task, where outputs shorter than the input are typically wanted, you might consider decreasing max_length manually, e.g. summarizer('...', max_length=85)


Summarizing article: Brokers need to be cautious against manipulations in stock market: Sebi - https://indianexpress.com/article/business/brokers-cautious-manipulations-stock-market-sebi-9179209/



Your max_length is set to 200, but your input_length is only 97. Since this is a summarization task, where outputs shorter than the input are typically wanted, you might consider decreasing max_length manually, e.g. summarizer('...', max_length=48)
Your max_length is set to 200, but your input_length is only 182. Since this is a summarization task, where outputs shorter than the input are typically wanted, you might consider decreasing max_length manually, e.g. summarizer('...', max_length=91)


Summarizing article: Wipro, Intel Foundry Collaborate To Advance Chip Design And Development - https://www.ndtvprofit.com/technology/wipro-intel-foundry-collaborate-to-advance-chip-design-and-development



Your max_length is set to 200, but your input_length is only 91. Since this is a summarization task, where outputs shorter than the input are typically wanted, you might consider decreasing max_length manually, e.g. summarizer('...', max_length=45)


Summarizing article: Eyes In The Sky: 33 Indian Drone Startups Looking For A Major Pie - https://inc42.com/startups/eyes-in-the-sky-india-drone-startups-looking-for-major-pie/



Your max_length is set to 200, but your input_length is only 198. Since this is a summarization task, where outputs shorter than the input are typically wanted, you might consider decreasing max_length manually, e.g. summarizer('...', max_length=99)
Your max_length is set to 200, but your input_length is only 194. Since this is a summarization task, where outputs shorter than the input are typically wanted, you might consider decreasing max_length manually, e.g. summarizer('...', max_length=97)
Your max_length is set to 200, but your input_length is only 197. Since this is a summarization task, where outputs shorter than the input are typically wanted, you might consider decreasing max_length manually, e.g. summarizer('...', max_length=98)
Your max_length is set to 200, but your input_length is only 95. Since this is a summarization task, where outputs shorter than the input are typically wanted, you might consider decreasing max_length manually, e.g. summarizer('...', max_length=47)
T

Summarizing article: Google CEO Sundar Pichai on how AI can help prevent hacking threats - https://timesofindia.indiatimes.com/gadgets-news/google-ceo-sundar-pichai-on-how-ai-can-help-prevent-hacking-threats/articleshow/107951712.cms



Your max_length is set to 200, but your input_length is only 119. Since this is a summarization task, where outputs shorter than the input are typically wanted, you might consider decreasing max_length manually, e.g. summarizer('...', max_length=59)


In [105]:
# Print the summaries
for i in news_results["news"][:num_results]:
    if i.get("article"):
        print(f"\nTitle: {i['title']}\nLink: {i['link']}")
        print(
            f"oai_summary: \033[32m {i['oai_summary']['output_text']}\033[0m")
        print(
            f"hf_flan_summary: \033[32m{i['hf_flan_summary']['output_text']}\033[0m")
        print(
            f"hf_bart_summary: \033[32m{i['hf_bart_summary']['output_text']}\033[0m")
        print(
            f"hf_distilbart_summary: \033[32m{i['hf_distilbart_summary']['output_text']}\033[0m"
        )
        print(
            f"hf_falconsai_summary: \033[32m{i['hf_falconsai_summary']['output_text']}\033[0m"
        )


Title: India's AI Market Poised For Explosive Growth, Reaching $17 Billion By 2027
Link: https://www.ndtv.com/india-news/indias-ai-market-poised-for-explosive-growth-reaching-17-billion-by-2027-5117475
oai_summary: [32m 
India's AI market is expected to reach $17 billion by 2027 with a growth rate of 25-35%. This is due to increased tech spending, a large pool of AI professionals, and significant investments. Both private and public entities are investing in AI research and development, leading to innovation and growth. India's tech industry sees AI as crucial for its future and is expected to continue to see a demand for AI talent. With a strong talent pool, investments, and adoption, India is set to become a global leader in shaping the future of technology.[0m
hf_flan_summary: [32mIndia's artificial intelligence market is predicted to witness a meteoric rise, reaching a staggering $17 billion by 2027, according to a joint report by IT industry body the National Association of So

### Summarization with Huggingface Open Source Smaller Local Models with LangChain - MistralAI Hosted

In [107]:
from langchain.chains.summarize import load_summarize_chain
from langchain.llms.huggingface_hub import HuggingFaceHub

hf_mixtral_llm = HuggingFaceHub(
    repo_id="mistralai/Mixtral-8x7B-Instruct-v0.1",
    model_kwargs={"temperature": 0.3, "max_length": 1024},
)


map_prompt = """Identify main themes to write a concise summary of the following:
"{text}"
CONCISE SUMMARY:
"""
map_prompt_template = PromptTemplate(
    template=map_prompt, input_variables=["text"])

combine_prompt = """Write a succinct summary of the following text delimited by triple backquotes.
```{text}```
succinct SUMMARY:
"""
combine_prompt_template = PromptTemplate(
    template=combine_prompt, input_variables=["text"]
)

hf_mixtral_chain = load_summarize_chain(
    llm=hf_mixtral_llm,
    chain_type="map_reduce",
    token_max=900,  # https://github.com/langchain-ai/langchain/discussions/10930
    map_prompt=map_prompt_template,
    combine_prompt=combine_prompt_template,
    verbose=True
)


# run chain
for news_item in news_results["news"][:num_results]:
    if news_item.get("article"):
        print(
            f"Summarizing article: {news_item['title']} - {news_item['link']}\n")
        news_item["hf_mixtral_summary"] = hf_mixtral_chain.invoke(
            news_item["split_article"]
        )

Summarizing article: India's AI Market Poised For Explosive Growth, Reaching $17 Billion By 2027 - https://www.ndtv.com/india-news/indias-ai-market-poised-for-explosive-growth-reaching-17-billion-by-2027-5117475



[1m> Entering new MapReduceDocumentsChain chain...[0m


[1m> Entering new LLMChain chain...[0m
Prompt after formatting:
[32;1m[1;3mIdentify main themes to write a concise summary of the following:
"The growth is fueled by tech spending, AI talent, and increased investments.

India's artificial intelligence market is predicted to witness a meteoric rise, reaching a staggering $17 billion by 2027, according to a joint report by IT industry body the National Association of Software and Service Companies (Nasscom) and consulting firm Boston Consulting Group (BCG). This translates to an impressive annualised growth rate of 25-35% between 2024 and 2027.

Fueling this AI boom are three key factors:

Surging Enterprise Tech Spending: Companies are increasingly investing in AI-

ValueError: A single document was longer than the context length, we cannot handle this.

In [96]:
from transformers import AutoModelForMaskedLM, AutoTokenizer

# %%
# Set up the model and tokenizer
# Replace this with the correct path to the model
tokenizer = AutoTokenizer.from_pretrained("mistralai/Mixtral-8x7B-Instruct-v0.1")


# # %%
# # Calculate the number of parameters
# num_params = sum(p.nelement() for p in model.parameters())

# # %%
# # Display the result
# print(f"The size of the '{tokenizer.config.name}' model is {num_params} parameters.")


# from transformers import BartForConditionalGeneration, BartTokenizer
# from transformers import AutoModelForSeq2SeqLM, AutoTokenizer
# from transformers import pipeline

# # Load the model and tokenizer

# bart_model = AutoModelForSeq2SeqLM.from_pretrained("facebook/bart-large-cnn")
# bart_tokenizer = AutoTokenizer.from_pretrained("facebook/bart-large-cnn")

print(f"mixtral_model_max_model_length:{tokenizer.model_max_length}")

mixtral_model_max_model_length:1000000000000000019884624838656


In [108]:
# Print the summaries
for i in news_results["news"][:num_results]:
    if i.get("article"):
        print(f"\nTitle: {i['title']}\nLink: {i['link']}")
        print(
            f"oai_summary: \033[32m {i['oai_summary']['output_text']}\033[0m")
        print(
            f"hf_flan_summary: \033[32m{i['hf_flan_summary']['output_text']}\033[0m")
        print(
            f"hf_bart_summary: \033[32m{i['hf_bart_summary']['output_text']}\033[0m")
        print(
            f"hf_distilbart_summary: \033[32m{i['hf_distilbart_summary']['output_text']}\033[0m"
        )
        print(
            f"hf_falconsai_summary: \033[32m{i['hf_falconsai_summary']['output_text']}\033[0m"
        )
        print(
            f"hf_mixtral_summary: \033[32m{i['hf_mixtral_summary']['output_text']}\033[0m"
        )


Title: India's AI Market Poised For Explosive Growth, Reaching $17 Billion By 2027
Link: https://www.ndtv.com/india-news/indias-ai-market-poised-for-explosive-growth-reaching-17-billion-by-2027-5117475
oai_summary: [32m 
India's AI market is expected to reach $17 billion by 2027 with a growth rate of 25-35%. This is due to increased tech spending, a large pool of AI professionals, and significant investments. Both private and public entities are investing in AI research and development, leading to innovation and growth. India's tech industry sees AI as crucial for its future and is expected to continue to see a demand for AI talent. With a strong talent pool, investments, and adoption, India is set to become a global leader in shaping the future of technology.[0m
hf_flan_summary: [32mIndia's artificial intelligence market is predicted to witness a meteoric rise, reaching a staggering $17 billion by 2027, according to a joint report by IT industry body the National Association of So

KeyError: 'hf_mixtral_summary'

In [None]:
from transformers import AutoModelForCausalLM, AutoTokenizer, pipeline
import torch
from transformers import BitsAndBytesConfig
from langchain import HuggingFacePipeline
from langchain import PromptTemplate, LLMChain


quantization_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_compute_dtype=torch.float16,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_use_double_quant=True,
)
model_id = "mistralai/Mistral-7B-Instruct-v0.1"

model_4bit = AutoModelForCausalLM.from_pretrained(
    model_id,
    device_map="auto",
    quantization_config=quantization_config,
)
tokenizer = AutoTokenizer.from_pretrained(model_id)

pipeline = pipeline(
    "text-generation",
    model=model_4bit,
    tokenizer=tokenizer,
    use_cache=True,
    device_map="auto",
    max_length=500,
    do_sample=True,
    top_k=5,
    num_return_sequences=1,
    eos_token_id=tokenizer.eos_token_id,
    pad_token_id=tokenizer.eos_token_id,
)

In [None]:
# Print the summaries
for i in news_results["news"][:num_results]:
    if i.get("article"):
        print(f"\nTitle: {i['title']}\nLink: {i['link']}")
        print(
            f"oai_summary: \033[32m {i['oai_summary']['output_text']}\033[0m")
        print(
            f"hf_flan_summary: \033[32m{i['hf_flan_summary']['output_text']}\033[0m")
        print(
            f"hf_bart_summary: \033[32m{i['hf_bart_summary']['output_text']}\033[0m")
        print(
            f"hf_distilbart_summary: \033[32m{i['hf_distilbart_summary']['output_text']}\033[0m"
        )
        print(
            f"hf_falconsai_summary: \033[32m{i['hf_falconsai_summary']['output_text']}\033[0m"
        )

In [8]:
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("google/gemma-2b")
model = AutoModelForCausalLM.from_pretrained(
    "google/gemma-2b", device_map="auto")

# tokenizer = AutoTokenizer.from_pretrained("google/gemma-7b")
# model = AutoModelForCausalLM.from_pretrained("google/gemma-7b", device_map="auto")

input_text = "Write me a poem about Machine Learning."
input_ids = tokenizer(input_text, return_tensors="pt")

outputs = model.generate(**input_ids)
print(tokenizer.decode(outputs[0]))

tokenizer_config.json:   0%|          | 0.00/1.11k [00:00<?, ?B/s]

tokenizer.model:   0%|          | 0.00/4.24M [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/17.5M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/555 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/627 [00:00<?, ?B/s]

model.safetensors.index.json:   0%|          | 0.00/13.5k [00:00<?, ?B/s]

Downloading shards:   0%|          | 0/2 [00:00<?, ?it/s]

model-00001-of-00002.safetensors:   0%|          | 0.00/4.95G [00:00<?, ?B/s]

model-00002-of-00002.safetensors:   0%|          | 0.00/67.1M [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/137 [00:00<?, ?B/s]



<bos>Write me a poem about Machine Learning.

I’m not sure what you mean by “


## Additional Reading

1. [LLM Bootcamp](https://github.com/miztiik/llm-bootcamp)
1. [Revolutionizing News Summarization](https://www.width.ai/post/revolutionizing-news-summarization-exploring-the-power-of-gpt-in-zero-shot-and-specialized-tasks)
1. [Summarizer For Any Size Document](https://www.width.ai/post/gpt3-summarizer)
1. [Langchain Summarization 1. Stuff & Map Reduce](https://python.langchain.com/docs/use_cases/summarization)
1. [Langchain Google Serper](https://python.langchain.com/docs/integrations/tools/google_serper)
1. [Hugging Face Local Pipelines](https://python.langchain.com/docs/integrations/llms/huggingface_pipelines)
1. [Chunking Strategies for LLM Applications](https://www.pinecone.io/learn/chunking-strategies/)
1. [Optimal Chunk-Size for Large Document Summarization](https://vectify.ai/blog/LargeDocumentSummarization)
1 .[4 Powerful Long Text Summarization Methods With Real Examples](https://www.width.ai/post/4-long-text-summarization-methods)

1. [5 Levels Of Summarization: Novice to Expert](https://www.youtube.com/watch?v=qaPMdcCqtWk)
1. [Generating Summaries for Large Documents with Llama2 using Hugging Face and Langchain](https://medium.com/@ankit941208/generating-summaries-for-large-documents-with-llama2-using-hugging-face-and-langchain-f7de567339d2)
1. [Py-LangChain-PDF-Summary](https://github.com/dmitrimahayana/Py-LangChain-PDF-Summary/blob/master/02_RAG_GPT4ALL.py)
1. [Langchain Text Summarization with OpenAI](https://github.com/krishnaik06/Complete-Langchain-Tutorials/blob/main/Text%20summarization/summarization.ipynb)


In [None]:
prompt_template = """Write a concise summary of the following:


{text}


CONCISE SUMMARY:"""
PROMPT = PromptTemplate(template=prompt_template, input_variables=["text"])
refine_template = (
    "Your job is to produce a final summary\n"
    "We have provided an existing summary up to a certain point: {existing_answer}\n"
    "We have the opportunity to refine the existing summary"
    "(only if needed) with some more context below.\n"
    "------------\n"
    "{text}\n"
    "------------\n"
    "Given the new context, refine the original summary. Make sure to address the list of problems, list of solutions and any following action"
    "If the context isn't useful, return the original summary."
)
refine_prompt = PromptTemplate(
    input_variables=["existing_answer", "text"],
    template=refine_template,
)
chain = load_summarize_chain(
    llm,
    chain_type="refine",
    return_intermediate_steps=True,
    question_prompt=PROMPT,
    refine_prompt=refine_prompt,
)
chain({"input_documents": docs}, return_only_outputs=True)