In [1]:
import os
import openai
from dotenv import load_dotenv

load_dotenv()

openai.api_key = os.environ["OPENAI_API_KEY"]

## Method 1: Using `stuff` chain

This is useful for a small number of documents.

In [2]:
from langchain.chat_models import ChatOpenAI
from langchain.document_loaders import DirectoryLoader
from langchain.chains.llm import LLMChain
from langchain.prompts import PromptTemplate
from langchain.chains.combine_documents.stuff import StuffDocumentsChain

In [3]:
loader = DirectoryLoader('./data/wiki', glob="**/*.pdf", show_progress=True)
docs = loader.load()

100%|████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:07<00:00,  3.77s/it]


In [4]:
docs

[Document(page_content='Algarve\n\nEarth > Europe > Iberia > Portugal > Southern Portugal > Algarve\n\nThe Algarve is the southernmost region of Portugal, on the coast of the Atlantic Ocean.\n\nRegions\n\nBarlavento (Lagos, Silves, Portimão, Lagoa, Albufeira areas) Serra Algarvia (Monchique, Caldeirão) Sotavento (Faro, Loulé, São Brás de Alportel, Olhão, Tavira, Vila Real de Santo António areas)\n\nCities\n\nFaro — the regional capital Albufeira Lagos Sagres Monchique Paderne Portimão Silves — first capital of the Algarve which has a Moorish castle Tavira — a city near the Ria Formosa (lagoon).\n\nOther destinations\n\nCastro Marim Quarteira\n\nUnderstand\n\nThe Algarve is Portugal\'s most popular holiday destination due to the clean beaches (approximately 200 km of them), the cool, unpolluted water, and the fact that it is relatively cheap, very safe, and overall welcoming. English is spoken at most resorts.\n\nThe Algarve is rich in culture and diversity. If you are looking for fast-

In [5]:

# Define prompt
prompt_template = """Write a concise summary in Czech language of the following:
"{text}"
CONCISE SUMMARY:"""
prompt = PromptTemplate.from_template(prompt_template)

# Define LLM chain
llm = ChatOpenAI(temperature=0, model_name="gpt-3.5-turbo-16k")
llm_chain = LLMChain(llm=llm, prompt=prompt)

# Define StuffDocumentsChain
stuff_chain = StuffDocumentsChain(
    llm_chain=llm_chain, document_variable_name="text"
)

docs = loader.load()
print(stuff_chain.run(docs))

100%|████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:05<00:00,  2.94s/it]


Algarve je nejjižnější region Portugalska na pobřeží Atlantského oceánu. Je oblíbenou turistickou destinací díky čistým plážím, teplému moři a příznivým cenám. Region je bohatý na kulturu a rozmanitost, nabízí různé aktivity jako turistiku, surfování a golf. V Algarve se nachází také několik historických měst, jako je Faro, Lagos a Tavira. Kromě toho je region známý svými výbornými vínem a mořskými plody.


## Method 2: Using `MapReduce` chain

This is useful for summarizing documents independently.

In [6]:
from langchain.chains.mapreduce import MapReduceChain
from langchain.text_splitter import CharacterTextSplitter
from langchain.chains import ReduceDocumentsChain, MapReduceDocumentsChain

llm = ChatOpenAI(temperature=0)

# Map
map_template = """The following is a set of documents
{docs}
Based on this list of docs, please identify the main themes 
Helpful Answer:"""
map_prompt = PromptTemplate.from_template(map_template)
map_chain = LLMChain(llm=llm, prompt=map_prompt)

In [7]:
# Reduce
reduce_template = """The following is set of summaries:
{doc_summaries}
Take these and distill it into a final, consolidated summary of the main themes. 
Helpful Answer:"""
reduce_prompt = PromptTemplate.from_template(reduce_template)
reduce_chain = LLMChain(llm=llm, prompt=reduce_prompt)

# Takes a list of documents, combines them into a single string, and passes this to an LLMChain
combine_documents_chain = StuffDocumentsChain(
    llm_chain=reduce_chain, document_variable_name="doc_summaries"
)

# Combines and iteravely reduces the mapped documents
reduce_documents_chain = ReduceDocumentsChain(
    # This is final chain that is called.
    combine_documents_chain=combine_documents_chain,
    # If documents exceed context for `StuffDocumentsChain`
    collapse_documents_chain=combine_documents_chain,
    # The maximum number of tokens to group documents into.
    token_max=4000,
)

In [8]:
# Combining documents by mapping a chain over them, then combining results
map_reduce_chain = MapReduceDocumentsChain(
    # Map chain
    llm_chain=map_chain,
    # Reduce chain
    reduce_documents_chain=reduce_documents_chain,
    # The variable name in the llm_chain to put the documents in
    document_variable_name="docs",
    # Return the results of the map steps in the output
    return_intermediate_steps=False,
)

text_splitter = CharacterTextSplitter.from_tiktoken_encoder(
    chunk_size=1000, chunk_overlap=0
)
split_docs = text_splitter.split_documents(docs)

In [9]:
print(map_reduce_chain.run(split_docs))

The main themes that emerge from the provided summaries are:

1. Geography and Location: The Algarve is described as the southernmost region of Portugal, located on the coast of the Atlantic Ocean. It is further divided into three regions: Barlavento, Serra Algarvia, and Sotavento. The document also mentions specific cities and destinations within the Algarve.

2. Tourism and Beaches: The Algarve is highlighted as Portugal's most popular holiday destination, known for its clean beaches and cool, unpolluted water. It is described as a relatively cheap, safe, and welcoming destination for tourists. The document mentions the presence of over 100 different beaches, with 88 of them designated as blue flag beaches.

3. Nature and Outdoor Activities: The Algarve is portrayed as a region rich in nature and diversity. It mentions various natural attractions such as the Monchique and Caldeirão mountains, the Sagres cape, and over 30 hiking trails. The document also mentions the Ria Formosa, a la

## Method 3: `Refine` chain

This method loops over the documents and iteratively refines the answer to the summarization query.

In [10]:
from langchain.chains.summarize import load_summarize_chain
chain = load_summarize_chain(llm, chain_type="refine")
chain.run(split_docs)

"Portugal, located on the south-western corner of Europe, is known for its pristine beaches and is often considered the surfing capital of Europe. The country offers a range of surfing spots, including Nazare, Peniche, Sagres, and Espinho. However, surfing during the winter months can be challenging due to the powerful Atlantic swell. Portugal also hosts various fairs and music festivals, particularly during the summer months. The country uses the euro as its currency, and ATMs accepting international cards can be found throughout. Credit cards are widely accepted, but some smaller establishments may only accept Portuguese cards. Haggling is possible in smaller shops, and tipping in restaurants is optional. Visitors can find a variety of souvenirs to buy, including cork products and locally produced alcoholic drinks. Portugal also has a thriving fashion industry, with independent designers such as Fátima Lopes and Maria Gambina. Handicrafts, including handmade leather goods and glass i

In [11]:
prompt_template = """Write a concise summary of the following:
{text}
CONCISE SUMMARY:"""
prompt = PromptTemplate.from_template(prompt_template)

refine_template = (
    "Your job is to produce a final summary\n"
    "We have provided an existing summary up to a certain point: {existing_answer}\n"
    "We have the opportunity to refine the existing summary"
    "(only if needed) with some more context below.\n"
    "------------\n"
    "{text}\n"
    "------------\n"
    "Given the new context, refine the original summary in Czech"
    "If the context isn't useful, return the original summary."
)
refine_prompt = PromptTemplate.from_template(refine_template)
chain = load_summarize_chain(
    llm=llm,
    chain_type="refine",
    question_prompt=prompt,
    refine_prompt=refine_prompt,
    return_intermediate_steps=True,
    input_key="input_documents",
    output_key="output_text",
)
result = chain({"input_documents": split_docs}, return_only_outputs=True)

In [12]:
print(result["output_text"])

Algarve je nejjižnější region Portugalska, známý svými čistými plážemi a dostupnými cenami. Je to oblíbená turistická destinace s více než 100 různými plážemi, mnoha turistickými stezkami a různými kulturními atrakcemi. Region má bohatou historii s vlivy Féničanů, Římanů, Maurů a Portugalců. Významnou roli ve výzkumu a objevování regionu sehrál princ Jindřich Mořeplavec. V roce 1755 postihlo Algarve obrovské zemětřesení a následná tsunami, která zpustošila pobřežní oblasti. Poškození nebylo omezeno pouze na Algarve, britské námořní zprávy z té doby uvádějí příchod obrovské vlny do přístavu Lisabonu. Poškození Lisabonu bylo téměř úplné a po obrovských politických turbulencích byl zodpovědný za obnovu města markýz z Pombalu, tehdejší premiér. Algarve nabízí také mnoho dalších zajímavých míst k návštěvě, jako je historické město Lagos, Monchique Mountains s krásnými výhledy a Silves s červeným kamenným hradem. Region je také známý svou bohatou nabídkou aktivit, jako je horské cyklistiky, 