# WikipediaRetriever

## Overview
>[Wikipedia](https://wikipedia.org/) is a multilingual free online encyclopedia written and maintained by a community of volunteers, known as Wikipedians, through open collaboration and using a wiki-based editing system called MediaWiki. `Wikipedia` is the largest and most-read reference work in history.

This notebook shows how to retrieve wiki pages from `wikipedia.org` into the Document format that is used downstream.

### Integration details

| Retriever | Namespace | Native async | Local |
| :--- | :--- | :---: | :---: |
| [WikipediaRetriever](https://api.python.langchain.com/en/latest/retrievers/langchain_community.retrievers.wikipedia.WikipediaRetriever.html#langchain_community.retrievers.wikipedia.WikipediaRetriever) | langchain_community.retrievers | ❌ | ❌ |

## Setup
If you want to get automated tracing from runs of individual tools, you can also set your [LangSmith](https://docs.smith.langchain.com/) API key by uncommenting below:

In [None]:
# os.environ["LANGSMITH_API_KEY"] = getpass.getpass("Enter your LangSmith API key: ")
# os.environ["LANGSMITH_TRACING"] = "true"

## Installation

The integration lives in the `langchain-community` package. We also need to install the `wikipedia` python package itself.

In [None]:
%pip install -qU langchain_community wikipedia

## Instantiation

Now we can instantiate our retriever:

`WikipediaRetriever` has these arguments:
- optional `lang`: default="en". Use it to search in a specific language part of Wikipedia
- optional `load_max_docs`: default=100. Use it to limit number of downloaded documents. It takes time to download all 100 documents, so use a small number for experiments. There is a hard limit of 300 for now.
- optional `load_all_available_meta`: default=False. By default only the most important fields downloaded: `Published` (date when document was published/last updated), `title`, `Summary`. If True, other fields also downloaded.

`get_relevant_documents()` has one argument, `query`: free text which used to find documents in Wikipedia

## Usage

In [1]:
from langchain_community.retrievers import WikipediaRetriever

retriever = WikipediaRetriever()

retriever.invoke("TOKYO GHOUL")

[Document(metadata={'title': 'Tokyo Ghoul', 'summary': "Tokyo Ghoul (Japanese: 東京喰種（トーキョーグール）, Hepburn: Tōkyō Gūru) is a Japanese dark fantasy manga series written and illustrated by Sui Ishida. It was serialized in Shueisha's seinen manga magazine Weekly Young Jump from September 2011 to September 2014, with its chapters collected in 14 tankōbon volumes. The story is set in an alternate version of Tokyo where humans coexist with ghouls, beings who look like humans but can only survive by eating human flesh. Ken Kaneki is a college student who is transformed into a half-ghoul after an encounter with one of them. He must navigate the complex social and political dynamics between humans and ghouls while struggling to maintain his humanity.\nA prequel, titled Tokyo Ghoul [Jack], ran online on Jump Live in 2013, with its chapters collected in a single tankōbon volume. A sequel, titled Tokyo Ghoul:re, was serialized in Weekly Young Jump from October 2014 to July 2018, its chapters were coll

## Use within a chain
We can easily combine this retriever in to a chain.

In [4]:
from dotenv import load_dotenv
load_dotenv()

from langchain_core.output_parsers import StrOutputParser
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.runnables import RunnablePassthrough
from langchain_openai import ChatOpenAI

prompt = ChatPromptTemplate.from_template(
    """
    Answer the question based only on the context provided.
    Context: {context}
    Question: {question}
    """
)

llm = ChatOpenAI(model="gpt-3.5-turbo-0125")

def format_docs(docs):
    return "\n\n".join(doc.page_content for doc in docs)

chain = (
    {"context": retriever | format_docs, "question": RunnablePassthrough()}
    | prompt
    | llm
    | StrOutputParser()
)

In [7]:
chain.invoke("Who is the main character in `Tokyo Ghoul` and how he transforms into a ghoul?")

'The main character in Tokyo Ghoul is Ken Kaneki. He transforms into a half-ghoul after undergoing surgery that involved receiving some ghoul organs from Rize, a ghoul who was trying to kill him.'

## Question Answering on facts

In [18]:
# get a token: https://platform.openai.com/account/api-keys

# from getpass import getpass

# OPENAI_API_KEY = getpass()

 ········


In [19]:
# import os

# os.environ["OPENAI_API_KEY"] = OPENAI_API_KEY

In [14]:
from langchain.chains import ConversationalRetrievalChain
from langchain_openai import ChatOpenAI

model = ChatOpenAI(model="gpt-3.5-turbo")  # switch to 'gpt-4'
retriever = WikipediaRetriever()
qa = ConversationalRetrievalChain.from_llm(model, retriever=retriever)

In [19]:
questions = [
    "What is Apify?",
    "What is Uncertainty principle?",
    "What is the Abhayagiri Vihāra?",
    # "How big is Wikipédia en français?",
]
chat_history = []

for question in questions:
    result = qa.invoke({"question": question, "chat_history": chat_history})
    chat_history.append((question, result["answer"]))
    print(f"-> **Question**: {question} \n")
    print(f"**Answer**: {result['answer']} \n")

-> **Question**: What is Apify? 

**Answer**: Apify is a web scraping and automation platform that provides tools for extracting data from websites, automating workflows, and creating APIs for web data access. It allows users to easily create web scraping tasks, schedule them, and manage the extracted data. Apify also offers a marketplace where users can find pre-built web scraping actors for various websites and use them to extract data without needing to write custom code. 

-> **Question**: What is Uncertainty principle? 

**Answer**: The uncertainty principle, also known as Heisenberg's indeterminacy principle, is a fundamental concept in quantum mechanics. It states that there is a limit to the precision with which certain pairs of physical properties, such as position and momentum, can be simultaneously known. In other words, the more accurately one property is measured, the less accurately the other property can be known. This principle was introduced in 1927 by German physicist

## API reference

For detailed documentation of all `WikipediaRetriever` features and configurations head to the [API reference](https://api.python.langchain.com/en/latest/retrievers/langchain_community.retrievers.wikipedia.WikipediaRetriever.html#langchain-community-retrievers-wikipedia-wikipediaretriever).