# Capstone Project
You've learned so much from the previous two hands-on exercises already.

For this capstone project I would like to put all your skill to the test.
We will be using everything we've learned to create an Agent, that when asked a question, finds out the proper wikipedia page, scrapes that pages, and uses RAG to answer the question.
We're going to utilize the concepts of, API calls and Webscrapping, Langchain tools and function calling, Text splitting, Embedding and vector similarity search and Retrieval based Q&A

# Installations
Let's install all required packages for this notebook

In [None]:
# install required packages; this may take some minutes; ignore dependency warnings it should work anyway
%pip install openai
%pip install langchain
%pip install langchain-openai
%pip install langchain-community
%pip install pypdf
%pip install tiktoken
%pip install chromadb
%pip install wikipedia

# Setup

Let's setup the openAI key

In [None]:
import os

openai_api_key = 'API_KEY'

os.environ["OPENAI_API_KEY"] = openai_api_key

Let's also setup the gpt model, the embedding model and the text splitter for later use.

In [None]:
from langchain_openai import ChatOpenAI
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.embeddings import OpenAIEmbeddings

model = ChatOpenAI(model="gpt-3.5-turbo",temperature=0, max_tokens=128)
embedding_model =  OpenAIEmbeddings(model="text-embedding-ada-002")
text_splitter = RecursiveCharacterTextSplitter(chunk_size=100, chunk_overlap=0)

# Langchain WikiWrapper
Let's start by using the Langchain wikipedia LLM wrapper to figure out which wikipedia page relates the most with out question.
More can be read about the wrapper [here](https://api.python.langchain.com/en/latest/utilities/langchain_community.utilities.wikipedia.WikipediaAPIWrapper.html)


This works exactly like we did with the structuring model. We need to create a class that explains what we're looking for and let an LLM do the heavy lifting

In [None]:
from langchain_community.tools import WikipediaQueryRun
from langchain_community.utilities import WikipediaAPIWrapper
from langchain_core.pydantic_v1 import BaseModel, Field


# first we use a class to explain the scheme of the output we're looking for
class WikiInputs(BaseModel):
    """Inputs to the wikipedia tool."""

    query: str = Field(
        description="query to look up in Wikipedia, should be 3 or less words"
    )

api_wrapper = WikipediaAPIWrapper(top_k_results=3, doc_content_chars_max=100) # we create the langchain api_wrapper object
wiki_tool = WikipediaQueryRun( # then we use this class to create the tool we'll be using to make the LLM calls
    name="wiki-tool",
    description="look up things in wikipedia",
    args_schema=WikiInputs, # this uses the schema we just created
    api_wrapper=api_wrapper, # this uses the wrapper above
    return_direct=True,
)

# let's take a look at how the output looks like
wiki = wiki_tool.run("Which interface do people use with code to make requests to the internet?") # can also be used as wiki_tool.invoke
print(wiki)

The output here is entirely a string.
We can use regular expressions to extract the page title or split on new lines to get each text separately but let us use the structuring model to do that for us.

In [None]:
class StructureAPI(BaseModel):
  """Structure of the explaination"""
  page: str = Field(description="The page explained in the output")
  summary: str = Field(description="The summary explained by the output")

structured_model =  model.with_structured_output(StructureAPI)

In [None]:
structured_output = structured_model.invoke(wiki)
print(structured_output.page, structured_output.summary)

Now let's create a chain that takes in the question and returns this structured output, remember that a chain uses the `|` operator

In [None]:
wiki_chain = wiki_tool | structured_model

# Web Scrapping

Now that we have a chain to generate the wikipedia page. We can use what we learned from the web scrapping hands-on to scrape that particular wikipedia page and return all its text

In [None]:
from bs4 import BeautifulSoup
import requests

def wikipedia_search(page: str):
  """Uses the page name to scrape all the text from the wikipedia page of that title"""

  scrape_url = f' https://en.wikipedia.org/wiki/{page}' # the page is what we get from the chain

  response = requests.get(scrape_url) # we do and http request

  # Check the response status code.
  if(response.status_code == 200):
    # Parse the HTML content of the webpage using beautiful soup
    soup = BeautifulSoup(response.content, 'html.parser')
    # get all text content from the website
    # We will extract the text within the <p> tags which usually contains the main content
    text_content = ''
    for paragraph in soup.find_all('p'):
        text_content += paragraph.get_text()
    # returns the text output
    return text_content
  else:
    return "Couldn't find the page you're looking for"


In [None]:
wiki = wiki_chain.invoke("Which interface do people use with code to make requests to the internet?")
wiki_article = wikipedia_search(wiki.page)
print(wiki_article)

Now we can leave it as that. But we can do something better.
Let's make this function into a tool that we can then add as a node to our chain.
This will make us only call one chain rather than to the sequential calls manually.

In [None]:
def wikipedia_search_tool(structured_wiki: StructureAPI): # instead of getting the page as string we recieve the output of the previous model which is the structureAPI object
  """Uses the page name to scrape all the text from the wikipedia page of that title"""

  scrape_url = f' https://en.wikipedia.org/wiki/{structured_wiki.page}' # We will be getting the entire structure object so we need to get the page out of it.

  response = requests.get(scrape_url) # we do and http request

  # Check the response status code.
  if(response.status_code == 200):
    # Parse the HTML content of the webpage using beautiful soup
    soup = BeautifulSoup(response.content, 'html.parser')
    # get all text content from the website
    # We will extract the text within the <p> tags which usually contains the main content
    text_content = ''
    for paragraph in soup.find_all('p'):
        text_content += paragraph.get_text()
    # returns the text output
    return text_content
  else:
    return "Couldn't find the page you're looking for"

# Tool Creation
Now we need to create a tool that will take our function and call it as the next step in the chain.
Langchain has all the classes we need already

In [None]:
from langchain.tools import Tool

# Define the tool wrapper
class WikipediaScrapeTool(Tool):
    def __init__(self, func):
      super().__init__(name="Wikipedia Scrapper",func=func, description="A tool that scrapes text from wikipedia given page information")
      self.func = func # func here is the function that we aim to use as a tool (wikipedia_search_tool in our case)

    def invoke(self, tool_input, config=None, **kwargs): # we then need to create the invoke function that the chain will call for us.
      return self.func(tool_input)

wiki_scraper = WikipediaScrapeTool(wikipedia_search_tool) # then we create an instance of our tool.

# we can already test this out by using wiki_scraper.invoke(wiki) but let's already create a chain

In [None]:
scrapper_chain = wiki_chain | wiki_scraper

In [None]:
# let's now test out our new chain
wiki_article = scrapper_chain.invoke("Which interface do people use with code to make requests to the internet?")
wiki_article

# Splitting

Great. Now that we have a chain that scrapes the internet for us let's start by creating the text splits we'll need.

In [None]:
text_chunks = text_splitter.create_documents([wiki_article])

print(text_chunks)

# Embedding and storing

Now that we have the chunks we need to vectorize them and store them in our vector database.

We will again use chroma for this

In [None]:
# let's create a directory to store our data
vectorstore_path = "chroma/"
try:
  os.mkdir(vectorstore_path)
except FileExistsError:
  !rm -rf chroma/  # remove old database files if any
  os.mkdir(vectorstore_path)


from langchain.vectorstores import Chroma

vectordb = Chroma.from_documents(
    documents=text_chunks,
    embedding=embedding_model,
    persist_directory=vectorstore_path
)

# Retrieval

Now let's create a retriever LLM that uses the stored vectorstore to answer our question for us.

In [None]:
from langchain.chains import RetrievalQA

qa_chain = RetrievalQA.from_chain_type(
    model,
    retriever=vectordb.as_retriever()
)

result = qa_chain({"query": "Which interface do people use with code to make requests to the internet?"}) # any question related to the topic we scrapped here would also work
result["result"]

# Conclusions

In this project you learned how to use langchain to create an agent able to answer questions using Wikipedia as a citation source.
You should by now have some good understanding on:
- API calling and web scrapping
- Langchain Tools and function calling
- Langchain chains and workflow
- Text splitting and Embedding
- Retrieval-Augmented Generation


## Possible next steps
- Use a different source other than wikipedia (you'll need to create your own wrapper or check if Langchain has another one, PS: Langchain has an optimized LLM for API calls)
- Create a function that takes in the input question and return the retrieval answer in one go (just needs to copy paste all steps into a function call)
- Could optimize by not calling the initial chain if the question's been asked before, or if the answer goes to the same page, use the already saved datastore, less calls = less money used)
- Be creative. The limit is what you can come up with
- Have fun!

More code and examples can be found in the following links:
- The Langchain Docs: https://python.langchain.com/v0.2/docs/introduction/
- Code from previous iterations: https://github.com/michaelnoi/venture_labs_build
- Code from this iteration: https://github.com/mostafa-elhaiany/AI-Practitioners-src