# Building a Research LLM AI

Inspiration taken from [here](https://github.com/assafelovic/gpt-researcher)

We want to define two agents. (1) to define objective questions (queries) in parallel out of the user's question and (2) to web search for answers through url requests. Url searches done in paralle will be aggregate and summarized into a report. See below:

<center><img src="https://camo.githubusercontent.com/a1db65299ca6ec3b203e42d47b225c46f36f54009246fe2cdae8fd74c68ab8d5/68747470733a2f2f636f7772697465722d696d616765732e73332e616d617a6f6e6177732e636f6d2f6172636869746563747572652e706e67" width=30%></center>


In [2]:
import os
from langchain.chat_models import ChatOpenAI
from langchain.prompts import ChatPromptTemplate
from langchain.schema.output_parser import StrOutputParser
from langchain.schema.runnable import RunnablePassthrough, RunnableLambda
from langchain.utilities import DuckDuckGoSearchAPIWrapper
import requests
from bs4 import BeautifulSoup
import json

In [3]:
os.environ["OPENAI_API_KEY"] = ""

In [4]:
template = """Summarize the following question based on the context:

Question: {question}

Context:
{context}
"""

prompt = ChatPromptTemplate.from_template(template)

In [5]:
print(prompt)

input_variables=['context', 'question'] messages=[HumanMessagePromptTemplate(prompt=PromptTemplate(input_variables=['context', 'question'], template='Summarize the following question based on the context:\n\nQuestion: {question}\n\nContext:\n{context}\n'))]


In [5]:
## inspiration from https://gist.github.com/hwchase17/69a8cdef9b01760c244324339ab64f0c

def scrape_text(url: str):
    # Send a GET request to the webpage
    try:
        response = requests.get(url)

        # Check if the request was successful
        if response.status_code == 200:
            # Parse the content of the request with BeautifulSoup
            soup = BeautifulSoup(response.text, "html.parser")

            # Extract all text from the webpage
            page_text = soup.get_text(separator=" ", strip=True)

            # Print the extracted text
            return page_text
        else:
            return f"Failed to retrieve the webpage: Status code {response.status_code}"
    except Exception as e:
        print(e)
        return f"Failed to retrieve the webpage: {e}"

In [6]:
url = "https://blog.langchain.dev/announcing-langsmith/" # example url to play with

page_content = scrape_text(url)

In [7]:
print(len(page_content))
print(page_content)

12520
Announcing LangSmith, a unified platform for debugging, testing, evaluating, and monitoring your LLM applications Skip to content LangChain Blog Home By LangChain Release Notes GitHub Docs Case Studies Sign in Subscribe Announcing LangSmith, a unified platform for debugging, testing, evaluating, and monitoring your LLM applications 8 min read Jul 18, 2023 LangChain exists to make it as easy as possible to develop LLM-powered applications. We started with an open-source Python package when the main blocker for building LLM-powered applications was getting a simple prototype working. We remember seeing Nat Friedman tweet in late 2022 that there was “ not enough tinkering happening .” The LangChain open-source packages are aimed at addressing this and we see lots of tinkering happening now ( Nat agrees )–people are building everything from chatbots over internal company documents to an AI dungeon master for a Dungeons and Dragons game. The blocker has now changed. While it’s easy to

models from OpenAI can be found [here](https://platform.openai.com/docs/models/gpt-3-5)

In [7]:
model_name = "gpt-3.5-turbo-16k"

chain = prompt | ChatOpenAI(model=model_name) | StrOutputParser()

In [1]:
# print(chain)

In [8]:
chain.invoke(
    {
        "question":"what is langsmith?",
        "context": page_content[:10000]
    }
)

'The question is asking for an explanation of what LangSmith is.'

Now we'll create a summary template and prompt

In [10]:
summary_template = """{text}

Using the above text, answer in short the following question

> {question}

---
if the question cannot be answered using the text, imply summarize the text. Include all factual information, numbers, stats, etc.
"""

summary_prompt = ChatPromptTemplate.from_template(summary_template)


In [9]:
model_name = "gpt-3.5-turbo-16k"

chain = summary_prompt | ChatOpenAI(model=model_name) | StrOutputParser()

chain.invoke(
    {
        "question":"what is langsmith",
        "text":page_content
    }
)

'LangSmith is a unified platform designed to help developers close the gap between prototype and production of LLM-powered applications. It provides tools for debugging, testing, evaluating, and monitoring LLM applications, offering visibility into model inputs and outputs, dataset creation, performance tracking, and integration with open source evaluation modules. LangSmith has been tested and used by companies such as Snowflake, Boston Consulting Group, DeepLearningAI, and ambitious startups like Mendable, Multi-On, and Quivr. It aims to simplify the development process and provide a single, fully-integrated hub for managing LLM applications.'

We'll modify the chain and add more syntax to make it more involved but at the same time more <b>streamlined</b> using the <b>[RunnablePassthrogh](https://nanonets.com/blog/langchain/)</b> calls

In [37]:
# Recall that scrape_text is the function we created and we are passing it through to be executed within the chain

chain = RunnablePassthrough.assign(
    text = lambda x: scrape_text(x["url"])[:10000]
) | summary_prompt |  ChatOpenAI(model=model_name) | StrOutputParser()

In [36]:
chain.invoke(
    {
        "question":"what is langsmith",
        "url":url
    }
)

'LangSmith is a unified platform for debugging, testing, evaluating, and monitoring LLM applications. It helps developers close the gap between prototype and production by providing visibility into model inputs and outputs, tools for creating and running datasets, seamless integration with evaluation modules, and monitoring system-level performance. It has been tested and used by various companies and organizations to improve their LLM applications.'

Now we are going to use the duckduckgo browser api to look up the url and pass it through the chain we built

In [11]:
results_per_search = 3

ddg_search = DuckDuckGoSearchAPIWrapper()

def web_search(query:str, num_results:int = results_per_search):
    results = ddg_search.results(query, num_results)

    return [r["link"] for r in results]

Examples of what DDG can do - Notice the return objects: snippet, title and link


In [41]:
ddg_search.results("what is the one piece?", 3)

[{'snippet': 'One Piece is a manga and anime series that follows Monkey D. Luffy, a young boy with a dream to become the greatest pirate in the world. As a child, he eats a mystical plant called a Devil...',
  'title': "What is One Piece? The series' mega-popularity, explained - Polygon",
  'link': 'https://www.polygon.com/entertainment/23845804/one-piece-explained-anime-manga-netflix-live-action'},
 {'snippet': 'Eiichiro Oda\'s One Piece debuted in "Weekly Shonen Jump" in 1997, with the anime adaptation of the series launching just two years later. Following the adventures of Luffy D Monkey and his pirate crew, the Straw Hats, they\'re out on the open seas hunting for the amassed treasure of Gol D Roger.',
  'title': 'What Is the One Piece? - CBR',
  'link': 'https://www.cbr.com/what-is-one-piece-treasure/'},
 {'snippet': "Anime One Piece: What Could The One Piece Be? As One Piece's ending looms in the manga, we take some guesses as to what the One Piece is. By Evan Valentine - August

In [44]:
ddg_search.results("what was the score between peru and venezuela last night?", 3)

[{'snippet': "Thanks to expanded qualifying with the first 48-team World Cup, Peru is still only four points off a qualification spot, but Juan Reynoso's team will be desperate for all three points against...",
  'title': 'Peru vs. Venezuela: How to watch 2026 World Cup qualifier - Pro Soccer Wire',
  'link': 'https://prosoccerwire.usatoday.com/2023/11/20/how-to-watch-peru-vs-venezuela-conmebol-2026-world-cup-qualifier-tv-and-streaming/'},
 {'snippet': '11/21/2023 TAJ 11/21/2023 11/21/2023 124 99 PAL Venezuela 0 - 0 Bolivia Venezuela vs Peru LIVE Updates: Score, Stream Info, Lineups and How to Watch World Cup Qualifiers 2026 Match',
  'title': 'Venezuela vs Peru LIVE Updates: Score, Stream Info, Lineups and How to ...',
  'link': 'https://www.vavel.com/en-us/soccer/2023/11/22/1163696-venezuela-vs-peru-live-updates-score-stream-info-lineups-and-how-to-watch-world-cup-qualifiers-2026-match.html'},
 {'snippet': 'Conmebol odds courtesy of Tipico Sportsbook. Odds last updated Tuesday at 7:2

Let's quickly explore what the output from RunnabalePassthrough does with our ddg function

In [49]:
main_chain = RunnablePassthrough.assign(
    urls = lambda x: web_search(x["question"])
)

main_chain.invoke({"question": "what is the one piece?"}) 

{'question': 'what is the one piece?',
 'urls': ['https://www.polygon.com/entertainment/23845804/one-piece-explained-anime-manga-netflix-live-action',
  'https://www.cbr.com/what-is-one-piece-treasure/',
  'https://collider.com/one-piece-arcs-in-order/']}

It not only return the output in dictionary format (urls), but also the input (question)

Let's now formulate the entire chain invoking it in each url. Notice how we are <b>>ensuring to pass a dictionary to each RunnablePassthrough</b> 

In [12]:
# From above
scrape_and_summary_chain = RunnablePassthrough.assign(
    text = lambda x: scrape_text(x["url"])[:10000]
) | summary_prompt |  ChatOpenAI(model=model_name) | StrOutputParser()

# New - Recall from above cell that a dict is returned from RunnablePassthrough.assign() {"question", "urls"}
web_url_chain = RunnablePassthrough.assign(
    urls = lambda x: web_search(x["question"])
) | (lambda x: [{"question":x["question"], "url":u} for u in x["urls"]]) # What we want is a list of {"url":...}

# Put together - Map applies chain to every input element. It takes a list of dictionaries
question_to_summaries_chain = web_url_chain | scrape_and_summary_chain.map()

# Invoke chain
question_to_summaries_chain.invoke(
    {
        "question":"what is langsmith?"
    }
)

['LangSmith is a platform for building production-grade LLM (Language Learning Models) applications. It allows developers to debug, test, evaluate, and monitor chains and intelligent agents built on any LLM framework. LangSmith seamlessly integrates with LangChain, an open-source framework for building with LLMs. It provides tracing and debugging capabilities to help developers understand the inputs and outputs of LLM calls, identify issues, and optimize performance. LangSmith can be used with different programming languages, including Python, JavaScript, and Go. It offers features such as visualization of LLM calls, metadata analysis, and an interactive Playground for editing inputs and parameters.',
 'LangSmith is a product created by the team behind LangChain, which is a popular software tool for large language models (LLMs). LangSmith aims to tackle the challenges of getting LLM applications into production in a reliable and maintainable way. It focuses on providing features around

What we are going to do now is add on top a chain to generate a list of sub-questions to feed the main chain

We need an additional prompt to generate such questions. Inspiration taken from [here](https://github.com/assafelovic/gpt-researcher/blob/master/gpt_researcher/master/prompts.py). Refer [here](https://python.langchain.com/docs/modules/model_io/prompts/prompt_templates/) from PromptTemplate guides.

In [13]:
search_prompt = ChatPromptTemplate.from_messages(
    [
        ("system", "You are an AI Research assistant. Your name is {agent_name}"),
        ("user",
         "Write 3 google search queries to search online that form an objective opinion from "
         "the following {question}\n"
         "You must respond with a list of strings in the following format:"
         '["query 1", "query 2", "query 3"].'
        )
    ]
)

search_questions_chain = search_prompt | ChatOpenAI(model=model_name) 

Test it to see what the output is so that you know how to parse it within the same chain

In [46]:
search_questions_chain.invoke(
    {
        "agent_name":"El Bryan",
        "question":"what is the one piece?"
    }
)

AIMessage(content='["What are the different theories about the meaning of \'One Piece\' in the One Piece anime/manga?", "What are the most popular fan interpretations of the \'One Piece\' in the One Piece series?", "What are the possible explanations for the \'One Piece\' mystery in the One Piece storyline?"]')

So we need to modify to output a list of dictionaries

In [15]:
search_questions_chain = search_prompt | ChatOpenAI(model=model_name) | StrOutputParser()

search_questions_chain.invoke(
    {
        "agent_name":"El Bryan",
        "question":"what is the one piece?"
    }
)

'["What is the plot of One Piece anime/manga?", "One Piece: reviews and critiques", "Is One Piece worth watching?"]'

We need to first convert the string type output, that has an embedded list, into an actual list. This can be achieved by [json.loads()](https://www.geeksforgeeks.org/python-difference-between-json-load-and-json-loads/)

Notice that in order to so, we have to make sure that the Prompt instructions have the <b>"</b> inside the embedded list, and the <b>'</b> at the ends of the string. Otherwise you'll get an error

In [42]:
json.loads('["One Piece anime review", "One Piece manga vs anime comparison", "One Piece fan discussion"]')

['One Piece anime review',
 'One Piece manga vs anime comparison',
 'One Piece fan discussion']

This will produce an error below

In [43]:
json.loads("['a', 'b']")

JSONDecodeError: Expecting value: line 1 column 2 (char 1)

Now we can continue and convert the output to a list of strings below

In [13]:
search_questions_chain = search_prompt | ChatOpenAI(model=model_name) | StrOutputParser() | json.loads

search_questions_chain.invoke(
    {
        # "agent_name":"El Bryan",
        "question":"what is the diffence between the UEFA Champions and Europe leagues?"
    }
)

KeyError: 'agent_name'

Recall that we <b>need a list of dictionaries to pass it along the chain</b> so that it can be "mapped." So we need to convert it to that.

The <b>web_url_chain</b> within the <b>question_to_summaries_chain</b> requires a <b>"question"</b> key

In [20]:
search_questions_chain = search_prompt | ChatOpenAI(model=model_name, temperature=0.3) | StrOutputParser() | json.loads | (
    lambda x : [{"question":i} for i in x]
)

search_questions_chain.invoke(
    {
        "agent_name":"El Bryan",
        "question":"what is the diffence between langsmith and langchain?"
    }
)

[{'question': 'Langsmith vs Langchain comparison'},
 {'question': 'Features of Langsmith'},
 {'question': 'Features of Langchain'}]

Good. Now <b>pass it to the question_to_summaries_chain</b>. We need to map it, which in essence will invoke the question_to_summaries_chain for each dict input coming out of the search_questions_chain

In [18]:
main_chain = search_questions_chain | question_to_summaries_chain.map()

main_chain.invoke(
    {
        "agent_name":"El Bryan",
        "question":"what is the diffence between langsmith and langchain?"
    }
)

[[],
 ['LangSmith is a framework built on top of LangChain. While LangChain is focused on developing LLM applications, LangSmith is a complementary platform that allows users to debug, monitor, test, evaluate, and collaborate on their LLM applications. It provides tools for debugging, testing, evaluating, and monitoring the inner workings of LLMs and AI agents. LangSmith also allows users to track and analyze traces, which are logs that show the text input and output of LLMs. It helps ensure the quality and reliability of AI outputs and has been instrumental in improving user experience. The author of the article states that they are putting a lot of trust in LangSmith and using it extensively for prototyping and debugging.',
  'The text mentions that LangSmith is a platform for building production-grade LLM applications, while LangChain is an open-source framework for building with LLMs. LangSmith seamlessly integrates with LangChain, and both are used in the development of OpenGPTs.'

This is a list of lists. Now that we have it, all we have to do is pass it to a final prompt to summarize everything