# Deep Search of Wikipedia

Given a query, this notebook shows how to carry out deep search of Wikipedia

In [2]:
#!pip install --quiet pydantic_ai wikipedia

In [1]:
import os
MODEL_ID = "gemini-2.0-flash"
import os
from dotenv import load_dotenv
load_dotenv("../keys.env")
assert os.environ["GEMINI_API_KEY"][:2] == "AI",\
       "Please specify the GEMINI_API_KEY access token in keys.env file"

In [3]:
from pydantic_ai import Agent

In [4]:
from dataclasses import dataclass
@dataclass
class WikipediaPage:
  title: str
  url: str
  relevant_text: str = None

In [5]:
# prompt: write code to search wikipedia for a topic and return titles and urls of the pages found
def search_wikipedia(query: str):
  import wikipedia
  wikipedia.set_lang("en")
  results = wikipedia.search(query)
  pages = []
  for title in results:
    try:
      page = wikipedia.page(title)
      pages.append(WikipediaPage(title=page.title,
                                 url=page.url))
    except wikipedia.exceptions.DisambiguationError as e:
      print(f"Disambiguation error for '{title}': {e}")
      # Handle disambiguation, e.g., choose the first option or skip
      # Here, we're skipping the disambiguation pages
      continue
    except wikipedia.exceptions.PageError:
      print(f"Page not found for '{title}'")
      continue
  return pages

# Example usage
query = "What were the causes of the Liberian civil war?"
pages = search_wikipedia(query)
pages


Page not found for 'Burundian Civil War'


[WikipediaPage(title='First Liberian Civil War', url='https://en.wikipedia.org/wiki/First_Liberian_Civil_War', relevant_text=None),
 WikipediaPage(title='Liberians United for Reconciliation and Democracy', url='https://en.wikipedia.org/wiki/Liberians_United_for_Reconciliation_and_Democracy', relevant_text=None),
 WikipediaPage(title='Americo-Liberian people', url='https://en.wikipedia.org/wiki/Americo-Liberian_people', relevant_text=None),
 WikipediaPage(title='History of Liberia', url='https://en.wikipedia.org/wiki/History_of_Liberia', relevant_text=None),
 WikipediaPage(title='List of war crimes', url='https://en.wikipedia.org/wiki/List_of_war_crimes', relevant_text=None),
 WikipediaPage(title='Timeline of events leading to the American Civil War', url='https://en.wikipedia.org/wiki/Timeline_of_events_leading_to_the_American_Civil_War', relevant_text=None),
 WikipediaPage(title='World War I', url='https://en.wikipedia.org/wiki/World_War_I', relevant_text=None),
 WikipediaPage(title='

In [6]:
import nest_asyncio
nest_asyncio.apply()

def rank_pages(query: str, pages: list[WikipediaPage]) -> list[WikipediaPage]:
  agent = Agent(model=MODEL_ID, result_type=list[WikipediaPage])
  prompt = f"""Rank these Wikipedia pages by relevance to the query: "{query}".
  Pages: {pages}"""
  response = agent.run_sync(prompt)
  return response.data

# Example usage
ranked_pages = rank_pages(query, pages)[:3] # top 3
ranked_pages

[WikipediaPage(title='First Liberian Civil War', url='https://en.wikipedia.org/wiki/First_Liberian_Civil_War', relevant_text=None),
 WikipediaPage(title='History of Liberia', url='https://en.wikipedia.org/wiki/History_of_Liberia', relevant_text=None),
 WikipediaPage(title='Americo-Liberian people', url='https://en.wikipedia.org/wiki/Americo-Liberian_people', relevant_text=None)]

In [7]:
def add_relevant_text(query: str, page: WikipediaPage):
  agent = Agent(model=MODEL_ID, result_type=str)
  prompt = f"""
  Read {page.url} and extract the text relevant to the following query.
  Return only the relevant text without any preamble.
  {query}
  """
  response = agent.run_sync(prompt)
  page.relevant_text = response.data

# Example usage
for page in ranked_pages:
  add_relevant_text(query, page)
ranked_pages

[WikipediaPage(title='First Liberian Civil War', url='https://en.wikipedia.org/wiki/First_Liberian_Civil_War', relevant_text="The causes of the First Liberian Civil War are complex and multi-faceted.\n\n**Historical Factors:**\n\n*   **Socio-economic disparities:** The Americo-Liberian elite, descendants of freed American slaves, had historically dominated the country's political and economic life, marginalizing the indigenous population. This created deep resentment and inequality.\n*   **Political exclusion:** The True Whig Party held power for over a century, effectively creating a one-party state that suppressed dissent and limited political participation for non-Americo-Liberians.\n*   **Samuel Doe's coup:** Samuel Doe, a member of the Krahn ethnic group, seized power in a 1980 coup, ending Americo-Liberian dominance. However, his regime became increasingly authoritarian and corrupt, favoring his own ethnic group and alienating others.\n\n**Immediate Triggers:**\n\n*   **Economic 

In [8]:
def synthesize_answer(query: str, pages: list[WikipediaPage]) -> str:
  agent = Agent(model=MODEL_ID, result_type=str)
  prompt = f"""
  Answer the following query based on the given information.
  Query:
  {query}

  Relevant information:
  {[page.relevant_text for page in pages]}
  """
  response = agent.run_sync(prompt)
  return response.data

# Example usage
answer = synthesize_answer(query, ranked_pages)
print(answer)

The causes of the First Liberian Civil War are complex and multi-faceted, stemming from historical factors, immediate triggers, and regional instability. Key causes include:

*   **Socio-economic disparities:** The Americo-Liberian elite historically dominated Liberia's political and economic life, marginalizing the indigenous population and creating deep resentment.
*   **Political exclusion:** The True Whig Party's long-standing monopoly on power, coupled with political repression and the suppression of dissent, fueled discontent and limited political participation for non-Americo-Liberians.
*   **Samuel Doe's coup and subsequent rule:** While initially welcomed by some, Doe's 1980 coup and subsequent authoritarian rule, favoritism towards his Krahn ethnic group, corruption, and persecution of other groups further exacerbated tensions.
*   **Charles Taylor's rebellion/invasion:** Charles Taylor's armed rebellion/invasion in 1989, backed by disgruntled Liberians and foreign mercenarie

In [9]:
def identify_gaps_and_followups(query: str, answer) -> list[str]:
  agent = Agent(model=MODEL_ID, result_type=list[str])
  prompt = f"""
  You are provided a question and an answer.
  Suggest 2-3 follow-on questions that could help flesh out the answer
  or fill logical or information gaps in the answer.

  Query:
  {query}

  Answer:
  {answer}
  """
  response = agent.run_sync(prompt)
  follow_ups = response.data
  # questions = [query + " Focus your answer on: " + f for f in follow_ups]
  questions = follow_ups
  return questions


# Example usage
follow_ups = identify_gaps_and_followups(query, answer)
follow_ups

['1. How did the socio-economic disparities specifically manifest in terms of access to resources, education, and opportunities for different ethnic groups?',
 "2. In what specific ways did regional instability, particularly the conflicts in Sierra Leone and Côte d'Ivoire, contribute to the Liberian Civil War in terms of arms proliferation and fighter recruitment?",
 "3. Can you elaborate on the specific human rights abuses and acts of political repression committed by Samuel Doe's regime that fueled the rebellion?"]

In [10]:
@dataclass
class Section:
  query: str
  answer: str
  sections: list['Section']

def create_section(query: str) -> Section:
  pages = search_wikipedia(query)
  ranked_pages = rank_pages(query, pages)[:3] # top 3
  for page in ranked_pages:
    add_relevant_text(query, page)
  answer = synthesize_answer(query, ranked_pages)
  section = Section(query=query, answer=answer, sections=list())
  return section

def add_subsections(parent: Section):
  # second and subsequent iterations with a thinking stage
  follow_ups = identify_gaps_and_followups(parent.query, parent.answer)
  for follow_up in follow_ups:
    section = create_section(follow_up)
    parent.sections.append(section)

def pretty_print(report, level=1):
  print(f"<h{level}>{report.query}<h{level}>")
  print(f"{report.answer}")
  for section in report.sections:
    pretty_print(section, level+1)

def deep_search(query: str, depth: int, report=None) -> Section:
  if report is None:
    report = create_section(query)
  add_subsections(report)
  if depth > 1:
    for section in report.sections:
      deep_search(section.query, depth-1, section)
  return report

In [11]:
report = deep_search(query="What were some of the famous victories of Napoleon Bonaparte?", depth=1)
pretty_print(report)

Page not found for 'Napoleon III'
Page not found for 'Napoleon (2023 film)'
Page not found for 'War and Peace'
Page not found for 'Russian Empire'




  lis = BeautifulSoup(html).find_all('li')


Disambiguation error for 'George B. McClellan': "George McClellan (disambiguation)" may refer to: 
George McClellan (physician)
George McClellan (New York politician)
George B. McClellan Jr.
George McClellan (police officer)
George McClellan (anatomy professor)
George B. McClellan (fireboat)
General George B. McClellan (Ellicott)
All pages with titles containing George McClellan
McClellan (disambiguation)
Page not found for 'Louis XIV'
<h1>What were some of the famous victories of Napoleon Bonaparte?<h1>
Based on the provided information, some of Napoleon Bonaparte's famous victories include:

*   Siege of Toulon (1793)
*   13 Vendémiaire (1795)
*   Battle of Montenotte (1796)
*   Battle of Lodi (1796)
*   Battle of Arcole (1796)
*   Battle of Rivoli (1797)
*   Battle of the Pyramids (1798)
*   Battle of Marengo (1800)
*   Battle of Ulm (1805)
*   Battle of Austerlitz (1805)
*   Battle of Jena–Auerstedt (1806)
*   Battle of Friedland (1807)
*   Battle of Wagram (1809)

<h2>What was the

In [13]:
report = deep_search(query="Why is Srinivasa Ramanujan considered one of the greatest mathematicians?", depth=2)
pretty_print(report)

Page not found for 'History of India'
Disambiguation error for 'India': "indian" may refer to: 
India
Indian people
Indian diaspora
Languages of India
Indian English
Indian cuisine
Indigenous peoples of the Americas
First Nations in Canada
Native Americans in the United States
Indigenous peoples of the Caribbean
Indigenous languages of the Americas
Indian, West Virginia
The Indians
Indian (film series)
Indian (1996 film)
Indian (2001 film)
Indians (musician)
unreleased song by Basshunter
"Indian" (song)
"Indians" (song)
The Link
Indian (card game)
Indian soap opera
Indians (play)
Indians (sculpture)
Akwesasne Indians
Cleveland Indians
Frölunda HC
Hannover Indians
Indianapolis Indians
Indios de Mayagüez
Springfield Indians
Indian Airlines
Indian Motorcycle
All pages with titles beginning with Indian
All pages with titles containing Indian
Hindustani (disambiguation)
India (disambiguation)
Indianism (disambiguation)
Indien (disambiguation)
Indo (disambiguation)
Indio (disambiguation)
Ind