# Wikibase Agent

This notebook demonstrates a simple wikibase agent that uses sparql generation.

## Preliminaries

### Install the latest langchain

In [1]:
!pip install /Users/i857913/Downloads/langchain-0.0.131-py3-none-any.whl

Looking in indexes: https://artifactory.concurtech.net/artifactory/api/pypi/pypi.python.org/simple, https://artifactory.concurtech.net/artifactory/api/pypi/ext-pypi-selfserve-local/simple
Processing /Users/i857913/Downloads/langchain-0.0.131-py3-none-any.whl
[0m^C
[31mERROR: Operation cancelled by user[0m[31m
[0m

### Enable tracing if desired

In [1]:
import os
#os.environ["LANGCHAIN_HANDLER"] = "langchain"
#os.environ["LANGCHAIN_SESSION"] = "default" # Make sure this session actually exists. You can create a new session in the UI.

### API Keys

In [2]:
import configparser
config = configparser.ConfigParser()
config.read('../training/secrets.ini')
import os
import openai
os.environ.update({'OPENAI_API_KEY': config['OPENAI']['OPENAI_API_KEY']})
openai.api_key = os.getenv('OPENAI_API_KEY')

# Tools

## Vocabulary lookup
Vacabulary lookup uses an elastic search endpoint. Not all wikibase instances have it, but wikidata does, and that's where we'll start.

In [3]:
from typing import Any, Dict, List, Optional, Union, Optional

def get_nested_value(nested_dict: Dict[str, Any], path: List[str]) -> Optional[Any]:
    current = nested_dict
    for key in path:
        if not isinstance(current, dict) or key not in current:
            return None
        current = current[key]
    return current



In [4]:
import requests

def vocab_lookup(search: str, entity_type: str = "item",
                 item_tag: str = None, 
                 srqiprofile: str = "classic_noboostlinks") -> Optional[str]:
    
    if item_tag is not None:
        if item_tag.startswith("q-"):
            entity_type = "item"
        elif item_tag.startswith("p-"):
            entity_type = "property"   
    
    if entity_type == "item":
        srnamespace = 0
    elif entity_type == "property":
        srnamespace = 120
    else:
        raise ValueError("entity_type must be either 'property' or 'item'")      
        
    url = "https://www.wikidata.org/w/api.php"
    params = {
        "action": "query",
        "list": "search",
        "srsearch": search,
        "srnamespace": srnamespace,
        "srlimit": 5,
        "srqiprofile": srqiprofile,
        "srwhat": 'text',
        "format": "json"
    }
    headers = {'Accept': 'application/json'}

    response = requests.get(url, headers=headers, params=params)
        
    if response.status_code == 200:
        results = get_nested_value(response.json(), ['query', 'search'])
        if results and len(results) > 0:
            # print(results)
            title = results[0]['title']
            # properties are returned with a prefix 'property:'
            return results[0]['title'].split(':')[-1]
        else:
            print(f"I couldn't find {entity_type} found for '{search}'. Please rephrase your request and try again")
            return ""
    else:
        # print(f"Request failed with status code {response.status_code}")
        return "Sorry, I got an error. Please try again."

In [5]:
print(vocab_lookup("Malin 1"))

Q4180017


## Query runner 

This tool runs sparql - be default, wikidata is used.

In [6]:
import requests
from typing import List, Dict, Any
import json

def run_sparql(query: str, url='https://query.wikidata.org/sparql',
              llm_mode = True) -> List[Dict[str, Any]]:
    headers = {
        'Accept': 'application/json',
        'User-Agent': 'AskwikiBot/1.0 (https://www.wikidata.org/wiki/User:What_Tottles_Meant)'
    }

    response = requests.get(url, headers=headers, params={'query': query, 'format': 'json'})

    if response.status_code == 200:
        results = response.json().get('results', {}).get('bindings', [])
        return json.dumps(results)
    if llm_mode:
        return "That query failed. Perhaps you could try a different one?"
    return None
        

In [9]:
run_sparql("SELECT (COUNT(?children) as ?count) WHERE { wd:Q1339 wdt:P40 ?children . }")

'[{"count": {"datatype": "http://www.w3.org/2001/XMLSchema#integer", "type": "literal", "value": "20"}}]'

## Find properties for a q item

This tool is intended to be used to find the properties of an object, so the AI can decide which properties are most useful for a question. The tool is ok, but it returns too much junk to be usable.

In [10]:
def find_properties_for_qitem(qitem: str):
    query = f"""
SELECT DISTINCT ?aLabel ?propLabel 
WHERE
{{
  wd:{qitem}  ?a ?b.
  SERVICE wikibase:label {{ bd:serviceParam wikibase:language "en". }}
  ?prop wikibase:directClaim ?a 
}}
"""
    result = run_sparql(query)
    if result is None:
        return "I'm sorry, that didn't work. Please try again"
    # save tokens!
    # print(result)
    response = [(get_nested_value(result, ['aLabel', 'value']).split('/')[-1],
                 get_nested_value(result, ['propLabel', 'value'])) for result in eval(result)]

    return json.dumps(response)


In [11]:
find_properties_for_qitem("Q1339")

'[["P27", "country of citizenship"], ["P25", "mother"], ["P53", "family"], ["P109", "signature"], ["P106", "occupation"], ["P135", "movement"], ["P136", "genre"], ["P39", "position held"], ["P40", "child"], ["P108", "employer"], ["P119", "place of burial"], ["P213", "ISNI"], ["P244", "Library of Congress authority ID"], ["P227", "GND ID"], ["P26", "spouse"], ["P103", "native language"], ["P69", "educated at"], ["P158", "seal image"], ["P31", "instance of"], ["P214", "VIAF ID"], ["P140", "religion or worldview"], ["P1017", "Vatican Library ID (former scheme)"], ["P1280", "CONOR.SI ID"], ["P1368", "National Library of Latvia ID"], ["P2342", "AGORHA person/institution ID"], ["P1299", "depicted by"], ["P1005", "Portuguese National Library author ID"], ["P989", "spoken text audio"], ["P1559", "name in native language"], ["P2163", "FAST ID"], ["P2089", "Library of Congress JukeBox ID (former scheme)"], ["P2188", "BiblioNet author ID"], ["P2338", "Musopen composer ID"], ["P2605", "\\u010cSFD 

It would be nice to use something like this, but this doesn't work for now. 


# Sparql Translation Agent

## Wrap the tools

In [26]:
from langchain.agents import Tool, AgentExecutor, LLMSingleActionAgent, AgentOutputParser
from langchain.prompts import StringPromptTemplate
from langchain import OpenAI, SerpAPIWrapper, LLMChain
from typing import List, Union
from langchain.schema import AgentAction, AgentFinish
import re

In [62]:
# Define which tools the agent can use to answer user queries
tools = [
    Tool(
        name = "ItemLookup",
        func=(lambda x: vocab_lookup(x, entity_type="item")),
        description="useful for when you need to know the q-number for an item"
    ),
    Tool(
        name = "PropertyLookup",
        func=(lambda x: vocab_lookup(x, entity_type="property")),
        description="useful for when you need to know the p-number for a property"
    ),
    Tool(
        name = "SparqlQueryRunner",
        func=run_sparql,
        description="useful for getting results from a wikibase"
    )    
]

## Prompts

In [63]:
# Set up the base template
template = """
Answer the following questions by running a sparql query against a wikibase where the p and q items are 
completely unknown to you. You will need to discover the p and q items before you can generate the sparql.
Do not assume you know the p and q items for any concepts. Always use tools to find all p and q items.
After you generate the sparql, you should run it. The results will be returned in json. 
Summarize the json results in natural language.

You may assume the following prefixes:
PREFIX wd: <http://www.wikidata.org/entity/>
PREFIX wdt: <http://www.wikidata.org/prop/direct/>
PREFIX p: <http://www.wikidata.org/prop/>
PREFIX ps: <http://www.wikidata.org/prop/statement/>

You have access to the following tools:

{tools}

Use the following format:

Question: the input question for which you must provide a natural language answer
Thought: you should always think about what to do
Action: the action to take, should be one of [{tool_names}]
Action Input: the input to the action
Observation: the result of the action
... (this Thought/Action/Action Input/Observation can repeat N times)
Thought: I now know the final answer
Final Answer: the final answer to the original input question

Question: {input}
{agent_scratchpad}"""


In [64]:
# Set up a prompt template
class CustomPromptTemplate(StringPromptTemplate):
    # The template to use
    template: str
    # The list of tools available
    tools: List[Tool]
    
    def format(self, **kwargs) -> str:
        # Get the intermediate steps (AgentAction, Observation tuples)
        # Format them in a particular way
        intermediate_steps = kwargs.pop("intermediate_steps")
        thoughts = ""
        for action, observation in intermediate_steps:
            thoughts += action.log
            thoughts += f"\nObservation: {observation}\nThought: "
        # Set the agent_scratchpad variable to that value
        kwargs["agent_scratchpad"] = thoughts
        # Create a tools variable from the list of tools provided
        kwargs["tools"] = "\n".join([f"{tool.name}: {tool.description}" for tool in self.tools])
        # Create a list of tool names for the tools provided
        kwargs["tool_names"] = ", ".join([tool.name for tool in self.tools])
        return self.template.format(**kwargs)

In [65]:
prompt = CustomPromptTemplate(
    template=template,
    tools=tools,
    # This omits the `agent_scratchpad`, `tools`, and `tool_names` variables because those are generated dynamically
    # This includes the `intermediate_steps` variable because that is needed
    input_variables=["input", "intermediate_steps"]
)

## Output parser 
This is unchanged from langchain docs

In [66]:
class CustomOutputParser(AgentOutputParser):
    
    def parse(self, llm_output: str) -> Union[AgentAction, AgentFinish]:
        # Check if agent should finish
        if "Final Answer:" in llm_output:
            return AgentFinish(
                # Return values is generally always a dictionary with a single `output` key
                # It is not recommended to try anything else at the moment :)
                return_values={"output": llm_output.split("Final Answer:")[-1].strip()},
                log=llm_output,
            )
        # Parse out the action and action input
        regex = r"Action: (.*?)[\n]*Action Input:[\s]*(.*)"
        match = re.search(regex, llm_output, re.DOTALL)
        if not match:
            raise ValueError(f"Could not parse LLM output: `{llm_output}`")
        action = match.group(1).strip()
        action_input = match.group(2)
        # Return the action and action input
        return AgentAction(tool=action, tool_input=action_input.strip(" ").strip('"'), log=llm_output)

In [67]:
output_parser = CustomOutputParser()

## Specify the LLM model

In [68]:
#llm = ChatOpenAI(temperature=0)
llm = OpenAI(temperature=0)

## Agent and agent executor

In [69]:
# LLM chain consisting of the LLM and a prompt
llm_chain = LLMChain(llm=llm, prompt=prompt)

In [70]:
tool_names = [tool.name for tool in tools]
agent = LLMSingleActionAgent(
    llm_chain=llm_chain, 
    output_parser=output_parser,
    stop=["\nObservation:"], 
    allowed_tools=tool_names
)

In [71]:
agent_executor = AgentExecutor.from_agent_and_tools(agent=agent, tools=tools, verbose=True)

## Run it!

In [53]:
# Double-secret verbose llm tracing!
# agent_executor.agent.llm_chain.verbose = False

In [75]:
agent_executor.run("How many children did J.S. Bach have?")



[1m> Entering new AgentExecutor chain...[0m
[32;1m[1;3mThought: I need to find the q-number for J.S. Bach and the p-number for the property "number of children".
Action: ItemLookup
Action Input: J.S. Bach[0m

Observation:[36;1m[1;3mQ1339[0m[32;1m[1;3m I need to find the p-number for the property "number of children".
Action: PropertyLookup
Action Input: number of children[0m

Observation:[33;1m[1;3mP1971[0m[32;1m[1;3m I need to run a sparql query to get the answer.
Action: SparqlQueryRunner
Action Input: 
SELECT ?children WHERE {
  wd:Q1339 p:P1971 ?statement .
  ?statement ps:P1971 ?children .
}[0m

Observation:[38;5;200m[1;3m[{"children": {"datatype": "http://www.w3.org/2001/XMLSchema#decimal", "type": "literal", "value": "20"}}][0m[32;1m[1;3m I now know the final answer.
Final Answer: J.S. Bach had 20 children.[0m

[1m> Finished chain.[0m


'J.S. Bach had 20 children.'

In [58]:
run_sparql(sparql)

'[{"children": {"datatype": "http://www.w3.org/2001/XMLSchema#decimal", "type": "literal", "value": "20"}}]'

In [59]:
sparql = agent_executor.run("What is the Basketball-Reference.com NBA player ID of Hakeem Olajuwon?")
sparql



[1m> Entering new AgentExecutor chain...[0m
[32;1m[1;3mThought: I need to find the q-number for Hakeem Olajuwon
Action: ItemLookup
Action Input: Hakeem Olajuwon[0m

Observation:[36;1m[1;3mQ273256[0m[32;1m[1;3m I need to find the p-number for Basketball-Reference.com NBA player ID
Action: PropertyLookup
Action Input: Basketball-Reference.com NBA player ID[0m

Observation:[33;1m[1;3mP2685[0m[32;1m[1;3m I now know the final answer
Final Answer: SELECT ?playerID WHERE {wd:Q273256 wdt:P2685 ?playerID}[0m

[1m> Finished chain.[0m


'SELECT ?playerID WHERE {wd:Q273256 wdt:P2685 ?playerID}'

In [60]:
run_sparql(sparql)

'[{"playerID": {"type": "literal", "value": "o/olajuha01"}}]'

# Q/A Agent

Add the sparql runner to the toolset and adjust the prompt accordingly.

In [62]:
# Define which tools the agent can use to answer user queries
tools = [
    Tool(
        name = "ItemLookup",
        func=(lambda x: vocab_lookup(x, entity_type="item")),
        description="useful for when you need to know the q-number for an item"
    ),
    Tool(
        name = "PropertyLookup",
        func=(lambda x: vocab_lookup(x, entity_type="property")),
        description="useful for when you need to know the p-number for a property"
    ),
    Tool(
        name = "SparqlQueryRunner",
        func=run_sparql,
        description="useful for getting results from a wikibase"
    )    
]

In [63]:
# Set up the base template
template = """
Answer the following questions by running a sparql query against a wikibase where the p and q items are 
completely unknown to you. You will need to discover the p and q items before you can generate the sparql.
Do not assume you know the p and q items for any concepts. Always use tools to find all p and q items.
After you generate the sparql, you should run it. The results will be returned in json. 
Summarize the json results in natural language.

You may assume the following prefixes:
PREFIX wd: <http://www.wikidata.org/entity/>
PREFIX wdt: <http://www.wikidata.org/prop/direct/>
PREFIX p: <http://www.wikidata.org/prop/>
PREFIX ps: <http://www.wikidata.org/prop/statement/>

You have access to the following tools:

{tools}

Use the following format:

Question: the input question for which you must provide a natural language answer
Thought: you should always think about what to do
Action: the action to take, should be one of [{tool_names}]
Action Input: the input to the action
Observation: the result of the action
... (this Thought/Action/Action Input/Observation can repeat N times)
Thought: I now know the final answer
Final Answer: the final answer to the original input question

Question: {input}
{agent_scratchpad}"""


In [64]:
# Set up a prompt template
class CustomPromptTemplate(StringPromptTemplate):
    # The template to use
    template: str
    # The list of tools available
    tools: List[Tool]
    
    def format(self, **kwargs) -> str:
        # Get the intermediate steps (AgentAction, Observation tuples)
        # Format them in a particular way
        intermediate_steps = kwargs.pop("intermediate_steps")
        thoughts = ""
        for action, observation in intermediate_steps:
            thoughts += action.log
            thoughts += f"\nObservation: {observation}\nThought: "
        # Set the agent_scratchpad variable to that value
        kwargs["agent_scratchpad"] = thoughts
        # Create a tools variable from the list of tools provided
        kwargs["tools"] = "\n".join([f"{tool.name}: {tool.description}" for tool in self.tools])
        # Create a list of tool names for the tools provided
        kwargs["tool_names"] = ", ".join([tool.name for tool in self.tools])
        return self.template.format(**kwargs)

In [65]:
prompt = CustomPromptTemplate(
    template=template,
    tools=tools,
    # This omits the `agent_scratchpad`, `tools`, and `tool_names` variables because those are generated dynamically
    # This includes the `intermediate_steps` variable because that is needed
    input_variables=["input", "intermediate_steps"]
)

In [66]:
class CustomOutputParser(AgentOutputParser):
    
    def parse(self, llm_output: str) -> Union[AgentAction, AgentFinish]:
        # Check if agent should finish
        if "Final Answer:" in llm_output:
            return AgentFinish(
                # Return values is generally always a dictionary with a single `output` key
                # It is not recommended to try anything else at the moment :)
                return_values={"output": llm_output.split("Final Answer:")[-1].strip()},
                log=llm_output,
            )
        # Parse out the action and action input
        regex = r"Action: (.*?)[\n]*Action Input:[\s]*(.*)"
        match = re.search(regex, llm_output, re.DOTALL)
        if not match:
            raise ValueError(f"Could not parse LLM output: `{llm_output}`")
        action = match.group(1).strip()
        action_input = match.group(2)
        # Return the action and action input
        return AgentAction(tool=action, tool_input=action_input.strip(" ").strip('"'), log=llm_output)

In [67]:
output_parser = CustomOutputParser()

In [68]:
#llm = ChatOpenAI(temperature=0)
llm = OpenAI(temperature=0)

In [69]:
# LLM chain consisting of the LLM and a prompt
llm_chain = LLMChain(llm=llm, prompt=prompt)

In [70]:
tool_names = [tool.name for tool in tools]
agent = LLMSingleActionAgent(
    llm_chain=llm_chain, 
    output_parser=output_parser,
    stop=["\nObservation:"], 
    allowed_tools=tool_names
)

In [71]:
agent_executor = AgentExecutor.from_agent_and_tools(agent=agent, tools=tools, verbose=True)

Note: the following cell sometimes fails, depending on what search terms the AI chooses for property lookup

In [75]:
agent_executor.run("How many children did J.S. Bach have?")



[1m> Entering new AgentExecutor chain...[0m
[32;1m[1;3mThought: I need to find the q-number for J.S. Bach and the p-number for the property "number of children".
Action: ItemLookup
Action Input: J.S. Bach[0m

Observation:[36;1m[1;3mQ1339[0m[32;1m[1;3m I need to find the p-number for the property "number of children".
Action: PropertyLookup
Action Input: number of children[0m

Observation:[33;1m[1;3mP1971[0m[32;1m[1;3m I need to run a sparql query to get the answer.
Action: SparqlQueryRunner
Action Input: 
SELECT ?children WHERE {
  wd:Q1339 p:P1971 ?statement .
  ?statement ps:P1971 ?children .
}[0m

Observation:[38;5;200m[1;3m[{"children": {"datatype": "http://www.w3.org/2001/XMLSchema#decimal", "type": "literal", "value": "20"}}][0m[32;1m[1;3m I now know the final answer.
Final Answer: J.S. Bach had 20 children.[0m

[1m> Finished chain.[0m


'J.S. Bach had 20 children.'

Interesting to compare what the model itself thinks:

In [76]:
llm("How many children did J.S. Bach have?")


'\n\nJ.S. Bach had seven children.'