# April 6 AskWiki langchain demo

## Preliminaries

### Install the latest langchain

In [1]:
!pip install /Users/i857913/Downloads/langchain-0.0.131-py3-none-any.whl

Looking in indexes: https://artifactory.concurtech.net/artifactory/api/pypi/pypi.python.org/simple, https://artifactory.concurtech.net/artifactory/api/pypi/ext-pypi-selfserve-local/simple
Processing /Users/i857913/Downloads/langchain-0.0.131-py3-none-any.whl
langchain is already installed with the same version as the provided wheel. Use --force-reinstall to force an installation of the wheel.


### Enable tracing if desired

In [1]:
import os
os.environ["LANGCHAIN_HANDLER"] = "langchain"
#os.environ["LANGCHAIN_SESSION"] = "Apr6Demo" # Make sure this session actually exists. You can create a new session in the UI.

### API Keys

In [2]:
import configparser
config = configparser.ConfigParser()
config.read('../training/secrets.ini')
import os
import openai
os.environ.update({'OPENAI_API_KEY': config['OPENAI']['OPENAI_API_KEY']})
openai.api_key = os.getenv('OPENAI_API_KEY')

## Set up a simple chain (1-step)

### Prompt

In [3]:
system_prompt_template= """
You are an expert on sparql and wikibase. I have a private wikibase where the p and q items are completely unknown to you. 

If you are asked to provide a "query template json" for a user question, you should respond with a json object with these keys and values:
* 'query_template': a python-style template string. the query template should not use 'rdfs:label' to help you find items. Instead, just use a placeholder tag for each p and q item you need. Each tag should begin with "p-" or "q-" depending on whether you think you need a p or a q item
* 'vocabulary': a list of objects, one for each p or q item you need for a sparql query, each item should have these components:
** 'item_tag': a placeholder string that you make up
** 'item_label_quesses': a list of up to three different strings that you think might be used as the label for that p or q item
Remember: you do not know the actual q or p items in my wikibase, so make sure to list ALL p and q items in the 'vocabulary' section of the json.

You may assume the following prefixes:
PREFIX wd: <http://www.wikidata.org/entity/>
PREFIX wdt: <http://www.wikidata.org/prop/direct/>
PREFIX p: <http://www.wikidata.org/prop/>
PREFIX ps: <http://www.wikidata.org/prop/statement/>

Respond only with the json. Do not include any comments or explanations.

"""

### Few-shot examples

In [4]:
example1_human = """
generate a query template json for the question "Does malin 1 have a right ascension lower than 15.1398?"
"""

In [5]:
example1_ai = """
{{
  "query_template": "SELECT ?result WHERE {{ wd:q-Malin1 wdt:p-RightAscension ?right_ascension . FILTER(?right_ascension < 15.1398) BIND(xsd:boolean(?right_ascension < 15.1398) as ?result) }}",
  "vocabulary": [
    {{
      "item_tag": "q-Malin1",
      "item_label_quesses": ["Malin 1", "Malin-1", "Malin1"]
    }},
    {{
      "item_tag": "p-RightAscension",
      "item_label_quesses": ["Right ascension", "right_ascension", "RA"]
    }}
  ]
}}
"""

In [6]:
example2_human = """
How many children did J.S. Bach have?
"""

In [7]:
example2_ai = """
{{
  "query_template": "SELECT (COUNT(?children) as ?count) WHERE {{ wd:q-JSBach wdt:p-Child ?children . }}",
  "vocabulary": [
    {{
      "item_tag": "q-JSBach",
      "item_label_quesses": ["Johann Sebastian Bach", "Bach, Johann Sebastian", "J.S. Bach"]
    }},
    {{
      "item_tag": "p-Child",
      "item_label_quesses": ["Child", "Offspring", "Progeny"]
    }}
  ]
}}
"""

### Create the chain

In [8]:
from langchain.prompts import (
    ChatPromptTemplate,
    PromptTemplate,
    SystemMessagePromptTemplate,
    AIMessagePromptTemplate,
    HumanMessagePromptTemplate,
)
from langchain.schema import (
    AIMessage,
    HumanMessage,
    SystemMessage
)

In [9]:
template=system_prompt_template
system_message_prompt = SystemMessagePromptTemplate.from_template(template)

example_human_1 = HumanMessagePromptTemplate.from_template(example1_human)
example_ai_1 = AIMessagePromptTemplate.from_template(example1_ai)
example_human_2 = HumanMessagePromptTemplate.from_template(example2_human)
example_ai_2 = AIMessagePromptTemplate.from_template(example2_ai)

human_template="Please generate query template json for this question: {text}"
human_message_prompt = HumanMessagePromptTemplate.from_template(human_template)

### Specify the LLM model

In [10]:
from langchain.chat_models import ChatOpenAI
from langchain import OpenAI
chat = ChatOpenAI(temperature=0)
# chat = ChatOpenAI(temperature=0, model_name='gpt-4')

### Tie it all together and run it

In [11]:
chat_prompt = ChatPromptTemplate.from_messages([system_message_prompt, example_human_1, example_ai_1, example_human_2, example_ai_2, 
                                                human_message_prompt])

In [12]:
from langchain import PromptTemplate, LLMChain
chain = LLMChain(llm=chat, prompt=chat_prompt)
result = chain.run('Which is the Basketball-Reference.com NBA player ID of Hakeem Olajuwon?')
print(result)

{
  "query_template": "SELECT ?BR_id WHERE { wd:q-HakeemOlajuwon wdt:p-1545 ?BR_id . }",
  "vocabulary": [
    {
      "item_tag": "q-HakeemOlajuwon",
      "item_label_quesses": ["Hakeem Olajuwon", "Olajuwon, Hakeem", "Akeem Olajuwon"]
    },
    {
      "item_tag": "p-1545",
      "item_label_quesses": ["Basketball-Reference.com NBA player ID", "BBR NBA player ID", "NBA.com player ID"]
    }
  ]
}


#### We should have legal json...

In [3]:
import json

In [14]:
query_template = json.loads(result)

## Code to look up vocabulary and resolve the template

In [4]:
from typing import Any, Dict, List, Optional, Union

def get_nested_value(nested_dict: Dict[str, Any], path: List[str]) -> Optional[Any]:
    """
    Retrieves a value from a nested dictionary structure using a list of keys as a path.

    :param nested_dict: The nested dictionary structure to traverse.
    :param path: A list of strings representing the keys to traverse in order to access the desired value.
    :return: The value at the end of the path if it exists, or None if any key in the path is not found.
    """
    current = nested_dict
    for key in path:
        if not isinstance(current, dict) or key not in current:
            return None
        current = current[key]
    return current

### Vocab lookup uses an elastic search api

In [5]:
import requests
from typing import Optional

def vocab_lookup(search: str, entity_type: str = "item",
                 item_tag: str = None, 
                 srqiprofile: str = "classic") -> Optional[str]:
    
    if item_tag is not None:
        if item_tag.startswith("q-"):
            entity_type = "item"
        elif item_tag.startswith("p-"):
            entity_type = "property"   
    
    if entity_type == "item":
        srnamespace = 0
    elif entity_type == "property":
        srnamespace = 120
    else:
        raise ValueError("entity_type must be either 'property' or 'item'")      
        
    url = "https://www.wikidata.org/w/api.php"
    params = {
        "action": "query",
        "list": "search",
        "srsearch": search,
        "srnamespace": srnamespace,
        "srlimit": 5,
        "srqiprofile": srqiprofile,
        "format": "json"
    }
    headers = {'Accept': 'application/json'}

    response = requests.get(url, headers=headers, params=params)
        
    if response.status_code == 200:
        results = get_nested_value(response.json(), ['query', 'search'])
        if results and len(results) > 0:
            title = results[0]['title']
            # properties are returned with a prefix 'property:'
            return results[0]['title'].split(':')[-1]
        else:
            print(f"I couldn't find {entity_type} found for '{search}'. Please rephrase your request and try again")
            return ""
    else:
        # print(f"Request failed with status code {response.status_code}")
        return "Sorry, I got an error. Please try again."

### Resolve the query template

In [6]:
def resolve_query_template(t):
    vocab_items = [(v["item_tag"], v["item_label_quesses"]) for v in t["vocabulary"]]
    # since this is not a proper template, we use "replace" below
    # so we need to be careful that some tags may be prefixes of others
    # sorting them by tag length avoids that
    vocab_items = sorted(vocab_items, key=lambda x: x[0], reverse=True)
    resolved_vocabulary = {item_tag: vocab_lookup(item_label_quesses[0], item_tag=item_tag) for item_tag, item_label_quesses in vocab_items}
    query = t['query_template']
    for item_tag, item_id in resolved_vocabulary.items():
        query = query.replace(item_tag, item_id)
    result = query
    return result

In [18]:
resolve_query_template(query_template)

'SELECT ?BR_id WHERE { wd:Q273256 wdt:P2685 ?BR_id . }'

### The next steps would be to run the query, parse the results and summarize, but ...

## langchain agent uses the LLM to orchestrate the integration steps

### Query runner 

In [7]:
import requests
from typing import List, Dict, Any

def run_sparql(query: str) -> List[Dict[str, Any]]:
    url = 'https://query.wikidata.org/sparql'
    headers = {
        'Accept': 'application/json',
        'User-Agent': 'AskwikiBot/1.0 (https://www.wikidata.org/wiki/User:What_Tottles_Meant)'
    }

    response = requests.get(url, headers=headers, params={'query': query, 'format': 'json'})

    if response.status_code == 200:
        results = response.json().get('results', {}).get('bindings', [])
        return json.dumps(results)
    else:
        # print(f"Request failed with status code {response.status_code}")
        return "That query failed. Perhaps you could try a different one?"


### Wrap the tools

In [8]:
from langchain.agents import Tool, AgentExecutor, LLMSingleActionAgent, AgentOutputParser
from langchain.prompts import StringPromptTemplate
from langchain import OpenAI, SerpAPIWrapper, LLMChain
from typing import List, Union
from langchain.schema import AgentAction, AgentFinish
import re

In [9]:
# Define which tools the agent can use to answer user queries
tools = [
    Tool(
        name = "ItemLookup",
        func=(lambda x: vocab_lookup(x, entity_type="item")),
        description="useful for when you need to know the q-number for an item"
    ),
    Tool(
        name = "PropertyLookup",
        func=(lambda x: vocab_lookup(x, entity_type="property")),
        description="useful for when you need to know the p-number for a property"
    ),
    Tool(
        name = "SparqlQueryRunner",
        func=run_sparql,
        description="useful for getting results from a wikibase"
    )    
]

### Templates

#### This template is adapted from langchain docs

In [10]:
# Set up the base template
template = """
Answer the following questions by running a sparql query against a wikibase where the p and q items are 
completely unknown to you. You will need to discover the p and q items before you can generate the sparql.
Do not assume you know the p and q items for any concepts. Always use tools to find all p and q items.
After you generate the sparql, you should run it. The results will be returned in json. 
Summarize the json results in natural language.

You may assume the following prefixes:
PREFIX wd: <http://www.wikidata.org/entity/>
PREFIX wdt: <http://www.wikidata.org/prop/direct/>
PREFIX p: <http://www.wikidata.org/prop/>
PREFIX ps: <http://www.wikidata.org/prop/statement/>

You have access to the following tools:

{tools}

Use the following format:

Question: the input question for which you must provide a natural language answer
Thought: you should always think about what to do
Action: the action to take, should be one of [{tool_names}]
Action Input: the input to the action
Observation: the result of the action
... (this Thought/Action/Action Input/Observation can repeat N times)
Thought: I now know the final answer
Final Answer: the final answer to the original input question

Question: {input}
{agent_scratchpad}"""


In [11]:
# Set up a prompt template
class CustomPromptTemplate(StringPromptTemplate):
    # The template to use
    template: str
    # The list of tools available
    tools: List[Tool]
    
    def format(self, **kwargs) -> str:
        # Get the intermediate steps (AgentAction, Observation tuples)
        # Format them in a particular way
        intermediate_steps = kwargs.pop("intermediate_steps")
        thoughts = ""
        for action, observation in intermediate_steps:
            thoughts += action.log
            thoughts += f"\nObservation: {observation}\nThought: "
        # Set the agent_scratchpad variable to that value
        kwargs["agent_scratchpad"] = thoughts
        # Create a tools variable from the list of tools provided
        kwargs["tools"] = "\n".join([f"{tool.name}: {tool.description}" for tool in self.tools])
        # Create a list of tool names for the tools provided
        kwargs["tool_names"] = ", ".join([tool.name for tool in self.tools])
        return self.template.format(**kwargs)

In [12]:
prompt = CustomPromptTemplate(
    template=template,
    tools=tools,
    # This omits the `agent_scratchpad`, `tools`, and `tool_names` variables because those are generated dynamically
    # This includes the `intermediate_steps` variable because that is needed
    input_variables=["input", "intermediate_steps"]
)

### Output parser recongnizes the "stop" sequence and parses action output

In [13]:
class CustomOutputParser(AgentOutputParser):
    
    def parse(self, llm_output: str) -> Union[AgentAction, AgentFinish]:
        # Check if agent should finish
        if "Final Answer:" in llm_output:
            return AgentFinish(
                # Return values is generally always a dictionary with a single `output` key
                # It is not recommended to try anything else at the moment :)
                return_values={"output": llm_output.split("Final Answer:")[-1].strip()},
                log=llm_output,
            )
        # Parse out the action and action input
        regex = r"Action: (.*?)[\n]*Action Input:[\s]*(.*)"
        match = re.search(regex, llm_output, re.DOTALL)
        if not match:
            raise ValueError(f"Could not parse LLM output: `{llm_output}`")
        action = match.group(1).strip()
        action_input = match.group(2)
        # Return the action and action input
        return AgentAction(tool=action, tool_input=action_input.strip(" ").strip('"'), log=llm_output)

In [14]:
output_parser = CustomOutputParser()

### Specify the LLM model

In [15]:
# llm = ChatOpenAI(temperature=0)
llm = OpenAI(temperature=0)

### Tie it all together and run it

In [16]:
# LLM chain consisting of the LLM and a prompt
llm_chain = LLMChain(llm=llm, prompt=prompt)

In [17]:
tool_names = [tool.name for tool in tools]
agent = LLMSingleActionAgent(
    llm_chain=llm_chain, 
    output_parser=output_parser,
    stop=["\nObservation:"], 
    allowed_tools=tool_names
)

In [18]:
agent_executor = AgentExecutor.from_agent_and_tools(agent=agent, tools=tools, verbose=True)

#### Note that the model summarizes the json output

In [19]:
agent_executor.run("How many children did J.S. Bach have?")



[1m> Entering new AgentExecutor chain...[0m
[32;1m[1;3mThought: I need to find the q-number for J.S. Bach and the p-number for the property "number of children".
Action: ItemLookup
Action Input: J.S. Bach[0m

Observation:[36;1m[1;3mQ1339[0m[32;1m[1;3m I need to find the p-number for the property "number of children".
Action: PropertyLookup
Action Input: number of children[0m

Observation:[33;1m[1;3mP1971[0m[32;1m[1;3m I need to run a sparql query to get the answer.
Action: SparqlQueryRunner
Action Input: 
SELECT ?children WHERE {
  wd:Q1339 p:P1971 ?statement .
  ?statement ps:P1971 ?children .
}[0m

Observation:[38;5;200m[1;3m[{"children": {"datatype": "http://www.w3.org/2001/XMLSchema#decimal", "type": "literal", "value": "20"}}][0m[32;1m[1;3m I now know the final answer.
Final Answer: J.S. Bach had 20 children.[0m

[1m> Finished chain.[0m


'J.S. Bach had 20 children.'

In [32]:
agent_executor.run("What is the Basketball-Reference.com NBA player ID of Hakeem Olajuwon?")



[1m> Entering new AgentExecutor chain...[0m
[32;1m[1;3mThought: I need to find the p and q items for Basketball-Reference.com NBA player ID and Hakeem Olajuwon
Action: ItemLookup
Action Input: Hakeem Olajuwon[0m

Observation:[36;1m[1;3mQ273256[0m[32;1m[1;3m I need to find the p and q items for Basketball-Reference.com NBA player ID
Action: PropertyLookup
Action Input: Basketball-Reference.com NBA player ID[0m

Observation:[33;1m[1;3mP2685[0m[32;1m[1;3m I now have the p and q items for the query
Action: SparqlQueryRunner
Action Input: SELECT ?playerID WHERE { wd:Q273256 p:P2685 ?playerIDStatement. ?playerIDStatement ps:P2685 ?playerID. }[0m

Observation:[38;5;200m[1;3m[{"playerID": {"type": "literal", "value": "o/olajuha01"}}][0m[32;1m[1;3m I now know the final answer
Final Answer: The Basketball-Reference.com NBA player ID of Hakeem Olajuwon is o/olajuha01.[0m

[1m> Finished chain.[0m


'The Basketball-Reference.com NBA player ID of Hakeem Olajuwon is o/olajuha01.'