# Fact checking with web search and LLMs for inference

Given a statement, verify whether it is true or not. We want a final answer but also supporting and contradictory evidence for that answer for better explainability. When LLMs are asked for evidence, they tend to make up reasons and especially the source of evidence. It seems that the LLMs decide on the answer then look for things that look like evidence. The great majority of URLs generated by both chatGPT and PaLM for that purpose are non-existent or irrelevant. Fact checking with web search generates real text snippets that support or contradict the statement as well as their source. This is evidence that can be easily verified. 

This approach does it by:
1. web search to find relevant web pages
2. comparing embeddings to find relevant passages in the web pages
3. using a LLM to verify whether the statement follows from the text
4. getting a result out of the inferences.

The main resources used are:
- bing search api for web search
- trafilatura for extracting useful text
- langchain for various utilities and pipelining
- OpenAI and Google PaLM for LLMs
- ChromaDb for vector store

This approach depends on the answer begin available on the web in chunks of text. However that may not always be the case. You sometimes need to search for multiple pieces of evidence and combine them to infer the statemnt (or its negation) This is handled in the next version.


In [1]:
import pprint
import os
import re
import math
import chromadb
from chromadb.config import Settings
from langchain.embeddings import HuggingFaceEmbeddings
from langchain.utilities import BingSearchAPIWrapper
from langchain.text_splitter import RecursiveCharacterTextSplitter as RCS
from langchain.prompts import PromptTemplate
from langchain.llms import OpenAI
from langchain.llms.google_palm import GooglePalm 
from langchain.chains import LLMChain
from trafilatura import fetch_url, extract, html2txt
from trafilatura.settings import use_config

## Web search

We use the Bing search utility from langchain and [Trafilatura](https://pypi.org/project/trafilatura/) to extract useful text from the webpage.

In [None]:
os.environ["BING_SUBSCRIPTION_KEY"] = os.environ['BING_SEARCH_V7_SUBSCRIPTION_KEY']
os.environ["BING_SEARCH_URL"] = os.path.join(os.environ['BING_SEARCH_V7_ENDPOINT'], 'v7.0/search')
bingSearch = BingSearchAPIWrapper()

trafConfig = use_config()
trafConfig.set('DEFAULT', 'EXTRACTION_TIMEOUT', '0')

embedder = HuggingFaceEmbeddings(model_name="all-MiniLM-L6-v2",
                                model_kwargs={'device': 'mps'})
cclient = chromadb.Client(settings=Settings(allow_reset=True))

openaiLLM = OpenAI(temperature = 0)
PALM_API_KEY = os.environ['PALM_API_KEY']
palmLLM = GooglePalm(google_api_key = PALM_API_KEY, 
                     max_output_tokens = 4096,
                     model_name='models/text-bison-001',
                     temperature=0.0,
                     verbose=True)
DEFAULTLLM = openaiLLM

## LLMs

We use the same prompt for openAI and PaLM. Although LLMs are not very good at logic, they may be better at inference from a set of axions than smaller inference models.

In [3]:
promptStr = """Consider the context and the statement below.
        If the statement follows from the context, respond with '<ENTAILS>'.
        If the opposite of the statement follows from the context, respond with '<CONTRADICTS>'.
        Otherwise respond with '<UNKNOWN>'.
        Also indicate your confidence in your answer with a number from 0.0 to 1.0.
        
        Context: {context}
        Statement: {stmt}
        Your response:
        Your confidence:
        """
promptTemplate = PromptTemplate.from_template(promptStr)
inferenceLLMChain = LLMChain(llm = DEFAULTLLM, prompt = promptTemplate)

MINURLTXTLEN = 100      # ignore web pages if the text is < 
MINCHUNKLEN = 80        # ignore text chunks if the length is <
NUMCHUNKMATCHES = 100   # number of texts matching the statement to get
MAXDISTANCE = 0.5       # max cosine distance for statement-text match

pp = pprint.PrettyPrinter(indent=4)

def doSearch(query: str, num: int = 10, ) -> dict:
    """
    Params:
      query: the statement to verify
      num: number of urls to request
    returns:
      a dict that bingSeach returns
    """
    results = bingSearch.results(query, num)
    for i, r in enumerate(results):
        r['urlSeq'] = i
    return results

def getText(urls: list[dict[str]], 
            withTables: bool = True) -> list[dict]:
    """
    Params:
      urls: list of dict of urls (from doSearch())
      withTables: whether to extract information from tables
    returns:
      the input urls with the text added
    """
    for url in urls:
        page = fetch_url(url['link'])
        webText = extract(page, include_comments=False, include_links=False,
                          include_images=False, include_tables=withTables,
                          include_formatting=False, deduplicate=False,
                          config=trafConfig, favor_recall=True)
        if webText is None: webText = ''
        if len(webText) < MINURLTXTLEN:
            webText = extract(page, include_comments=True, include_links=True,
                          include_images=False, include_tables=True,
                          include_formatting=False, deduplicate=False,
                          config=trafConfig, favor_recall=True)
            if webText is None: webText = ''
        url['text'] = webText
    return urls

## Splitting the web page

While langchain has a variety of text splitters, we simply split at the paragraphs that trafilatura generates.

In [4]:
def chunkText(urls: list[dict]) -> list[dict]:
    """
    Parmas:
      urls: lsit of dict with url info after getText()
    Returns:
      the same list with an additional 'chunks' key in the dicts
    """
    #splitter = RCS(separators=['\n', '.', '!', '?', ';', ' ', ''], chunk_overlap=0,
    #              add_start_index=True)
    chunks = {}
    for url in urls:
        url['chunks'] = []
        theseChunks = [x for x in url['text'].split('\n') if len(x) > MINCHUNKLEN]
        for chnk in theseChunks:
            if chnk not in chunks:
                url['chunks'].append(chnk)
                chunks[chnk] = 1           
        #url['chunks'] = splitter.split_text(url['text'])
        del(url['text'])
    return urls

def getTextsMetadata(urls):
    """
    Params:
      urls: list of dicts with urls, texts, chunks
    returns:
      texts: just the chinks, depuplicated
      metadata: metadaa about the text
      ids: an id for each text
      cmetadata: metadata about the whole collection
    """
    texts = []
    metadata = []
    ids = []
    cmetadata = {'urls': []}
    for url in urls:
        cmetadata['urls'].append({'url': url['link'], 'urlSeq': str(url['urlSeq'])})
        for i, chunk in enumerate(url['chunks']):
            texts.append(chunk)
            metadata.append({'urlSeq': url['urlSeq'], 'chunkSeq': i,
                            'url': url['link']})
            ids.append(f"{url['urlSeq']}_{i}")
    # pp.pprint(cmetadata)
    # pp.pprint(texts[:4])
    # pp.pprint(ids[:10])
    # pp.pprint(metadata[:4])
    return texts, metadata, ids, cmetadata

## Finding relevant chunks

This uses cosine similarity between embedding vectors.  ChromaDb is used for that.  

An alternative approach would be to ask the LLM whether the text is relevant to determinign the truth of the statement.


In [5]:
def storeVectors(urls):
    """
    Params:
      urls: the list of url dicts as above
    Return:
      vs: chromaDb vector store
      cmetadata: collection metadata
    """
    texts, metadata, ids, cmetadata = getTextsMetadata(urls)
    vs = cclient.create_collection(name='vs0',
                                  metadata={"hnsw:space": "cosine"},  # cos distance
                                  get_or_create = True)
    vs.add(embeddings = embedder.embed_documents(texts), 
                                 documents = texts,
                                 metadatas=metadata, ids=ids)
    # print('Count ', vs.count())
    return vs, cmetadata

def getChunksForInference(vs, query, numMatches = NUMCHUNKMATCHES):
    """
    Params:
      vs: vector store
      query: query of interst:
      numMatches: number to returm
    Return:
      candDocs: matching text chunks
      candMetadata: corresponding metadata
      candDIstances: corresponding (query-chunk) distances
    """
    closestChunks = vs.query(query_embeddings=embedder.embed_documents(query),
                             n_results = numMatches)
    # maxidx = len([x for x in closestChunks['distances'] if x < MAXDISTANCE])
    candDocs = closestChunks['documents']
    candMetadata = closestChunks['metadatas']
    candDistances = closestChunks['distances']
    return candDocs, candMetadata, candDistances

def getLLMpredictions(query, candDocs, llm=openaiLLM):
    inputs = []
    for d in candDocs:
        inputs.append({'stmt': query, 'context': d})
    response = inferenceLLMChain.apply(inputs)
    return response

## Combining evidence

Givem the LLM predictions on the text chunks, we need to combine them to get a result. The heuristic approach used here assumes:
- the urls generated by web search are in order of importance. This is modeled here by a power law weighting of the urls
- the degree of confidence generated by the LLMs do have some validity and so we use them to weight evidence

Using these two factors we come up with a result and display the supporting and contradictory evidence.

In [10]:
def getResults(candDocs, candMetadata, predictions, 
               power=-1.0,  
              threshold=0.3):  
    """
    Parms:
      candDocs: texts providing evidence for or against statement
      candMetadata: metadata abput these texts
      predictions: LLM response
      power: apply power law to search urls. 0.0 for no power lay
      threshold: if abs(resultantScore) < threshold -> unknown
    returns:
      status: entails|contradicts|unknown
      resultantConfidence
      supportSet
      contradictSet
        supportSet, contradictSet: list of {url, urlSeq. text. status, confidence}
    """            
    support = []
    contra = []
    cre = re.compile('.*?([\d\.]+).*?', re.DOTALL)
    for dmp in zip(candDocs[0], candMetadata[0], predictions):
        status = 'UNKNOWN'
        conf = 0.0
        einfo = {'url': dmp[1]['url'], 'urlSeq': dmp[1]['urlSeq'], 'text': dmp[0]}
        response = dmp[2]['text']
        cmatch = cre.match(response)
        if cmatch is not None:
            conf = float(cmatch[1])
        einfo['conf'] = conf
        if 'entail' in response.lower():
            einfo['status'] = 'entails'
            support.append(einfo)
        elif 'contra' in response.lower():
            einfo['status'] = 'contradicts'
            contra.append(einfo)
        else: einfo['status'] = 'unknown'
    denom = 0
    sumSupport = sum([x['conf'] * math.pow(x['urlSeq'] + 1, power) for x in support])
    sumContra = sum([x['conf'] * math.pow(x['urlSeq'] + 1, power) for x in contra])
    denom += sum([math.pow(x['urlSeq'] + 1, power) for x in support])
    denom += sum([math.pow(x['urlSeq'] + 1, power) for x in contra])
    if denom > 0:
        resultConf = (sumSupport - sumContra) / denom
    else:
        resultConf = 0
        einfo['status'] = 'unknown'
    if abs(resultConf) < threshold:
        return 'unknown', 0.0, support, contra
    elif resultConf > 0:
        return 'entails', resultConf, support, contra
    else:
        return 'contradicts', -resultConf, support, contra

        
# top level:
def factcheck(query: str = '',
              kwargs: dict = {}) -> dict:
    # kwargs:
    #  - savedSearch
    urls = doSearch(query)
    # doc download and cleaning
    urls = getText(urls)
    # splitting
    urls = chunkText(urls)
    # adding to vectorstore
    vs, cmetadata = storeVectors(urls)
    # choosing chunks
    candChunks, candMetadata, candDistances = getChunksForInference(vs, 
                                                                [query],
                                                                numMatches = 20)
    # get inferences
    predictions = getLLMpredictions(query, candChunks[0])
    status, conf, supportSet, contraSet = getResults(candChunks, candMetadata, predictions)
    print(f"Status: {status}, confidence: {conf:.2f}")
    print('Support:')
    pp.pprint(supportSet)
    print('Contradicts:')
    pp.pprint(contraSet)
    cclient.reset()



## Examples

Here are some examples:
- "wolves are beneficial to the ecosystem" - a non-controversial statement
- "The Golden State Warriors are the best team." - this is more controversial.
- "The moon is closer to Earth than Europa is to Jupiter." - it gets that wrong, in part because of the text retrieved by web search and found relevant by cosine simialrity.
- This is the output from Bard which has the correct distances, but gets the answer wrong:
~~~
  No, it is not true that the Moon is closer to Earth than Europa is to Jupiter. The Moon is approximately 238,900 miles (384,400 kilometers) from Earth, while Europa is approximately 417,000 miles (671,000 kilometers) from Jupiter. This means that Europa is approximately 178,100 miles (286,600 kilometers) further away from Jupiter than the Moon is from Earth.

Here is a table comparing the distances between the Earth and Moon, and Jupiter and Europa:

| Body | Distance |
|---|---|---|
| Earth to Moon | 238,900 miles (384,400 kilometers) |
| Jupiter to Europa | 417,000 miles (671,000 kilometers) |

I hope this information is helpful.
~~~

In [127]:
factcheck('Wolves are beneficial to the ecosystem')

Status: entails, confidence: 0.92
Support:
[   {   'conf': 0.9,
        'status': 'entails',
        'text': 'As a large predator, wolves play a key role in regulating '
                'populations of other animals. Without them the balance in '
                'those ecosystems is upset.',
        'url': 'https://greentumble.com/why-are-wolves-good-for-the-ecosystem',
        'urlSeq': 1},
    {   'conf': 0.9,
        'status': 'entails',
        'text': 'What is more, wolves benefit other animals, like scavengers. '
                'Because they hunt down their prey, wolves leave leftovers '
                'behind that are the key source of food for a number of other '
                'species such as ravens, magpies, bald eagles, golden eagles, '
                'weasels, mink, lynx, cougar, grizzly bear, chickadees, masked '
                'shrew, great gray owl, and more than 445 species of beetle '
                '[4].',
        'url': 'https://greentumble.com/why-are-wolves-

In [132]:
factcheck("The Golden State Warriors are the best team.")

Status: entails, confidence: 0.64
Support:
[   {   'conf': 1.0,
        'status': 'entails',
        'text': 'The Golden State Warriors have the best record in the '
                'National Basketball Association, but are they really the '
                'league’s best team?',
        'url': 'https://bluemanhoop.com/2021/11/08/golden-state-warriors-nba-best-team/',
        'urlSeq': 1},
    {   'conf': 0.9,
        'status': 'entails',
        'text': 'There’s one team with one loss in the entire NBA, and it’s '
                'the Golden State Warriors. With that in mind, based on purely '
                'record, the Warriors seem, on paper, to be the NBA’s best '
                'team. While their league-best record suggests it, are they '
                'actually the best of the best?',
        'url': 'https://bluemanhoop.com/2021/11/08/golden-state-warriors-nba-best-team/',
        'urlSeq': 1},
    {   'conf': 1.0,
        'status': 'entails',
        'text': 'The Dubs are t

In [13]:
factcheck("The moon is closer to Earth than Europa is to Jupiter.")

Status: contradicts, confidence: 0.66
Support:
[   {   'conf': 1.0,
        'status': 'entails',
        'text': 'Europa is smaller and colder than Earth. It’s slightly '
                'smaller in size than Earth’s Moon. It’s so cold because it’s '
                'a long way from the Sun—more than five times farther than the '
                'distance between the Sun and Earth.',
        'url': 'https://spaceplace.nasa.gov/europa/en/',
        'urlSeq': 3}]
Contradicts:
[   {   'conf': 0.9,
        'status': 'contradicts',
        'text': '- Europa is nearly the same size as Earth’s Moon. It is '
                'tidally locked to Jupiter in its orbit and spins faster on '
                'its axis than it orbits. Its orbit is nearly circular.',
        'url': 'https://space-facts.com/moons/europa/',
        'urlSeq': 5},
    {   'conf': 0.9,
        'status': 'contradicts',
        'text': 'Europa (Jupiter II), the second of the four Galilean moons, '
                'is the secon