

# Using PaLM for fact chekcing

An easy way to check whether statements are true is to just ask a LLM. There are 3 outcomes: True, False, Unknown.

There are 2 approaches below:
- just ask whether the statement is true or not
- in addition, ask it to present some evidence. There are 2 variations:
  - simply ask for evidence
  - ask for some number of pieces of evidence

It is hoped that asking for evidence is analogous to a chain of thought approach and will result in better performance.

## TLDR

1. The metrics are relatively low
2. PaLM is biased to finding that statements are true
3. Asking for evidence for its decisions exacerbates this bias
4. PaLM makes up sources (urls) and arguments when asked for evidence
5. PaLM does not seem to grasp the concept of evidence for or against a statement.

## Implementation

PaLM has some idiosyncrasies that need to be handled.
    

In [117]:
import google.generativeai as palm
import google.generativeai.types.safety_types as safety_types
import os
import time
import json
import pandas as pd
import random
import re
import logging

api_key = os.environ['PALM_API_KEY']
palm.configure(api_key=api_key)
MODEL = 'models/text-bison-001'
BASEDIR = os.getcwd()
RESULTDIR = os.path.join(BASEDIR, '../procData/factcheck/palm')
TESTDIR = os.path.join(BASEDIR, '../procData/fakeGen/chatgpt')

logging.basicConfig(filename=os.path.join(BASEDIR, 'factCheck_palm.log'), 
                    level=logging.DEBUG)

# without using this, a number of requests fail
SAFETY = [
    {'category': safety_types.HarmCategory.HARM_CATEGORY_DEROGATORY,
     'threshold': safety_types.HarmBlockThreshold.BLOCK_NONE,
    },
    {'category': safety_types.HarmCategory.HARM_CATEGORY_TOXICITY,
     'threshold': safety_types.HarmBlockThreshold.BLOCK_NONE,
    },
    {'category': safety_types.HarmCategory.HARM_CATEGORY_VIOLENCE,
     'threshold': safety_types.HarmBlockThreshold.BLOCK_NONE,
    },
    {'category': safety_types.HarmCategory.HARM_CATEGORY_SEXUAL,
     'threshold': safety_types.HarmBlockThreshold.BLOCK_NONE,
    },
    {'category': safety_types.HarmCategory.HARM_CATEGORY_MEDICAL,
     'threshold': safety_types.HarmBlockThreshold.BLOCK_NONE,
    },
    {'category': safety_types.HarmCategory.HARM_CATEGORY_DANGEROUS,
     'threshold': safety_types.HarmBlockThreshold.BLOCK_NONE,
    },
]

RETV = re.compile('<(.*?)>')
RENB = re.compile('(TRUE|FALSE|UNKNOWN)')

def verifyStatement(slist: list[str]):
    """
    Just ask if the statement is true, false or unknown
    """
    stmts = '\n'.join([f"Statement: {x}" for x in slist])
    rv = []
    query = f"""
Provide a list of '<TRUE>', '<FALSE>', or '<UNKNOWN>' for each statement in the list below.

Statement list:

{stmts}
"""
    try:
        palmResp = palm.generate_text(
            model=MODEL,
            prompt=query,
            temperature=0,
            candidate_count=1,
            safety_settings = SAFETY
        )
        logging.info(stmts)
        logging.debug(str(palmResp))
        logging.info('Result\n' + str(palmResp.result))
    except Exception as e:
        print('verity statement failed\n', e)
        logging.error('Palm failure: ' + str(e))
        raise e
    if len(palmResp.candidates) == 0:
        logging.error('No candidates from palm')
        raise ValueError('No candidates')
    preds = []
    preds = list(map(lambda x: getTV(x), RETV.findall(palmResp.result)))
    if len(preds) == 0:
        preds = list(map(lambda x: getTV(x), RENB.findall(palmResp.result)))
    return preds, palmResp


def verifyStatementWithNEvidence(slist: list[str], nEvid: int = None):
    """
    Ask about the truth of the staements, and provide some evidence supporting the answer
    """
    stmts = '\n'.join([f"Statement: {x}\n" for x in slist])
    query = f"""For each statment in the list below, provide: 
    1. {'A few' if nEvid is None else nEvid} pieces of evidence for it with their sources
    2. {'A few' if nEvid is None else nEvid} pieces of evidence against it with their sources
    3. <TRUE> if you assess the statement to be true, <FALSE> if you believe it is false, <UNKNOWN> if you cannot conclude either true or false.

Statement list: 
{stmts}
"""
    try:
        palmResp = palm.generate_text(
            model=MODEL,
            prompt=query,
            temperature=0,
            candidate_count=1,
            safety_settings = SAFETY
        )
        logging.info(stmts)
        logging.debug(str(palmResp))
        logging.info('Result\n' + str(palmResp.result))
    except Exception as e:
        logging.error('Palm failuer: ' + str(e))
        raise e
    preds = list(map(lambda x: getTV(x), RETV.findall(palmResp.result)))
    if len(preds) == 0:
        preds = list(map(lambda x: getTV(x), RENB.findall(palmResp.result)))
    return preds, palmResp
    if (len(palmResp.candidates) == 0) or (len(preds) != len(slist)):
        logging.error(f"Num statements: {len(slist)}, num answers: {len(preds)}")
        raise ValueError('Wron number of answers')
    return preds, palmResp

def getTV(stv):
    lstv = stv.lower()
    if lstv == 'true': return True
    elif lstv == 'false': return False
    else: return 'Unknown'
    

In [None]:
stmts = ['David Lloyd George was the next British Prime Minister after Arthur Balfour',
         "Moonwalk was the name of Michael Jackson\'s autobiography written in 1988.",
         "California was the last US state to reintroduce alcohol after prohibition.",
         "George Bush was the director of the CIA from 1976-81.",
        ]


#tvs, resp = verifyStatement(stmts)
tvs, resp = verifyStatementWithNEvidence(stmts, 3)
print(resp.result, '\n', tvs)


### Applying these to the data

This needs to handle intances where PaLM does not return the correct number of results.  


In [82]:


MAXSLEEP = 3
MINSLEEP = 0
BATCHSIZE = 4

from typing import Callable

def testFactCheck(inFile: str,          # json list of dicts with statement, fake_statment...
                  selectFile: str,      # json list of True, False
                  vmethod: Callable[list[str], list[str]],  # approach to take 
                  nEvidence: int,       # if want evidence, hot many?
                  outFile: str,         # output
                  logFile: str = None,  # detailed output
                  batchSize: int = 64, 
                  startIdx: int = 0,
                  numProc: int = -1,
                 ):
    """
    The input file is a json list of dicts that contain a true statement and a false 
    statement based on the same trivia question.
    The select file is a list of true and false that indicates which statement to pick
    for each question
    If there is no select file, we verify all questions
    """
    results = []
    predVals = []
    nb = 0
    with open(inFile, 'r') as ix:
        testData = json.load(ix)
    selData = None
    if selectFile is not None:
        with open(selectFile, 'r') as sx:
            selData = json.load(sx)
    if logFile is not None:
        log = open(logFile, 'w')
    lb = startIdx
    ub = 0
    while lb < len(testData) -1:
        ub = min(lb + batchSize, len(testData))
        if numProc > -1:
            ub = min(ub, (numProc + startIdx))
            if lb >= numProc + startIdx: break
        try:
            batch = []
            stv = []
            for bx in range(lb, ub):
                if selData is not None:
                    batch.append(testData[bx]['statement'] if selData[bx] else testData[bx]['fake_statement'])
                    stv.append([batch[-1], selData[bx]])
                else:
                    batch.append(testData[bx]['statement'])
                    stv.append([batch[-1], True])
                    batch.append(testData[bx]['fake_statement'])
                    stv.append([batch[-1], False])
            try:
                preds, palmResp = doVerify(vmethod, nEvidence, batch)
            except Exception as e:
                preds = []
                palmResp = ''
                print(e)
            textResp = palmResp.result
            if len(preds) != len(batch):
                print(f"ERROR in batch {nb}: lenght of preds = {len(preds)}, batch = {len(batch)}. Redoing one at a time")
                preds = []
                textResp = ''
                for s in batch:
                    try:
                        px, ox = doVerify(vmethod, nEvidence, [s])
                        if len(px) != 1:
                            preds.append('Unknown')
                        else:
                            preds += px
                        textResp += ox.result
                    except Exception as e:
                        print(e)
                        preds.append('Unknown')
                        textResp += '--ERROR--\n'
            lb = ub
            nb += 1
            predVals.extend(preds)
            results.extend(stv)
            if logFile is not None:
                log.write('\n'.join(batch) + '\n')
                log.write(f"{textResp}\n\n")
            time.sleep(MINSLEEP + random.random() * (MAXSLEEP - MINSLEEP))
        except Exception as e:
            print('Error in \n', batch, '\n', e)
            lb = ub
            nb += 1
        if nb % 10 == 0:
            if logFile is not None: log.flush()
            print(f"{nb} batches done")
            df = pd.DataFrame(results, columns=['question', 'trueAnswer'])
            try:
                df['palmAnswer'] = predVals
                df['correct?'] = (df['palmAnswer'] == df['trueAnswer'])
            except Exception as e:
                print(e)
            df.to_csv(outFile, index=False, header=True)
    if logFile is not None: log.close()
    df = pd.DataFrame(results, columns=['question', 'trueAnswer'])
    try:
        df['palmAnswer'] = predVals
        df['correct?'] = (df['palmAnswer'] == df['trueAnswer'])
    except Exception as e:
        print(e)
        with open(os.path.join(RESULTDIR, 'predvals_buggy'), 'w') as px:
            json.dump(predVals, px)
            print('Dumped predVals')
    df.to_csv(outFile, index=False, header=True)
    if logFile is not None:
        log.close()

def doVerify(vmethod, nEvidence, batch):
    if vmethod.__name__ == 'vertifyStatementNEvidence':
        preds, palmResp = vmethod(batch, nEvidence)
    else:
        preds, palmResp = vmethod(batch)
    return preds, palmResp

In [None]:
testFactCheck(inFile = os.path.join(TESTDIR, 'tf_qa-dev.json'),
              selectFile = os.path.join(TESTDIR, 'tf_qa-dev_select1.json'),
              vmethod = verifyStatement,
              nEvidence = 5,
              outFile = os.path.join(RESULTDIR, 'triviaQA_dev_noEvidence.csv'),
              logFile = os.path.join(RESULTDIR, 'triviaQA_dev_noEvidence.log'),
              batchSize = 4,
              startIdx = 0,
              numProc = -1,
             )
print('all done')

# Results

Metrics are below.

## Interesting examokes

PaLM does not seem to have a robust notion of what evidence is and how it relates to the statement it is asked to verify. Palm tends to generate either staements without source, or urls without explanation as to why they are evidence for or against the statement. So asking PaLM for evidence to help explain its decisions is not useful.

### URLs only as evidence.
~~~
Rudolf Hess was the last inmate of Spandau jail in Berlin

1. Evidence for:
- [https://en.wikipedia.org/wiki/Rudolf_Hess](https://en.wikipedia.org/wiki/Rudolf_Hess)
- [https://www.britannica.com/biography/Rudolf-Hess](https://www.britannica.com/biography/Rudolf-Hess)

2. Evidence against:
- [https://en.wikipedia.org/wiki/Spandau_Prison
(https://en.wikipedia.org/wiki/Spandau_Prison)

3. <TRUE>
~~~

- https://en.wikipedia.org/wiki/Rudolf_Hess does not seen to indicate that Hess whas the last inmate.
- https://www.britannica.com/biography/Rudolf-Hess says that he was the sole inmate from 1966, which implies he was the last one.
- https://en.wikipedia.org/wiki/Spandau_Prison does imply that Hess was the last inmate, but it is listed as evidence against that.

### Statements without source as evidence.

~~~
Treasure Island is the famous novel that Men Against the Sea and Pitcairn's Island were two sequels to.

1. Evidence for:
- Treasure Island is a famous novel.
- Men Against the Sea and Pitcairn's Island are sequels to Treasure Island.

2. Evidence against:
- Treasure Island is not a famous novel.
- Men Against the Sea and Pitcairn's Island are not sequels to Treasure Island.

3. Conclusion: <TRUE>
~~~

Part of the evidence for consists of restating the statement. THe evidence agains is the negation of the evidence for. This pattern is common

The conclusion is also incorrect.

### Making up evidence

~~~
Babe Ruth was the Georgia Peach.

1. Evidence for:
    - https://en.wikipedia.org/wiki/Babe_Ruth
    - https://www.baseball-reference.com/players/r/ruthba01.shtml
2. Evidence against:
    - None
3. <TRUE>
~~~

The statement is False, and  neither of the urls presented as evidence mentions 'the georgia peach'.

### Fake urls

~~~
Wendy Darling played Tinker Bell in Steven Spielberg's Hook.

1. Evidence for:
    - [IMDB](https://www.imdb.com/title/tt0103777/)
    - [Wikipedia](https://en.wikipedia.org/wiki/Hook_(film))
2. Evidence against:
    - None
3. <TRUE>
~~~

The url https://www.imdb.com/title/tt0103777 refers to "Le batteur du boléro" which is not related to Hook or Spielberg or Wendy Darling.
Also, the answer is incorrect. 

The numbers in urls that have them seem to be incorrect. Another example: https://www.youtube.com/watch?v=--5--3--5--5

_1351_



# Metrics

Precision, recall, F1 and the confusion matrix are computed for the 3 versions: no evidence, some evidence and more evidence.


In [182]:
from sklearn.metrics import confusion_matrix
from sklearn.metrics import precision_recall_fscore_support
import numpy as np

def computeMetrics(inFile):
    df = pd.read_csv(inFile, index_col=None, dtype=str)
    preds = df['palmAnswer'].tolist()
    trues = df['trueAnswer'].tolist()
    labels = [True, False, 'Unknown']
    p,r,f,s = precision_recall_fscore_support(trues, preds, labels=labels, average=None,
                                             zero_division=np.nan)
    prf = np.array([p[:2], r[:2], f[:2]])
    print('Precision, recall, F1 for True, False')
    print(prf)
    cm = confusion_matrix(trues, preds, labels=labels)
    print('Confusion matrix for True, False, Unknown')
    print(cm)


In [40]:
# Version not asking for evidence

computeMetrics(os.path.join(RESULTDIR, 'triviaQA_dev_noEvidence.csv'))

Precision, recall, F1 for True, False
[[0.61754386 0.8034188 ]
 [0.88       0.46305419]
 [0.7257732  0.5875    ]]
Confusion matrix for True, False, Unknown
[[176  23   1]
 [109  94   0]
 [  0   0   0]]


In [42]:
# asking for evidence
computeMetrics(os.path.join(RESULTDIR, 'triviaQA_dev_3Evidence.csv'))

Precision, recall, F1 for True, False
[[0.54938272 0.76190476]
 [0.89       0.2364532 ]
 [0.67938931 0.36090226]]
Confusion matrix for True, False, Unknown
[[178  15   7]
 [146  48   9]
 [  0   0   0]]


In [57]:
# asking for more evidence

computeMetrics(os.path.join(RESULTDIR, 'triviaQA_dev_5Evidence.csv'))

Precision, recall, F1 for True, False
[[0.54938272 0.76190476]
 [0.89       0.2364532 ]
 [0.67938931 0.36090226]]
Confusion matrix for True, False, Unknown
[[178  15   7]
 [146  48   9]
 [  0   0   0]]


## Summary

True statements
~~~
			no evidence	3 evidence	5 evidence
precision	0.62		0.55		0.55
recall		0.88		0.89		0.89
f1			0.73		0.68		0.68
~~~

False statements
~~~
			no evidence	3 evidence	5 evidence
precision	0.80		0.76		0.76
recall		0.46		0.24		0.24
f1			0.59		0.36		0.36
~~~

- Metrics are not that great.
- There is no difference (in counts) between 3 evidence and 5 evidence. PaLM ignores the number requested
- asking for evidence makes PaLM assign true (and unknown to a lesser extent) to more false statements
- it also moves some true statements assigned false to true or unknown

So:
1. palm has a bias towards finding that statements are true
2. asking for evidence for its decision tends to increase that bias


# Examples from each bin

I look at some examples of statements that are true, assigned true with no evidence and with some evidence and other combinations.

Observations:
- some of the statements are not correctly generated, especially for those in the True-False-False group
- the bias of PaLM towards assiging statements to true is apparent.
- 

In [81]:

mergedDF = None

def mergeDfs(leftFile, rightFile, labels=['_left', '_right']):
    global mergedDF
    dfLeft = pd.read_csv(leftFile, index_col=None, dtype=str)
    dfRight = pd.read_csv(rightFile, index_col=None, dtype=str)
    mergedDF = pd.merge(dfLeft, dfRight, how='inner', on='question', suffixes=labels)

mergeDfs(os.path.join(RESULTDIR, 'triviaQA_dev_noEvidence.csv'),
        os.path.join(RESULTDIR, 'triviaQA_dev_3Evidence.csv'))


In [None]:
def getCountAndSamples(df, vals = ['True', 'True', 'True'], nsamples=10):
    filtDf = df[(df['trueAnswer_right'] == vals[0]) & (df['palmAnswer_left'] == vals[1]) \
        & (df['palmAnswer_right'] == vals[2])]
    print(f"Number of {vals} = {len(filtDf)}")
    ns = min(nsamples, len(filtDf))
    sample = filtDf.sample(n=ns)
    for i, r in sample.iterrows():
        print(f"{i}: {r['question']}")

In [None]:
getCountAndSamples(mergedDF, ['True', 'True', 'True'])

In [None]:
getCountAndSamples(mergedDF, ['True', 'True', 'False'])

# etd

# Inconsistent responses

In the above, PaLM was presented with either a true statement or a false statement 
of the same form as the true statement. Here, it is fed both the true statement and
the false statement. Just one of the pairs can be true.  The experiment here is to 
find if PaLM does that.

The true and false sentences for each pair are presented sequentially, which may make
it easier for PaLM.  The order of all the statements can be randomized later if need be.


In [None]:
testFactCheck(inFile = os.path.join(TESTDIR, 'tf_qa-dev.json'),
              selectFile = None,   # no select file -> verify hoth true and fake stmts
              vmethod = verifyStatement,  # asking for evidence does not help
              nEvidence = 5,
              outFile = os.path.join(RESULTDIR, 'triviaQA_TandF.csv'),
              logFile = os.path.join(RESULTDIR, 'triviaQA_TandF.log'),
              batchSize = 2,
              startIdx = 0,
              numProc = -1,
             )
print('all done')

## Computing results

Given the resuts from above:
- make each pair of rows into one row so that each row has one true and one false statement on the same topic
- label each row as:
  - True if the PaLM response is correct
  - False if the response is reverse comapred to the true statements
  - Inconsistent if both responses are true
  - Unknown if both responses are false
    

In [178]:

def evalPairs(inFile, outFile):
    dfr = pd.read_csv(inFile, index_col=None, dtype=str)
    df_t = dfr.iloc[::2, :]
    nameMap = {x: f"{x}_f" for x in dfr.columns}
    df_f = dfr.drop(df_t.index).reset_index(drop=True).rename(columns=nameMap)
    df_t = df_t.reset_index(drop=True)
    df_tf = pd.concat([df_t, df_f], axis=1, ignore_index=False)
    numBadRows = len(df_tf[~(df_tf['trueAnswer'] == 'True')]) + len(df_tf[(df_tf['trueAnswer_f'] == 'True')])
    if numBadRows > 0:
        print('rows mixed up, num bad rows: ', numBadRows)
        return
    df_tf['pstatus'] = df_tf.apply(lambda r: getRowStatus(r), axis=1)
    print('True: ', len(df_tf[df_tf['pstatus'] == 'True']))
    print('False: ', len(df_tf[df_tf['pstatus'] == 'False']))
    print('Inconsistent: ', len(df_tf[df_tf['pstatus'] == 'Inconsistent']))
    print('Unknown: ', len(df_tf[df_tf['pstatus'] == 'Unknown']))
    df_tf.to_csv(outFile, index=False) 

def getRowStatus(r):
    if r['palmAnswer_f'] == 'False' and (r['palmAnswer'] == 'True'):
        return 'True'
    elif r['palmAnswer_f'] == 'True' and (r['palmAnswer'] == 'False'):
        return 'False'
    elif r['palmAnswer_f'] == 'True' and (r['palmAnswer'] == 'True'):
        return 'Inconsistent'
    else:
        return 'Unknown'

evalPairs(os.path.join(RESULTDIR, 'triviaQA_TandF.csv'),
          os.path.join(RESULTDIR, 'triviaQA_Pairs.csv'))



True:  293
False:  62
Inconsistent:  16
Unknown:  31


These results are better than the results obtained earlier. THe earlier computations were repeated multiple times with the same results.  

Rerun the no-evidence fact check to see if the performance matches this one.

In [None]:
testFactCheck(inFile = os.path.join(TESTDIR, 'tf_qa-dev.json'),
              selectFile = os.path.join(TESTDIR, 'tf_qa-dev_select1.json'),
              vmethod = verifyStatement,
              nEvidence = 5,
              outFile = os.path.join(RESULTDIR, 'triviaQA_dev_noEvidence_Q2.csv'),
              logFile = os.path.join(RESULTDIR, 'triviaQA_dev_noEvidence_Q2.log'),
              batchSize = 4,
              startIdx = 0,
              numProc = -1,
             )
print('all done')

In [183]:
# Version not asking for evidence. run a few days later

computeMetrics(os.path.join(RESULTDIR, 'triviaQA_dev_noEvidence_Q2.csv'))

Precision, recall, F1 for True, False
[[0.60526316 0.72592593]
 [0.805      0.48275862]
 [0.69098712 0.57988166]]
Confusion matrix for True, False, Unknown
[[161  37   2]
 [105  98   0]
 [  0   0   0]]


In [None]:
The previous metrics were better.  There seems to be instability in the results.

## Conclusions

Factchecking is meant to be a way to verify whether a statement is true. In addition providing evidence is meant to help people gain confidence in the judgement of the system.

1. the metrics are relatively low
2. PaLM is biased to finding that statements are true
3. Asking for evidence for its decisions exacerbates this bias
4. PaLM makes up sources (urls) and arguments when asked for evidence
5. PaLM does not seem to grasp the concept of evidence for or against a statement.

