# Lie Detector
## General Description
#### In this Project, we are using Data contains Sentences as the first column and the last column contain the information of whether this sentence is a lie or not. Lie being 1 and Not Lie being 0. We are trying to find the P( Lie | Query )




## We are using Navie Bayesian Network

                          Lie (Class)
                        /   |    |   \
                      W₁   W₂   W₃  ... Wₙ (Words)

Lie (root node): Binary class (True = lie, False = truth).

Words (child nodes): Each word W<sub>i</sub> depends only on the class Lie

Edges: All the edges are pointing from Lie to W<sub>i</sub>

In [2]:
import pandas as pd
import numpy as np
import re

# Cleaning Data
- Drop all columns not contain the sentence and whether it is a lie.
- Store them in tuple as (list of words , whether it is a lie)

In [39]:
data = pd.read_csv('politifact_clean_binarized.csv')

In [40]:
# Training set
training_data = []
for i in range(0,11188-100): # We are only training on the first 11087 data, the last 100 will be the validation set
  word_list = data[['statement','veracity']].iloc[i].tolist()[0]
  words = re.sub(r'[^\w\s]', '', word_list).split()
  if data[['statement','veracity']].iloc[i].tolist()[1] == 0:
    lie_tupple = (words,True)
  else:
    lie_tupple = (words,False)
  training_data.append(lie_tupple)

In [41]:
# Validation set
validation_data = []
for i in range(11087,11188):
  word_list = data[['statement','veracity']].iloc[i].tolist()[0]
  words = re.sub(r'[^\w\s]', '', word_list).split()
  if data[['statement','veracity']].iloc[i].tolist()[1] == 0:
    lie_tupple = (words,True)
  else:
    lie_tupple = (words,False)
  validation_data.append(lie_tupple)

## Current Approach: Words are treated as a bag-of-words (order-agnostic).

### Reason:

- Naive Bayes assumes conditional independence between words given the class. Ordering adds complexity and violates this assumption.

- Simplifies computation and reduces dimensionality.

# Step-by-Step Math Calculation

## 1. Priors: \( P( Lie ) \) and \( P( Not Lie ) \)

- Calculate class probabilities from training data:

  P(Lie) = Number of lies/Total statements
  
  P(Not Lie) = 1 - P(Lie)

## 2. Likelihoods: \( P( W<sub>i</sub> | Lie ) \) and \( P( W<sub>i</sub> | Not Lie ) \)

For each word \( W<sub>i</sub> \) in the input query:

- **Lie Class**:

  P( W<sub>i</sub> | Lie ) = Count of lies containing W<sub>i</sub> / Total lies

- **Not Lie Class**:

  P( W<sub>i</sub> | Not Lie ) = Count of truths containing W<sub>i</sub> / Total truths


## 3. Joint Probability of Query Given Class

- Assume independence between words:

  P( Query | Lie ) = Product of all W<sub>i</sub> over P( W<sub>i</sub> | Lie )

  P( Query | Not Lie ) = Product of all W<sub>i</sub> over P( W<sub>i</sub> | Not Lie )


## 4. Posterior Probability (Bayes’ Theorem)

P ( Query ) =  P( Query | Lie ) * P( Lie ) + P( Query | Not Lie ) * P( Not Lie )

P(Lie|Query) = P( Query | Lie ) * P( Lie ) / P ( Query )

## Exmaple:
# Example Calculation

From the code’s test case:  
**Query:** `["absolutely", "my"]`  

Assume training data has:  

- Total lies (`numLie`) = 5,000  
- Total truths (`numNotLie`) = 6,188  
- **absolutely** appears in 3,000 lies and 1,000 truths  
- **my** appears in 2,500 lies and 4,000 truths  

## Step 1: Priors  

P(Lie) = 5000 / 11188 = 0.447

P(Not Lie) = 1 - 0.447 = 0.553

## Step 2: Likelihoods  

P( absolutely | Lie ) = 3000 / 5000 = 0.6

P( absolutely | Not Lie) = 1000/6188 = 0.162

P( my | Lie) = 2500 / 5000 = 0.5

P( my | Not Lie) = 4000 / 6188 = 0.646


## Step 3: Joint Probabilities  

P(Query | Lie) = 0.6 * 0.5 = 0.3

P(Query | Not Lie) = 0.162 * 0.646 = 0.105


## Step 4: Posterior  

P(Lie | Query) = 0.3 * 0.447 / (0.3* 0.447 + 0.105 * 0.553) = 0.698



In [63]:
def calculate_lie_probability(training_data, query):
    """
    Calculate the probability of an input statement being spam.

    Args:
        training_data (list of tuples): Each tuple contains a list of words and a boolean indicating
                                        if the list is lie (True) or not lie (False).
        query (list of str): Words in the input statement to evaluate.

    Returns:
        float: Probability that the input statement is lie.
    """
    P_lie_given_query = 1.0

    # checks for at lest one valid word exist in the input statement
    # and keep only the valid words
    valid = False
    valid_query = []
    for i in query:
        for words, isLie in training_data:
            if i in words:
                valid = True
                valid_query.append(i)
                break
    if not valid:
        return -1
    query = valid_query

    # calculate P(Wi|Lie) for all words within W*, along with P(Lie)
    PLie = 0
    numLie = 0
    P_Wi_given_lie_list = []
    P_Wi_given_not_lie_list = []
    for i in query:
        numLie = 0
        numWgivenLie = 0
        numNotLie = 0
        numWgivenNotLie = 0
        for words, isLie in training_data:
            if isLie:
                numLie += 1
                if i in words:
                    numWgivenLie += 1
            else:
                numNotLie += 1
                if i in words:
                    numWgivenNotLie += 1
        PLie = numLie / len(training_data)
        PNotLie = 1 - PLie
        # Store P(Wi|Lie)
        P_Wi_given_lie_list.append(numWgivenLie/numLie)
        # Store P(Wi|notLie)
        P_Wi_given_not_lie_list.append(numWgivenNotLie/numNotLie)

    PLie = numLie / len(training_data)
    PNotLie = 1 - PLie

    # Calculate P(W*|Lie)
    P_W_given_Lie = 1
    for i in P_Wi_given_lie_list:
        P_W_given_Lie *= i

    # Calculate P(W*|notLie)
    P_W_given_not_Lie = 1
    for i in P_Wi_given_not_lie_list:
        P_W_given_not_Lie *= i

    if P_W_given_Lie == 0 and P_W_given_not_Lie == 0:
      return -1

    # Calculate P_lie_given_query
    P_lie_given_query = P_W_given_Lie * PLie / (P_W_given_Lie * PLie + P_W_given_not_Lie * PNotLie)

    return P_lie_given_query

##Handle Unknown word
  Words not present in the training data are removed from the query. If all words are invalid, the model returns -1 (unknown)
#### Motivation:

- Simplifies the problem by ignoring words that the model hasn’t learned to associate with lies/truths.


##OOV Problem & Short Statements
###Current Handling:

- Delete OOV words. For short statements, this risks leaving only common propositional words (e.g., “the”, “is”), which are uninformative.

- Example: Query ["wn", "m"] returns -100% due to no valid words.

### There might have Better Approaches:

- Minimum Word Threshold: Require a minimum number of valid words to make predictions.

- New Token: Set a new token <UNK>, which represents all the words that are invalid.

## Rare Words & Generalization
### Problem:

Words appearing very few times in training data (e.g., 1–2 occurrences) lead to noisy estimates of
P( W<sub>i</sub> ∣ Class )


- Example: If “absolutely” appears once in lies and never in truths,
P( absolutely ∣ Lie )=1, which is unreliable.

### Possible Solution
- Word Embeddings:

  Replace bag-of-words with embeddings (e.g., Word2Vec) to capture semantic similarity.

  Advantage: Generalizes to synonyms (e.g., “money” ↔ “cash”).


In [47]:
def determine_lie(training_data,query):
    result = calculate_lie_probability(training_data, query)
    if result >= 0.5:
        print ("Lie!")
    elif result >= 0 and result < 0.5:
        print ("Truth")
    else:
        print ("Hmm, I have never seen anything like this before. I may need to upgrade my training data in the future. For now, try another sentence.")

In [48]:
def userInput():
    uInput = input("Please enter a sentence: ")
    arr = uInput.split()
    return arr

In [53]:
#Test user input function
#I will absolutely sign this contract for money
query = userInput()
while( not query == ["stop"]):
    result = determine_lie(training_data, query)
    query = userInput()

KeyboardInterrupt: Interrupted by user

#### the cells below is only for testing the determine_lie function

In [65]:
#  Test with validation set
total_count = len(validation_data)
correct_count = 0
for sentence, veracity in validation_data:
  prob = calculate_lie_probability(training_data, sentence)
  if prob >= 0.5:
    if veracity:
      correct_count += 1
  elif prob >= 0 and prob < 0.5:
    if not veracity:
      correct_count += 1
  else:
    total_count -= 1
accuracy = correct_count / total_count
print(f"The accuracy obtained from the validation set is {accuracy * 100:.2f}% .")

The accuracy obtained from the validation set is 58.43% .


In [35]:
#  Testing with or without own dataset
#  training_data = [(['must', 'absolutely'], True),
#  (['must', 'work', 'promise', 'absolutely'], False),
#  (['money', 'contract'], True),
#  (['money', 'work'], False),
#  (['must'], True),
#  (['must', 'money', 'contract'], True),
#  (['must', 'money', 'contract', 'work'], True),
#  (['project', 'absolutely'], False),
#  (['work', 'absolutely'], False),
#  (['work', 'promise'], False)]

query = ["wn", "m"]
prob = calculate_lie_probability(training_data, query)
print(f"This sentence is {prob * 100:.2f}% likely to be a lie.")
result = determine_lie(training_data, query)

query = ["absolutely", "my"]
prob = calculate_lie_probability(training_data, query)
print(f"This sentence is {prob * 100:.2f}% likely to be a lie.")
result = determine_lie(training_data, query)

query = ['absolutely', 'work']
prob = calculate_lie_probability(training_data, query)
print(f"This sentence is {prob * 100:.2f}% likely to be a lie.")
result = determine_lie(training_data, query)

query = ['must', 'money']
prob = calculate_lie_probability(training_data, query)
print(f"This sentence is {prob * 100:.2f}% likely to be a lie.")
result = determine_lie(training_data, query)

query = ['money', 'must']
prob = calculate_lie_probability(training_data, query)
print(f"This sentence is {prob * 100:.2f}% likely to be a lie.")
result = determine_lie(training_data, query)

query = ["absolutely"]
prob = calculate_lie_probability(training_data, query)
print(f"This sentence is {prob * 100:.2f}% likely to be a lie.")
result = determine_lie(training_data, query)

query = ["contract", "money", "absolutely"]
prob = calculate_lie_probability(training_data, query)
print(f"This sentence is {prob * 100:.2f}% likely to be a lie.")
result = determine_lie(training_data, query)

query = ["I", "will", "absolutely","sign", "this", "contract","for", "money"]
prob = calculate_lie_probability(training_data, query)
print(f"This sentence is {prob * 100:.2f}% likely to be a lie.")
result = determine_lie(training_data, query)

This sentence is -100.00% likely to be a lie.
Hmm, I have never seen anything like this before. I may need to upgrade my training data in the future. For now, try another sentence.
This sentence is 59.86% likely to be a lie.
Lie!
This sentence is 68.17% likely to be a lie.
Lie!
This sentence is 61.46% likely to be a lie.
Lie!
This sentence is 61.46% likely to be a lie.
Lie!
This sentence is 60.00% likely to be a lie.
Lie!
This sentence is 60.87% likely to be a lie.
Lie!
This sentence is 74.68% likely to be a lie.
Lie!


##Benchmarking & Performance Analysis

By validation set, the accuracy is 58.43%

###Compare against baselines:

- Random Guessing: Accuracy ≈ 50% (balanced data).

- Always Predict Lie: Accuracy = % of lies in test data.

###Model does a little bit better than baseline Baselines