# 5 - Ground Truth Verification

I'm skipping past building a semantic vector space based on Reddit for now, and going straight to ground truth verification using search results. 

* [pistocop/subreddit-comments-dl: Download subreddit comments](https://github.com/pistocop/subreddit-comments-dl)
* [serpapi documentation on PyPI](https://pypi.org/project/serpapi/)
* [My SerpApi Dashboard](https://serpapi.com/searches)
* [Google Gemini](https://gemini.google.com/app/aa2876187db79f27) notes.

To run **all-MiniLM-L6-v2** locally: [StackOverflow](https://stackoverflow.com/questions/65419499/download-pre-trained-sentence-transformers-model-locally).

In [None]:
# IMPORTS, KEYS, MODELS
from serpapi import GoogleSearch
import json
from collections import Counter
import re
from sentence_transformers import SentenceTransformer, util

# Load API key
keys = json.load(open("../apikeys.json"))
SERP_API_KEY = keys["SerpApi"]["key"]

# Load model (all-MiniLM-L6-v2 is fast and accurate)
# model = SentenceTransformer('all-MiniLM-L6-v2')
model = SentenceTransformer('../models/BERT-all-mpnet-base-v2')

# SAVE MODEL (for offline use)
# model.save('../models/BERT-all-mpnet-base-v2')
# THEN TO LOAD IT AGAIN:
# model = SentenceTransformer(modelPath)

## Loading Results & Getting Rid of Duplicates

The first thing to do is to re-use some code to load the saved results of queries and then to find the repetitions. After that, we can work on merging similar proverbs.

In [2]:
# LOAD FILE
with open('responses/di-5000-4.json', 'r') as f:
    data = json.load(f)

# EXTRACT JUST THE TEXT FROM RESULTS
texts = [entry['text'] for entry in data]

# COUNT OCCURRENCES
counts = Counter(texts)

# FIND PROVERBS THAT APPEARED MORE THAN ONCE
duplicates = {text: count for text, count in counts.items() if count > 1}
print(f"Unique Proverbs: {len(counts)}")
print(f"Total Repetitions: {sum(duplicates.values()) - len(duplicates)}")

# TO SEE REPETITIONS
# print("\nMost Frequent Repetitions:")
# for text, count in sorted(duplicates.items(), key=lambda x: x[1], reverse=True)[30:40]:
#     print(f"[{count}x] {text}")

Unique Proverbs: 1059
Total Repetitions: 3941


In [None]:
# ANOTHER WAY TO DEDUPE 
# (no count is generated)
# deduped = list(set(texts))

# SORT BY LENGTH SO THE SHORTEST VERSION IS OUR "ANCHOR"
deduped = list(duplicates.keys()) # If first method is used
data = sorted(deduped, key=len)


In [9]:
len(duplicates)

246

In [6]:
unique_proverbs = []
threshold = 0.60  # Aggressive grouping

for current in data:
    if not unique_proverbs:
        unique_proverbs.append(current)
        continue
    
    # Compare current sentence against our accepted unique list
    current_emb = model.encode(current, convert_to_tensor=True)
    unique_embs = model.encode(unique_proverbs, convert_to_tensor=True)
    
    scores = util.cos_sim(current_emb, unique_embs)[0]
    
    # If it's not similar to anything we already have, add it
    if max(scores) < threshold:
        unique_proverbs.append(current)

# Display results
print(f"--- Original Count: {len(data)} | Final Count: {len(unique_proverbs)} ---")
for p in unique_proverbs:
    print(f"‚úì {p}")

--- Original Count: 246 | Final Count: 37 ---
‚úì Don't read the comments.
‚úì The internet is forever so be careful what you post.
‚úì If you're not paying for the product, you are the product.
‚úì The internet is forever and nothing is ever truly deleted.
‚úì If you lurk on the internet long enough, eventually you see yourself in a post.
‚úì If you lurk long enough on any online community, you'll eventually see someone get banned.
‚úì If you lurk long enough on any online community, you'll eventually see someone mention their ex.
‚úì If you lurk long enough on any online community, you'll eventually see someone mention their grandma.
‚úì If you lurk long enough on any online community, you'll eventually see a variation of the same 10 arguments.
‚úì If you lurk long enough on any online community, you'll eventually see someone compare themselves to Hitler.
‚úì If you lurk on the internet long enough, eventually you'll see a variation of your own personality disorder.
‚úì If you lurk l

In [None]:
### OLDER METHOD BELOW

# Process the proverbs to filter out similar meanings

uniques = []

for proverb in proverbs:
    if not uniques:
        uniques.append(proverb)
        continue
    
    # Encode current sentence and existing unique ones
    current_embedding = model.encode(proverb)
    unique_embeddings = model.encode(uniques)
    
    # Calculate similarities between current sentence and all saved unique ones
    cosine_scores = util.cos_sim(current_embedding, unique_embeddings)[0]
    
    # If the highest similarity score is below our threshold, it's a "new" proverb
    # We'll use 0.75 as a standard threshold for "same meaning"
    threshold = 0.50
    if max(cosine_scores) < threshold:
        uniques.append(proverb)

# 4. Output the results

for proverb in uniques:
    print(f" - {proverb}")

## Search / Validate

Having whittled down the responses from the LLM to a manageable number of unique proverbs, we can now use SerpApi to search for each proverb and see if there are any results. If there are results, we can assume that the proverb is valid.

In [None]:
def verify_external_existence(phrase):
    """
    Checks the web for the phrase and looks for 'canonical' markers.
    """
    # 1. Total Hit Count Check (Exact Phrase)
    params = {
        "q": f'"{phrase}"',  # Quoted for exact match
        "engine": "google",
        "api_key": SERP_API_KEY
    }
    
    search = GoogleSearch(params)
    results = search.get_dict()
    
    # Extract total results (Google hit count)
    total_results = results.get("search_information", {}).get("total_results", 0)
    
    # 2. Targeted Site Check (Dictionary & Folklore sites)
    # We check if the phrase appears on known authority sites
    authority_sites = ["oxfordreference.com", "phrases.org.uk", "theidioms.com"]
    site_query = f'"{phrase}" site:' + " OR site:".join(authority_sites)
    
    site_params = {**params, "q": site_query}
    site_search = GoogleSearch(site_params)
    site_results = site_search.get_dict()
    
    authority_count = site_results.get("search_information", {}).get("total_results", 0)
    
    # 3. Novelty Logic
    # High LLM Stability + Low Search Hits = A Discovery
    if total_results < 1000 and authority_count == 0:
        return {
            "verdict": "üåü NOVEL MAXIM",
            "hits": total_results,
            "details": "High consensus in AI, but virtually zero footprint in human dictionaries."
        }
    elif authority_count > 0:
        return {
            "verdict": "üìö DOCUMENTED PROVERB",
            "hits": total_results,
            "details": f"Found on {authority_count} authority websites."
        }
    else:
        return {
            "verdict": "üåê COMMON IDIOM",
            "hits": total_results,
            "details": "Frequently used online but not officially documented as a proverb."
        }

# Example Test
# discovery = verify_external_existence("The data is the new oil of the digital soul")

In [None]:
display = verify_external_existence("The internet is forever and nothing is ever really deleted.")

In [None]:
d1 = display
d2 = verify_external_existence("If you lurk long enough on any online community, you'll eventually see yourself in a post.")

In [None]:
print(d1)
print(d2)

In the top 10 results from the run of 5000 queries on Llama-4[^1], there were really three proverbs. The most repeated one occupied two of the top 10 slots:

[1139x] If you're not paying for the product, you are the product.  
[219x] If you're not paying for the product, then you are the product.

So the question I have to determine is if there is a way to recognize two or more proverbs that are semantically the same but lexically different. By hand, I chose the shortest version of the next two proverbs in the top 10 -- and the three proverbs accounted for all of the top 10 results:


[62x][253x][558x][73x][70x][66x] The internet is forever and nothing is ever really deleted.  
[56x][54x] If you lurk long enough on any online community, you'll eventually see yourself in a post.

So find all the proverbs that are the same, choose the shortest version, and search on that.

[^1]: The complete descriptor is Llama-4-Maverick-17B-128E-Instruct-FP8.