# Cosine Similarity as a consensus metric

Cosine similarity measures how close two vectors are in a vector space. The idea is to represetn the triplets we have in the vector space and check whether the different models have similar output.

For the following discussion, consider a document that contains three pieces of knowledge that can be represented as triplets A, B, and C. We will use triplets D, E, F to represent hallucinations.

## Issues:
- Entities often are names.
- The same entity can be extracted with different names.
- A triplet has 3 components, we can't take its cosine similarity directly.
- Not all models extract the same number of triplets. Some models might exctract the same piece of information multiple times, while others could miss it entirely.
- We don't know which of the triplets exctracted match onto which piece of knowledge.




In [None]:
!wget https://github.com/hishamad/SEBx-KG-construction/raw/main/output/gpt_triplets.json
!wget https://github.com/hishamad/SEBx-KG-construction/raw/main/output/llama_triplets.json
!wget https://github.com/hishamad/SEBx-KG-construction/raw/main/output/mixtral_triplets.json

--2024-11-01 12:25:19--  https://github.com/hishamad/SEBx-KG-construction/raw/main/output/gpt_triplets.json
Resolving github.com (github.com)... 140.82.112.3
Connecting to github.com (github.com)|140.82.112.3|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://raw.githubusercontent.com/hishamad/SEBx-KG-construction/main/output/gpt_triplets.json [following]
--2024-11-01 12:25:19--  https://raw.githubusercontent.com/hishamad/SEBx-KG-construction/main/output/gpt_triplets.json
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 155017 (151K) [text/plain]
Saving to: ‘gpt_triplets.json.1’


2024-11-01 12:25:19 (7.17 MB/s) - ‘gpt_triplets.json.1’ saved [155017/155017]

--2024-11-01 12:25:19--  https://github.com/hishamad/SEBx-KG-const

In [None]:
!pip install gensim
from gensim.models import KeyedVectors
import numpy as np
import json
import numpy as np
import math
from sklearn.metrics.pairwise import cosine_similarity





In [None]:
with open('gpt_triplets.json', 'r') as f:
    gpt_data = json.load(f)

with open('llama_triplets.json', 'r') as f:
    llama_data = json.load(f)

with open('mixtral_triplets.json', 'r') as f:
    mixtral_data = json.load(f)


print (len(gpt_data))
print (len(llama_data))
print (len(mixtral_data))

print (gpt_data[0])
print (llama_data[0])
print (mixtral_data[0])

301
300
301
{'subject': ['CapGen Capital Group VI LP', 'CapGen Capital Group VI LP'], 'subject_type': ['ORG', 'ORG'], 'relationship': ['Invests_In', 'Has'], 'object': ['Union Bankshares Corp', 'Stake', 'Union Bankshares Corp'], 'object_type': ['COMP', 'FIN_INSTRUMENT']}
{'subject': ['CAPGEN CAPITAL GROUP VI LP', 'CAPGEN CAPITAL GROUP VI LP', 'CAPGEN CAPITAL GROUP', 'Union Bankshares Corp'], 'subject_type': ['ORG', 'ORG', 'ORG', 'COMP'], 'relationship': ['Has', 'Announce', 'Participates_In', 'Operate_In'], 'object': ['Union Bankshares Corp', '7.3 Pct Stake in Union Bankshares Corp', "Union Bankshares Corp's Management", 'Union Bankshares Corp'], 'object_type': ['COMP', 'FIN_INSTRUMENT', 'ORG', 'COMP']}
{'subject': ['Union Bankshares Corp', 'CAPGEN Capital Group VI LP', 'CAPGEN Capital Group VI LP', 'CAPGEN Capital Group VI LP', 'CAPGEN Capital Group VI LP', 'CAPGEN Capital Group VI LP', 'CAPGEN Capital Group VI LP'], 'subject_type': ['COMP', 'COMP', 'COMP', 'COMP', 'COMP', 'COMP', 'COMP

# The word2vec model
The pre-trained Google News word2vec model contains **3 million word vectors**, each with **300 dimensions**. This model was trained on approximately 100 billion words from Google News.

In [None]:
w2v_model = KeyedVectors.load_word2vec_format('/content/drive/MyDrive/GoogleNews-vectors-negative300.bin.gz', binary=True)

In [None]:
def prep_word_vec(model, word):
  words = word.lower().split()
  output_vector  = np.zeros(300)
  word_count = 0
  for w in words:
    if w in model:
      output_vector += model[w]
      word_count += 1

  if word_count == 0:
    return np.zeros(300)

  return output_vector/word_count


def process_triplet(model, subject, relation, obj):
  subject_vector = prep_word_vec(model, subject)
  relation_vector = prep_word_vec(model, relation)
  object_vector = prep_word_vec(model, obj)

  return subject_vector, relation_vector, object_vector


In [None]:
from multiprocessing import process

avg_similarity = 0


for i in range(300):
  total_pairs = 0
  total_similarity_count = 0
  cos_score = 0

  skip_gpt = False
  skip_llama = False
  skip_mixtral = False

  gpt_triplets = gpt_data[i]
  llama_triplets = llama_data[i]
  mixtral_triplets = mixtral_data[i]

  gpt_complete_triplets = min(len(gpt_triplets['subject']), len(gpt_triplets['relationship']), len(gpt_triplets['object']))
  llama_complete_triplets = min(len(llama_triplets['subject']), len(llama_triplets['relationship']), len(llama_triplets['object']))
  mixtral_complete_triplets = min(len(mixtral_triplets['subject']), len(mixtral_triplets['relationship']), len(mixtral_triplets['object']))


  print (gpt_triplets)
  print (llama_triplets)
  print (mixtral_triplets)

  if gpt_complete_triplets == 0:
    skip_gpt = True

  if llama_complete_triplets == 0:
    skip_llama = True

  if mixtral_complete_triplets == 0:
    skip_mixtral = True


  if not skip_gpt:
    gpt_subject, gpt_relation, gpt_obj  = gpt_triplets['subject'], gpt_triplets['relationship'], gpt_triplets['object']

  if not skip_llama:
    llama_subject, llama_relation, llama_obj  = llama_triplets['subject'], llama_triplets['relationship'], llama_triplets['object']

  if not skip_mixtral:
    mixtral_subject, mixtral_relation, mixtral_obj  = mixtral_triplets['subject'], mixtral_triplets['relationship'], mixtral_triplets['object']

  if not skip_gpt:
    gpt_triplet_vectors = []
    for j in range(gpt_complete_triplets):
      subject_vector, relationship_vector, object_vector = process_triplet(w2v_model, gpt_subject[j], gpt_relation[j], gpt_obj[j])
      gpt_triplet_vectors.append((subject_vector + relationship_vector + object_vector)/3)

  if not skip_llama:
    llama_triplet_vectors = []
    for j in range(llama_complete_triplets):
      subject_vector, relationship_vector, object_vector = process_triplet(w2v_model, llama_subject[j], llama_relation[j], llama_obj[j])
      llama_triplet_vectors.append((subject_vector + relationship_vector + object_vector)/3)

  if not skip_mixtral:
    mixtral_triplet_vectors = []
    for j in range(mixtral_complete_triplets):
      subject_vector, relationship_vector, object_vector = process_triplet(w2v_model, mixtral_subject[j], mixtral_relation[j], mixtral_obj[j])
      mixtral_triplet_vectors.append((subject_vector + relationship_vector + object_vector)/3)

  total_similarity_count = 0


  if not skip_gpt and not skip_llama:
    gpt_llama_vector_pair_similairty = {}
    sims = cosine_similarity(gpt_triplet_vectors, llama_triplet_vectors)
    for j in range(gpt_complete_triplets):
      for q in range(llama_complete_triplets):
        gpt_llama_vector_pair_similairty[(j, q)] = sims[j][q]
        total_similarity_count += sims[j][q]
    total_pairs += gpt_complete_triplets * llama_complete_triplets


  if not skip_gpt and not skip_mixtral:
    gpt_mixtral_vector_pair_similairty = {}
    sims = cosine_similarity(gpt_triplet_vectors, mixtral_triplet_vectors)
    for j in range(gpt_complete_triplets):
      for q in range(mixtral_complete_triplets):
        gpt_mixtral_vector_pair_similairty[(j, q)] = sims[j][q]
        total_similarity_count += sims[j][q]
    total_pairs += gpt_complete_triplets * mixtral_complete_triplets

  if not skip_llama and not skip_mixtral:
    llama_mixtral_vector_pair_similairty = {}
    sims = cosine_similarity(llama_triplet_vectors, mixtral_triplet_vectors)
    for j in range(llama_complete_triplets):
      for q in range(mixtral_complete_triplets):
        llama_mixtral_vector_pair_similairty[(j, q)] = sims[j][q]
        total_similarity_count += sims[j][q]
    total_pairs += llama_complete_triplets * mixtral_complete_triplets


  print("Doc #"+str(i)+": ")
  print ("Total pairs: ", total_pairs)

  total_similarity_count = math.trunc(total_similarity_count*100)/100
  print ("Total similarity count: ",total_similarity_count)

  cos_score = (total_similarity_count/total_pairs) * 100
  avg_similarity += cos_score

  cos_score = math.trunc(cos_score*100)/100
  print ("Cosine similarity 'score': ", cos_score)
  print ("----------------------------------------------------")


print("Average similarity: ", avg_similarity/300)




{'subject': ['CapGen Capital Group VI LP', 'CapGen Capital Group VI LP'], 'subject_type': ['ORG', 'ORG'], 'relationship': ['Invests_In', 'Has'], 'object': ['Union Bankshares Corp', 'Stake', 'Union Bankshares Corp'], 'object_type': ['COMP', 'FIN_INSTRUMENT']}
{'subject': ['CAPGEN CAPITAL GROUP VI LP', 'CAPGEN CAPITAL GROUP VI LP', 'CAPGEN CAPITAL GROUP', 'Union Bankshares Corp'], 'subject_type': ['ORG', 'ORG', 'ORG', 'COMP'], 'relationship': ['Has', 'Announce', 'Participates_In', 'Operate_In'], 'object': ['Union Bankshares Corp', '7.3 Pct Stake in Union Bankshares Corp', "Union Bankshares Corp's Management", 'Union Bankshares Corp'], 'object_type': ['COMP', 'FIN_INSTRUMENT', 'ORG', 'COMP']}
{'subject': ['Union Bankshares Corp', 'CAPGEN Capital Group VI LP', 'CAPGEN Capital Group VI LP', 'CAPGEN Capital Group VI LP', 'CAPGEN Capital Group VI LP', 'CAPGEN Capital Group VI LP', 'CAPGEN Capital Group VI LP'], 'subject_type': ['COMP', 'COMP', 'COMP', 'COMP', 'COMP', 'COMP', 'COMP'], 'relatio

# Remaining Issues:
- Redundant output pushes the score towards higher values without merit.
- We are checking across entire outputs, instead of individual triplets.

# Solution:
- Remove high similarity pairs in each model's output.
- Between models, remove highest similarity pairs both and only use those in the cosine similarity calculation.



Look into company (appl, microsoft) stock market name and see how they compare to each other and to the company name.

Look into sentence transformers: https://huggingface.co/spaces/mteb/leaderboard