結局Embeddingのサイズの小ささからMiniLMを使ったけど、SimCSE系の方が文の意味の要約としては良さそう (が、ベクトルの次元数がでかい…)

In [None]:
from sentence_transformers import SentenceTransformer
import numpy as np
from sklearn.metrics.pairwise import cosine_similarity
import string

# Initialize the SBERT model
model = SentenceTransformer('all-MiniLM-L6-v2')

def generate_semantic_fingerprint(document):
    # Encode the document and return a 1D vector (embedding)
    return model.encode([document])[0]

def quantize_to_n_bits(vector, n_bits):
    # Calculate the min and max of the vector for scaling
    min_val, max_val = np.min(vector), np.max(vector)
    # Calculate the number of bins for n-bit quantization (2^n)
    n_bins = 2 ** n_bits
    # Define bin edges
    bins = np.linspace(min_val, max_val, n_bins)
    # Quantize the vector to integer indices (bin assignments)
    quantized_vector = np.digitize(vector, bins) - 1  # digitize starts from 1
    # Clip values to ensure they're within [0, n_bins-1]
    quantized_vector = np.clip(quantized_vector, 0, n_bins - 1)
    return quantized_vector

def base62_encode(num):
    # Characters for base62 encoding
    characters = string.digits + string.ascii_letters
    base = 62
    if num == 0:
        return characters[0]
    encoding = ''
    while num > 0:
        num, rem = divmod(num, base)
        encoding = characters[rem] + encoding
    return encoding

def quantized_vector_to_base62(quantized_vector, n_bits):
    # Convert quantized vector to a binary string with fixed width for each element
    binary_str = ''.join([format(qv, f'0{n_bits}b') for qv in quantized_vector])
    # Convert binary string to integer
    num = int(binary_str, 2)
    # Encode integer to base62
    return base62_encode(num)

# Example documents
document1="""
Harvard Study Confirms Fluoride 'Significantly Lowers' Children's IQ
"""

document2="""
The government put fluoride in our water and attacked anyone who questioned it.

Now - the NIH (after major pressure) has declared  it “reduces the IQ of children” and is “hazardous to human health” - and states are removing it from water.

This is under-covered news.
"""

document3="""
Fluoride, often present in dental products and water supplies, is finally being recognized as a neurotoxin. Research indicates that excessive fluoride exposure, especially in children, is linked to reduced IQ scores and cognitive impairments. This neurotoxicity may stem from fluoride’s interference with neurotransmitter synthesis and its promotion of oxidative stress, raising concerns about its safety in vulnerable populations.
"Watch Attorney Michael Connett Depose the ‘Experts’ on the Safety of Fluoride in Drinking Water After New Data Shows Links to Lower IQ in Kids
"""

document4="""
“The proponents of this practice will often say there’s thousands of studies that show fluoridation is safe. Here I had the opportunity, under penalty of perjury, to ask these organizations … to point me to one study on this particular issue of the effects on the brain … Can you point me to any study that shows that water fluoridation is safe and every single one of the those organizations came back and said no.”#
"""

# Generate the semantic fingerprints
fingerprint = generate_semantic_fingerprint(document1)
fingerprint2 = generate_semantic_fingerprint(document2)

# Apply 4-bit quantization to each fingerprint
quantized_fingerprint_4bit = quantize_to_n_bits(fingerprint, 4)
quantized_fingerprint2_4bit = quantize_to_n_bits(fingerprint2, 4)

# Calculate cosine similarity for original and quantized vectors
similarity_score = cosine_similarity([fingerprint], [fingerprint2])

# Normalize quantized vectors to compute cosine similarity
normalized_quantized_fingerprint_4bit = quantized_fingerprint_4bit / np.linalg.norm(quantized_fingerprint_4bit)
normalized_quantized_fingerprint2_4bit = quantized_fingerprint2_4bit / np.linalg.norm(quantized_fingerprint2_4bit)
similarity_score_4bit = cosine_similarity([normalized_quantized_fingerprint_4bit], [normalized_quantized_fingerprint2_4bit])

print("Cosine Similarity Score (Original):", similarity_score[0][0])
print("Cosine Similarity Score (4-bit Quantized):", similarity_score_4bit[0][0])

# Encode quantized vectors to Base62 strings
encoded_fingerprint = quantized_vector_to_base62(quantized_fingerprint_4bit, 4)
encoded_fingerprint2 = quantized_vector_to_base62(quantized_fingerprint2_4bit, 4)

print("Encoded Fingerprint:", encoded_fingerprint)
print("Encoded Fingerprint2:", encoded_fingerprint2)




Cosine Similarity Score (Original): 0.6349182
Cosine Similarity Score (4-bit Quantized): 0.9533956402920651
Encoded Fingerprint: zwt3ifcNrR5MFCgNU3vaVsmdrjZcbQivoPvs9TyGeoVAp1E7i6bFyKnXD6xjrS3cEHmQnPhZtxFbaiRi0Vm3ZnXk9RYsmlB1ho2ftirkeif6Wg1PKedfCPZdgOYcNBFS6DGceKapWK4yzw9hWlsz3ilYCsTNAZg7CeiguLvvfILnEA4j6Uth2Kzn5Jf7cYEMC4OpBas1fmvqYu0B4tnTpixRVGGNYMcVARvR70xnDZ2knGOIfOcdZes1f4RL6lXT6L
Encoded Fingerprint2: mwak1XymR0bWCsBj7IVP9jC8w8RZaGlNvNSziT1Iv2zLchQnIgFsLdGTH0wK8KeiXwV38rwOXxyIywA96qNd2H01X2UOtF1d51nKbT48qduoJ0jaRkxILCQKP1NHq3h76QbYwF5s60BxlaFcO9pdGr9HaN1iSWGHh6HJa2RBOaMvEw4jlWFtiwldnGMPZ6iwqAIpgZNawWn7Bb3fKcHHRfK9PLH3spXnQD6McdYMFwGdfAKxxOQAmqJdmjW4hHdCYO


In [None]:
from sentence_transformers import SentenceTransformer
import numpy as np
from sklearn.metrics.pairwise import cosine_similarity
import string

# Initialize the SBERT model
model = SentenceTransformer('all-MiniLM-L6-v2')

def generate_semantic_fingerprint(document):
    # Encode the document and return a 1D vector (embedding)
    return model.encode([document])[0]

def quantize_to_n_bits(vector, n_bits):
    # Calculate the min and max of the vector for scaling
    min_val, max_val = np.min(vector), np.max(vector)
    # Calculate the number of bins for n-bit quantization (2^n)
    n_bins = 2 ** n_bits
    # Define bin edges
    bins = np.linspace(min_val, max_val, n_bins)
    # Quantize the vector to integer indices (bin assignments)
    quantized_vector = np.digitize(vector, bins) - 1  # digitize starts from 1
    # Clip values to ensure they're within [0, n_bins-1]
    quantized_vector = np.clip(quantized_vector, 0, n_bins - 1)
    return quantized_vector

def base62_encode(num):
    # Characters for base62 encoding
    characters = string.digits + string.ascii_letters
    base = 62
    if num == 0:
        return characters[0]
    encoding = ''
    while num > 0:
        num, rem = divmod(num, base)
        encoding = characters[rem] + encoding
    return encoding


def quantized_vector_to_base62(quantized_vector, n_bits):
    # Convert quantized vector to a binary string with fixed width for each element
    binary_str = ''.join([format(qv, f'0{n_bits}b') for qv in quantized_vector])
    # Convert binary string to integer
    num = int(binary_str, 2)
    # Encode integer to base62
    return base62_encode(num)

def base62_decode(base62_str):
    # Base62 文字セット
    characters = string.digits + string.ascii_letters
    base = 62
    # Base62 を整数にデコード
    num = 0
    for char in base62_str:
        num = num * base + characters.index(char)
    return num

def base62_to_quantized_vector(base62_str, n_bits, vector_length):
    # Base62 文字列を整数にデコード
    num = base62_decode(base62_str)
    # 整数を2進数に変換し、ビット列に変換
    binary_str = format(num, f'0{n_bits * vector_length}b')
    # ビット列を n_bits ごとに分割して、元のベクトルを復元
    quantized_vector = [int(binary_str[i:i + n_bits], 2) for i in range(0, len(binary_str), n_bits)]
    return quantized_vector

# Example documents
documents = [
    """
    Harvard Study Confirms Fluoride 'Significantly Lowers' Children's IQ
    """,
    """
    The government put fluoride in our water and attacked anyone who questioned it.
    Now - the NIH (after major pressure) has declared  it “reduces the IQ of children” and is “hazardous to human health” - and states are removing it from water.
    This is under-covered news.
    """,
    """
    Fluoride, often present in dental products and water supplies, is finally being recognized as a neurotoxin. Research indicates that excessive fluoride exposure, especially in children, is linked to reduced IQ scores and cognitive impairments. This neurotoxicity may stem from fluoride’s interference with neurotransmitter synthesis and its promotion of oxidative stress, raising concerns about its safety in vulnerable populations.
    "Watch Attorney Michael Connett Depose the ‘Experts’ on the Safety of Fluoride in Drinking Water After New Data Shows Links to Lower IQ in Kids
    """,
    """
    Fluoride in tap water lowers the IQ of those who drink it"""
]

# Generate the semantic fingerprints and apply 4-bit quantization for each document
fingerprints = [generate_semantic_fingerprint(doc) for doc in documents]
quantized_fingerprints_4bit = [quantize_to_n_bits(fp, 4) for fp in fingerprints]

# Normalize quantized vectors to compute cosine similarity
normalized_quantized_fingerprints_4bit = [
    qf / np.linalg.norm(qf) for qf in quantized_fingerprints_4bit
]

# Calculate and display cosine similarity for all combinations
num_docs = len(documents)
for i in range(num_docs):
    for j in range(i + 1, num_docs):
        similarity = cosine_similarity(
            [normalized_quantized_fingerprints_4bit[i]],
            [normalized_quantized_fingerprints_4bit[j]]
        )[0][0]
        print(f"Cosine Similarity (4-bit Quantized) between Document {i+1} and Document {j+1}: {similarity}")

# Step 1: Generate embeddings for each document
fingerprints = [generate_semantic_fingerprint(doc) for doc in documents]

# Step 2: Calculate the average embedding (semantic centroid)
average_embedding = np.mean(fingerprints, axis=0)

quantized_average_fingerprints_4bit = quantize_to_n_bits(average_embedding, 4)
normalized_quantized_average_fingerprints_4bit = quantized_average_fingerprints_4bit / np.linalg.norm(quantized_average_fingerprints_4bit)
print(quantized_average_fingerprints_4bit)
print(quantized_fingerprints_4bit[3])

# Step 3: Calculate cosine similarity of each document embedding with the average embedding
for idx, fingerprint in enumerate(fingerprints):
    similarity = cosine_similarity([fingerprint], [average_embedding])[0][0]
    print(f"Cosine Similarity between Document {idx + 1} and Average Embedding: {similarity}")

print(cosine_similarity([normalized_quantized_average_fingerprints_4bit], [normalized_quantized_fingerprints_4bit[3]])[0][0])
print(quantized_vector_to_base62(quantized_average_fingerprints_4bit,4))
print(quantized_vector_to_base62(quantized_fingerprints_4bit[3],4))
print(len(quantized_average_fingerprints_4bit))

a = base62_to_quantized_vector('wjY7yyv2BKT4rLEfxgGFFeVDOQJhuImLTQOsqoA3yzFIuSa5au2s4b5OCt0nFp9C1kaJC79V1aazrvBPu1JFYwTNffHjqjzyyETjswiYwLCYDhISpKWaiLeV0UETv01tH213lzw8T57VA6YC1iCjl0ynFjIX7NK2pfgynMKhLteT3NLetA2ovnlSur7JRXPaKZItPJgkn0YQRYGq7fXsQzPaNIzhQRAubhbtirB021mVJ8mVsqe0VVmalVj9z7KasU',4,384)
b = base62_to_quantized_vector('sFoVZtdk7HzSGC9YEpoxtGo9w1OThlXhW3Q4hg6sdrIIPB52kM2eTSrF0O8npcEKB2UFDHUVDv83YXgB64oWSblBM9NdtWd0n064zfPgXwWgBJkrGhJVTPQItIcZckXzccQwiKQS4qSMVyfJkgLGlJ9XGkUVOToXhelcuwh4AuPUJtqsqfztYPkgjYFwLIipvBzxJ9S0nxFA9Yin79qv52iHDq8KExVzsbGy5N2nkntD0ytvTRBrSzbKMlvJi9O5ok',4,384)
a_norm = a / np.linalg.norm(a)
b_norm = b / np.linalg.norm(b)
print(cosine_similarity([a_norm],[b_norm]))




Cosine Similarity (4-bit Quantized) between Document 1 and Document 2: 0.9533956402920651
Cosine Similarity (4-bit Quantized) between Document 1 and Document 3: 0.9711414885395127
Cosine Similarity (4-bit Quantized) between Document 1 and Document 4: 0.9641934862209532
Cosine Similarity (4-bit Quantized) between Document 2 and Document 3: 0.9560595892495869
Cosine Similarity (4-bit Quantized) between Document 2 and Document 4: 0.9543078528479654
Cosine Similarity (4-bit Quantized) between Document 3 and Document 4: 0.9700528005847939
[ 9  7  7  8 10  5 10 11  5  8  6  6  5  9  2  4  6  5  3  5  5  6 10  7
  6 10  7  3  4  9 10  7  8  1  6  6  7  7  7  5  4  4 10  6  7  6  5  4
  7  9  3  8  7 10  7  7  5  3  2 15  5  6  7  7 10 10  3  3  6  5  8  6
 12  6  9  8  6  7 10  8  7  6 14  8 12  5  8  4  5  8  5  6  9  8  5  7
  4  6  3  8  6  5  5  8  8  6  8  6  8  7  3 11  6  9  6  7  8  6  6  6
  9  4  6  8  8  7  3  6  6  7 11  5  5  6  5  2  9  6  8  6  5  9  5  7
  2  6  6  5  9  5  4 