# Biography Comparison Module
## Introduction
The aim of this module is to provide the necessary training and comparison functions for a biography module. It is supposed to establish the similarity between two users' bios, while taking into account the rarity of certain embeddings, which are topics or terms that can be represented by various words.

## Architecture and Thought Process
While the most straight forward way of comparing multiple bios is to use [Sentence-BERT](https://towardsdatascience.com/an-intuitive-explanation-of-sentence-bert-1984d144a868), this does not take into account the rarity of certain topics. For example, if a user has a bio that contains the word "gym", this word is not very rare, and therefore, it does not make sense that the presence of this embedding in both bios to have the same weight as the word "theater", which is a much more niche interest. We need to take into account the rarity of certain topics, and therefore, we need to use a different approach.

![Sentence-BERT Architecture](https://miro.medium.com/max/1124/1*6gjaA_TqojVTABHJPNRMng.jpeg)

^ This is the architecture of Sentence-BERT for reference.

We can use almost the same architecture but replace the `cosine-similarity` function with a weighted cosinge similarity function where the weight for each embedding is proportional to its rarity. This would be a potential formula for the weighted cosine similarity function:
$$\frac{\sum_{i}{w_i u_i v_i}}{\sqrt{\sum_{i}w_i u_i^2}\sqrt{\sum_{i}w_i v_i^2}}$$
Where $w_i$ is the weight of the $i^{th}$ embedding, $u_i$ is the $i^{th}$ embedding of the first bio, and $v_i$ is the $i^{th}$ embedding of the second bio.

### Why not GloVe?
GloVe is a system for generatting a vector / embedding list for a sentence. The following question could be raised: why would we use BERT for generating sentence embeddings if we are not going to use use our own comparison function instead of BERT's native function. GloVe segments sentences using traditional word-like tokens, while BERT learns the uses finer grained segmenetation and learns "[learns its custom word-piece embeddings jointly with the entire model](https://datascience.stackexchange.com/questions/73189/does-bert-use-glove#:~:text=BERT%20cannot%20use%20GloVe%20embeddings,subword%20units%20called%20word%2Dpieces.)". This could be useful in the case of school specific lingo not encoded into the GloVe model but that could be trained into BERT (<span style="color:orange">fastText could be explored as an alternative since it has similar learning capabilities</span>).

## Implementation

In [7]:
from collections import defaultdict
import numpy as np
from sentence_transformers import SentenceTransformer, util
from scipy.spatial import distance

BIOGRAPHIES = ['The cat sits outside',
             'A man is playing guitar',
             'I love pasta',
             'The new movie is awesome',
             'The cat plays in the garden',
             'A woman watches TV',
             'The new movie is so great',
             'Do you like pizza?']

SELECTED_TOP_PER_USER = 10

class BioClassifier:
    def __init__(self):
        # self.model = SentenceTransformer('all-mpnet-base-v2') # Best perfomring sentence embedding model
        self.model = SentenceTransformer('all-MiniLM-L6-v2') # Almost as good as all-mpnet-base-v2, but 2x faster
        self.embedding_weights = None

    # Returns a list of length len(biographies) with the embeddings of each biography
    def encode(self, biographies):
        return self.model.encode(biographies, convert_to_tensor=True)

    def train(self, bio_embeddings):
        self.embedding_weights = np.empty(bio_embeddings.shape[1]) # Initialize empty weights (one per embedding dimension)
        self.embedding_weights = None
        # TODO Calculate embedding weights

    def encode_and_train(self, biographies): 
        bio_embeddings = self.encode(biographies)
        self.train(bio_embeddings)
        return bio_embeddings

    def compare(self, bio1_embeddings, bio2_embeddings):
        # TODO Re add training check
        # if self.embedding_weights is None:
        #     raise Exception('The classifier has not been trained yet')
        return distance.cosine(bio1_embeddings, bio2_embeddings, self.embedding_weights)
        # return util.cos_sim([bio1_embeddings], [bio2_embeddings])[0][0] #! Check why cos_sim returns different results than weighted cosine with no weights.
    

def main():
    biographies = BIOGRAPHIES
    classifier = BioClassifier()
    biographies_embeddings = classifier.encode_and_train(biographies)

    # Compute/find the highest similarity scores
    pairs = []
    for i in range(len(biographies_embeddings) - 1):
        for j in range(i + 1, len(biographies_embeddings)):
            score = classifier.compare(biographies_embeddings[i], biographies_embeddings[j])
            pairs.append({'index': (i, j), 'score': score})
    
    # Sort the scores in decreasing order
    selected_matches = defaultdict(lambda: [])
    pairs = sorted(pairs, key=lambda x: x['score'], reverse=True)
    for pair in pairs:
        i, j = pair['index']
        score = pair['score']
        if len(selected_matches[i]) < SELECTED_TOP_PER_USER:
            selected_matches[i].append({'user': j, 'score': score})
        if len(selected_matches[j]) < SELECTED_TOP_PER_USER:
            selected_matches[j].append({'user': i, 'score': score})

    for i, matches in sorted(selected_matches.items()):
        print(f'===== User: {i} =====')
        for match in matches:
            user = match['user']
            score = match['score']
            print(f'Matched with {user} with score {score}')

if __name__ == '__main__':
    main()

===== User: 0 =====
Matched with 3 with score 1.0246800854802132
Matched with 6 with score 1.0028615249320865
Matched with 2 with score 0.9918926283717155
Matched with 7 with score 0.9746066741645336
Matched with 1 with score 0.9636695198714733
Matched with 5 with score 0.8689927160739899
Matched with 4 with score 0.3212115168571472
===== User: 1 =====
Matched with 2 with score 1.0367830768227577
Matched with 5 with score 1.032725878059864
Matched with 6 with score 1.0136096188798547
Matched with 3 with score 0.9907048298045993
Matched with 7 with score 0.9883512947708368
Matched with 0 with score 0.9636695198714733
Matched with 4 with score 0.7895365059375763
===== User: 2 =====
Matched with 1 with score 1.0367830768227577
Matched with 0 with score 0.9918926283717155
Matched with 4 with score 0.9769552703946829
Matched with 5 with score 0.9641275219619274
Matched with 3 with score 0.7559625208377838
Matched with 6 with score 0.7439515888690948
Matched with 7 with score 0.4904498457908