***
# .IPYNB Similarity Scorer
***
## Motivation:
We can use the following program to determine similarity between two texts, programs, or mixture of the two. This can be useful in systematically looking for similarity between programs and texts due to the critical role that language plays in shaping market decisions, regulatory compliance, and financial communications. We can also use this to easily identify plagiarism and gain financial insight, if texts are highly similar, it could be a sign of reuse and stagantion, if texts vary heavily, it could be a sign of diversification of strategy, etc.
## Model Explanation:
Implemented are two functions that respectively convert ipynb files to text only, and imported sentence transformer encoders encode the text. Similarity is measured as cosine similarity between each text. Finally, a score tensor is outputted, which we directly access using the .item() method of a tensor object, grabbing this scalar value.
## NECESSARY:
### This function will only work if you are indexed into the folder containing the files you want to compare pairwise. you can use os.chdir("path") where path is the path to the folder in order to change your directory to the folder containing the files for comparison.
## 
***

In [14]:
%%capture
import os
import pandas as pd
import itertools
!pip install sentence-transformers
import json
from sentence_transformers import SentenceTransformer, util
### ENSURE CURRENT WORKING DIRECTORY IS THE FOLDER

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


In [15]:
#folderName = "folder1"
#os.chdir(folderName)
print(os.getcwd()) #folderName= ## make sure you index into the folder you wish to inspect
print(os.listdir()) #this should print the files you wish to compare
files = os.listdir()
pairNames = list(itertools.combinations(files,2)) 

/home/jovyan/memes/folder1
['decoder.ipynb', 'survival.ipynb', 'Kevin_Berookhim_Homework3.ipynb', '191.ipynb', '.ipynb_checkpoints', 'Extra Credit - Exercise of NUMPY (Duration Analysis).ipynb']


In [16]:
#import hugging Face model
model = SentenceTransformer('all-MiniLM-L6-v2') 

#extract only the text for each file in files:
def extract(file):
    with open(file, 'r', encoding='utf-8') as f:
        notebook = json.load(f)
    # Combine text from markdown and code cells
    content = []
    for cell in notebook.get('cells', []):
        if cell.get('cell_type') in ['markdown', 'code']:
            content.append(' '.join(cell.get('source', [])))
    return ' '.join(content)

#compute cosine similarity
def similarity(text1, text2):
    embedding1 = model.encode(text1, convert_to_tensor=True)
    embedding2 = model.encode(text2, convert_to_tensor=True)
    similarityScore = util.cos_sim(embedding1, embedding2)
    return str(similarityScore.item() * 100)[0:5] + "%" #round the decimal and convert to percent.

In [17]:
#TEST THAT THE MODEL WORKS:
x = "hello my friend, too, is a big boy"
y = "hello my friend too is a big boy"

g = similarity(x,y)
print(f"similarity score: {g}")

similarity score: 96.80%


In [18]:
texts = [extract(file) for file in files if file != ".ipynb_checkpoints"]
pairsList = list(itertools.combinations(texts, 2))

In [19]:
simDict = {}
name = 0
for pair in pairsList:
    x = similarity(pair[0], pair[1])
    simDict[pairNames[name]] = x
    name+=1

In [21]:
df = pd.DataFrame(
    [(key[0], key[1], value) for key, value in simDict.items()],
    columns=['Text One', 'Text Two', 'Similarity Score']
)
df

Unnamed: 0,Text One,Text Two,Similarity Score
0,decoder.ipynb,survival.ipynb,11.18%
1,decoder.ipynb,Kevin_Berookhim_Homework3.ipynb,5.911%
2,decoder.ipynb,191.ipynb,26.59%
3,decoder.ipynb,.ipynb_checkpoints,19.96%
4,decoder.ipynb,Extra Credit - Exercise of NUMPY (Duration Ana...,16.42%
5,survival.ipynb,Kevin_Berookhim_Homework3.ipynb,26.27%
6,survival.ipynb,191.ipynb,24.05%
7,survival.ipynb,.ipynb_checkpoints,32.90%
8,survival.ipynb,Extra Credit - Exercise of NUMPY (Duration Ana...,37.08%
9,Kevin_Berookhim_Homework3.ipynb,191.ipynb,29.16%
