## Setup

### Install required libraries

The libraries should already be installed in the terminal after running 
```
pip install -r requirements.txt
```
in the root directory, but this will be useful if the notebook is used in other environments.

In [1]:
%%capture
!pip install InstructorEmbedding==1.0.1
!pip install scikit-learn==1.6.1
!pip install pandas==2.2.3
!pip install numpy==2.2.6
!pip install sentence-transformers==2.2.2
!pip install requests==2.32.3
!pip install transformers==4.37.2
!pip install huggingface-hub==0.25.2

### Import required libraries

In [2]:
import pandas as pd
import numpy as np
from InstructorEmbedding import INSTRUCTOR
from sklearn.metrics.pairwise import cosine_similarity
import requests
from huggingface_hub import configure_http_backend
import urllib3

pd.set_option('display.max_colwidth', 100)

  from tqdm.autonotebook import trange


### Other configurations

In [3]:
urllib3.disable_warnings(urllib3.exceptions.InsecureRequestWarning)

# [OPTIONAL] Use if there is SSL certificate verification issues
def backend_factory() -> requests.Session:
    session = requests.Session()
    session.verify = False
    return session


configure_http_backend(backend_factory=backend_factory)

### Load the Instructor model

In [4]:
model = INSTRUCTOR('hkunlp/instructor-large')

load INSTRUCTOR_Transformer
max_seq_length  512


### Load dataset

In [5]:
cv_csv_file = f'../datasets/common_voice/cv-valid-dev.csv'
df = pd.read_csv(cv_csv_file)
df.head(10)

Unnamed: 0,filename,text,up_votes,down_votes,age,gender,accent,duration
0,cv-valid-dev/sample-000000.mp3,be careful with your prognostications said the stranger,1,0,,,,
1,cv-valid-dev/sample-000001.mp3,then why should they be surprised when they see one,2,0,,,,
2,cv-valid-dev/sample-000002.mp3,a young arab also loaded down with baggage entered and greeted the englishman,2,0,,,,
3,cv-valid-dev/sample-000003.mp3,i thought that everything i owned would be destroyed,3,0,,,,
4,cv-valid-dev/sample-000004.mp3,he moved about invisible but everyone could hear him,1,0,fourties,female,england,
5,cv-valid-dev/sample-000005.mp3,but everything had changed,3,0,teens,male,us,
6,cv-valid-dev/sample-000006.mp3,are you sure this is claire,2,0,,,,
7,cv-valid-dev/sample-000007.mp3,it had told him to dig where his tears fell,1,0,,,,
8,cv-valid-dev/sample-000008.mp3,the shop folks were taking down their shutters and people were opening their bedroom windows,1,0,twenties,female,canada,
9,cv-valid-dev/sample-000009.mp3,the teacher thought that he'd taught himself all he could,1,0,fifties,female,australia,


### Create hot words and instruction

In [6]:
hot_words = ['be careful', 'destroy', 'stranger']

In [7]:
instruction = "Represent the warning concept:"

### Create embeddings for the hot words

In [8]:
hot_word_embeddings = model.encode([[instruction, hw] for hw in hot_words])

### Create embeddings for the text column

In [9]:
sentences = df['text'].tolist()
sentence_embeddings = model.encode([[instruction, s] for s in sentences])

### Test the Instructor model using samples

We are using cosine similarity to check the similarity between the hot words embeddings and the sentence embeddings.

We will check the results using some sample data.

In [10]:
sample_idx_list = [0, 6, 9, 16]

for sample_idx in sample_idx_list:
    print(f"Sentence with index {sample_idx}: {sentences[sample_idx]}")
    sims = cosine_similarity([sentence_embeddings[sample_idx]], hot_word_embeddings)[0]
    for i, _ in enumerate(hot_words):
        print(f"Cosine Similarity with '{hot_words[i]}': {sims[i]}")
    print("\n")

Sentence with index 0: be careful with your prognostications said the stranger
Cosine Similarity with 'be careful': 0.8775918483734131
Cosine Similarity with 'destroy': 0.7324599027633667
Cosine Similarity with 'stranger': 0.8865888714790344


Sentence with index 6: are you sure this is claire
Cosine Similarity with 'be careful': 0.8151886463165283
Cosine Similarity with 'destroy': 0.752497136592865
Cosine Similarity with 'stranger': 0.8010200262069702


Sentence with index 9: the teacher thought that he'd taught himself all he could
Cosine Similarity with 'be careful': 0.7652658224105835
Cosine Similarity with 'destroy': 0.7242734432220459
Cosine Similarity with 'stranger': 0.7414363622665405


Sentence with index 16: is that what you want me to tell vincent
Cosine Similarity with 'be careful': 0.819640040397644
Cosine Similarity with 'destroy': 0.7587750554084778
Cosine Similarity with 'stranger': 0.7860390543937683




Sample Sentence 1 ("be careful with your prognostications said the stranger"):
- Similar to the "be careful" and "stranger" hot words
- Little or no reference of the "destroy" hot word
- Hence, cosine similarities are high for "be careful" and "stranger" hot words (above 0.80)

Sample Sentence 2 ("are you sure this is claire"):
- One can argue that this sentence implies and "be careful" and "stranger".
- Little reference of the hot words
- Hence, cosine similarities are higher for "be careful" and "stranger" hot words (above 0.80)
  
Sample Sentence 3 ("the teacher thought that he'd taught himself all he could"):
- Little or no reference of the hot words
- Hence, cosine similarities are low for all of the hot words (under 0.80)

Sample Sentence 4 ("is that what you want me to tell vincent"):
- Cosine similarity to "be careful" hot word is slightly higher even though the reference is more vague
- Little reference of the hot words
- Hence, cosine similarity is higher for "be careful" hot word (above 0.80)

Using the scores as the reference, we set a threshold of *0.8* to determine if a sentence is similar to the 3 hot words.

### Perform similarity check

In [12]:
threshold = 0.80
similarity_flags = []

for emb in sentence_embeddings:
    sims = cosine_similarity([emb], hot_word_embeddings)[0]
    is_similar = any(sim >= threshold for sim in sims)
    similarity_flags.append(is_similar)

In [13]:
df['similarity'] = similarity_flags
print("No. of sentences similar to the hot words: ", df[df['similarity'] == True].shape[0])
df[df['similarity'] == True].head(10)

No. of sentences similar to the hot words:  1785


Unnamed: 0,filename,text,up_votes,down_votes,age,gender,accent,duration,similarity
0,cv-valid-dev/sample-000000.mp3,be careful with your prognostications said the stranger,1,0,,,,,True
1,cv-valid-dev/sample-000001.mp3,then why should they be surprised when they see one,2,0,,,,,True
2,cv-valid-dev/sample-000002.mp3,a young arab also loaded down with baggage entered and greeted the englishman,2,0,,,,,True
3,cv-valid-dev/sample-000003.mp3,i thought that everything i owned would be destroyed,3,0,,,,,True
6,cv-valid-dev/sample-000006.mp3,are you sure this is claire,2,0,,,,,True
8,cv-valid-dev/sample-000008.mp3,the shop folks were taking down their shutters and people were opening their bedroom windows,1,0,twenties,female,canada,,True
11,cv-valid-dev/sample-000011.mp3,you haven't seen anything yet,2,0,,,,,True
12,cv-valid-dev/sample-000012.mp3,but i found it difficult to get to work because of the investigations,2,0,,,,,True
16,cv-valid-dev/sample-000016.mp3,is that what you want me to tell vincent,2,0,,,,,True
17,cv-valid-dev/sample-000017.mp3,one of us is going to jail,1,0,,,,,True


### Save data

In [14]:
df.to_csv('cv-valid-dev-with-similarity.csv', index=False)