## Cosine Similarity

Each question in the data has been matched with the 411 possible contexts. The true pair is labeled 1.
This created a total of 168921 rows.

<br>
For each question-context pair, we compute the cosine similarity and then pick the best 50 contexts for each question, .
The purpose is to have closely resembling question-context pairs so that BERT can have better data to train on.


In [None]:
import pandas as pd
import numpy as np

In [4]:
df = pd.read_csv('Desktop/Interactions/AnswerSelectionDataLong.csv', index_col = 0) #data x 411
data = pd.read_csv('Desktop/Interactions/formatted_questions.csv') #original data 


In [6]:
df.head()

Unnamed: 0,question,context,label
0,Am I at risk for COVID-19 from a package or pr...,There is still a lot that is unknown about the...,1
1,Am I at risk for COVID-19 from a package or pr...,Symptom severity can be influenced by many dif...,0
2,Am I at risk for COVID-19 from a package or pr...,No. According to the World Health Organization...,0
3,Am I at risk for COVID-19 from a package or pr...,Scientists are currently testing different typ...,0
4,Am I at risk for COVID-19 from a package or pr...,The existence of an S strain and an L strain r...,0


In [8]:
#preprocess
def newline_strip(col):   
    return col.replace("\n","")
df['context'] = df['context'].apply(newline_strip)


In [9]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity

tfidfvectorizer = TfidfVectorizer(analyzer='word' , stop_words='english',)
sklearn_tfidf = TfidfVectorizer()

In [10]:
cosine_array = np.empty((0,411),int)
document_list = df.context[0:411].tolist()                       #make all context documents a list
document_list.insert(0,df['question'][0])               #insert first question


df_len = len(df)
for row in range(0,168921,411):
    document_list[0] = df['question'][row]
    tfidf_matrix = sklearn_tfidf.fit_transform(document_list)   # fit all documents into TFIDF vector
    temp = cosine_similarity(tfidf_matrix[0:1], tfidf_matrix[1:])  #compare first document to all documents
    cosine_array = np.append(cosine_array, temp, axis=0)
print(cosine_array)

[[0.06999983 0.0590073  0.01970762 ... 0.02458137 0.05146684 0.01594306]
 [0.03603349 0.17936848 0.01635591 ... 0.04812475 0.11092149 0.1361513 ]
 [0.05621662 0.05024126 0.34531113 ... 0.03066183 0.04260995 0.00675402]
 ...
 [0.0613523  0.03101297 0.04919409 ... 0.023455   0.04361931 0.00251824]
 [0.         0.         0.02864626 ... 0.09056522 0.03135509 0.        ]
 [0.047635   0.02054201 0.00919182 ... 0.04119002 0.03632653 0.14408178]]


In [12]:
start = 0
finish = len(data) -1  # .loc brackets are inclusive
increment = len(data)

df['cosine_sim'] = ""
for row in range(411):
    cos = cosine_array[row].tolist()
    df.loc[start:finish, 'cosine_sim']= cos
    start += increment
    finish += increment

In [14]:
new_df = pd.DataFrame(columns =['question','context','label','cosine_sim'])


In [15]:
for row in range(411):
    largest = df['cosine_sim'][(row*411):((row+1)*411)].astype(float).nlargest(50)  #get the largest 50 out of each question
    ct = 0
    for i in largest.index:
        if df['label'][i] ==1:       #make sure ground truth is among the largest
            ct+=1
    if ct == 0:
        new_df = new_df.append(df[(row*411):(row+1)*411].loc[df['label'] ==1])
        new_df = new_df.append(df.iloc[largest.index][0:49])
    else: 
        new_df = new_df.append(df.iloc[largest.index])

In [18]:
ct = 0
for i in new_df['label']:
    if i == 1:
        ct+=1
        
print("Number of TRUE labels in Data",ct)     #make sure there are 411 true labels in data    

# there are now 596 due to the duplicate contexts

Number of TRUE labels in Data 596


In [None]:
new_df = new_df.reset_index(drop=True)
new_df.to_csv('AnswerSelectionDataCosine.csv')

In [None]:
train = new_df[:16449]
test = new_df[16449:]
train = train.drop(['cosine_sim'], axis = 1)
test = test.drop(['cosine_sim'], axis = 1)

In [None]:
train.to_csv('AnswerSelectionDataCosineTrain.tsv',index=False, sep="\t")
test.to_csv('AnswerSelectionDataCosineTest.tsv',index=False, sep="\t")