# Embeddings based on Cosine Similarity

The following notebook tries to incorporate the fact that words that have similar contexts must have similar embeddings. In the baseline tf-idf model does not take into consideration the context of the target words. To implement the embedding I have used an article from the NewsQA dataset as the text corpus

## Steps:

### 1. Data Preprocessing 
* An article is selected from the NewsQA dataset.
* The article is tokenized into words.
### 2. Making a co-occurence matrix
* A co-occurence matrix is created by using a user-defined window size which counts the number of times a given word co-exists with the others in the given window.
* The co-occurence matrix is further stored into a csv file for ease of access.
### 3. Cosine Similarity 
* A function `cosine_similarity` is defined, which takes the co-occurence vector of two words and computes their cosine similarity value.
* This value is unique to each word pair and is appended to a list belonging to the word, this list will finally act as the embedding of the word.
### 4. Storing the embeddings
* The embeddings are then stored in a csv file.

     

##### 1. Importing Libraries

In [1]:
import nltk
import pandas as pd
import numpy as np
from collections import defaultdict
from nltk.tokenize import word_tokenize


##### 2. Reading data

In [2]:
df=pd.read_parquet("/home/naseeha/repos/IntelSIGRecTasks/Polyphasic/1)Embeddings/Embeddings_cosine/train-00000-of-00001-ec54fbe500fc3b5c.parquet")
df

Unnamed: 0,context,question,answers,key,labels
0,"NEW DELHI, India (CNN) -- A high court in nort...",What was the amount of children murdered?,[19],da0e6b66e04d439fa1ba23c32de07e50,"[{'end': [295], 'start': [294]}]"
1,"NEW DELHI, India (CNN) -- A high court in nort...",When was Pandher sentenced to death?,[February.],724f6eb9a2814e4fb2d7d8e4de846073,"[{'end': [269], 'start': [261]}]"
2,"NEW DELHI, India (CNN) -- A high court in nort...",The court aquitted Moninder Singh Pandher of w...,[rape and murder],d64cbb90e5134081acfa83d3e702408c,"[{'end': [638], 'start': [624]}]"
3,"NEW DELHI, India (CNN) -- A high court in nort...",who was acquitted,[Moninder Singh Pandher],fd7177ee6f1f4d62becd983a0305f503,"[{'end': [216], 'start': [195]}]"
4,"NEW DELHI, India (CNN) -- A high court in nort...",who was sentenced,[Moninder Singh Pandher],cd25c69f631349748ccdeccaace66463,"[{'end': [216], 'start': [195]}]"
...,...,...,...,...,...
74155,"OAKLAND, California (CNN) -- Fifth-grader Chri...",What happened to Christopher Rodriguez?,"[was hit by a stray bullet, paralyzing him for...",c0ac3ef6afb94dbe8e1666b8bfbf5237,"[{'end': [259], 'start': [209]}]"
74156,"OAKLAND, California (CNN) -- Fifth-grader Chri...",What did Christopher Rodriguez love?,[music.],683cfaf6ec1c47189172300b4aaa3f91,"[{'end': [985], 'start': [980]}]"
74157,"OAKLAND, California (CNN) -- Fifth-grader Chri...",WIll the boy walk again?,[paralyzed for life],0da315692fd04023b7205fab7aeeb26e,"[{'end': [701], 'start': [684]}]"
74158,"OAKLAND, California (CNN) -- Fifth-grader Chri...",What did the suspect allegedly rob?,[Chevron gas station],74a4fa2d8548463f8fdf9a370d7ea5ff,"[{'end': [423], 'start': [405]}]"


##### 3.Extracting the first article 

In [3]:
text=df["context"][0]
len(text)

1235

##### 4.Splitting the article into sentences

In [4]:
sentences=text.split(".")
sentences

['NEW DELHI, India (CNN) -- A high court in northern India on Friday acquitted a wealthy businessman facing the death sentence for the killing of a teen in a case dubbed "the house of horrors',
 '"\n\n\n\nMoninder Singh Pandher was sentenced to death by a lower court in February',
 '\n\n\n\nThe teen was one of 19 victims -- children and young women -- in one of the most gruesome serial killings in India in recent years',
 '\n\n\n\nThe Allahabad high court has acquitted Moninder Singh Pandher, his lawyer Sikandar B',
 ' Kochar told CNN',
 '\n\n\n\nPandher and his domestic employee Surinder Koli were sentenced to death in February by a lower court for the rape and murder of the 14-year-old',
 "\n\n\n\nThe high court upheld Koli's death sentence, Kochar said",
 '\n\n\n\nThe two were arrested two years ago after body parts packed in plastic bags were found near their home in Noida, a New Delhi suburb',
 ' Their home was later dubbed a "house of horrors" by the Indian media',
 '\n\n\n\nPand

##### 5. Removing non-alphabetical characters

In [5]:
for i in range(len(sentences)):
    st=""
    for j in sentences[i]:
        if j.isalpha() or j==" ":
            st+=j.lower()
    sentences[i]=st
sentences

['new delhi india cnn  a high court in northern india on friday acquitted a wealthy businessman facing the death sentence for the killing of a teen in a case dubbed the house of horrors',
 'moninder singh pandher was sentenced to death by a lower court in february',
 'the teen was one of  victims  children and young women  in one of the most gruesome serial killings in india in recent years',
 'the allahabad high court has acquitted moninder singh pandher his lawyer sikandar b',
 ' kochar told cnn',
 'pandher and his domestic employee surinder koli were sentenced to death in february by a lower court for the rape and murder of the yearold',
 'the high court upheld kolis death sentence kochar said',
 'the two were arrested two years ago after body parts packed in plastic bags were found near their home in noida a new delhi suburb',
 ' their home was later dubbed a house of horrors by the indian media',
 'pandher was not named a main suspect by investigators initially but was summoned as

##### 6. Making a list of unique words

In [6]:
words=[]
for sentence in sentences:
    x = [i.lower() for i in word_tokenize(sentence) if i.isalpha()]
    for word in x:
        if word not in words:
            words.append(word)

##### 7.Function to compute co-occurence values

In [7]:
def co_occurrence(sentences, window_size):
    d=defaultdict(int)
    for i in range(len(words)):
        token = words[i]
        next_token = words[i+1 : i+1+window_size]
        for t in next_token:
            key = tuple( sorted([t, token]) )
            d[key] += 1
    
    df = pd.DataFrame(data=np.zeros((len(words), len(words)), dtype=np.int16),
                      index=words,
                      columns=words)
    for key, value in d.items():
        df.at[key[0], key[1]] = value
        df.at[key[1], key[0]] = value
    df.to_csv("co_occurence_matrix.csv",index=False)

##### 8.Computing co-occurence values of words within a window of size 3

In [8]:
co_occurrence(sentences,3)

#### 9. Reading co-occurence data

In [9]:
df=pd.read_csv("/home/naseeha/repos/IntelSIGRecTasks/Polyphasic/1)Embeddings/Embeddings_cosine/co_occurence_matrix.csv")
df.reset_index()

Unnamed: 0,index,new,delhi,india,cnn,a,high,court,in,northern,...,australia,when,raped,killed,faces,remaining,could,remain,custody,attorney
0,0,0,1,1,1,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,1,1,0,1,1,1,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,2,1,1,0,1,1,1,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,3,1,1,1,0,1,1,1,0,0,...,0,0,0,0,0,0,0,0,0,0
4,4,0,1,1,1,0,1,1,1,0,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
102,102,0,0,0,0,0,0,0,0,0,...,0,0,1,1,1,0,1,1,1,0
103,103,0,0,0,0,0,0,0,0,0,...,0,0,0,1,1,1,0,1,1,1
104,104,0,0,0,0,0,0,0,0,0,...,0,0,0,0,1,1,1,0,1,1
105,105,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,1,1,1,0,1


##### 10.Function to calculate cosine similarity of co-occurence vectors of two words

In [10]:
def cosine_similarity(word1,word2):
    arr1=np.array(df[word1])
    arr2=np.array(df[word2])
    
    dot_prod=np.dot(arr1,arr2)
    norm1 = np.linalg.norm(arr1)
    norm2 = np.linalg.norm(arr2)

    cosine=dot_prod/(norm1*norm2)

    return cosine
    

##### 11. Declaring a default dictionary to save the embeddings

In [11]:
dictionary_similarity=defaultdict(list)


##### 12. Storing each cosine similarity value to the assigned list for each word

In [12]:
for i in words:
    for j in words:
        dictionary_similarity[i].append(cosine_similarity(i,j))
dictionary_similarity

defaultdict(list,
            {'new': [1.0000000000000002,
              0.5773502691896258,
              0.5163977794943222,
              0.47140452079103173,
              0.7071067811865476,
              0.47140452079103173,
              0.23570226039551587,
              0.0,
              0.0,
              0.0,
              0.0,
              0.0,
              0.0,
              0.0,
              0.0,
              0.0,
              0.0,
              0.0,
              0.0,
              0.0,
              0.0,
              0.0,
              0.0,
              0.0,
              0.0,
              0.0,
              0.0,
              0.0,
              0.0,
              0.0,
              0.0,
              0.0,
              0.0,
              0.0,
              0.0,
              0.0,
              0.0,
              0.0,
              0.0,
              0.0,
              0.0,
              0.0,
              0.0,
              0.0,
              0.0,
            

##### 13. Storing the embeddings in a list

In [13]:
embeddings=[]
for i in dictionary_similarity:
    embeddings.append(dictionary_similarity[i])


##### 14.Making a dataframe with the words and corresponding embeddings

In [14]:
data={"words":words,"embeddings":embeddings}
new_df=pd.DataFrame(data)
new_df

Unnamed: 0,words,embeddings
0,new,"[1.0000000000000002, 0.5773502691896258, 0.516..."
1,delhi,"[0.5773502691896258, 1.0, 0.6708203932499369, ..."
2,india,"[0.5163977794943222, 0.6708203932499369, 0.999..."
3,cnn,"[0.47140452079103173, 0.6123724356957946, 0.73..."
4,a,"[0.7071067811865476, 0.4082482904638631, 0.547..."
...,...,...
102,remaining,"[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ..."
103,could,"[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ..."
104,remain,"[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ..."
105,custody,"[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ..."


##### 15.Saving the values to a csv file

In [15]:
new_df.to_csv("Embeddings_cosinesim.csv",index=False)