# MANDATE - 2 :  (MT2021069)

### In this notebook, i have tried to experiment few unsupervised techniques for running a QA model with the use of sentence embedding generated in previous notebook and the training dataset.

### The Unsupervised techniques (Euclidean distance & Cosine Similarity) used in this notebook are just implemented to show how sentence embedding (vectorization) helps in solving the problem. These techniques are not final models but just for experimentation.

## Importing Necessary Libraries and Packages :

In [1]:
import numpy as np, pandas as pd
import json
import ast 
from textblob import TextBlob
import nltk
import torch
import pickle
from scipy import spatial
import warnings
warnings.filterwarnings('ignore')
import spacy
from nltk import Tree
nlp = spacy.load("en_core_web_sm")
from nltk.stem.lancaster import LancasterStemmer
st = LancasterStemmer()
from sklearn.feature_extraction.text import TfidfVectorizer, TfidfTransformer

### Training Pandas Dataframe, generarted in previous notebook :

In [2]:
train = pd.read_csv(r"C:\Users\admin\Downloads\t11.csv")

In [3]:
train.shape

(87599, 4)

### Loading Embedding Dictionary (Generated in previous notebook) :

In [4]:
with open(r'C:\Users\admin\dataN\dict_embeddings1.pickle', 'rb') as handle:
    d1 = pickle.load(handle)

In [5]:
with open(r'C:\Users\admin\dataN\dict_embeddings2.pickle', 'rb') as handle:
    d2 = pickle.load(handle)

In [6]:
dict_emb = dict(d1)
dict_emb.update(d2)

In [7]:
len(dict_emb)

179857

In [8]:
del d1, d2

## Data Processing :

In [9]:
def get_target(x):
    idx = -1
    for i in range(len(x["sentences"])):
        if x["text"] in x["sentences"][i]: idx = i
    return idx

In [10]:
train.head(3)

Unnamed: 0,context,question,answer_start,text
0,"Architecturally, the school has a Catholic cha...",To whom did the Virgin Mary allegedly appear i...,515,Saint Bernadette Soubirous
1,"Architecturally, the school has a Catholic cha...",What is in front of the Notre Dame Main Building?,188,a copper statue of Christ
2,"Architecturally, the school has a Catholic cha...",The Basilica of the Sacred heart at Notre Dame...,279,the Main Building


In [11]:
train.dropna(inplace=True)

In [12]:
train.shape

(87598, 4)

In [13]:
def process_data(train):
    
    print("step 1")
    train['sentences'] = train['context'].apply(lambda x: [item.raw for item in TextBlob(x).sentences])
    
    print("step 2")
    train["target"] = train.apply(get_target, axis = 1)
    
    print("step 3")
    train['sent_emb'] = train['sentences'].apply(lambda x: [dict_emb[item][0] if item in\
                                                           dict_emb else np.zeros(4096) for item in x])
    print("step 4")
    train['quest_emb'] = train['question'].apply(lambda x: dict_emb[x] if x in dict_emb else np.zeros(4096) )
        
    return train  

In [14]:
train = process_data(train)

step 1
step 2
step 3
step 4


In [15]:
train.head(3)

Unnamed: 0,context,question,answer_start,text,sentences,target,sent_emb,quest_emb
0,"Architecturally, the school has a Catholic cha...",To whom did the Virgin Mary allegedly appear i...,515,Saint Bernadette Soubirous,"[Architecturally, the school has a Catholic ch...",5,"[[-0.015412132, -0.054134972, 0.014491863, -0....","[[0.0074688885, 0.024210272, 0.06961634, -0.01..."
1,"Architecturally, the school has a Catholic cha...",What is in front of the Notre Dame Main Building?,188,a copper statue of Christ,"[Architecturally, the school has a Catholic ch...",2,"[[-0.015412132, -0.054134972, 0.014491863, -0....","[[0.0074688885, -0.033483382, 0.040545914, -0...."
2,"Architecturally, the school has a Catholic cha...",The Basilica of the Sacred heart at Notre Dame...,279,the Main Building,"[Architecturally, the school has a Catholic ch...",3,"[[-0.015412132, -0.054134972, 0.014491863, -0....","[[0.0074688885, -0.043944724, 0.1438594, -0.00..."


## Predicted Cosine and Euclidean Index :

In [16]:
def cosine_sim(x):
    li = []
    for item in x["sent_emb"]:
        li.append(spatial.distance.cosine(item,x["quest_emb"][0]))
    return li   

In [17]:
def pred_idx(distances):
    return np.argmin(distances)   

In [18]:
def predictions(train):
    
    train["cosine_sim"] = train.apply(cosine_sim, axis = 1)
    train["diff"] = (train["quest_emb"] - train["sent_emb"])**2
    train["euclidean_dis"] = train["diff"].apply(lambda x: list(np.sum(x, axis = 1)))
    del train["diff"]
    
    print("cosine start")
    
    train["pred_idx_cos"] = train["cosine_sim"].apply(lambda x: pred_idx(x))
    train["pred_idx_euc"] = train["euclidean_dis"].apply(lambda x: pred_idx(x))
    
    return train
    

In [19]:
predicted = predictions(train)

cosine start


In [20]:
predicted.head(3)

Unnamed: 0,context,question,answer_start,text,sentences,target,sent_emb,quest_emb,cosine_sim,euclidean_dis,pred_idx_cos,pred_idx_euc
0,"Architecturally, the school has a Catholic cha...",To whom did the Virgin Mary allegedly appear i...,515,Saint Bernadette Soubirous,"[Architecturally, the school has a Catholic ch...",5,"[[-0.015412132, -0.054134972, 0.014491863, -0....","[[0.0074688885, 0.024210272, 0.06961634, -0.01...","[0.7483252286911011, 0.6157153248786926, 0.616...","[8.731496, 7.767769, 8.243154, 7.9353786, 7.73...",5,5
1,"Architecturally, the school has a Catholic cha...",What is in front of the Notre Dame Main Building?,188,a copper statue of Christ,"[Architecturally, the school has a Catholic ch...",2,"[[-0.015412132, -0.054134972, 0.014491863, -0....","[[0.0074688885, -0.033483382, 0.040545914, -0....","[0.7264537513256073, 0.5839367210865021, 0.654...","[7.0554705, 6.2977524, 7.563771, 4.9111557, 6....",3,3
2,"Architecturally, the school has a Catholic cha...",The Basilica of the Sacred heart at Notre Dame...,279,the Main Building,"[Architecturally, the school has a Catholic ch...",3,"[[-0.015412132, -0.054134972, 0.014491863, -0....","[[0.0074688885, -0.043944724, 0.1438594, -0.00...","[0.6630303263664246, 0.5432414710521698, 0.575...","[6.4743433, 5.8934736, 6.712471, 4.1032743, 4....",3,3


In [21]:
predicted.to_csv(r'C:\Users\Admin\Downloads\predictedNLP1.csv', index=False)

## Accuracy :

In [24]:
def accuracy(target, predicted):
    
    acc = (target==predicted).sum()/len(target)
    
    return acc

### Accuracy for Euclidean Distance :

In [25]:
print(accuracy(predicted["target"], predicted["pred_idx_euc"]))

0.49163222904632525


### Accuracy for Cosine Similarity :

In [26]:
print(accuracy(predicted["target"], predicted["pred_idx_cos"]))

0.6198657503595972


In [27]:
predicted.to_csv(r'C:\Users\Admin\Downloads\predEx.csv', index = None)

### Accuracy of Cosine Similarity is better because it takes into consideration the angle between the vectors which the Euclidean Distance doesn't.

### In the above techniques, in data processing i have not implemented stemming on sentences and questions as it was just for the purpose of experimentation, but will surely implement it while training future advance models.

# Next Goals :

#### 1. Implement Advance Supervised Models and train them to get better accuracy on SQuAD.
#### 2. The "Bhagwad Gita" dataset generation is under process, once it is ready will apply all these methodologies experimented on SQuAD dataset on it.