# Experiment 1

### **0.5* Sbert_cosine_sim + 0.3* W2v_cosine_sim + 0.2* Normalized_Score**

- On following basis weight selection hase been done:-
- Sbert gave best results among all so more weightage is given.
- Custom w2v is trained on this data so certaily it provides some value, so 0.3 weightage has been given.
- Higher the question score, more likely to be answered or can contain more answers, so 0.2 weightage has been assigned.

### Importing libraries

In [1]:
import torch
import pandas as pd
import numpy as np
import re
import bs4
import swifter
from nltk.stem import WordNetLemmatizer
from nltk.corpus import stopwords
from sentence_transformers import SentenceTransformer, util
from sklearn.metrics.pairwise import cosine_similarity
from gensim.models import Word2Vec
import joblib
from tqdm import tqdm 
tqdm.pandas()


### Preprocessing and embedding fucntions

In [2]:
# # https://stackoverflow.com/a/47091490/4084039
def decontracted(phrase):
    # specific
    phrase = re.sub(r"won't", "will not", phrase)
    phrase = re.sub(r"can\'t", "can not", phrase)
    # general
    phrase = re.sub(r"n\'t", " not", phrase)
    phrase = re.sub(r"\'re", " are", phrase)
    #phrase = re.sub(r"\'s", " is", phrase)
    phrase = re.sub(r"\'d", " would", phrase)
    phrase = re.sub(r"\'ll", " will", phrase)
    phrase = re.sub(r"\'t", " not", phrase)
    phrase = re.sub(r"\'ve", " have", phrase)
    phrase = re.sub(r"\'m", " am", phrase)
    return phrase



def text_preprocessing(text):
    '''This function does text preprocessing 
       It includes removal of html tags,
       converting to lowercase, 
       decontraction and 
       removal of any non alphanumeric characters.
       
       Function takes one parameter - text
       returns - preprocessed text
    '''
    
    # Some titles (~42) start with '<' but doesnt have closing '>'. 
    #eg: #text = '<asp: RegularExpressionValidator and RegexOptions.IgnorePatternWhitespace'
    # beautifulsoup gives emppty string on such text so remove '<' before removing html tags from titles.
    text = text.replace("<","")
    # Remove html tags from question corpus
    text = bs4.BeautifulSoup(text, 'lxml').get_text()
    # Convert each word to lowercase
    text = text.lower()
    # text decontraction. eg: won't to will not. Can't to cannot
    text = decontracted(text)
    # Remove any non-alphanumeric characters if present
    #text = re.sub('\W', ' ',text).strip()
    text = re.sub("[^a-zA-Z'.+# ]+", '', text) # kepping + for c++, . for .net, vb.net etc, # for C#

    # why lemmatization is choose over stemming
    #https://stackoverflow.com/questions/1787110/what-is-the-difference-between-lemmatization-vs-stemming
    # Lemmatization   
    lemmatizer = WordNetLemmatizer()
    text = " ".join([lemmatizer.lemmatize(word) for word in text.split()])
    text = text.strip()
    return text



def get_w2v_embedding(sentence):
    '''Get 300 dim word embedding for each word from custom trained w2v model.
       Avg word embedding to create sentence embedding
       
       Function accepts only one parameter - sentence (text input)
       returns - 300 dim sentence embedding'''
    
    custom_w2v = []
    for word in sentence.split():
        if (word not in final_stopwords):
            try:
                custom_w2v.append(loaded_model.wv[word])#keyerror
            except:
                pass
         
    avg_w2v = np.array(custom_w2v).mean(axis=0)
    return avg_w2v

In [3]:
final_stopwords = joblib.load('final_stopwords.pkl')

### Data Loading

In [4]:
%%time
df = joblib.load('cleaned_df.pkl')
print(df.shape)

(999348, 3)
CPU times: user 370 ms, sys: 215 ms, total: 586 ms
Wall time: 1.04 s


### Model Loading

In [5]:
%%time
loaded_model = Word2Vec.load("model/custom_trained_w2v/word2vec_v2.model")
sentence_embedder = joblib.load('sentence_embedder.pkl')

CPU times: user 37.7 s, sys: 3.53 s, total: 41.3 s
Wall time: 1min 12s


### Pre-trained embedding loading

In [6]:
%%time
w2v_embeddings = joblib.load('w2v_embeddings.pkl')
sbert_embeddings = joblib.load('sbert_embeddings.pkl')

CPU times: user 759 ms, sys: 2.61 s, total: 3.37 s
Wall time: 22.7 s


### Function to retrieve semantically similar questions

In [22]:
def get_similar_questions(query):
    ''' Function to accept user query and show top 5 similar question alongwith custom score.
        Function accepts one parameter: query (text input)
        Processing: Text preprocessing of query, 
                    compute sentence embedding with avg w2v and sbert method,
                    compute custom weighted score with formula 0.5*Sbert_cosine_sim + 0.3*W2v_cosine_sim + 0.2*Normalized_Score
        Returns: None, prints similar question's titles and custom score obtained.
    '''
    preprocessed_query = text_preprocessing(query)
    
    query_embedding_w2v = get_w2v_embedding(preprocessed_query)

    query_embedding_sbert = sentence_embedder.encode(preprocessed_query, convert_to_tensor=True)

    
    # We use cosine-similarity and torch.topk to find the highest 5 scores
    df['Sbert_cosine_sim'] = util.pytorch_cos_sim(query_embedding_sbert, sbert_embeddings)[0]
    df['W2v_cosine_sim'] = cosine_similarity(np.array(query_embedding_w2v).reshape(1, -1),np.array(w2v_embeddings)).T

    df['Final_Score'] = (0.5*df['Sbert_cosine_sim']) + (0.3*df['W2v_cosine_sim']) + (0.2*df['Normalized_Score'])
    # Find the closest 5 sentences of the corpus for each query sentence based on cosine similarity
    top_k = 5
    
    top_results = torch.topk(torch.tensor(df['Final_Score'].values), k=top_k)

    print("Query:", query)
    print("\nTop 5 most similar questions in corpus:")
    
    i = 1
    for score, idx in zip(top_results[0], top_results[1]):
        print("{}) ".format(i), df['Title'].iloc[int(idx)], "(Score: {:.4f})".format(score))
        i = i+1
        
#     Sorting using below function also took almost same time, difference of around 0.6 seconds was noticed

#     df.sort_values(by='Final_Score', ascending=False, inplace=True)
#     print(df[['Title','Final_Score']].head().values)


### Performance Evaluation

In [23]:
%%time
get_similar_questions('python sort dictionary')

Query: python sort dictionary

Top 5 most similar questions in corpus:
1)  Python: sort this dictionary (dict in dict) (Score: 0.7410)
2)  sort a dictionary according to their values in python (Score: 0.7297)
3)  Sort by key of dictionary inside a dictionary in Python (Score: 0.7258)
4)  Sorting dictionary keys in python (Score: 0.7206)
5)  Dictionary sort? (Score: 0.7203)
CPU times: user 3.17 s, sys: 981 ms, total: 4.15 s
Wall time: 2.35 s


In [24]:
%%time
get_similar_questions('CSS Performance')

Query: CSS Performance

Top 5 most similar questions in corpus:
1)  CSS Performance (Score: 0.8013)
2)  CSS Performance (Score: 0.8013)
3)  CSS Performance Question (Score: 0.7134)
4)  CSS Performance issues (Score: 0.7092)
5)  Performance, serve all CSS at once, or as its needed? (Score: 0.6370)
CPU times: user 3.23 s, sys: 1.1 s, total: 4.32 s
Wall time: 2.24 s


In [25]:
%%time
get_similar_questions('python convert date to datetime')

Query: python convert date to datetime

Top 5 most similar questions in corpus:
1)  Convert date to datetime in Python (Score: 0.7912)
2)  Convert date Python (Score: 0.7282)
3)  How can I convert the time in a datetime string from 24:00 to 00:00 in Python? (Score: 0.7097)
4)  Date converter in python (Score: 0.6901)
5)  How do I get datetime from date object python? (Score: 0.6845)
CPU times: user 2.95 s, sys: 935 ms, total: 3.88 s
Wall time: 2.21 s


In [26]:
%%time
get_similar_questions('how to create list of lists in python')

Query: how to create list of lists in python

Top 5 most similar questions in corpus:
1)  Creating lists of lists in a pythonic way (Score: 0.7286)
2)  How can I make a list in Python like (0,6,12, .. 144)? (Score: 0.7019)
3)  List of Lists in python? (Score: 0.7016)
4)  How to create nested lists in python? (Score: 0.7003)
5)  Creating a list of objects in Python (Score: 0.6929)
CPU times: user 3.08 s, sys: 1.13 s, total: 4.21 s
Wall time: 2.24 s


In [27]:
%%time
get_similar_questions('pd.melt() not working python')


Query: pd.melt() not working python

Top 5 most similar questions in corpus:
1)  I am getting an error when trying to use melt() on a dataframe containing Dates (Score: 0.4897)
2)  Running Panda3D on Python 2.6 (Score: 0.4865)
3)  Python .pth Files Aren't Working (Score: 0.4837)
4)  Python pdb not breaking in files properly? (Score: 0.4795)
5)  Python optparse not working for me (Score: 0.4749)
CPU times: user 2.98 s, sys: 1.07 s, total: 4.05 s
Wall time: 2.22 s


### Code queries

In [28]:
%%time
get_similar_questions('try: 22/0 except Exception as e:print("Error! Code: {c}, Message, {m}".format(c = e.code, m = str(e))')

Query: try: 22/0 except Exception as e:print("Error! Code: {c}, Message, {m}".format(c = e.code, m = str(e))

Top 5 most similar questions in corpus:
1)  Exception message (Python 2.6) (Score: 0.5633)
2)  Format Exception error (Score: 0.5628)
3)  Exception Error in the Code (Score: 0.5612)
4)  complus Exception code -532462766 (Score: 0.5557)
5)  uncatchable exception from unreachable code (Score: 0.5432)
CPU times: user 3.25 s, sys: 1.06 s, total: 4.31 s
Wall time: 2.23 s


In [29]:
%%time
get_similar_questions('def main() return {a:1, b:2} syntax error')

Query: def main() return {a:1, b:2} syntax error

Top 5 most similar questions in corpus:
1)  multiple return statements in python "def" causes syntax error (Score: 0.5843)
2)  Why use def main()? (Score: 0.5332)
3)  Why no compiler error for main() without a return at the end? (Score: 0.5297)
4)  Help calling def from class (Score: 0.4954)
5)  Why is "def InvalidArgsSpecified:" a syntax error? (Score: 0.4925)
CPU times: user 3.39 s, sys: 1.03 s, total: 4.42 s
Wall time: 2.26 s


In [36]:
%%time
get_similar_questions('import KNN \
                       knn= KNN(n=4) \
                       knn.fit(Xtrain, ytrain)')

Query: import KNN                        knn= KNN(n=4)                        knn.fit(Xtrain, ytrain)

Top 5 most similar questions in corpus:
1)  How to import * with __import__ (Score: 0.5479)
2)  implemention of imports (Score: 0.5137)
3)  What exactly does "import *" import? (Score: 0.4964)
4)  Python import mechanics (Score: 0.4852)
5)  import os to j2me (Score: 0.4848)
CPU times: user 3.02 s, sys: 990 ms, total: 4.01 s
Wall time: 2.29 s


### Inference:

- Results seems satisfactory for almost all queries.
- For code related queries: <br>
    1) try-except query: it captured and understood code, thus result set includes questions related to exception code.<br>
    2) main function query: It captured essence of main function in programming language.<br>
    3) KNN query: It was not able to capture fully that code is related to KNN algorithm yet it yield results related to import statement.

