# CH5019: Mathematical Foundations of Data Science
## *Course Project*
## Group-6
In this notebook we examine the task of automatically grading answers to descriptive questions. Often exams have comprehensive questions, but manually accessing and grading the answer is not trivial. The purpose of this exercise is to grade students answer by automatically retrieving the closest sample answer and grdaing accordingly.

**Our basic strategy is as follows**

We already have the template best answer to each question. Also, we have generated 5 sample answers from the template best answer & assigned marks out of 10 to each sample answer. For the asked question, we have to find the sample answer closest in meaning to the students answer & display marks gained by student for that particular question according to the retrieved sample answer. 

To compare two answer scripts we need to have an efficient way of computing semantic similarity between two sentences. To compute semantic similarity between sentences, we will convert each sentence into a vector. We can then use cosine similarity between vectors to come up with a distance measure between sentences that indicates how similar they are in meaning. This is an implementation of natural language processing.

In [180]:
# Loading Libraries
import pandas as pd
import numpy as np
import re
import gensim 
from gensim.parsing.preprocessing import remove_stopwords 
import gensim.downloader as api
import sklearn
from sklearn.metrics.pairwise import cosine_similarity;

# Loading the Q&A data set
df = pd.read_csv("Data/DS_interview.csv");

# Renaming columns of data set
df.columns = ["Q.No", "questions","answers"];

In [181]:
# Question & answer data set
print("--------------------------------------------Question & Answer pairs------------------------------------------------\n")
for i in range(1,11):
    Ques = df.loc[df["Q.No"]== i]
    print('\033[1m' + "Question",i,":",Ques["questions"].iloc[0] + "\033[0;0m")
    print('\033[1m' + "\nStandard Answer:",Ques["answers"].iloc[0] + "\033[0;0m")
    print("\n\n")

--------------------------------------------Question & Answer pairs------------------------------------------------

[1mQuestion 1 : What is Data Science?[0;0m
[1m
Standard Answer: Data Science is a combination of algorithms, tools, and machine learning techniques which helps you to find common hidden patterns from the given raw data.[0;0m



[1mQuestion 2 : What is Logistic Regression?[0;0m
[1m
Standard Answer: Logistic Regression is a method to forecast the binary outcome from a linear combination of predictor variables using sigmoid function.[0;0m



[1mQuestion 3 : What is Bias?[0;0m
[1m
Standard Answer: Bias is an error introduced in your model because of the oversimplification of a machine learning algorithm.[0;0m



[1mQuestion 4 : Define Boltzmann Machine?[0;0m
[1m
Standard Answer: Boltzmann machines is a simple learning algorithm, which helps you to discover features that represent complex regularities in the training data.[0;0m



[1mQuestion 5 : What is the 

In [182]:
#Function to choose question-answer pair from given data set
def Q_selection(Q):
    # Loop to change input, if not in given range
    while(True):
        Question_Number = input("Enter the question number you want to ask from above the list of questions:")
        Q = int(Question_Number)
        
        # Error message
        if Q > 10 or Q < 1:
            print('\033[1m' + "Error: Please enter a number from 1 to 10\n"+ "\033[0;0m")
        else:
            break
    
    print("\n---------------------You have selected following Q&A pair:-----------------\n")
        
    Ques = df.loc[df["Q.No"]== Q]
    print('\033[1m' + "Question:",Ques["questions"].iloc[0] + "\033[0;0m")
    print('\033[1m' + "\nStandard Answer:",Ques["answers"].iloc[0] + "\033[0;0m")
        
    return Q    # Return the choosen question number

In [183]:
# Choosing question-answer pair number
N = Q_selection(-1)

Enter the question number you want to ask from above the list of questions:3

---------------------You have selected following Q&A pair:-----------------

[1mQuestion: What is Bias?[0;0m
[1m
Standard Answer: Bias is an error introduced in your model because of the oversimplification of a machine learning algorithm.[0;0m


In [184]:
# Loading the sample answers data set for the choosen Q&A pair, N
sample = pd.read_csv(('Data/Sample_'+str(N)+'.csv'));

# Renaming columns of sample answers data set
sample.columns=["marks","answers"];

# Preprocessing 

Most NLP tasks involve preprocessing. For this task we are performing the following preprocessing : 
1. Removing all characters that are not alpha numeric
2. Removing stopwords - commonly used words such as 'a', 'to', 'in' and so on.. that do not contribute to the semantic similarity between two sentences.

We apply this to both the sample answers and the students answer sentence.

In [185]:
# Cleaning single sentence
def clean_sentence(sentence, stopwords = False):
    # Converting all characters to lower case
    sentence = sentence.lower().strip()
    
    # Replace non-alpha numeric characters with spaces
    sentence = re.sub(r'[^a-z0-9\s]', '', sentence)
    
    # Removing stopwords if asked to do so
    if stopwords:
         sentence = remove_stopwords(sentence)
    
    # Returning cleaned sentence
    return sentence
            
# Cleaning data frame of sentences
def get_cleaned_sentences(df,stopwords = False):
    # Choosing answers column from data frame
    sents = df[["answers"]];
    cleaned_sentences=[]   # Initializing list to store cleaned sentences

    # Looping over rows of data frame
    for index,row in df.iterrows():
        # Cleaning each sentence
        cleaned = clean_sentence(row["answers"],stopwords);
        
        # Appending cleaned sentence to the cleaned_sentences list
        cleaned_sentences.append(cleaned);
    
    # Returning data frame of cleaned sentences
    return cleaned_sentences;

In [186]:
# Cleaning all sample answers of the choosen Q&A pair
cleaned_sample_ans = get_cleaned_sentences(sample, stopwords = True)

In [187]:
# Function to take students answer of asked question as input
def student_answer():
    student_ans = input("---------Student should type his answer below and press enter---------\n")
    
    # Cleaning students answer
    student_ans = clean_sentence(student_ans, stopwords = True);
    
    # Returning cleaned student answer
    return student_ans

In [188]:
# Reading & cleaning students answer
cleaned_student_ans = student_answer()

---------Student should type his answer below and press enter---------
Bias is an error introduced in your model because of the oversimplification of a machine learning algorithm.


# Glove Embeddings

GloVe is an unsupervised learning algorithm for obtaining vector representations for words. Training is performed on aggregated global word-word co-occurrence statistics from a corpus, and the resulting representations showcase interesting linear substructures of the word vector space. Deatil information about Glove Embeddings can be found in below link

https://nlp.stanford.edu/projects/glove/

In [189]:
# Making glove embedding model
glove_model = None;   # Initialization

# Checking if model is already downloaded
try:
    glove_model = gensim.models.KeyedVectors.load("./glovemodel.mod")
    print("Loaded glove model")
    
# Download the model if not already downloaded
except:            
    glove_model = api.load('glove-twitter-25')
    glove_model.save("./glovemodel.mod")
    print("Saved glove model")

Loaded glove model


**Finding Phrase Embeddings from Word Embeddings** 

Simplest technique to convert word embeddings to phrase embeddings, that is applicable with glove embeddings, is to sum up the individual word embeddings in the phrase to get a phrase vector. It is implemented below.

In [190]:
# Function to convert word to vector by given model
def getWordVec(word, model):
    # Reading downloaded model from computer
    samp = model['computer'];
    
    # Initializing vector
    vec = [0]*len(samp);
    
    # Checking if word is already present in model
    try:
            vec = model[word];
    except:
            vec = [0]*len(samp);
    
    # Returning the word vector
    return vec

# Function to convert word embeddings to phrase embeddings
def getPhraseEmbedding(phrase, embeddingmodel):

        samp = getWordVec('computer', embeddingmodel);
        vec = np.array([0]*len(samp));
        den = 0;
        
        # Looping over all words of given phrase
        for word in phrase.split():
            den = den + 1;
            vec = vec + np.array(getWordVec(word,embeddingmodel));
            
        # Returning the phrase embedding
        return vec.reshape(1, -1)

In [191]:
# Function to displaying cosine similaritiy & final marks obtained
def retrieveAndPrintmarks(student_embedding, sample_embedding, sample_df):
    print("Cosine similarity between students answer & each sample answer for Q"+str(N)+" are as follows:\n")
    
    max_sim = -1;   # Initializing maximum value of cosine similarity
    index_sim = -1;  # Initializing index corresponding to maximum cosine similarity
    
    # Looping over all sample embeddings
    for index, samp_embedding in enumerate(sample_embedding):
        
        # Computing cosine similarity between student answer & sample answer
        sim = cosine_similarity(samp_embedding, student_embedding)[0][0];
        
        # Displaying cosine similarity for each sample answer
        print('\033[1m' + "Sample answer", index + 1, "(" + str(10 - index*2) + "/10):", sim,"\033[0;0m")
        
        # Changing the maximum value of cosine similarity & corresponding index in sample_df
        if sim > max_sim:
            max_sim = sim;
            index_sim = index;
    
    # Computing & displaying student mark
    value = sample_df.iloc[index_sim,0] * max_sim
    print('\033[1m', "\nStudent Marks out of 10: ","{:.2f}".format(value),"\033[0;0m")
    
    # Displying formula to compute student mark
    print("\nNote: Formula used for marks calculation is as follows")
    print("\tMarks = (Marks of retrieved sample answer)*(Corresponding cosine similairty)")

In [192]:
# Applying Glove embedding model
sample_embeddings = [];   # Initializing list of sample answer embeddings

# Looping over sample answers
for sent in cleaned_sample_ans:
    # Computing & appending each sample answer embedding to sample_embeddings using Glove model
    sample_embeddings.append(getPhraseEmbedding(sent, glove_model));

# Computing student answer embedding using Glove model
student_embedding = getPhraseEmbedding(cleaned_student_ans, glove_model);

# Displaying final results
retrieveAndPrintmarks(student_embedding, sample_embeddings, sample);

Cosine similarity between students answer & each sample answer for Q3 are as follows:

[1mSample answer 1 (10/10): 0.9999999999999999 [0;0m
[1mSample answer 2 (8/10): 0.9890437788074244 [0;0m
[1mSample answer 3 (6/10): 0.9768594868947094 [0;0m
[1mSample answer 4 (4/10): 0.9706787154068517 [0;0m
[1mSample answer 5 (2/10): 0.8837214519016757 [0;0m
[1m 
Student Marks out of 10:  10.00 [0;0m

Note: Formula used for marks calculation is as follows
	Marks = (Marks of retrieved sample answer)*(Corresponding cosine similairty)
