# Technical Assessment for Data science Position

Business Problem

Short-answered questions require students to freely answer their thoughts as opposed to
multiple choice questions. This type of assessment may be considered as an accurate
assessment because it reveals a real student’s understanding. Nevertheless, grading shortanswered
questions is extremely challenging because manual experts grading process
becomes extraordinarily expensive to scale up. To resolve this scalability challenge, we expect
you to build an automated grading system.
For this purpose, you are provided an annotated dataset consists of 900 students’ short
constructed answers and their correctness in the given context. Four qualitative levels of
correctness are defined, correct, correct-but-incomplete, contradictory and Incorrect.
We don’t expect a completed solution. Our focus is on your problem-solving logic, coding
skills, the connection of technical expertise to the problem and overall approach. We would
like to understand how comfortable you would be with cutting edge development.

Is this something that you are open to doing? If so, the instructions are below:

Goal

The goal of this project is to take a question, reference answers and a student’s response and
determine whether the student’s response is correct, correct-but-incomplete, contradictory
and Incorrect.

Metric

Please split your dataset in an 80-20 ratio for training and test datasets. The project is
evaluated on the classification Accuracy and F1 score of your predictions.

Additional Notes

The ideal response will be reproducible. Your trained model takes the instructor question,
reference answers and a student’s response and generates a result when we run it. Please
provide a requirements.txt file that contains the dependencies for your code. Our preferred
language choices are Python and R.

Acknowledgements

The data is presented by Banjade et. al. and available here. However, the data is attached in
XML in the package sent to you.
Banjade, R., Maharjan, N., Niraula, N. B., Gautam, D., Samei, B., & Rus, V. Evaluation Dataset (DT-Grade) and Word
Weighting Approach towards Free Short Answers Assessment in Tutorial Di-alogue Context. In proceedings of the BEA
workshop at NAACL

# Data Loading Processing

The approach I took here is that I exploaded the dataset by the `ReferenceAnswers`. So for the each row of the 900, It'll be splitted into n rows where n is the number of reference answers.

To ensure that the model is not trained on any of the student answers for the purpose of having a good and reliable test set, We split the original dataset into 80-20 ratio for `training` and `test` datasets. And the we exploaded the `train` and `test` sets seperatly. And then we evaluate the model on the `test` by grouping all the exploaded rows by the `Question` and `Answer` and for each reference answer we take a vote of the most predicted class. 

In [1]:
import re
import numpy as np
import pandas as pd
from pandas_profiling import ProfileReport

import xml.etree.ElementTree as ET

from sklearn.model_selection import train_test_split
from tensorflow.keras.utils import to_categorical


class Preprocessing:
    def load(self, file_path):
        df = self.read(file_path)
        df = df[['Question', 'Answer', 'ReferenceAnswers', 'Annotation']]
        df['Annotation'] = df['Annotation'].apply(self.get_annotation)
        train, test, y_train, y_test = train_test_split(df, df['Annotation'], test_size = 0.20, stratify=df['Annotation'], random_state=777)
        train = self.process_data(train)
        original_test = test.copy()
        test = self.process_data(test)
        return train, y_train, test, y_test, original_test

    def process_data(self, dataframe):
        dataframe = self.clean_text(self.concat_text(self.expload_reference_answers(dataframe)))
        return dataframe

    @staticmethod
    def read(file_path):
        xml_data = open('grade_data.xml', 'r').read()
        root = ET.XML(xml_data)
        data = []
        for i, child in enumerate(root):
            row = {}
            for attribute in child:
                if attribute.tag == "ProblemDescription":
                    row[attribute.tag] = attribute.text
                elif attribute.tag == "Question":
                    row[attribute.tag] = attribute.text
                elif attribute.tag == "Answer":
                    row[attribute.tag] = attribute.text
                elif attribute.tag == "Annotation":
                    row[attribute.tag] = attribute.get("Label")
                    for subchild in attribute:
                        if subchild.tag == "AdditionalAnnotation":
                            row["ContextRequired"] = subchild.get("ContextRequired")
                            row["ExtraInfoInAnswer"] = subchild.get("ExtraInfoInAnswer")
                        elif subchild.tag == "Comments":
                            row[subchild.tag] = subchild.text
                            row["Watch"] = subchild.get("Watch")
                elif attribute.tag == "ReferenceAnswers":
                    row[attribute.tag] = attribute.text
            data.append(row)
        return pd.DataFrame(data)
    

    @staticmethod
    def get_annotation(text):
        values = text.split('|')
        annotation = [i for i, value in enumerate(values) if "(1)" in value][0]
        return annotation

    @staticmethod
    def split_reference_answers(text):
        return list(filter(None, re.split("\d+:  ", text.replace("\n", ""))))
    
    @staticmethod
    def clean_text(df, column='text'):
        df[column] = df[column].str.replace('\n',' ')
        df[column] = df[column].str.replace('\r',' ')
        df[column] = df[column].str.replace('\t',' ')
        
        #This removes unwanted texts
        df[column] = df[column].apply(lambda x: re.sub(r'[0-9]',' ',x))
        df[column] = df[column].apply(lambda x: re.sub(r'[/(){}\[\]\|@,;.:-]',' ',x))
        
        #Converting all upper case to lower case
        df[column] = df[column].apply(lambda s:s.lower() if type(s) == str else s)
        

        #Remove un necessary white space
        df[column] = df[column].str.replace('  ',' ')
        return df
    
    def expload_reference_answers(self, dataframe):
        dataframe.ReferenceAnswers = dataframe.ReferenceAnswers.apply(self.split_reference_answers)
        dataframe = dataframe.explode('ReferenceAnswers')
        dataframe.reset_index(drop=True, inplace=True)
        return dataframe
    
    @staticmethod
    def concat_text(dataframe):
        dataframe['text'] = (dataframe['ReferenceAnswers'] + ' ' + dataframe['Answer'] + ' ' + dataframe['Question']).apply(lambda row: row.strip())
        return dataframe

In [2]:
data_preprocessing = Preprocessing()

In [3]:
train, y_train, test, y_test, original_test = data_preprocessing.load('grade_data.xml')

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self[name] = value


# Build and Train Model

In this this experiment, I used a bert layer followed by two Dense Layers and dropout layers to try and prevent overfitting.

In [4]:
import tokenization

import tensorflow as tf
import tensorflow.keras.backend as K
import tensorflow_hub as hub
from sklearn.utils import class_weight
from sklearn.metrics import confusion_matrix, classification_report


class Model:
    def __init__(self, data_preprocessing):
        self.max_len = 200
        self.class_dict = {0:'correct', 1:'correct_but_incomplete', 2:'contradictory', 3:'incorrect'}
        self.data_preprocessing = data_preprocessing
        module_url = 'https://tfhub.dev/tensorflow/bert_en_uncased_L-12_H-768_A-12/2'
        self.bert_layer = hub.KerasLayer(module_url, trainable=True)
        vocab_file = self.bert_layer.resolved_object.vocab_file.asset_path.numpy()
        do_lower_case = self.bert_layer.resolved_object.do_lower_case.numpy()
        self.tokenizer = tokenization.FullTokenizer(vocab_file, do_lower_case)
        self.model = self.build_model()
    
    def get_class_weights(self, labels):
        weights = class_weight.compute_class_weight('balanced',
                                                 np.unique(labels),
                                                 labels)
        class_weights = {}
        for w, c in zip(weights, np.unique(labels)):
            class_weights[c] = w
        return class_weights

    def bert_encode(self, texts):
        all_tokens = []
        all_masks = []
        all_segments = []
        
        for text in texts:
            text = self.tokenizer.tokenize(text)
            
            text = text[:self.max_len-2]
            input_sequence = ["[CLS]"] + text + ["[SEP]"]
            pad_len = self.max_len - len(input_sequence)
            
            tokens = self.tokenizer.convert_tokens_to_ids(input_sequence) + [0] * pad_len
            pad_masks = [1] * len(input_sequence) + [0] * pad_len
            segment_ids = [0] * self.max_len
            
            all_tokens.append(tokens)
            all_masks.append(pad_masks)
            all_segments.append(segment_ids)
        
        return np.array(all_tokens), np.array(all_masks), np.array(all_segments)
    
    def encode_inputs(self, data):
        data_input = self.bert_encode(data.text.values)
        y_data_org = data.pop('Annotation')
        y_data = to_categorical(np.asarray(y_data_org))
        return data_input, y_data, y_data_org

    def recall_m(self, y_true, y_pred):
        true_positives = K.sum(K.round(K.clip(y_true * y_pred, 0, 1)))
        possible_positives = K.sum(K.round(K.clip(y_true, 0, 1)))
        recall = true_positives / (possible_positives + K.epsilon())
        return recall

    def precision_m(self, y_true, y_pred):
        true_positives = K.sum(K.round(K.clip(y_true * y_pred, 0, 1)))
        predicted_positives = K.sum(K.round(K.clip(y_pred, 0, 1)))
        precision = true_positives / (predicted_positives + K.epsilon())
        return precision

    def f1_m(self, y_true, y_pred):
        precision = self.precision_m(y_true, y_pred)
        recall = self.recall_m(y_true, y_pred)
        return 2*((precision*recall)/(precision+recall+K.epsilon()))
    
    def train(self, train_input):
        checkpoint = tf.keras.callbacks.ModelCheckpoint('new-dt-model.h5', monitor='val_accuracy', save_best_only=True, verbose=1)
        earlystopping = tf.keras.callbacks.EarlyStopping(monitor='val_accuracy', patience=7, verbose=1)
        train_input, y_train, y_train_org = self.encode_inputs(train_input)
        class_weight = self.get_class_weights(y_train_org)
        print('class weights: ', class_weight)
        train_history = self.model.fit(
            train_input, y_train,
            validation_split=0.2,
            # class_weight=class_weight,
            epochs=5,
            callbacks=[checkpoint, earlystopping],
            batch_size=5,
            verbose=1
        )

    def build_model(self):
        input_word_ids = tf.keras.Input(shape=(self.max_len,), dtype=tf.int32, name="input_word_ids")
        input_mask = tf.keras.Input(shape=(self.max_len,), dtype=tf.int32, name="input_mask")
        segment_ids = tf.keras.Input(shape=(self.max_len,), dtype=tf.int32, name="segment_ids")

        _, sequence_output = self.bert_layer([input_word_ids, input_mask, segment_ids])
        clf_output = sequence_output[:, 0, :]
        net = tf.keras.layers.Dense(16, activation='relu')(clf_output)
        net = tf.keras.layers.Dropout(0.2)(net)
        net = tf.keras.layers.Dense(8, activation='relu')(net)
        net = tf.keras.layers.Dropout(0.2)(net)
        out = tf.keras.layers.Dense(4, activation='softmax')(net)
        
        model = tf.keras.models.Model(inputs=[input_word_ids, input_mask, segment_ids], outputs=out)
        model.compile(tf.keras.optimizers.Adam(lr=1e-5), loss='categorical_crossentropy', metrics=['accuracy', self.f1_m, self.precision_m, self.recall_m])
        print(model.summary())
        
        return model
    
    def load(self, model_path):
        self.model.load_weights(model_path)
    
    def test(self, test):
        t_labels = []
        p_labels = []
        for i, row in test.iterrows():
            p = self.vote_predict_df(row)
            t_labels.append(row.Annotation)
            p_labels.append(p)
        class_ids = list(np.unique(test.Annotation))
        class_names = [self.class_dict[cid] for cid in class_ids]
        clf_report = classification_report(t_labels, p_labels, target_names=class_names)
        print(clf_report)
    
    def vote_predict_df(self, row):
        row = pd.DataFrame({
            "Question": [row.Question],
            "Answer": [row.Answer],
            "ReferenceAnswers": [row.ReferenceAnswers]
        })
        row = self.data_preprocessing.process_data(row)
        row_input = self.bert_encode(row.text.values)
        preds = self.model.predict(row_input)
        preds = np.argmax(preds, axis=1)
        return np.bincount(preds).argmax()
    
    def vote_predict(self, question, answer, reference_answers):
        row = pd.DataFrame({
            "Question": [question],
            "Answer": [answer],
            "ReferenceAnswers": [reference_answers]
        })
        row = self.data_preprocessing.process_data(row)
        row_input = self.bert_encode(row.text.values)
        preds = self.model.predict(row_input)
        preds = np.argmax(preds, axis=1)
        return self.class_dict[np.bincount(preds).argmax()]

In [5]:
model = Model(data_preprocessing)

Model: "model"
__________________________________________________________________________________________________
Layer (type)                    Output Shape         Param #     Connected to                     
input_word_ids (InputLayer)     [(None, 200)]        0                                            
__________________________________________________________________________________________________
input_mask (InputLayer)         [(None, 200)]        0                                            
__________________________________________________________________________________________________
segment_ids (InputLayer)        [(None, 200)]        0                                            
__________________________________________________________________________________________________
keras_layer (KerasLayer)        [(None, 768), (None, 109482241   input_word_ids[0][0]             
                                                                 input_mask[0][0]             

In [None]:
#model.load('new-dt-model.h5')

model.test(original_test)

In [14]:
question = 'Can you articulate the definition or principle that helps us determine the forces?'
answer = 'the net forces are the sum of all the forces acting on the object'
reference_answers = '''
1:  An object at rest will stay at rest and an object moving with constant velocity in a straight line will continue moving with constant velocity in a straight line as long as the net force acting on the object is zero.
2:  When the object is in equilibrium or velocity is constant, the sum of all forces will equal 0.
3:  If the acceleration of a system is zero, the summation of the forces is also zero.
'''
model.vote_predict(question, answer, reference_answers)

'incorrect'

In [15]:
question = 'How do the magnitudes, or amounts, of the forces they exert on each other compare?'
answer = 'The truck exerts a large force on the smaller car. Since the truck is accelerating, the forces between them are not balanced. Gravity and the normal are balanced'
reference_answers = '''
1:  The force from the car on the truck and the force from the truck on the car are a Third Law pair. Thus, the forces are equal in magnitude and opposite in direction.
2:  The forces exerted by the truck and car are equal and opposite.
3:  The amount of forces from the truck and car are equal.
4:  The forces are equal in magnitude.
5:  The amounts of the forces are equal.
'''
model.vote_predict(question, answer, reference_answers)

'contradictory'