# NLP and the Web: Exercise 11

So far, you have learned how to vectorize text, how to train classifiers and how to do IR. In this exercise we look at a more practical use case: Community Question Answering (cQA). 

The web is full of rich ressources where humans can post questions and answers to specific topics (e.g. https://stackexchange.com/, https://quora.com or https://answers.yahoo.com/). In many cases the information need of a user is not entirely new, but a similar question has already been asked and answered in some form.


The data is a small sample of the SemEval2015 Task 3. The data comes from a Qatar Living Forum.  This subset comes as `\t` seperated file and includes multiple columns:
* **qid** is the unique ID for each question
* **cid** is the unique ID for each comment
* **question_category** is the category of the question (such as "Beauty and Style")
* **question_subject** is the subject associated with a question
* **question** is the textual question
* **comment** is a comment for this question
* **comment_gold** is the label, whether this comment is a "good" or "bad" answer to the question

Each question can come with multiple comments; in this case, a new row is used for each differing comment (but the same question).

In [1]:
import pandas as pd
import numpy as np
from sklearn.svm import SVC
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import classification_report
import spacy

nlp = spacy.load('en_core_web_md')

## Task 1 - 2 Points

**a)** Load the provided train and dev set. Lowercase the fields `question_subject`, `question` and `comment` of both data splits. After lowercasing, output the first five elements (e.g. `head()`) of the training split.


In [None]:
train_data = pd.read_csv("data/train.tsv",sep="\t", header=0)
test_data = pd.read_csv("data/dev.tsv",sep="\t", header=0)

# Lowercase the field
train_data["question_subject"] = train_data["question_subject"].apply(lambda x: str(x).lower())
train_data["question"]= train_data["question"].apply(lambda x: str(x).lower())
train_data["comment"]= train_data["comment"].apply(lambda x: str(x).lower())

test_data["question_subject"] = test_data["question_subject"].apply(lambda x: str(x).lower())
test_data["question"]= test_data["question"].apply(lambda x: str(x).lower())
test_data["comment"]= test_data["comment"].apply(lambda x: str(x).lower())

In [None]:
train_data.head()

In [None]:
test_data.head()

**b)** Create three new columns in the training and development splits: `question_vectors`, `subject_vectors`, `comment_vectors`. Use spaCy to convert each token of the fields `question`, `question_subject`, `comment` into a dense word vector (`token.vector`) and store these word vectors in the new columns. 

In [None]:
def sent2vec(sentence):
    """
    params:
        sentence: list of words
    return:
        result: (nummer of words, 300) matrix of vectorized tokens
    """
    
    tokenizer = nlp.tokenizer
    tokens = tokenizer(sentence)
    
    # Initialize output matrix
    vectorized_sent = np.zeros(shape=(len(tokens),300))
    
    for i, tok in enumerate(tokens):
        vectorized_token = nlp(tok.text).vector
        vectorized_sent[i] = vectorized_token
        
    return vectorized_sent

In [None]:
# Lowercase the field
train_data["question_vectors"] = train_data["question_subject"].apply(sent2vec)
train_data["subject_vectors"] = train_data["question"].apply(sent2vec)
train_data["comment_vectors"] = train_data["comment"].apply(sent2vec)

test_data["question_vectors"] = test_data["question_subject"].apply(sent2vec)
test_data["subject_vectors"] = test_data["question"].apply(sent2vec)
test_data["comment_vectors"] = test_data["comment"].apply(sent2vec)

In [None]:
# Vectorizing word takes a long time. We will create a pickle file out of the vectors

train_data.to_pickle("data/train.pkl")
test_data.to_pickle("data/test.pkl")

In [2]:
# Load dataset vectors we vectorized from earlier
train_data_ = pd.read_pickle("data/train.pkl")
test_data_ = pd.read_pickle("data/test.pkl")

## Task 2 - 8 Points
You will train a classifier to predict whether a comment is a good answer or not, based on the word vectors. You will explore different strategies
* how to compute a single fixed-length vector out of the word vectors
* how to combine these single vectors from different fields (such as `question` and `comment`)
 
**a)** Implement the function `embedding_fn_max_pooling(word_vectors)`. It should compute a single vector (of the same dimensionality as a single word vector) via max pooling: In each $i$th dimension, the resulting representation must have the maximum value based on all $i$th dimension values from all words.

Example:

$w_0 = [0.1, 0.2, 0.3, 0.4]$ 

$w_1 = [1.0, -1.5, 2.0, -2.5]$

The resulting embedding should be $e = [max(0.1, 1.0), max(0.2, -1.5), max(0.3, 2.0), max(0.4, -2.5)] = [1.0, 0.2, 2.0, 0.4]$.

Output the dimensionality of the resulting embedding after applying this function on the word vectors of a single question.

In [3]:
def embedding_fn_max_pooling(word_vectors):
    """
    Converts an arbitrary number of d-dimensional word vectors 
    into a single d-dimensional embedding via max-pooling.
    :param word_vectors
        list of d-dimensional word vectors (one vector for each token)
    :returns
        d-dimensional embedding
    """
    sen_vec =  np.amax(word_vectors, axis=0)
    return sen_vec

In [4]:
arr2D = np.array([[11, 12, 13],
                     [14, 15, 16],
                     [17, 15, 11],
                     [12, 14, 15]])
print(embedding_fn_max_pooling(arr2D))

[17 15 16]


**b)** Implement the function `feature_fn_concatenate_question_comment(df, embedding_fn)`. It should
* use the `embedding_fn` to create fixed-length vectors from the fields `question_vectors` and `comment_vectors` individually. (For each sample: one vector for the question, one vector for the comments).
* Concatenate both vectors
* Return these concatenated vectors (the features) for all samples in the dataframe `df`.

You will use these vectors to train a new classifier. Execute this function using `embedding_fn=embedding_fn_max_pooling` on the train split and output the shape of the resulting matrix.

In [5]:
def feature_fn_concatenate_question_comment(df, embedding_fn):
    """
    Uses the embedding_fn to create d-dimensional embeddings from the tokens 
    of the questions and comments respectively, and concatenates these embeddings.
    :param df
        Dataframe consisting of multiple samples. Features will be computed for each sample individually
    :param embedding_fn
        As in 2a)
    :returns
        Matrix of shape (n, d) whereas 
        n is the number of samples in df and 
        d is the output dimensionality of the embedding_fn
    """
    vectorized_qc = np.zeros(shape=(df.shape[0], 300*2))
    
    for i, row in df.iterrows():
        q_v = embedding_fn(row["question_vectors"])
        c_v = embedding_fn(row["comment_vectors"])
        vectorized_qc[i] = np.concatenate([q_v,c_v])
    
    return vectorized_qc

**c)** Execute the (already implemented) function `train_and_evaluate` to train a new classifier based on the two functions you implemented. Use `embedding_fn=embedding_fn_max_pooling`, `feature_fn=feature_fn_concatenate_question_comment` and leave the classifier as the default parameter. The function will normalize the computed features and train and evaluate a support vector machine.

What is the advantage of using F1 macro (as opposed to accuracy or F1 micro/weighted) when we want to weight each label equally?

In [6]:
def train_and_evaluate(df_train, df_dev, embedding_fn, feature_fn, 
                      classifier=make_pipeline(StandardScaler(), SVC())):
    
    # 1) Compute vectors
    X_train = feature_fn(df_train, embedding_fn)
    X_dev = feature_fn(df_dev, embedding_fn)
    
    # 2) Train classifier
    classifier.fit(X_train, df_train['comment_gold'])
    
    # 3) Predict dev data
    predictions = classifier.predict(X_dev)
    
    # 4) Compute and output metrics
    print(classification_report(df_dev['comment_gold'], predictions))

    
# You probably have to update the names of your train and dev set
train_and_evaluate(train_data_, test_data_, embedding_fn_max_pooling, feature_fn_concatenate_question_comment)


              precision    recall  f1-score   support

         Bad       0.61      0.21      0.31        92
        Good       0.81      0.96      0.88       322

    accuracy                           0.79       414
   macro avg       0.71      0.58      0.59       414
weighted avg       0.77      0.79      0.75       414



With accuracy, each sample is equally contributing to the final metric. Hence, if one label is highly represented in the data, the prediction based on this label will largely dominate the final accuracy score. In f1-macro, every label is weighted equally, regardless on how many samples each label represents.

**d)** Create
* one new embedding function (as in 2a) that converts multiple word vectors into a single fixed-length vector (e.g. by averaging over all word embeddings)
* two new feature functions (as in 2b), that combine these vectors of different fields to generate the final features (e.g. add different columns, use fewer columns, use average instead of concatenating, ...).

Use self-explanatory function names or add a comment to describe what each function does.

Run a grid search (parts are already implemented) for all combinations of all two embedding functions and all three feature functions. Which combination yields in the highest F1 macro? Explain in up to three sentences how the task of *question similarity* comes into play, when you want to create a basic cQA system with this model.

__Answer:__

Highest f1-macro is yield from the combination of averaging embedding with concat-question-comment feature (68%)

We can use this system to give answers for new incoming questions. For example, we choose the a set of answers from the samw topic of the question. Then we perform the prediction using the model on the choosen set of answers and the given question to determine the labels of the combinations of question and answer. We then return the answer which is labeled as good.

In [7]:
def embedding_fn_average(word_vectors):
    """
    Converts an arbitrary number of d-dimensional word vectors 
    into a single d-dimensional embedding via averaging.
    :param word_vectors
        list of d-dimensional word vectors (one vector for each token)
    :returns
        d-dimensional embedding
    """
    sen_vec =  np.mean(word_vectors, axis=0)

    return sen_vec

In [27]:
def feature_fn_averaging_question_comment(df, embedding_fn):
    """
    Uses the embedding_fn to create d-dimensional embeddings from the tokens 
    of the questions and comments respectively, and averagibg these embeddings.
    :param df
        Dataframe consisting of multiple samples. Features will be computed for each sample individually
    :param embedding_fn
        As in 2a)
    :returns
        Matrix of shape (n, d) whereas 
        n is the number of samples in df and 
        d is the output dimensionality of the embedding_fn
    """
    vectorized_qc = np.zeros(shape=(df.shape[0], 300*2))
    
    for i, row in df.iterrows():
        q_v = embedding_fn(row["question_vectors"])
        c_v = embedding_fn(row["comment_vectors"])
        vectorized_qc[i] = np.mean([q_v,c_v])
    
    return vectorized_qc

In [34]:
def feature_fn_concatenate_question_comment_subject(df, embedding_fn):
    """
    Uses the embedding_fn to create d-dimensional embeddings from the tokens 
    of the questions, comments and subject respectively, and concatenates these embeddings.
    :param df
        Dataframe consisting of multiple samples. Features will be computed for each sample individually
    :param embedding_fn
        As in 2a)
    :returns
        Matrix of shape (n, d) whereas 
        n is the number of samples in df and 
        d is the output dimensionality of the embedding_fn
    """
    vectorized_qc = np.zeros(shape=(df.shape[0], 300*3))
    
    for i, row in df.iterrows():
        q_v = embedding_fn(row["question_vectors"])
        c_v = embedding_fn(row["comment_vectors"])
        s_v = embedding_fn(row["subject_vectors"])
        vectorized_qc[i] = np.concatenate((q_v,c_v,s_v))
    
    return vectorized_qc

In [25]:
def feature_fn_averaging_question_comment_subject(df, embedding_fn):
    """
    Uses the embedding_fn to create d-dimensional embeddings from the tokens 
    of the questions, comments and subject respectively, and averaging these embeddings.
    :param df
        Dataframe consisting of multiple samples. Features will be computed for each sample individually
    :param embedding_fn
        As in 2a)
    :returns
        Matrix of shape (n, d) whereas 
        n is the number of samples in df and 
        d is the output dimensionality of the embedding_fn
    """
    vectorized_qc = np.zeros(shape=(df.shape[0], 300*2))
    
    for i, row in df.iterrows():
        q_v = embedding_fn(row["question_vectors"])
        c_v = embedding_fn(row["comment_vectors"])
        s_v = embedding_fn(row["subject_vectors"])
        vectorized_qc[i] = np.mean([q_v,c_v,s_v])
    
    return vectorized_qc

In [35]:
embedding_functions = [
    ('max-pooling', embedding_fn_max_pooling), 
    ('averaging', embedding_fn_average)
]

feature_functions = [
    ('concat-question-comment', feature_fn_concatenate_question_comment),
    ("averaging-question-comment",feature_fn_averaging_question_comment),
    ("concatenate-question-comment-subject",feature_fn_concatenate_question_comment_subject),
    ("averaging-question-comment-subject", feature_fn_averaging_question_comment_subject)
]

for embedding_name, embedding_fn in embedding_functions:
    for feature_name, feature_fn in feature_functions:
        print('Embeddings:', embedding_name, '; Features:', feature_name)
        
        # Adjust the naming of df_train and df_dev to match your training and dev set:
        train_and_evaluate(train_data_, test_data_, embedding_fn, feature_fn)
        print()

Embeddings: max-pooling ; Features: concat-question-comment
              precision    recall  f1-score   support

         Bad       0.61      0.21      0.31        92
        Good       0.81      0.96      0.88       322

    accuracy                           0.79       414
   macro avg       0.71      0.58      0.59       414
weighted avg       0.77      0.79      0.75       414


Embeddings: max-pooling ; Features: averaging-question-comment
              precision    recall  f1-score   support

         Bad       0.33      0.04      0.08        92
        Good       0.78      0.98      0.87       322

    accuracy                           0.77       414
   macro avg       0.56      0.51      0.47       414
weighted avg       0.68      0.77      0.69       414


Embeddings: max-pooling ; Features: concatenate-question-comment-subject
              precision    recall  f1-score   support

         Bad       0.67      0.07      0.12        92
        Good       0.79      0.99      

  _warn_prf(average, modifier, msg_start, len(result))


              precision    recall  f1-score   support

         Bad       0.00      0.00      0.00        92
        Good       0.78      1.00      0.88       322

    accuracy                           0.78       414
   macro avg       0.39      0.50      0.44       414
weighted avg       0.60      0.78      0.68       414


Embeddings: averaging ; Features: concatenate-question-comment-subject
              precision    recall  f1-score   support

         Bad       0.76      0.24      0.36        92
        Good       0.82      0.98      0.89       322

    accuracy                           0.81       414
   macro avg       0.79      0.61      0.63       414
weighted avg       0.80      0.81      0.77       414


Embeddings: averaging ; Features: averaging-question-comment-subject
              precision    recall  f1-score   support

         Bad       0.81      0.14      0.24        92
        Good       0.80      0.99      0.89       322

    accuracy                           0

In [None]:
# Your answer here