# Barshay Secondary Results

After not really understanding how to turn my cosine similarity results into a classification model, I was at somewhat of a crossroads. One idea that was mentioned in class that I decided to use as my first real attempt at classifcation was to make a vector out of tfidf vectorized features where each vector contained: the question text, the document text, as well as each possible long answer (separately). So if a question had 50 possible long answers, then there would be 50 documents corresponding to each question where the only difference between the different vectors is their potential long answer. The idea behind this was to allow the machine learning model to see some of the similarities behind each vector, so as to highlight the differences that each vector contains in terms of its potential long answer. The hope was that this strategy would give more context to the vectors and thus hopefully improve prediction accuracy.

If that sounds somewhat confusing let me walk you through an example. I start with importing all my favorite data science tools.

*Much of the structure from this sample was borrowed from Dr. Paul Anderson.*

In [447]:
from bs4 import BeautifulSoup
import json
import pandas as pd
import sklearn
from collections import Counter
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer
from pandas.io.json import json_normalize
from sklearn.metrics.pairwise import cosine_similarity
import numpy as np
import re
from scipy.sparse import hstack
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import precision_score, recall_score
import joblib
from sklearn.model_selection import cross_val_score


%matplotlib inline

pd.options.display.max_rows = 160

I had initially started with many more observations being loaded in however, my computer couldn't handle it and so I had to go down to 100, which is unfortunate.

In [448]:
train_100 = open("/Users/mdbarshay/Desktop/DATA 301 Project/tensorflow2-question-answering/simplified-nq-train.jsonl")
records_train = [] 
c=0
for line in train_100:
    records_train.append(json.loads(line))
    if c > 99:
        break
    c+=1
train_df = pd.DataFrame(records_train)
train_100.close()

We are defining a function that helps to process the data to make it ameable for analysis.

This function processes each row and normalizes each json object within each row. In this case it is normalizing the long answer candidates.

In [449]:
def process(row):
    tdf = json_normalize(row['long_answer_candidates'])
    tdf["example_id"] = row['example_id'] 
    return tdf

Now we apply this function to our training data frame.

In [450]:
long_answer_candidates = train_df.apply(process,axis=1)
long_answer_candidates = pd.concat(long_answer_candidates.values) 

This is somewhat similar to the strategies that I have used before to get each long answer by itself, but this one is especially nice since it contains the example id, which can be used in the future for keeping track of which question the tokens belong to. Let me just confirm that this function worked as hoped by ensuring that we got all of the long answer tokens corresponding to id #: 5655493461695504401. I do so below.

In [451]:
len(long_answer_candidates[long_answer_candidates["example_id"] == 5655493461695504401])

63

In [452]:
len(train_df.iloc[0]["long_answer_candidates"])

63

Now just for reference I will quickly calculate how many long answers candidates correspond to each question.

In [453]:
"The average number of long answer candidates for each question is " + str(len(long_answer_candidates) / 101) \
+ " candidates."

'The average number of long answer candidates for each question is 107.14851485148515 candidates.'

Now to do a similar normalizing process for the annotations, I define a similar process function.

In [454]:
def process(row):
    tdf = json_normalize(row['annotations'])
    tdf["example_id"] = row['example_id']  
    return tdf

annotations = train_df.apply(process,axis=1)
annotations = pd.concat(annotations.values)

Once again, I am normalizing the short answers, but this time I have to be careful to deal with special characters.

In [455]:
def process(row):
    tdf = json_normalize(row['short_answers'])
    tdf["example_id"] = row['example_id']
    for c in row.index:
        if c == "short_answers":
            continue
        tdf[c] = row.loc[c]
    return tdf

short_answers = annotations.apply(process,axis=1)
short_answers = pd.concat(short_answers.values,sort=False)

At this point it is time to put all of the pieces together and join everything back into one data frame, making sure to set the index so that I am able to not have duplicate observations.

In [456]:
joined_train_df = train_df.set_index("example_id").join(long_answer_candidates.set_index("example_id"))

Now I remove all of the characters that can be found between "<" and ">" which effectively means removing all HTML tags.  

In [457]:
document_text_token = joined_train_df.apply(lambda row: re.sub("<.*?>", " "," ".join(str(row['document_text']).split()[row.start_token:row.end_token]).lower()),axis=1)

The following lines give us a data frame where the first column is the example id and the second column is the potential answer corresponding to one of the many potential answers for each question.

In [458]:
document_text_token = document_text_token.reset_index()
document_text_token.columns = ["example_id","document_text_token"]

It is now time to fit the tfidf vectorizer. I am going to fit the vectorizer on the document_text_token column of the above data frame. For this tfidf vectorizer each document is going to correspond to one potential long answer.

In [459]:
doc_text_vectorizer = TfidfVectorizer()
doc_text_vectorizer.fit(document_text_token["document_text_token"]) #only fit tfidf on the document text not the id
X1 = doc_text_vectorizer.transform(document_text_token["document_text_token"]) #save the results to X1

So now we have a sparse matrix that has as its rows each potential long answer and as its columns the vocabulary that it has learned. Now is time to fit another tfidf vectorizer, this time on the questions. However, we treat the questions slightly differently due to the fact that there are many repeats in the questions so no point fitting on duplicates.

In [460]:
question_vectorizer = TfidfVectorizer()
question_vectorizer.fit(joined_train_df["question_text"].drop_duplicates())
X2 = question_vectorizer.transform(joined_train_df["question_text"])
display(X2)

<10822x464 sparse matrix of type '<class 'numpy.float64'>'
	with 90460 stored elements in Compressed Sparse Row format>

The last piece of the puzzle is to include the document text in these vectors along with everything else. 

In [461]:
full_text = joined_train_df.document_text
full_text = full_text.apply(lambda row : re.sub("<.*?>", " "," ".join(str(row).split())))

In [463]:
full_text = full_text.reset_index()

In [464]:
full_vectorizer = TfidfVectorizer()
full_vectorizer.fit(full_text["document_text"].drop_duplicates())
X3 = full_vectorizer.transform(full_text["document_text"])

In [465]:
display(X1)
display(X2)
display(X3)

<10822x25720 sparse matrix of type '<class 'numpy.float64'>'
	with 269295 stored elements in Compressed Sparse Row format>

<10822x464 sparse matrix of type '<class 'numpy.float64'>'
	with 90460 stored elements in Compressed Sparse Row format>

<10822x35038 sparse matrix of type '<class 'numpy.float64'>'
	with 17007718 stored elements in Compressed Sparse Row format>

These three sparse matrices all have the same number of rows. I believe that X1 should have a slightly smaller vocabulary than X3, and also X3 should have many more filled in observations since it vectorized the entire document rather than just a particular sentence. Unfortunately I believe that each column corresponds to the vocabulary that was in that run of tfidf vectorizer. This means that there are repeat columns for the same words across the different groups.

I am now going to stack all of these observations horizontally so that each row still corresponds to a single observations, but now all of the matrices that I made are going to be side by side. In this setup a row is a vector corresponding to a possible long answer. I define my X_train below.

In [466]:
X_train = hstack((X1,X2,X3))  

In [467]:
X_train

<10822x61222 sparse matrix of type '<class 'numpy.float64'>'
	with 17367473 stored elements in COOrdinate format>

Notice that the number of rows did not change, however we see that the number of columns did change.

In [468]:
X_train.shape

(10822, 61222)

It is now time to create the target data that is going to help us train a model to make predictions with.

Lets now go back to one of our previous data frames and add a column corresponding to whether or not that specific question was the correct answer.

To make things easier on ourselves lets set the index of annotations to example_id. The next bit of code something I discovered while trying to correctly get the correct question indexes to line up, there is no purpose for it other than that.

In [469]:
annotations.sort_values(by = "example_id", inplace = True)

In [470]:
annotations.set_index("example_id", inplace = True)

In [471]:
long_answer_candidates.set_index("example_id", inplace = True)

In [472]:
fixed_long_answer_candidates = long_answer_candidates.copy()
fixed_long_answer_candidates.reset_index(inplace = True)

In [473]:
fixed_long_answer_candidates = fixed_long_answer_candidates.sort_values(by = ["example_id", "start_token"])
fixed_long_answer_candidates.index = range(len(long_answer_candidates))

Here is the function that I created to extract out the correct long answer, basically it involves looping through the fixed long answer candidates and comparing the potential long answers to the correct ones. After the function I print out the first 10 things in the list. Surprisingly (but correclty, I checked) the first observation is a one meaning the first observation in fixed_long_answer_candidates represents a correct potential answer. If the question had no answer then a 0 will be appended to that list since no potential long answers have -1 as their start index, which is what all questions that have no long answer have in their annotations section.

In [474]:
accuracy_assessment = []
for i in range(len(fixed_long_answer_candidates)):
    ex_id = fixed_long_answer_candidates.loc[i, "example_id"]
    correct_start = annotations.loc[ex_id, "long_answer.start_token"]
    correct_end = annotations.loc[ex_id, "long_answer.end_token"]
    if (fixed_long_answer_candidates.loc[i, "start_token"] == correct_start) & (fixed_long_answer_candidates.loc[i, "end_token"] == correct_end):
        accuracy_assessment.append(1)
    else:
        accuracy_assessment.append(0)

In [475]:
accuracy_assessment[:10]

[1, 0, 0, 0, 0, 0, 0, 0, 0, 0]

In [476]:
y_train = accuracy_assessment.copy() # just making a copy so it is named y_train

Now I am going to use cross val score to get some estimates for my test error. The first model that I am going to train is is KNeighborsClassifier. I chose to use f1 score as my scoring mechanism since it incorporates both precision and recall, and I chose the number of folds to be three since I know cross_val_score can be computationally expensive.

In [477]:
model = KNeighborsClassifier(n_neighbors=9)

In [478]:
model.fit(X_train, y_train)

KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
                     metric_params=None, n_jobs=None, n_neighbors=9, p=2,
                     weights='uniform')

In [479]:
res = cross_val_score(model, X_train, y_train, cv = 3, scoring = "f1")

  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)


We see that all three folds had an f1 score of 0. This means that it is time to break out some more complicated models. I decided to first try catboost with balanced weights.

In [480]:
pd.Series(res).value_counts()

0.0    3
dtype: int64

In [481]:
from catboost import CatBoostClassifier

v = pd.Series(y_train).value_counts().values
class_weights = [1,v[0]/v[1]]

model = CatBoostClassifier(iterations=100,
                          learning_rate=1,
                          depth=8,
                          class_weights=class_weights,
                          verbose = False)
# with cross_val_score we do not have to fit the model at this stage since cross val actually fits a model at 
# each step of its cross-validation

In [482]:
res = cross_val_score(model, X_train, y_train, cv = 2, scoring = "precision")

In [483]:
"Our average level of precision was " +  str(res.mean()) + " this not a very good result."

'Our average level of precision was 0.020585880154811158 this not a very good result.'

In [484]:
resR = cross_val_score(model, X_train, y_train, cv = 2, scoring = "recall")

In [485]:
"Our average level of recall was " +  str(resR.mean()) + " this not a great result."

'Our average level of recall was 0.12698412698412698 this not a great result.'

However this result is slighty better than we were able to do with precision.

In [496]:
from sklearn.linear_model import LogisticRegression
clf = LogisticRegression(random_state=0, solver='lbfgs',class_weight="balanced")
clf.fit(X_train, y_train)

LogisticRegression(C=1.0, class_weight='balanced', dual=False,
                   fit_intercept=True, intercept_scaling=1, l1_ratio=None,
                   max_iter=100, multi_class='warn', n_jobs=None, penalty='l2',
                   random_state=0, solver='lbfgs', tol=0.0001, verbose=0,
                   warm_start=False)

Our recall and precision levels were very lackluster. We were unable to predict any observations correctly on this trial. I am starting to get the idea that catboost is likely the best that we can do.

In [497]:
res = cross_val_score(clf, X_train, y_train, cv = 3, scoring = "precision")

In [498]:
res.mean()

0.0

In [499]:
res = cross_val_score(clf, X_train, y_train, cv = 3, scoring = "precision")

In [500]:
res.mean()

0.0

I cannot explain why we get these values for Logistic Regression with these set of features, but logistic regression turns out to be possibly the best model with other sets of features, some things in data science are mysteries. It is not because of my choice for class_weight balanced, we got the same result regardless of the class weight in this scenario.

The main problem with these choice of features is its computational expense. My computer gets warm and the fan gets loud... there must be a way that we can keep these results or improve while being more efficient, or at the very least smarter about what we are doing.