## Iteration - Grail QA

In our last analysis, we saw that there were specific terms that appeared frequently in our dataset. Let's build a model removing these stop words.

In [3]:
stop_words = ['the', 'what', 'of', 'is', 'which', 'has', 'by', 'that', 'in', 'and', 'with', 'for', 'was', 'name', 'to', 'are', 'how', 'who', 'as', 'on', 'many', 'than', 'used', 'have', 'does', 'an']

In [4]:
import pandas as pd

pd.options.display.max_colwidth = 0

In [5]:
from src.data.utils import *

train, dev = make_grail_qa()

In [6]:
print(f'---Train Distribution---\n{train.domains.value_counts()}')
print(f'---Dev Distribution---\n{dev.domains.value_counts()}')

---Train Distribution---
technology    4967
healthcare    3250
Name: domains, dtype: int64
---Dev Distribution---
technology    408
healthcare    303
Name: domains, dtype: int64


In [7]:
from sklearn.feature_extraction.text import TfidfVectorizer

tfidf = TfidfVectorizer(stop_words=stop_words)
xt = tfidf.fit_transform(train.questions)
xd = tfidf.transform(dev.questions)

In [8]:
import numpy as np

def transform_labels(labels):
    labels[np.where(labels == 'healthcare')] = 0.
    labels[np.where(labels == 'technology')] = 1.
    return labels.astype(np.float64)

yt = transform_labels(train.domains.values)
yd = transform_labels(dev.domains.values)

In [9]:
from sklearn.linear_model import LogisticRegression

clf = LogisticRegression()
clf.fit(xt, yt)

LogisticRegression()

In [13]:
from sklearn.metrics import classification_report

yh = clf.predict(xd)
print(classification_report(yd, yh))

              precision    recall  f1-score   support

         0.0       1.00      0.98      0.99       303
         1.0       0.99      1.00      0.99       408

    accuracy                           0.99       711
   macro avg       0.99      0.99      0.99       711
weighted avg       0.99      0.99      0.99       711



In [14]:
from sklearn.metrics import confusion_matrix

print(confusion_matrix(yd, yh))

[[298   5]
 [  0 408]]


In [15]:
# Find the misclassified examples
mistakes_idxs = np.where(yd != yh)
mistakes_lbls = yh[mistakes_idxs]
mistakes = xd[mistakes_idxs].todense()

for i, mistake in enumerate(mistakes):
    print(tfidf.inverse_transform(mistake)[0].tolist(), mistakes_lbls[0])

['deracoxib'] 1.0
['contraindications', 'temazepam'] 1.0
['contraindications', 'deracoxib', 'number'] 1.0
['contraindications', 'number', 'teriparatide'] 1.0
['contraindications', 'teriparatide'] 1.0


Even though the performance is slightly worse than before, we can trust our model a little bit more. Also, we can start to pinpoint those places where it fails because it cannot rely on stop words to guess. 