## Baseline - Grail QA

Since we have labels already, it makes sense to consider what performance we can get with straight classification. Topic modeling is fine for large datasets where we don't have the luxury of labeled data, but let's see what a simpler classifier can do.

In [1]:
import pandas as pd

pd.options.display.max_colwidth = 0

In [2]:
from src.data.utils import *

train = pd.DataFrame(get_domains_and_questions('train', 'grail_qa'))
dev   = pd.DataFrame(get_domains_and_questions('dev',   'grail_qa'))

In [3]:
domains = ['medicine', 'computer']
train = filter_domains(train, domains)
dev   = filter_domains(dev,   domains)

In [4]:
train.loc[train.domains =='medicine'].sample(5)

Unnamed: 0,domains,questions
24900,medicine,esomeprazole magnesium is in what drug class?
37288,medicine,"name the drug mechanism of action for doconexent/icosapent/calcium/iron/ascorbic acid/pyridoxine/.alpha.-tocopherol, d-/folic acid."
32763,medicine,what is the drug formulation of calendula officinalis flowering top/bellis perennis/ledum palustre twig/arnica montana/phosphorus homeopathic preparation?
4098,medicine,which drug uses inhalation as it's route of administration?
18277,medicine,what medical trial has the health authority us fda as well as the efficacy study design?


In [5]:
train.loc[train.domains =='computer'].sample(5)

Unnamed: 0,domains,questions
21766,computer,commodore vic-20 is a computer emulated by which computer emulator?
25630,computer,which is the earliest computer processor on record?
17974,computer,what model computer is compatible with the peripheral of interface?
18675,computer,what is the file format contained by eps?
8067,computer,who was the developer on the earliest released operation system that includes android?


In [6]:
print(f'---Train Distribution---\n{train.domains.value_counts()}')
print(f'---Dev Distribution---\n{dev.domains.value_counts()}')

---Train Distribution---
medicine    2002
computer    1923
Name: domains, dtype: int64
---Dev Distribution---
computer    190
medicine    178
Name: domains, dtype: int64


Fairly balanced dataset considering only these two `subdomains`. We'll see how that changes when we incorporate others.

In [7]:
from sklearn.feature_extraction.text import TfidfVectorizer

tfidf = TfidfVectorizer()
xt = tfidf.fit_transform(train.questions)
xd = tfidf.transform(dev.questions)

In [8]:
import numpy as np

def transform_labels(labels):
    labels[np.where(labels == 'medicine')] = 0.
    labels[np.where(labels == 'computer')] = 1.
    return labels.astype(np.float64)

yt = transform_labels(train.domains.values)
yd = transform_labels(dev.domains.values)

In [9]:
from sklearn.linear_model import LogisticRegression

clf = LogisticRegression()
clf.fit(xt, yt)

LogisticRegression()

In [10]:
from sklearn.metrics import classification_report

print(classification_report(yd, clf.predict(xd)))

              precision    recall  f1-score   support

         0.0       1.00      1.00      1.00       178
         1.0       1.00      1.00      1.00       190

    accuracy                           1.00       368
   macro avg       1.00      1.00      1.00       368
weighted avg       1.00      1.00      1.00       368

