## Baseline - Grail QA

Since we have labels already, it makes sense to consider what performance we can get with straight classification. Topic modeling is fine for large datasets where we don't have the luxury of labeled data, but let's see what a simpler classifier can do.

In [1]:
import pandas as pd

pd.options.display.max_colwidth = 0

In [2]:
from src.data.utils import *

train = pd.DataFrame(get_domains_and_questions('train', 'grail_qa'))
dev   = pd.DataFrame(get_domains_and_questions('dev',   'grail_qa'))

In [3]:
domains = ['medicine', 'computer']
train = set_domains(train, domains)
dev   = set_domains(dev,   domains)

In [6]:
train.loc[train.domains =='medicine'].sample(5)

Unnamed: 0,domains,questions
10245,medicine,ethacrynic acid has what symptoms as side effects?
33466,medicine,what fda otc drug monograph part regulates robafen dm max non drowsy 10/200 liquid)
32658,medicine,name a medical trial that uses the same type of medical trial as the cocaine effects in humans: physiology and behavior – 1
38885,medicine,orange is the flavor of what manufactured drug?
19342,medicine,"of indinavir sulfate, who is the medical trial sponsor?"


In [7]:
train.loc[train.domains =='computer'].sample(5)

Unnamed: 0,domains,questions
13345,computer,which type of software uses ssh file transfer protocol as protocol?
27750,computer,what is zx-pilot's emulator?
35923,computer,what file format can smartdraw read?
4425,computer,name the computers whose parent was the trs-80 color computer.
2698,computer,what software has a latest release date on 2007-02-09?


In [14]:
print(f'TRAIN DISTRIBUTION\n{train.domains.value_counts()}')
print(f'DEV DISTRIBUTION\n{dev.domains.value_counts()}')

TRAIN DISTRIBUTION
medicine    2002
computer    1923
Name: domains, dtype: int64
DEV DISTRIBUTION
computer    190
medicine    178
Name: domains, dtype: int64


Fairly balanced dataset considering only these two `subdomains`. We'll see how that changes when we incorporate others.

In [15]:
from sklearn.feature_extraction.text import TfidfVectorizer

tfidf = TfidfVectorizer()
xt = tfidf.fit_transform(train.questions)
xd = tfidf.transform(dev.questions)

In [22]:
import numpy as np

def transform_labels(labels):
    labels[np.where(labels == 'medicine')] = 0.
    labels[np.where(labels == 'computer')] = 1.
    return labels.astype(np.float64)

yt = transform_labels(train.domains.values)
yd = transform_labels(dev.domains.values)

In [24]:
from sklearn.linear_model import LogisticRegression

clf = LogisticRegression()
clf.fit(xt, yt)

LogisticRegression()

In [26]:
from sklearn.metrics import classification_report

print(classification_report(yd, clf.predict(xd)))

              precision    recall  f1-score   support

         0.0       1.00      1.00      1.00       178
         1.0       1.00      1.00      1.00       190

    accuracy                           1.00       368
   macro avg       1.00      1.00      1.00       368
weighted avg       1.00      1.00      1.00       368

