## EDA 2 - Grail QA

Since we have some strong baselines, let's take a deeper look at our dataset to understand its quirks and identify any limiting conditions. Specifically, in this notebook, we'll look at which terms appear most frequently across documents. 

In [1]:
import pandas as pd

pd.options.display.max_colwidth = 0

In [2]:
from src.data.utils import *

train, dev = make_grail_qa()

In [3]:
print(f'---Train Distribution---\n{train.domains.value_counts()}')
print(f'---Dev Distribution---\n{dev.domains.value_counts()}')

---Train Distribution---
technology    4967
healthcare    3250
Name: domains, dtype: int64
---Dev Distribution---
technology    408
healthcare    303
Name: domains, dtype: int64


In [4]:
from sklearn.feature_extraction.text import TfidfVectorizer

tfidf = TfidfVectorizer()
xt = tfidf.fit_transform(train.questions)

In [25]:
import altair as alt

# Map the vocab
vocab = {v: k for k, v in tfidf.vocabulary_.items()}

# Find the top k terms by IDF
k = 50
common_terms_idxs = tfidf.idf_.argsort()[:k]

# DataFrame for plotting
common_terms = pd.DataFrame({'terms': [vocab[i] for i in common_terms_idxs], 'IDF': tfidf.idf_[common_terms_idxs]})

In [26]:
# Plot
title = alt.TitleParams(f'Top {k} Terms by IDF', subtitle='(smaller IDF is more frequent)')
alt.Chart(common_terms, title=title).mark_line().encode(y=alt.Y('terms', sort='x'), x='IDF').configure_axisY(labelAlign='left', labelPadding=70)

Based on this list, let's compile a list of stop words specific to our dataset, which should help our models generalize:

In [None]:
stop_words = ['the', 'what', 'of', 'is', 'which', 'has', 'by', 'that', 'in', 'and', 'with', 'for', 'was', 'name', 'to', 'are', 'how', 'who', 'as', 'on', 'many', 'than', 'used', 'have', 'does', 'an']