# EDA on a Hugging Face dataset

Fine-tuning the best Large Language Models (LLMs) is not an option: they are proprietary, accessible only via apps and APIs. Fine-tuning an open-source LLM is possible but not always practical or appealing. Thus, In-Context Learning (ICL) is a popular alternative, for those with the right custom data.

Background: In-Context Learning (few-shot learning) involves augmentation at inference. The sequence ingested by the model is the unlabeled input concatenated with context (labeled examples) that primes the model toward desired output, albeit indirectly. In 2023, ICL is a hot topic.

Challenge: Methods abound for selecting contextual exemplars per input, and the performance benefits of ICL vary dramatically across selection methods. New learning-based approaches show promise ([Ye et al.](https://arxiv.org/abs/2302.05698), [Xu et al.](https://arxiv.org/abs/2305.08848)), but require more labeled examples than do learning-free approaches. Other factors that moderate the influence of contextual subsets include relevance and diversity.

Dataset: In 2015, [Zhang et al.](https://arxiv.org/abs/1509.01626) compiled several datasets, including one sourced from Yahoo that is available via Hugging Face. It includes 1.46 million question-answer pairs, labeled according to topic. The content comes from Yahoo Answers, a crowdsourced QA site deprecated in 2021 (content is from pre-2007).

Goal: The present EDA will shed light on the suitability of this dataset for learning-based ICL methods, especially SuperICL and CEIL, in support of LLM-based generative QA as well as text classification.

In [None]:
import matplotlib.pyplot as plt
from eda_funcs import *

pd.set_option('max_colwidth', 200)

## Check size

In [None]:
builder = datasets.load_dataset_builder('yahoo_answers_topics')
show_size(builder)

## Download, preview sample

In [None]:
ds = load_yahoo(split='train', num_shards=32, shard_index=0)
print(f"number of rows: {ds.num_rows}\n")
ds[:5]

In [None]:
ds.cleanup_cache_files()

### Check topic balance

In [None]:
ds[:].value_counts(['label', 'topic']).to_frame().sort_index()

### Check missing values

In [None]:
ds[:].query("answer == ''").groupby(['topic'])['id'].count().plot(
    kind='barh', figsize=(4,3), title='Missing answers')
plt.show()

### Inspect concise questions

In [None]:
mono_q = ds.filter(lambda x: len(x['question'].split())==1)[:10]['question']
duo_q = ds.filter(lambda x: len(x['question'].split())==2)[:10]['question']

pd.DataFrame({'1-word Questions': mono_q, 
              '2-word Questions': duo_q})

### Drop rows with blank answers, blank questions, or 1-word questions; Review topic balance

In [None]:
ds = ds.filter(lambda x: x['answer'] != '' and x['question'] != '' and len(x['question'].split()) > 1)
df = ds[:].value_counts(['label', 'topic']).to_frame().sort_index()
df.plot(kind='barh', figsize=(5, 3), title='Examples per Topic', legend=False)
plt.show()

### Preliminary observations

- This shard of the training dataset is large and balanced. For comparison, Rubin et al. use 44k examples to train a receiver for ICL. This Yahoo Answers dataset from huggingface is 32x that size.
- As stated [elsewhere](https://en.wikipedia.org/wiki/Yahoo!_Answers), at least some examples appear silly, inarticulate, or worse.

___

## Check question quality

### How do questions begin?

In [None]:
ds.reset_format()
ds = ds.map(q_start)
ds.set_format('pandas')
plot_question_starters(ds)

In [None]:
ds[:].query("q_start == 'i'")[:3]

In [None]:
ds = ds.remove_columns(['q_start'])

### Observations

- *What?* is common.
- *Who?* is especially common in Sports, Entertainment & Music.
- *Why?* is especially common in Politics & Gov't, Society & Culture.
- *How?* is especially common in Computers & Internet.
- *I* is unexpectedly common across all topics, framing questions with first-person narrative.

Questions that start with *I* are indirect and long, requiring sythesis across sentences and interpretation. These are not the best candidates for ICL.

___

## Word counts

In [None]:
ds = word_counts(ds)
ds[:][['q_word_count', 'ans_word_count']].describe().astype(int)

In [None]:
ds.set_format('pandas')
plt.figure(figsize=(5, 3))
plt.title('Question lengths')
plt.xlabel('Number of words in question')
plt.ylabel('Frequency')
plt.hist(ds['q_word_count'], bins=40, range=(0, 40), histtype='bar', rwidth=2)
plt.show()

In [None]:
plt.figure(figsize=(10, 3))
plt.title('Answer lengths')
plt.xlabel('Number of words in answer')
plt.ylabel('Frequency')
plt.hist(ds[:]['ans_word_count'], bins=80, range=(0,400), histtype='bar', rwidth=2)
plt.show()


In [None]:
ds = ds.remove_columns(['q_word_count', 'ans_word_count'])

### Observations

- RoBERTa has max sequence length of 512 tokens. The majority of these examples would fit.
- The proprietary LLMs have context length ranging up to 4096. Multiple of these examples could fit within a single context.

___

## Conclusions

EDA does not determine how a dataset for learning-based ICL would influence a model's output. Even indicators of similarity (between an unlabeled input and a labeled contextual exemplar) and diversity (within the concatenated exemplar set per input) do not determine precisely how well a dataset will work for ICL: the best trade-off between relevance and diversity differs across tasks. The same goes for the number of in-context examples per input. EDA is only a start.

Nonetheless, we can draw a few conclusions.
- The size and breadth of the Yahoo Answers dataset is its strength, especially for learning-based ICL methods.
- The sequence length of tokenized question-answer pairs in this dataset is an appropriate length for the in-context learning methods discussed at the outset, namely CEIL (optimizing the context retriever's subset selection) and SuperICL (fine-tuning RoBERTa in a cascade design).

[Rubin et al.](https://aclanthology.org/2022.naacl-main.191/) argue that the best context for ICL is generated by a scoring LLM, separate from the inference LLM. That may be the case. For crowdsourced data, Yahoo Answers provides a large, diverse set.

Personally, I would not twist anyone's arm to use this dataset.
- It is difficult to fact check.
- Other datasets exist -- synthetic or real.
- Starting with a small, high-quality dataset seems more reasonable than harvesting a huge, dubious one.
- Yahoo shut down the site. Of course, that was in 2021. Still, the decision raises questions.