# Open-domain QA

- 📺 **Video:** [https://youtu.be/P-j_zeS0Pa8](https://youtu.be/P-j_zeS0Pa8)

## Overview
- Retrieve evidence from large corpora and generate answers.
- Combine dense or sparse retrievers with reader models.

## Key ideas
- **Retrieval step:** BM25, DPR, or ColBERT fetch documents.
- **Reader step:** extract answer spans or generate responses from evidence.
- **Caching:** index documents for efficient lookup.
- **Evaluation:** exact match / F1 on open-domain benchmarks like Natural Questions.

## Demo
Implement a toy open-domain QA system with a small knowledge base and TF-IDF retrieval, matching the lecture (https://youtu.be/Ai8uKXu95ZM).

In [1]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity

kb = {
    'doc1': 'Leonardo da Vinci painted the Mona Lisa in the early 1500s.',
    'doc2': 'Vincent van Gogh painted Starry Night while in Saint-Remy.',
    'doc3': 'Claude Monet was a founder of French Impressionist painting.'
}
question = 'Who painted the Mona Lisa?'

vec = TfidfVectorizer()
X = vec.fit_transform(list(kb.values()) + [question])
q_vec = X[-1]
sims = cosine_similarity(X[:-1], q_vec)
best_idx = sims.argmax()
doc_id = list(kb.keys())[best_idx]
print('Retrieved doc:', doc_id)
print('Answer candidate:', kb[doc_id].split(' painted ')[0])


Retrieved doc: doc1
Answer candidate: Leonardo da Vinci


## Try it
- Modify the demo
- Add a tiny dataset or counter-example


## References
- [MCTest: A Challenge Dataset for the Open-Domain Machine Comprehension of Text](https://www.aclweb.org/anthology/D13-1020.pdf)
- [SQuAD: 100,000+ Questions for Machine Comprehension of Text](https://www.aclweb.org/anthology/D16-1264/)
- [Adversarial Examples for Evaluating Reading Comprehension Systems](https://www.aclweb.org/anthology/D17-1215/)
- [Reading Wikipedia to Answer Open-Domain Questions](https://arxiv.org/abs/1704.00051)
- [Latent Retrieval for Weakly Supervised Open Domain Question Answering](https://www.aclweb.org/anthology/P19-1612.pdf)
- [[Website] Natural Questions](https://ai.google.com/research/NaturalQuestions)
- [retrieval-augmented generation](https://arxiv.org/pdf/2005.11401.pdf)
- [WebGPT](https://arxiv.org/abs/2112.09332)
- [HotpotQA: A Dataset for Diverse, Explainable Multi-hop Question Answering](https://arxiv.org/abs/1809.09600)
- [Understanding Dataset Design Choices for Multi-hop Reasoning](https://www.aclweb.org/anthology/N19-1405/)
- [Learning to Retrieve Reasoning Paths over Wikipedia Graph for Question Answering](https://openreview.net/forum?id=SJgVHkrYDH)
- [QAMPARI](https://arxiv.org/abs/2205.12665)
- [Wizards of Wikipedia: Knowledge-Powered Conversational Agents](https://arxiv.org/pdf/1811.01241.pdf)
- [Task-Oriented Dialogue as Dataflow Synthesis](https://arxiv.org/abs/2009.11423)
- [A Neural Network Approach to Context-Sensitive Generation of Conversational Responses](https://arxiv.org/abs/1506.06714)
- [A Diversity-Promoting Objective Function for Neural Conversation Models](https://arxiv.org/abs/1510.03055)
- [Recipes for building an open-domain chatbot](https://arxiv.org/pdf/2004.13637.pdf)
- [Kurt Shuster et al.](https://arxiv.org/abs/2208.03188)
- [character.ai](https://character.ai)


*Links only; we do not redistribute slides or papers.*