<a href="https://colab.research.google.com/github/nbroad1881/hp_wiki_scrapy/blob/master/HP_RAG.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Retrieval Augmented Generation for Harry Potter

The wiki dataset was created using scrapy. Code available here. https://github.com/nbroad1881/hp_wiki_scrapy



Embedding the documents takes about 9 minutes on GPU.

User can select:
- split up the document every 5 sentences or 100 spaces.
- books or the wiki as the dataset
- which saved text source to use (if skipping embedding step)

In [None]:
SPLIT_STYLE = "sentence" #@param ["sentence", "whitespace"]
DATASET_TO_USE = "wiki" #@param ["books", "wiki"]

## Download wiki dataset from GitHub

All of the data has already been scraped using scrapy. Scraping code is in this repo: https://github.com/nbroad1881/hp_wiki_scrapy  

In [None]:
wiki_filename = "hp_wiki.json"
!wget https://raw.githubusercontent.com/nbroad1881/hp_wiki_scrapy/master/wiki_data/hp_wiki.json -O $wiki_filename

--2021-09-22 00:00:38--  https://raw.githubusercontent.com/nbroad1881/hp_wiki_scrapy/master/wiki_data/hp_wiki.json
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.110.133, 185.199.111.133, 185.199.109.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.110.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 16894220 (16M) [text/plain]
Saving to: ‘hp_wiki.json’


2021-09-22 00:00:39 (118 MB/s) - ‘hp_wiki.json’ saved [16894220/16894220]



## Split text every 5 sentences


Sentence splitting function found here: https://stackoverflow.com/a/31505798

Create a new chunked json file to load into the dataset object.

In [None]:
import re
import json

alphabets= "([A-Za-z])"
prefixes = "(Mr|St|Mrs|Ms|Dr)[.]"
suffixes = "(Inc|Ltd|Jr|Sr|Co)"
starters = "(Mr|Mrs|Ms|Dr|He\s|She\s|It\s|They\s|Their\s|Our\s|We\s|But\s|However\s|That\s|This\s|Wherever)"
acronyms = "([A-Z][.][A-Z][.](?:[A-Z][.])?)"
websites = "[.](com|net|org|io|gov)"
digits = "([0-9])"

def split_into_sentences(text, n_sents=5):
    text = " " + text + "  "
    text = text.replace("\n"," ")
    text = re.sub(prefixes,"\\1<prd>",text)
    text = re.sub(websites,"<prd>\\1",text)
    if "Ph.D" in text: text = text.replace("Ph.D.","Ph<prd>D<prd>")
    text = re.sub("\s" + alphabets + "[.] "," \\1<prd> ",text)
    text = re.sub(acronyms+" "+starters,"\\1<stop> \\2",text)
    text = re.sub(alphabets + "[.]" + alphabets + "[.]" + alphabets + "[.]","\\1<prd>\\2<prd>\\3<prd>",text)
    text = re.sub(alphabets + "[.]" + alphabets + "[.]","\\1<prd>\\2<prd>",text)
    text = re.sub(" "+suffixes+"[.] "+starters," \\1<stop> \\2",text)
    text = re.sub(" "+suffixes+"[.]"," \\1<prd>",text)
    text = re.sub(" " + alphabets + "[.]"," \\1<prd>",text)
    text = re.sub(digits + "[.]" + digits,"\\1<prd>\\2",text)
    if "”" in text: text = text.replace(".”","”.")
    if "\"" in text: text = text.replace(".\"","\".")
    if "!" in text: text = text.replace("!\"","\"!")
    if "?" in text: text = text.replace("?\"","\"?")
    text = text.replace(".",".<stop>")
    text = text.replace("?","?<stop>")
    text = text.replace("!","!<stop>")
    text = text.replace("<prd>",".")
    sentences = text.split("<stop>")
    if len(sentences[-1]) == 0:
      sentences = sentences[:-1]
    sentences = [s.strip() for s in sentences]
    final_sentences = []
    for i in range(0, len(sentences), n_sents):
      final_sentences.append(" ".join(sentences[i:i+n_sents]))
    return final_sentences


def split_text(text: str, n=100, character=" "):
    """Split the text every ``n``-th occurrence of ``character``"""
    text = text.split(character)
    return [character.join(text[i : i + n]).strip() for i in range(0, len(text), n)]


def make_chunked_file(chunked_filename, wiki_filename, split_func):
  """
  of a new chunked json file to be loaded into the dataset object.
  """
  with open(wiki_filename) as wiki_file:

    with open(chunked_filename, "w") as chunked_file:
        for line in wiki_file.readlines():
          jline = json.loads(line)

          for passage in split_func(jline["text"]):
            json.dump({
                "title": jline["title"],
                "text": passage,
                "path": jline["path"]
            }, chunked_file)
            chunked_file.write("\n")


# Use corresponding splitting function depending
# on whether the user chooses 'sentence' or 'whitespace'
split_funcs = {
    "sentence": split_into_sentences,
    "whitespace": split_text
}
wiki_chunked_filename = f"hp_wiki_chunked_{SPLIT_STYLE}.json"

make_chunked_file(
    chunked_filename=wiki_chunked_filename,
    wiki_filename=wiki_filename,
    split_func=split_funcs[SPLIT_STYLE]
)

## Sanity Check

In [None]:
text = "Sentence 1. Sentence 2. Sentence 3. Sentence 4. Sentence 5. Sentence 6. Sentence 7."
split_into_sentences(text)

['Sentence 1. Sentence 2. Sentence 3. Sentence 4. Sentence 5.',
 'Sentence 6. Sentence 7. ']

In [None]:
text = " ".join([f"{i}" for i in range(110)])
split_text(text)

['0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99',
 '100 101 102 103 104 105 106 107 108 109']

## Create book dataset

Text files for each book can be downloaded below. They aren't perfect transcriptions (especially book 2), but they were the best I could find. If you find better text files, please share!


In [None]:
!wget "https://raw.githubusercontent.com/amephraim/nlp/master/texts/J.%20K.%20Rowling%20-%20Harry%20Potter%201%20-%20Sorcerer's%20Stone.txt"  -O hp1.txt
!wget "https://raw.githubusercontent.com/amephraim/nlp/master/texts/J.%20K.%20Rowling%20-%20Harry%20Potter%202%20-%20The%20Chamber%20Of%20Secrets.txt"  -O hp2.txt
!wget "https://raw.githubusercontent.com/amephraim/nlp/master/texts/J.%20K.%20Rowling%20-%20Harry%20Potter%203%20-%20Prisoner%20of%20Azkaban.txt"  -O hp3.txt
!wget "https://raw.githubusercontent.com/amephraim/nlp/master/texts/J.%20K.%20Rowling%20-%20Harry%20Potter%204%20-%20The%20Goblet%20of%20Fire.txt"  -O hp4.txt
!wget "https://raw.githubusercontent.com/bobdeng/owlreader/master/ERead/assets/books/Harry%20Potter%20and%20the%20Order%20of%20the%20Phoenix.txt"  -O hp5.txt
!wget "https://raw.githubusercontent.com/bobdeng/owlreader/master/ERead/assets/books/Harry%20Potter%20and%20The%20Half-Blood%20Prince.txt"  -O hp6.txt
!wget "https://raw.githubusercontent.com/neelk07/neelkothari/master/blog/static/data/text/Harry%20Potter%20and%20the%20Deathly%20Hallows.txt"  -O hp7.txt

--2021-09-22 00:00:43--  https://raw.githubusercontent.com/amephraim/nlp/master/texts/J.%20K.%20Rowling%20-%20Harry%20Potter%201%20-%20Sorcerer's%20Stone.txt
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.110.133, 185.199.109.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 439742 (429K) [text/plain]
Saving to: ‘hp1.txt’


2021-09-22 00:00:43 (12.7 MB/s) - ‘hp1.txt’ saved [439742/439742]

--2021-09-22 00:00:43--  https://raw.githubusercontent.com/amephraim/nlp/master/texts/J.%20K.%20Rowling%20-%20Harry%20Potter%202%20-%20The%20Chamber%20Of%20Secrets.txt
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.109.133, 185.199.110.133, 185.199.111.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.109.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Lengt

## Process book text

This is a bit ugly and it has to do with the fact that most of the files have different formats.

It breaks all the files into chapters.

In [None]:
num2title = {
    1: "Harry Potter and the Sorcerer's Stone",
    2: "Harry Potter and the Chamber of Secrets",
    3: "Harry Potter and the Prisoner of Azkaban",
    4: "Harry Potter and the Goblet of Fire",
    5: "Harry Potter and the Order of the Phoenix",
    6: "Harry Potter and the Half-Blood Prince",
    7: "Harry Potter and the Deathly Hallows",
}

def break_into_chapters(text, book_num):
  if book_num == 2:
    return ([num2title[book_num]], [text])
  if book_num in [1,3,4]:
    split_pattern="CHAPTER"
  elif book_num == 5:
    split_pattern = "\n- CHAPTER"
  elif book_num in [6,7]:
    split_pattern = "Chapter"
    if book_num == 7:
        text = "filler\n"+text
  splits = text.split(split_pattern)
  chapter_names, chapter_texts = [], []
  for chapter_num, ch in enumerate(splits[1:], start=1):
    if book_num == 4:
      temp = ch[ch.index("- ")+2:].strip()
    elif book_num == 6:
      temp = ch[ch.index(": ")+2:].strip()
    else:
      temp = ch[ch.index('\n'):].strip()
    chapter_texts.append(temp[temp.index('\n'):].strip())
    if  "\n"  in temp:
      chapter_name = temp[:temp.index("\n")]
    chapter_names.append(f"{num2title[book_num]} - Chapter {chapter_num} - {chapter_name.title()}")
  if book_num == 7:
    chapter_names.append(f"{num2title[book_num]} - Epilogue - Nineteen Years Later")
    marker = "Epilogue\nNineteen Years Later"
    index = text.index(marker)+len(marker)
    chapter_texts.append(text[index:])
  return chapter_names, chapter_texts

def get_chapters():
  """
  Go through each text and break it into chapters.

  yields a tuple of (list of chapter names, list of chapter texts)
  """
  for i in range(1,8):
    encoding=None if i != 4 else "cp1252" # this one has a different encoding
    with open(f"hp{i}.txt", encoding=encoding) as f:
      text = f.read()
    splits = break_into_chapters(text, i)
    yield splits

### Checking that it broke the chapters up correctly

In [None]:
all_titles, all_texts = [], []

for titles, texts in get_chapters():
  all_titles.extend(titles)
  all_texts.extend(texts)

list(zip(all_titles[:5], all_texts[:5]))

[("Harry Potter and the Sorcerer's Stone - Chapter 1 - The Boy Who Lived",
  'Mr. and Mrs. Dursley, of number four, Privet Drive, were proud to say\nthat they were perfectly normal, thank you very much. They were the last\npeople you\'d expect to be involved in anything strange or mysterious,\nbecause they just didn\'t hold with such nonsense.\n\nMr. Dursley was the director of a firm called Grunnings, which made\ndrills. He was a big, beefy man with hardly any neck, although he did\nhave a very large mustache. Mrs. Dursley was thin and blonde and had\nnearly twice the usual amount of neck, which came in very useful as she\nspent so much of her time craning over garden fences, spying on the\nneighbors. The Dursleys had a small son called Dudley and in their\nopinion there was no finer boy anywhere.\n\nThe Dursleys had everything they wanted, but they also had a secret, and\ntheir greatest fear was that somebody would discover it. They didn\'t\nthink they could bear it if anyone found o

## Split text every 5 sentences or every 100 spaces

In [None]:
books_chunked_filename = f"hp_books_chunked_{SPLIT_STYLE}.json"

def make_chunked_file(chunked_filename, all_titles, all_texts, split_func):
  """
  Slightly different version than the one used for the wiki. Same end result 
  of a new chunked json file to be loaded into the dataset object.
  """
  with open(chunked_filename, "w") as chunked_file:
    for title, text in zip(all_titles, all_texts):

      for passage in split_func(text):
          json.dump({
              "title": title,
              "text": passage,
              "path": "", # filler to make consistent with wiki dataset
          }, chunked_file)
          chunked_file.write("\n")

make_chunked_file(
    chunked_filename=books_chunked_filename,
    all_titles=all_titles,
    all_texts=all_texts,
    split_func=split_funcs[SPLIT_STYLE]
)



if SPLIT_STYLE == "sentence":
  with open(books_chunked_filename, "w") as chunked_file:
    for title, text in zip(all_titles, all_texts):

      for passage in split_into_sentences(text, n_sents=5):
          json.dump({
              "title": title,
              "text": passage,
              "path": "", # filler to make consistent with wiki dataset
          }, chunked_file)
          chunked_file.write("\n")
elif SPLIT_STYLE == "whitespace":

  with open(books_chunked_filename, "w") as chunked_file:
    for title, text in zip(all_titles, all_texts):

      for passage in split_text(text):
          json.dump({
              "title": title,
              "text": passage,
              "path": "", # filler to make consistent with wiki dataset
          }, chunked_file)
          chunked_file.write("\n")

with open(books_chunked_filename) as f:
  import json
  for i, line in enumerate(f.readlines()):
    if i>5: break
    j = json.loads(line)
    print("Title:",j["title"])
    print("Text:",j["text"])
    print("Path:", j["path"], "\n")

Title: Harry Potter and the Sorcerer's Stone - Chapter 1 - The Boy Who Lived
Text: Mr. and Mrs. Dursley, of number four, Privet Drive, were proud to say that they were perfectly normal, thank you very much. They were the last people you'd expect to be involved in anything strange or mysterious, because they just didn't hold with such nonsense. Mr. Dursley was the director of a firm called Grunnings, which made drills. He was a big, beefy man with hardly any neck, although he did have a very large mustache. Mrs. Dursley was thin and blonde and had nearly twice the usual amount of neck, which came in very useful as she spent so much of her time craning over garden fences, spying on the neighbors.
Path:  

Title: Harry Potter and the Sorcerer's Stone - Chapter 1 - The Boy Who Lived
Text: The Dursleys had a small son called Dudley and in their opinion there was no finer boy anywhere. The Dursleys had everything they wanted, but they also had a secret, and their greatest fear was that someb

## Install Necessary Packages

Installing faiss first before faiss-gpu was the only way I could get it to work. If you know of a better way, let me know!

Now would also be a good time to switch to GPU runtime.

In [None]:
!pip install -U transformers datasets

!apt install libomp-dev
!pip install -U faiss
!pip install faiss-gpu

Collecting transformers
  Downloading transformers-4.10.2-py3-none-any.whl (2.8 MB)
[K     |████████████████████████████████| 2.8 MB 5.2 MB/s 
[?25hCollecting datasets
  Downloading datasets-1.12.1-py3-none-any.whl (270 kB)
[K     |████████████████████████████████| 270 kB 44.8 MB/s 
Collecting pyyaml>=5.1
  Downloading PyYAML-5.4.1-cp37-cp37m-manylinux1_x86_64.whl (636 kB)
[K     |████████████████████████████████| 636 kB 39.6 MB/s 
Collecting sacremoses
  Downloading sacremoses-0.0.45-py3-none-any.whl (895 kB)
[K     |████████████████████████████████| 895 kB 43.0 MB/s 
[?25hCollecting huggingface-hub>=0.0.12
  Downloading huggingface_hub-0.0.17-py3-none-any.whl (52 kB)
[K     |████████████████████████████████| 52 kB 1.1 MB/s 
Collecting tokenizers<0.11,>=0.10.1
  Downloading tokenizers-0.10.3-cp37-cp37m-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_12_x86_64.manylinux2010_x86_64.whl (3.3 MB)
[K     |████████████████████████████████| 3.3 MB 40.0 MB/s 
Collecting aiohttp
  D

## Create dataset object using either the books or the wiki chunked file

In [None]:
from datasets import load_dataset

dataset_filenames = {
    "books": books_chunked_filename,
    "wiki": wiki_chunked_filename
}

dataset = load_dataset(
    "json", 
    data_files=[dataset_filenames[DATASET_TO_USE]],
    split="train",
)

# If we sort by length, batching will be more efficient and padding will be minimized
def add_len(example):
  example["len"] = len(example["title"]+example["text"])
  return example

dataset = dataset.map(add_len).sort("len").remove_columns(["len"])

Using custom data configuration default-96d26e4e0fa5a074
Reusing dataset json (/root/.cache/huggingface/datasets/json/default-96d26e4e0fa5a074/0.0.0/d75ead8d5cfcbe67495df0f89bd262f0023257fbbbd94a730313295f3d756d50)
Loading cached processed dataset at /root/.cache/huggingface/datasets/json/default-96d26e4e0fa5a074/0.0.0/d75ead8d5cfcbe67495df0f89bd262f0023257fbbbd94a730313295f3d756d50/cache-f2a8d36a7f321275.arrow
Loading cached sorted indices for dataset at /root/.cache/huggingface/datasets/json/default-96d26e4e0fa5a074/0.0.0/d75ead8d5cfcbe67495df0f89bd262f0023257fbbbd94a730313295f3d756d50/cache-a481351fc727c358.arrow


## Load DPR Encoder and Embed Each Document

With GPU, this will take about 9 minutes.

In [None]:
%%time

from datasets import load_from_disk, Features, Value, Sequence
from transformers import DPRContextEncoder, DPRContextEncoderTokenizer
import torch
from functools import partial


device = "cuda" if torch.cuda.is_available() else "cpu"
def embed(documents: dict, ctx_encoder: DPRContextEncoder, ctx_tokenizer: DPRContextEncoderTokenizer) -> dict:
    """Compute the DPR embeddings of document passages"""
    input_ids = ctx_tokenizer(
        documents["title"], documents["text"], truncation=True, padding="longest", return_tensors="pt"
    )["input_ids"]
    embeddings = ctx_encoder(input_ids.to(device=device), return_dict=True).pooler_output
    return {"embeddings": embeddings.detach().cpu().numpy()}

torch.set_grad_enabled(False)


dpr_model_name = "facebook/dpr-ctx_encoder-multiset-base"
batch_size = 32

ctx_encoder = DPRContextEncoder.from_pretrained(dpr_model_name).to(device=device)
ctx_tokenizer = DPRContextEncoderTokenizer.from_pretrained(dpr_model_name)
new_features = Features({
      "text": Value("string"), 
      "title": Value("string"), 
      "embeddings": Sequence(Value("float32")), 
      "path": Value("string")
      })
dataset = dataset.map(
    partial(embed, ctx_encoder=ctx_encoder, ctx_tokenizer=ctx_tokenizer),
    batched=True,
    batch_size=batch_size,
    features=new_features,
)


The tokenizer class you load from this checkpoint is not the same type as the class this function is called from. It may result in unexpected tokenization. 
The tokenizer class you load from this checkpoint is 'DPRQuestionEncoderTokenizer'. 
The class this function is called from is 'DPRContextEncoderTokenizerFast'.


  0%|          | 0/1077 [00:00<?, ?ba/s]

## Load faiss index that was just computed into this dataset


In [None]:
import faiss

faiss_num_dim = 768
faiss_num_links = 128 


index = faiss.IndexHNSWFlat(faiss_num_dim, faiss_num_links, faiss.METRIC_INNER_PRODUCT)
dataset.add_faiss_index("embeddings", custom_index=index)

  0%|          | 0/35 [00:00<?, ?it/s]

Dataset({
    features: ['text', 'title', 'embeddings', 'path'],
    num_rows: 34454
})

# Time to test it out!

### Load RAG Retriever and Generator

This is a big model so it can take some time to download.

In [None]:
from transformers import (RagRetriever, 
                          RagSequenceForGeneration, 
                          RagTokenizer)
rag_model_name = "facebook/rag-sequence-nq"

retriever = RagRetriever.from_pretrained(
    rag_model_name, index_name="custom", indexed_dataset=dataset
)
model = RagSequenceForGeneration.from_pretrained(rag_model_name, retriever=retriever).to(device)
tokenizer = RagTokenizer.from_pretrained(rag_model_name)

The tokenizer class you load from this checkpoint is not the same type as the class this function is called from. It may result in unexpected tokenization. 
The tokenizer class you load from this checkpoint is 'RagTokenizer'. 
The class this function is called from is 'DPRQuestionEncoderTokenizer'.
The tokenizer class you load from this checkpoint is not the same type as the class this function is called from. It may result in unexpected tokenization. 
The tokenizer class you load from this checkpoint is 'RagTokenizer'. 
The class this function is called from is 'DPRQuestionEncoderTokenizerFast'.
The tokenizer class you load from this checkpoint is not the same type as the class this function is called from. It may result in unexpected tokenization. 
The tokenizer class you load from this checkpoint is 'RagTokenizer'. 
The class this function is called from is 'BartTokenizer'.
The tokenizer class you load from this checkpoint is not the same type as the class this function is called fr

In [None]:
def ask_question(question):
  """
  Ask a question to the model and get back top 5 answers along with the
  document title it came from and the path (if using wiki dataset)

  Args:
    question (str): Question to ask the model.

  Returns:
    dict: 
      answers generated by model, 
      title of article model referenced, 
      path to article

  """

  retriever_input_ids = model.retriever.question_encoder_tokenizer.batch_encode_plus(
      [question],
      return_tensors="pt",
      padding=True,
      truncation=True,
  )["input_ids"].to(device)

  question_enc_outputs = model.rag.question_encoder(retriever_input_ids)
  question_enc_pool_output = question_enc_outputs[0]

  result = model.retriever(
      retriever_input_ids,
      question_enc_pool_output.cpu().detach().to(torch.float32).numpy(),
      prefix=model.rag.generator.config.prefix,
      n_docs=model.config.n_docs,
      return_tensors="pt",
  )
  all_docs = model.retriever.index.get_doc_dicts(result.doc_ids)

  titles = []
  paths = []
  for docs in all_docs:
      titles.extend([title for title in docs["title"]])
      paths.extend([path for path in docs["path"]])
  
  # Occasionally it isn't able to return 5 answers
  # In that case, keep decreasing the number until it succeeds
  num_return = 5
  while num_return > 0:
    try:
      generated = model.generate(retriever_input_ids, num_beams=3, num_return_sequences=num_return)
    except RuntimeError:
      num_return -= 1
    else:
      break
  answers = tokenizer.batch_decode(generated, skip_special_tokens=True)
  return {
      "answers": answers, 
      "titles": titles,
      "paths": paths
  }


def zip_results(answer_dict):
  """
  Takes three lists (answers, titles, and paths) and 
  groups them into a list of triplet tuples.
  """
  return [(answer, title, path) for answer, title, path in zip(answer_dict["answers"], answer_dict["titles"], answer_dict["paths"])]

### Here are some sample questions. Feel free to add your own!

Some questions are too vague and the model gives bad answers. Some questions are very specific and the model still gives bizarre answers.

In [None]:
hp_questions = [
      "Who gave Harry Potter his scar?",
      "What sport does Harry Potter play?",
      "Who is the headmaster at Hogwarts?",
      "What is Harry Potter's wand made of?",
      "What is an ingredient in Polyjuice potion?",
      "Who competes in the triwizard tournament in Harry Potter's fourth year?",
      "Who teaches potions in Harry Potter's first year?",
      "Who teaches defense against the dark arts in Harry Potter's third year?",
      "Who put Harry Potter's name in the goblet of fire?",
      "What is the name of Harry Potter's owl?",
      "Who does Harry Potter ask to the Yule Ball?",
      "Who impersonates Mad-Eye Moody?",
      "Who does Hagrid have romantic feelings for?",
      "What is Ron Weasley's sister's name?",
      "In what house does Harry Potter belong?",
      "What position does Harry Potter play on the Quidditch team?",
      "What does the Sorcerer's Stone do?",
      "Who is Fluffy?",
      "What does the dementor's kiss do?",
      "What does the Imperius Curse do?",
      "Who poses as Mad-Eye Moody, Harry Potter's 4th year Defense Against the Dark Arts professor?",
      "What is an Auror?",
      "What happened to Wormtail's hand in Little Hangleton?",
      "What is the name of the killing curse?",
      "What does crucio do?",
      "What school does Fleur Delacour go to?",
      "What school does Viktor Krum go to?",
      "Who goes to the ball with Viktor Krum?",
      "What is veritaserum?",
      "Where does Harry Potter talk to Myrtle about the golden egg?",
      "Hermione's patronus is what animal?",
      "What does gillyweed do?",
      "What creatures live at the bottom of the Hogwarts Lake?",
      "What is the name of Harry Potter's first broomstick?",
      "Who is the minister of magic in 1991?",
      "Who is Harry Potter's godfather?",
      "What are the names of Harry Potter's parents?",
      "Who wins the Quidditch World Cup in 1994?",
      "Who is Winky's master?",
      "What is the name of the summoning charm?",
      "Who is the minister of magic in 1995?",
      "In 1997, who is the minister of magic?",
      "What is the opposite of the Summoning Charm?",
      "What subject did Professor McGonagall teach at Hogwarts?",
      "What subject did Professor Trelawney teach at Hogwarts?",
      "What subject did Professor Flitwick teach at Hogwarts?",
      "Who killed Sirius Black?",
      "What is the name of Draco Malfoy's father?",
      'Who is Aragog?',
      "What animal is Ron Weasley afraid of?",
      "How did Professor Dumbledore die?",
      "How did Dobby die?",
      "How did Cedric Diggory die?",
      "What does Avada Kedavra do?",
      "What does the spell Avada Kedavra do in the book Harry Potter?",
      "What species is Dobby?",
      "Who is Dobby's master?",
      "What is the fastest broomstick?",
      "What broomstick does Ron Weasley ride?",
      "Who is the seeker for Gryffindor's Quidditch team?",
      "What is the name of the centaur divination teacher?",
      "Who summons the dark mark above Hogwarts?",
      "Who works in Gringotts?",
      "Who did Harry Potter live with when he was young?",
      "What was Harry Potter's address when he was young?",
      "What was the name of the broomstick Harry uses in the triwizard tournament?",
      "What family lives in The Burrow?",
      "What is the name of Aberforth Dumbledore's brother?",
      "What is the name of Albus Dumbledore's sister?",
      "What is the name of Tom Riddle's mother?",
      ]

In [None]:
# Loop through the questions and print out the top 5 answers and what article the model referenced to make that answer 

for q in hp_questions:
  results = ask_question(q)
  print(q)
  for num, (a, t, p) in enumerate(zip_results(results), start=1):
    if p:
      p = f"https://harrypotter.fandom.com{p}"
    print(f"{num}.) Answer: {a}")
    print(f"\t\tSection Title: {t}")
    print(f"\t\tPath: {p}\n")

Who gave Harry Potter his scar?
1.) Answer:  lord voldemort
		Section Title: Harry Potter's scars - Lightning-bolt scar Scarring
		Path: https://harrypotter.fandom.com/wiki/Harry_Potter%27s_scars

2.) Answer:  voldemort
		Section Title: Harry Potter - Physical appearance
		Path: https://harrypotter.fandom.com/wiki/Harry_Potter

3.) Answer:  harry
		Section Title: Harry Potter - Physical appearance
		Path: https://harrypotter.fandom.com/wiki/Harry_Potter

4.) Answer:  quirrell
		Section Title: Harry Potter - Physical appearance
		Path: https://harrypotter.fandom.com/wiki/Harry_Potter

5.) Answer:  his mother's loving sacrifice
		Section Title: Harry Potter - The Philosopher's Stone
		Path: https://harrypotter.fandom.com/wiki/Harry_Potter

What sport does Harry Potter play?
1.) Answer:  quidditch
		Section Title: Harry Potter: Quidditch World Cup - Summary
		Path: https://harrypotter.fandom.com/wiki/Harry_Potter:_Quidditch_World_Cup

2.) Answer:  quiddish
		Section Title: England - Games

### Phrasing matters!

In [None]:
phrasing_questions = [
                  "What family lives in the Burrow?",
                  "Who lives in the Burrow?",
                  "The family that lives in the Burrow is known as what?",
]

for q in phrasing_questions:
  results = ask_question(q)
  print(q)
  for num, (a, t, p) in enumerate(zip_results(results), start=1):
    if p:
      p = f"https://harrypotter.fandom.com{p}"
    print(f"{num}.) Answer: {a}")
    print(f"\t\tSection Title: {t}")
    print(f"\t\tPath: {p}\n")

What family lives in the Burrow?
1.) Answer:  arthur and molly weasley
		Section Title: The Burrow - Layout
		Path: https://harrypotter.fandom.com/wiki/The_Burrow

2.) Answer:  the lovegoods
		Section Title: The Burrow - Layout
		Path: https://harrypotter.fandom.com/wiki/The_Burrow

3.) Answer:  the diggorys
		Section Title: The Burrow - Summary
		Path: https://harrypotter.fandom.com/wiki/The_Burrow

4.) Answer:  weasley family
		Section Title: The Burrow - Weasley family home
		Path: https://harrypotter.fandom.com/wiki/The_Burrow

5.) Answer:  the weasley family
		Section Title: The Burrow - Attic
		Path: https://harrypotter.fandom.com/wiki/The_Burrow

Who lives in the Burrow?
1.) Answer:  arthur and molly weasley
		Section Title: The Burrow - Layout
		Path: https://harrypotter.fandom.com/wiki/The_Burrow

2.) Answer:  the weasley family
		Section Title: The Burrow - Layout
		Path: https://harrypotter.fandom.com/wiki/The_Burrow

3.) Answer:  the weasley family ghoul
		Section Title: Th

### General knowledge questions that are irrelevant to Harry Potter
These questions are meant to see what information is stored in the model itself and has nothing to do with Harry Potter. It is a bit interesting to see which articles from the Harry Potter universe might have information relevant to the questions.

In [None]:
general_questions = [
                  "Who is the president of the United States in 2018?",
                  "How far away is the moon from Earth?",
                  "Where do dolphins live?",
                  "What do monkeys like to eat?",
                  "In what stadium does Manchester United play?",
]

for q in general_questions:
  results = ask_question(q)
  print(q)
  for num, (a, t, p) in enumerate(zip_results(results), start=1):
    if p:
      p = f"https://harrypotter.fandom.com{p}"
    print(f"{num}.) Answer: {a}")
    print(f"\t\tSection Title: {t}")
    print(f"\t\tPath: {p}\n")

Who is the president of the United States in 2018?
1.) Answer: Donald j. trump
		Section Title: President of the United States of America - Summary
		Path: https://harrypotter.fandom.com/wiki/President_of_the_United_States_of_America

2.) Answer:  donald j. trump
		Section Title: United States of America - Recent history
		Path: https://harrypotter.fandom.com/wiki/United_States_of_America

3.) Answer:  Donald j. trump
		Section Title: President of the United States of America - Summary
		Path: https://harrypotter.fandom.com/wiki/President_of_the_United_States_of_America

4.) Answer:  william j. trump
		Section Title: President of the Magical Congress of the United States of America - Modern times
		Path: https://harrypotter.fandom.com/wiki/President_of_the_Magical_Congress_of_the_United_States_of_America

5.) Answer: Donald joseph trump
		Section Title: Hermione Granger - Summary
		Path: https://harrypotter.fandom.com/wiki/Hermione_Granger

How far away is the moon from Earth?
1.) Answ

# Saving your dataset and embeddings

Saving the retriever will store the dataset and embeddings!


In [None]:
retriever.save_pretrained("HP_retriever")

## If you made it this far, I hope you found this interesting or fun.

If you did, please feel free to leave a comment or some claps!