In [1]:
import openai

from fp_dataset_artifacts.utils import init_openai
from datasets import list_datasets, load_dataset, list_metrics, load_metric

init_openai()

data = load_dataset('squad')
data

Reusing dataset squad (/home/x/.cache/huggingface/datasets/squad/plain_text/1.0.0/d6ec3ceb99ca480ce37cdd35555d6cb2511d223b9150cce08a837ef62ffea453)


  0%|          | 0/2 [00:00<?, ?it/s]

DatasetDict({
    train: Dataset({
        features: ['id', 'title', 'context', 'question', 'answers'],
        num_rows: 87599
    })
    validation: Dataset({
        features: ['id', 'title', 'context', 'question', 'answers'],
        num_rows: 10570
    })
})

In [2]:
data['train'][0]

{'id': '5733be284776f41900661182',
 'title': 'University_of_Notre_Dame',
 'context': 'Architecturally, the school has a Catholic character. Atop the Main Building\'s gold dome is a golden statue of the Virgin Mary. Immediately in front of the Main Building and facing it, is a copper statue of Christ with arms upraised with the legend "Venite Ad Me Omnes". Next to the Main Building is the Basilica of the Sacred Heart. Immediately behind the basilica is the Grotto, a Marian place of prayer and reflection. It is a replica of the grotto at Lourdes, France where the Virgin Mary reputedly appeared to Saint Bernadette Soubirous in 1858. At the end of the main drive (and in a direct line that connects through 3 statues and the Gold Dome), is a simple, modern stone statue of Mary.',
 'question': 'To whom did the Virgin Mary allegedly appear in 1858 in Lourdes France?',
 'answers': {'text': ['Saint Bernadette Soubirous'], 'answer_start': [515]}}

In [3]:
# Prompt without any structure or introduction
def get_bare_prompt(x):
    return f"{x['context']}\n\n{x['question']}\n\n"

prompt = get_bare_prompt(data['train'][0])
print(prompt)

Architecturally, the school has a Catholic character. Atop the Main Building's gold dome is a golden statue of the Virgin Mary. Immediately in front of the Main Building and facing it, is a copper statue of Christ with arms upraised with the legend "Venite Ad Me Omnes". Next to the Main Building is the Basilica of the Sacred Heart. Immediately behind the basilica is the Grotto, a Marian place of prayer and reflection. It is a replica of the grotto at Lourdes, France where the Virgin Mary reputedly appeared to Saint Bernadette Soubirous in 1858. At the end of the main drive (and in a direct line that connects through 3 statues and the Gold Dome), is a simple, modern stone statue of Mary.

To whom did the Virgin Mary allegedly appear in 1858 in Lourdes France?




In [4]:
response = openai.Completion.create(
  engine='curie',
  prompt=prompt,
  temperature=0,
  max_tokens=100,
  top_p=1,
  frequency_penalty=0.0,
  presence_penalty=0.0,
  stop=["\n"]
)
response

<OpenAIObject text_completion id=cmpl-4BU8BWV7dO9jJIL3hAU7cL29mvJpE at 0x7f94ef4f13b0> JSON: {
  "choices": [
    {
      "finish_reason": "stop",
      "index": 0,
      "logprobs": null,
      "text": ""
    }
  ],
  "created": 1638642779,
  "id": "cmpl-4BU8BWV7dO9jJIL3hAU7cL29mvJpE",
  "model": "curie:2020-05-03",
  "object": "text_completion"
}

In [5]:
# Prompt with Q&A structure, but no introduction
def get_QA_prompt(x):
    return f"{x['context']}\n\nQ: {x['question']}\n\nA: "

prompt = get_QA_prompt(data['train'][0])
print(prompt)

Architecturally, the school has a Catholic character. Atop the Main Building's gold dome is a golden statue of the Virgin Mary. Immediately in front of the Main Building and facing it, is a copper statue of Christ with arms upraised with the legend "Venite Ad Me Omnes". Next to the Main Building is the Basilica of the Sacred Heart. Immediately behind the basilica is the Grotto, a Marian place of prayer and reflection. It is a replica of the grotto at Lourdes, France where the Virgin Mary reputedly appeared to Saint Bernadette Soubirous in 1858. At the end of the main drive (and in a direct line that connects through 3 statues and the Gold Dome), is a simple, modern stone statue of Mary.

Q: To whom did the Virgin Mary allegedly appear in 1858 in Lourdes France?

A: 


In [6]:
response = openai.Completion.create(
  engine='curie',
  prompt=prompt,
  temperature=0,
  max_tokens=100,
  top_p=1,
  frequency_penalty=0.0,
  presence_penalty=0.0,
  stop=["\n"]
)
response

<OpenAIObject text_completion id=cmpl-4BU9gHy90Kg9APkKrBMAnQhDAIceU at 0x7f954f65be50> JSON: {
  "choices": [
    {
      "finish_reason": "stop",
      "index": 0,
      "logprobs": null,
      "text": ""
    }
  ],
  "created": 1638642872,
  "id": "cmpl-4BU9gHy90Kg9APkKrBMAnQhDAIceU",
  "model": "curie:2020-05-03",
  "object": "text_completion"
}

In [8]:
# Prompt with better Q&A structure, but no introduction
def get_QA_prompt(x):
    return f"Context: {x['context']}\n\nQuestion: {x['question']}\n\nAnswer: "

prompt = get_QA_prompt(data['train'][0])
print(prompt)

Context: Architecturally, the school has a Catholic character. Atop the Main Building's gold dome is a golden statue of the Virgin Mary. Immediately in front of the Main Building and facing it, is a copper statue of Christ with arms upraised with the legend "Venite Ad Me Omnes". Next to the Main Building is the Basilica of the Sacred Heart. Immediately behind the basilica is the Grotto, a Marian place of prayer and reflection. It is a replica of the grotto at Lourdes, France where the Virgin Mary reputedly appeared to Saint Bernadette Soubirous in 1858. At the end of the main drive (and in a direct line that connects through 3 statues and the Gold Dome), is a simple, modern stone statue of Mary.

Question: To whom did the Virgin Mary allegedly appear in 1858 in Lourdes France?

Answer: 


In [9]:
response = openai.Completion.create(
  engine='curie',
  prompt=prompt,
  temperature=0,
  max_tokens=100,
  top_p=1,
  frequency_penalty=0.0,
  presence_penalty=0.0,
  stop=["\n"]
)
response

<OpenAIObject text_completion id=cmpl-4BUAc2lGCyenaUdETW20ScnkZNRfz at 0x7f94efe76950> JSON: {
  "choices": [
    {
      "finish_reason": "stop",
      "index": 0,
      "logprobs": null,
      "text": ""
    }
  ],
  "created": 1638642930,
  "id": "cmpl-4BUAc2lGCyenaUdETW20ScnkZNRfz",
  "model": "curie:2020-05-03",
  "object": "text_completion"
}

In [18]:
# All of the above and with the OpenAI's Q&A bot introduction without "Unknown" since it's 1.1
intro = 'I am a highly intelligent question answering bot. If you ask me a question that is rooted in truth, I will give you the answer.'

def get_bot_prompt(x):
    return f"{intro}\n\nContext: {x['context']}\n\nQuestion: {x['question']}\n\nAnswer: "

prompt = get_bot_prompt(data['train'][0])
print(prompt)

I am a highly intelligent question answering bot. If you ask me a question that is rooted in truth, I will give you the answer.

Context: Architecturally, the school has a Catholic character. Atop the Main Building's gold dome is a golden statue of the Virgin Mary. Immediately in front of the Main Building and facing it, is a copper statue of Christ with arms upraised with the legend "Venite Ad Me Omnes". Next to the Main Building is the Basilica of the Sacred Heart. Immediately behind the basilica is the Grotto, a Marian place of prayer and reflection. It is a replica of the grotto at Lourdes, France where the Virgin Mary reputedly appeared to Saint Bernadette Soubirous in 1858. At the end of the main drive (and in a direct line that connects through 3 statues and the Gold Dome), is a simple, modern stone statue of Mary.

Question: To whom did the Virgin Mary allegedly appear in 1858 in Lourdes France?

Answer: 


In [19]:
response = openai.Completion.create(
  engine='curie',
  prompt=prompt,
  temperature=0,
  max_tokens=100,
  top_p=1,
  frequency_penalty=0.0,
  presence_penalty=0.0,
  stop=["\n"]
)
response

<OpenAIObject text_completion id=cmpl-4BUDL0oIs66fbRBRPC66oZhjniuDV at 0x7f94efd3b270> JSON: {
  "choices": [
    {
      "finish_reason": "stop",
      "index": 0,
      "logprobs": null,
      "text": "_________"
    }
  ],
  "created": 1638643099,
  "id": "cmpl-4BUDL0oIs66fbRBRPC66oZhjniuDV",
  "model": "curie:2020-05-03",
  "object": "text_completion"
}

In [20]:
# We are finally getting something from the model, but it's pretty much generating invalid tokens

In [21]:
# To reduce the burden of large context, let's try it without it.
def get_bot_no_context_prompt(x):
    return f"{intro}\n\nQuestion: {x['question']}\n\nAnswer: "

prompt = get_bot_no_context_prompt(data['train'][0])
print(prompt)

I am a highly intelligent question answering bot. If you ask me a question that is rooted in truth, I will give you the answer.

Question: To whom did the Virgin Mary allegedly appear in 1858 in Lourdes France?

Answer: 


In [22]:
response = openai.Completion.create(
  engine='curie',
  prompt=prompt,
  temperature=0,
  max_tokens=100,
  top_p=1,
  frequency_penalty=0.0,
  presence_penalty=0.0,
  stop=["\n"]
)
response

<OpenAIObject text_completion id=cmpl-4BUFpOZw2WR4FPqI0BGQgL0pZnuI2 at 0x7f954c5748b0> JSON: {
  "choices": [
    {
      "finish_reason": "stop",
      "index": 0,
      "logprobs": null,
      "text": "__________"
    }
  ],
  "created": 1638643253,
  "id": "cmpl-4BUFpOZw2WR4FPqI0BGQgL0pZnuI2",
  "model": "curie:2020-05-03",
  "object": "text_completion"
}

In [24]:
# Let's try even more compact.
short_intro = 'I am a highly intelligent question answering bot.'

def get_bot_no_context_prompt(x):
    return f"{short_intro}\n\nQ: {x['question']}\n\nA: "

prompt = get_bot_no_context_prompt(data['train'][0])
print(prompt)

I am a highly intelligent question answering bot.

Q: To whom did the Virgin Mary allegedly appear in 1858 in Lourdes France?

A: 


In [25]:
response = openai.Completion.create(
  engine='curie',
  prompt=prompt,
  temperature=0,
  max_tokens=100,
  top_p=1,
  frequency_penalty=0.0,
  presence_penalty=0.0,
  stop=["\n"]
)
response

<OpenAIObject text_completion id=cmpl-4BUHGQbFh0qiJCpCrezQC3eijo8aV at 0x7f94efd4d680> JSON: {
  "choices": [
    {
      "finish_reason": "stop",
      "index": 0,
      "logprobs": null,
      "text": "__________"
    }
  ],
  "created": 1638643342,
  "id": "cmpl-4BUHGQbFh0qiJCpCrezQC3eijo8aV",
  "model": "curie:2020-05-03",
  "object": "text_completion"
}

In [31]:
# Let's try a few different questions
prompt = get_bot_no_context_prompt(data['train'].shuffle()[0])
print(prompt)

Loading cached shuffled indices for dataset at /home/x/.cache/huggingface/datasets/squad/plain_text/1.0.0/d6ec3ceb99ca480ce37cdd35555d6cb2511d223b9150cce08a837ef62ffea453/cache-2bd8e69c8e2e88c5.arrow


I am a highly intelligent question answering bot.

Q: What countries are recognized as Nuclear Weapons States?

A: 


In [32]:
response = openai.Completion.create(
  engine='curie',
  prompt=prompt,
  temperature=0,
  max_tokens=100,
  top_p=1,
  frequency_penalty=0.0,
  presence_penalty=0.0,
  stop=["\n"]
)
response

<OpenAIObject text_completion id=cmpl-4BUIveEkFHj2msuv4Nx0jAYzhrxTj at 0x7f94efd4def0> JSON: {
  "choices": [
    {
      "finish_reason": "stop",
      "index": 0,
      "logprobs": null,
      "text": ""
    }
  ],
  "created": 1638643445,
  "id": "cmpl-4BUIveEkFHj2msuv4Nx0jAYzhrxTj",
  "model": "curie:2020-05-03",
  "object": "text_completion"
}

In [34]:
# Attempt 2
prompt = get_bot_no_context_prompt(data['train'].shuffle()[1])
print(prompt)
response = openai.Completion.create(
  engine='curie',
  prompt=prompt,
  temperature=0,
  max_tokens=100,
  top_p=1,
  frequency_penalty=0.0,
  presence_penalty=0.0,
  stop=["\n"]
)
print(response)

Loading cached shuffled indices for dataset at /home/x/.cache/huggingface/datasets/squad/plain_text/1.0.0/d6ec3ceb99ca480ce37cdd35555d6cb2511d223b9150cce08a837ef62ffea453/cache-2bd8e69c8e2e88c5.arrow


I am a highly intelligent question answering bot.

Q: Who compared transnational police information and intelligence sharing practices?

A: 
{
  "choices": [
    {
      "finish_reason": "stop",
      "index": 0,
      "logprobs": null,
      "text": "????"
    }
  ],
  "created": 1638643499,
  "id": "cmpl-4BUJnhYRhNhLx0mVpKXmWK1t9ljVC",
  "model": "curie:2020-05-03",
  "object": "text_completion"
}


In [35]:
# Attempt 3
prompt = get_bot_no_context_prompt(data['train'].shuffle()[2])
print(prompt)
response = openai.Completion.create(
  engine='curie',
  prompt=prompt,
  temperature=0,
  max_tokens=100,
  top_p=1,
  frequency_penalty=0.0,
  presence_penalty=0.0,
  stop=["\n"]
)
print(response)

Loading cached shuffled indices for dataset at /home/x/.cache/huggingface/datasets/squad/plain_text/1.0.0/d6ec3ceb99ca480ce37cdd35555d6cb2511d223b9150cce08a837ef62ffea453/cache-2bd8e69c8e2e88c5.arrow


I am a highly intelligent question answering bot.

Q: What was the inspiration behind Kanye West's decision to call Rick Rubin?

A: 
{
  "choices": [
    {
      "finish_reason": "stop",
      "index": 0,
      "logprobs": null,
      "text": "????"
    }
  ],
  "created": 1638643502,
  "id": "cmpl-4BUJqQ7VrrrL75F4rUtsTee7LskfG",
  "model": "curie:2020-05-03",
  "object": "text_completion"
}


In [36]:
# Attempt 4
prompt = get_bot_no_context_prompt(data['train'].shuffle()[3])
print(prompt)
response = openai.Completion.create(
  engine='curie',
  prompt=prompt,
  temperature=0,
  max_tokens=100,
  top_p=1,
  frequency_penalty=0.0,
  presence_penalty=0.0,
  stop=["\n"]
)
print(response)

Loading cached shuffled indices for dataset at /home/x/.cache/huggingface/datasets/squad/plain_text/1.0.0/d6ec3ceb99ca480ce37cdd35555d6cb2511d223b9150cce08a837ef62ffea453/cache-2bd8e69c8e2e88c5.arrow


I am a highly intelligent question answering bot.

Q: On what part of the island is Fort Oscar located on the far side of?

A: 
{
  "choices": [
    {
      "finish_reason": "stop",
      "index": 0,
      "logprobs": null,
      "text": "????"
    }
  ],
  "created": 1638643518,
  "id": "cmpl-4BUK63QKDQfLtCTx5F4uTxMPd0jN0",
  "model": "curie:2020-05-03",
  "object": "text_completion"
}


In [37]:
# Attempt 5 with Davinci
prompt = get_bot_no_context_prompt(data['train'].shuffle()[3])
print(prompt)
response = openai.Completion.create(
  engine='davinci',
  prompt=prompt,
  temperature=0,
  max_tokens=100,
  top_p=1,
  frequency_penalty=0.0,
  presence_penalty=0.0,
  stop=["\n"]
)
print(response)

Loading cached shuffled indices for dataset at /home/x/.cache/huggingface/datasets/squad/plain_text/1.0.0/d6ec3ceb99ca480ce37cdd35555d6cb2511d223b9150cce08a837ef62ffea453/cache-2bd8e69c8e2e88c5.arrow


I am a highly intelligent question answering bot.

Q: On what part of the island is Fort Oscar located on the far side of?

A: 
{
  "choices": [
    {
      "finish_reason": "length",
      "index": 0,
      "logprobs": null,
      "text": "\u0caa\u0ccd\u0cb0\u0cbf\u0caf\u0cae\u0ccd\u0cae\u0ca8\u0ccd \u0cae\u0ca8\u0cc6\u0caf\u0cbf\u0c82\u0ca6 \u0cae\u0cc1\u0c82\u0ca6\u0cc1\u0cb5\u0cb0\u0cc6\u0caf\u0cb2\u0ccd\u0cb2\u0cbf \u0c85\u0cb5"
    }
  ],
  "created": 1638643551,
  "id": "cmpl-4BUKdWtcH70lHKUKnrCaG8DOLDS7C",
  "model": "davinci:2020-05-03",
  "object": "text_completion"
}


In [38]:
# There's no point in trying to evaluate the entire dataset, if we can't get a single one to work.

In [41]:
# One last try with everything we got
def get_bot_context_prompt(x):
    return f"{short_intro}\n\n{x['context']}\n\nQ: {x['question']}\n\nA: "

prompt = get_bot_context_prompt(data['train'].shuffle()[3])
print(prompt)

Loading cached shuffled indices for dataset at /home/x/.cache/huggingface/datasets/squad/plain_text/1.0.0/d6ec3ceb99ca480ce37cdd35555d6cb2511d223b9150cce08a837ef62ffea453/cache-2bd8e69c8e2e88c5.arrow


I am a highly intelligent question answering bot.

Among the notable structures in the town are the three forts built by the Swedes for defense purposes. One of these forts, known as Fort Oscar (formerly Gustav Adolph), which overlooks the sea is located on the far side of La Pointe. However, the ruins have been replaced by a modern military building which now houses the local gendarmerie. The other fort known as Fort Karl now presents a very few ruins. The third fort built by the Swedes is the Fort Gustav, which is also seen in ruins strewn around the weather station and the Light House. The fort built in 1787 over a hill slope has ruins of ramparts, guardhouse, munitions depot, wood-burning oven and so forth.

Q: On what part of the island is Fort Oscar located on the far side of?

A: 


In [42]:
response = openai.Completion.create(
  engine='davinci',
  prompt=prompt,
  temperature=0,
  max_tokens=100,
  top_p=1,
  frequency_penalty=0.0,
  presence_penalty=0.0,
  stop=["\n"]
)
print(response)

{
  "choices": [
    {
      "finish_reason": "stop",
      "index": 0,
      "logprobs": null,
      "text": ""
    }
  ],
  "created": 1638644876,
  "id": "cmpl-4BUg0VB2WGLKlQLW0XbRpkGLZET7B",
  "model": "davinci:2020-05-03",
  "object": "text_completion"
}


In [43]:
def get_bot_context_prompt(x):
    return f"{intro}\n\n{x['context']}\n\nQ: {x['question']}\n\nA: "

prompt = get_bot_context_prompt(data['train'].shuffle()[3])
print(prompt)

Loading cached shuffled indices for dataset at /home/x/.cache/huggingface/datasets/squad/plain_text/1.0.0/d6ec3ceb99ca480ce37cdd35555d6cb2511d223b9150cce08a837ef62ffea453/cache-2bd8e69c8e2e88c5.arrow


I am a highly intelligent question answering bot. If you ask me a question that is rooted in truth, I will give you the answer.

Among the notable structures in the town are the three forts built by the Swedes for defense purposes. One of these forts, known as Fort Oscar (formerly Gustav Adolph), which overlooks the sea is located on the far side of La Pointe. However, the ruins have been replaced by a modern military building which now houses the local gendarmerie. The other fort known as Fort Karl now presents a very few ruins. The third fort built by the Swedes is the Fort Gustav, which is also seen in ruins strewn around the weather station and the Light House. The fort built in 1787 over a hill slope has ruins of ramparts, guardhouse, munitions depot, wood-burning oven and so forth.

Q: On what part of the island is Fort Oscar located on the far side of?

A: 


In [44]:
response = openai.Completion.create(
  engine='davinci',
  prompt=prompt,
  temperature=0,
  max_tokens=100,
  top_p=1,
  frequency_penalty=0.0,
  presence_penalty=0.0,
  stop=["\n"]
)
print(response)

{
  "choices": [
    {
      "finish_reason": "stop",
      "index": 0,
      "logprobs": null,
      "text": ""
    }
  ],
  "created": 1638644930,
  "id": "cmpl-4BUgspdd9dXR90oeXvMXDKojt8YPS",
  "model": "davinci:2020-05-03",
  "object": "text_completion"
}
