# Training a gpt 3 model for nl2sparql
Openai only supports fine-tuning on their gpt3 models: ada, babbage, curie and davinci. They do not yet support fine-tuning 
on gpt3.5 (aka chatGPT, aka "gpt-3.5-turbo"). This notebook examines the models for which fine-tuninig is available.

In [1]:
!pip install wikibaseintegrator

Looking in indexes: https://artifactory.concurtech.net/artifactory/api/pypi/pypi.python.org/simple, https://artifactory.concurtech.net/artifactory/api/pypi/ext-pypi-selfserve-local/simple
[0m

In [2]:
# pip install these as needed
import openai
import pandas as pd

In [3]:
# these imports should not require installation
import json
import string
import random
import subprocess


### We will need the lcquad data as a DataFrame. You may need to change this file path.

In [4]:
lcquad_filename = '../../lcquad2.0.train.json'
lcquad_df = pd.read_json(lcquad_filename)

In [5]:
df_train_validate = lcquad_df.sample(n=1200, random_state=1)

In [5]:
df_train = df_train_validate.sample(n=1000, random_state=2)
len(df_train)

1000

In [6]:
df_validate = df_train_validate.drop(df_train.index)
len(df_validate)

200

Let's validate that these are really distinct samples

In [7]:
s = set()
s.update(list(df_train.index))
s.update(list(df_validate.index))
len(s)

1200

### Set up the openai api key

The api key is a secret and so should not be checked into github. This is what the ini file should look like:
```
[OPENAI]
OPENAI_API_KEY=<openai key here>
[WANDB]
WANDB_API_KEY=<wandb key here>
```
Add your own api key there, or ask Max for his.

In [8]:
import configparser
config = configparser.ConfigParser()
config.read('secrets.ini')

['secrets.ini']

In [9]:
import os
os.environ.update({'OPENAI_API_KEY': config['OPENAI']['OPENAI_API_KEY']})
openai.api_key = os.getenv('OPENAI_API_KEY')

### Sanity check: openai's tutorial example
This is here just to validate that the api setup is working

In [10]:
def generate_prompt(animal):
    return """Suggest three names for an animal that is a superhero.

Animal: Cat
Names: Captain Sharpclaw, Agent Fluffball, The Incredible Feline
Animal: Dog
Names: Ruff the Protector, Wonder Canine, Sir Barks-a-Lot
Animal: {}
Names:""".format(
        animal.capitalize()
    )


In [11]:
def run_prompt(prompt="", model="text-davinci-003", temperature=0.6, stop=None):
    response = openai.Completion.create(
        model=model,
        prompt=prompt,
        temperature=temperature,
        max_tokens=100,
        stop=stop
    )
    return response

In [12]:
response = run_prompt(generate_prompt('cow'))
response['choices'][0]['text']

' Mighty Moo-ver, Bovine Brawler, Supercow!'

### Sanity check: the openai CLI should be working as well
Implant the key in the shell environment

In [13]:
# Implant the openai key in the shell environment
!eval `cat secrets.ini | grep OPENAI_API_KEY | sed 's/^/export /'`

In [14]:
!which openai

/opt/homebrew/anaconda3/bin/openai


In [15]:
# if the above fails, try this
!pip install openai

Looking in indexes: https://artifactory.concurtech.net/artifactory/api/pypi/pypi.python.org/simple, https://artifactory.concurtech.net/artifactory/api/pypi/ext-pypi-selfserve-local/simple
^C
[31mERROR: Operation cancelled by user[0m[31m
[0m

In [16]:
# this is a quick way to validate that the CLI is working
!openai api fine_tunes.list

{
  "data": [
    {
      "created_at": 1677959387,
      "fine_tuned_model": "ada:ft-personal-2023-03-04-20-06-30",
      "hyperparams": {
        "batch_size": 1,
        "learning_rate_multiplier": 0.1,
        "n_epochs": 4,
        "prompt_loss_weight": 0.01
      },
      "id": "ft-90O7QVVHQ86vwYsLsRe3lFbZ",
      "model": "ada",
      "object": "fine-tune",
      "organization_id": "org-6Tm9wvTU2DAyCUVamArcvPxV",
      "result_files": [
        {
          "bytes": 43024,
          "created_at": 1677960391,
          "filename": "compiled_results.csv",
          "id": "file-xZCujiVRhZnleRx5KowmvYt7",
          "object": "file",
          "purpose": "fine-tune-results",
          "status": "processed",
          "status_details": null
        }
      ],
      "status": "succeeded",
      "training_files": [
        {
          "bytes": 37497,
          "created_at": 1677959386,
          "filename": "openai_train.jsonl",
          "id": "file-PtdM8l5yCZybCnRknRIFCIiv",
          

## Fine-tuning
Choose a base model

In [17]:
#base_model = 'ada'
#base_model = 'curie'
base_model = 'davinci'

Does the model know anything about sparql already?

In [18]:
response = run_prompt("Please show me a sample sparql query", model=base_model)
response['choices'][0]['text']

' that can do this.\n\nThanks.'

In [19]:
response = run_prompt("What does a sparql query do?", model=base_model)
response['choices'][0]['text']

'\n\nThe query is a string that describes a set of triples. The query is run against a sparql endpoint and the results are returned.\n\nWhat is sparql?\n\nsparql is a query language for the semantic web.\n\nWhat is the semantic web?\n\nThe semantic web is a web of data that is linked, annotated, and defined by machine-readable standards. It is a web that can be read and understood by machines.\n\nWhat'

In [20]:
response = run_prompt("Please translate this question to sparql: 'What is Delta Air Line's periodical literature mouthpiece'", model=base_model)
response['choices'][0]['text']

' in your language.\n\nThe query was translated to the following languages:\n\nFrench: Delta Air Lines a publié une revue trimestrielle en anglais.\n\nItalian: Delta Air Lines ha pubblicato una rivista trimestrale in inglese.\n\nSpanish: Delta Air Lines editó una revista trimestral en inglés.\n\nGerman: Delta Air Lines publizierte eine hal'

Openai wants fine-tune data in a certain format. See https://platform.openai.com/docs/guides/fine-tuning/prepare-training-data.
This function prepares lcquad data. Note that the script prefers the "paraphrased" question. This is something we could play with.

In [21]:
def make_finetune_data(df, filename=None):
    training_json = []
    for index, row in df.iterrows():
        d = {}
        question = row['paraphrased_question']
        if len(question) == 0 or len(question) > 2048:
            question = row['question']
        d['prompt'] = f"{question} ->"
        d['completion'] = f" {row['sparql_wikidata']} \n"
        training_json.append(json.dumps(d))
    if filename is None:
        return training_json
    with open(filename, 'w') as f:
        for l in training_json:
            f.write(l + '\n')   

Play with it a bit to see that it's working

In [26]:
training_json = make_finetune_data(df_train)

In [27]:
training_json[0:5]

['{"prompt": "let me know lord in Greek mythology title has the word thestrus in it ->", "completion": " SELECT DISTINCT ?sbj ?sbj_label WHERE { ?sbj wdt:P31 wd:Q24434794 . ?sbj rdfs:label ?sbj_label . FILTER(CONTAINS(lcase(?sbj_label), \'thestrus\')) . FILTER (lang(?sbj_label) = \'en\') } LIMIT 25  \\n"}',
 '{"prompt": "Who is the {protein} for {physically interatomic with} of {fentanyl} ->", "completion": "  select distinct ?sbj where { ?sbj wdt:P129 wd:Q407541 . ?sbj wdt:P31 wd:Q8054 }  \\n"}',
 '{"prompt": "Which is the office held by head of the organisation of Autonomous University of Madrid? ->", "completion": " select distinct ?answer where { wd:Q788091 wdt:P2388 ?answer} \\n"}',
 '{"prompt": "Is Usain Bolt\'s individual best rise to to 36.84? ->", "completion": " ASK WHERE { wd:Q1189 wdt:P2415 ?obj filter(?obj = 36.84) }  \\n"}',
 '{"prompt": "What is the unicameral legislative body in Azad Kashmir called? ->", "completion": "  select distinct ?obj where { wd:Q200130 wdt:P194 

Now create a training file. 

In [28]:
def random_train_file_name(N=5):
    random_s = ''.join(random.choices(string.ascii_uppercase + string.digits, k=N))    
    return f'openai_train_{random_s}.jsonl'

In [32]:
train_file_name = random_train_file_name()
make_finetune_data(df_train, filename=train_file_name)

In [33]:
validation_file_name = random_train_file_name()
make_finetune_data(df_validate, filename=validation_file_name)

We hope that this call simply asserts that everything looks good - it should not prompt. If it prompts, it will crash because 
we're taking input from /dev/null. If you see a crash, try running this command in a shell, without the /dev/null prompt.

In [34]:
!openai tools fine_tunes.prepare_data -f {train_file_name} < /dev/null

Analyzing...

- Your file contains 1000 prompt-completion pairs
- All prompts end with suffix ` ->`
- All completions end with suffix ` \n`

No remediations found.

You can use your file for fine-tuning:
> openai api fine_tunes.create -t "openai_train_A7MUW.jsonl"

After you’ve fine-tuned a model, remember that your prompt has to end with the indicator string ` ->` for the model to start generating completions, rather than continuing with the prompt. Make sure to include `stop=[" \n"]` so that the generated texts ends at the expected place.
Once your model starts training, it'll approximately take 16.18 minutes to train a `curie` model, and less for `ada` and `babbage`. Queue will approximately take half an hour per job ahead of you.


In [35]:
!openai tools fine_tunes.prepare_data -f {validation_file_name} < /dev/null

Analyzing...

- Your file contains 200 prompt-completion pairs
- All prompts end with suffix ` ->`
- All completions end with suffix ` \n`

No remediations found.

You can use your file for fine-tuning:
> openai api fine_tunes.create -t "openai_train_LVYWD.jsonl"

After you’ve fine-tuned a model, remember that your prompt has to end with the indicator string ` ->` for the model to start generating completions, rather than continuing with the prompt. Make sure to include `stop=[" \n"]` so that the generated texts ends at the expected place.
Once your model starts training, it'll approximately take 5.19 minutes to train a `curie` model, and less for `ada` and `babbage`. Queue will approximately take half an hour per job ahead of you.


Now we create the training run. Note that many more hyperparameters can be specified. See https://platform.openai.com/docs/guides/fine-tuning/hyperparameters

### Create the fine-tuning run
In principal, this command creates the job and streams back the messages. In practice, I always see `Stream interrupted (client disconnected)` when I run it in a jupyter notebook.

In [36]:
# use this if you don't want validation - but who doesn't want validation?
#!openai api fine_tunes.create -t {train_file_name} -m {base_model} < /dev/null

In [38]:
# use this if you do want validation

!openai api fine_tunes.create -t {train_file_name} -v {train_file_name} -m {base_model} < /dev/null


Upload progress: 100%|██████████████████████| 192k/192k [00:00<00:00, 89.9Mit/s]
Uploaded file from openai_train_A7MUW.jsonl: file-EOYCkdbjT5gSAFON3lOWvgtP
Found potentially duplicated files with name 'openai_train_A7MUW.jsonl', purpose 'fine-tune' and size 191887 bytes
file-EOYCkdbjT5gSAFON3lOWvgtP
Upload progress: 100%|██████████████████████| 192k/192k [00:00<00:00, 92.7Mit/s]is file anyway: 
Uploaded file from openai_train_A7MUW.jsonl: file-IAkxwi534qhwZk0JuHgHk6SA
Created fine-tune: ft-iVdTcCTmIoSHQ8VTVEkxa5aL
Streaming events until fine-tuning is complete...

(Ctrl-C will interrupt the stream, but not cancel the fine-tune)
[2023-03-12 17:33:15] Created fine-tune: ft-iVdTcCTmIoSHQ8VTVEkxa5aL

Stream interrupted (client disconnected).
To resume the stream, run:

  openai api fine_tunes.follow -i ft-iVdTcCTmIoSHQ8VTVEkxa5aL



In [39]:
def get_last_run_id():
    result = subprocess.run(['openai','api', 'fine_tunes.list'], stdout=subprocess.PIPE)
    runs = json.loads(result.stdout)
    last_run = runs['data'][-1]
    run_id = fine_tuned_model = None
    if 'id' in last_run:
        run_id = last_run['id']
    if 'fine_tuned_model' in last_run:
        fine_tuned_model = last_run['fine_tuned_model']  
    return run_id, fine_tuned_model

In [40]:
last_run, fine_tuned_model = get_last_run_id()
last_run, fine_tuned_model

('ft-iVdTcCTmIoSHQ8VTVEkxa5aL', None)

Again, this command in principal streams all messages until completion, but in practice, also times out.

So run this cell over and over until you see "Status: succeeded 🎉"

In [47]:
!openai api fine_tunes.follow -i {last_run} 

[2023-03-12 17:33:15] Created fine-tune: ft-iVdTcCTmIoSHQ8VTVEkxa5aL
[2023-03-12 17:37:56] Fine-tune costs $7.09
[2023-03-12 17:37:57] Fine-tune enqueued. Queue number: 0
[2023-03-12 17:37:57] Fine-tune is in the queue. Queue number: 0
[2023-03-12 17:37:59] Fine-tune started
[2023-03-12 17:51:10] Completed epoch 1/4
[2023-03-12 18:01:06] Completed epoch 2/4
[2023-03-12 18:10:58] Completed epoch 3/4
[2023-03-12 18:20:53] Completed epoch 4/4
[2023-03-12 18:21:40] Uploaded model: davinci:ft-askwiki-2023-03-13-01-21-40
[2023-03-12 18:21:42] Uploaded result file: file-Nm0MnsOcT72zCTqgLdMBo1HU
[2023-03-12 18:21:42] Fine-tune succeeded

Job complete! Status: succeeded 🎉
Try out your fine-tuned model:

openai api completions.create -m davinci:ft-askwiki-2023-03-13-01-21-40 -p <YOUR_PROMPT>


Get the fine-tuned model name. Make sure it's not None.

In [48]:
last_run, fine_tuned_model = get_last_run_id()
last_run, fine_tuned_model

('ft-iVdTcCTmIoSHQ8VTVEkxa5aL', 'davinci:ft-askwiki-2023-03-13-01-21-40')

Take a look at a few questions

In [49]:
response = run_prompt("What is Delta Air Line's periodical literature mouthpiece? ->", 
                      model=fine_tuned_model, stop=[" \n"])
response['choices'][0]['text']

' SELECT ?answer WHERE { wd:Q122467 wdt:P166 ?X . ?X wdt:P1056 ?answer}'

In [167]:
lcquad_df.iloc[sample_size+5]['paraphrased_question']

'What grant was gotten Mary Tyler Moore ?'

In [168]:
response = run_prompt(f"{lcquad_df.iloc[sample_size+5]['paraphrased_question']} ->", 
                      model=fine_tuned_model, stop=[" \n"])
response['choices'][0]['text']
response

<OpenAIObject text_completion id=cmpl-6sy3clKE6EimrIaY3FX4sKJgb5FtV at 0x7f8af2ecc130> JSON: {
  "choices": [
    {
      "finish_reason": "stop",
      "index": 0,
      "logprobs": null,
      "text": " SELECT ?value WHERE { wd:Q1680 p:P1082 ?s . ?s ps:P1082 ?x filter(contains(YEAR(?x),'1966')) . ?s pq:P585 ?value}"
    }
  ],
  "created": 1678558592,
  "id": "cmpl-6sy3clKE6EimrIaY3FX4sKJgb5FtV",
  "model": "curie:ft-personal-2023-03-11-18-13-24",
  "object": "text_completion",
  "usage": {
    "completion_tokens": 54,
    "prompt_tokens": 9,
    "total_tokens": 63
  }
}

### Fine-tune further if you want
You just do the same things as above, but specify the fine-tuned model name

In [169]:
next_sample_size = 1000

In [170]:
next_train_file_name = random_train_file_name()

In [171]:
make_finetune_data(lcquad_df.iloc[sample_size:sample_size+next_sample_size], filename=next_train_file_name)

In [172]:
!openai tools fine_tunes.prepare_data -f {next_train_file_name} < /dev/null

Analyzing...

- Your file contains 1000 prompt-completion pairs
- All prompts end with suffix ` ->`
- All completions end with suffix ` \n`

No remediations found.

You can use your file for fine-tuning:
> openai api fine_tunes.create -t "openai_train_8YJQN.jsonl"

After you’ve fine-tuned a model, remember that your prompt has to end with the indicator string ` ->` for the model to start generating completions, rather than continuing with the prompt. Make sure to include `stop=[" \n"]` so that the generated texts ends at the expected place.
Once your model starts training, it'll approximately take 16.18 minutes to train a `curie` model, and less for `ada` and `babbage`. Queue will approximately take half an hour per job ahead of you.


#### Note: we're using the fine_tuned_model here

In [173]:
!openai api fine_tunes.create -t {train_file_name} -m {fine_tuned_model} < /dev/null

Found potentially duplicated files with name 'openai_train_RN5E2.jsonl', purpose 'fine-tune' and size 38794 bytes
file-wqHZIMQlqYKdRnFS4Oj93nwl
Upload progress: 100%|████████████████████| 38.8k/38.8k [00:00<00:00, 44.3Mit/s]is file anyway: 
Uploaded file from openai_train_RN5E2.jsonl: file-UrT9q5sOdsOC9gWUDYNChsQz
Created fine-tune: ft-IR6C0jd8iTkdNRytnzutzuDd
Streaming events until fine-tuning is complete...

(Ctrl-C will interrupt the stream, but not cancel the fine-tune)
[2023-03-11 11:16:50] Created fine-tune: ft-IR6C0jd8iTkdNRytnzutzuDd

Stream interrupted (client disconnected).
To resume the stream, run:

  openai api fine_tunes.follow -i ft-IR6C0jd8iTkdNRytnzutzuDd



In [174]:
last_run, fine_tuned_model = get_last_run_id()
last_run, fine_tuned_model

('ft-IR6C0jd8iTkdNRytnzutzuDd', None)

Again, run this cell until you see the success message

In [176]:
!openai api fine_tunes.follow -i {last_run} < /dev/null

[2023-03-11 11:16:50] Created fine-tune: ft-IR6C0jd8iTkdNRytnzutzuDd
[2023-03-11 11:19:57] Fine-tune costs $0.15
[2023-03-11 11:19:57] Fine-tune enqueued. Queue number: 0
[2023-03-11 11:19:59] Fine-tune started
[2023-03-11 11:21:39] Completed epoch 1/4
[2023-03-11 11:22:17] Completed epoch 2/4
[2023-03-11 11:22:55] Completed epoch 3/4
[2023-03-11 11:23:33] Completed epoch 4/4
[2023-03-11 11:23:51] Uploaded model: curie:ft-personal-2023-03-11-18-23-51
[2023-03-11 11:23:52] Uploaded result file: file-Tdl2avROfPtZcHwjvmx3em9V
[2023-03-11 11:23:52] Fine-tune succeeded

Job complete! Status: succeeded 🎉
Try out your fine-tuned model:

openai api completions.create -m curie:ft-personal-2023-03-11-18-23-51 -p <YOUR_PROMPT>


Pick up the fine-tuned model name. Note: this model was trained from the previous fine-tuned model, but this one has a different, new name.
Presumably, the previous fine-tunesd model is still around. I'm not sure how to delete them.

In [177]:
last_run, fine_tuned_model = get_last_run_id()
last_run, fine_tuned_model

('ft-IR6C0jd8iTkdNRytnzutzuDd', 'curie:ft-personal-2023-03-11-18-23-51')

### Sync with wandb

You may need a paid openai account for this to work.

In [50]:
project_name = 'nl2sparql'

In [52]:
!pip install wandb

Looking in indexes: https://artifactory.concurtech.net/artifactory/api/pypi/pypi.python.org/simple, https://artifactory.concurtech.net/artifactory/api/pypi/ext-pypi-selfserve-local/simple
Collecting wandb
  Downloading https://artifactory.concurtech.net/artifactory/api/pypi/pypi.python.org/packages/packages/7f/7d/6be3f2da29b80ad650d624f2bb6b76814a25aafa17260d7efbe3b5e0ccd3/wandb-0.13.11-py3-none-any.whl (2.0 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.0/2.0 MB[0m [31m1.2 MB/s[0m eta [36m0:00:00[0m00:01[0m00:01[0m0m
Collecting sentry-sdk>=1.0.0
  Downloading https://artifactory.concurtech.net/artifactory/api/pypi/pypi.python.org/packages/packages/e4/03/4a3e03619eb41b4d9028d377869cc29a5a2095975ca9e8d9b64cb87e1450/sentry_sdk-1.16.0-py2.py3-none-any.whl (184 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m184.3/184.3 kB[0m [31m1.6 MB/s[0m eta [36m0:00:00[0ma [36m0:00:01[0m
Collecting GitPython!=3.1.29,>=1.0.0
  Downloading https://

In [53]:
!WANDB_API_KEY=`cat secrets.ini | grep WANDB_API_KEY | sed 's/^WANDB_API_KEY=//'` openai wandb sync --project {project_name} < /dev/null

[34m[1mwandb[0m: Currently logged in as: [33mdaziff-berkeley[0m ([33maskwiki[0m). Use [1m`wandb login --relogin`[0m to force relogin
[34m[1mwandb[0m: Tracking run with wandb version 0.13.11
[34m[1mwandb[0m: Run data is saved locally in [35m[1m/Users/i857913/Documents/mids/210/kgqa-ucb-210/training/wandb/run-20230313_133004-ft-iVdTcCTmIoSHQ8VTVEkxa5aL[0m
[34m[1mwandb[0m: Run [1m`wandb offline`[0m to turn off syncing.
[34m[1mwandb[0m: Syncing run [33mft-iVdTcCTmIoSHQ8VTVEkxa5aL[0m
[34m[1mwandb[0m: ⭐️ View project at [34m[4mhttps://wandb.ai/askwiki/nl2sparql[0m
[34m[1mwandb[0m: 🚀 View run at [34m[4mhttps://wandb.ai/askwiki/nl2sparql/runs/ft-iVdTcCTmIoSHQ8VTVEkxa5aL[0m
[34m[1mwandb[0m: Waiting for W&B process to finish... [32m(success).[0m
[34m[1mwandb[0m: 
[34m[1mwandb[0m: Run history:
[34m[1mwandb[0m:             elapsed_examples ▁▁▁▂▂▂▂▂▂▃▃▃▃▃▄▄▄▄▄▄▅▅▅▅▅▅▆▆▆▆▆▇▇▇▇▇▇███
[34m[1mwandb[0m:               elapsed_tokens ▁▁▁▂▂▂▂▂▂▃▃▃▃▃▃▄

## Check syntactic correctness

In [180]:
def generate_sparql(question, model=base_model, stop=None):
    response = run_prompt(f"{question} ->", model=model, stop=stop)
    # print(response)
    translation = response['choices'][0]['text']
    # print(translation)
    if translation is None or len(translation) == 0:
        return None
    # logging.info(f'sparql {translation}')
    return translation


In [181]:
generate_sparql("What is the name of Bill Gate's mother?", model=fine_tuned_model, stop=[" \n"])

' select distinct ?answer where { wd:Q974 wdt:P24 ?answer}'

In [182]:
def generate_lots_of_sparql(l, generator=None):
    result = []
    count = 1
    if generator is None:
        generator = lambda s: generate_sparql(s)
    for s in l:
        if count % 10 == 0:
            print(count)
        result.append(generator(s))
        count += 1
    print(count-1)
    return result


In [183]:
generator = lambda s: generate_sparql(s, model=fine_tuned_model, stop=[" \n"])

In [184]:
generator("What is the name of Bill Gate's mother?")

' select distinct ?answer where { wd:Q355444 wdt:P3822 ?answer}'

In [185]:
sparqls = generate_lots_of_sparql(lcquad_df[-10: -1]['question'], generator=generator)

9


In [186]:
sparqls

[" SELECT ?answer WHERE { wd:Q128056 wdt:P180 ?answer . ?answer wdt:P1279 ?x FILTER(contains(?x,'Al Green'))}",
 ' ASK WHERE { wd:Q252 wdt:P741 ?obj filter(?obj = 45.6) } ',
 ' SELECT ?answer WHERE { wd:Q171741 wdt:P19 ?X . ?X wdt:P19 ?answer}',
 ' SELECT ?answer WHERE { wd:Q1628086 wdt:P26 ?answer . ?answer wdt:P190 wd:Q133642}',
 " SELECT DISTINCT ?sbj ?sbj_label WHERE { ?sbj wdt:P31 wd:Q43229 . ?sbj rdfs:label ?sbj_label . FILTER(CONTAINS(lcase(?sbj_label), 's')) . FILTER (lang(?sbj_label) = 'en') } LIMIT 25 ",
 " SELECT DISTINCT ?sbj ?sbj_label WHERE { ?sbj wdt:P31 wd:Q109 . ?sbj rdfs:label ?sbj_label . FILTER(STRSTARTS(lcase(?sbj_label), 'H')) . FILTER (lang(?sbj_label) = 'en') } LIMIT 25 ",
 ' ASK WHERE { wd:Q742408 wdt:P2260 ?obj filter(?obj = 117.6) } ',
 ' SELECT ?answer WHERE { wd:Q65984 wdt:P108 ?X . ?X wdt:P103 ?answer}',
 ' SELECT ?obj WHERE { wd:Q1101377 p:P1082 ?s . ?s ps:P1082 ?obj . ?s pq:P459 wd:Q1344910 }']

In [187]:
from wikibaseintegrator import wbi_helpers
from wikibaseintegrator.wbi_config import config as wbi_config
import logging

In [188]:
wbi_config['USER_AGENT'] = 'AskwikiBot/1.0 (https://www.wikidata.org/wiki/User:What_Tottles_Meant)'
wbi_config['BACKOFF_MAX_TRIES'] = 1


In [189]:
from requests.exceptions import HTTPError
def run_sparql(query):
    try:
        results = wbi_helpers.execute_sparql_query(query)
    except HTTPError as he:
        logging.error(f'HTTPError {he}')
        print(f"failed query {query}")
        return None
    # print(results)
    if 'boolean' in results:
        return pd.DataFrame([{'Boolean': results['boolean'] }])
    jsonResult = [dict([(k, b[k]['value']) for k in b]) for b in results['results']['bindings']]
    df = pd.DataFrame.from_dict(jsonResult)
    return df


In [190]:
def validate_queries(qs):
    validation_results = []
    count = 0
    for q in qs:
        # print(q)
        df = run_sparql(q)
        result_count = 0
        if df is None:
            run_result = 'Fail'
            print(f"Failed query number {count}")
        else:
            run_result = 'Pass'
            result_count = len(df)
        validation_results.append((run_result, result_count))
        count += 1
    return validation_results 

In [191]:
validate_queries(sparqls)

[('Pass', 0),
 ('Pass', 1),
 ('Pass', 0),
 ('Pass', 0),
 ('Pass', 25),
 ('Pass', 0),
 ('Pass', 1),
 ('Pass', 0),
 ('Pass', 0)]

In [192]:
sparqls = generate_lots_of_sparql(lcquad_df[0: 10]['question'], generator=generator)

10
10


In [193]:
validate_queries(sparqls)

ERROR:backoff:Giving up execute_sparql_query(...) after 1 tries (requests.exceptions.HTTPError: 400 Client Error: Bad Request for url: https://query.wikidata.org/sparql?query=%23Tool%3A+WikibaseIntegrator+wbi_functions.execute_sparql_query%0A+ASK+WHERE+%7B+wd%3AQ7424+wdt%3AP27+wd%3AQ56474+.+wd%3AQ7424+wdt%3AP27+wd%3AQ56474+not%28ber%3ASTARTING_VALUE151398%29+%7D+&format=json)
ERROR:root:HTTPError 400 Client Error: Bad Request for url: https://query.wikidata.org/sparql?query=%23Tool%3A+WikibaseIntegrator+wbi_functions.execute_sparql_query%0A+ASK+WHERE+%7B+wd%3AQ7424+wdt%3AP27+wd%3AQ56474+.+wd%3AQ7424+wdt%3AP27+wd%3AQ56474+not%28ber%3ASTARTING_VALUE151398%29+%7D+&format=json


failed query  ASK WHERE { wd:Q7424 wdt:P27 wd:Q56474 . wd:Q7424 wdt:P27 wd:Q56474 not(ber:STARTING_VALUE151398) } 
Failed query number 6


[('Pass', 0),
 ('Pass', 0),
 ('Pass', 1),
 ('Pass', 0),
 ('Pass', 0),
 ('Pass', 25),
 ('Fail', 0),
 ('Pass', 0),
 ('Pass', 1),
 ('Pass', 0)]

In [194]:
sparqls

[' SELECT ?answer WHERE { wd:Q168790 wdt:P172 ?answer . ?answer wdt:P571 wd:Q168790}',
 ' SELECT ?answer WHERE { wd:Q169794 wdt:P156 ?X . ?X wdt:P1346 ?answer}',
 ' ASK WHERE { wd:Q4092 wdt:P1337 wd:Q106697 . wd:Q4092 wdt:P1337 wd:Q152697 }',
 " SELECT ?answer WHERE { wd:Q162856 wdt:P186 ?answer . ?answer wdt:P2080 ?x FILTER(contains(?x,'phase_matter_of_Galinstan'))}",
 ' select distinct ?answer where { wd:Q32491 wdt:P2260 ?answer}',
 " SELECT DISTINCT ?sbj ?sbj_label WHERE { ?sbj wdt:P31 wd:Q23847174 . ?sbj rdfs:label ?sbj_label . FILTER(STRSTARTS(lcase(?sbj_label), 'p')) . FILTER (lang(?sbj_label) = 'en') } LIMIT 25 ",
 ' ASK WHERE { wd:Q7424 wdt:P27 wd:Q56474 . wd:Q7424 wdt:P27 wd:Q56474 not(ber:STARTING_VALUE151398) } ',
 '  select distinct ?obj where { wd:Q202729 wdt:P358 ?obj . ?obj wdt:P31 wd:Q273057 } ',
 ' select distinct ?answer where { wd:Q235975 wdt:P3171 ?answer}',
 ' SELECT ?answer WHERE { wd:Q1159316 wdt:P156 ?X . ?X wdt:P1346 ?answer}']

In [195]:
lcquad_df[0:10]['sparql_wikidata']

0     select distinct ?obj where { wd:Q188920 wdt:P...
1    SELECT ?answer WHERE { wd:Q169794 wdt:P26 ?X ....
2    ASK WHERE { wd:Q174843 wdt:P106 wd:Q1804811 . ...
3    SELECT ?answer WHERE { wd:Q675176 wdt:P515 ?X ...
4    select distinct ?answer where { wd:Q32491 wdt:...
5    SELECT DISTINCT ?sbj ?sbj_label WHERE { ?sbj w...
6    ASK WHERE { wd:Q4180017 wdt:P6257 ?obj filter(...
7     select distinct ?obj where { wd:Q202729 wdt:P...
8    select distinct ?answer where { wd:Q235975 wdt...
9    SELECT ?answer WHERE { wd:Q1356316 wdt:P156 ?X...
Name: sparql_wikidata, dtype: object

In [196]:
validate_queries(lcquad_df[0:10]['sparql_wikidata'])

[('Pass', 1),
 ('Pass', 1),
 ('Pass', 1),
 ('Pass', 0),
 ('Pass', 1),
 ('Pass', 0),
 ('Pass', 1),
 ('Pass', 0),
 ('Pass', 1),
 ('Pass', 1)]

In [197]:
lcquad_df.iloc[3]['paraphrased_question']

'What range are the papers at the Monique Genonceaux about?'

In [198]:
lcquad_df.iloc[3]['sparql_wikidata']

'SELECT ?answer WHERE { wd:Q675176 wdt:P515 ?X . ?X wdt:P156 ?answer}'

In [199]:
sparqls[3]

" SELECT ?answer WHERE { wd:Q162856 wdt:P186 ?answer . ?answer wdt:P2080 ?x FILTER(contains(?x,'phase_matter_of_Galinstan'))}"

In [200]:
lcquad_df.iloc[6]

NNQT_question           Does the {right ascension} of the {Malin 1} {l...
uid                                                                 18423
subgraph                                              boolean with filter
template_index                                                        441
question                Is the right ascension of malin 1 less than 15...
sparql_wikidata         ASK WHERE { wd:Q4180017 wdt:P6257 ?obj filter(...
sparql_dbpedia18        ASK { ?statement1 <http://www.w3.org/1999/02/2...
template                            ASK ?sbj ?pred ?obj filter ?obj = num
answer                                                                 []
template_id                                                             3
paraphrased_question    Does malin 1 have a right ascension lower than...
Name: 6, dtype: object

In [201]:
lcquad_df.iloc[6]['paraphrased_question']

'Does malin 1 have a right ascension lower than 15.1398?'