<a href="https://colab.research.google.com/github/donaldziff/kgqa-ucb-210/blob/main/training/summarization/AskWiki_NLG_using_OPENAI.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Ask Wiki Problem definition 
AskWiki must perform 2 tasks sequentially, first is to construct a SPARQL query based on question, second task is to verbalize and generate an answer from the query results. 

Author: shrinivasbjoshi@berkeley.edu

#NL generation based on wikidata triples 
To generate NL answer to a question, Askwiki must input wikidata triples into NLG model and generate and summarize an english language response. 

We considered T5 and OpenAi model families as candidates for the NLG. 


Intution behind choosing T5 & OpenAI models 
1. T5 will offer larger training and fine tuning opportunities 
2. OpenAI offers wider selection of models and easier few shot training approaches 


This Notebook provides overview of our T5 small NLG training and generation. 

Approach to NL generation using WEB NLG 2020 challenge data 

1. Training on webnlg 2020 data set

 WebNLG dataset provided ready to use RDF triples [similar to how Askwiki will generate triples] and annotated human language responses for those triples  
 
2. Askwiki intends to answer a specific question using NL and does not want to just summarize set of triples into a paragraph. For those purposes we have not done extensive tuning of T5 models in this notebook, onus of generating an answer is on the eariler pipeline of the code and not necessrily on the NLG model. 

3. This model is just reacting to the input RDF tiples , AskWiki did not have access to any specific question answer database for finetuning and utilized the webnlg dataset as language generator and summarizer trainer [not as answering model]

# Installing the required packages

In [None]:
# pip install these as needed
!pip install openai
!pip install rouge

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting openai
  Downloading openai-0.27.4-py3-none-any.whl (70 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m70.3/70.3 kB[0m [31m4.2 MB/s[0m eta [36m0:00:00[0m
Collecting aiohttp
  Downloading aiohttp-3.8.4-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.0 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.0/1.0 MB[0m [31m29.9 MB/s[0m eta [36m0:00:00[0m
Collecting yarl<2.0,>=1.0
  Downloading yarl-1.8.2-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (264 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m264.6/264.6 kB[0m [31m9.5 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting frozenlist>=1.1.1
  Downloading frozenlist-1.3.3-cp39-cp39-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_17_x86_64.manylinux2014_x86_64.whl (158 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m158.8/

In [None]:
import openai
import pandas as pd
import os
import subprocess

#Set Open AI details

In [None]:
openai.api_key = 'removed for security reason'

In [None]:
import os
os.environ['OPENAI_API_KEY'] = 
os.environ['OPENAI_ORGANIZATION'] ='org-6Tm9wvTU2DAyCUVamArcvPxV'

In [None]:
openai.organization = "org-6Tm9wvTU2DAyCUVamArcvPxV"
openai.api_key = os.getenv("OPENAI_API_KEY")
openai.Model.list()

<OpenAIObject list at 0x7fce6f0bf0e0> JSON: {
  "data": [
    {
      "created": 1649358449,
      "id": "babbage",
      "object": "model",
      "owned_by": "openai",
      "parent": null,
      "permission": [
        {
          "allow_create_engine": false,
          "allow_fine_tuning": false,
          "allow_logprobs": true,
          "allow_sampling": true,
          "allow_search_indices": false,
          "allow_view": true,
          "created": 1669085501,
          "group": null,
          "id": "modelperm-49FUp5v084tBB49tC4z8LPH5",
          "is_blocking": false,
          "object": "model_permission",
          "organization": "*"
        }
      ],
      "root": "babbage"
    },
    {
      "created": 1649359874,
      "id": "davinci",
      "object": "model",
      "owned_by": "openai",
      "parent": null,
      "permission": [
        {
          "allow_create_engine": false,
          "allow_fine_tuning": false,
          "allow_logprobs": true,
          "allow_sa

# Basic Functions

In [None]:
def run_prompt(prompt="", model="text-davinci-003", temperature=0.4, stop=None):
    response = openai.Completion.create(
        model=model,
        prompt=prompt,
        temperature=temperature,
        max_tokens=80,
        stop=stop
    )
    return response

In [None]:
def get_last_run_id():
    result = subprocess.run(['openai','api', 'fine_tunes.list'], stdout=subprocess.PIPE)
    runs = json.loads(result.stdout)
    last_run = runs['data'][-1]
    run_id = fine_tuned_model = None
    if 'id' in last_run:
        run_id = last_run['id']
    if 'fine_tuned_model' in last_run:
        fine_tuned_model = last_run['fine_tuned_model']  
    return run_id, fine_tuned_model

# Training data 

In [None]:
import urllib.request
import zipfile
url = 'https://gitlab.com/shimorina/webnlg-dataset/-/archive/master/webnlg-dataset-master.zip?path=release_v3.0/en/train'
urllib.request.urlretrieve(url, 'web.zip')
with zipfile.ZipFile('web.zip', 'r') as zip_ref:
    zip_ref.extractall('web')
import glob
import os
import re
import xml.etree.ElementTree as ET
import pandas as pd
files = glob.glob("/content/web/webnlg-dataset-master-release_v3.0-en-train/release_v3.0/en/train/**/*.xml", recursive=True)
triple_re=re.compile('(\d)triples')
data_dct={}
for file in files:
    tree = ET.parse(file)
    root = tree.getroot()
    triples_num=int(triple_re.findall(file)[0])
    for sub_root in root:
        for ss_root in sub_root:
            strutured_master=[]
            unstructured=[]
            for entry in ss_root:
                unstructured.append(entry.text)
                strutured=[triple.text for triple in entry]
                strutured_master.extend(strutured)
            unstructured=[i for i in unstructured if i.replace('\n','').strip()!='' ]
            strutured_master=strutured_master[-triples_num:]
            strutured_master_str=(' && ').join(strutured_master)
            data_dct[strutured_master_str]=unstructured
mdata_dct={"prefix":[], "input_text":[], "target_text":[]}
for st,unst in data_dct.items():
    for i in unst:
        mdata_dct['prefix'].append('AskWiki NLG: ')
        mdata_dct['input_text'].append(st)
        mdata_dct['target_text'].append(i)


df=pd.DataFrame(mdata_dct)
df.to_csv('webNLG2020_train.csv')

In [None]:
df.count()

prefix         35203
input_text     35203
target_text    35203
dtype: int64

In [None]:
train_df=pd.read_csv('webNLG2020_train.csv',  index_col=[0])

In [None]:
train_df.count()

prefix         35203
input_text     35203
target_text    35203
dtype: int64

In [None]:
train_df

Unnamed: 0,prefix,input_text,target_text
0,AskWiki NLG:,Ajoblanco | country | Spain && Ajoblanco | mai...,Ajoblanco is from Andalusia inSpain. The main ...
1,AskWiki NLG:,Ajoblanco | country | Spain && Ajoblanco | mai...,"Ajoblanco is from Andalusia, Spain. It contain..."
2,AskWiki NLG:,Ajoblanco | country | Spain && Ajoblanco | mai...,Ajoblanco is from the Andalusia region of Spai...
3,AskWiki NLG:,Ajoblanco | country | Spain && Ajoblanco | mai...,Ajoblanco originates from the Andalusia region...
4,AskWiki NLG:,Ajoblanco | country | Spain && Ajoblanco | mai...,Ajoblanco is a Spanish dish found in Andalusia...
...,...,...,...
35196,AskWiki NLG:,Walter_Baade | doctoralStudent | Allan_Sandage,Allan Sandage was a doctoral student of Walter...
35197,AskWiki NLG:,Walter_Baade | doctoralStudent | Halton_Arp,Halton Arp was a doctoral student of Walter Ba...
35198,AskWiki NLG:,Walter_Baade | doctoralStudent | Halton_Arp,Walter Baade's doctoral student was Halton Arp.
35199,AskWiki NLG:,Walter_Baade | nationality | Germany,Walter Baade is a German national.


In [None]:
from sklearn.model_selection import train_test_split
train_df, test_df = train_test_split(train_df, test_size=0.3)

In [None]:
#reduce train further for few shot training with open ai
train_df=train_df.sample(frac=0.01)

In [None]:
train_df.count()

prefix         246
input_text     246
target_text    246
dtype: int64

## Make json file for training

In [None]:
import json
def make_finetune_data(df, filename=None):
    training_json = []
    for index, row in df.iterrows():
        d = {}
        triples = row['input_text']
        d['prompt'] = f"{triples} ->"
        d['completion'] = f" {row['target_text']} \n"
        training_json.append(json.dumps(d))
    if filename is None:
        return training_json
    with open(filename, 'w') as f:
        for l in training_json:
            f.write(l + '\n')  

In [None]:
training_json = make_finetune_data(train_df)

In [None]:
training_json[0:5]

['{"prompt": "Alpena_County_Regional_Airport | cityServed | Alpena,_Michigan ->", "completion": " Alpena County Regional Airport city served Alpena, Michigan. \\n"}',
 '{"prompt": "A.S._Roma | numberOfMembers | 70634 && A.S._Roma | ground | Rome ->", "completion": " A.S. Roma has 70634 members and have a ground in Rome. \\n"}',
 '{"prompt": "School of Business and Social Sciences at the Aarhus University | academicStaffSize | 737 && School of Business and Social Sciences at the Aarhus University | established | 1928 ->", "completion": " The School of Business and Social Sciences at Aarhus University was created in 1928 and has an academic staff size of 737 people. \\n"}',
 '{"prompt": "Abner_W._Sibal | office | \\"Member of the Connecticut Senate from the 26th District\\" && Abner_W._Sibal | party | Republican_Party_(United_States) && Abner_W._Sibal | successor | Donald_J._Irwin && Abner_W._Sibal | deathPlace | Alexandria,_Virginia && Abner_W._Sibal | birthPlace | Ridgewood,_Queens ->"

In [None]:
import random
import string
def random_train_file_name(N=5):
    random_s = ''.join(random.choices(string.ascii_uppercase + string.digits, k=N))    
    return f'openai_train_{random_s}.jsonl'

In [None]:
train_file_name = random_train_file_name()
make_finetune_data(train_df, filename=train_file_name)

In [None]:
test_df.count()

prefix         10561
input_text     10561
target_text    10561
dtype: int64

In [None]:
test_df=test_df.sample(frac=0.01)

In [None]:
test_df.count()

prefix         106
input_text     106
target_text    106
dtype: int64

In [None]:
train_df.count()

prefix         246
input_text     246
target_text    246
dtype: int64

In [None]:
test_file_name = random_train_file_name()
make_finetune_data(test_df, filename=test_file_name)

In [None]:
!openai tools fine_tunes.prepare_data -f {train_file_name} < /dev/null

Analyzing...

- Your file contains 246 prompt-completion pairs
- All prompts end with suffix ` ->`
- All completions end with suffix `. \n`

No remediations found.

You can use your file for fine-tuning:
> openai api fine_tunes.create -t "openai_train_DN8Q6.jsonl"

After you’ve fine-tuned a model, remember that your prompt has to end with the indicator string ` ->` for the model to start generating completions, rather than continuing with the prompt. Make sure to include `stop=[". \n"]` so that the generated texts ends at the expected place.
Once your model starts training, it'll approximately take 5.82 minutes to train a `curie` model, and less for `ada` and `babbage`. Queue will approximately take half an hour per job ahead of you.


In [None]:
!openai tools fine_tunes.prepare_data -f {test_file_name} < /dev/null

Analyzing...

- Your file contains 106 prompt-completion pairs
- All prompts end with suffix ` ->`
- All completions end with suffix `. \n`

No remediations found.

You can use your file for fine-tuning:
> openai api fine_tunes.create -t "openai_train_QPZY6.jsonl"

After you’ve fine-tuned a model, remember that your prompt has to end with the indicator string ` ->` for the model to start generating completions, rather than continuing with the prompt. Make sure to include `stop=[". \n"]` so that the generated texts ends at the expected place.
Once your model starts training, it'll approximately take 3.9 minutes to train a `curie` model, and less for `ada` and `babbage`. Queue will approximately take half an hour per job ahead of you.


# da-vinci

In [None]:
base_model = 'davinci'

In [None]:
#train and validate
#remember we have taken fraction of the actual data..246 on train and 106 for test
!openai api fine_tunes.create -t {train_file_name} -v {test_file_name} -m {base_model} < /dev/null

Upload progress:   0% 0.00/73.0k [00:00<?, ?it/s]Upload progress: 100% 73.0k/73.0k [00:00<00:00, 75.3Mit/s]
Uploaded file from openai_train_DN8Q6.jsonl: file-qrJTOUztvQHBghjNcaGxtTAG
Upload progress: 100% 30.6k/30.6k [00:00<00:00, 52.2Mit/s]
Uploaded file from openai_train_QPZY6.jsonl: file-fXljyFXwFjPIeqmE1BvGbuD0
Created fine-tune: ft-ahoTiyk1FATwjTQyJvuChCdZ
Streaming events until fine-tuning is complete...

(Ctrl-C will interrupt the stream, but not cancel the fine-tune)
[2023-04-10 05:23:18] Created fine-tune: ft-ahoTiyk1FATwjTQyJvuChCdZ
[2023-04-10 05:23:27] Fine-tune costs $2.25
[2023-04-10 05:23:27] Fine-tune enqueued. Queue number: 0



In [None]:
last_run, fine_tuned_model = get_last_run_id()
last_run, fine_tuned_model

('ft-ahoTiyk1FATwjTQyJvuChCdZ', None)

In [None]:
!openai api fine_tunes.follow -i {last_run} 

[2023-04-10 05:23:18] Created fine-tune: ft-ahoTiyk1FATwjTQyJvuChCdZ
[2023-04-10 05:23:27] Fine-tune costs $2.25
[2023-04-10 05:23:27] Fine-tune enqueued. Queue number: 0
[2023-04-10 05:33:33] Fine-tune started
[2023-04-10 05:36:41] Completed epoch 1/4
[2023-04-10 05:38:01] Completed epoch 2/4
[2023-04-10 05:39:21] Completed epoch 3/4
[2023-04-10 05:40:42] Completed epoch 4/4
[2023-04-10 05:41:23] Uploaded model: davinci:ft-askwiki-2023-04-10-05-41-22
[2023-04-10 05:41:24] Uploaded result file: file-0e5hoBQdRS1auVkmMq6emjxA
[2023-04-10 05:41:24] Fine-tune succeeded

Job complete! Status: succeeded 🎉
Try out your fine-tuned model:

openai api completions.create -m davinci:ft-askwiki-2023-04-10-05-41-22 -p <YOUR_PROMPT>


In [None]:
#ask wiki org 
os.environ['OPENAI_ORGANIZATION'] ='org-6Tm9wvTU2DAyCUVamArcvPxV'

In [None]:
#get the model 
askwiki_davinci_fine_tune = 'davinci:ft-askwiki-2023-04-10-05-41-22'

In [None]:
def generate_response(input, model=askwiki_davinci_fine_tune, stop=None):
    response = run_prompt(f"{input} ->", model=model, stop=stop)
    # print(response)
    translation = response['choices'][0]['text']
    # print(translation)
    if translation is None or len(translation) == 0:
        return None
    return translation

In [None]:
generate_response('hepatitis A| BNCF Thesaurus |42006 && hepatitis A| MedlinePlus ID |000278 && hepatitis A| DiseasesDB |5757 & hepatitis A| ICD-9 |070.0 && hepatitis A| ICD-10 |B15 && hepatitis A| eMedicine |177484 && hepatitis A| Freebase ID |/m/01yjzm && hepatitis A| Patientplus ID |hepatitis-a-pro ->',  model=askwiki_davinci_fine_tune, stop=[" \n"])


' Hepatitis A is known by several names including 01yjzm in Freebase, m. It is also known by the names 01yjzm in Freebase, m and 01yjzm. DiseasesDB lists the disease as 5757 and it has the ICD-9 code of 070.0. The ICD-10 code for hepatitis A is B15'

In [None]:
input = 'eye color | brown | Commons category | Brown eyes && eye color | brown | sRGB color hex triplet | 800000 && eye color | blue | Commons category | Blue eyes && eye color | blue | sRGB color hex triplet | 608090 && eye colour | green | Commons category | Green eyes && eye colour | green | sRGB color hex triplet | 707020 ->'
 

In [None]:
generate_response(input,  model=askwiki_davinci_fine_tune, stop=[" \n"])

' Green eyes are represented in the sRGB colour space using the hex triplet 707020. Brown eyes are represented using the sRGB colour hex triplet 800000. Blue eyes are represented using the sRGB colour hex triplet 608090.'

# Model Inference and Testing

In [None]:
import os
import openai
openai.organization = "org-6Tm9wvTU2DAyCUVamArcvPxV"
openai.api_key = os.getenv("OPENAI_API_KEY")
openai.Model.list()

<OpenAIObject list at 0x7fce6f061810> JSON: {
  "data": [
    {
      "created": 1649358449,
      "id": "babbage",
      "object": "model",
      "owned_by": "openai",
      "parent": null,
      "permission": [
        {
          "allow_create_engine": false,
          "allow_fine_tuning": false,
          "allow_logprobs": true,
          "allow_sampling": true,
          "allow_search_indices": false,
          "allow_view": true,
          "created": 1669085501,
          "group": null,
          "id": "modelperm-49FUp5v084tBB49tC4z8LPH5",
          "is_blocking": false,
          "object": "model_permission",
          "organization": "*"
        }
      ],
      "root": "babbage"
    },
    {
      "created": 1649359874,
      "id": "davinci",
      "object": "model",
      "owned_by": "openai",
      "parent": null,
      "permission": [
        {
          "allow_create_engine": false,
          "allow_fine_tuning": false,
          "allow_logprobs": true,
          "allow_sa

In [None]:
askwiki_davinci_fine_tune = 'davinci:ft-askwiki-2023-04-10-05-41-22'

In [None]:
test_df.count()

prefix         106
input_text     106
target_text    106
dtype: int64

In [None]:
test_df=test_df.sample(frac=0.01)

In [None]:
testing_json = make_finetune_data(test_df)

In [None]:
test_file_name = random_train_file_name()
make_finetune_data(test_df, filename=test_file_name)

In [None]:
test_df

Unnamed: 0,prefix,input_text,target_text
17976,AskWiki NLG:,Agnes_Kant | birthPlace | Hessisch_Oldendorf &...,Agnes Kant was born in Hessisch Oldendorf and ...
15722,AskWiki NLG:,20_Fenchurch_Street | location | United_Kingdo...,The pound sterling is the currency of the Unit...
33920,AskWiki NLG:,Andra_(singer) | associatedBand/associatedMusi...,The singer Andra is associated with the band C...
26095,AskWiki NLG:,Afonso_Pena_International_Airport | runwayLeng...,Afonso Pena International Airport is located i...
19985,AskWiki NLG:,United_States | language | English_language &&...,"English is spoken in the United States, which ..."
...,...,...,...
26035,AskWiki NLG:,Adolfo_Suárez_Madrid–Barajas_Airport | locatio...,Adolfo Suárez Madrid–Barajas Airport is in Alc...
34680,AskWiki NLG:,"Lake_Placid,_New_York | isPartOf | New_York","Lake Placid, New York is part of New York."
14750,AskWiki NLG:,Akeem_Adams | club | Ferencvárosi_TC && Akeem_...,Akeem Adams plays for the Trinidad and Tobago ...
26192,AskWiki NLG:,Al_Asad_Airbase | operatingOrganisation | Unit...,The United States Air Force is the operating o...


In [None]:
!cat openai_train_7JW8H.jsonl

{"prompt": "Agnes_Kant | birthPlace | Hessisch_Oldendorf && Agnes_Kant | office | House_of_Representatives_(Netherlands) ->", "completion": " Agnes Kant was born in Hessisch Oldendorf and worked at the House of Representatives in the Netherlands. \n"}
{"prompt": "20_Fenchurch_Street | location | United_Kingdom && United_Kingdom | currency | Pound_sterling ->", "completion": " The pound sterling is the currency of the United Kingdom, which is also the location of 20 Fenchurch Street. \n"}
{"prompt": "Andra_(singer) | associatedBand/associatedMusicalArtist | CRBL ->", "completion": " The singer Andra is associated with the band CRBL. \n"}
{"prompt": "Afonso_Pena_International_Airport | runwayLength | 2215.0 && Afonso_Pena_International_Airport | elevationAboveTheSeaLevel | 911.0 && Afonso_Pena_International_Airport | location | S\u00e3o_Jos\u00e9_dos_Pinhais && Afonso_Pena_International_Airport | runwayName | \"15/33\" ->", "completion": " Afonso Pena International Airport is located in 

In [None]:
test_df['candidate_text']=test_df['input_text'].apply(lambda x: generate_response(x,model=askwiki_davinci_fine_tune, stop=[" \n"]))

In [None]:
test_df

Unnamed: 0,prefix,input_text,target_text,candidate_text,reading_score,words
17976,AskWiki NLG:,Agnes_Kant | birthPlace | Hessisch_Oldendorf &...,Agnes Kant was born in Hessisch Oldendorf and ...,"Agnes Kant, who was born in Hessisch Oldendor...",55.24,16
15722,AskWiki NLG:,20_Fenchurch_Street | location | United_Kingdo...,The pound sterling is the currency of the Unit...,20 Fenchurch Street is located in the United ...,72.16,16
33920,AskWiki NLG:,Andra_(singer) | associatedBand/associatedMusi...,The singer Andra is associated with the band C...,Andra (singer) is associated with CRBL.,31.55,6
26095,AskWiki NLG:,Afonso_Pena_International_Airport | runwayLeng...,Afonso Pena International Airport is located i...,"Afonso Pena International Airport, located in...",79.26,36
19985,AskWiki NLG:,United_States | language | English_language &&...,"English is spoken in the United States, which ...","Albuquerque, in the United States, is located...",50.84,12
...,...,...,...,...,...,...
26035,AskWiki NLG:,Adolfo_Suárez_Madrid–Barajas_Airport | locatio...,Adolfo Suárez Madrid–Barajas Airport is in Alc...,Adolfo Suarez Madrid-Barajas airport is locat...,47.45,14
34680,AskWiki NLG:,"Lake_Placid,_New_York | isPartOf | New_York","Lake Placid, New York is part of New York.",Lake Placid is in New York.,116.15,6
14750,AskWiki NLG:,Akeem_Adams | club | Ferencvárosi_TC && Akeem_...,Akeem Adams plays for the Trinidad and Tobago ...,Akeem Adams played for the Ferencvárosi TC cl...,67.08,21
26192,AskWiki NLG:,Al_Asad_Airbase | operatingOrganisation | Unit...,The United States Air Force is the operating o...,Al Asad Airbase is operated by the United Sta...,46.10,25


In [None]:
!pip install textstat

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting textstat
  Downloading textstat-0.7.3-py3-none-any.whl (105 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m105.1/105.1 kB[0m [31m8.9 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting pyphen
  Downloading pyphen-0.14.0-py3-none-any.whl (2.0 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.0/2.0 MB[0m [31m70.5 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: pyphen, textstat
Successfully installed pyphen-0.14.0 textstat-0.7.3


In [None]:
import textstat

In [None]:
#reading score
test_df['reading_score'] = test_df.apply(lambda row: textstat.flesch_reading_ease(row['candidate_text']), axis=1)


In [None]:
import numpy as np

In [None]:
np.mean(test_df['reading_score'])

68.1072641509434

In [None]:
test_df['words']=test_df['candidate_text'].apply(lambda x: len(x.split()))

In [None]:
np.mean(test_df['words'])

16.858490566037737