<a href="https://colab.research.google.com/github/nealcaren/BIGSSS-LLM/blob/main/LLM_3_API.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Part III. Paying for it

This notebook presents a second way to use LLM models to code data focusing on OpenAI's ChatGPT model. This gives you access to a much more powerful model. No need to use your own GPU.

In [1]:
!pip install openai
!pip install tenacity

Collecting openai
  Downloading openai-0.27.8-py3-none-any.whl (73 kB)
[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/73.6 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[91m╸[0m[90m━[0m [32m71.7/73.6 kB[0m [31m5.0 MB/s[0m eta [36m0:00:01[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m73.6/73.6 kB[0m [31m1.7 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: openai
Successfully installed openai-0.27.8


After installing the module, you can need to authorize yourself with an OpenAI API token. First, apply as a [developer](https://platform.openai.com/signup), then you can your [tokens](https://platform.openai.com/account/api-keys). Since the process takes a few days, we can try *all* doing this on my account.

In [2]:
import openai
import pandas as pd
from tenacity import retry, wait_random_exponential, stop_after_attempt
from sklearn.metrics import accuracy_score

from google.colab import data_table
data_table.enable_dataframe_formatter()


api_key ='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'
openai.api_key = api_key

A sample review:

In [3]:
review = '''"Basically there's a family where a little boy (Jake) thinks there's a zombie in his closet & his parents are fighting all the time.<br /><br />This movie is slower than a soap opera... and suddenly, Jake decides to become Rambo and kill the zombie.<br /><br />OK, first of all when you're going to make a film you must Decide if its a thriller or a drama! As a drama the movie is watchable. Parents are divorcing & arguing like in real life. And then we have Jake with his closet which totally ruins all the film! I expected to see a BOOGEYMAN similar movie, and instead i watched a drama with some meaningless thriller spots.<br /><br />3 out of 10 just for the well playing parents & descent dialogs. As for the shots with Jake: just ignore them."'''

ChatGPT is famously a *chat* model. We aren't going to use the chat history aspect, but will take advantage of the fact that it is designed to answer your questions, rather than continue your text.

Users provide two text entry fields: `system` which gives the AI isn't larger role and `user` which is the more immediate response. The difference bewteen the two is somewhat vague and changes over time.

Note the `user` command includes the text of the review.

In [4]:
system = "You are a helpful assistant who evaluates the sentiment of movies."
user =   f"Is this a positive or negative  movie review? Only say 'positive' or 'negative'\n{review}"

In [5]:
response = openai.ChatCompletion.create(
        model="gpt-3.5-turbo-0613",
        timeout = 20,
        messages=[
            {"role": "system", "content": system},
             {"role": "user", "content": user},
            ]
        )

Current pricing: \$0.0015 per 1,000 input tokens and \$0.002 per 1,000 output tokens. So the estimated cost would be....

In [6]:
input_tokens = response['usage']['prompt_tokens']
output_tokens = response['usage']['completion_tokens']

gpt3_cost = (input_tokens/1000)*0.0015 + (output_tokens/1000)*0.002
print(gpt3_cost)


0.0003275


That's why I'm letting you use my API tokens. GPT4 is the better  and more expensive model.

In [7]:

gpt4_cost = (input_tokens/1000)*0.03 + (output_tokens/1000)*0.06
print(gpt4_cost)

0.0065699999999999995


Still cheap, but relatively much, much more expensive:

In [8]:
print(gpt4_cost/gpt3_cost)

20.061068702290076


In [9]:
df = pd.read_csv('https://www.dropbox.com/s/72n7a8u47mxntp9/IMDB%20Dataset.csv?raw=1')

In [10]:
dfs20 = df.sample(20)

In [11]:
@retry(wait=wait_random_exponential(min=1, max=5), stop=stop_after_attempt(6))
def analyze_sentiment(review):
  system = "You are a helpful assistant who evaluates the sentiment of movies."
  user =   f"Is this a positive or negative  movie review? Respond only 'positive' or 'negative'\n{review}"

  response = openai.ChatCompletion.create(
        model="gpt-3.5-turbo-0613",
        timeout = 20,
        messages=[
                {"role": "system", "content": system},
                 {"role": "user", "content": user},])

  sentiment = response['choices'][0]['message']['content']


  return sentiment

In [12]:
dfs20['sentiment_pred'] = dfs20['review'].apply(analyze_sentiment)
dfs20

Unnamed: 0,review,sentiment,sentiment_pred
49326,After learning that her sister Susan is contem...,positive,positive
12616,This person is a so-called entertainer who has...,negative,negative
2602,"Meatballs works its way into conversations, li...",positive,positive
21337,I realize the line on my summary is not too po...,negative,negative
1294,This movie was a masterpiece of human emotions...,positive,positive
9902,Here's how you do it: Believe in God and repen...,negative,negative
23093,I am a relative latecomer to the transcendent ...,negative,negative
7115,"It's New Year's eve, a cop-killer (in the form...",negative,negative
38676,"Sam O'Steen, the film editor on the superlativ...",negative,negative
13348,The United States of Leland was an amazing mov...,positive,positive


In [13]:
pd.crosstab(dfs20['sentiment'], dfs20['sentiment_pred'])

sentiment_pred,Negative,negative,positive
sentiment,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
negative,0,8,0
positive,1,0,11


Now let's ask some much harder questions. For example, let's say we had the abstract of a sociology presentation:

> Despite the existence of a large body of research examining HIV-related stigma in the general population and among vulnerable groups in sub-Saharan Africa, there has been a paucity of consistent scholarship focusing on the behavioral consequences of holding stigmatizing attitudes towards people living with HIV. In particular, while a small but growing body of research suggests that holding stigmatizing beliefs is associated with high-risk sexual behaviors, it is unknown whether this relationship has changed over the past decade, a period in which Africans have experienced improvements in HIV outcomes. To fill this gap, we use data from Demographic and Health Surveys to test two social psychological hypotheses for changes in the association between HIV stigma beliefs and high-risk sexual behaviors in two West African countries: Ghana and Nigeria.

We might want to know:
* Is it qualitative or quantitative?
* What theories do they use?
* Do they test a hypothesis?

First, let's try the "google/flan-t5-base" model. You *don't* have to run these cells.

In [14]:
!pip install transformers

import torch
from transformers import pipeline

device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")

text_generator = pipeline("text2text-generation",
                          model="google/flan-t5-base",
                          device = device,
                           max_length=200)

Collecting transformers
  Downloading transformers-4.30.2-py3-none-any.whl (7.2 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.2/7.2 MB[0m [31m53.7 MB/s[0m eta [36m0:00:00[0m
Collecting huggingface-hub<1.0,>=0.14.1 (from transformers)
  Downloading huggingface_hub-0.15.1-py3-none-any.whl (236 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m236.8/236.8 kB[0m [31m20.0 MB/s[0m eta [36m0:00:00[0m
Collecting tokenizers!=0.11.3,<0.14,>=0.11.1 (from transformers)
  Downloading tokenizers-0.13.3-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (7.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.8/7.8 MB[0m [31m89.7 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting safetensors>=0.3.1 (from transformers)
  Downloading safetensors-0.3.1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.3 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.3/1.3 MB[0m [31m60.0 MB/s[0m eta [36m0:00:0

Downloading (…)lve/main/config.json:   0%|          | 0.00/1.40k [00:00<?, ?B/s]

Downloading model.safetensors:   0%|          | 0.00/990M [00:00<?, ?B/s]

Downloading (…)neration_config.json:   0%|          | 0.00/147 [00:00<?, ?B/s]

Downloading (…)okenizer_config.json:   0%|          | 0.00/2.54k [00:00<?, ?B/s]

Downloading spiece.model:   0%|          | 0.00/792k [00:00<?, ?B/s]

Downloading (…)/main/tokenizer.json:   0%|          | 0.00/2.42M [00:00<?, ?B/s]

Downloading (…)cial_tokens_map.json:   0%|          | 0.00/2.20k [00:00<?, ?B/s]

In [15]:
a = '''Despite the existence of a large body of research examining HIV-related stigma in the general population and among vulnerable groups in sub-Saharan Africa, there has been a paucity of consistent scholarship focusing on the behavioral consequences of holding stigmatizing attitudes towards people living with HIV. In particular, while a small but growing body of research suggests that holding stigmatizing beliefs is associated with high-risk sexual behaviors, it is unknown whether this relationship has changed over the past decade, a period in which Africans have experienced improvements in HIV outcomes. To fill this gap, we use data from Demographic and Health Surveys to test two social psychological hypotheses for changes in the association between HIV stigma beliefs and high-risk sexual behaviors in two West African countries: Ghana and Nigeria.'''
prompt = '''Study: {a} \n Is this study using quantitative or qualitative methods?'''
text_generator(prompt)

[{'generated_text': 'quantitative'}]

In [16]:
prompt = '''Study: {a} \n Does this study test a hypoethesis?'''
text_generator(prompt)

[{'generated_text': 'no'}]

In [17]:
prompt = '''Study: {a} \n In this study, how is the data collected?'''
text_generator(prompt)

[{'generated_text': 'a computer'}]

In [18]:
prompt = '''Study: {a} \n In this study, what theories are used?'''
text_generator(prompt)

[{'generated_text': 'a realism'}]

Now let's try ChatGPT:

In [19]:
@retry(wait=wait_random_exponential(min=1, max=5), stop=stop_after_attempt(6))
def analyze_abstract(text):
  system = "You are a helpful assistant."
  user =   f'''Answer the following questions based on this abstract: '

   Abstract:  {text}

   Questions:
   * How is the data collected?
   * What theories are used?
   * Does this study test a hypoethesis?
   * Is this study using quantitative or qualitative methods?

   '''

  response = openai.ChatCompletion.create(
        model="gpt-3.5-turbo-0613",
        timeout = 20,
        messages=[
                {"role": "system", "content": system},
                 {"role": "user", "content": user},])

  summary = response['choices'][0]['message']['content']


  return summary

In [20]:
print(analyze_abstract(a))

Based on the information in the abstract:

1. The data is collected from Demographic and Health Surveys.
2. The study uses social psychological theories.
3. Yes, this study tests two social psychological hypotheses.
4. This study is most likely using quantitative methods, as it mentions using data from surveys.


In [21]:
@retry(wait=wait_random_exponential(min=1, max=5), stop=stop_after_attempt(6))
def analyze_abstract(text):
  system = "You are a helpful assistant who adds data to a database."
  json_sample = '''{'collection_method':'enthography', 'theories':'critical race', 'hypthosis_present': 'Yes', 'method':'qualitative'}'''
  user =   f'''Answer the following questions based on this abstract: '

   Abstract:  {text}

   Questions:
   * What data is used?
   * What theories are used?
   * Does this study test a hypoethesis?
   * Is this study using quantitative or qualitative methods?

   If the information is not in the abstract, say "Unclear". Response in valid JSON format, such
   as: {json_sample}

   '''

  response = openai.ChatCompletion.create(
        model="gpt-3.5-turbo-0613",
        timeout = 20,
        messages=[
                {"role": "system", "content": system},
                 {"role": "user", "content": user},])

  summary = response['choices'][0]['message']['content']


  return summary

In [22]:
print(analyze_abstract(a))

{
  "data_used": "Demographic and Health Surveys",
  "theories_used": "Unclear",
  "hypothesis_tested": "Yes",
  "method": "Quantitative"
}


In [23]:
abstract_df = pd.read_json('https://www.dropbox.com/scl/fi/guarm4rfn9ze58oss5vuy/asa_2023_abtracts.json?dl=1&rlkey=3qtjacx7slc3vy4jzwzx0qorl')

In [24]:
abstract_df.sample(3)

Unnamed: 0,title,abstract,date,authors,event_x,url,session_id,roundtable,id,og:site_name,og:title,og:url,og:type,type,Organizer,sub2,event_y,og:description,section
663,A Totemic Issue? The Function of Abortion in E...,The Supreme Court's decision to overturn abort...,"Sat, August 19, 12:00 to 1:00pm, Pennsylvania ...","Joseph Charles Roso, Duke University",Sociology of Religion Roundtables,https://tinyurl.com/2y47og2s,2077952,2 - Sociology of Religion Roundtables - Religi...,abstracts/2066691,American Sociological Association,2 - Sociology of Religion Roundtables - Religi...,https://tinyurl.com/23a8edoa,website,Refereed Roundtable (60 min),Section on Sociology of Religion,Sociology of Religion Roundtables,Sociology of Religion Roundtables,,True
1481,Media Coverage and Partnerships: how discourse...,"During the past few decades, many wealthy coun...","Sat, August 19, 10:00 to 11:30am, Pennsylvania...",Patrizio Lodetti,"Marriage, En Vogue? Complicating the Discussio...",https://tinyurl.com/248hzge5,2075201,,abstracts/2066912,American Sociological Association,"Marriage, En Vogue? Complicating the Discussio...",https://tinyurl.com/26vchahl,website,Paper Session (90 min),"Marriage, Civil Unions, and Cohabitation",,,Is marriage so 90s? Is it as relevant in 2023?...,False
3039,Data Values: Digital Surveillance and the New ...,Companies and academic researchers have starte...,"Sat, August 19, 12:00 to 1:30pm, Marriott Phil...","Mira Vale, University of Michigan",Technology and the Body,https://tinyurl.com/2yopdkm9,2041685,,abstracts/2065872,American Sociological Association,Technology and the Body,https://tinyurl.com/23kgj4h3,website,Paper Session (90 min),"Section on Science, Knowledge, and Technology","Open Session on Issues Related to Science, Kno...",,Papers in this session consider the intersecti...,True


In [25]:
some_abstracts = abstract_df.sample(5)['abstract'].values

In [27]:
import json

@retry(wait=wait_random_exponential(min=1, max=5), stop=stop_after_attempt(6))
def analyze_abstract(text):
  system = "You are a helpful assistant who adds data to a database."
  json_sample = '''{"data":"2020 Census", "theory":"critical race", "hypothesis": "Yes", "method":"qualitative"}'''
  user =   f'''Answer the following questions based on this abstract: '

   Abstract:  {text}

   Questions:
   * What data is used?
   * What theories are used?
   * Does this study test a hypoethesis?
   * Is this study using quantitative or qualitative methods?

   If the information is not in the abstract, say "Unclear". Response in valid JSON format, such
   as: {json_sample}

   '''
  response = openai.ChatCompletion.create(
        model="gpt-3.5-turbo-0613",
        timeout = 20,
        messages=[
                {"role": "system", "content": system},
                 {"role": "user", "content": user},])

  summary = response['choices'][0]['message']['content']
  print(summary)
  json_summary = json.loads(summary) #!!!!!

  return json_summary

analyze_abstract(a)

{"data": "Demographic and Health Surveys", "theory": "social psychological", "hypothesis": "Yes", "method": "quantitative"}


{'data': 'Demographic and Health Surveys',
 'theory': 'social psychological',
 'hypothesis': 'Yes',
 'method': 'quantitative'}

We can analyze all the abstracts by applying it to each abstract using list comprehension.

In [28]:
abstracts_analyzed = [analyze_abstract(a) for a in some_abstracts]

{"data":"nationally representative county-level data from the U.S Census Bureau and Social Security Administration", "theory":"Unclear", "hypothesis": "Yes", "method":"mixed-method (in-depth interviews and analysis of county-level data)"}
{"data": "survey on creative workers", "theory": "occupation as distinct and exclusive communities of workers; polyoccupationalism", "hypothesis": "Yes", "method": "quantitative"}
{"data":"Original fieldwork with dual national Colombian-Venezuelan individuals who were arbitrarily deprived of their nationality in Colombia", "theory": "Unclear", "hypothesis": "Unclear", "method": "qualitative"}
{"data": "Panama Papers and other leaks", "theory": "social capital", "hypothesis": "Unclear", "method": "quantitative"}
{"data":"Survey results collected from affected residents", "theory":"Psychology and sociology", "hypothesis": "Yes", "method":"quantitative"}


We can put the results in a dataframe.

In [29]:
adf = pd.DataFrame(abstracts_analyzed)
adf['abstract'] = some_abstracts
adf

Unnamed: 0,data,theory,hypothesis,method,abstract
0,nationally representative county-level data fr...,Unclear,Yes,mixed-method (in-depth interviews and analysis...,While disability programs are primarily motiva...
1,survey on creative workers,occupation as distinct and exclusive communiti...,Yes,quantitative,Past research has posited that occupations are...
2,Original fieldwork with dual national Colombia...,Unclear,Unclear,qualitative,"Between 2021 and 2022, 40,000 mostly dual nati..."
3,Panama Papers and other leaks,social capital,Unclear,quantitative,Social capital refers to community connectedne...
4,Survey results collected from affected residents,Psychology and sociology,Yes,quantitative,There is growing concern that climate change a...


The good news is that it was able to mostly understand all the questions, it felt free to use "unclear" when it couldn't answer the question. It also returned a valid JSON object, which isn't always the case.

With some prompt engineering and variable renaming, this could work.

**Your turn** Rewrite the function so that you can  consistently extract two intresting pieces of information from an article abstract. You'll need to update both the `json_sample` and the `user` prompt.

In [30]:
@retry(wait=wait_random_exponential(min=1, max=5), stop=stop_after_attempt(6))
def analyze_abstract(text):
  system = "You are a helpful assistant who adds data to a database."
  json_sample = '''{"data":"2020 Census", "theory":"critical race", "hypothesis": "Yes", "method":"qualitative"}'''
  user =   f'''Answer the following questions based on this abstract: '

   Abstract:  {text}

   Questions:
   * What data is used?
   * What theories are used?
   * Does this study test a hypoethesis?
   * Is this study using quantitative or qualitative methods?

   If the information is not in the abstract, say "Unclear". Response in valid JSON format, such
   as: {json_sample}

   '''
  response = openai.ChatCompletion.create(
        model="gpt-3.5-turbo-0613",
        timeout = 20,
        messages=[
                {"role": "system", "content": system},
                 {"role": "user", "content": user},])

  summary = response['choices'][0]['message']['content']
  print(summary)
  json_summary = json.loads(summary) #!!!!!

  return json_summary

analyze_abstract(a)

{"data": "Demographic and Health Surveys", "theory": "social psychological", "hypothesis": "Yes", "method": "quantitative"}


{'data': 'Demographic and Health Surveys',
 'theory': 'social psychological',
 'hypothesis': 'Yes',
 'method': 'quantitative'}

## But wait, there's more!

Here's an article from a newsaper:

> Advocates rallied for LGBTQ rights at the Kentucky Capitol on Wednesday, calling for lawmakers to stop advancing bills that would ban trans girls from participating in girls sports. Supporters also renewed their call for lawmakers to pass a statewide fairness ordinance—banning discrimination based on sexual orientation—and to approve a bill that would prohibit conversion therapy. Keturah Herron—a Democratic candidate for a legislative seat in Louisville who identifies as queer, masculine-presenting woman—said she experienced intolerance at the legislature during her time as a lobbyist for the American Civil Liberties Union. She said at one point she was told by a woman to use a different bathroom. “I understand I am supposed to be in this space. However, I understand that my mere presence makes some people uncomfortable. My advice to them: Get used to it,” Herron said. Herron is running in a special election on Feb. 22 for House District 42 against Republican Judy Martin Stallard. Over the last week, Republicans have advanced bills in both the House and Senate that would ban transgender girls from participating in girls sports, even though there haven’t been any complaints in Kentucky about such rare cases. As the Senate prepared to debate one of those bills on Wednesday, chants from the rally could be heard in the chamber, forcing lawmakers to keep the doors closed. Donzella Lee, a pastor and executive director of the Franklin-Simpson Human Rights Commission, encouraged rallygoers to make noise. “Today the Senate is planning to vote on one of these bills, and I want you to tell them as loud and proud as you can: Let kids play,” Lee shouted to the crowd. The annual event featured a long list of Democratic speakers, including Gov. Andy Beshear, who was the first governor to ever speak at the rally in 2020. “We cannot possibly reach our fullest potential unless every single one of our families, every single one of our people feels supported and valued to be themselves,” Beshear said. U.S. Senate candidate Charles Booker said Republican lawmakers were pushing for anti-trans legislation while ignoring other priorities. “They’d rather focus on banning trans youth from playing in sports than making sure you have health care. They would rather ban trans youth from playing in sports than making sure you have clean water,” Booker said. Rep. Attica Scott, a Democrat from Louisville, said the fight for LGBTQ rights goes hand-in-hand with the fight against racism. “We can’t all get wherever there is unless we get there together,” Scott said. Shortly after the rally, the Senate passed Senate Bill 83, banning trans girls from playing girls sports in middle school and high school, with a vote of 27-8. The House will now consider the measure.

I'm interested in extracting details about the protest. To do this, I'm going:
* use ChatGPT's relatively new "functions" option to force the output to be a pre-defined JSON;
* send the data directly from a Python dictionary;
* ask for two responses, which is one way to measure model uncertainity.

In [31]:
meta= {'title': 'LGBTQ advocates rally for fairness as Ky. lawmakers advance anti-trans bill – 89.3 WFPL News Louisville',
        'url': 'https://wfpl.org/lgbtq-advocates-rally-for-fairness-as-ky-lawmakers-advance-anti-trans-bill/',
        'text' : ''''Advocates rallied for LGBTQ rights at the Kentucky Capitol on Wednesday, calling for lawmakers to stop advancing bills that would ban trans girls from participating in girls sports.\n\nSupporters also renewed their call for lawmakers to pass a statewide fairness ordinance—banning discrimination based on sexual orientation—and to approve a bill that would prohibit conversion therapy.\n\nKeturah Herron—a Democratic candidate for a legislative seat in Louisville who identifies as queer, masculine-presenting woman—said she experienced intolerance at the legislature during her time as a lobbyist for the American Civil Liberties Union.\n\nShe said at one point she was told by a woman to use a different bathroom.\n\n“I understand I am supposed to be in this space. However, I understand that my mere presence makes some people uncomfortable. My advice to them: Get used to it,” Herron said.\n\nHerron is running in a special election on Feb. 22 for House District 42 against Republican Judy Martin Stallard.\n\nOver the last week, Republicans have advanced bills in both the House and Senate that would ban transgender girls from participating in girls sports, even though there haven’t been any complaints in Kentucky about such rare cases.\n\nAs the Senate prepared to debate one of those bills on Wednesday, chants from the rally could be heard in the chamber, forcing lawmakers to keep the doors closed.\n\nDonzella Lee, a pastor and executive director of the Franklin-Simpson Human Rights Commission, encouraged rallygoers to make noise.\n\n“Today the Senate is planning to vote on one of these bills, and I want you to tell them as loud and proud as you can: Let kids play,” Lee shouted to the crowd.\n\nThe annual event featured a long list of Democratic speakers, including Gov. Andy Beshear, who was the first governor to ever speak at the rally in 2020.\n\n“We cannot possibly reach our fullest potential unless every single one of our families, every single one of our people feels supported and valued to be themselves,” Beshear said.\n\nU.S. Senate candidate Charles Booker said Republican lawmakers were pushing for anti-trans legislation while ignoring other priorities.\n\n“They’d rather focus on banning trans youth from playing in sports than making sure you have health care. They would rather ban trans youth from playing in sports than making sure you have clean water,” Booker said.\n\nRep. Attica Scott, a Democrat from Louisville, said the fight for LGBTQ rights goes hand-in-hand with the fight against racism.\n\n“We can’t all get wherever there is unless we get there together,” Scott said.\n\nShortly after the rally, the Senate passed Senate Bill 83, banning trans girls from playing girls sports in middle school and high school, with a vote of 27-8. The House will now consider the measure.''',
        'publication_date' : '2022-02-17'
        }


The next cell creates a dictionary with schem of the JSON we want. This is where *all* the action happens.

In [32]:
extract_data =    {
    "name": "extract_data",
    "description": "Add details related to a protest  to the database.",
    "parameters": {
        "type": "object",
            "properties": {
                "protest": {
                    "type": "boolean",
                    "description": f'''Does the article describe a political protest, rally or demonstration?''',
                },
                "issue": {
                    "type": "array",
                    "description": "What broad political issues was the protest about?",
                    "items": { "type": "string"}
                },
                "city": {
                    "type": "string",
                    "description": f'''In what city did the protest occur?''',
                },
                "state": {
                    "type": "string",
                    "description": f'''In what state did the protest occur? Use the two letter abbreviation, like NJ or CA.''',
                },
                "day_of_week": {
                    "type": "string",
                    "description": f'''What day of the week did the protest occur?''',
                },
                "date": {
                    "type": "string",
                    "description": f'''What date did the protest happen? Use YYYY-MM-DD format.''',
                },
                "size_words": {
                    "type": "array",
                    "description": f'''What exact words are used to describe the size of crowd protesting?''',
                    "items": { "type": "string"}
                },
                "size_number": {
                    "type": ["number", "null"],
                    "description": f'''Using a number, how many people were at the protest?''',
                },
            },
        "required": ['protest', 'issue', 'city', 'state', 'day_of_week','date', 'size_words', 'size_number'],
    },
}


The call to the API only gives general instructions. the `n` parameter sets the number of respones and `extract_data` is passed

In [34]:
messages = [
    {"role": "system", "content": "You are a helpful assistant that extracts summaries of newspaper articles as a JSON for a database."},
    {"role": "user", "content": f'Extract a summary of the article: {meta}' }
    ]

response = openai.ChatCompletion.create(
    model='gpt-3.5-turbo-0613',
    functions=[extract_data],
    n=2,
    messages=messages)


Let's print out each of the results:

In [35]:
for r in response.choices:
   print(r['message']['function_call']['arguments'])


{
  "protest": true,
  "issue": ["LGBTQ rights", "anti-trans legislation"],
  "city": "Frankfort",
  "state": "KY",
  "day_of_week": "Wednesday",
  "date": "2022-02-16",
  "size_words": ["rallygoers", "crowd"],
  "size_number": null
}
{
  "protest": true,
  "issue": ["LGBTQ rights", "anti-trans bill"],
  "city": "Frankfort",
  "state": "KY",
  "day_of_week": "Wednesday",
  "date": "2022-02-16",
  "size_words": ["large", "crowd", "rallygoers"],
  "size_number": null
}


The most impressive part here, is that it managed to figure out that the protest happened on Wednesday, 2022-02-16, as the article only provided that the protest took place on a Wednesday and the article was publisehd on 2022-02-17. It uses extra information here to infer the date, presumably as it was trained on data that shoed that 2022-02-17 was a Thursday.

## Big Finish

* gpt-3.5 is more powerful than LLMs that you can run yourself.
* gpt-4 is even more powerful, especially for latent concepts.
* There are other options for modifying model parameters or even fine-tuning a model to your use case (if you have enough data).
* With all models, worry about confabulation/hallucination, especially as they are likely to be biased.
* You *need* training data to compare the accuracy of the GPT results.