# Retrieve Information from Specific Data Corpus

In [2]:
!pip install openai

Collecting openai
  Downloading openai-0.27.2-py3-none-any.whl (70 kB)
     ---------------------------------------- 70.1/70.1 kB 1.3 MB/s eta 0:00:00
Collecting aiohttp
  Downloading aiohttp-3.8.4-cp39-cp39-win_amd64.whl (323 kB)
     -------------------------------------- 323.6/323.6 kB 3.3 MB/s eta 0:00:00
Collecting frozenlist>=1.1.1
  Downloading frozenlist-1.3.3-cp39-cp39-win_amd64.whl (34 kB)
Collecting multidict<7.0,>=4.5
  Downloading multidict-6.0.4-cp39-cp39-win_amd64.whl (28 kB)
Collecting yarl<2.0,>=1.0
  Downloading yarl-1.8.2-cp39-cp39-win_amd64.whl (56 kB)
     ---------------------------------------- 56.8/56.8 kB 3.1 MB/s eta 0:00:00
Collecting async-timeout<5.0,>=4.0.0a3
  Downloading async_timeout-4.0.2-py3-none-any.whl (5.8 kB)
Collecting aiosignal>=1.1.2
  Downloading aiosignal-1.3.1-py3-none-any.whl (7.6 kB)
Installing collected packages: multidict, frozenlist, async-timeout, yarl, aiosignal, aiohttp, openai
Successfully installed aiohttp-3.8.4 aiosignal-1.3.1 asy

## Set up Azure OpenAI

In [4]:
import os
import openai
from dotenv import load_dotenv


In [5]:
# Set up Azure OpenAI
#load_dotenv()
openai.api_type = "azure"
openai.api_base = "https://azureopenaiday.openai.azure.com/" # Api base is the 'Endpoint' which can be found in Azure Portal where Azure OpenAI is created. It looks like https://xxxxxx.openai.azure.com/
openai.api_version = "2022-12-01"
openai.api_key ="Your key"


## Deploy a Language Model

In [6]:
# list models deployed with embeddings capability
deployment_id = None
result = openai.Deployment.list()
desired_model = 'text-davinci-003'

# check if desired model is already deployed
for deployment in result.data:
    if deployment["status"] != "succeeded":
        continue
    
    if deployment['model'] != desired_model:
        continue
    
    deployment_id = deployment["id"]; print(deployment_id)
    break

# if not model deployed, deploy one
if not deployment_id:
    print('No deployment with status: succeeded found.')
    model = desired_model

    # Now let's create the deployment
    print(f'Creating a new deployment with model: {model}')
    result = openai.Deployment.create(model=model, scale_settings={"scale_type":"standard"})
    deployment_id = result["id"]
    print(f'Successfully created {model} with deployment_id {deployment_id}')
else:
    print(f'Found a succeeded deployment that supports embeddings with id: {deployment_id}.')

deployment-83fc5492512148a1bfd571a7c1055d90
Found a succeeded deployment that supports embeddings with id: deployment-83fc5492512148a1bfd571a7c1055d90.


## Load Data

In [7]:
import pandas as pd
fname = 'bbc-news-data.csv'
df_orig = pd.read_csv(fname, delimiter='\t', index_col=False)

In [8]:
import numpy as np

DEVELOPMENT = False  # Set this to True for development on small subset of data

if DEVELOPMENT:
    # Sub-sample for development
    df = df_orig.sample(n=20, replace=False, random_state=9).copy() # Set sample size
else:
    df = df_orig.copy()

df

Unnamed: 0,category,filename,title,content
0,business,001.txt,Ad sales boost Time Warner profit,Quarterly profits at US media giant TimeWarne...
1,business,002.txt,Dollar gains on Greenspan speech,The dollar has hit its highest level against ...
2,business,003.txt,Yukos unit buyer faces loan claim,The owners of embattled Russian oil giant Yuk...
3,business,004.txt,High fuel prices hit BA's profits,British Airways has blamed high fuel prices f...
4,business,005.txt,Pernod takeover talk lifts Domecq,Shares in UK drinks and food firm Allied Dome...
...,...,...,...,...
2220,tech,397.txt,BT program to beat dialler scams,BT is introducing two initiatives to help bea...
2221,tech,398.txt,Spam e-mails tempt net shoppers,Computer users across the world continue to i...
2222,tech,399.txt,Be careful how you code,A new European directive could put software w...
2223,tech,400.txt,US cyber security chief resigns,The man making sure US computer networks are ...


In [15]:
df[df['category']=='entertainment']

Unnamed: 0,category,filename,title,content
510,entertainment,001.txt,Gallery unveils interactive tree,A Christmas tree that can receive text messag...
511,entertainment,002.txt,Jarre joins fairytale celebration,French musician Jean-Michel Jarre is to perfo...
512,entertainment,003.txt,Musical treatment for Capra film,The classic film It's A Wonderful Life is to ...
513,entertainment,004.txt,Richard and Judy choose top books,The 10 authors shortlisted for a Richard and ...
514,entertainment,005.txt,Poppins musical gets flying start,The stage adaptation of children's film Mary ...
...,...,...,...,...
891,entertainment,382.txt,Last Star Wars 'not for children',The sixth and final Star Wars movie may not b...
892,entertainment,383.txt,French honour for director Parker,British film director Sir Alan Parker has bee...
893,entertainment,384.txt,Robots march to US cinema summit,Animated movie Robots has opened at the top o...
894,entertainment,385.txt,Hobbit picture 'four years away',Lord of the Rings director Peter Jackson has ...


## Unstrcutured data to structured data

In [16]:
def retrieve_structured_data(prompt):
    try:
        # Request API
        response = openai.Completion.create(
            deployment_id= deployment_id, 
            prompt=prompt,
            temperature=1,
            max_tokens=300,
            top_p=1.0,
            frequency_penalty=0.0,
            presence_penalty=1
        )

        # response
        result = response['choices'][0]['text']; print(result)
    except Exception as err:
        print(f"Unexpected {err=}, {type(err)=}")

    return 

## Sample Queries

In [17]:
idx = 513 #index of the selected text

# prompt postifx
prompt_postfix = """ 
  \n\n Extract author and books from the text above in a table. 
"""
# build prompt
prompt = df['content'].loc[idx] + prompt_postfix; print(prompt)

# query
retrieve_structured_data(prompt=prompt)

 The 10 authors shortlisted for a Richard and Judy book award in 2005 are hoping for a boost in sales following the success of this year's winner.  The TV couple's interest in the book world coined the term "the Richard & Judy effect" and created the top two best-selling paperbacks of 2004 so far. The finalists for 2005 include Andrew Taylor's The American Boy and Robbie Williams' autobiography Feel. This year's winner, Alice Sebold's The Lovely Bones, sold over one million. Joseph O'Connor's Star of the Sea came second and saw sales increase by 350%. The best read award, on Richard Madeley and Judy Finnigan's Channel 4 show, is part of the British Book Awards. David Mitchell's Booker-shortlisted novel, Cloud Atlas, makes it into this year's top 10 along with several lesser known works.  "There's no doubt that this year's selection of book club entries is the best yet. If anything, the choice is even wider than last time," said Madeley. "It was very hard to follow last year's extremely

In [18]:
idx = 891 #index of the selected text

# prompt postifx
prompt_postfix = """ 
  \n\n Extract Star Wars movie series and associated ratings from the text above into a table. 
"""
# build prompt
prompt = df['content'].loc[idx] + prompt_postfix; print(prompt)

# query
retrieve_structured_data(prompt=prompt)

 The sixth and final Star Wars movie may not be suitable for young children, film-maker George Lucas has said.  He told US TV show 60 Minutes that Revenge of the Sith would be the darkest and most violent of the series. "I don't think I would take a five or six-year-old to this," he told the CBS programme, to be aired on Sunday. Lucas predicted the film would get a US rating advising parents some scenes may be unsuitable for under-13s. It opens in the UK and US on 19 May. He said he expected the film would be classified PG-13 - roughly equivalent to a British 12A rating.  The five previous Star Wars films have all carried less restrictive PG - parental guidance - ratings in the US. In the UK, they have all been passed U - suitable for all - with the exception of Attack of The Clones, which got a PG rating in 2002. Revenge of the Sith - the third prequel to the original 1977 Star Wars film - chronicles the transformation of the heroic Anakin Skywalker into the evil Darth Vader as he tra

In [6]:
%load_ext autoreload
%autoreload 2

from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"

In [10]:
!pip3 install python-dotenv

Collecting python-dotenv
  Downloading python_dotenv-1.0.0-py3-none-any.whl (19 kB)
Installing collected packages: python-dotenv
Successfully installed python-dotenv-1.0.0
