## How to use a Large Language Models (LLMs) to analyse text and create novel datasets.

Last updated: 2024-05-16

See this blog post on Autonomous Econ for a deatiled wlkthrough of this notebook: https://autonomousecon.substack.com/p/how-anyone-can-use-ai-to-create-novel

### Install dependencies

In [None]:
pip install openai

In [None]:
pip install langchain

In [None]:
pip install -U langchain-openai

### Import libraries

In [7]:
from langchain_openai import ChatOpenAI
from langchain.document_loaders import WebBaseLoader
from langchain.output_parsers import ResponseSchema
from langchain.output_parsers import StructuredOutputParser
from langchain.prompts import ChatPromptTemplate
import os
import json
import pandas as pd

### Set enviornment variables and model

In [8]:
# Set OpenAI API key
os.environ["OPENAI_API_KEY"] = "<YOURAPIKEY>"

In [9]:
# set model and model params
model="gpt-3.5-turbo-16k"
chat = ChatOpenAI(temperature=0.0, model_name=model)

### Choose the web links to load into your document list

In [10]:
# Loading multiple web links
urls = [
"https://www.federalreserve.gov/newsevents/pressreleases/monetary20240320a.htm",
"https://www.federalreserve.gov/newsevents/pressreleases/monetary20230201a.htm",
"https://www.federalreserve.gov/newsevents/pressreleases/monetary20220504a.htm",
"https://www.federalreserve.gov/newsevents/pressreleases/monetary20220316a.htm",
"https://www.federalreserve.gov/newsevents/pressreleases/monetary20200315a.htm",
]
loader = WebBaseLoader(web_path=urls)

docs = loader.load()

Create the prompt template

In [23]:

# create the instructions that will convert the LLM response into a JSON format

date_schema = ResponseSchema(name="fomc_date", description="date of fomc announcement")

target_range_schema = ResponseSchema(
    name="fed_funds_target_range", description="target range for the federal funds rate"
)

decision_schema = ResponseSchema(
    name="rate_decision", description="decision for the federal funds rate"
)

policy_stance_schema = ResponseSchema(
    name="policy_stance", description="policy stance of statement ranging from -1 (very dovish) to 1 (very hawkish)"
)


response_schemas = [
    date_schema,
    target_range_schema,
    decision_schema,
    policy_stance_schema,
]

output_parser = StructuredOutputParser.from_response_schemas(response_schemas)

format_instructions = output_parser.get_format_instructions()

print(format_instructions)

The output should be a markdown code snippet formatted in the following schema, including the leading and trailing "```json" and "```":

```json
{
	"fomc_date": string  // date of fomc announcement
	"fed_funds_target_range": string  // target range for the federal funds rate
	"rate_decision": string  // decision for the federal funds rate
	"policy_stance": string  // policy stance of statement ranging from -1 (very dovish) to 1 (very hawkish)
}
```


In [24]:
# Create the extraction instructions for the prompt template

extract_template = """\
For the following text, extract the following information:

date_schema: date of fomc announcement. Use the following format: dd/mm/YYYY.

target_range_schema: the target range for the federal funds rate that the Committee decided on. Record the range as values with 2 decimal places.

decision_schema: decision by the FOMC for the federal funds rate. Classify as either 'raise','maintain' or 'lower'.

policy_stance_schema: classify the text with the following values depending on how dovish or hawkish the overall message was in the FOMC statement.
-1: Strongly expresses a belief that the economy may be growing too slowly and/or inflation is too low and may need stimulus through monetary policy.
-0.5: Overall message expresses a belief that the economy may be growing too slowly and/or inflation is too low and may need stimulus through monetary policy.
0: Expresses neither a hawkish nor dovish view and the Fed is on track to achieve its employment and inflation goals.
0.5: Overall message expresses a belief that the economy is growing too quickly and/or inflation is too high and may need to be slowed down through monetary policy.
1: Strongly expresses a belief that the economy is growing too quickly and/or inflation is too high and may need to be slowed down through monetary policy.

text: {text}

{format_instructions}
"""

In [25]:

prompt = ChatPromptTemplate.from_template(template=extract_template)


### Run your prompt over every document in your document list

In [26]:
# Generate a response for each document based on the prompt template and save it into a list
output_list = []
for i in range(len(docs)):
    doc = docs[i]
    messages = prompt.format_messages(text=doc, format_instructions=format_instructions)
    response = chat(messages)
    parsed_response = output_parser.parse(response.content)
    output_list.append(parsed_response)

In [27]:
# JSON list
output_list


[{'fomc_date': '20/03/2024',
  'fed_funds_target_range': '5.25-5.50',
  'rate_decision': 'maintain',
  'policy_stance': '0'},
 {'fomc_date': '01/02/2023',
  'fed_funds_target_range': '4.50-4.75',
  'rate_decision': 'raise',
  'policy_stance': '0.5'},
 {'fomc_date': '04/05/2022',
  'fed_funds_target_range': '0.75 - 1.00',
  'rate_decision': 'raise',
  'policy_stance': '0.5'},
 {'fomc_date': '16/03/2022',
  'fed_funds_target_range': '0.25 - 0.5',
  'rate_decision': 'raise',
  'policy_stance': '0.5'},
 {'fomc_date': '15/03/2020',
  'fed_funds_target_range': '0.00-0.25',
  'rate_decision': 'lower',
  'policy_stance': '-1'}]

In [28]:
# Convert your json list into a pandas DataFrame
json_list = json.dumps(output_list)
df = pd.read_json(json_list)


In [29]:
# The final output
df

Unnamed: 0,fomc_date,fed_funds_target_range,rate_decision,policy_stance
0,20/03/2024,5.25-5.50,maintain,0.0
1,01/02/2023,4.50-4.75,raise,0.5
2,04/05/2022,0.75 - 1.00,raise,0.5
3,16/03/2022,0.25 - 0.5,raise,0.5
4,15/03/2020,0.00-0.25,lower,-1.0


In [20]:
# save the structured df to a csv file
df.to_csv("fomc_sample.csv", index=False)