# Using data in the prompt

## Ways of using ChatGPT with data

If we have data, there are multiple ways of working with ChatGPT. We will probably have already have talked about two:

- Asking ChatGPT to analyse a data file
- Asking ChatGPT to summarise an image of plotted data

Both of these are human-like ways of understanding and working with data. However, ChatGPT is also able to work with data by looking at the entire raw dataset as a gestalt. This is what is happening when you pass in raw data into the prompt. That's what we'll experiment with in this session.

## Fighting the context window

- When we pass data into the prompt, the context window becomes a problem
- Today we will use `gpt-4-turbo` which has 128k context window
- You will find that not enough for many things you want to do
- Much of this session will be about deciding
  - What do I want to show ChatGPT?
  - In what format should I pass it?
  - Is there any way of reducing the data size so that I can show more?

## Imports

In [None]:
import json
from dataclasses import dataclass
from typing import Dict, List, Optional
import requests
import sh
from ipywidgets import interact, Dropdown
import pandas as pd

In [None]:
import hashlib

In [None]:
import openai

In [None]:
import shelve

## Helpers

You can mostly ignore this code to start with. What it does is to provide the helper method `fetch_bundle()` that will fetch all the config, data and metadata for a single Grapher chart (or indicator).

In [None]:
@dataclass
class Indicator:
    data: dict
    metadata: dict

    def to_dict(self):
        return {"data": self.data, "metadata": self.metadata}

    def to_frame(self):
        # getting a data frame is easy
        df = pd.DataFrame.from_dict(self.data)

        # turning entity ids into entity names
        entities = pd.DataFrame.from_records(self.metadata['dimensions']['entities']['values'])
        id_to_name = entities.set_index('id').name.to_dict()
        df['entities'] = df.entities.apply(id_to_name.__getitem__)

        # make the "values" column more interestingly named
        df = df.rename(columns={'values': self.metadata.get('shortName', f'ind_{self.metadata["id"]}')})

        # order the columns better
        cols = ['entities', 'years'] + sorted([c for c in df.columns if c not in ['entities', 'years']])
        df = df[cols]

        return df


@dataclass
class GrapherBundle:
    config: Optional[dict]
    dimensions: Dict[int, Indicator]
    origins: List[dict]

    def to_json(self):
        return json.dumps(
            {
                "config": self.config,
                "dimensions": {k: i.to_dict() for k, i in self.dimensions.items()},
                "origins": self.origins,
            }
        )

    def size(self):
        return len(self.to_json())

    @property
    def indicators(self) -> List[Indicator]:
        return list(self.dimensions.values())

    def to_frame(self):
        df = None
        for i in self.indicators:
            to_merge = i.to_frame()
            if df is None:
                df = to_merge
            else:
                df = pd.merge(df, to_merge, how='outer', on=['entities', 'years'])
        return df

    def __repr__(self):
        return f'GrapherBundle(config={self.config}, dimensions=..., origins=...)'

def fetch_grapher_config(slug):
    resp = requests.get(f"https://ourworldindata.org/grapher/{slug}")
    resp.raise_for_status()
    return json.loads(resp.content.decode("utf-8").split("//EMBEDDED_JSON")[1])


def fetch_dimension(id: int) -> Indicator:
    data = requests.get(
        f"https://api.ourworldindata.org/v1/indicators/{id}.data.json"
    ).json()
    metadata = requests.get(
        f"https://api.ourworldindata.org/v1/indicators/{id}.metadata.json"
    ).json()
    return Indicator(data, metadata)


def fetch_bundle(
    slug: Optional[str] = None, indicator_id: Optional[int] = None
) -> GrapherBundle:
    if slug:
        config = fetch_grapher_config(slug)
        indicator_ids = [d["variableId"] for d in config["dimensions"]]
    else:
        print(f"Fetching indicator {indicator_id}")
        config = None
        indicator_ids = [indicator_id]
    dimensions = {
        indicator_id: fetch_dimension(indicator_id) for indicator_id in indicator_ids
    }
    origins = []
    for d in dimensions.values():
        if d.metadata.get("origins"):
            origins.append(d.metadata.pop("origins"))
    return GrapherBundle(config, dimensions, origins)

In [None]:
def fetch(slug=None, indicator_id=None):
    key = f'{slug}::{indicator_id}'
    with shelve.open('cache.db') as shelf:
        if key not in shelf:
            b = fetch_bundle(slug=slug, indicator_id=indicator_id)
            shelf[key] = b
        
        return shelf[key]

In [None]:
def to_clipboard(s):
    sh.pbcopy(_in=s)

## Helpers: asking ChatGPT by API

In [None]:
client = openai.Client()

In [None]:
MODEL = 'gpt-4-turbo'

In [None]:
def gpt_response(message: str, model: str = MODEL) -> str:
    return client.chat.completions.create(
      model=model,
      messages=[{"role": "user", "content": message}],
    ).choices[0].message.content

In [None]:
print(gpt_response('Tell me a funny story in a single haiku with a surprising twist'))

In [None]:
def gpt_cached(message: str, model: str = MODEL) -> str:
    with shelve.open('cache.db') as shelf:
        key = hashlib.md5(f'{model}:::{message}'.encode('utf8')).hexdigest()
        if key in shelf:
            return shelf[key]

        resp = gpt_response(message, model)
        shelf[key] = resp
        return resp

## Test data fetching

Let's check to see how it works.

In [None]:
# fetch just one indicator
b = fetch(slug='gdp-per-capita-maddison')
b

#### If we passed everything as JSON to ChatGPT, how big would it be?

In [None]:
len(b.to_json())

395k! Much bigger than our 128k context window!

#### What about just the data, no metadata?

In [None]:
b.to_frame()

In [None]:
len(b.to_frame().to_json())

919k??? Eeek, it's worse, since entity names are strings now.

#### What about stacked area charts?

In [None]:
# fetch a stacked chart that uses a bunch of indicators
b = fetch(slug='births-by-age-of-mother')

In [None]:
len(b.to_json())

2.8MB!!!

In [None]:
len(b.to_frame().to_json())

2.95MB!!!

So, we will need to think carefully about what we might use in a prompt.

## Part 1: making data to prompt on

See what makes data bigger or smaller, when being passed to ChatGPT. Can you find a strategy that reduces this entire data file to something we can use in a prompt?

In [None]:
slug_whitelist = set(json.load(open('slugs.json')))

In [None]:
last_slug = 'life-expectancy'
last_prompt = ''

In [None]:
@interact(slug=last_slug)
def find_data(slug=None):
    global last_slug, last_prompt
    
    if not slug:
        return

    last_slug = slug
    
    if slug not in slug_whitelist:
        matches = sorted([s for s in slug_whitelist if s.startswith(slug)])[:5]
        if matches:
            print('\n'.join(matches))
        else:
            print('(not found)')
        return
    
    b = fetch(slug=slug)
    df = b.to_frame()

    ### ------ YOU WORK HERE -------

    last_prompt = df.to_json()
    
    ###

    # let's see how we did
    l = len(data_to_prompt(df)) // 1000
    emoji = '❌' if l > 128 else '✅'
    print(f'Length: {l}k {emoji}\n')
    print(last_prompt[:1000])

If you get a tick, you may copy your prompt to the clipboard.

In [None]:
to_clipboard(last_prompt)

Then go paste it into ChatGPT, with any instructions you like.

#### Strategies

- You could try varying the serialisation format
  - Look at what you're really passing to ChatGPT, is that the most compact thing?
- You could try showing only part of the data

## Part 2: experimenting with prompts

Although we can use the ChatGPT API, you may enjoy continuing to use your clipboard and using ChatGPT more interactively.

Now we want to work out what we can say to ChatGPT that will give us the kind of output we want.

## Things to try

What types of prompts work well?
- Try giving a lot of guidance
- Try giving little to no guidance
- Try comparing a country to its peers, income group or neighbours (see: `peers.json`)
- Try asking it to think step by step, then give an answer after '---'