# 80 · Company Innovation Profiling — Generation Demo  
_Last updated: 2025-05-03_

We’ll ask GPT-4o-mini to infer a **rich innovation profile** for three well-known
firms. The output is constrained by a strict JSON schema so we can drop the
results straight into a database or BI tool.

Pipeline:

1. **Load** a mini company table (name + country).  
2. **Call** the model once per firm, using temperature 0.7 for a balanced guess.  
3. **Parse → DataFrame** and eyeball the results.  
4. **Extension ideas** for RAG, validation, and scaling.


## API key

* Reads `OPENAI_API_KEY` from the environment, or  
* Falls back to `key/openai_key.txt` (one-line file).  
* Raises an error if neither is present.


In [1]:
# %pip -q install --upgrade openai python-dotenv pandas

import os, pathlib, json, pandas as pd
from openai import OpenAI

# 2) fallback key file
key_path = pathlib.Path("key/openai_key.txt")
if os.getenv("OPENAI_API_KEY") is None and key_path.exists():
    os.environ["OPENAI_API_KEY"] = key_path.read_text().strip()

if not os.getenv("OPENAI_API_KEY"):
    raise ValueError("Provide OPENAI_API_KEY or create key/openai_key.txt")

client = OpenAI()


## 1 · Sample company list


In [2]:
companies = pd.DataFrame({
    "NAME_INTERNAT": ["Tesla, Inc.", "Nestlé SA", "BYD Auto"],
    "Country": ["USA", "CHE", "CHN"]
})
companies["custom_id"] = [f"cmp{i+1}" for i in range(len(companies))]
companies


Unnamed: 0,NAME_INTERNAT,Country,custom_id
0,"Tesla, Inc.",USA,cmp1
1,Nestlé SA,CHE,cmp2
2,BYD Auto,CHN,cmp3


## 2 · Strict JSON schema (`company_innovation_v1`)

<details>
<summary>Click to view schema</summary>

```json
{
  "has_patents": "boolean",
  "number_of_patents": "enum: None (0) | Few (1-10) | Moderate (11-100) | High (101-500) | Very High (>500)",
  "...": "full schema defined in the next cell"
}


In [3]:
response_format = {
    "type": "json_schema",
    "json_schema": {
        "name": "company_innovation_v1",
        "strict": True,
        "schema": {
            "type": "object",
            "properties": {
                "has_patents": {"type": "boolean"},
                "number_of_patents": {
                    "type": "string",
                    "enum": ["None (0)", "Few (1-10)", "Moderate (11-100)",
                             "High (101-500)", "Very High (>500)"]
                },
                "name_patents_major": {"type": "array", "items": {"type": "string"}},
                "name_patents_others": {"type": "array", "items": {"type": "string"}},
                "new_or_improved_product": {"type": "boolean"},
                "num_new_products": {
                    "type": "string",
                    "enum": ["0", "1-2", "3-5", "More than 5"]
                },
                "names_new_products": {"type": "array", "items": {"type": "string"}},
                "new_or_improved_process": {"type": "boolean"},
                "process_innovation_area": {
                    "type": "string",
                    "enum": [
                        "production/manufacturing process",
                        "logistics or delivery methods",
                        "administrative/organizational processes",
                        "marketing or sales processes",
                        "service delivery processes",
                        "other"
                    ]
                },
                "details_process_improvements": {"type": "array", "items": {"type": "string"}}
            },
            "required": [
                "has_patents",
                "number_of_patents",
                "name_patents_major",
                "name_patents_others",
                "new_or_improved_product",
                "num_new_products",
                "names_new_products",
                "new_or_improved_process",
                "process_innovation_area",
                "details_process_improvements"
            ],
            "additionalProperties": False
        }
    }
}


## 3 · Helper to ask GPT-4o-mini once


In [4]:
def ask_innovation(name: str, country: str, temp: float = 0.7) -> dict:
    """Return a dict conforming to the schema."""
    sys_prompt = (
        "You are an analyst inferring a company's innovation attributes. "
        "Use best judgement; if unsure put 'NA'. "
        "Return JSON only."
    )
    user_prompt = f"Company Name: {name}\nCountry: {country}"
    resp = client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[
            {"role": "system", "content": sys_prompt},
            {"role": "user",   "content": user_prompt}
        ],
        temperature=temp,
        max_tokens=16000,
        response_format=response_format
    )
    return json.loads(resp.choices[0].message.content)


## 4 · Generate profiles for each firm


In [5]:
records = []
for _, row in companies.iterrows():
    info = ask_innovation(row.NAME_INTERNAT, row.Country, temp=0.7)
    info["custom_id"] = row.custom_id
    info["company"]   = row.NAME_INTERNAT
    records.append(info)

df_profiles = pd.json_normalize(records)
df_profiles.head()


Unnamed: 0,has_patents,number_of_patents,name_patents_major,name_patents_others,new_or_improved_product,num_new_products,names_new_products,new_or_improved_process,process_innovation_area,details_process_improvements,custom_id,company
0,True,Very High (>500),"[Electric Vehicle Battery Technology, Autopilo...","[Vehicle Design Innovations, Manufacturing Tec...",True,More than 5,"[Model S, Model 3, Model X, Model Y, Cybertruc...",True,production/manufacturing process,"[Gigafactory production techniques, Automation...",cmp1,"Tesla, Inc."
1,True,High (101-500),"[Food preservation techniques, Nutritional sup...","[Coffee brewing technology, Dairy product proc...",True,More than 5,"[Plant-based milk alternatives, Functional bev...",True,production/manufacturing process,"[Automated food production lines, Sustainable ...",cmp2,Nestlé SA
2,True,Very High (>500),"[Electric Vehicle Battery Technology, Electric...","[Solar Energy Solutions, Vehicle Design Innova...",True,More than 5,"[BYD Han EV, BYD Tang EV, BYD Dolphin, BYD Sea...",True,production/manufacturing process,"[Automation in battery production, Streamlined...",cmp3,BYD Auto


## 5 · Sanity-check a few columns


In [6]:
cols = ["company", "has_patents", "number_of_patents"]
df_profiles[cols]


Unnamed: 0,company,has_patents,number_of_patents
0,"Tesla, Inc.",True,Very High (>500)
1,Nestlé SA,True,High (101-500)
2,BYD Auto,True,Very High (>500)


In [7]:
df_profiles

Unnamed: 0,has_patents,number_of_patents,name_patents_major,name_patents_others,new_or_improved_product,num_new_products,names_new_products,new_or_improved_process,process_innovation_area,details_process_improvements,custom_id,company
0,True,Very High (>500),"[Electric Vehicle Battery Technology, Autopilo...","[Vehicle Design Innovations, Manufacturing Tec...",True,More than 5,"[Model S, Model 3, Model X, Model Y, Cybertruc...",True,production/manufacturing process,"[Gigafactory production techniques, Automation...",cmp1,"Tesla, Inc."
1,True,High (101-500),"[Food preservation techniques, Nutritional sup...","[Coffee brewing technology, Dairy product proc...",True,More than 5,"[Plant-based milk alternatives, Functional bev...",True,production/manufacturing process,"[Automated food production lines, Sustainable ...",cmp2,Nestlé SA
2,True,Very High (>500),"[Electric Vehicle Battery Technology, Electric...","[Solar Energy Solutions, Vehicle Design Innova...",True,More than 5,"[BYD Han EV, BYD Tang EV, BYD Dolphin, BYD Sea...",True,production/manufacturing process,"[Automation in battery production, Streamlined...",cmp3,BYD Auto


## 6 · Extension ideas

* **Batch endpoint** — profile 10 000 firms by writing one JSONL line per company
  and sending to `/v1/chat/completions` with `completion_window="24h"`.
* **Validation loop** — flag rows where `has_patents = false` but
  `number_of_patents = "High"`, route them for manual review.
* **Embedding similarity** — embed `names_new_products` and cluster to spot
  tech-adjacency between firms.


### Crediting 
The actual paper behind this is work in progress.
If you use this code, please cite any of the other paper(s):
- Garg, P. and Fetzer, T., 2025. **Causal claims in economics**. arXiv preprint arXiv:2501.06873.
- Fetzer, T., Lambert, P.J., Feld, B. and Garg, P., 2024. **AI-generated production networks: Measurement and applications to global trade**.
- Garg, P. and Fetzer, T., 2025. **Political expression of academics on Twitter**. Nature Human Behaviour. DOI: 10.1038/s41562-025-02199-1
- Garg, P. and Fetzer, T., 2025. **Artificial Intelligence health advice accuracy varies across languages and contexts**. arXiv preprint arXiv:2504.18310.

