# 80 · Company Innovation Profiling — Generation Demo  
_Last updated: 2025-05-03_

We’ll ask GPT-4o-mini to infer a **rich innovation profile** for three well-known
firms. The output is constrained by a strict JSON schema so we can drop the
results straight into a database or BI tool.

Pipeline:

1. **Load** a mini company table (name + country).  
2. **Call** the model once per firm, using temperature 0.7 for a balanced guess.  
3. **Parse → DataFrame** and eyeball the results.  
4. **Extension ideas** for RAG, validation, and scaling.


## API key

* Reads `OPENAI_API_KEY` from the environment, or  
* Falls back to `key/openai_key.txt` (one-line file).  
* Raises an error if neither is present.


In [1]:
# %pip -q install --upgrade openai python-dotenv pandas

import os, pathlib, json, pandas as pd
from openai import OpenAI

# 2) fallback key file
key_path = pathlib.Path("key/openai_key.txt")
if os.getenv("OPENAI_API_KEY") is None and key_path.exists():
    os.environ["OPENAI_API_KEY"] = key_path.read_text().strip()

if not os.getenv("OPENAI_API_KEY"):
    raise ValueError("Provide OPENAI_API_KEY or create key/openai_key.txt")

client = OpenAI()


## 1 · Sample company list


In [2]:
companies = pd.DataFrame({
    "NAME_INTERNAT": ["Tesla, Inc.", "Nestlé SA", "BYD Auto"],
    "Country": ["USA", "CHE", "CHN"]
})
companies["custom_id"] = [f"cmp{i+1}" for i in range(len(companies))]
companies


Unnamed: 0,NAME_INTERNAT,Country,custom_id
0,"Tesla, Inc.",USA,cmp1
1,Nestlé SA,CHE,cmp2
2,BYD Auto,CHN,cmp3


## 2 · Strict JSON schema (`company_innovation_v1`)

<details>
<summary>Click to view schema</summary>

```json
{
  "has_patents": "boolean",
  "number_of_patents": "enum: None (0) | Few (1-10) | Moderate (11-100) | High (101-500) | Very High (>500)",
  "...": "full schema defined in the next cell"
}


In [3]:
response_format = {
    "type": "json_schema",
    "json_schema": {
        "name": "company_innovation_v1",
        "strict": True,
        "schema": {
            "type": "object",
            "properties": {
                "has_patents": {"type": "boolean"},
                "number_of_patents": {"type": "string",
                    "enum": ["None (0)", "Few (1-10)", "Moderate (11-100)",
                             "High (101-500)", "Very High (>500)"]},
                "name_patents_major":  {"type": "array", "items": {"type": "string"}},
                "name_patents_others": {"type": "array", "items": {"type": "string"}},
                "climate_mitigation_innovation": {"type": "boolean"},
                "climate_patents": {"type": "boolean"},
                "new_or_improved_product": {"type": "boolean"},
                "num_new_products": {"type": "string",
                    "enum": ["0", "1-2", "3-5", "More than 5"]},
                "novelty_product_innovation": {"type": "string",
                    "enum": ["New to the firm only",
                             "New to the local market/industry",
                             "New to the world"]},
                "names_new_products": {"type": "array", "items": {"type": "string"}},
                "new_or_improved_process": {"type": "boolean"},
                "process_innovation_area": {"type": "string",
                    "enum": ["production/manufacturing process",
                             "logistics or delivery methods",
                             "administrative/organizational processes",
                             "marketing or sales processes",
                             "service delivery processes", "other"]},
                "details_process_improvements": {"type": "array", "items": {"type": "string"}},
                "R&D_investment": {"type": "boolean"},
                "digital_technology_adoption": {"type": "boolean"},
                "eco_innovations": {"type": "boolean"},
                "green_innovation_example": {"type": "array", "items": {"type": "string"}},
                "innovation_strategy": {"type": "string",
                    "enum": ["mainly develops new innovations in-house",
                             "mostly adapts or adopts innovations from external sources",
                             "a balanced mix of both", "NA"]}
            },
            "required": [
                "has_patents","number_of_patents","name_patents_major","name_patents_others",
                "climate_mitigation_innovation","climate_patents",
                "new_or_improved_product","num_new_products","novelty_product_innovation",
                "names_new_products","new_or_improved_process","process_innovation_area",
                "details_process_improvements","R&D_investment",
                "digital_technology_adoption","eco_innovations","green_innovation_example",
                "innovation_strategy"
            ],
            "additionalProperties": False
        }
    }
}


## 3 · Helper to ask GPT-4o-mini once


In [5]:
def ask_innovation(name: str, country: str, temp: float = 0.7) -> dict:
    """Return a dict conforming to the schema."""
    sys_prompt = (
        "You are an analyst inferring a company's innovation attributes. "
        "Use best judgement; if unsure put 'NA'. "
        "Return JSON only."
    )
    user_prompt = f"Company Name: {name}\nCountry: {country}"
    resp = client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[
            {"role": "system", "content": sys_prompt},
            {"role": "user",   "content": user_prompt}
        ],
        temperature=temp,
        max_tokens=16000,
        response_format=response_format
    )
    return json.loads(resp.choices[0].message.content)


## 4 · Generate profiles for each firm


In [6]:
records = []
for _, row in companies.iterrows():
    info = ask_innovation(row.NAME_INTERNAT, row.Country, temp=0.7)
    info["custom_id"] = row.custom_id
    info["company"]   = row.NAME_INTERNAT
    records.append(info)

df_profiles = pd.json_normalize(records)
df_profiles.head()


Unnamed: 0,has_patents,number_of_patents,name_patents_major,name_patents_others,climate_mitigation_innovation,climate_patents,new_or_improved_product,num_new_products,novelty_product_innovation,names_new_products,new_or_improved_process,process_innovation_area,details_process_improvements,R&D_investment,digital_technology_adoption,eco_innovations,green_innovation_example,innovation_strategy,custom_id,company
0,True,Very High (>500),"[Battery technology patents, Autopilot and sel...","[Solar technology patents, Energy storage pate...",True,True,True,More than 5,New to the world,"[Model S, Model 3, Model X, Model Y, Cybertruck]",True,production/manufacturing process,[Gigafactory production efficiency improvement...,True,True,True,"[Solar Roof, Powerwall, Supercharger network]",mainly develops new innovations in-house,cmp1,"Tesla, Inc."
1,True,High (101-500),"[Food preservation methods, Nutritional supple...",[],True,True,True,More than 5,New to the world,"[Plant-based milk alternatives, Nutritional pr...",True,production/manufacturing process,"[Improved supply chain efficiency, Reduced wat...",True,True,True,"[Recyclable packaging, Water-saving agricultur...",mainly develops new innovations in-house,cmp2,Nestlé SA
2,True,High (101-500),"[Battery technology, Electric vehicle designs,...","[Charging infrastructure, Energy storage systems]",True,True,True,More than 5,New to the world,"[Electric buses, Electric trucks, Hybrid vehic...",True,production/manufacturing process,"[Automated assembly lines, Advanced battery re...",True,True,True,"[Battery recycling, Solar energy integration, ...",mainly develops new innovations in-house,cmp3,BYD Auto


## 5 · Sanity-check a few columns


In [7]:
cols = ["company", "has_patents", "number_of_patents",
        "climate_mitigation_innovation", "digital_technology_adoption",
        "innovation_strategy"]
df_profiles[cols]


Unnamed: 0,company,has_patents,number_of_patents,climate_mitigation_innovation,digital_technology_adoption,innovation_strategy
0,"Tesla, Inc.",True,Very High (>500),True,True,mainly develops new innovations in-house
1,Nestlé SA,True,High (101-500),True,True,mainly develops new innovations in-house
2,BYD Auto,True,High (101-500),True,True,mainly develops new innovations in-house


In [8]:
df_profiles

Unnamed: 0,has_patents,number_of_patents,name_patents_major,name_patents_others,climate_mitigation_innovation,climate_patents,new_or_improved_product,num_new_products,novelty_product_innovation,names_new_products,new_or_improved_process,process_innovation_area,details_process_improvements,R&D_investment,digital_technology_adoption,eco_innovations,green_innovation_example,innovation_strategy,custom_id,company
0,True,Very High (>500),"[Battery technology patents, Autopilot and sel...","[Solar technology patents, Energy storage pate...",True,True,True,More than 5,New to the world,"[Model S, Model 3, Model X, Model Y, Cybertruck]",True,production/manufacturing process,[Gigafactory production efficiency improvement...,True,True,True,"[Solar Roof, Powerwall, Supercharger network]",mainly develops new innovations in-house,cmp1,"Tesla, Inc."
1,True,High (101-500),"[Food preservation methods, Nutritional supple...",[],True,True,True,More than 5,New to the world,"[Plant-based milk alternatives, Nutritional pr...",True,production/manufacturing process,"[Improved supply chain efficiency, Reduced wat...",True,True,True,"[Recyclable packaging, Water-saving agricultur...",mainly develops new innovations in-house,cmp2,Nestlé SA
2,True,High (101-500),"[Battery technology, Electric vehicle designs,...","[Charging infrastructure, Energy storage systems]",True,True,True,More than 5,New to the world,"[Electric buses, Electric trucks, Hybrid vehic...",True,production/manufacturing process,"[Automated assembly lines, Advanced battery re...",True,True,True,"[Battery recycling, Solar energy integration, ...",mainly develops new innovations in-house,cmp3,BYD Auto


## 6 · Extension ideas

* **Batch endpoint** — profile 10 000 firms by writing one JSONL line per company
  and sending to `/v1/chat/completions` with `completion_window="24h"`.
* **Validation loop** — flag rows where `has_patents = false` but
  `number_of_patents = "High"`, route them for manual review.
* **Embedding similarity** — embed `names_new_products` and cluster to spot
  tech-adjacency between firms.
