# This notebook is the step-by-step on how to encode data from `materials_data.csv` file

## 0. Testing simple GPT query from python

In [2]:
from openai import OpenAI
import os

# Initialize client with your environment variable
client = OpenAI(api_key=os.getenv("OPENAI_API_KEY"))

# Test query
response = client.chat.completions.create(
    model="gpt-4o-mini",  # fast + cheap, good for prototyping
    messages=[
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "Give me a triple <source, function, application> for denim fabric."}
    ]
)

print(response.choices[0].message.content)

Sure! Here’s a triple related to denim fabric:

**<source: cotton, function: durable clothing, application: jeans>** 

This indicates that denim is sourced from cotton, serves the function of being durable clothing, and is commonly used in the application of making jeans.


## 1. Imports, client and small helpers

In [36]:
# Cell 1: setup
import os, time, json, math, random
import pandas as pd
from openai import OpenAI

# Uses your OPENAI_API_KEY from the environment
client = OpenAI()

MODEL = "gpt-4o-mini"   # inexpensive, supports JSON mode
TEMP  = 0.3             # allow light inference while staying stable

def backoff_sleep(attempt, base=1.0, cap=30.0):
    # exponential backoff with jitter
    return min(cap, base * (2 ** attempt)) + random.random() * 0.5


## 2. Build the system and user messages for one row

In [61]:
# Cell 2: message builder
SYSTEM_MSG = """
You are an information extractor that outputs a SINGLE JSON object per input row:
{
  "source": string,          // nouns only; a single material responsible for the functions
  "functions": string[],     // 1–6 items; each is verb-noun or adjective-noun
  "application": string,     // nouns only; a single, concrete artifact or use-case that solves a human design problem
  "notes": string            // brief rationale; state if external knowledge was used
}

Priorities:
1) Ground in the row (title, post text, category, tags, material properties).
2) You MAY use widely known outside knowledge to choose a specific application, but do not contradict the row.

Rules:
- “source” must be a material term (not a brand/company).
- Each “functions” item must be verb-noun or adjective-noun.
- “application” must be SPECIFIC and DESIGN-SOLVING (e.g., a particular product, component, device, structure, or engineered use-case), not a broad sector or vague category.
- Avoid generic words alone such as: packaging, construction, furniture, apparel, insulation, coating, composite, component, product, prototype, architecture, interior, automotive, consumer goods, decor. If such a word appears, qualify it into a concrete artifact or engineered use (2–8 words).
- Prefer the most specific application explicitly mentioned or strongly implied by the row’s text/tags/properties; if several are plausible, pick the single most defensible one.
- If the row explicitly rules an application out, do not use it.
- Do not mention the name of the material on the application field.
- Applications must be something that can be used or built today with current technology, not a futuristic or speculative concept or idea.
- If the text includes a specfic application, use the general aspect of it rather than the specific trademarked name.
- Output valid JSON only, no extra text.
"""

def build_user_msg(row_dict):
    """
    Row dict can contain: name, url, category, code, country, brand, post_text, properties, tags
    We'll pass only what we have.
    """
    # Only include non-empty fields to keep the prompt compact
    payload = {k: v for k, v in row_dict.items() if pd.notna(v) and str(v).strip() != ""}
    # IMPORTANT: make properties a structured list if it is JSON-like; otherwise keep raw string
    return (
        "Extract <source, functions[], application> using this row.\n"
        "Row data (JSON):\n" + json.dumps(payload, ensure_ascii=False)
    )


## 3. One-row encoder with retry + JSON enforcement

In [62]:
# Cell 3: single-row encode function
def encode_row_to_triple(row_dict, max_retries=5):
    user_msg = build_user_msg(row_dict)
    for attempt in range(max_retries):
        try:
            resp = client.chat.completions.create(
                model=MODEL,
                temperature=TEMP,
                response_format={"type": "json_object"},  # enforce JSON
                messages=[
                    {"role": "system", "content": SYSTEM_MSG.strip()},
                    {"role": "user",   "content": user_msg}
                ],
                max_tokens=300,
                n=1,
            )
            content = resp.choices[0].message.content
            data = json.loads(content)

            # Minimal schema guard
            source = (data.get("source") or "").strip()
            functions = data.get("functions") or []
            application = (data.get("application") or "").strip()
            notes = (data.get("notes") or "").strip()

            # Normalize: lower-case phrases, deduplicate, ensure patterns
            def norm_phrase(s):
                return " ".join(str(s).strip().split())

            source = norm_phrase(source)
            application = norm_phrase(application)
            functions = [norm_phrase(f) for f in functions if norm_phrase(f)]
            # dedupe functions while preserving order
            seen = set()
            functions = [f for f in functions if not (f in seen or seen.add(f))]

            usage = getattr(resp, "usage", None)
            prompt_toks = getattr(usage, "prompt_tokens", None) if usage else None
            comp_toks   = getattr(usage, "completion_tokens", None) if usage else None

            return {
                "source": source or "unknown",
                "functions": functions,
                "application": application or "unknown",
                "notes": notes,
                "model": MODEL,
                "prompt_tokens": prompt_toks,
                "completion_tokens": comp_toks
            }

        except Exception as e:
            msg = str(e)
            # Rate limiting / transient errors: backoff and retry
            if "rate" in msg.lower() or "temporar" in msg.lower() or "429" in msg:
                time.sleep(backoff_sleep(attempt))
                continue
            # Quota/billing/etc: just surface it
            return {
                "source": "unknown",
                "functions": [],
                "application": "unknown",
                "notes": f"error: {msg}",
                "model": MODEL,
                "prompt_tokens": None,
                "completion_tokens": None
            }
    # If we exhausted retries:
    return {
        "source": "unknown",
        "functions": [],
        "application": "unknown",
        "notes": "error: max retries exceeded",
        "model": MODEL,
        "prompt_tokens": None,
        "completion_tokens": None
    }


## 4. Load materials data

In [63]:
# Cell 4: test on a few rows
df = pd.read_csv("materials_data.csv")  # <-- change filename if needed

# Peek at the available columns
df.columns.tolist()

['url',
 'material_name',
 'Category',
 'Code',
 'Country',
 'Brand',
 'post_text',
 'tags',
 'Sensorial_Glossiness',
 'Sensorial_Translucence',
 'Sensorial_Structure',
 'Sensorial_Texture',
 'Sensorial_Hardness',
 'Sensorial_Temperature',
 'Sensorial_Acoustics',
 'Sensorial_Odour',
 'Technical_Fire Resistance',
 'Technical_UV Resistance',
 'Technical_Weather Resistance',
 'Technical_Scratch Resistance',
 'Technical_Weight',
 'Technical_Chemical Resistance',
 'Technical_Renewable']

## 5. Select a few rows and test

In [65]:
# Cell 5: run on the first 3 rows to validate output shape
sample_n = 10
rows = []
total_prompt = 0
total_completion = 0

for i in range(min(sample_n, len(df))):
    row = df.iloc[i].to_dict()
    out = encode_row_to_triple(row)
    total_prompt += (out.get("prompt_tokens") or 0)
    total_completion += (out.get("completion_tokens") or 0)

    rows.append({
        "name": row.get("name", ""),
        "url": row.get("url", ""),
        "source": out["source"],
        "functions": "; ".join(out["functions"]),
        "application": out["application"],
        "notes": out["notes"],
        "model": out["model"],
        "prompt_tokens": out["prompt_tokens"],
        "completion_tokens": out["completion_tokens"]
    })

test_df = pd.DataFrame(rows)
test_df


Unnamed: 0,name,url,source,functions,application,notes,model,prompt_tokens,completion_tokens
0,,https://materialdistrict.com/material/013-denim/,denim,recycle-yarn; weave-fabric; support-innovation,art installation and sustainable textile products,The application is derived from the use of the...,gpt-4o-mini,1045,81
1,,https://materialdistrict.com/material/100-bact...,bacterial dye,create color; provide sustainability; reduce w...,microbial color library for textiles,External knowledge about sustainable dyeing pr...,gpt-4o-mini,778,67
2,,https://materialdistrict.com/material/100-basa...,Basalt,reinforcement-fabric; prevent-algal growth; ex...,maritime reinforcement fabric,External knowledge about basalt properties and...,gpt-4o-mini,762,80
3,,https://materialdistrict.com/material/100-biob...,flax fibers,reduce-environmental impact; provide-structura...,interior wall panels,External knowledge about flax's properties and...,gpt-4o-mini,869,75
4,,https://materialdistrict.com/material/100-reje...,cotton,recycle-textiles; create-carpets; reduce-waste...,high-quality recycled carpets,External knowledge about recycling textiles an...,gpt-4o-mini,900,69
5,,https://materialdistrict.com/material/2-000-00...,waste flowers,provide raw materials; offer aesthetic value; ...,manufacturing of bio composites,External knowledge about material applications...,gpt-4o-mini,821,60
6,,https://materialdistrict.com/material/2-5d-print/,ink,create-relief; dry-quickly; print-colourful-pa...,wall panels with raised graphics,External knowledge of printing techniques was ...,gpt-4o-mini,987,66
7,,https://materialdistrict.com/material/2000n-pr...,Colback,moulding-material; tufting-material; finishing...,collection of shoes,External knowledge about Colback's use in auto...,gpt-4o-mini,951,65
8,,https://materialdistrict.com/material/2tec2-hi...,vinyl,reduce impact sounds; absorb footsteps; provid...,acoustic flooring for public spaces,External knowledge about flooring applications...,gpt-4o-mini,789,65
9,,https://materialdistrict.com/material/2tec2-ma...,marble,provide-nature-inspired aesthetics; offer-dura...,high-quality floor covering,External knowledge about flooring applications...,gpt-4o-mini,926,85


## 6. Encode all data

In [67]:
rows = []
total_prompt = 0
total_completion = 0

for i in range(len(df)):
    row = df.iloc[i].to_dict()
    out = encode_row_to_triple(row)
    total_prompt += (out.get("prompt_tokens") or 0)
    total_completion += (out.get("completion_tokens") or 0)

    rows.append({
        "name": row.get("name", ""),
        "url": row.get("url", ""),
        "source": out["source"],
        "functions": "; ".join(out["functions"]),
        "application": out["application"],
        "notes": out["notes"],
        "model": out["model"],
        "prompt_tokens": out["prompt_tokens"],
        "completion_tokens": out["completion_tokens"]
    })
    if (i + 1) % 50 == 0 or i == len(df) - 1:
        print(f"Processed {i + 1}/{len(df)} rows...")

out_df = pd.DataFrame(rows)
out_df.to_csv("materials_data_encoded.csv", index=False)

Processed 50/3163 rows...
Processed 100/3163 rows...
Processed 150/3163 rows...
Processed 200/3163 rows...
Processed 250/3163 rows...
Processed 300/3163 rows...
Processed 350/3163 rows...
Processed 400/3163 rows...
Processed 450/3163 rows...
Processed 500/3163 rows...
Processed 550/3163 rows...
Processed 600/3163 rows...
Processed 650/3163 rows...
Processed 700/3163 rows...
Processed 750/3163 rows...
Processed 800/3163 rows...
Processed 850/3163 rows...
Processed 900/3163 rows...
Processed 950/3163 rows...
Processed 1000/3163 rows...
Processed 1050/3163 rows...
Processed 1100/3163 rows...
Processed 1150/3163 rows...
Processed 1200/3163 rows...
Processed 1250/3163 rows...
Processed 1300/3163 rows...
Processed 1350/3163 rows...
Processed 1400/3163 rows...
Processed 1450/3163 rows...
Processed 1500/3163 rows...
Processed 1550/3163 rows...
Processed 1600/3163 rows...
Processed 1650/3163 rows...
Processed 1700/3163 rows...
Processed 1750/3163 rows...
Processed 1800/3163 rows...
Processed 18

## 7. Verify and post-processing

In [5]:
import pandas as pd
out_df = pd.read_csv("materials_data_encoded.csv")

final_df = out_df.drop_duplicates(subset=["source"], keep="first")[['source', 'functions', 'application']]
final_df.to_csv("materials_data_final.csv", index=False)
final_df

Unnamed: 0,source,functions,application
0,denim,recycle-yarn; weave-fabric; support-innovation...,large work of art presented to the Dutch royal...
1,bacterial dye,produce pigments; create sustainable alternati...,microbial colour library for dyeing textiles
2,Basalt,reinforce-fabric; prevent-algal growth; extend...,reinforcement fabric for maritime applications
3,flax fibers,provide structural support; reduce environment...,interior wall panels
4,cotton,recycle-textiles; create-carpets; reduce-waste...,high-quality recycled carpets
...,...,...,...
3145,clay mortar,moisture buffering; interior finish; colour va...,interior wall and ceiling finish
3154,Belgian blue stone,bonded slabs; processed limestone; finished pr...,kitchen and bathroom surfaces
3155,lignocellulosic natural fibres,produce-binders; engineer-composites; absorb-a...,biobased composite materials for construction
3160,Zinc,absorb-energy; increase-strength; create-foam;...,energy-absorbing structural component
