# Survey Transformation Basics

This notebook explains survey transformation in order: main input, structured outputs, and practical benefits.


## 1. Setup

Configure pandas extension and model once.


In [1]:
import os

import pandas as pd
from pydantic import BaseModel, Field

from openaivec import pandas_ext

assert os.getenv("OPENAI_API_KEY") or os.getenv("AZURE_OPENAI_BASE_URL"), (
    "Set OPENAI_API_KEY or Azure OpenAI environment variables before running this notebook."
)

pandas_ext.set_responses_model("gpt-5.2")


## 2. Input: survey responses DataFrame

Each row contains one free-text survey response.


In [2]:
survey_df = pd.DataFrame(
    {
        "response_id": ["RESP_001", "RESP_002", "RESP_003", "RESP_004"],
        "response": [
            "I am a 24-year-old student in Seattle and enjoy gaming and anime.",
            "I am a 41-year-old manager in New York, interested in fitness and business books.",
            "I am a 33-year-old software engineer in Austin who likes hiking and coffee.",
            "I am retired in Denver and spend time gardening and local community events.",
        ],
    }
)

survey_df


Unnamed: 0,response_id,response
0,RESP_001,I am a 24-year-old student in Seattle and enjo...
1,RESP_002,"I am a 41-year-old manager in New York, intere..."
2,RESP_003,I am a 33-year-old software engineer in Austin...
3,RESP_004,I am retired in Denver and spend time gardenin...


## 3. Output A: structured profile per response

Convert each free-text response into a typed profile.


In [3]:
class SurveyProfile(BaseModel):
    age_group: str = Field(description="One of: 18-25, 26-35, 36-45, 46-55, 56+")
    occupation_category: str = Field(description="Short category such as student, technology, business")
    location: str = Field(description="City or region")
    interests: list[str] = Field(description="Top interests")


profiles = survey_df["response"].ai.responses(
    instructions=(
        "Extract age group, occupation category, location, and interests "
        "from each survey response. Keep labels concise."
    ),
    response_format=SurveyProfile,
)

survey_df.assign(profile=profiles)[["response_id", "profile"]]


Processing batches:   0%|          | 0/4 [00:00<?, ?item/s]

Unnamed: 0,response_id,profile
0,RESP_001,age_group='18-25' occupation_category='student...
1,RESP_002,age_group='36-45' occupation_category='managem...
2,RESP_003,age_group='26-35' occupation_category='technol...
3,RESP_004,age_group='56+' occupation_category='retired' ...


## 4. Output B: analysis-ready columns

Expand structured profiles into regular columns for aggregation.


In [4]:
analysis_df = survey_df[["response_id"]].join(profiles.rename("profile").ai.extract())

analysis_df


Unnamed: 0,response_id,profile_age_group,profile_occupation_category,profile_location,profile_interests
0,RESP_001,18-25,student,Seattle,"[gaming, anime]"
1,RESP_002,36-45,management,New York,"[fitness, business books]"
2,RESP_003,26-35,technology,Austin,"[hiking, coffee]"
3,RESP_004,56+,retired,Denver,"[gardening, community events]"


In [5]:
print("Age group distribution")
print(analysis_df["profile_age_group"].value_counts())

print("\nOccupation category distribution")
print(analysis_df["profile_occupation_category"].value_counts())


Age group distribution
profile_age_group
18-25    1
36-45    1
26-35    1
56+      1
Name: count, dtype: int64

Occupation category distribution
profile_occupation_category
student       1
management    1
technology    1
retired       1
Name: count, dtype: int64


## 5. Benefits

**Main input**
- Free-text survey responses (`response` column)
- Extraction schema (`SurveyProfile`)

**Main output**
- Structured profile objects (`ai.responses`)
- Flat analysis columns (`ai.extract`)

**Why this helps**
- Turns qualitative text into queryable fields
- Keeps transformation logic explicit and reproducible
- Makes downstream segmentation and BI reporting easier
