# 05 - LangChain Output Parsers for Environmental Data
## SEEDS Nexus AI Agents Academy

This notebook demonstrates how to use LangChain output parsers to structure environmental data and sustainability information. You'll learn to transform AI-generated content into organized, usable formats.

### Learning Objectives
- Master 5 different LangChain output parsers
- Structure environmental data using different parsing techniques
- Validate sustainability information with Pydantic models
- Create robust data pipelines for environmental research

### Environmental Focus
We'll use examples related to climate data, renewable energy statistics, and sustainability metrics throughout this tutorial.

### Output Parsers We'll Cover:
1. **StrOutputParser** - Basic string output for simple text
2. **JSONOutputParser** - Structured data for environmental metrics
3. **CSVOutputParser** - Tabular data for climate statistics
4. **StructuredOutputParser** - Organized environmental reports
5. **PydanticOutputParser** - Validated sustainability models

In [None]:
# Setup Cell - Install Required Packages
# Run this cell first in Google Colab

!pip install langchain==0.1.0
!pip install langchain-openai==0.0.5
!pip install langchain-community==0.0.10
!pip install openai==1.12.0
!pip install pydantic==2.5.0
!pip install python-dotenv==1.0.0

## Setup: API Key Configuration

Before we start parsing environmental data, let's set up our OpenAI API key securely:

In [None]:
import os
import warnings
warnings.filterwarnings('ignore')

# Check if running in Colab
try:
    import google.colab
    IN_COLAB = True
    print("Running in Google Colab")
except ImportError:
    IN_COLAB = False
    print("Running locally")

# Set up OpenAI API key based on environment
if IN_COLAB:
    from google.colab import userdata
    from getpass import getpass

    try:
        openai_api_key = userdata.get('OPENAI_API_KEY')
        if openai_api_key:
            os.environ["OPENAI_API_KEY"] = openai_api_key
            print("✅ API key loaded from Google Colab secrets!")
        else:
            print("OpenAI API key not found in Colab secrets.")
            os.environ["OPENAI_API_KEY"] = getpass("Enter your OpenAI API key: ")
            print("✅ API key set from input")
    except Exception as e:
        print(f"Note: {e}")
        os.environ["OPENAI_API_KEY"] = getpass("OpenAI API key: ")
else:
    try:
        from dotenv import load_dotenv
        load_dotenv()
        api_key = os.getenv("OPENAI_API_KEY")
        if api_key:
            print("✅ API key loaded from .env file")
        else:
            print("⚠️ No API key found in .env file.")
            os.environ["OPENAI_API_KEY"] = "your-api-key-here"
    except ImportError:
        print("⚠️ python-dotenv not installed.")
        os.environ["OPENAI_API_KEY"] = "your-api-key-here"

if os.environ.get("OPENAI_API_KEY") in [None, "", "your-api-key-here"]:
    print("⚠️ WARNING: Please set your OpenAI API key before running the examples!")
else:
    print("✅ API key is set! Ready to proceed.\n")

In [None]:
# Import required libraries for environmental data parsing
from langchain.output_parsers import (
    StrOutputParser,
    JSONOutputParser,
    CSVOutputParser,
    StructuredOutputParser,
    PydanticOutputParser
)
from langchain_core.prompts import PromptTemplate
from langchain_openai import ChatOpenAI
from pydantic import BaseModel, Field
from typing import List
import json

# Initialize the language model
llm = ChatOpenAI(model="gpt-3.5-turbo", temperature=0)
print("✅ Libraries imported and LLM initialized!")

## 1. StrOutputParser - Simple Environmental Text

The StrOutputParser is the most basic parser that returns raw string output. Perfect for generating simple environmental descriptions or summaries.

In [None]:
# Example: Generate simple environmental information
prompt = PromptTemplate.from_template(
    "In one sentence, describe the main environmental challenge facing {region}."
)
parser = StrOutputParser()
chain = prompt | llm | parser

# Test with different regions
regions = ["Arctic", "Amazon Rainforest", "Sahara Desert"]

for region in regions:
    result = chain.invoke({"region": region})
    print(f"{region}: {result}")
    print("-" * 50)

## 2. JSONOutputParser - Structured Environmental Data

The JSONOutputParser converts AI output into JSON format, perfect for creating structured environmental datasets.

In [None]:
# Example: Generate structured renewable energy data
json_prompt = PromptTemplate.from_template(
    """Generate renewable energy data for a country in JSON format with these keys:
    - country: name of the country
    - solar_capacity: solar energy capacity in GW
    - wind_capacity: wind energy capacity in GW
    - renewable_percentage: percentage of total energy from renewables

    Country: {country}

    Return valid JSON only."""
)

parser = JSONOutputParser()
chain = json_prompt | llm | parser

# Test with different countries
countries = ["Germany", "China", "Brazil"]

for country in countries:
    result = chain.invoke({"country": country})
    print(f"Renewable Energy Data for {country}:")
    print(json.dumps(result, indent=2))
    print("-" * 50)

## 3. CSVOutputParser - Climate Data Tables

The CSVOutputParser is ideal for creating tabular environmental data that can be easily imported into spreadsheets or data analysis tools.

In [None]:
# Example: Generate climate data in CSV format
csv_prompt = PromptTemplate.from_template(
    """Create climate data for {num_cities} major cities in CSV format.
    Include: City, Country, Annual_Temperature_C, Annual_Rainfall_mm

    Provide realistic data in comma-separated format, one city per line."""
)

parser = CSVOutputParser()
chain = csv_prompt | llm | parser

result = chain.invoke({"num_cities": "5"})
print("Climate Data (CSV format):")
print("City, Country, Annual_Temperature_C, Annual_Rainfall_mm")
for row in result:
    print(row)

## 4. StructuredOutputParser - Environmental Reports

The StructuredOutputParser creates organized data using predefined schemas. Perfect for generating consistent environmental reports.

In [None]:
from langchain.output_parsers import StructuredOutputParser
from langchain.output_parsers.schema import ResponseSchema

# Define schema for carbon footprint analysis
carbon_schema = [
    ResponseSchema(
        name="sector",
        description="Economic sector (e.g., Transportation, Energy, Agriculture)"
    ),
    ResponseSchema(
        name="emissions_mt_co2",
        description="Annual CO2 emissions in million tonnes"
    ),
    ResponseSchema(
        name="reduction_potential",
        description="Potential reduction percentage with current technology"
    ),
    ResponseSchema(
        name="key_solutions",
        description="Main strategies for reducing emissions in this sector"
    )
]

parser = StructuredOutputParser.from_response_schemas(carbon_schema)
format_instructions = parser.get_format_instructions()

prompt = PromptTemplate(
    template="""
    Analyze carbon emissions for the {sector} sector in {country}.
    Provide realistic data and practical solutions.

    {format_instructions}
    """,
    input_variables=["sector", "country"],
    partial_variables={"format_instructions": format_instructions}
)

chain = prompt | llm | parser

# Test with different sectors
sectors = ["Transportation", "Energy Production", "Agriculture"]
country = "United States"

for sector in sectors:
    result = chain.invoke({"sector": sector, "country": country})
    print(f"\n=== {sector} Sector Analysis ===")
    for key, value in result.items():
        print(f"{key.replace('_', ' ').title()}: {value}")
    print("-" * 50)

## 5. PydanticOutputParser - Validated Environmental Models

The PydanticOutputParser provides the strongest validation using Pydantic models. Essential for production environmental data systems where accuracy is critical.

In [None]:
# Define Pydantic models for environmental data validation
class ClimateData(BaseModel):
    location: str = Field(..., description="Geographic location")
    temperature_celsius: float = Field(..., description="Average annual temperature in Celsius", ge=-60, le=60)
    precipitation_mm: int = Field(..., description="Annual precipitation in millimeters", ge=0, le=12000)
    co2_concentration_ppm: float = Field(..., description="CO2 concentration in parts per million", ge=280, le=500)
    biodiversity_index: float = Field(..., description="Biodiversity index score", ge=0, le=1)

class RenewableEnergyProject(BaseModel):
    project_name: str = Field(..., description="Name of the renewable energy project")
    energy_type: str = Field(..., description="Type of renewable energy (solar, wind, hydro, etc.)")
    capacity_mw: int = Field(..., description="Energy capacity in megawatts", gt=0)
    annual_output_gwh: float = Field(..., description="Annual energy output in gigawatt-hours", gt=0)
    carbon_offset_tonnes: int = Field(..., description="Annual carbon offset in tonnes", gt=0)
    investment_usd: int = Field(..., description="Total investment in USD", gt=0)

# Test ClimateData parser
climate_parser = PydanticOutputParser(pydantic_object=ClimateData)
climate_instructions = climate_parser.get_format_instructions()

climate_prompt = PromptTemplate(
    template="""
    Provide realistic climate data for {location}.
    Include current environmental conditions and measurements.

    {format_instructions}
    """,
    input_variables=["location"],
    partial_variables={"format_instructions": climate_instructions}
)

climate_chain = climate_prompt | llm | climate_parser

# Test RenewableEnergyProject parser
energy_parser = PydanticOutputParser(pydantic_object=RenewableEnergyProject)
energy_instructions = energy_parser.get_format_instructions()

energy_prompt = PromptTemplate(
    template="""
    Create a realistic renewable energy project profile for a {energy_type} project in {country}.
    Use realistic values for capacity, output, and investment.

    {format_instructions}
    """,
    input_variables=["energy_type", "country"],
    partial_variables={"format_instructions": energy_instructions}
)

energy_chain = energy_prompt | llm | energy_parser

# Test both parsers
print("=== Climate Data Validation ===")
locations = ["Copenhagen, Denmark", "Nairobi, Kenya", "São Paulo, Brazil"]

for location in locations:
    try:
        result = climate_chain.invoke({"location": location})
        print(f"\n{location}:")
        print(f"  Temperature: {result.temperature_celsius}°C")
        print(f"  Precipitation: {result.precipitation_mm}mm")
        print(f"  CO2 Concentration: {result.co2_concentration_ppm}ppm")
        print(f"  Biodiversity Index: {result.biodiversity_index}")
    except Exception as e:
        print(f"Error parsing data for {location}: {e}")

print("\n" + "="*50)
print("=== Renewable Energy Project Validation ===")

energy_projects = [
    {"energy_type": "solar", "country": "India"},
    {"energy_type": "wind", "country": "Denmark"},
    {"energy_type": "hydro", "country": "Norway"}
]

for project in energy_projects:
    try:
        result = energy_chain.invoke(project)
        print(f"\n{result.project_name} ({result.energy_type.title()} in {project['country']}):")
        print(f"  Capacity: {result.capacity_mw} MW")
        print(f"  Annual Output: {result.annual_output_gwh} GWh")
        print(f"  Carbon Offset: {result.carbon_offset_tonnes:,} tonnes/year")
        print(f"  Investment: ${result.investment_usd:,}")
    except Exception as e:
        print(f"Error parsing {project['energy_type']} project: {e}")

## Summary: Choosing the Right Parser for Environmental Data

### When to Use Each Parser:

1. **StrOutputParser**: 
   - Simple descriptions and summaries
   - Environmental narratives and explanations
   - Quick prototyping

2. **JSONOutputParser**: 
   - Structured environmental datasets
   - API responses for environmental services
   - Flexible data formats

3. **CSVOutputParser**: 
   - Tabular climate data
   - Data for spreadsheet analysis
   - Bulk data processing

4. **StructuredOutputParser**: 
   - Consistent environmental reports
   - Predefined data schemas
   - Lightweight validation

5. **PydanticOutputParser**: 
   - Production environmental systems
   - Strict data validation
   - Type safety for critical applications

### Best Practices for Environmental Data:
- Always validate numerical ranges (temperatures, emissions, etc.)
- Include units in field descriptions
- Use realistic data constraints
- Consider data quality and sources
- Plan for error handling in production systems

### Next Steps:
- Combine parsers with environmental APIs
- Create custom validation rules
- Build environmental data pipelines
- Integrate with climate databases

## Practice Exercise

Try creating your own environmental data parser! Choose a topic like:
- Ocean temperature and pH levels
- Air quality measurements
- Deforestation rates
- Species population data
- Energy consumption by country

Use the parser that best fits your data structure and validation needs.

In [None]:
# Your practice code here
# Example template:

# Define your Pydantic model
class YourEnvironmentalData(BaseModel):
    # Add your fields here
    pass

# Create parser and prompt
# parser = PydanticOutputParser(pydantic_object=YourEnvironmentalData)
# prompt = PromptTemplate(...)
# chain = prompt | llm | parser

# Test your parser
# result = chain.invoke({...})
# print(result)