# Quick Start - Semiconductor Wafer Clustering Agent

This notebook provides a quick introduction to using the Semiconductor Wafer Clustering Agent.

## Overview
- Initialize the agent with your OpenAI API key
- Load or generate wafer data
- Analyze using natural language queries
- Visualize results

## 1. Setup and Installation

In [None]:
# If running in Google Colab, install dependencies
import sys
if 'google.colab' in sys.modules:
    !pip install langchain langchain-openai gradio pandas numpy scikit-learn matplotlib seaborn -q
    print("✅ Dependencies installed for Colab")
else:
    print("📌 Make sure you have installed requirements: pip install -r requirements.txt")

In [None]:
# Import necessary libraries
import os
import sys
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import warnings
warnings.filterwarnings('ignore')

# Add parent directory to path to import src modules
if 'google.colab' in sys.modules:
    # For Colab, we'll define the agent directly in this notebook
    print("Running in Google Colab")
else:
    # For local development
    sys.path.append(os.path.dirname(os.path.dirname(os.path.abspath("__file__"))))
    from src.agent import WaferClusteringAgent

## 2. Initialize the Agent

In [None]:
# Set your OpenAI API key
# Option 1: Direct (not recommended for sharing)
# api_key = "sk-your-api-key-here"

# Option 2: From environment variable
api_key = os.getenv("OPENAI_API_KEY")

# Option 3: Interactive input
if not api_key:
    import getpass
    api_key = getpass.getpass("Enter your OpenAI API key: ")

# Initialize the agent
print("Initializing Wafer Clustering Agent...")
agent = WaferClusteringAgent(api_key)
print("✅ Agent ready!")

## 3. Load Data

You can either:
- Generate synthetic data for testing
- Load your own CSV file

In [None]:
# Option 1: Generate synthetic data
print("Generating synthetic wafer data...")
df = agent.generate_synthetic_data(n_wafers=500)
agent.load_data(df)

print(f"\n📊 Generated {len(df)} wafers with {len(df.columns)} features")
print(f"\nFeatures: {list(df.columns)}")
print(f"\nFirst 5 rows:")
df.head()

In [None]:
# Option 2: Load your own CSV (uncomment to use)
# df = pd.read_csv("your_wafer_data.csv")
# agent.load_data(df)
# print(f"Loaded {len(df)} wafers")

## 4. Basic Data Exploration

In [None]:
# Quick data visualization
fig, axes = plt.subplots(2, 2, figsize=(12, 8))

# Yield distribution
if 'Yield_%' in df.columns:
    df['Yield_%'].hist(bins=30, ax=axes[0,0], edgecolor='black')
    axes[0,0].set_title('Yield Distribution')
    axes[0,0].set_xlabel('Yield %')

# Defect density distribution
if 'Defect_Density' in df.columns:
    df['Defect_Density'].hist(bins=30, ax=axes[0,1], edgecolor='black')
    axes[0,1].set_title('Defect Density Distribution')
    axes[0,1].set_xlabel('Defect Density')

# Yield vs Defects scatter
if 'Yield_%' in df.columns and 'Defect_Density' in df.columns:
    axes[1,0].scatter(df['Defect_Density'], df['Yield_%'], alpha=0.5)
    axes[1,0].set_xlabel('Defect Density')
    axes[1,0].set_ylabel('Yield %')
    axes[1,0].set_title('Yield vs Defects')

# Feature correlation
numeric_cols = df.select_dtypes(include=[np.number]).columns[:5]
corr = df[numeric_cols].corr()
im = axes[1,1].imshow(corr, cmap='coolwarm', vmin=-1, vmax=1)
axes[1,1].set_title('Feature Correlation')
plt.colorbar(im, ax=axes[1,1])

plt.tight_layout()
plt.show()

## 5. Natural Language Analysis

Now let's analyze the data using natural language queries:

In [None]:
# Query 1: Basic data overview
query = "What does my wafer data look like? Give me a summary of the key features."
print(f"🔍 Query: {query}\n")
response = agent.analyze(query)
print(response)

In [None]:
# Query 2: Find optimal clusters
query = "Find the optimal number of clusters for this wafer dataset"
print(f"🔍 Query: {query}\n")
response = agent.analyze(query)
print(response)

In [None]:
# Query 3: Apply clustering
query = "Apply k-means clustering with 4 clusters and tell me about each cluster"
print(f"🔍 Query: {query}\n")
response = agent.analyze(query)
print(response)

In [None]:
# Query 4: Business insights
query = "Which cluster has the best yield? What makes it different from the worst cluster?"
print(f"🔍 Query: {query}\n")
response = agent.analyze(query)
print(response)

In [None]:
# Query 5: Outlier detection
query = "Are there any outlier wafers I should investigate?"
print(f"🔍 Query: {query}\n")
response = agent.analyze(query)
print(response)

## 6. Visualize Clustering Results

In [None]:
# Create PCA visualization
query = "Create a PCA visualization of the clustering results"
print(f"🔍 Query: {query}\n")
response = agent.analyze(query)
print(response)

## 7. Export Results

In [None]:
# Get cluster labels if available
if agent.current_labels is not None:
    # Add cluster labels to dataframe
    results_df = df.copy()
    results_df['Cluster'] = agent.current_labels
    
    # Show cluster distribution
    print("Cluster Distribution:")
    print(results_df['Cluster'].value_counts().sort_index())
    
    # Save results
    results_df.to_csv('clustered_wafers.csv', index=False)
    print("\n✅ Results saved to 'clustered_wafers.csv'")
    
    # Show sample of each cluster
    print("\nSample wafers from each cluster:")
    for cluster in sorted(results_df['Cluster'].unique()):
        if cluster != -1:  # Skip noise points
            print(f"\nCluster {cluster}:")
            print(results_df[results_df['Cluster'] == cluster].head(3)[['Wafer_ID', 'Yield_%', 'Defect_Density']])
else:
    print("No clustering results available yet. Run clustering analysis first.")

## 8. Interactive Analysis

Try your own queries:

In [None]:
# Interactive query cell - modify the query and run!
your_query = "What process parameters correlate most strongly with high yield?"

print(f"🔍 Your Query: {your_query}\n")
response = agent.analyze(your_query)
print(response)

## 9. Launch Full UI (Optional)

For a complete interactive experience with more features:

In [None]:
# Launch Gradio UI (if not in Colab)
if 'google.colab' not in sys.modules:
    from src.ui import create_gradio_interface
    
    print("Launching Gradio UI...")
    demo = create_gradio_interface()
    demo.launch(share=True)
else:
    print("To use the full UI in Colab, run the main notebook: Wafer_Clustering_Demo.ipynb")

## Summary

You've learned how to:
1. ✅ Initialize the Wafer Clustering Agent
2. ✅ Load or generate wafer data
3. ✅ Analyze data using natural language
4. ✅ Apply various clustering algorithms
5. ✅ Get insights about clusters
6. ✅ Export results for further analysis

## Next Steps

- Try different clustering algorithms (DBSCAN, Hierarchical, GMM)
- Compare multiple algorithms to find the best one
- Analyze your own wafer data
- Use the full Gradio UI for more interactive analysis
- Check the [API Reference](../docs/api_reference.md) for advanced usage

## Example Queries to Try

- "Compare k-means and DBSCAN clustering on this data"
- "What's the optimal number of clusters using both elbow and silhouette methods?"
- "Which features are most important for distinguishing between clusters?"
- "Create a detailed report on cluster characteristics"
- "Identify the top 10 worst performing wafers"

Happy analyzing! 🚀