# Genie Deep Dive: Engineering AI Context

Learn to create high-quality Genie Spaces that provide accurate natural language data access.

## What You'll Learn

✅ Set up effective Genie Spaces  
✅ Create comprehensive knowledge stores  
✅ Implement query parameters  
✅ Define trusted assets  
✅ Measure and improve accuracy with benchmarks  
✅ Apply best practices for context engineering  

---

## What is Genie?

**Genie** is Databricks' natural language interface that lets users query data conversationally.

**Key Components:**
- **Genie Space**: Container for tables, context, and configuration
- **Knowledge Store**: Business terminology and sample Q&A
- **Trusted Assets**: Curated tables, views, and metric views
- **Benchmarks**: Test questions to measure accuracy

---

## Table of Contents

1. [Setting Up Genie Spaces](#setup)
2. [Knowledge Store Engineering](#knowledge)
3. [Query Parameters](#parameters)
4. [Trusted Assets](#assets)
5. [Benchmarks and Quality](#benchmarks)
6. [Best Practices](#best-practices)

---

**References:**
- [Genie Setup](https://docs.databricks.com/aws/en/genie/set-up)
- [Knowledge Store](https://docs.databricks.com/aws/en/genie/knowledge-store)
- [Query Parameters](https://docs.databricks.com/aws/en/genie/query-params)
- [Trusted Assets](https://docs.databricks.com/aws/en/genie/trusted-assets)
- [Best Practices](https://docs.databricks.com/aws/en/genie/best-practices)

## 1. Setting Up Genie Spaces <a id="setup"></a>

### Creating a Genie Space

**Step 1: Navigate to Genie**
1. Click **Genie** in the sidebar
2. Click **Create Space**
3. Name: "IoT Analytics Assistant"

**Step 2: Select Data Sources**

Choose tables relevant to your use case:
- `sensor_enriched` - Main fact table
- `dim_factories` - Factory details
- `dim_devices` - Device information
- `dim_models` - Model catalog
- `factory_kpis_gold` - Pre-aggregated metrics

**Best Practices for Table Selection:**

✅ **Include related tables** - For context and joins  
✅ **Use metric views** - Pre-defined business metrics  
✅ **Limit scope** - Too many tables confuse Genie  
✅ **Pre-aggregate** - Gold layer tables perform better  
✅ **Document relationships** - Foreign keys help Genie  

### Configuration

**Space Settings:**
```
Name: IoT Analytics Assistant
Description: Query IoT sensor data, factory metrics, and device health
Default Time Range: Last 30 days
Primary Table: sensor_enriched
```

**Permissions:**
- Set who can access the Space
- Control query execution permissions
- Manage knowledge store editing rights

## 2. Knowledge Store Engineering <a id="knowledge"></a>

### What is the Knowledge Store?

The knowledge store teaches Genie about your business:
- **Terminology**: Domain-specific terms
- **Relationships**: How concepts connect
- **Sample Q&A**: Example questions and answers
- **Context**: Business rules and definitions

### Adding Business Terminology

**Example Terminology Entries:**

```
Term: Critical Reading
Definition: A sensor reading where temperature exceeds 85°F
SQL Column: CASE WHEN temperature > 85 THEN 'Critical' ELSE 'Normal' END

Term: Device Uptime
Definition: Percentage of time a device is actively reporting
Calculation: (active_hours / total_hours) * 100

Term: Anomaly Rate
Definition: Percentage of readings flagged as anomalous
SQL: (anomaly_count / total_readings) * 100

Term: Factory
Definition: Physical location where devices are deployed
Table: dim_factories
Key Column: factory_id

Term: Model Family
Definition: Product line grouping (e.g., SkyJet, CloudCruiser)
Table: dim_models
Column: model_family
```

### Adding Sample Q&A

**Format**: Question → SQL → Natural Language Answer

**Example 1: Simple Aggregation**
```
Q: What's the average temperature across all factories?
SQL: SELECT AVG(temperature) FROM sensor_enriched
Expected Answer: "The average temperature is 67.8°F"
```

**Example 2: Filtered Query**
```
Q: How many devices are active in the West region?
SQL: SELECT COUNT(DISTINCT device_id) 
     FROM sensor_enriched 
     WHERE region = 'West'
Expected Answer: "There are 245 active devices in the West region"
```

**Example 3: Time-Based**
```
Q: Show me temperature trends over the last week
SQL: SELECT DATE(timestamp) as date, 
            AVG(temperature) as avg_temp
     FROM sensor_enriched
     WHERE timestamp >= CURRENT_DATE - 7
     GROUP BY DATE(timestamp)
     ORDER BY date
Expected Viz: Line chart
```

**Example 4: Complex Join**
```
Q: Which factory has the most critical readings?
SQL: SELECT f.factory_name, 
            COUNT(CASE WHEN s.temperature > 85 THEN 1 END) as critical_count
     FROM sensor_enriched s
     JOIN dim_factories f ON s.factory_id = f.factory_id
     GROUP BY f.factory_name
     ORDER BY critical_count DESC
     LIMIT 1
Expected Answer: "TechHub Seattle has 1,247 critical readings"
```

### Context Documentation

**Business Rules:**
```
- Operating hours: 6 AM - 10 PM PST
- Maintenance windows: Sundays 2-4 AM
- Critical threshold: Temperature > 85°F
- Warning threshold: Temperature > 75°F
- Normal range: 60-75°F
- Regions: West, East, Central, International
```

**Metric Definitions:**
```
Device Health Score:
  - Excellent: < 5% critical readings
  - Good: 5-10% critical readings
  - Fair: 10-20% critical readings
  - Poor: > 20% critical readings

Utilization Rate:
  - (Active devices / Total installed) * 100
  - Target: > 95%
```

## 3. Query Parameters <a id="parameters"></a>

### Pre-Defined Parameters

Help Genie understand common filters:

**Time Parameters:**
```
yesterday: timestamp >= CURRENT_DATE - 1 AND timestamp < CURRENT_DATE
today: DATE(timestamp) = CURRENT_DATE
this week: timestamp >= DATE_TRUNC('week', CURRENT_DATE)
this month: timestamp >= DATE_TRUNC('month', CURRENT_DATE)
last 7 days: timestamp >= CURRENT_DATE - 7
last 30 days: timestamp >= CURRENT_DATE - 30
```

**Region Parameters:**
```
west: region = 'West'
east: region = 'East'
central: region = 'Central'
international: region = 'International'
domestic: region IN ('West', 'East', 'Central')
```

**Status Parameters:**
```
critical: temperature > 85
warning: temperature > 75 AND temperature <= 85
normal: temperature BETWEEN 60 AND 75
anomaly: is_anomaly = true
```

### Dynamic Date Ranges

```
Parameter: relative_date
Values:
  - "yesterday" → CURRENT_DATE - 1
  - "last week" → CURRENT_DATE - 7
  - "last month" → CURRENT_DATE - 30
  - "last quarter" → CURRENT_DATE - 90
```

---

## 4. Trusted Assets <a id="assets"></a>

### What are Trusted Assets?

**Trusted Assets** are tables and views that Genie should prefer when answering questions.

### Types of Trusted Assets

**1. Metric Views**
```
Asset: iot_sensor_metrics (metric view)
Use For: All metric-related questions
Reason: Pre-defined, consistent business metrics
Priority: High
```

**2. Gold Layer Tables**
```
Asset: factory_kpis_gold
Use For: Factory performance questions
Reason: Pre-aggregated, fast queries
Priority: High

Asset: device_health_gold
Use For: Device status and health questions
Priority: High
```

**3. Curated Views**
```
Asset: recent_sensor_data (view)
Definition: SELECT * FROM sensor_enriched 
           WHERE timestamp >= CURRENT_DATE - 30
Use For: Recent data queries
Reason: Performance optimization
```

### Configuring Trust Levels

```
High Priority:
  - Metric views
  - Gold layer aggregates
  - Curated business views

Medium Priority:
  - Silver layer tables
  - Dimension tables

Low Priority:
  - Bronze layer (raw data)
  - Historical archives
```

---

## 5. Benchmarks and Quality <a id="benchmarks"></a>

### Creating Benchmark Questions

**Purpose**: Measure Genie accuracy and track improvement over time.

**Benchmark Categories:**

**1. Basic Metrics** (Should be 100% accurate)
```
- What's the average temperature?
- How many devices are there?
- Show me total readings
```

**2. Filtered Queries** (Should be 95%+ accurate)
```
- What's the average temperature in the West region?
- How many critical readings yesterday?
- Show devices with high anomaly rates
```

**3. Time-Based** (Should be 90%+ accurate)
```
- Temperature trend over last week
- Daily device counts for this month
- Compare this week vs last week
```

**4. Complex Joins** (Should be 85%+ accurate)
```
- Which model has the highest temperature?
- Factory with most devices by region
- Anomaly rate by model family
```

**5. Business Logic** (Should be 80%+ accurate)
```
- Calculate device uptime percentage
- Rank factories by health score
- Identify underperforming models
```

### Measuring Accuracy

**Run Benchmark Suite:**
1. Submit all benchmark questions
2. Review generated SQL
3. Verify results accuracy
4. Note failures and patterns

**Scoring:**
- ✅ Correct SQL and answer: 1 point
- ⚠️ Correct answer, suboptimal SQL: 0.5 points
- ❌ Incorrect: 0 points

**Target Scores:**
- Basic: 100%
- Filtered: 95%+
- Time-Based: 90%+
- Complex: 85%+
- Business Logic: 80%+

### Iterative Improvement

**If accuracy is low:**
1. **Add more sample Q&A** for failed patterns
2. **Document terminology** used in questions
3. **Create trusted assets** for common queries
4. **Refine parameter definitions**
5. **Simplify table schemas** if needed
6. **Add context** about relationships

**Track Over Time:**
```
Week 1: 75% overall accuracy
Week 2: 82% (added 20 sample Q&A)
Week 3: 88% (created metric views)
Week 4: 93% (refined terminology)
```

---

## 6. Best Practices <a id="best-practices"></a>

### Context Engineering

**Start Simple:**
- Begin with 5-10 tables
- Add 20-30 sample Q&A
- Define 10-15 key terms
- Create 2-3 trusted assets

**Iterate Based on Usage:**
- Monitor actual user questions
- Add Q&A for common patterns
- Document frequently used terms
- Create views for popular queries

### Documentation Best Practices

✅ **Be specific** with definitions  
✅ **Use examples** in explanations  
✅ **Include units** (°F, %, count)  
✅ **Define acronyms** fully  
✅ **Show calculations** for complex metrics  
✅ **Link related concepts**  

### Knowledge Store Tips

**Good Terminology Entry:**
```
Term: Device Uptime
Definition: The percentage of expected operating hours that a device 
was actively reporting sensor data. Calculated as 
(hours_with_readings / total_expected_hours) * 100.
Expected operating hours are 6 AM - 10 PM daily (16 hours).
Example: If a device reported for 14 out of 16 hours, 
uptime is 87.5%.
Related: Device Health, Active Device, Reporting Status
```

**Bad Terminology Entry:**
```
Term: Uptime
Definition: When device works
```

### Sample Q&A Tips

**Include Variety:**
- Simple counts
- Averages and aggregations
- Filters and conditions
- Time-based queries
- Joins across tables
- Rankings and top N
- Comparisons
- Trend analysis

**Use Natural Language:**
```
✅ "Show me factories with high anomaly rates"
✅ "What's the temperature trend this week?"
✅ "Which devices need maintenance?"

❌ "SELECT * FROM factories WHERE anomaly_rate > threshold"
❌ "Give temperature data filtered by date"
```

---

## Summary

In this notebook, you learned:

✅ **Genie Spaces** - Container for natural language queries  
✅ **Knowledge Store** - Teaching Genie your business  
✅ **Query Parameters** - Pre-defined filters and shortcuts  
✅ **Trusted Assets** - Preferred tables and views  
✅ **Benchmarks** - Measuring and improving accuracy  
✅ **Best Practices** - Effective context engineering  

### Key Takeaways:

1. **Start with good data** - Use metric views and gold layer tables
2. **Document thoroughly** - Terminology and sample Q&A are critical
3. **Iterate continuously** - Add content based on actual usage
4. **Measure quality** - Use benchmarks to track improvement
5. **Think like a user** - Sample questions should be natural

### Quality Checklist:

- [ ] 5-10 relevant tables selected
- [ ] 20+ sample Q&A added
- [ ] 10+ terms documented
- [ ] 3+ trusted assets configured
- [ ] Query parameters defined
- [ ] Benchmark suite created
- [ ] 85%+ accuracy on benchmarks

### Next Steps:

- Create your first Genie Space
- Add sample Q&A from this notebook
- Run benchmark questions
- Monitor user queries and iterate
- Achieve 90%+ accuracy

---

**Additional Resources:**
- [Genie Documentation](https://docs.databricks.com/aws/en/genie/)
- [Knowledge Store Guide](https://docs.databricks.com/aws/en/genie/knowledge-store)
- [Best Practices](https://docs.databricks.com/aws/en/genie/best-practices)
- [Benchmarks](https://docs.databricks.com/aws/en/genie/benchmarks)