# 🚀 Getting Started with Airweave

Welcome to Airweave! This notebook will show you how to:
- 📦 Create a **collection** (a searchable group of data sources)
- 🔗 Connect **GitHub** and **Stripe** to sync your data
- ⏳ Monitor sync progress in real-time
- 🔍 Search across all your connected data with one query

By the end of this tutorial, you'll understand how Airweave makes any app searchable for your AI agents!

## 📋 Prerequisites

Before we begin, make sure you have:
1. **Airweave API Key** - Get one at [app.airweave.ai](https://app.airweave.ai)
2. **GitHub Personal Access Token** - Create one at [github.com/settings/tokens](https://github.com/settings/tokens) with `repo` scope
3. **Stripe API Key** - Get your test key from [dashboard.stripe.com/apikeys](https://dashboard.stripe.com/apikeys)

In [None]:
# Install required packages
!pip install airweave-sdk

In [None]:
# Import required libraries
import os
import time
from datetime import datetime
from typing import Optional

from airweave import AirweaveSDK
from airweave.schemas import ResponseType

# For pretty printing
from IPython.display import display, Markdown, clear_output

## 🔐 Step 1: Configure Your Credentials

Let's set up your API keys. Replace the placeholder values with your actual keys.

In [None]:
# Configuration - Replace with your actual values
AIRWEAVE_API_KEY = "your_airweave_api_key_here"
GITHUB_TOKEN = "your_github_pat_here"
GITHUB_REPO = "owner/repo"  # e.g., "airweave-ai/airweave"
STRIPE_API_KEY = "sk_test_your_stripe_key_here"

# Initialize the Airweave client
client = AirweaveSDK(api_key=AIRWEAVE_API_KEY)

print("✅ Airweave client initialized!")

## 📦 Step 2: Create a Collection

**What is a Collection?**

A collection is a logical group of data sources that:
- 🗂️ Organizes related data together (e.g., "Engineering Data", "Customer Support")
- 🔍 Provides a single search endpoint for all its sources
- 🤖 Makes it easy for agents to query specific domains of knowledge

Think of it like a folder that contains data from multiple apps, all searchable together!

In [None]:
# Create a new collection
collection_name = f"My Project Data {datetime.now().strftime('%Y-%m-%d %H:%M')}"

print(f"🔄 Creating collection: {collection_name}...")
collection = client.collections.create_collection(
    name=collection_name
)

print(f"\n✅ Collection created successfully!")
print(f"📌 Collection ID: {collection.readable_id}")
print(f"📊 Status: {collection.status}")
print(f"🕐 Created at: {collection.created_at}")

# Store the collection ID for later use
COLLECTION_ID = collection.readable_id

print(f"\n💡 Note: Status is '{collection.status}' because we haven't added any data sources yet!")

## 🐙 Step 3: Connect GitHub Repository

Now let's add our first **source connection** to the collection. Source connections:
- 🔗 Link external data sources to your collection
- 🔄 Sync data automatically on a schedule or manually
- 🔐 Store credentials securely

We'll start with GitHub to sync code, issues, and documentation.

In [None]:
# Create GitHub source connection
print(f"🔄 Connecting GitHub repository: {GITHUB_REPO}...")

github_connection = client.source_connections.create_source_connection(
    name=f"GitHub - {GITHUB_REPO}",
    short_name="github",
    collection=COLLECTION_ID,
    auth_fields={
        "personal_access_token": GITHUB_TOKEN,
        "repo_name": GITHUB_REPO
    },
    config_fields={
        "branch": "feature" # Optional
    },
    sync_immediately=True  # Start syncing data right away
)

print(f"\n✅ GitHub connected successfully!")
print(f"📌 Connection ID: {github_connection.id}")
print(f"📊 Status: {github_connection.status}")
print(f"🔄 Sync Job ID: {github_connection.latest_sync_job_id}")

## ⏳ Step 4: Monitor GitHub Sync Progress

Let's watch the sync progress in real-time! Airweave is now:
- 📥 Fetching data from your GitHub repository
- 🔄 Processing and extracting entities (files, issues, PRs, etc.)
- 💾 Storing them in the vector database for semantic search

In [None]:
def monitor_sync_progress(connection_id: str, connection_name: str, max_wait_seconds: int = 300):
    """Monitor sync progress using the source connection's built-in sync job fields."""
    print(f"⏳ Monitoring sync for {connection_name}...\n")
    
    start_time = time.time()
    
    while True:
        # Get current connection with latest sync job info
        connection = client.source_connections.get_source_connection(
            source_connection_id=connection_id
        )
        
        # Use the built-in sync job fields from the source connection
        if connection.latest_sync_job_status:
            status = connection.latest_sync_job_status.upper()
            
            # Clear output for clean display
            clear_output(wait=True)
            
            # Display current status
            elapsed = int(time.time() - start_time)
            print(f"🔄 Sync Progress for {connection_name}")
            print(f"━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━")
            print(f"📊 Status: {status}")
            print(f"⏱️  Elapsed: {elapsed}s")
            
            if connection.latest_sync_job_started_at:
                print(f"🕐 Started: {connection.latest_sync_job_started_at}")
            
            if status == "COMPLETED":
                print(f"\n✅ Sync completed successfully!")
                if connection.latest_sync_job_completed_at and connection.latest_sync_job_started_at:
                    duration = (connection.latest_sync_job_completed_at - connection.latest_sync_job_started_at).total_seconds()
                    print(f"⏱️  Total duration: {int(duration)}s")
                print(f"\n💡 Your {connection_name} data is now searchable!")
                break
            elif status == "FAILED":
                print(f"\n❌ Sync failed!")
                if connection.latest_sync_job_error:
                    print(f"Error: {connection.latest_sync_job_error}")
                break
            else:
                print(f"\n⏳ Sync in progress...")
                print(f"   Processing your {connection_name} data...")
        
        time.sleep(2)  # Poll every 2 seconds
        
        # Timeout check
        if time.time() - start_time > max_wait_seconds:
            print(f"\n⚠️  Sync taking longer than {max_wait_seconds}s - it may still complete in the background")
            break

# Monitor GitHub sync
monitor_sync_progress(github_connection.id, "GitHub")

## 💳 Step 5: Add Stripe Data Source

Let's add another data source - **Stripe** - to demonstrate the power of multi-source search.

With Stripe connected, you'll be able to search across:
- 💰 Customer payment data
- 🧾 Invoices and transactions
- 📊 Product and pricing information

Combined with GitHub, this creates a powerful knowledge base spanning code AND business data!

In [None]:
# Create Stripe source connection
print("🔄 Connecting Stripe account...")

stripe_connection = client.source_connections.create_source_connection(
    name="Stripe - Test Account",
    short_name="stripe",
    collection=COLLECTION_ID,  # Add to the same collection as GitHub
    auth_fields={
        "api_key": STRIPE_API_KEY
    },
    config_fields={},  # No additional config needed for Stripe
    sync_immediately=True  # Start syncing immediately
)

print(f"\n✅ Stripe connected successfully!")
print(f"📌 Connection ID: {stripe_connection.id}")
print(f"📊 Status: {stripe_connection.status}")
print(f"🔄 Sync Job ID: {stripe_connection.latest_sync_job_id}")

# Monitor Stripe sync
monitor_sync_progress(stripe_connection.id, "Stripe")

## 🔍 Step 6: Search Your Data

Now comes the magic! With both sources synced, you can:
- 🔍 Search across GitHub AND Stripe with one query
- 🤖 Get AI-powered summaries of your data
- 🎯 Find connections between code and business data

Let's try some searches!

In [None]:
# Search for code-related information
code_query = "What are the main features and architecture of this project?"

# Perform search with raw results
code_results = client.collections.search_collection(
    readable_id=COLLECTION_ID,
    query=code_query,
    response_type=ResponseType.RAW
)

# Display top results
for i, result in enumerate(code_results.results[:3], 1):  # Show top 3
    print(f"📄 Result {i}:")
    print(f"   Source: {result.source}")
    print(f"   Score: {result.score:.3f}")
    if hasattr(result, 'metadata') and result.metadata:
        if hasattr(result.metadata, 'title'):
            print(f"   Title: {result.metadata.title}")
        if hasattr(result.metadata, 'url'):
            print(f"   URL: {result.metadata.url}")
    print(f"   Content: {result.content[:200]}...\n")

In [None]:
# Search for payment-related information
payment_query = "Show me recent customer transactions or payment activity"

# Get AI completion response for better summarization
payment_results = client.collections.search_collection(
    readable_id=COLLECTION_ID,
    query=payment_query,
    response_type=ResponseType.COMPLETION  # AI will analyze and summarize the results
)

print("🤖 AI Summary:")
print("━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━")
print(payment_results.completion)
print("\n📚 Based on sources:")
sources = set()
for result in payment_results.results:
    sources.add(result.source)
for source in sources:
    print(f"   • {source}")

## 🎉 Congratulations!

You've successfully:
- ✅ Created an Airweave **collection** to organize your data
- ✅ Connected **GitHub** and **Stripe** as data sources
- ✅ Synced data from multiple sources into a unified knowledge base
- ✅ Searched across all your data with semantic search

### 🤔 What Just Happened?

Airweave has:
1. **Extracted** entities from your GitHub repo (files, issues, PRs) and Stripe account (customers, payments)
2. **Processed** them into searchable chunks with metadata
3. **Stored** them in a vector database for semantic search
4. **Unified** them under one collection for cross-source queries

Your AI agents can now search all this data through a single API!

### 🚀 What's Next?

Now that you have a searchable knowledge base, check out our other examples:

1. **[Building AI Agents with Function Calling](./02_ai_agent_with_function_calling.ipynb)** - Use OpenAI function calling to build agents that can search your Airweave data
2. **[Using Airweave MCP Server](./03_mcp_server_integration.ipynb)** - Integrate Airweave as an MCP tool for advanced AI workflows

### 💾 Save Your Collection ID

You'll need this collection ID for the next examples:

In [None]:
print(f"📌 Your Collection ID: {COLLECTION_ID}")
print(f"\n💡 Save this ID to use in the next examples!")

## 🧹 Optional: Cleanup

If you want to clean up the resources created in this tutorial:

In [None]:
# Uncomment to delete the collection and all its data
"""
print("🧹 Cleaning up...")

# Delete the collection (this will also delete all source connections)
client.collections.delete_collection(
    readable_id=COLLECTION_ID,
    delete_data=True  # Also delete data in vector/graph stores
)

print("✅ Cleanup complete!")
"""