[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/Hawksight-AI/semantica/blob/main/cookbook/introduction/02_Data_Ingestion.ipynb)

# Data Ingestion - Comprehensive Guide

## Overview

This notebook provides a comprehensive guide to Semantica's data ingestion capabilities. It covers all submodules, classes, and helper functions available in the `semantica.ingest` module.

**Documentation**: [Ingest API Reference](https://semantica.readthedocs.io/reference/ingest/)

### Table of Contents

1.  **Unified Ingestion**: `ingest` function
2.  **File Ingestion**: `FileIngestor`, `FileTypeDetector`, `CloudStorageIngestor`
3.  **Web Ingestion**: `WebIngestor`, `ContentExtractor`, `SitemapCrawler`, `RobotsChecker`
4.  **Feed Ingestion**: `FeedIngestor`, `FeedMonitor`
5.  **Stream Ingestion**: `StreamIngestor`, `StreamMonitor`
6.  **Repository Ingestion**: `RepoIngestor`, `CodeExtractor`, `GitAnalyzer`
7.  **Email Ingestion**: `EmailIngestor`, `AttachmentProcessor`
8.  **Database Ingestion**: `DBIngestor`, `DatabaseConnector`
9.  **MCP Ingestion**: `MCPIngestor`
10. **Configuration**: `IngestConfig`

## Installation

Install Semantica with all dependencies:

```bash
pip install semantica[all]
```

---

## 1. Unified Ingestion

The `ingest` function is the main entry point for quick data loading. It automatically detects the source type.


In [1]:
!pip install semantica






## 2. File Ingestion

Detailed control over file processing using `FileIngestor` and helper classes.


In [4]:
import os
import tempfile
from semantica.ingest import FileIngestor, FileTypeDetector, CloudStorageIngestor

# Ensure dependencies from previous cells are available
if 'temp_dir' not in locals():
    temp_dir = tempfile.mkdtemp()
    print(f"Created temporary directory: {temp_dir}")

if 'sample_file' not in locals():
    sample_file = os.path.join(temp_dir, "sample_large.txt")

if not os.path.exists(sample_file):
    # Create a sample file with a lot of info
    with open(sample_file, 'w') as f:
        f.write("# Semantica Data Ingestion Guide\n\n")
        f.write("Semantica is a powerful framework for semantic data processing.\n")
        # ... (more content) ...
    print(f"Created sample file: {sample_file}")

# --- FileTypeDetector ---
detector = FileTypeDetector()
detected_type = detector.detect_type(sample_file)
print(f"Detected Type: {detected_type}")
# ...

Detected Type: txt


## 3. Web Ingestion

Scraping and crawling with `WebIngestor`, `ContentExtractor`, and `SitemapCrawler`.


In [6]:
from semantica.ingest import WebIngestor, ContentExtractor, SitemapCrawler, RobotsChecker

# --- ContentExtractor ---
extractor = ContentExtractor()
html_content = "<html><body><h1>Hello World</h1><p>This is a test.</p><a href='/link'>Link</a></body></html>"
text = extractor.extract_text(html_content)
links = extractor.extract_links(html_content, base_url="https://example.com")
print(f"Extracted Text: {text}")
print(f"Extracted Links: {links}")

# --- RobotsChecker ---
checker = RobotsChecker()
can_fetch = checker.can_fetch("https://www.google.com/search", "MyBot")
print(f"Can fetch google search? {can_fetch}")

# --- WebIngestor ---
web_ingestor = WebIngestor(delay=1.0)
try:
    web_content = web_ingestor.ingest_url("https://example.com")
    print(f"Web Content Title: {web_content.title}")
except Exception as e:
    print(f"Web ingest failed: {e}")

# --- SitemapCrawler ---
crawler = SitemapCrawler()
try:
    urls = crawler.parse_sitemap("https://www.google.com/sitemap.xml")
    print(f"Found {len(urls)} URLs in sitemap")
except Exception as e:
    print(f"Sitemap crawl failed: {e}")

Extracted Text: Hello WorldThis is a test.Link
Extracted Links: ['https://example.com/link']


TypeError: RobotsChecker.can_fetch() takes 2 positional arguments but 3 were given

## 4. Feed Ingestion

Consuming RSS/Atom feeds with `FeedIngestor` and monitoring with `FeedMonitor`.


In [None]:
from semantica.ingest import FeedIngestor, FeedMonitor
import time

# --- FeedIngestor ---
feed_ingestor = FeedIngestor()
try:
    feed_data = feed_ingestor.ingest_feed("https://feeds.feedburner.com/oreilly/radar")
    print(f"Feed Title: {feed_data.title}")
except Exception as e:
    print(f"Feed ingest failed: {e}")

# --- FeedMonitor ---
def feed_callback(feed_data):
    print(f"Feed Updated: {feed_data.title} with {len(feed_data.items)} items")

monitor = FeedMonitor(check_interval=5)
try:
    monitor.monitor("https://feeds.feedburner.com/oreilly/radar", callback=feed_callback)
    time.sleep(2) # Let it run briefly
    monitor.stop()
except Exception as e:
    print(f"Feed monitor failed: {e}")


## 5. Stream Ingestion

Real-time processing with `StreamIngestor` and `StreamMonitor`.


In [None]:
from semantica.ingest import StreamIngestor, StreamMonitor

stream_ingestor = StreamIngestor()

# --- Kafka Processor ---
kafka_config = {"bootstrap_servers": ["localhost:9092"]}
kafka_processor = stream_ingestor.ingest_kafka("my-topic", **kafka_config)

# --- RabbitMQ Processor ---
rabbitmq_processor = stream_ingestor.ingest_rabbitmq("my-queue", "amqp://guest:guest@localhost:5672/")

# --- Stream Monitor ---
monitor = stream_ingestor.monitor
health = monitor.check_health()
print(f"Stream Health: {health['overall']}")
print(f"Processors: {list(health['processors'].keys())}")


## 6. Repository Ingestion

Analyzing codebases with `RepoIngestor`, `CodeExtractor`, and `GitAnalyzer`.


In [None]:
from semantica.ingest import RepoIngestor, CodeExtractor, GitAnalyzer

# --- CodeExtractor ---
code_extractor = CodeExtractor()
py_code = "class MyClass:\n    def my_method(self):\n        pass"
structure = code_extractor.extract_structure(py_code, language="python")
print(f"Classes: {structure.get('classes')}")
print(f"Functions: {structure.get('functions')}")

# --- RepoIngestor ---
repo_ingestor = RepoIngestor()
try:
    repo_data = repo_ingestor.ingest_repository("https://github.com/Hawksight-AI/semantica.git")
    print(f"Repo Name: {repo_data['name']}")
except Exception as e:
    print(f"Repo ingest failed: {e}")

# --- GitAnalyzer ---
try:
    analyzer = GitAnalyzer(".")
    stats = analyzer.get_statistics()
    print(f"Commits in current repo: {stats.get('total_commits', 'N/A')}")
except Exception as e:
    print(f"Git analysis failed: {e}")


## 7. Email Ingestion

Processing emails with `EmailIngestor` and `AttachmentProcessor`.


In [None]:
from semantica.ingest import EmailIngestor, AttachmentProcessor

# --- AttachmentProcessor ---
att_processor = AttachmentProcessor()
dummy_content = b"PDF Content"
saved_path = att_processor.save_attachment(dummy_content, "doc.pdf", temp_dir)
print(f"Saved attachment to: {saved_path}")

# --- EmailIngestor ---
email_ingestor = EmailIngestor()
try:
    email_ingestor.connect_imap("imap.gmail.com", "user", "pass")
    emails = email_ingestor.ingest_mailbox("INBOX", max_emails=5)
    print(f"Fetched {len(emails)} emails")
except Exception as e:
    print(f"Email ingest failed (Auth required): {e}")


## 8. Database Ingestion

Connecting to SQL databases with `DBIngestor` and `DatabaseConnector`.


In [None]:
from semantica.ingest import DBIngestor, DatabaseConnector
import sqlite3

# Setup SQLite DB
db_path = os.path.join(temp_dir, "test.db")
conn = sqlite3.connect(db_path)
conn.execute("CREATE TABLE items (id INT, name TEXT)")
conn.execute("INSERT INTO items VALUES (1, 'Item 1'), (2, 'Item 2')")
conn.commit()
conn.close()

# --- DatabaseConnector ---
connector = DatabaseConnector()
engine = connector.create_engine(f"sqlite:///{db_path}")
print(f"Connected to DB: {engine.name}")

# --- DBIngestor ---
db_ingestor = DBIngestor()
table_data = db_ingestor.ingest_database(f"sqlite:///{db_path}", table="items")
print(f"Table: {table_data.table_name}")
print(f"Rows: {table_data.row_count}")
for row in table_data.rows:
    print(f" - {row}")


## 9. MCP Ingestion

Integrating with Model Context Protocol servers using `MCPIngestor`.


In [None]:
from semantica.ingest import MCPIngestor

mcp_ingestor = MCPIngestor()

try:
    # Connect
    mcp_ingestor.connect("weather_server", url="http://localhost:8000/mcp")

    # Ingest Resources
    resources = mcp_ingestor.ingest_resources("weather_server")
    print(f"Resources: {len(resources)}")

    # Call Tool
    result = mcp_ingestor.ingest_tool_output("weather_server", "get_forecast", {"city": "NYC"})
    print(f"Tool Result: {result.content}")
except Exception as e:
    print(f"MCP ingest failed (Server required): {e}")


## 10. Configuration

Managing ingestion settings with `IngestConfig`.


In [None]:
from semantica.ingest import IngestConfig, ingest_config

# Global config
print(f"Default Source Type: {ingest_config.get('default_source_type')}")

# Custom config instance
config = IngestConfig()
config.set("max_file_size", 1024 * 1024) # 1MB
print(f"Max File Size: {config.get('max_file_size')} bytes")
