# BrickByte - GitHub Example

This notebook demonstrates how to sync data from GitHub to Databricks using BrickByte.

## Prerequisites
- GitHub account
- Personal Access Token (generate at https://github.com/settings/tokens)
- Databricks workspace with Unity Catalog

## Note
For public repositories, you can use a token with minimal scopes. For private repositories, ensure your token has appropriate repository access.


In [None]:
%run ./_setup


In [None]:
from brickbyte import BrickByte

bb = BrickByte(
    sources=["source-github"],
    destination="destination-databricks",
    destination_install="git+https://github.com/park-peter/brickbyte.git#subdirectory=integrations/destination-databricks-py"
)
bb.setup()


In [None]:
import airbyte as ab

FORCE_FULL_REFRESH = True
cache = bb.get_or_create_cache()

# Configure the GitHub source
# Documentation: https://docs.airbyte.com/integrations/sources/github
source = ab.get_source(
    "source-github",
    config={
        "credentials": {
            "option_title": "PAT Credentials",
            "personal_access_token": "",  # Your GitHub PAT
        },
        "repositories": ["airbytehq/airbyte"],  # format: "owner/repo"
        # "start_date": "2024-01-01T00:00:00Z",  # Optional
    },
    local_executable=bb.get_source_exec_path("source-github")
)
source.check()
source.select_all_streams()
print("Available streams:", source.get_available_streams())


In [None]:
# Configure the Databricks destination
destination = ab.get_destination(
    "destination-databricks",
    config={
        "server_hostname": "",  # e.g., "adb-xxx.azuredatabricks.net"
        "http_path": "",        # e.g., "/sql/1.0/warehouses/abc123"
        "token": "",            # Your Databricks PAT
        "catalog": "",          # Unity Catalog name
        "schema": "",           # Target schema
    },
    local_executable=bb.get_destination_exec_path()
)

write_result = destination.write(source, cache=cache, force_full_refresh=FORCE_FULL_REFRESH)
print("Sync completed!")


In [None]:
bb.cleanup()


In [None]:
# Cleanup virtual environments
bb.cleanup()


## Query Your Data

After the sync completes, you can query your GitHub data:

```sql
-- View recent commits
SELECT 
    _airbyte_emitted_at,
    _airbyte_data:sha AS commit_sha,
    _airbyte_data:commit.message AS message,
    _airbyte_data:commit.author.name AS author,
    _airbyte_data:commit.author.date AS commit_date
FROM your_catalog.your_schema._airbyte_raw_commits
ORDER BY _airbyte_data:commit.author.date DESC
LIMIT 20;

-- View open issues
SELECT 
    _airbyte_data:number AS issue_number,
    _airbyte_data:title AS title,
    _airbyte_data:state AS state,
    _airbyte_data:user.login AS author,
    _airbyte_data:created_at AS created_at
FROM your_catalog.your_schema._airbyte_raw_issues
WHERE _airbyte_data:state = 'open'
ORDER BY _airbyte_data:created_at DESC
LIMIT 20;
```
