Skip to content

Apache Airflow DAG for migrating SQL Server databases to PostgreSQL with automatic schema conversion and data type mapping

Notifications You must be signed in to change notification settings

johndauphine/mssql-to-postgres-pipeline

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

88 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

SQL Server to PostgreSQL Migration Pipeline

An Apache Airflow pipeline for automated, full-refresh migrations from Microsoft SQL Server to PostgreSQL. Built with the Astronomer framework for reliable orchestration and easy deployment.

Features

  • Schema Discovery: Automatically extract table structures, columns, indexes, and foreign keys from SQL Server
  • Type Mapping: Convert 30+ SQL Server data types to their PostgreSQL equivalents
  • Streaming Data Transfer: Move data efficiently using server-side cursors, keyset pagination, and PostgreSQL's COPY protocol
  • Validation: Standalone validation DAG verifies migration success through row count comparisons
  • Parallelization: Transfer multiple tables concurrently using Airflow's dynamic task mapping
  • Large Table Partitioning: Automatically partitions tables >5M rows into parallel chunks by primary key range

Performance

Tested against the StackOverflow 2010 dataset:

Metric Value
Total Rows Migrated 19.3 million
Tables 9
Migration Time ~2.5 minutes
Throughput ~125,000 rows/sec
Validation Success 100% (9/9 tables)

Performance Optimizations

  • 100k chunk size: 10x larger batches reduce overhead
  • Parallel partitioning: Large tables split into 4 parallel partitions by PK range
  • Connection pooling: Reuses PostgreSQL connections across operations

See docs/PARALLEL_PARTITIONING.md for details on large table partitioning.

Tables Migrated

Table Rows
Badges 1,102,019
Comments 3,875,183
Posts 3,729,195
Users 299,398
Votes 10,143,364
PostTypes 8
VoteTypes 15
PostLinks 149,313
LinkTypes 3

How It Works

The pipeline executes as a single Airflow DAG with the following stages:

Extract Schema -> Create Target Schema -> Create Tables -> Transfer Data (parallel) -> Create Foreign Keys -> Validate -> Report
  1. Schema Extraction: Queries SQL Server system catalogs to discover all tables, columns, data types, indexes, and constraints
  2. DDL Generation: Converts SQL Server schemas to PostgreSQL-compatible DDL with proper type mappings
  3. Table Creation: Creates target tables in PostgreSQL (drops existing tables first)
  4. Data Transfer: Streams data using keyset pagination with direct pymssql connections and PostgreSQL COPY protocol
  5. Foreign Key Creation: Adds foreign key constraints after all data is loaded
  6. Validation: Triggers standalone validation DAG that compares source and target row counts

Architecture

Data Transfer Approach

The pipeline uses a streaming architecture optimized for large datasets:

  • Keyset Pagination: Uses primary key ordering instead of OFFSET/FETCH for efficient chunking of large tables
  • Direct Database Connections: Uses pymssql and psycopg2 directly to avoid Airflow hook limitations with large datasets
  • PostgreSQL COPY Protocol: Bulk loads data for maximum throughput
  • Server-Side Cursors: Streams rows without loading entire result sets into memory

Validation DAG

A standalone validate_migration_env DAG handles validation separately to avoid XCom serialization issues with large result sets. This DAG:

  • Uses direct database connections (psycopg2, pymssql)
  • Compares row counts between source and target for all tables
  • Can be triggered independently for ad-hoc validation

Supported Data Types

Category SQL Server Types PostgreSQL Mapping
Integer bit, tinyint, smallint, int, bigint BOOLEAN, SMALLINT, INTEGER, BIGINT
Decimal decimal, numeric, money, smallmoney DECIMAL, NUMERIC
Float float, real DOUBLE PRECISION, REAL
String char, varchar, text, nchar, nvarchar, ntext CHAR, VARCHAR, TEXT
Binary binary, varbinary, image BYTEA
Date/Time date, time, datetime, datetime2, smalldatetime, datetimeoffset DATE, TIME, TIMESTAMP
Other uniqueidentifier, xml, geography, geometry UUID, XML, GEOGRAPHY, GEOMETRY

Quick Start

Migrate any SQL Server database to PostgreSQL in 3 steps:

# 1. Configure your databases
cp .env.example .env                              # Resource settings
cp airflow_settings.yaml.example airflow_settings.yaml  # Connections
# Edit airflow_settings.yaml with your database details

# 2. Start the stack
astro dev start                         # Connections loaded automatically

# 3. Run migration
airflow dags trigger mssql_to_postgres_migration

See Configuration for details on connecting to remote databases.

Getting Started

Prerequisites

  • Docker Desktop (4GB+ RAM recommended per database container)
  • Astronomer CLI (optional - can use standalone Docker)
  • Access to source SQL Server and target PostgreSQL databases

Installation

  1. Clone this repository:

    git clone <repository-url>
    cd mssql-to-postgres-pipeline
  2. Configure your environment:

    cp .env.example .env
    cp airflow_settings.yaml.example airflow_settings.yaml
    # Edit .env with resource settings for your machine (RAM presets)
    # Edit airflow_settings.yaml with your database connection details
  3. Start Airflow (creates Docker network):

    astro dev start
  4. (Optional) Start local database containers:

    If you need local SQL Server and PostgreSQL containers for testing:

    docker-compose up -d

    Note: Run astro dev start first - it creates the Docker network that database containers join.

    For remote databases (Azure SQL, AWS RDS, on-prem servers), skip this step and configure hosts in airflow_settings.yaml.

  5. Access the Airflow UI at http://localhost:8080

Configuration

Configuration is split between two files:

  • .env - Resource settings (memory limits, parallelism)
  • airflow_settings.yaml - Database connections

Database Connections

Edit airflow_settings.yaml to configure your source and target databases:

airflow:
  connections:
    # SOURCE: SQL Server
    - conn_id: mssql_source
      conn_type: mssql
      conn_host: mssql-server           # Docker container, IP, or hostname
      # conn_host: 192.168.1.100        # Remote server
      # conn_host: myserver.database.windows.net  # Azure SQL
      conn_port: 1433
      conn_login: sa
      conn_password: YourStrong@Passw0rd
      conn_schema: StackOverflow2013

    # TARGET: PostgreSQL
    - conn_id: postgres_target
      conn_type: postgres
      conn_host: postgres-target        # Docker container, IP, or hostname
      # conn_host: mydb.abc123.rds.amazonaws.com  # AWS RDS
      conn_port: 5432
      conn_login: postgres
      conn_password: PostgresPassword123
      conn_schema: stackoverflow

Connections are loaded automatically when you run astro dev start.

Resource Settings

The .env file contains resource presets for different machine sizes:

RAM Preset
16GB Conservative settings for laptops
32GB Default balanced settings
64GB+ High performance settings

Uncomment the appropriate preset section in .env for your machine.

Applying Configuration Changes

After editing airflow_settings.yaml, restart Airflow to reload connections:

astro dev restart

Loading Test Data (Local SQL Server)

To restore a SQL Server backup file (.bak) into the local container:

# 1. Copy backup file to container
docker cp /path/to/YourDatabase.bak mssql-server:/tmp/

# 2. Restore the database
docker exec -it mssql-server /opt/mssql-tools18/bin/sqlcmd \
  -S localhost -U sa -P 'YourStrong@Passw0rd' -C \
  -Q "RESTORE DATABASE YourDatabase FROM DISK='/tmp/YourDatabase.bak' \
      WITH MOVE 'YourDatabase' TO '/var/opt/mssql/data/YourDatabase.mdf', \
           MOVE 'YourDatabase_log' TO '/var/opt/mssql/data/YourDatabase_log.ldf'"

# 3. Update airflow_settings.yaml with the database name
# conn_schema: YourDatabase

Note: Logical file names in the MOVE clause vary by backup. Use RESTORE FILELISTONLY to discover them.

Usage

Running the Migration

  1. Open the Airflow UI at http://localhost:8080
  2. Find the mssql_to_postgres_migration DAG
  3. Click the play button to trigger with default parameters, or use "Trigger DAG w/ config" to customize

Configuration Parameters

Parameter Default Description
source_conn_id mssql_source Airflow connection ID for SQL Server
target_conn_id postgres_target Airflow connection ID for PostgreSQL
source_schema dbo Schema to migrate from SQL Server
target_schema public Target schema in PostgreSQL
chunk_size 10000 Rows per batch during transfer (100-100,000)
exclude_tables [] Table patterns to skip (supports wildcards)
validate_samples false Enable sample data validation (slower)
create_foreign_keys true Create foreign key constraints after transfer

Example: Trigger via CLI

astro dev run dags trigger mssql_to_postgres_migration

Or with custom configuration:

astro dev run dags trigger mssql_to_postgres_migration \
  --conf '{"source_schema": "sales", "target_schema": "sales", "chunk_size": 50000}'

Monitoring

# View scheduler logs
astro dev logs -s -f

# List DAG runs
astro dev run dags list-runs mssql_to_postgres_migration

# Check validation results
astro dev run dags list-runs validate_migration_env

Project Structure

mssql-to-postgres-pipeline/
├── dags/
│   ├── mssql_to_postgres_migration.py   # Main migration DAG
│   └── validate_migration_env.py        # Standalone validation DAG
├── include/
│   └── mssql_pg_migration/
│       ├── schema_extractor.py          # SQL Server schema discovery
│       ├── type_mapping.py              # Data type conversion logic
│       ├── ddl_generator.py             # PostgreSQL DDL generation
│       ├── data_transfer.py             # Streaming data transfer with keyset pagination
│       └── validation.py                # Migration validation
├── tests/
│   └── dags/
│       └── test_dag_example.py          # DAG validation tests
├── docs/
│   ├── HANDOFF_NOTES.md                 # Session handoff documentation
│   └── *.md                             # Additional technical documentation
├── docker-compose.yml                   # Database containers (SQL Server, PostgreSQL)
├── Dockerfile                           # Astronomer Runtime image
├── requirements.txt                     # Python dependencies
└── airflow_settings.yaml                # Local connections/variables

Development

Validate DAGs

astro dev parse

Run Tests

# Initialize Airflow DB first (required for DagBag tests)
astro dev run airflow db init

# Run tests
astro dev pytest tests/

View Logs

astro dev logs

Stop Airflow

astro dev stop

Known Issues and Workarounds

TEXT Column NULL Handling

SQL Server databases may have NULL values in columns marked as NOT NULL (data integrity issues). The DDL generator skips NOT NULL constraints for TEXT columns to handle this:

# Skip NOT NULL for TEXT columns as source data may have integrity issues
if not column.get('is_nullable', True) and column['data_type'].upper() != 'TEXT':
    parts.append('NOT NULL')

XCom Serialization

Large validation results can cause XCom serialization issues. The pipeline uses a separate validation DAG with direct database connections to avoid this.

Dependencies

  • Astronomer Runtime 3.1
  • apache-airflow-providers-microsoft-mssql >= 3.8.0
  • apache-airflow-providers-postgres >= 5.12.0
  • pymssql >= 2.2.0
  • pg8000 >= 1.30.0

License

See LICENSE file for details.

About

Apache Airflow DAG for migrating SQL Server databases to PostgreSQL with automatic schema conversion and data type mapping

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 4

  •  
  •  
  •  
  •