SQL Server to PostgreSQL Migration Pipeline

An Apache Airflow pipeline for automated, full-refresh migrations from Microsoft SQL Server to PostgreSQL. Built with the Astronomer framework for reliable orchestration and easy deployment.

Features

Schema Discovery: Automatically extract table structures, columns, indexes, and foreign keys from SQL Server
Type Mapping: Convert 30+ SQL Server data types to their PostgreSQL equivalents
Streaming Data Transfer: Move data efficiently using server-side cursors, keyset pagination, and PostgreSQL's COPY protocol
Validation: Standalone validation DAG verifies migration success through row count comparisons
Parallelization: Transfer multiple tables concurrently using Airflow's dynamic task mapping
Large Table Partitioning: Automatically partitions tables >5M rows into parallel chunks by primary key range

Performance

Tested against the StackOverflow 2010 dataset:

Metric	Value
Total Rows Migrated	19.3 million
Tables	9
Migration Time	~2.5 minutes
Throughput	~125,000 rows/sec
Validation Success	100% (9/9 tables)

Performance Optimizations

100k chunk size: 10x larger batches reduce overhead
Parallel partitioning: Large tables split into 4 parallel partitions by PK range
Connection pooling: Reuses PostgreSQL connections across operations

See docs/PARALLEL_PARTITIONING.md for details on large table partitioning.

Tables Migrated

Table	Rows
Badges	1,102,019
Comments	3,875,183
Posts	3,729,195
Users	299,398
Votes	10,143,364
PostTypes	8
VoteTypes	15
PostLinks	149,313
LinkTypes	3

How It Works

The pipeline executes as a single Airflow DAG with the following stages:

Extract Schema -> Create Target Schema -> Create Tables -> Transfer Data (parallel) -> Create Foreign Keys -> Validate -> Report

Schema Extraction: Queries SQL Server system catalogs to discover all tables, columns, data types, indexes, and constraints
DDL Generation: Converts SQL Server schemas to PostgreSQL-compatible DDL with proper type mappings
Table Creation: Creates target tables in PostgreSQL (drops existing tables first)
Data Transfer: Streams data using keyset pagination with direct pymssql connections and PostgreSQL COPY protocol
Foreign Key Creation: Adds foreign key constraints after all data is loaded
Validation: Triggers standalone validation DAG that compares source and target row counts

Architecture

Data Transfer Approach

The pipeline uses a streaming architecture optimized for large datasets:

Keyset Pagination: Uses primary key ordering instead of OFFSET/FETCH for efficient chunking of large tables
Direct Database Connections: Uses pymssql and psycopg2 directly to avoid Airflow hook limitations with large datasets
PostgreSQL COPY Protocol: Bulk loads data for maximum throughput
Server-Side Cursors: Streams rows without loading entire result sets into memory

Validation DAG

A standalone validate_migration_env DAG handles validation separately to avoid XCom serialization issues with large result sets. This DAG:

Uses direct database connections (psycopg2, pymssql)
Compares row counts between source and target for all tables
Can be triggered independently for ad-hoc validation

Supported Data Types

Category	SQL Server Types	PostgreSQL Mapping
Integer	`bit`, `tinyint`, `smallint`, `int`, `bigint`	`BOOLEAN`, `SMALLINT`, `INTEGER`, `BIGINT`
Decimal	`decimal`, `numeric`, `money`, `smallmoney`	`DECIMAL`, `NUMERIC`
Float	`float`, `real`	`DOUBLE PRECISION`, `REAL`
String	`char`, `varchar`, `text`, `nchar`, `nvarchar`, `ntext`	`CHAR`, `VARCHAR`, `TEXT`
Binary	`binary`, `varbinary`, `image`	`BYTEA`
Date/Time	`date`, `time`, `datetime`, `datetime2`, `smalldatetime`, `datetimeoffset`	`DATE`, `TIME`, `TIMESTAMP`
Other	`uniqueidentifier`, `xml`, `geography`, `geometry`	`UUID`, `XML`, `GEOGRAPHY`, `GEOMETRY`

Quick Start

Migrate any SQL Server database to PostgreSQL in 3 steps:

# 1. Configure your databases
cp .env.example .env                              # Resource settings
cp airflow_settings.yaml.example airflow_settings.yaml  # Connections
# Edit airflow_settings.yaml with your database details

# 2. Start the stack
astro dev start                         # Connections loaded automatically

# 3. Run migration
airflow dags trigger mssql_to_postgres_migration

See Configuration for details on connecting to remote databases.

Getting Started

Prerequisites

Docker Desktop (4GB+ RAM recommended per database container)
Astronomer CLI (optional - can use standalone Docker)
Access to source SQL Server and target PostgreSQL databases

Installation

Clone this repository:

git clone <repository-url>
cd mssql-to-postgres-pipeline

Configure your environment:

cp .env.example .env
cp airflow_settings.yaml.example airflow_settings.yaml
# Edit .env with resource settings for your machine (RAM presets)
# Edit airflow_settings.yaml with your database connection details

Start Airflow (creates Docker network):
```
astro dev start
```
(Optional) Start local database containers:

If you need local SQL Server and PostgreSQL containers for testing:
```
docker-compose up -d
```
Note: Run astro dev start first - it creates the Docker network that database containers join.

For remote databases (Azure SQL, AWS RDS, on-prem servers), skip this step and configure hosts in airflow_settings.yaml.
Access the Airflow UI at http://localhost:8080

Configuration

Configuration is split between two files:

.env - Resource settings (memory limits, parallelism)
airflow_settings.yaml - Database connections

Database Connections

Edit airflow_settings.yaml to configure your source and target databases:

airflow:
  connections:
    # SOURCE: SQL Server
    - conn_id: mssql_source
      conn_type: mssql
      conn_host: mssql-server           # Docker container, IP, or hostname
      # conn_host: 192.168.1.100        # Remote server
      # conn_host: myserver.database.windows.net  # Azure SQL
      conn_port: 1433
      conn_login: sa
      conn_password: YourStrong@Passw0rd
      conn_schema: StackOverflow2013

    # TARGET: PostgreSQL
    - conn_id: postgres_target
      conn_type: postgres
      conn_host: postgres-target        # Docker container, IP, or hostname
      # conn_host: mydb.abc123.rds.amazonaws.com  # AWS RDS
      conn_port: 5432
      conn_login: postgres
      conn_password: PostgresPassword123
      conn_schema: stackoverflow

Connections are loaded automatically when you run astro dev start.

Resource Settings

The .env file contains resource presets for different machine sizes:

RAM	Preset
16GB	Conservative settings for laptops
32GB	Default balanced settings
64GB+	High performance settings

Uncomment the appropriate preset section in .env for your machine.

Applying Configuration Changes

After editing airflow_settings.yaml, restart Airflow to reload connections:

astro dev restart

Loading Test Data (Local SQL Server)

To restore a SQL Server backup file (.bak) into the local container:

# 1. Copy backup file to container
docker cp /path/to/YourDatabase.bak mssql-server:/tmp/

# 2. Restore the database
docker exec -it mssql-server /opt/mssql-tools18/bin/sqlcmd \
  -S localhost -U sa -P 'YourStrong@Passw0rd' -C \
  -Q "RESTORE DATABASE YourDatabase FROM DISK='/tmp/YourDatabase.bak' \
      WITH MOVE 'YourDatabase' TO '/var/opt/mssql/data/YourDatabase.mdf', \
           MOVE 'YourDatabase_log' TO '/var/opt/mssql/data/YourDatabase_log.ldf'"

# 3. Update airflow_settings.yaml with the database name
# conn_schema: YourDatabase

Note: Logical file names in the MOVE clause vary by backup. Use RESTORE FILELISTONLY to discover them.

Usage

Running the Migration

Open the Airflow UI at http://localhost:8080
Find the mssql_to_postgres_migration DAG
Click the play button to trigger with default parameters, or use "Trigger DAG w/ config" to customize

Configuration Parameters

Parameter	Default	Description
`source_conn_id`	`mssql_source`	Airflow connection ID for SQL Server
`target_conn_id`	`postgres_target`	Airflow connection ID for PostgreSQL
`source_schema`	`dbo`	Schema to migrate from SQL Server
`target_schema`	`public`	Target schema in PostgreSQL
`chunk_size`	`10000`	Rows per batch during transfer (100-100,000)
`exclude_tables`	`[]`	Table patterns to skip (supports wildcards)
`validate_samples`	`false`	Enable sample data validation (slower)
`create_foreign_keys`	`true`	Create foreign key constraints after transfer

Example: Trigger via CLI

astro dev run dags trigger mssql_to_postgres_migration

Or with custom configuration:

astro dev run dags trigger mssql_to_postgres_migration \
  --conf '{"source_schema": "sales", "target_schema": "sales", "chunk_size": 50000}'

Monitoring

# View scheduler logs
astro dev logs -s -f

# List DAG runs
astro dev run dags list-runs mssql_to_postgres_migration

# Check validation results
astro dev run dags list-runs validate_migration_env

Project Structure

mssql-to-postgres-pipeline/
├── dags/
│   ├── mssql_to_postgres_migration.py   # Main migration DAG
│   └── validate_migration_env.py        # Standalone validation DAG
├── include/
│   └── mssql_pg_migration/
│       ├── schema_extractor.py          # SQL Server schema discovery
│       ├── type_mapping.py              # Data type conversion logic
│       ├── ddl_generator.py             # PostgreSQL DDL generation
│       ├── data_transfer.py             # Streaming data transfer with keyset pagination
│       └── validation.py                # Migration validation
├── tests/
│   └── dags/
│       └── test_dag_example.py          # DAG validation tests
├── docs/
│   ├── HANDOFF_NOTES.md                 # Session handoff documentation
│   └── *.md                             # Additional technical documentation
├── docker-compose.yml                   # Database containers (SQL Server, PostgreSQL)
├── Dockerfile                           # Astronomer Runtime image
├── requirements.txt                     # Python dependencies
└── airflow_settings.yaml                # Local connections/variables

Development

Validate DAGs

astro dev parse

Run Tests

# Initialize Airflow DB first (required for DagBag tests)
astro dev run airflow db init

# Run tests
astro dev pytest tests/

View Logs

astro dev logs

Stop Airflow

astro dev stop

Known Issues and Workarounds

TEXT Column NULL Handling

SQL Server databases may have NULL values in columns marked as NOT NULL (data integrity issues). The DDL generator skips NOT NULL constraints for TEXT columns to handle this:

# Skip NOT NULL for TEXT columns as source data may have integrity issues
if not column.get('is_nullable', True) and column['data_type'].upper() != 'TEXT':
    parts.append('NOT NULL')

XCom Serialization

Large validation results can cause XCom serialization issues. The pipeline uses a separate validation DAG with direct database connections to avoid this.

Dependencies

Astronomer Runtime 3.1
apache-airflow-providers-microsoft-mssql >= 3.8.0
apache-airflow-providers-postgres >= 5.12.0
pymssql >= 2.2.0
pg8000 >= 1.30.0

License

See LICENSE file for details.

Name		Name	Last commit message	Last commit date
Latest commit History 88 Commits
.astro		.astro
dags		dags
docs		docs
include/mssql_pg_migration		include/mssql_pg_migration
postgres_config		postgres_config
scripts		scripts
tests/dags		tests/dags
.dockerignore		.dockerignore
.env.example		.env.example
.gitignore		.gitignore
CLAUDE.md		CLAUDE.md
Dockerfile		Dockerfile
INTEGRATION_TEST_REPORT.md		INTEGRATION_TEST_REPORT.md
MIGRATION_REPORT.md		MIGRATION_REPORT.md
PERFORMANCE_IMPROVEMENTS.md		PERFORMANCE_IMPROVEMENTS.md
PERFORMANCE_REPORT.md		PERFORMANCE_REPORT.md
README.md		README.md
airflow_settings.yaml.example		airflow_settings.yaml.example
docker-compose.yml		docker-compose.yml
integration_test.sh		integration_test.sh
measure_performance.py		measure_performance.py
packages.txt		packages.txt
requirements.txt		requirements.txt
setup_stackoverflow.sh		setup_stackoverflow.sh

johndauphine/mssql-to-postgres-pipeline

Folders and files

Latest commit

History

Repository files navigation