An Apache Airflow pipeline for automated, full-refresh migrations from Microsoft SQL Server to PostgreSQL. Built with the Astronomer framework for reliable orchestration and easy deployment.
- Schema Discovery: Automatically extract table structures, columns, indexes, and foreign keys from SQL Server
- Type Mapping: Convert 30+ SQL Server data types to their PostgreSQL equivalents
- Streaming Data Transfer: Move data efficiently using server-side cursors, keyset pagination, and PostgreSQL's COPY protocol
- Validation: Standalone validation DAG verifies migration success through row count comparisons
- Parallelization: Transfer multiple tables concurrently using Airflow's dynamic task mapping
- Large Table Partitioning: Automatically partitions tables >5M rows into parallel chunks by primary key range
Tested against the StackOverflow 2010 dataset:
| Metric | Value |
|---|---|
| Total Rows Migrated | 19.3 million |
| Tables | 9 |
| Migration Time | ~2.5 minutes |
| Throughput | ~125,000 rows/sec |
| Validation Success | 100% (9/9 tables) |
- 100k chunk size: 10x larger batches reduce overhead
- Parallel partitioning: Large tables split into 4 parallel partitions by PK range
- Connection pooling: Reuses PostgreSQL connections across operations
See docs/PARALLEL_PARTITIONING.md for details on large table partitioning.
| Table | Rows |
|---|---|
| Badges | 1,102,019 |
| Comments | 3,875,183 |
| Posts | 3,729,195 |
| Users | 299,398 |
| Votes | 10,143,364 |
| PostTypes | 8 |
| VoteTypes | 15 |
| PostLinks | 149,313 |
| LinkTypes | 3 |
The pipeline executes as a single Airflow DAG with the following stages:
Extract Schema -> Create Target Schema -> Create Tables -> Transfer Data (parallel) -> Create Foreign Keys -> Validate -> Report
- Schema Extraction: Queries SQL Server system catalogs to discover all tables, columns, data types, indexes, and constraints
- DDL Generation: Converts SQL Server schemas to PostgreSQL-compatible DDL with proper type mappings
- Table Creation: Creates target tables in PostgreSQL (drops existing tables first)
- Data Transfer: Streams data using keyset pagination with direct pymssql connections and PostgreSQL COPY protocol
- Foreign Key Creation: Adds foreign key constraints after all data is loaded
- Validation: Triggers standalone validation DAG that compares source and target row counts
The pipeline uses a streaming architecture optimized for large datasets:
- Keyset Pagination: Uses primary key ordering instead of OFFSET/FETCH for efficient chunking of large tables
- Direct Database Connections: Uses pymssql and psycopg2 directly to avoid Airflow hook limitations with large datasets
- PostgreSQL COPY Protocol: Bulk loads data for maximum throughput
- Server-Side Cursors: Streams rows without loading entire result sets into memory
A standalone validate_migration_env DAG handles validation separately to avoid XCom serialization issues with large result sets. This DAG:
- Uses direct database connections (psycopg2, pymssql)
- Compares row counts between source and target for all tables
- Can be triggered independently for ad-hoc validation
| Category | SQL Server Types | PostgreSQL Mapping |
|---|---|---|
| Integer | bit, tinyint, smallint, int, bigint |
BOOLEAN, SMALLINT, INTEGER, BIGINT |
| Decimal | decimal, numeric, money, smallmoney |
DECIMAL, NUMERIC |
| Float | float, real |
DOUBLE PRECISION, REAL |
| String | char, varchar, text, nchar, nvarchar, ntext |
CHAR, VARCHAR, TEXT |
| Binary | binary, varbinary, image |
BYTEA |
| Date/Time | date, time, datetime, datetime2, smalldatetime, datetimeoffset |
DATE, TIME, TIMESTAMP |
| Other | uniqueidentifier, xml, geography, geometry |
UUID, XML, GEOGRAPHY, GEOMETRY |
Migrate any SQL Server database to PostgreSQL in 3 steps:
# 1. Configure your databases
cp .env.example .env # Resource settings
cp airflow_settings.yaml.example airflow_settings.yaml # Connections
# Edit airflow_settings.yaml with your database details
# 2. Start the stack
astro dev start # Connections loaded automatically
# 3. Run migration
airflow dags trigger mssql_to_postgres_migrationSee Configuration for details on connecting to remote databases.
- Docker Desktop (4GB+ RAM recommended per database container)
- Astronomer CLI (optional - can use standalone Docker)
- Access to source SQL Server and target PostgreSQL databases
-
Clone this repository:
git clone <repository-url> cd mssql-to-postgres-pipeline
-
Configure your environment:
cp .env.example .env cp airflow_settings.yaml.example airflow_settings.yaml # Edit .env with resource settings for your machine (RAM presets) # Edit airflow_settings.yaml with your database connection details
-
Start Airflow (creates Docker network):
astro dev start
-
(Optional) Start local database containers:
If you need local SQL Server and PostgreSQL containers for testing:
docker-compose up -d
Note: Run
astro dev startfirst - it creates the Docker network that database containers join.For remote databases (Azure SQL, AWS RDS, on-prem servers), skip this step and configure hosts in
airflow_settings.yaml. -
Access the Airflow UI at http://localhost:8080
Configuration is split between two files:
.env- Resource settings (memory limits, parallelism)airflow_settings.yaml- Database connections
Edit airflow_settings.yaml to configure your source and target databases:
airflow:
connections:
# SOURCE: SQL Server
- conn_id: mssql_source
conn_type: mssql
conn_host: mssql-server # Docker container, IP, or hostname
# conn_host: 192.168.1.100 # Remote server
# conn_host: myserver.database.windows.net # Azure SQL
conn_port: 1433
conn_login: sa
conn_password: YourStrong@Passw0rd
conn_schema: StackOverflow2013
# TARGET: PostgreSQL
- conn_id: postgres_target
conn_type: postgres
conn_host: postgres-target # Docker container, IP, or hostname
# conn_host: mydb.abc123.rds.amazonaws.com # AWS RDS
conn_port: 5432
conn_login: postgres
conn_password: PostgresPassword123
conn_schema: stackoverflowConnections are loaded automatically when you run astro dev start.
The .env file contains resource presets for different machine sizes:
| RAM | Preset |
|---|---|
| 16GB | Conservative settings for laptops |
| 32GB | Default balanced settings |
| 64GB+ | High performance settings |
Uncomment the appropriate preset section in .env for your machine.
After editing airflow_settings.yaml, restart Airflow to reload connections:
astro dev restartTo restore a SQL Server backup file (.bak) into the local container:
# 1. Copy backup file to container
docker cp /path/to/YourDatabase.bak mssql-server:/tmp/
# 2. Restore the database
docker exec -it mssql-server /opt/mssql-tools18/bin/sqlcmd \
-S localhost -U sa -P 'YourStrong@Passw0rd' -C \
-Q "RESTORE DATABASE YourDatabase FROM DISK='/tmp/YourDatabase.bak' \
WITH MOVE 'YourDatabase' TO '/var/opt/mssql/data/YourDatabase.mdf', \
MOVE 'YourDatabase_log' TO '/var/opt/mssql/data/YourDatabase_log.ldf'"
# 3. Update airflow_settings.yaml with the database name
# conn_schema: YourDatabaseNote: Logical file names in the MOVE clause vary by backup. Use
RESTORE FILELISTONLYto discover them.
- Open the Airflow UI at http://localhost:8080
- Find the
mssql_to_postgres_migrationDAG - Click the play button to trigger with default parameters, or use "Trigger DAG w/ config" to customize
| Parameter | Default | Description |
|---|---|---|
source_conn_id |
mssql_source |
Airflow connection ID for SQL Server |
target_conn_id |
postgres_target |
Airflow connection ID for PostgreSQL |
source_schema |
dbo |
Schema to migrate from SQL Server |
target_schema |
public |
Target schema in PostgreSQL |
chunk_size |
10000 |
Rows per batch during transfer (100-100,000) |
exclude_tables |
[] |
Table patterns to skip (supports wildcards) |
validate_samples |
false |
Enable sample data validation (slower) |
create_foreign_keys |
true |
Create foreign key constraints after transfer |
astro dev run dags trigger mssql_to_postgres_migrationOr with custom configuration:
astro dev run dags trigger mssql_to_postgres_migration \
--conf '{"source_schema": "sales", "target_schema": "sales", "chunk_size": 50000}'# View scheduler logs
astro dev logs -s -f
# List DAG runs
astro dev run dags list-runs mssql_to_postgres_migration
# Check validation results
astro dev run dags list-runs validate_migration_envmssql-to-postgres-pipeline/
├── dags/
│ ├── mssql_to_postgres_migration.py # Main migration DAG
│ └── validate_migration_env.py # Standalone validation DAG
├── include/
│ └── mssql_pg_migration/
│ ├── schema_extractor.py # SQL Server schema discovery
│ ├── type_mapping.py # Data type conversion logic
│ ├── ddl_generator.py # PostgreSQL DDL generation
│ ├── data_transfer.py # Streaming data transfer with keyset pagination
│ └── validation.py # Migration validation
├── tests/
│ └── dags/
│ └── test_dag_example.py # DAG validation tests
├── docs/
│ ├── HANDOFF_NOTES.md # Session handoff documentation
│ └── *.md # Additional technical documentation
├── docker-compose.yml # Database containers (SQL Server, PostgreSQL)
├── Dockerfile # Astronomer Runtime image
├── requirements.txt # Python dependencies
└── airflow_settings.yaml # Local connections/variables
astro dev parse# Initialize Airflow DB first (required for DagBag tests)
astro dev run airflow db init
# Run tests
astro dev pytest tests/astro dev logsastro dev stopSQL Server databases may have NULL values in columns marked as NOT NULL (data integrity issues). The DDL generator skips NOT NULL constraints for TEXT columns to handle this:
# Skip NOT NULL for TEXT columns as source data may have integrity issues
if not column.get('is_nullable', True) and column['data_type'].upper() != 'TEXT':
parts.append('NOT NULL')Large validation results can cause XCom serialization issues. The pipeline uses a separate validation DAG with direct database connections to avoid this.
- Astronomer Runtime 3.1
- apache-airflow-providers-microsoft-mssql >= 3.8.0
- apache-airflow-providers-postgres >= 5.12.0
- pymssql >= 2.2.0
- pg8000 >= 1.30.0
See LICENSE file for details.