# 📊 Data Ingestion Pipeline: L7 to OMOP CDM via Airflow


## 🔍 What is Apache Airflow?

Apache Airflow is an open-source platform to programmatically author, schedule, and monitor workflows. Think of it as a workflow orchestrator for your data pipelines.

### Benefits:
- **Modular**: Each task is defined in Python.
- **Scalable**: Handles complex workflows with retries and dependencies.
- **Visual**: Provides a UI to track job execution and dependencies.
    


## 🏗️ Use Case: Ingesting Clinical Data from L7 (Postgres)

The objective is to extract patient data from the L7 Postgres database, apply lightweight transformation to conform with OMOP CDM, and load it into the OMOP-compliant database.
    


## 🧠 How the DAG Works

- **Start** ➡️ **Extract from L7** ➡️ **Transform Data** ➡️ **Load to OMOP DB** ➡️ **End**

Each task in Airflow corresponds to a Python function. The DAG ensures they run in the correct order.
    


## 🧾 Airflow DAG Code Example
    


## 🛠️ Requirements

- Airflow (`pip install apache-airflow`)
- PostgreSQL Driver: `psycopg2`, `sqlalchemy`
- Airflow running with DAG folder configured (`~/airflow/dags/`)

## 📚 Resources

- [Airflow Docs](https://airflow.apache.org/docs/apache-airflow/stable/)
- [OMOP CDM Info](https://ohdsi.github.io/CommonDataModel/)
- [Astronomer Academy](https://www.astronomer.io/learn/)
    

In [None]:

from airflow import DAG
from airflow.operators.python import PythonOperator
from airflow.utils.dates import days_ago
import pandas as pd
import psycopg2
import sqlalchemy

default_args = {
    'owner': 'harbinger',
    'start_date': days_ago(1),
    'retries': 1
}

dag = DAG(
    'l7_to_omop_ingestion',
    default_args=default_args,
    schedule_interval='@daily',
    catchup=False,
    description='Ingest L7 clinical data from Postgres to OMOP CDM staging area'
)
    

In [None]:

def extract_data_from_l7():
    conn = psycopg2.connect(
        host='your-l7-host',
        dbname='l7_database',
        user='your_user',
        password='your_password',
        port=5432
    )
    df = pd.read_sql("SELECT * FROM patient_table", conn)
    df.to_csv('/tmp/l7_patient_data.csv', index=False)
    conn.close()
    

In [None]:

def transform_data():
    df = pd.read_csv('/tmp/l7_patient_data.csv')
    df.columns = [col.lower() for col in df.columns]  # sample transformation
    df.to_csv('/tmp/transformed_patient_data.csv', index=False)
    

In [None]:

def load_to_omop():
    engine = sqlalchemy.create_engine('postgresql://omop_user:password@omop-host:5432/omop_db')
    df = pd.read_csv('/tmp/transformed_patient_data.csv')
    df.to_sql('person', engine, if_exists='append', index=False)
    

In [None]:

extract_task = PythonOperator(
    task_id='extract_l7',
    python_callable=extract_data_from_l7,
    dag=dag
)

transform_task = PythonOperator(
    task_id='transform_l7',
    python_callable=transform_data,
    dag=dag
)

load_task = PythonOperator(
    task_id='load_omop',
    python_callable=load_to_omop,
    dag=dag
)

extract_task >> transform_task >> load_task
    