In [2]:
from dotenv import load_dotenv
load_dotenv()
import openai
import os

In [3]:
schema_files = os.listdir('../schema')

In [4]:
all_schemas = {}

In [5]:
for file in schema_files:
    opened_file = open('../schema/' + file, 'r')
    all_schemas[file] = opened_file.read()

In [6]:
system_prompt = """You are a data engineer looking to create documentation and example queries for your data sets"""

In [7]:
user_prompt = f"""Using cumulative table input schema {all_schemas['product.sql']}
                 Generate a pipeline documentation in markdown 
                    that shows how this is generated from 
                {all_schemas['product_scd_tbl.sql']}
                make sure to include example queries that use the season stats array
                make sure to document all columns with column comments
                make sure to document all created types as well
            """

In [8]:
print(system_prompt)
print(user_prompt)

You are a data engineer looking to create documentation and example queries for your data sets
Using cumulative table input schema CREATE TABLE product (
  Product_ID INT PRIMARY KEY,
  Category_Name VARCHAR(50),
  Sub_Category_Name VARCHAR(50),
  Brand VARCHAR(50),
  Feature_Desc VARCHAR(100)
)

                 Generate a pipeline documentation in markdown 
                    that shows how this is generated from 
                create temp table #product_temp
select a.Product_ID as Product_ID_New,
case when a.Category_Name <> b.Category_Name then ‘-Category_Name’ else ‘’ end ||
case when a.Sub_Category_Name <> b.Sub_Category_Name then ‘-Sub_Category_Name’ else ‘’ end ||
case when a.Brand <> b.Brand then ‘-Brand’ else ‘’ end ||
case when a.Feature_Desc <> b.Feature_Desc then ‘-Feature_Desc’ else ‘’ end as CHANGED_COLUMN_NEW
from Dim_Product a join Stg_Product b
on a.Product_ID=b.Product_ID and a.current_flag=’Y’
where
a.Category_Name <> b.Category_Name or
a.Sub_Category_Name <> b.S

In [9]:
response = openai.chat.completions.create(
    model="gpt-4",
    messages=[
        {"role": "system", "content": system_prompt},
        {"role": "user", "content": user_prompt}
    ],
    temperature=0
)
answer = response.choices[0].message.content

In [10]:
print(answer)

# Pipeline Documentation

This pipeline is designed to generate a temporary table `#product_temp` that identifies the changes in the product details from the staging table `Stg_Product` to the dimension table `Dim_Product`.

## Input Schema

The input schema for this pipeline is the `product` table with the following columns:

- `Product_ID`: An integer that uniquely identifies each product. This is the primary key of the table.
- `Category_Name`: A string that represents the category of the product.
- `Category_Sub_Name`: A string that represents the sub-category of the product.
- `Brand`: A string that represents the brand of the product.
- `Feature_Desc`: A string that describes the features of the product.

## Pipeline Steps

1. The pipeline starts by creating a temporary table `#product_temp` that will store the new product ID and the changed columns.

2. It then performs a join operation between the `Dim_Product` and `Stg_Product` tables based on the `Product_ID` and the `current

In [11]:
if not os.path.exists('output'):
    os.mkdir('output')
# Open the file with write permissions
with open('output/documentation.md', 'w') as file:
    # Write some data to the file
    file.write(answer)