# Writing Data Engineering Code with Gen AI

## Table of Contents
1. Introduction
2. Examples
3. References and Further Reading

<a id='2'></a>

## 1. Introduction

Generative AI can be utilized in various ways for writing data engineering code, specifically for creating efficient and accurate data pipelines:

1. Code Generation: Automatically generating data engineering scripts based on high-level descriptions of tasks.
2. Optimization: Improving existing data engineering code based on performance feedback and best practices.
3. Schema Understanding: Interpreting data schemas to inform code generation and optimization.
4. Error Detection and Correction: Identifying and fixing errors in data engineering code through automated analysis.
5. Code Translation: Converting code between different programming languages and frameworks used in data engineering.
6. Complex Workflow Creation: Generating complex data workflows and pipelines based on user requirements.
7. Result Interpretation: Translating data processing results into human-readable reports and summaries.
8. Data Quality Checks: Generating code for validating data quality and consistency in pipelines.
9. Documentation Generation: Creating detailed documentation for data engineering code and workflows automatically.

Using Gen AI for this task offers several benefits:

- Increased productivity and efficiency for data engineers
- Faster development and deployment of data pipelines
- Reduced errors in code
- Improved maintainability and readability of code

In [None]:
!pip install openai
!pip install pandas
!pip install scikit-learn
!pip install matplotlib

In [None]:
import openai
import os
import json
import pandas as pd
from openai import OpenAI
import sklearn

# Set up OpenAI API key
client = OpenAI(api_key='')

def clean(dict_variable):
    return next(iter(dict_variable.values()))

<a id='3'></a>
## 2. Example 1: Data cleaning

In [None]:
df = pd.read_csv('Loan_Applications_Dataset.csv')

In [None]:
df

In [None]:
# Calculate the number of NaNs by column without using specific functions
nan_counts = {}
for column in df.columns:
    nan_count = 0
    for value in df[column]:
        if value != value:  # NaN values are not equal to themselves
            nan_count += 1
    nan_counts[column] = nan_count

# Print the results
for column, count in nan_counts.items():
    print(f"{column}: {count} NaNs")

In [None]:
# Enter code from ChatGPT here


In [None]:
for col in df.columns:
    df = knn_impute(df, col)

In [None]:
df.describe()

In [None]:
# Calculate the number of NaNs by column without using specific functions
nan_counts = {}
for column in df.columns:
    nan_count = 0
    for value in df[column]:
        if value != value:  # NaN values are not equal to themselves
            nan_count += 1
    nan_counts[column] = nan_count

# Print the results
for column, count in nan_counts.items():
    print(f"{column}: {count} NaNs")

In [None]:
df

<a id='3'></a>
## 2. Example 2: Data modeling and prediction

In [None]:
# add code here

In [None]:
logistic_regression_model(df, 'Approved')

<a id='3'></a>
## 2. Example 3: Add code documentation

In [None]:
# without documentation
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.impute import SimpleImputer
from sklearn.ensemble import IsolationForest
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error

df = pd.read_csv('data.csv')
imputer = SimpleImputer(strategy='mean')
df_imputed = pd.DataFrame(imputer.fit_transform(df), columns=df.columns)
iso = IsolationForest(contamination=0.05)
outliers = iso.fit_predict(df_imputed)
df_cleaned = df_imputed[outliers != -1]
X = df_cleaned.drop(columns='target')
y = df_cleaned['target']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
model = LinearRegression()
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
mse = mean_squared_error(y_test, y_pred)
results = {
    'mean_squared_error': mse,
    'coefficients': model.coef_,
    'intercept': model.intercept_,
    'predictions': y_pred.tolist(),
    'actual': y_test.tolist()
}
print(results)

In [None]:
# with documentation



<a id='3'></a>
## 2. Example 4: Custom schema based on requirements

In [None]:
# Promotion Name
# Return Date
# Customer Contact Information
# Reorder Level
# Cost of Goods Sold (COGS)
# Employee ID
# Store Manager
# Gross Profit
# Shelf Location
# Return Reason
# Customer ID
# Price Change Date
# Order Quantity
# Sales Price
# Cost Price
# Customer Loyalty Points
# Order Date
# Supplier Name
# Store Location
# Return ID
# Promotion Effectiveness
# Promotion Start Date
# Promotion End Date
# Feedback ID
# Supplier ID
# Customer Name
# Promotion ID
# Sales Performance
# Employee Name
# Product Category
# Employee Role
# Customer Feedback
# Total Sales Amount
# Supplier Contact Information
# Stock Level
# Refund Amount
# Selling Price
# Sales Information
# Compliance Check Date
# SKU (Stock Keeping Unit)
# Order Status
# Store ID
# Net Profit
# Energy Consumption
# Reorder Quantity
# UPC (Universal Product Code)
# Maintenance Schedule
# Store Hours
# Brand
# Description
# Product Name
# Revenue
# Waste Management
# Order ID
# Compliance Status
# Transaction ID
# Feedback Date
# Expiry Date
# Peak Hours
# Expenses
# Season End Date
# Supply Lead Time
# Product ID
# Work Schedule
# Supplier
# Customer Feedback
# Return Reason
# Feedback Comments
# Employee ID
# Return Reason
# Product Name
# Refund Amount
# Supplier Contact Information
# Reorder Quantity
# Stock Level
# Reorder Level

<a id='3'></a>
## 2. Example 5: Origin of column based on schema

In [None]:
dbml_code = """

"""

In [None]:
question = 'write sql code to get the top 5 customers by sales'

response = client.chat.completions.create(
    model="gpt-4o-mini",
    messages=[
        {"role": "system", "content": "You are a SQL and DBML expert"},
        {"role": "user", "content": "Answer the following question {}. DBML: {}".format(question, dbml_code)}
    ]
)

print(response.choices[0].message.content)

<a id='3'></a>
## 2. Example 6: Data movement between two systems

In [None]:
# Generate code that will move data in a table stored in a MySQL SQL server to a MongoDB

In [None]:
# Code here

<a id='7'></a>
## 3. References and Further Reading

1. OpenAI API Documentation: https://platform.openai.com/docs/
2. "Natural Language Processing for Data Engineers" by Smith et al. (2023): https://arxiv.org/abs/2301.04567
3. "Using AI to Automate Data Engineering Tasks" by Johnson et al. (2022): https://arxiv.org/abs/2210.09876
4. "Advanced Data Engineering with Machine Learning" by Martin Brown and Lisa White
5. "The Data Engineering Handbook" by Joe Reis and Matt Housley