# Most Used Functions in Data Engineering

Data engineering involves various tasks such as data ingestion, transformation, validation, and storage. Here, we'll look at some of the most commonly used functions and techniques in data engineering using Python and its libraries.

## 1. Data Ingestion

Data ingestion is the process of obtaining and importing data for immediate use. Common sources include databases, CSV files, APIs, and streaming data.

In [None]:
# Example: Reading data from a CSV file using pandas
import pandas as pd

# Reading data from a CSV file
data = pd.read_csv('example.csv')
print(data.head())

## 2. Data Transformation

Data transformation involves converting data from one format or structure into another. This is essential for data cleaning, normalization, and enrichment.

In [None]:
# Example: Data transformation using pandas
# Adding a new column
data['new_column'] = data['existing_column'] * 2

# Renaming columns
data.rename(columns={'old_name': 'new_name'}, inplace=True)

# Filtering data
filtered_data = data[data['column'] > threshold]
print(filtered_data.head())

## 3. Data Validation

Data validation ensures the accuracy and quality of data. This can include checking for missing values, ensuring data types are correct, and validating data against business rules.

In [None]:
# Example: Data validation using pandas
# Checking for missing values
missing_values = data.isnull().sum()

# Ensuring correct data types
data['column'] = data['column'].astype(int)

# Validating data against business rules
valid_data = data[data['column'] > 0]
print(valid_data.head())

## 4. Data Storage

Data storage involves saving data in a structured format for future use. Common storage options include relational databases, NoSQL databases, and data lakes.

In [None]:
# Example: Saving data to a database using SQLAlchemy
from sqlalchemy import create_engine

# Creating an engine and saving data to a SQL database
engine = create_engine('sqlite:///example.db')
data.to_sql('table_name', engine, if_exists='replace', index=False)

## 5. Data Pipeline Orchestration

Data pipeline orchestration involves scheduling and managing data workflows. Tools like Apache Airflow and Prefect are commonly used for this purpose.

In [None]:
# Example: Simple data pipeline using Airflow
from airflow import DAG
from airflow.operators.python_operator import PythonOperator
from datetime import datetime

def my_task():
    print('Task executed!')

default_args = {
    'owner': 'airflow',
    'start_date': datetime(2023, 1, 1),
    'retries': 1,
}

dag = DAG(
    'my_dag',
    default_args=default_args,
    schedule_interval='@daily',
)

task = PythonOperator(
    task_id='my_task',
    python_callable=my_task,
    dag=dag,
)