# Whats is Data Engineering?

It is simply a more glamorized term for <code>ETL/ELT</code> practices. Tradtional <code>ETL/ELT</code> were implemented on predominantly on-prem. Modern Data Engineering spans from on-prem to be hosted on Cloud services to being serverless.<br><br>


Data Eningeering as a principle consists of the following. <br><br>

<div><font color = 'dodgerblue'><b>1. Data Collection<br>
2. Data Processing<br>
3. Data Storage<br>
4. Data Modeling<br>
5. Data Analysis<br>
6. Data Visualization<br>
7. Data Consumption</b></font></div>

<img src="https://i.ibb.co/YyRPbWy/Screenshot-2023-03-10-at-2-19-49-PM.png" height = "1200" width = "500">

A Data pipeline is a collective representation of all or some of the steps listed above. <br><br>

<font color = 'dodgerblue'><b>1. Data Collection</b></font><br>
The first step in a data-pipeline is data collection from various heterogenous sources such as files, databases, APIs and streamsets. It is important to understand how to securely collect the data and scale the system as the input volume increases. <br>

As a data engineer you should be familiar with parsing input schemas and file handling. The schema defines the structure of the data and type of fields being ingested. The input schema also consists of column and line delimiters.<br> 

For example if you are reading a csv file, it is a comma separated schema. <br><br>

<div><code><font color = 'indigo'>import csv<br>
#Define the input schema<br>
schema = {
    'name': str,
    'age': int,
    'gender': str,
    'city': str,
    'country': str
}<br>
#Open the CSV file
with open('data.csv', newline='') as csvfile:
    # Create a CSV reader object
    reader = csv.DictReader(csvfile)
    # Loop through each row in the CSV file
    for row in reader:
        # Parse the row using the input schema
        parsed_row = {key: schema[key](row[key]) for key in schema}
        # Process the parsed row
        print(parsed_row)</font></code></div><br>


Reading input files through inferring the source schema without having to explicitly specify the same.<br><br>

<div><code><font color = 'indigo'>import csv<br>
#Open the CSV file
with open('data.csv', newline='') as csvfile:
    # Create a CSV reader object
    reader = csv.reader(csvfile)
    # Keep track of the header row
    header = next(reader)
    # Loop through each row in the CSV file
    for row in reader:
        # Create a dictionary to hold the parsed row
        parsed_row = {}
        # Loop through each field in the row
        for i in range(len(row)):
            # Check if the field is in the header
            if i < len(header):
                # If it is, use the field name as the key and parse the value
                parsed_row[header[i]] = int(row[i])
            else:
                # If it's not, assume it's a new field and parse the value as a string
                parsed_row[f'new_field_{i - len(header)}'] = str(row[i])
        # Process the parsed row as needed
        print(parsed_row)</font></code></div><br>


Inferring dynamic schema however can cause issues in the downstream if the data-pipelines do not have the same dynamic schema inference. It would be best to process the data but to send an error message both downstream and upstream notifying the change to conclude if this is an error or needs a change management to process the new schema going foreward.<br><br>


<font color = 'dodgerblue'><b>2. Data processing</b></font> is the second step in the data engineering pipeline, after data collection. It involves cleaning, transforming, and enriching the data so that it can be used for analysis.

<b>a. Data Cleaning:</b> Data collected from different sources may contain errors, missing values, or outliers that need to be cleaned. Data cleaning involves identifying and correcting errors, filling in missing values, and removing outliers.

Lets assume I'm reading a csv file and the downstream consumers wants 100% data availability. You can help set this up by discussing what needs to be set in place of missing values. In the below example we assume the consumers want the missing values to be the mean of the column value.<br><br>

<div><code><font color = 'indigo'>import pandas as pd<br>
#Read in the data
df = pd.read_csv('data.csv')<br>
#Replace missing values with the mean of the column
df.fillna(df.mean(), inplace=True)</font></code></div><br>
    
    
<b>b. Data transformation. </b>Once the data is cleaned, it needs to be transformed into a format that is suitable for analysis. This can involve converting data types, aggregating data, or splitting data into multiple tables.<br>
    
Lets say I've set up a data-pipeline to read a date field but upstream change the data format to string. So I need to transform the data to explicity make it date format.<br><br>
    
<div><code><font color = 'indigo'>#Convert date strings to datetime objects
df['date'] = pd.to_datetime(df['date'])</font></code></div><br><br>
    
<b>c. Data Enrichment:</b> In many cases, additional data from external sources can be used to enhance the data being processed. Data enrichment involves adding new data, such as demographic or geographic data, to existing data to create a more comprehensive dataset.<br><br>
    
<div><code><font color = 'indigo'>#Merge data from two different data sources
df1 = pd.read_csv('data1.csv')
df2 = pd.read_csv('data2.csv')
merged_df = pd.merge(df1, df2, on='id')</font></code></div><br><br>
    
Lastly,
    
<b>d. ETL (Extract, Transform, Load):</b> ETL is a common approach to data processing that involves extracting data from different sources, transforming it into a format suitable for analysis, and then loading it into a database or data warehouse.<br><br>

    
<div><code><font color = 'indigo'># Extract data from a CSV file
df = pd.read_csv('data.csv')<br>
# Transform the data by aggregating by group
grouped_df = df.groupby('category').sum()<br>
# Load the data into a database
from sqlalchemy import create_engine
engine = create_engine('postgresql://user:password@localhost/mydatabase')
grouped_df.to_sql('summary_table', engine, if_exists='replace')</font></code></div><br><br>

    
<font color = 'dodgerblue'><b>2. Data storage</b></font> is an essential component of data engineering. It involves designing, building, and maintaining systems that can store large volumes of data reliably and efficiently.<br><br>

There are several different types of data storage systems that data engineers use, depending on the nature of the data and the requirements of the project. Some common data storage systems include:<br><br>

<b>Relational databases:</b> These are traditional databases that store data in tables with predefined schema. Relational databases are well-suited for structured data with well-defined relationships between entities.
Examples include <code>MySQL, PostgreSQL, Oracle, and Microsoft SQL Server.</code><br>
    
Relational databases are suitable for storing:
    
<i>Customer data:</i> Customer data, such as names, addresses, phone numbers, and email addresses, can be stored in a relational database. Each customer can be represented as a row in a table, and the relationship between customers and orders can be represented as a foreign key in a separate table.
    
<i>Sales data:</i> Sales data, such as product SKUs, prices, quantities, and order dates, can be stored in a relational database. Each sale can be represented as a row in a table, and the relationship between sales and customers can be represented as a foreign key in a separate table.
    
<i>Financial data:</i> Financial data, such as balance sheets, income statements, and cash flow statements, can be stored in a relational database. Each financial period can be represented as a row in a table, with columns for each account or category.
    
<i>Inventory data:</i> Inventory data, such as product SKUs, descriptions, quantities, and locations, can be stored in a relational database. Each inventory item can be represented as a row in a table, and the relationship between inventory and suppliers can be represented as a foreign key in a separate table.
    
    
Relational databases are well-suited for data that has a clear structure and well-defined relationships between entities. They are especially useful for data that needs to be queried and analyzed in a structured manner, such as for business intelligence and reporting purposes.
    
<b>NoSQL databases:</b> These databases are designed to handle unstructured or semi-structured data that doesn't fit well into a rigid table structure. NoSQL databases are highly scalable and can handle large volumes of data with high velocity and variety.
Examples include <code>MongoDB, Cassandra, Couchbase, and Amazon DynamoDB.</code><br><br>
    
NoSQL databases are suitable for storing:
    
<i>Social media data:</i> NoSQL databases can store large volumes of social media data such as posts, comments, likes, and shares.
    
<i>Sensor data:</i> NoSQL databases are well-suited for handling sensor data generated by IoT devices such as temperature sensors, humidity sensors, and pressure sensors.
    
<i>Product catalogs:</i> NoSQL databases are ideal for storing product catalogs that contain a large number of items with varying attributes.
    
<i>Log data:</i> NoSQL databases are often used for storing log data generated by servers, applications, and network devices.
    
<i>Graph data:</i> NoSQL databases such as graph databases are designed to handle complex data relationships, making them ideal for storing data such as social networks, recommendation engines, and fraud detection systems.
    
<i>Geospatial data:</i> NoSQL databases can store and retrieve geospatial data such as GPS coordinates, maps, and location-based data.
    
<i>Time-series data:</i> NoSQL databases are often used for storing and analyzing time-series data such as stock prices, website traffic, and weather data.
    
Overall, NoSQL databases are well-suited for handling large, complex, and unstructured data sets.<br><br>

    
<b>Data Lakes:</b> A data lake is a central repository that stores all the raw data in its original format. This allows data engineers to store large volumes of unstructured data, such as log files and social media posts, and then transform the data as needed for analysis.
Examples include <code>Amazon S3, Azure Data Lake Storage, and Google Cloud Storage.</code><br>
    
Data Lakes are suitable for storing:
    
<i>Web logs:</i> These are records of user activity on websites or applications, including information about pages visited, time spent on each page, and user interactions. Web logs can be stored in data lakes for analysis to gain insights into user behavior, identify trends, and optimize user experience.
    
<i>Social media data:</i> Data from social media platforms like Twitter, Facebook, and Instagram can be stored in data lakes for analysis. This can include information on user engagement, sentiment analysis, and demographic information.
    
<i>Sensor data:</i> Data generated from Internet of Things (IoT) devices and sensors can be stored in data lakes for real-time analysis. This can include data from smart homes, smart cities, and industrial sensors used in manufacturing and logistics.
    
<i>Clickstream data:</i> This includes data on how users interact with websites and online services, such as clicks, searches, and purchases. Clickstream data can be stored in data lakes for analysis to optimize user experience, identify trends, and improve marketing campaigns.
    
<i>Log files:</i> This includes system log files, application log files, and network log files, which record system events, errors, and user activity. Log files can be stored in data lakes for analysis to improve system performance, identify security issues, and troubleshoot problems.<br><br>
    
<b>Data warehouses:</b> These are centralized repositories that store data from various sources in a structured format, optimized for querying and analysis. Data warehouses are typically used for business intelligence and analytics purposes.
Examples include <code>Amazon Redshift, Microsoft Azure SQL Data Warehouse, and Snowflake.</code><br>
    
Data warehouses are suitable for data warehouses is high-volume, structured, historical, and relevant to business analysis and decision-making.<br><br>

<b>Object Storage:</b> Object storage is a data storage architecture that manages data as objects rather than files or blocks. Object storage is highly scalable and can store large volumes of unstructured data, such as images and video.
Examples include <code>Amazon S3, Microsoft Azure Blob Storage, and Google Cloud Storage.</code><br>
    
Object Storage is suitable for:
    
<i>Media files:</i> Images, audio files, and videos are examples of media files that are ideal for object storage. These files can be of various sizes, and they are best accessed through a unique identifier.
    
<i>Backup data:</i> Backup data often includes large volumes of data that need to be stored securely and retrieved quickly when needed. Object storage provides efficient backup and disaster recovery solutions.
    
<i>Log files:</i> System logs, application logs, and access logs are examples of log files that are suitable for object storage. These files are typically generated in large volumes, and object storage can efficiently handle the storage and retrieval of these files.
    
<i>IoT data:</i> Internet of Things (IoT) devices generate vast amounts of unstructured data that require object storage solutions. Object storage allows for the easy management and analysis of this data.
    
<i>Cloud-native applications:</i> Cloud-native applications are designed to run in the cloud environment, and object storage is an essential component of these applications. Object storage enables cloud-native applications to store and retrieve data easily and efficiently.
    
Overall, object storage is an ideal solution for unstructured data that needs to be stored, managed, and accessed easily and efficiently.<br><br>
    
    
Data engineers must choose the right storage system for their project based on several factors, including data type, data volume, access patterns, and performance requirements. They also need to ensure that the data is stored securely, with appropriate backups and disaster recovery measures in place. Additionally, data engineers must be able to design and implement efficient data retrieval and processing mechanisms that can work with the chosen storage system.