# Whats is Data Engineering?

It is simply a more glamorized term for <code>ETL/ELT</code> practices. Tradtional ETL/ELT were implemented on predominantly on-prem. Modern Data Engineering spans from on-prem to be hosted on Cloud services to being serverless.<br><br>


Data Eningeering as a principle consists of the following. <br><br>

<div><font color = 'dodgerblue'><b>1. Data Collection<br>
2. Data Processing<br>
3. Data Storage<br>
4. Data Modeling<br>
5. Data Analysis<br>
6. Data Visualization<br>
    7. Data Consumption</b></font></div>


A Data pipeline is a collective representation of all or some of the steps listed above. <br><br>

<font color = 'dodgerblue'><b>1. Data Collection</b></font><br>
The first step in a data-pipeline is data collection from various heterogenous sources such as files, databases, APIs and streamsets. It is important to understand how to securely collect the data and scale the system as the input volume increases. <br>

As a data engineer you should be familiar with parsing input schemas and file handling. The schema defines the structure of the data and type of fields being ingested. The input schema also consists of column and line delimiters.<br> 

For example if you are reading a csv file, it is a comma separated schema. 

<div><code><font color = 'indigo'>import csv<br>
#Define the input schema<br>
schema = {
    'name': str,
    'age': int,
    'gender': str,
    'city': str,
    'country': str
}<br>
#Open the CSV file
with open('data.csv', newline='') as csvfile:
    # Create a CSV reader object
    reader = csv.DictReader(csvfile)
    # Loop through each row in the CSV file
    for row in reader:
        # Parse the row using the input schema
        parsed_row = {key: schema[key](row[key]) for key in schema}
        # Process the parsed row
        print(parsed_row)</font></code></div><br>


Reading input files through inferring the source schema without having to explicitly specify the same.<br><br>

<div><code><font color = 'indigo'>import csv<br>
#Open the CSV file
with open('data.csv', newline='') as csvfile:
    # Create a CSV reader object
    reader = csv.reader(csvfile)
    # Keep track of the header row
    header = next(reader)
    # Loop through each row in the CSV file
    for row in reader:
        # Create a dictionary to hold the parsed row
        parsed_row = {}
        # Loop through each field in the row
        for i in range(len(row)):
            # Check if the field is in the header
            if i < len(header):
                # If it is, use the field name as the key and parse the value
                parsed_row[header[i]] = int(row[i])
            else:
                # If it's not, assume it's a new field and parse the value as a string
                parsed_row[f'new_field_{i - len(header)}'] = str(row[i])
        # Process the parsed row as needed
        print(parsed_row)</font></code></div><br>


Inferring dynamic schema however can cause issues in the downstream if the data-pipelines do not have the same dynamic schema inference. It would be best to process the data but to send an error message both downstream and upstream notifying the change to conclude if this is an error or needs a change management to process the new schema going foreward.<br><br>


<font color = 'dodgerblue'><b>2. Data processing</b></font> is the second step in the data engineering pipeline, after data collection. It involves cleaning, transforming, and enriching the data so that it can be used for analysis.

<b>1. Data Cleaning:</b> Data collected from different sources may contain errors, missing values, or outliers that need to be cleaned. Data cleaning involves identifying and correcting errors, filling in missing values, and removing outliers.

Lets assume I'm reading a csv file and the downstream consumers wants 100% data availability. You can help set this up by discussing what needs to be set in place of missing values. In the below example we assume the consumers want the missing values to be the mean of the column value.<br><br>

<div><code><font color = 'indigo'>import pandas as pd<br>
#Read in the data
df = pd.read_csv('data.csv')<br>
#Replace missing values with the mean of the column
df.fillna(df.mean(), inplace=True)</font></code></div><br>
    
    
<b>2. Data transformation. </b>Once the data is cleaned, it needs to be transformed into a format that is suitable for analysis. This can involve converting data types, aggregating data, or splitting data into multiple tables.