# What is Data Engineering?
Data engineering involves **designing, building, and maintaining** **data infrastructures and platforms**. These infrastructures include databases, big data repositories, and data pipelines for transforming and moving data between systems. A data engineer develops and optimizes these data systems to ensure data is available for analysis.





<center> "The constant increase in data processing speeds and bandwidth, the nonstop invention of new tools for creating, sharing, and consuming data, and the steady addition of new data creators and consumers ensure that data growth continues unabated." — Forbes, 2020 </center>

## Role of a Data Engineer
Data engineers act as the "plumbers of data," ensuring seamless data flow within an organization. Their work enables analysts and scientists to use data effectively by choosing the right databases, storage systems, and cloud architectures.

### Key Responsibilities of a Data Engineer
The overarching responsibility of a data engineer is to provide analytics-ready data to data consumers. Data is analytics-ready when it is:

* **Accuracy** – Ensuring data correctness.
* **Efficiency** – Accessing needed data quickly.
* **Compliant with regulations**

#### At a broad level, data engineers:

✔ **Extract** data from various sources (databases, APIs, logs, IoT devices, etc.)

✔ **Integrate** data from disparate systems into a unified format

✔ **Organize** & **Store** data in repositories like data warehouses, data lakes, or relational/non-relational databases

✔ **Build & Maintain Data Pipeline** to ensure data flows smoothly for analysis

Essentially, Data Engineers **create the infrastructure** that allows Data Analysts and Data Scientists to **access, analyze, and derive insights from data.**

### Key Skills for Data Engineers

#### Technical Skills

1. **Operating Systems**:
  - UNIX
  - Linux
  - Windows
  - system administration tools

2. **Infrastructure Components**:
  - Virtual machines
  - networking
  - application services
  - cloud platforms (AWS, Google Cloud, IBM, Microsoft Azure)

3. **Databases and Data Warehouses**:
  - RDBMS: IBM DB2, MySQL, Oracle Database, PostgreSQL
  - NoSQL: Redis, MongoDB, Cassandra, Neo4J
  - Data Warehouses: Oracle Exadata, IBM Db2 Warehouse on Cloud, Amazon Redshift

4. **Data Pipelines**: Gather data from multiple sources, transform it into analytics-ready data, and make it available to data consumers.
  - Apache Beam
  - AirFlow
  - DataFlow

5. **ETL Tools**:
  - IBM Infosphere
  - AWS Glue
  - Improvado

6. **Programming Languages**:

  - Query Languages: SQL, SQL-like NoSQL queries
  - General Purpose: Python, R, Java
  - Scripting: Unix/Linux Shell, PowerShell

7. **Big Data Processing**: Essential for handling vast amounts of structured and unstructured data.
  - Hadoop
  - Hive
  - Spark

#### Functional Skills

- Converting business requirements into technical specifications.

- Understanding the complete software development lifecycle.

- Knowledge of business applications of data.

- Awareness of risks in data management (quality, privacy, security, compliance).

#### Soft Skills

- Collaboration with data analysts, scientists, and business users.

- Strong interpersonal communication with both technical and non-technical stakeholders.

- Ability to work in a team-oriented environment.

### Data Roles Comparison
**Data Engineer**: Develops and maintains data architectures, pipelines, and repositories, ensuring data availability, consistency, security, and recoverability.

**Data Analyst**: Inspects, cleans, and analyzes data to derive insights.

**Data Scientist**: Builds predictive models using machine learning and statistical analysis.

**Business Analysts**: Use data insights to inform business decisions.

**BI Analysts**: Focus on market forces and external business influences.

## Example Data Engineering Project

Sarah Flinch, a Data Engineer at a multinational hair care company, worked on a project to track customer sentiment for a new shampoo launch.

* The business team wanted real-time insights from social media and eCommerce platforms.

* Data Scientists built a sentiment analysis dashboard prototype with dummy data.

* Sarah’s team collected product-related tweets, posts, and reviews using APIs and web scraping.

* She processed and stored the data in a database, which powered the dashboard.

* To ensure real-time updates, Sarah implemented a data pipeline that automated data extraction, transformation, and loading (ETL).

This system enabled business users to monitor brand perception instantly without manual intervention.

## Data Sources

Data originates from **structured** and **unstructured** datasets such as:

* Text, images, videos, clickstreams, user conversations, and social media.

* IoT devices, real-time event streams, legacy databases, and third-party data providers.

## Stages of Data Processing

* **Data Acquisition**: Extracting and importing data from various sources while ensuring security and integrity.

* **Data Preparation**: Organizing, cleaning, and optimizing data for accessibility and compliance.

* **Data Storage**: Managing repositories with high availability, flexibility, and security.

* **Data Access & Consumption**: Making data available through APIs, reports, dashboards, and analytical tools.

# Module 1 Practice Quiz (1)




1.) Which emerging technology has made it possible for every enterprise to have access to limitless storage and high-performance computing?

<details>
  <summary>ANSWER</summary>
    
    ↪ Cloud Computing

Cloud technologies has made it possible for every enterprise, regardless of its size, to have access to limitless storage and high-performance computing at nominal costs.
</details>

2.) Which of the data roles is responsible for extracting, integrating, and organizing data into data repositories?

<details>
  <summary>ANSWER</summary>
    
    ↪ Data Engineers

Data Engineers are responsible for extracting, integrating, and organizing data into data repositories.
</details>

3.) The field of data engineering concerns itself with the mechanics for the flow and access of data. What captures the goal of data engineering?

<details>
  <summary>ANSWER</summary>
    
    ↪ Make quality data available for fact-finding and business decision-making

  Data engineering is the process of collecting raw data and converting it into analytics-ready data by cleaning, transforming, and preparing data so that it is reliable.
</details>

# Module 1 Practice Quiz (2)

1.) Which one of these skills is essential to the role of a Data Engineer?

<details>
  <summary>ANSWER</summary>
    
    ↪ To set up and manage the infrastructure required for the ingestion, processing, and storage of data.

  Data Engineers are responsible for setting up and managing the infrastructure required for ingesting raw data, processing it, and storing it so that it is available for analytics.
</details>

2.) What, according to Sarah Flinch, needs to be tracked and analyzed in order to keep business updated on the overall sentiment of the consumers?

<details>
  <summary>ANSWER</summary>
    
    ↪ Social media posts, customer reviews and ratings on eCommerce platforms, and product reviews on blogging sites.

 How a product gets talked about on social media, eCommerce platforms, and blogging sites has an immediate impact on sales numbers and brand perception.
</details>

# Module 1 Graded Quiz

1.) Data Engineers work within the data ecosystem to:

    A. Analyze data for actionable insights

    B. Analyze data for deriving insights

    C. Provide business intelligence solutions by monitoring data on different business functions

    D. Develop and maintain data architectures

  <details>
  <summary>ANSWER</summary>
    
    ↪ D. Develop and maintain data architectures.

 One of the responsibilities of a Data Engineer in a data ecosystem is to develop and maintain data architectures so that data is available for business operations and analysis.
</details>

2.) The goal of data engineering is to make quality data available for fact-finding and decision-making. Which one of these statements captures the process of data engineering?

    A. Processing data and making it available to users securely
    
    B. Collecting, processing, and making data available to users securely
    
    C. Collecting, processing, and storing data
    
    D. Collecting, processing, storing, and making data available to users securely

  <details>
  <summary>ANSWER</summary>
    
    ↪ D. Collecting, processing, storing, and making data available to users securely.

 Data engineering includes the collection of data from disparate sources, processing data so that it is usable, storing processed data, and making it available to users securely.
</details>

3.) Data extracted from disparate sources can be stored in:

    A. Data Lakes only
    
    B. Databases only
    
    C. Data Warehouses only
    
    D. Databases, data warehouses, data lakes, or any other type of data repository

  <details>
  <summary>ANSWER</summary>
    
    ↪ D. Databases, data warehouses, data lakes, or any other type of data repository.

 Data extracted from multiple sources can be stored in any type of data repository, such as, databases, data warehouses, and data lakes.
</details>

4.) From the provided list, select the three emerging technologies that are shaping today’s data ecosystem.

    A. Big Data, Internet of Things, and Dashboarding
    
    B. Cloud Computing, Internet of Things, and Dashboarding
    
    C. Machine Language, Cloud Computing, and Internet of Things
    
    D. Cloud Computing, Machine Learning, and Big Data

  <details>
  <summary>ANSWER</summary>
    
    ↪ D. Cloud Computing, Machine Learning, and Big Data.

 Emerging technologies such as Cloud Computing, Machine Learning, and Big Data are shaping today’s data ecosystem and its possibilities.
</details>

5.) Oracle Exadata, IBM Db2 Warehouse on Cloud, IBM Netezza Performance Server, and Amazon RedShift are some of the popular __________________ in use today.

    A. Big Data Platforms
    
    B. ETL Tools
    
    C. NoSQL Databases
    
    D. Data Warehouses

  <details>
  <summary>ANSWER</summary>
    
    ↪ D. Data Warehouses.

 These are some of the popularly used data warehouses.
</details>

6.) Data Engineers manage the infrastructure required for the ingestion, processing, and storage of data.

    A. True
    
    B. False

  <details>
  <summary>ANSWER</summary>
    
    ↪ A. True.

 This is one of the primary responsibilities of a Data Engineer.
</details>

7.) To ensure business stakeholders can see real-time data each time they log into the dashboard, Sarah decided to build _______________ to extract, transform, and load data on an ongoing basis.

    A. A sentiment analysis algorithm
    
    B. APIs
    
    C. A Python program
    
    D. A Data Pipeline

  <details>
  <summary>ANSWER</summary>
    
    ↪ D. A Data Pipeline.

 Data pipelines cover the journey of data from source to destination systems which include extracting, transforming, and loading data.
</details>