# The Data Engineering Ecosystem






A data engineer’s ecosystem consists of infrastructure, tools, frameworks, and processes for managing the flow of data. This includes:

* **Data Extraction & Storage**: Collecting data from various sources (databases, APIs, web services, etc.) and storing it in appropriate repositories.

* **Data Categorization**: Data is classified as structured (organized in databases), semi-structured (emails, JSON), or unstructured (images, videos, social media content).

* **Data Repositories**:
  *  **Transactional (OLTP)** – Handles real-time, high-volume operational data (e.g., banking, airline bookings).
  
  * **Analytical (OLAP)** – Designed for complex analysis (e.g., data warehouses, data lakes).
* **Data Integration & Pipelines**: Data is processed, cleansed, and transformed using Extract-Transform-Load (ETL) or Extract-Load-Transform (ELT) processes.

* **Programming & Querying**: SQL for querying, Python for application development, and shell scripting for automation.

* **BI & Reporting Tools**: These tools, used mainly by analysts, help visualize and present data via dashboards, but they are managed by data engineers.

* **Automation & Optimization**: Various frameworks and processes streamline data workflows, making data more accessible and efficient for business use.

Overall, a data engineer’s ecosystem is diverse and complex, supporting the seamless movement, transformation, and analysis of data.

## Data Tools

Data is unorganized information that gains meaning through processing. It can be categorized into three types based on its structure:

* **Structured Data**: Highly organized, follows a defined schema, and is stored in relational databases (SQL). Examples include spreadsheets, online forms, GPS data, and transaction logs. It can be easily analyzed using standard tools.

* **Semi-Structured Data**: Has some organization but lacks a fixed schema. It includes metadata and hierarchical structures like XML, JSON, emails, and zipped files. It is often used for data integration and exchange.

* **Unstructured Data**: Lacks a clear format and cannot be stored in traditional databases. Examples include social media feeds, images, videos, PDFs, and web pages. It is often stored in NoSQL databases for analysis.

In summary, structured data is highly organized, semi-structured data relies on metadata, and unstructured data is freeform and complex.

## File Formats
As a data professional, understanding different file formats helps in choosing the best option for data storage and processing. Common formats include:

* **Delimited Text Files (CSV, TSV)**: Store data in plain text with values separated by delimiters (e.g., commas, tabs). They are widely supported and flexible.

* **XLSX (Excel Open XML Spreadsheet)**: A structured spreadsheet format that supports multiple worksheets and is secure since it cannot store malicious code.

* **XML (Extensible Markup Language)**: A markup language for encoding data, readable by both humans and machines. It facilitates platform-independent data sharing.

* **PDF (Portable Document Format)**: Developed by Adobe, it ensures documents appear the same across devices and is commonly used for legal and financial documents.

* **JSON (JavaScript Object Notation)**: A widely used, language-independent format for transmitting structured data, especially in APIs and web services.

Each format has its own strengths and limitations, making it crucial to select the right one based on data structure and performance needs.

## Data Sources

Data professionals often work with a variety of data sources, including relational databases, NoSQL databases, big data repositories, and streaming data. While relational databases remain widely used due to their flexibility and durability, newer data types—such as logs, JSON, and XML—have driven the adoption of NoSQL databases like Cassandra and HBase, which are better suited for high-write applications like IoT and social media.

Data sources today are more dynamic and diverse than ever. Common sources include:

* **Relational Databases (SQL Server, MySQL, Oracle, IBM DB2)**: Used by organizations for structured data storage in applications like customer transactions, HR, and workflows.

* **Flat Files & XML Datasets**: Flat files (CSV, spreadsheets) store data in tabular formats, while XML supports hierarchical structures, often used for surveys and bank statements.

* **APIs & Web Services**: Enable data access from sources like social media (Twitter, Facebook), stock markets, and validation services. APIs return data in formats like JSON, XML, or plain text.

* **Web Scraping**: Extracts data from websites for price comparisons, sales leads, forum data, and machine learning datasets. Popular tools include BeautifulSoup, Scrapy, and Selenium.

* **Data Streams & Feeds**: Real-time data flows from IoT devices, GPS, stock tickers, retail transactions, and social media. Processing tools include Apache Kafka, Spark Streaming, and Storm.

* **RSS Feeds**: Capture continuously updated data from news sites and forums for real-time tracking.

These sources provide valuable data for analytics, decision-making, and business intelligence.


#### Working with varied data formats presents unique challenges:

* **Log data** is unstructured and often requires custom tools for parsing.

* **XML** was widely used but is resource-intensive due to its verbose tags. JSON became the preferred format for RESTful APIs, reducing memory usage.

* **Apache Avro** is gaining popularity for its efficient data storage and serialization.

* **Character encoding and delimiters** can cause issues when migrating data between databases. Finding suitable delimiters in datasets containing special characters can be difficult, requiring adaptable parsing strategies.

## Languages for Data Professionals

Languages essential for data professionals can be categorized into **query languages**, **programming languages**, and **shell scripting**.

* **SQL**: A query language primarily used for relational databases, enabling data retrieval, updates, and stored procedures. It is widely adopted due to its portability, simple syntax, and efficiency in handling large datasets. However, challenges arise when working across multiple database vendors, versions, and performance constraints. While moving data once may be straightforward, maintaining efficient, continuous data transfers requires flexibility and problem-solving.

* **Python**: A high-level, open-source programming language known for its readability and versatility. It offers extensive libraries for data analysis (NumPy, Pandas), visualization (Matplotlib, Seaborn), web scraping (BeautifulSoup, Scrapy), and machine learning. Python is widely used for handling large datasets, statistical analysis, and automation.

* **R**: A statistical computing language designed for data visualization and analysis, featuring powerful libraries like ggplot2 and Plotly. It supports both structured and unstructured data and is particularly useful for analytics, interactive reporting, and research applications.

* **Java**: A platform-independent, object-oriented language used in big data frameworks like Hadoop and Spark. It is well-suited for high-performance, scalable data solutions and large-scale processing tasks.

* **Shell Scripting (Unix/Linux Shell, PowerShell)**: Automates repetitive tasks such as file manipulation, system administration, and backups. PowerShell, a Microsoft tool, is particularly effective for handling structured data formats (JSON, CSV, XML) and integrating with APIs.

Each language serves a unique role in data management, analysis, and automation, making proficiency in at least one from each category crucial for data professionals.

## Data Repositories

A **data repository** is a system used to collect, organize, and store data for business operations, reporting, and analysis. It can range from small to large database infrastructures. Key types of data repositories include:

1. **Databases**: Collections of data for storage, search, retrieval, and modification. Managed by a **Database Management System (DBMS)**, which allows users to query and extract information (e.g., finding inactive customers). Databases are classified as:

  * **Relational Databases (RDBMS)**: Organize data in tables (rows and columns) with defined structure, optimized for querying large datasets using SQL.

  * **Non-Relational Databases (NoSQL)** : Designed for speed, flexibility, and scale, storing data without a strict schema, often used for big data and processing diverse data types.

2. **Data Warehouses**: Centralized repositories that consolidate data from various sources through the **ETL (Extract, Transform, Load)** process for business intelligence and analytics. This process helps clean and prepare data for analysis.

3. **Data Marts and Data Lakes**: Variants of data repositories, with data lakes storing raw, unprocessed data, and data marts focused on specific business areas.

4. **Big Data Stores**: Infrastructure for storing and processing large datasets, using distributed computational and storage systems.

Data repositories support efficient reporting, analytics, and archiving, ensuring data is easily accessible and well-organized.

## Relational Databases (RDBMS)

A **relational database** organizes data into tables with rows (records) and columns (attributes). These tables can be linked based on common data, allowing for efficient querying and retrieval of related information. For example, a **customer table** might be linked to a **transaction table** using a **Customer ID** to consolidate customer data and transactions.

Relational databases use **SQL** for querying and processing data, which makes them ideal for large datasets and minimizing redundancy. They allow for data consistency and integrity through structured relationships and data types. Unlike spreadsheets, relational databases support vast volumes of data and provide fast retrieval and processing capabilities.

Key benefits include:

* **Flexibility**: SQL allows changes to the structure while maintaining performance.

* **Reduced redundancy**: Data is stored in linked tables to minimize duplication.

* **Backup and recovery**: Easy export/import options and cloud-based continuous mirroring ensure quick recovery.

* **ACID compliance**: Ensures reliability and consistency of data transactions.

Relational databases are used for **Online Transaction Processing (OLTP), data warehousing**, and **IoT solutions**, though they struggle with **semi-structured** or **unstructured data**. Popular examples include **IBM DB2, MySQL, Oracle Database**, and cloud solutions like **Amazon RDS**.

Despite limitations, relational databases remain the primary solution for managing structured data, offering reliability, scalability, and well-established support.

## NoSQL

**NoSQL** (Not Only SQL) is a type of non-relational database that offers flexible schemas for storing and retrieving data. It has become popular due to its scalability, performance, and ease of use, especially in cloud, big data, and high-volume applications. Unlike relational databases, NoSQL databases don't rely on a traditional row/column/table structure with fixed schemas, allowing for schema-less or free-form data storage. They support structured, semi-structured, and unstructured data.

There are four main types of NoSQL databases:

1. **Key-value store**: Stores data as key-value pairs. It’s efficient for real-time recommendations and caching, but less suited for complex queries. Examples: Redis, DynamoDB.

2. **Document-based**: Stores data in documents (e.g., JSON), allowing flexible indexing and powerful ad hoc queries. Best for eCommerce and CRM platforms. Examples: MongoDB, CouchDB.

3. **Column-based**: Stores data in columns rather than rows, optimizing fast access and searches. Ideal for systems requiring heavy write requests. Examples: Cassandra, HBase.

4. **Graph-based**: Represents data as nodes and relationships, excellent for analyzing connected data (e.g., social networks or fraud detection). Examples: Neo4j, CosmosDB.

##### Advantages of NoSQL:

* Handles large volumes of diverse data types (structured, semi-structured, unstructured).
* Scalable across multiple data centers, leveraging cloud infrastructure.
* Cost-effective, with a scale-out architecture that adds capacity and performance by adding nodes.
* Agile and flexible design for quicker iteration.

##### Key differences between relational and non-relational databases:

* **RDBMS**: Rigid schemas, supports ACID-compliance for reliability, and is more expensive to maintain.

* **NoSQL**: Schema-agnostic, supports various data types, and is optimized for low-cost hardware and distributed systems.

NoSQL is well-suited for modern, mission-critical applications and continues to grow in popularity despite being a newer technology compared to relational databases.

## Data Warehouses, Data Marts, and Data Lakes

Data mining repositories like **data warehouses**, **data marts**, and **data lakes** all aim to store data for reporting, analysis, and insight extraction, but they differ in purpose, data types, and access methods.

* **Data Warehouse**: A central repository integrating data from multiple sources, designed to store cleansed, conformed, and categorized data for analysis. Data warehouses typically have a three-tier architecture (database servers, OLAP server, client front-end) and are moving to the cloud for benefits like lower costs, scalable storage, and faster recovery. They are used for large volumes of operational data, with examples including Teradata and Snowflake.

* **Data Mart**: A subset of a data warehouse, focusing on a specific business function (e.g., sales or finance). There are dependent, independent, and hybrid data marts, each varying in data source and transformation processes. Data marts provide efficient access to relevant data and faster decision-making.

* **Data Lake**: A repository that stores large amounts of structured, semi-structured, and unstructured data in its raw, native format without needing to define a structure beforehand. It offers flexibility and scalability, allowing data to be transformed as needed for specific use cases. Data lakes can be built using cloud storage or distributed systems like Hadoop. They are beneficial for storing diverse data types and scaling from terabytes to petabytes, with vendors like Amazon and Microsoft providing relevant platforms.

Each repository type has specific benefits, and their selection depends on the organization's use case and technology infrastructure.

## ETL, ELT, and Data Pipelines

ETL and ELT focus on data transformation and loading processes, while data pipelines manage the overall movement and processing of data across systems.

1. **ETL (Extract, Transform, Load)**:

  * **Extract**: Collect data from source systems using batch (large chunks at scheduled intervals) or stream processing (real-time data).
  * **Transform**: Clean and format data for analysis, such as standardizing units, removing duplicates, and applying business rules.
  * **Load**: Move the processed data to a repository, with options for initial loading, incremental updates, or full refreshes. Load verification ensures data integrity.
  * **ETL** is used for large-scale, batch workloads and is supported by tools like IBM Infosphere, AWS Glue, and Informatica PowerCenter.

2. **ELT (Extract, Load, Transform)**:

  * In ELT, data is extracted and loaded into the destination system (usually a data lake) before transformations are applied.
  * ELT is suited for processing large, unstructured data and offers flexibility for data scientists, especially in Big Data environments. It is ideal for data lakes, allowing quick ingestion of raw data and later transformation for analysis.
  * ELT is faster than ETL, reduces cycle time, and supports more flexible analytics.

3. **Data Pipelines**:

  * Data pipelines refer to the complete journey of moving data between systems, which may include both batch and streaming data.
  * Data pipelines support real-time data processing, and destinations include data lakes, applications, or visualization tools. Popular solutions include Apache Beam, AirFlow, and DataFlow.

## Data Integration Platforms: Key Insights
### Definition & Purpose
Data integration combines disparate data into a unified view for analytics, business operations, and decision-making.

It involves:
* Extracting, transforming, and merging data
* Ensuring data quality and governance
* Enabling seamless access and analysis

### Relation to ETL & Data Pipelines
* **ETL** is a subset of data integration focused on structured transformation.
* **Data Pipelines** handle the entire journey of data movement and can include integration processes.

### Key Features of Modern Data Integration Platforms
* **Pre-built connectors** for databases, APIs, social media, CRM, and ERP systems
* **Support for batch and streaming data processing**
* **Integration with Big Data sources**
* **Cloud flexibility** (single cloud, multi-cloud, hybrid)
* **Data governance, security, and compliance features**

### Popular Data Integration Tools & Platforms
* **Enterprise Solutions**: IBM (Cloud Pak, DataStage), Talend (Data Fabric, Open Studio), SAP, Oracle, Microsoft, Qlik, SAS, TIBCO
* **Open-Source Options**: Dell Boomi, Jitterbit, SnapLogic
* **Cloud-Based iPaaS (Integration Platform as a Service)**: Google Cloud, IBM Application Integration Suite, Informatica Integration Cloud

### Industry Trends
Data integration is evolving with new technologies, expanding data sources, and increased demand for real-time analytics. Businesses prioritize scalable, secure, and flexible solutions for effective decision-making.


---

## Key Takeaways on Data Engineering Tools & Technologies
### Databases & Storage
* Relational Databases (RDBMS): MySQL, PostgreSQL, IBM DB2, Microsoft SQL Server
* NoSQL Databases: MongoDB, Cassandra
* Graph Databases: Neo4j
* Cloud Storage & Data Warehousing: AWS S3 (Data Lake), AWS Redshift (Data Warehouse)

### Data Processing & Pipelines
* ETL & Data Movement: Talend, Apache NiFi, SSIS (SQL Server Integration Services)
* Data Orchestration: Apache Airflow
* Big Data Processing: Apache Spark, Hadoop
* Streaming & Messaging: Apache Kafka, WebSphere MQ

### Automation & Development Tools
* Version Control & CI/CD: GitHub, Jenkins
* Schema Management: Liquibase
* Programming & Scripting: Python (primary language), Shell, Perl, Java APIs

### Web Scraping
* Tools: BeautifulSoup, Scrapy

### Key Advice for Data Engineers
* Lifelong Learning: Data engineering is constantly evolving, requiring continuous skill development.
* Strong Fundamentals: A solid understanding of data concepts helps in quickly adapting to new tools and technologies.
* Open Source Contribution: Exploring and contributing to open-source projects (e.g., Apache Foundation tools) can enhance learning and career growth.



---




## Foundations of Big Data

Big Data refers to the large, dynamic volumes of data generated by people, tools, and machines, which require scalable technology to collect and analyze for real-time insights. The key elements of Big Data are:

* Velocity: The speed at which data accumulates and is processed, such as real-time streaming data.
* Volume: The scale of data, driven by increased data sources, higher-resolution sensors, and scalable infrastructure.
* Variety: The diversity of data, including both structured (e.g., databases) and unstructured data (e.g., social media posts, videos, and images).
* Veracity: The quality and accuracy of data, ensuring it is consistent, complete, and reliable.
* Value: The ability to derive meaningful insights from data, which can have social, medical, and business benefits.

### Examples include:

* Velocity: Millions of hours of video uploaded to YouTube every minute.
* Volume: 2.5 quintillion bytes of data generated every day by the global population.
* Variety: Data from various sources like text, images, health devices, and the Internet of Things.
* Veracity: 80% of data being unstructured, requiring careful categorization and analysis.

To handle Big Data, tools like Apache Spark and Hadoop are used for distributed computing, allowing businesses to gain valuable insights and improve services.

## Big Data Processing Tools

Big Data processing technologies like Apache Hadoop, Apache Hive, and Apache Spark are essential for analyzing large datasets. Here’s a summary of each:

* Hadoop: An open-source framework for distributed storage and processing of large datasets. It uses the Hadoop Distributed File System (HDFS) to store data across multiple nodes, offering scalability, fault tolerance, and parallel computation. It supports structured, semi-structured, and unstructured data and is cost-effective for storing large amounts of data.

* Hive: A data warehouse built on top of Hadoop for managing and querying large datasets. It's suited for ETL, reporting, and data analysis but has high query latency, making it less ideal for real-time applications.

* Spark: A fast, in-memory data processing engine designed for real-time analytics, machine learning, and data integration. It supports multiple programming languages (Java, Scala, Python, R) and can run on top of Hadoop. Its ability to process streaming data and perform complex analytics makes it a key tool for Big Data analytics.

These tools help handle the complexities of Big Data by providing scalable storage, efficient processing, and easy access to insights.

## Impact of Big Data on Data Engineering

Big Data, characterized by its four Vs—velocity, veracity, volume, and variety—has transformed the field of data engineering, creating both challenges and opportunities. Key points include:

* Impact on Data Engineering: The rapid growth of data has created a demand for professionals capable of handling large volumes and diverse types of data. This has led to the emergence of new technologies and tools designed for Big Data management and analysis.

* Evolution of Tools: Traditional relational databases (RDBMS) are no longer sufficient for handling the diverse and massive data sets organizations are collecting. New technologies, such as Google BigTable, Cassandra, Hadoop, and MapReduce, were developed to address these needs.

* Data Storage and Handling: Storing large amounts of data is no longer a major concern due to advancements in storage technology. However, unstructured data (e.g., from IoT devices or social media) requires specialized solutions like MongoDB.

* Growth of Big Data Technologies: As data sources and volumes continue to expand, the role of data engineers has evolved to include managing, processing, and analyzing massive datasets in real-time, utilizing specialized tools and infrastructure.









# Module 2 Practice Quiz (1)

1.) Automated tools, frameworks, and processes for all stages of the data analytics process are part of the Data Engineer’s ecosystem. What role do data integration tools play in this ecosystem?

    A. Cover the entire journey of data from source to destination
    
    B. Store high-volume day-to-day operational data in data repositories
    
    C. Conduct complex data analytics
    
    D. Combine data from multiple sources into a unified view that is accessed by data consumers to query and manipulate data

  <details>
  <summary>ANSWER</summary>
    
    ↪ D. Combine data from multiple sources into a unified view that is accessed by data consumers to query and manipulate data

 Data Integration tools provide a unified view to data collected from disparate sources so that it can be accessed via a single interface for query and manipulation by data consumers.
</details>

2.) Which one of the provided file formats is commonly used by APIs and Web Services to return data?

    A. Delimited file
    
    B. XLS
    
    C. XML
    
    D. JSON

  <details>
  <summary>ANSWER</summary>
    
    ↪ D. JSON

 JSON is the format that is most used by APIs and Web Services to return data.
</details>

3.) What is one example of the relational databases discussed in the video?

    A. XML
    
    B. Spreadsheet
    
    C. Flat files
    
    D. SQL Server

  <details>
  <summary>ANSWER</summary>
    
    ↪ D. SQL Server

 SQL Server is one of the examples of relational databases shared in the video.
</details>

4.) Which of the following languages is one of the most popular querying languages in use today?

    A. R
    
    B. Java
    
    C. Python
    
    D. SQL

  <details>
  <summary>ANSWER</summary>
    
    ↪ D. SQL

 SQL, or Structured Query Language, is one of the most popular querying languages in use today.
</details>

# Module 2 Practice Quiz (2)

1.) The term “data repositories” exclusively refers to RDBMes and NoSQL databases that are used to collect, organize, and isolate data for analytics.

    A. True
    B. False

  <details>
  <summary>ANSWER</summary>
    
    ↪ B. False

 The term “data repositories” includes not just RDBMSes and NoSQL databases, it also includes data warehouses, data marts, and data lakes.
</details>

2.) In use cases for RDBMS, what is one of the reasons that relational databases are so well suited for OLTP applications?

    A. Allow you to make changes in the database even while a query is being executed
    B. Offer easy backup and restore options
    C. Minimize data redundancy
    D. Support the ability to insert, update, or delete small amounts of data

  <details>
  <summary>ANSWER</summary>
    
    ↪ D. Support the ability to insert, update, or delete small amounts of data

 This is one of the abilities of RDBMSs that make them very well suited for OLTP applications.
</details>

3.) Which NoSQL database type stores each record and its associated data within a single document and also works well with Analytics platforms?

    A. Key-value store
    B. Column-based
    C. Graph-based
    D. Document-based

  <details>
  <summary>ANSWER</summary>
    
    ↪ D. Document-based

 Document-based NoSQL databases store each record and its associated data within a single document and work well with Analytics platforms.
</details>

4.) Which one of these statements explains what data integration is?

    A. Data Integration is the process of loading data into a data repository
    B. Data Integration is the process of extracting data
    C. Data Integration is the process of applying business logic to source data
    D. Data Integration includes extracting, transforming, merging, and delivering quality data for analytical purposes

  <details>
  <summary>ANSWER</summary>
    
    ↪ D. Data Integration includes extracting, transforming, merging, and delivering quality data for analytical purposes

 Data Integration extracts and combines disparate source data into a unified view so that data consumers can query and analyze the integrated data.
</details>

# Module 2 Practice Quiz (3)

1.) What does the attribute “Velocity” imply in the context of Big Data?

    A. Quality and origin of data
    B. Scale of data
    C. Diversity of data
    D. The speed at which data accumulates

  <details>
  <summary>ANSWER</summary>
    
    ↪ D. The speed at which data accumulates

 Velocity, in the context of Big Data, is the speed at which data accumulates.
</details>

2.) Which of the Big Data processing tools provides distributed storage and processing of Big Data?

    A. ETL
    B. Spark
    C. Hadoop
    D. Hive

  <details>
  <summary>ANSWER</summary>
    
    ↪ C. Hadoop

 Hadoop, a java-based open-source framework, allows distributed storage and processing of large datasets across clusters of computers.
</details>

# Module 2 Graded Quiz

1.) There are two main types of data repositories – Transactional and Analytical. For high-volume day-to-day operational data such as banking transactions, Transactional, or OLTP, systems are the ideal choice.

    A. TRUE  
    B. FALSE

<details>
  <summary>ANSWER</summary>
  ↪ A. TRUE
  
Transactional, or OLTP, systems are designed and optimized for handling high-volume transactions.
</details>

2.) Which of the following is an example of unstructured data?

    A. Spreadsheets  
    B. Zipped files  
    C. Video and Audio files  
    D. XML  

<details>
  <summary>ANSWER</summary>
  ↪ C. Video and Audio files

  Video and audio files are examples of unstructured data.
</details>

3.) Which one of these file formats is independent of software, hardware, and operating systems, and can be viewed the same way on any device?  

    A. Delimited text file  
    B. XLSX  
    C. PDF  
    D. XML  

<details>
  <summary>ANSWER</summary>
  ↪ C. PDF

  PDF format is independent of software, hardware, and operating systems, and can be viewed the same way on any device.
</details>

4.) Which data source can return data in plain text, XML, HTML, or JSON among others?  

    A. Delimited text file  
    B. PDF  
    C. APIs  
    D. XML  

<details>
  <summary>ANSWER</summary>
  ↪ C. APIs

  APIs can return data in a wide variety of formats such as plain text, XML, HTML, or JSON among others.
</details>

5.) In the data engineer’s ecosystem, languages are classified by type. What are shell and scripting languages most commonly used for?

    A. Manipulating data  
    B. Querying data  
    C. Automating repetitive operational tasks  
    D. Building apps  

<details>
  <summary>ANSWER</summary>
  ↪ C. Automating repetitive operational tasks

  Shell and scripting languages are commonly used for automating repetitive operational tasks.
</details>

6.) What is one of the most significant advantages of an RDBMS?  

    A. Enforces a limit on the length of data fields  
    B. Can store only structured data  
    C. Is ACID-Compliant  
    D. Requires source and destination tables to be identical for migrating data  

<details>
  <summary>ANSWER</summary>
  ↪ C. Is ACID-Compliant  
  
  ACID-Compliance is one of the significant advantages of an RDBMS.
</details>

7.) Which one of the NoSQL database types uses a graphical model to represent and store data, and is particularly useful for visualizing, analyzing, and finding connections between different pieces of data?

    A. Document-based  
    B. Column-based  
    C. Graph-based  
    D. Key value store  

<details>
  <summary>ANSWER</summary>
  ↪ C. Graph-based  
  
  Graph-based NoSQL databases use a graphical model to represent and store data and are used for visualizing, analyzing, and finding connections between different pieces of data.
</details>

8.) Which of the data repositories serves as a pool of raw data and stores large amounts of structured, semi-structured, and unstructured data in their native formats?

    A. Relational Databases  
    B. Data Marts  
    C. Data Lakes  
    D. Data Warehouses  

<details>
  <summary>ANSWER</summary>
  ↪ C. Data Lakes  
  
  A Data Lake can store large amounts of structured, semi-structured, and unstructured data in their native format, classified and tagged with metadata.
</details>

9.) While data integration combines disparate data into a unified view of the data, a data pipeline covers the entire data movement journey from source to destination systems, and ETL is a process within data integration.

    A. TRUE  
    B. FALSE  

<details>
  <summary>ANSWER</summary>
  ↪ A. TRUE  
  
  A data pipeline covers the entire journey of data from source to destination. Data integration is performed within a data pipeline, while ETL is a process within data integration.
</details>

10.) What does the attribute “Veracity” imply in the context of Big Data?

    A. The speed at which data accumulates  
    B. Scale of data  
    C. Diversity of the type and sources of data  
    D. Accuracy and conformity of data to facts  

<details>
  <summary>ANSWER</summary>
  ↪ D. Accuracy and conformity of data to facts  
  
  Veracity, in the context of Big Data, refers to the accuracy and conformity of data to facts.
</details>

11.) ______________, in the context of Big Data, is the speed at which data accumulates.

    A. Value  
    B. Volume  
    C. Variety  
    D. Velocity  

<details>
  <summary>ANSWER</summary>
  ↪ D. Velocity  
  
  Velocity refers to the speed at which data is generated, such as, real-time streaming data.
</details>

12.) Apache Spark is a general-purpose data processing engine designed to extract and process Big Data for a wide range of applications. What is one of its key use cases?

    A. Consolidate data across the organization  
    B. Scalable and reliable Big Data storage  
    C. Fast recovery from hardware failures  
    D. Perform complex analytics in real-time  

<details>
  <summary>ANSWER</summary>
  ↪ D. Perform complex analytics in real-time  
  
  Spark is a general-purpose data processing engine used for performing complex data analytics in real-time.
</details>

13.) Which of the Big Data processing tools is used for reading, writing, and managing large data set files that are stored in either HDFS or Apache HBase?

    A. ETL  
    B. Spark  
    C. Hadoop  
    D. Hive  

<details>
  <summary>ANSWER</summary>
  ↪ D. Hive  
  
  Hive is an open-source data warehouse software for reading, writing, and managing large data sets stored in data storage systems such as HDFS and Apache HBase.
</details>