**Data Extraction in ETL**

**Question 1 : Describe different types of data sources used in ETL with suitable examples.**

**Ans:**

In ETL (Extract, Transform, Load), data extraction means collecting data from different kinds of source systems.
1. Database sources
These are the most common ETL data sources. Data is extracted from relational or non-relational databases.
Example: Customer details stored in an Oracle or MySQL database, sales records in SQL Server, or product data in PostgreSQL.

2. Flat files
Flat files are simple files that store data in a structured text format. They are widely used for data exchange.
Example: CSV files containing employee details, Excel files with monthly sales reports, or text files generated by legacy systems.

3. Application sources
Data can be extracted directly from business applications that manage daily operations.
Example: ERP systems storing finance and inventory data, CRM applications containing customer and lead information, or HR systems with payroll data.

4. Web services and APIs
Modern ETL processes often extract data from web-based services using APIs.
Example: Pulling user activity data from a REST API, payment data from a payment gateway API, or marketing data from social media platforms.

5. Cloud-based sources
With cloud adoption, many data sources now reside in cloud platforms.
Example: Data stored in cloud databases, cloud data warehouses, or SaaS tools like online accounting or analytics platforms.

6. Logs and machine-generated data
These sources produce large volumes of semi-structured or unstructured data.
Example: Server log files, application logs, sensor data from IoT devices, or clickstream data from websites.

**Question 2 : What is data extraction? Explain its role in the ETL pipeline.**

**Ans:**

Data extraction is the first step in the ETL (Extract, Transform, Load) process. It refers to the process of collecting or pulling raw data from different source systems such as databases, flat files, applications, APIs, or cloud platforms.

In the ETL pipeline, data extraction plays a critical role because it acts as the entry point of data into the system. This step ensures that data is collected in a timely, accurate, and consistent manner while maintaining the integrity of the source systems. Extraction can be done as a full extraction or incremental extraction, depending on business requirements.

The extracted data is then sent to the transformation stage, where it is cleaned, standardized, and validated. Any issues in extraction, such as missing or incorrect data, directly impact the quality of transformed data and the final output. Therefore, data extraction is essential for ensuring reliable data flow and building trustworthy reports and analytics in the ETL pipeline.

**Question 3 : Explain the difference between CSV and Excel in terms of extraction and ETL usage.**

**Ans:**

CSV and Excel are both used as data sources in ETL, but they are different in how data is extracted and used.

A CSV file stores data in plain text format where values are separated by commas. In ETL, CSV files are easy to extract because they have a simple structure and no formatting. Most ETL tools can read CSV files easily, and they are suitable for handling large amounts of data.

An Excel file stores data in rows and columns but can also contain multiple sheets, formatting, and formulas. In ETL, extracting data from Excel files takes more effort because the tool has to select the correct sheet and handle extra features like formulas. Excel files are mostly used for small datasets or manual data sharing.

In ETL processes, CSV files are preferred for automated data extraction, while Excel files are commonly used when data is shared by users in a readable format.

**Question 4 : Explain the steps involved in extracting data from a relational database.**

**Ans:**

Extracting data from a relational database is done in a few clear steps in the ETL process.

First, the ETL system establishes a connection to the relational database using required details such as database name, server, username, and password.

Next, the required data is identified. This includes selecting the tables, columns, and records that need to be extracted, often using SQL queries.

After that, the data is extracted by running SQL commands like SELECT queries. The data can be extracted fully entire table or incrementally only new or updated records.

Then, the extracted data is moved to a staging area where it is temporarily stored. This helps in avoiding direct load on the source database during transformation.

Finally, basic checks are performed to ensure the extracted data is complete and correct before it is sent to the transformation stage.

These steps ensure that data is safely and accurately extracted from a relational database for further ETL processing.

**Question 5 : Explain three common challenges faced during data extraction.**

**Ans:**

Data extraction is a critical part of the ETL process, but it comes with several challenges.

One common challenge is data quality issues. Source data may contain missing values, duplicate records, or inconsistent formats. When such data is extracted, it affects the accuracy of the entire ETL process and requires additional effort during transformation.

Another challenge is performance impact on source systems. Extracting large volumes of data, especially during business hours, can slow down operational databases and affect normal business activities. Careful scheduling and efficient extraction methods are needed to avoid this problem.

A third challenge is data inconsistency and schema changes. Source systems may change over time, such as modifications in table structure, column names, or data types. These changes can break extraction logic and cause failures in ETL jobs if not properly managed.

These challenges must be addressed to ensure reliable and efficient data extraction in ETL processes.

**Question 6 : What are APIs? Explain how APIs help in real-time data extraction.**

**Ans:**

APIs Application Programming Interfaces are tools that allow different software applications to communicate with each other. They define a set of rules and methods through which one system can request data from another system.

APIs help in real-time data extraction by allowing ETL systems to fetch data instantly whenever it is generated or updated in the source system. Instead of waiting for scheduled file transfers or database updates, data can be pulled directly using API calls. This is especially useful for applications that require up-to-date information.

For example, ETL tools can use APIs to extract real-time data from online applications such as payment systems, social media platforms, or weather services. This enables faster data processing and supports real-time reporting and analytics.

**Question 7 : Why are databases preferred for enterprise-level data extraction?**

**Ans:**

Databases are preferred for enterprise-level data extraction because they can handle large volumes of data efficiently and reliably. They are designed to store structured data in an organized manner, which makes data extraction faster and more accurate.

Databases also support powerful query languages like SQL, which allow organizations to extract only the required data instead of the entire dataset. This improves performance and reduces processing time.

In addition, databases provide security, data integrity, and backup features, which are very important at the enterprise level. Because of these advantages, databases are commonly used as primary data sources in large-scale ETL processes.

**Question 8 : What steps should an ETL developer take when extracting data from large CSV files (1GB+)?**

**Ans:**

When extracting data from large CSV files 1GB or more, an ETL developer should follow certain steps to ensure efficient and reliable processing.

First, the developer should avoid loading the entire file into memory at once. Instead, the file should be read in chunks or batches to prevent memory issues.

Second, proper file validation should be done before extraction. This includes checking the file format, delimiter, headers, and encoding to avoid errors during processing.

Third, the data should be extracted incrementally by processing rows in batches and writing them to a staging area. This improves performance and allows recovery in case of failure.

Next, unnecessary columns should be ignored during extraction to reduce processing time and resource usage.

Finally, error handling and logging should be implemented to capture failed records and track progress, ensuring that the extraction process can be monitored and restarted if required.

These steps help ETL developers handle large CSV files efficiently without impacting system performance.

