<a href="https://colab.research.google.com/github/newfrogg/data_engineering/blob/what_is_ELT/data_engineering.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# What is ELT ?

## Data Engineering for short
It's about the practice of designing and building systems with purposes:
1. collecting,
2. storing,
3. analyzing data at scale.
4. To ensure the highly usable state before being pushed to data scientists, data analysts.

[Coursera-What is Data Engineering](https://www.coursera.org/articles/what-does-a-data-engineer-do-and-how-do-i-become-one)


## What is Data Engineering
"Data engineering is a set of operations aimed at creating ***interfaces and mechanisms for the flow and access of information***. It takes dedicated specialists—data engineers— to maintain data so that it remains available and usable by others. In short, data engineers set up and operate the organization’s data infrastructure, preparing it for further analysis by data analysts and scientists"

**-from “Data Engineering and Its Main Concepts” by AlexSoft**

1. Data Engineering Defined

Data engineering is the development, implementation, and maintenance of systems
and processes that take in raw data and produce high-quality, consistent information that supports downstream use cases, such as analysis and machine learning. Data engineering is the intersection of security, data management, DataOps, data architecture, orchestration, and software engineering. A ***data engineer manages the data engineering lifecycle***, beginning with getting data from source systems and ending with serving data for use cases, such as analysis or machine learning.

2. The Data Engineering Lifecycle

- The data engineering lifecycle focus on data itself and shift away the conversation away from technology.
- There are many stages (above and below the iceberge):
    - Generation, Storage, Serving: tasks of data engineering workflow itself.
    - Security, DataOps, Software Engineering: tasks for interact with relevant fields

![Data Engineering Life Cycle](https://github.com/newfrogg/data_engineering/blob/what_is_ELT/images/data_engineering_life_cycle.png?raw=1)

3. Evolution of the Data Engineer

*   The roots of data engineering can be traced back to the **data warehousing era (1980s-2000)**, pioneered by figures like Bill Inmon and Ralph Kimball. Early roles like BI engineers and ETL developers focused on building systems for scalable analytics using relational databases and MPP systems-. The rise of the web introduced new data scale challenges that traditional systems struggled with.
*   **Contemporary data engineering emerged in the early 2000s** as companies faced exploding data growth. Innovations from Google (GFS, MapReduce), the open-source Hadoop ecosystem inspired by Google's work, and the advent of public clouds like AWS provided the **foundation for distributed computation and storage** on massive clusters-. This marked the beginning of the "big data" era.
*   The **big data era (2000s-2010s)** saw the rise of the "big data engineer" proficient in software development and low-level infrastructure hacking to manage complex open-source tools like Hadoop and Spark-. While powerful, managing these massive clusters was operationally burdensome and costly, often diverting engineers from delivering business value. The term "big data" has since become a relic as the technology became more accessible.
*   In the **2020s, big data engineers are now simply called data engineers or data lifecycle engineers**, reflecting a shift towards managing the entire data engineering lifecycle rather than low-level infrastructure details-. With greater abstraction and simplification of tools, the focus has moved to higher-value areas like security, data management, DataOps, data architecture, and orchestration. Data engineering has become a discipline of **connecting various modular technologies** to serve business goals.

4. Data Engineering and Data Science

- There are many opinions abouth relationship between data engineering and data science. However, for this case, we consider that **data engineering is a separate discipline from data science and analytics**, although they are complementary. Data engineering is described as sitting **upstream** from data science, meaning data engineers are responsible for providing the necessary data inputs for data scientists.

- Considering **Data Science Hierarchy of Needs** published in 2017 by Monica Rogati, which places AI and machine learning at the top, with foundational tasks like data movement, storage, collection, cleansing, and infrastructure at the bottom. The sources state that data scientists often spend a significant majority of their time, estimated at **70% to 80%, on these lower-level tasks** such as gathering, cleaning, and processing data. This occurs because data scientists are typically not trained to engineer production-grade data systems.

![Data Science Hierarchy](https://github.com/newfrogg/data_engineering/blob/what_is_ELT/images/data_science_hierarchy.png?raw=1)

- The core idea is that **data engineers build the solid data foundation** represented by the bottom layers of this hierarchy. By doing so, data engineers enable data scientists to focus their time more effectively on higher-value activities like analysis, experimentation, and machine learning (the top layers of the pyramid). Ultimately, data engineering bridges the gap between acquiring raw data and extracting value from it, playing a vital role in the success of data science in production environments.

[Fundamentals of Data Engineering](https://www.oreilly.com/library/view/fundamentals-of-data/9781098108298/)

## What is ETL ?
ETL is about a *data integration process* including:
1. **Extract** data from legacy system
2. **Transform** and/or clean data to enhance data quality, improve consitency
3. **Load** data into target databases

[IBM_ETL](https://www.ibm.com/think/topics/etl)



### Example: Amazon Web scraping

In [75]:
# Import Libraries
from bs4 import BeautifulSoup
import requests
import pandas as pd
import numpy as np
import json

In [76]:
# Support Function
# Function to extract Product Title
def get_title(soup):

    try:
        # Outer Tag Object
        title = soup.find("span", attrs={"id":'productTitle'})

        # Inner NavigatableString Object
        title_value = title.text

        # Title as a string value
        title_string = title_value.strip()

    except AttributeError:
        title_string = ""

    return title_string

# Function to extract Product Price
def get_price(soup):

    try:
        price = soup.find("div",attrs={'class':'a-section aok-hidden twister-plus-buying-options-price-data'}).string.strip()
        price = json.loads(price)
        price = price["desktop_buybox_group_1"][0]["displayPrice"]
    except:
        price = ""

    return price

# Function to extract Product Rating
def get_rating(soup):

    try:
        rating = soup.find("i", attrs={'class':'a-icon a-icon-star a-star-4-5'}).string.strip()

    except AttributeError:
        try:
            rating = soup.find("span", attrs={'class':'a-icon-alt'}).string.strip()
        except:
            rating = ""

    return rating

# Function to extract Number of User Reviews
def get_review_count(soup):
    try:
        review_count = soup.find("span", attrs={'id':'acrCustomerReviewText'}).string.strip()

    except AttributeError:
        review_count = ""

    return review_count

# Function to extract Availability Status
def get_availability(soup):
    try:
        available = soup.find("div", attrs={'id':'availability'})
        available = available.find("span").string.strip()

    except AttributeError:
        available = "Not Available"

    return available

# Function to extract pages_count (On going fixing)
def get_pages(soup):
    try:
        # pages =  soup.find_all('div', class_='a-section a-spacing-none a-text-center rpi-attribute-value')
        # if pages:
        #     last_idx = pages[-1]
        #     pages = last_idx.find('span').string.strip()
        pages = soup.find("ul", attrs={'class':'a-unordered-list a-nostyle a-vertical a-spacing-none detail-bullet-list'})
        if pages:
            pages = pages.find_all("li")[4]

        if pages:
            pages = pages.find("span")

        if pages:
            pages = pages.find_all("span")[1].string.strip()

    except AttributeError:
        pages = "Not Available"
    return pages

# Function to extract main author
def get_author(soup):
    try:
        author = soup.find("div", attrs={'id':'bylineInfo', 'class':'a-section a-spacing-micro bylineHidden feature'})
        author = author.find("span")
        author = author.find("a").string.strip()
    except AttributeError:
        author = "Not Available"
    return author


In [77]:
if __name__ == '__main__':

    # add your user agent
    HEADERS = ({'User-Agent':'', 'Accept-Language': 'en-US, en;q=0.5'})

    # The webpage URL
    URL = "https://www.amazon.com/s?k=data+engineering&i=stripbooks-intl-ship"

    # HTTP Request
    webpage = requests.get(URL, headers=HEADERS)

    # Soup Object containing all data
    soup = BeautifulSoup(webpage.content, "html.parser")

    # Fetch links as List of Tag Objects
    links = soup.find_all("a", attrs={'class':'a-link-normal s-no-outline'})

    # Store the links
    links_list = []

    # Loop for extracting links from Tag Objects
    for link in links:
            links_list.append(link.get('href'))

    d = {"title":[], "price":[], "rating":[], "reviews":[],"availability":[], "authors":[]}

    # Loop for extracting product details from each link
    for link in links_list:
        new_webpage = requests.get("https://www.amazon.com" + link, headers=HEADERS)

        new_soup = BeautifulSoup(new_webpage.content, "html.parser")

        # Function calls to display all necessary product information
        d['title'].append(get_title(new_soup))
        d['price'].append(get_price(new_soup))
        d['rating'].append(get_rating(new_soup))
        d['reviews'].append(get_review_count(new_soup))
        d['availability'].append(get_availability(new_soup))
        d['authors'].append(get_author(new_soup))
        # d['pages'].append(get_pages(new_soup))
        # break


    amazon_df = pd.DataFrame.from_dict(d)
    amazon_df['title'].replace('', np.nan, inplace=True)


    amazon_df = pd.DataFrame.from_dict(d)
    amazon_df.replace({'title':''},np.nan, inplace=True)
    amazon_df = amazon_df.dropna(subset=['title'])
    amazon_df.to_csv("amazon_data.csv", header=True, index=False)

The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  amazon_df['title'].replace('', np.nan, inplace=True)


In [78]:
amazon_df

Unnamed: 0,title,price,rating,reviews,availability,authors
0,Fundamentals of Data Engineering: Plan and Bui...,,4.4 out of 5 stars,700 ratings,Not Available,Joe Reis
1,Data Engineering with AWS: Acquire the skills ...,,4.7 out of 5 stars,50 ratings,Not Available,Gareth Eagar
2,Data Engineering Design Patterns: Recipes for ...,$67.13,4.4 out of 5 stars,1 rating,In Stock,Bartosz Konieczny
3,Designing Data-Intensive Applications: The Big...,$45.67,4.4 out of 5 stars,"5,186 ratings",In Stock,Martin Kleppmann
4,AI Engineering: Building Applications with Fou...,$57.74,4.6 out of 5 stars,185 ratings,In Stock,Chip Huyen
5,Data Pipelines Pocket Reference: Moving and Pr...,$16.93,4.4 out of 5 stars,394 ratings,In Stock,James Densmore
6,Hands-On Data Engineering: From Zero to Produc...,,4.7 out of 5 stars,,Not Available,Nitin Rane
7,Data Engineering Best Practices: Architect rob...,$35.99,4.5 out of 5 stars,4 ratings,In Stock,Richard J. Schiller
8,Ace the Data Engineering Interview: Questions ...,$29.99,4.4 out of 5 stars,3 ratings,In Stock,Sean Coyne
9,Databricks Certified Data Engineer Associate S...,$65.22,4.0 out of 5 stars,6 ratings,In Stock,Derar Alhussein
