# Getting started with ETL

**ETL** stands for **Extract, Transform, Load**. It is a data integration process used to collect data from multiple sources, clean and transform the data into a proper format, and load it into a target system, such as a data warehouse or database.

## ETL Process Breakdown

1. **Extract**:
   The first step involves extracting data from various sources, such as databases, APIs, or files (e.g., CSV, JSON). The goal is to retrieve the raw data needed for analysis.

2. **Transform**:
   In this stage, the extracted data is cleaned, formatted, and transformed to ensure consistency and usability. Transformations might include:
   - Handling missing or null values
   - Standardizing data formats (e.g., date formats, currencies)
   - Aggregating data or performing calculations
   - Merging data from multiple sources

3. **Load**:
   The final step is loading the transformed data into a target system, such as a data warehouse (e.g., Amazon Redshift, Google BigQuery) or database. This makes the data ready for reporting, analytics, or use by machine learning models.

## Why ETL?

ETL automates the process of collecting, processing, and storing data, ensuring that clean and consistent data is always available for decision-making. This process is essential for businesses that need to handle large volumes of data from multiple sources efficiently and consistently.

## Example

Imagine a retail company that wants to combine data from its online store, physical store, and customer service platform. ETL would:
1. **Extract** sales data from the online and physical store databases and customer feedback from the service platform.
2. **Transform** the data to standardize currencies, clean any missing values, and ensure all timestamps use the same format.
3. **Load** the consolidated data into a centralized database, ready for analysis.

By automating ETL, the company ensures that its data is always up-to-date and ready for reporting and analytics.

In [8]:
!pip install google-cloud-bigquery
!pip install --upgrade google-cloud-bigquery
!pip install pandas_gbq
!pip install pandas
!pip install load_dotenv

Collecting load_dotenv
  Using cached load_dotenv-0.1.0-py3-none-any.whl.metadata (1.9 kB)
Collecting python-dotenv (from load_dotenv)
  Using cached python_dotenv-1.0.1-py3-none-any.whl.metadata (23 kB)
Using cached load_dotenv-0.1.0-py3-none-any.whl (7.2 kB)
Using cached python_dotenv-1.0.1-py3-none-any.whl (19 kB)
Installing collected packages: python-dotenv, load_dotenv
Successfully installed load_dotenv-0.1.0 python-dotenv-1.0.1


In [1]:
from google.cloud import bigquery
import pandas as pd
from load_dotenv import load_dotenv
import os

In [2]:
load_dotenv()

PROJECT_NAME = os.getenv('PROJECT_NAME')

In [3]:
credential = 'key.json'

client = bigquery.Client.from_service_account_json(credential)

query = f"""
SELECT * FROM `{PROJECT_NAME}`
"""

result = client.query(query)
df = result.to_dataframe()

df.head()



Unnamed: 0,id,created_at,first_name,last_name,email,cell_phone,country,state,street,number,additionals
0,36,2018-01-26 03:29:13+00:00,Mariana,Góes,mariana@meu_email.com,9 7324-4293,Brasil,,,,Apto 25
1,73,2018-02-03 07:37:38+00:00,Cristiano,Almeida,cristiano@usuario.com,9 2630-9907,Brasil,,,,Conjunto 24
2,84,2018-11-01 15:39:40+00:00,Carol,Bueno,carol@meu_email.com,9 3760-2211,Brasil,,,,Conjunto 26
3,95,2018-10-01 11:35:59+00:00,Mariana,Rosa,mariana@usuario.com,9 3139-2145,Brasil,,,,
4,0,2017-11-01 14:45:41+00:00,Marta,Jesus,,9 9102-7834,Brasil,Acre,,,Conjunto 16
