# Import libraries 

As a data engineer, you would typically use libraries such as: 
- `pandas` : used to read data into a structured tabular format known as a `DataFrame`. It supports reading data from files, databases and APIs. It allows for operations to be performed on the `DataFrame` before then being written out to another file or database. 
- `requests` : used to send complex queries to APIs to fetch data.  
- And much more..! Depending on what you need to do. 

Go ahead an import these popular libraries into your notebook by running

```python
import pandas as pd  # it is common to provide an alias (shortened name) for the library you're importing to make it easier to reference in your code 
```

If these libraries do not exist on your computer, you would see a `Module Not Found` error. In that case, go ahead and install these libraries by running: 

```
pip install pandas 
```


# Ingesting data 

As a data engineer, the types of data that you are ingesting will depend on the organisation or company you work for. For example: 
- Finance data
- Operational data 
- Market research data
- and many more! 

The data can exist in a myraid of different formats, for example: 
- comma seperated values (csv) files 
- excel files 
- parquet files 
- JSON files 
- REST APIs (JSON)
- SOAP APIs (XML)
- database tables (SQL) 
- kafka streams 
- and so much more..! 

Therefore, as a data engineer, you will need to learn how to ingest data from these disparate sources. To make our lives easier, there are really 4 main data formats that we can think about when ingesting data: 
1. database tables 
2. Web (REST API/JSON)
3. files (csv, excel, json, parquet, etc)
4. streaming data (kafka streams)

If you are able to master each one, then you are going to be valuable in the eyes of your company. 

# Let's ingest

For this instructor demo, we will be ingesting data from the following sources: 

1. Customers dataset: `olist_customers_dataset.csv`
2. Orders dataset: `olist_orders_dataset.csv` 
3. Order items dataset: `olist_order_items_dataset.csv` 
4. Products dataset: `olist_products_dataset.csv` 
5. Product category translation: `product_category_name_translation.csv`


In [None]:
# read in csv files into their respective DataFrames


In [None]:
# display the first 5 rows in each DataFrame 


In [None]:
# display the first 5 rows in each DataFrame 


In [None]:
# display the first 5 rows in each DataFrame 


In [None]:
# display the first 5 rows in each DataFrame 


In [None]:
# display the first 5 rows in each DataFrame 


Great work! Now that we have ingested the CSV files into DataFrames, what do we do next? 

# Transforming data 

Once we have ingested the raw data, the next step is to enrich the data so that it can be used by **Data Analysts** to answer business questions.

*How do we enrich the data?*

The step of enriching the data is known as `Transformation`. Transformation typically involves steps like: 
- removing missing records from the data 
- renaming columns and values to make the data consistent and easier to use 
- joining a dataset with another to make more data available to the Data Analyst 
- creating new columns which are calculations performed on previous ones 
- aggregating the data 

*What do we use to perform transformations?*

As a data engineer, you would typically perform transformations using the following tools: 

- Database SQL: first load your data into a database table. Then perform the transformation using `Structured Query Language (SQL)`. 
- Pandas DataFrame: first load your data into a Pandas DataFrame. Then perform the transformation using pandas and/or python functions. 
- Spark DataFrame: Spark is a distributed computing framework that allows you to perform operations on very large datasets using many computers. First load your data into a Spark DataFrame. Then perform the transformation using spark and/or user defined functions.  

For our exercise today, we will be using a Pandas DataFrame. 


# How do I transform? 

Deciding how to transform is where you will have to use your analytical skills, and also engage the Data Analyst or the end-user to determine what shape the data needs to be in order to answer business questions. For example, a business question could be: 
- "What are my most popular products?"
- "Which countries generate the highest revenue?"

For our exercise, we will refer to the database diagram below (also known as a database schema). The database diagram will tell us how the DataFrames relate to one another and therefore, how they should be combined together to answer business questions. 

<img src="../resources/ecommerce_database_schema.png" alt="schema" style="width:600px;"/>



In [None]:
# merge customers with orders and save result to a new DataFrame


In [None]:
# merge customers and orders with order items and save result to a new DataFrame


In [None]:
# merge customers, orders and order items with products and save result to a new DataFrame


In [None]:
# merge merge customers, orders, order items, and products with product category translation and save result to a new DataFrame


In [None]:
# view all the columns in our final DataFrame 


In [None]:
# keep the following columns: ["customer_id", "customer_city", "customer_state", "order_purchase_timestamp", "price", "freight_value", "product_category_name", "product_name_lenght", "product_description_lenght", "product_photos_qty"]


In [None]:
# add a new column called "total_value" which is the sum of the price and freight_value 


Great work! Your transformation steps are complete! The new DataFrame is ready to be saved to a CSV file for the Data Analyst to use for their data analysis. 

# Saving data 

After completing your transformation steps, you are ready to save your data so that others can consume the datasets you've produced. 

There are several locations you can save your data to: 
1. **Save dataset to a database table**: 
    - do this if the data analysts are only comfortable the SQL language. 
    - depending on the size of your dataset, you may choose to go with a standard SQL database like MySQL, PostgreSQL, Microsoft SQL Server. If your dataset is large enough, then you would have to consider Massively Parallel Processing databases like: AWS Redshift, Snowflake, Google BigQuery, or Azure Synapse. 


2. **Save dataset to a file**: 
    - do this if the data analysts are comfortable with accessing data with multiple languages like: SQL, Python, R, Scala. 
    - choose a file format that maximises: storage size (compression), query performance (how quickly the computer is able to read the file), formats that users are familiar with. For example: parquet, avro, csv, delta lake. 
    - to make the data accessible to data analysts over the cloud, you will likely choose to store your data in a file storage system for example: AWS S3 buckets, Azure Data Lake, GCP Cloud Storage. 
    - files are now becoming more popular as the choice to store data because of new architectural paradigms such as the data lakehouse. 

To save your data when using Pandas, simply perform: 

```python
df.to_csv("your file path here", index=False) # index=False to remove the index column in the DataFrame when saving 
```
