# Introduction to Preliminary Global Extraction, Transformation, and Loading (ETL) Process

In the following notebook, our primary objective is to initiate the Extraction, Transformation, and Loading (ETL) process by extracting data from a client-provided drive. The data extracted will serve as the foundation for subsequent analysis and manipulation, ultimately leading to the creation of the final product as per the client's specifications. To achieve this, we leverage a variety of libraries designed to facilitate the extraction of a refined and organized database.

It is important to emphasize our commitment to conciseness and simplicity in code implementation. We aim to achieve this by minimizing the number of action cells, ensuring clarity and efficiency in our ETL process. In instances where code interactions are necessary, detailed comments will be provided to enhance code comprehension.

The structure of our code follows a modular approach, reminiscent of the Model-View-Controller (MVC) pattern. The sections include:

1. **Define Data Extraction Function:** This section encapsulates the functionality responsible for extracting data from the client's drive and transporting it to the lakehouse.

2. **Load Utils File:** Here, we incorporate a dedicated Utils file containing essential functions for seamless library integration and specific actions required for our ETL process.

3. **View Cell:** The View Cell serves as the interface, facilitating the visualization and interpretation of our processed data.

In alignment with our distinctive approach, we term this model the Library-Action-View (LAV) paradigm, embodying a systematic and efficient framework for executing the ETL process.

___
## Install necessary packages that are not found by default

In [5]:
pip install gdown google-api-python-client google-auth google-auth-oauthlib google-auth-httplib2

StatementMeta(, ce5b84d7-3d11-4efe-a59c-8c3005fa92ec, 7, Finished, Available)

Collecting gdown
  Downloading gdown-5.1.0-py3-none-any.whl (17 kB)
Collecting google-api-python-client
  Downloading google_api_python_client-2.118.0-py2.py3-none-any.whl (12.1 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m12.1/12.1 MB[0m [31m105.4 MB/s[0m eta [36m0:00:00[0m00:01[0m0:01[0m
Collecting google-auth-httplib2
  Downloading google_auth_httplib2-0.2.0-py2.py3-none-any.whl (9.3 kB)
Collecting httplib2<1.dev0,>=0.15.0 (from google-api-python-client)
  Downloading httplib2-0.22.0-py3-none-any.whl (96 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m96.9/96.9 kB[0m [31m46.1 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting google-api-core!=2.0.*,!=2.1.*,!=2.2.*,!=2.3.0,<3.0.0.dev0,>=1.31.5 (from google-api-python-client)
  Downloading google_api_core-2.17.1-py3-none-any.whl (137 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m137.0/137.0 kB[0m [31m59.7 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting uritemplate<5,>

## Import the necessary libraries to extract data.

In [6]:
import io
import os
import sys
import builtin.utils as ut
from googleapiclient.discovery import build
from googleapiclient.http import MediaIoBaseDownload
from googleapiclient.errors import HttpError
from google.oauth2 import service_account
from google.auth.transport.requests import Request

StatementMeta(, ce5b84d7-3d11-4efe-a59c-8c3005fa92ec, 8, Finished, Available)

In [7]:
#Test

drive_id = '1-yvBLKCJt8g_BZif-YLsx3YHKkdrj6BT'
test_folder = '/lakehouse/default/Files/otra_prueba'

#Extracting
ut.json_extract(drive_id, test_folder)

StatementMeta(, ce5b84d7-3d11-4efe-a59c-8c3005fa92ec, 9, Finished, Available)

Entering metadataSitiosPrueba
Entering reviewEstadosPrueba
Entering review-Washington / reviewEstadosPrueba
Entering review-Wyoming / reviewEstadosPrueba
Entering review-Texas / reviewEstadosPrueba
Entering yelpPrueba


We use a test folder ID and path to verify that the code is working. <br>
Once we execute and verify that it works, we proceed to use the data provided by the client.

___
## 1. Extract data 


We have defined a function that allows us to extract the data provided by the client from different folders in Google Drive.<br>
 This function is hosted in the utils file, which we import from builtin.utils as ut.

### 1.1 Extract data from Yelp folder

We create a variable called drive_id where the Google Drive folder ID is stored. <br> 
We also create another variable called main_folder_path where the path to the folder of the original files is stored.

In [8]:
drive_id='1TI-SsMnZsNP6t930olEEWbBQdo_yuIZF'
main_folder_path = '/lakehouse/default/Files/original/Yelp'

# Call the function with the folder ID, destination folder, and credentials JSON file path
ut.json_extract(drive_id, main_folder_path)

StatementMeta(, ce5b84d7-3d11-4efe-a59c-8c3005fa92ec, 10, Finished, Available)

### 1.2 Extract data from 'metadata-sitios' & Review Estados folder

In the same way we did for the data from the YELP folder, we're going to save in variables the information of the Google Drive folder ID and the path to the folders where the original data is located.

In the same way we did for the data from the YELP folder, we're going to save in variables the information of the Google Drive folder ID and the path to the folders where the original data is located.

In [9]:
drive_id='1Wf7YkxA0aHI3GpoHc9Nh8_scf5BbD4DA'
main_folder_path = '/lakehouse/default/Files/original'

# Call the function with the folder ID, destination folder, and credentials JSON file path
ut.json_extract(drive_id, main_folder_path)

StatementMeta(, ce5b84d7-3d11-4efe-a59c-8c3005fa92ec, 11, Finished, Available)

Entering metadata-sitios
Entering reviews-estados
Entering review-Wyoming / reviews-estados
Entering review-Virginia / reviews-estados
Entering review-South_Dakota / reviews-estados
Entering review-West_Virginia / reviews-estados
Entering review-Wisconsin / reviews-estados
Entering review-Vermont / reviews-estados
Entering review-Utah / reviews-estados
Entering review-Tennessee / reviews-estados
Entering review-South_Carolina / reviews-estados
Entering review-Texas / reviews-estados
Entering review-Washington / reviews-estados
Entering review-North_Carolina / reviews-estados
Entering review-Rhode_Island / reviews-estados
Entering review-North_Dakota / reviews-estados
Entering review-New_Jersey / reviews-estados
Entering review-Pennsylvania / reviews-estados
Entering review-Oregon / reviews-estados
Entering review-New_Mexico / reviews-estados
Entering review-Ohio / reviews-estados
Entering review-New_York / reviews-estados
Entering review-Oklahoma / reviews-estados
Entering review-Misso

Using the "json_extract" function located in utils, we have managed to extract the information stored within the folders with the raw data. This information is extracted to be stored in a local directory of Microsoft Fabric.