# 04.1 Introduction to extraction processes

The goal of this laboratory session is for you to implement and orchestrate extraction processes for the AdventureWorks data warehouse.

First lets recall the available data sources:

- AdventureWorks Core (MySQL)
- AdventureWorks HR Files (HTTP Server)
- AdventureWorks Reviews API (REST API)                                                                            - 

Each of the sources is of a different kind and will require a specific approach to be able to extract the data from it.
                                                                                            

## 1. Resources for extracting data from a MySQL database

Most of the existent databases require an special driver (connector) to communicate with them. In this case we installed the driver for MySQL during the image creation process in the Dockerfile (`aw-dwh/config/aio/dockerfile`)

```
# Download and install MySQL JDBC driver
RUN curl -L -o /opt/spark/jars/mysql-connector-j-${MYSQL_VERSION}.jar \
    "https://repo1.maven.org/maven2/com/mysql/mysql-connector-j/${MYSQL_VERSION}/mysql-connector-j-${MYSQL_VERSION}.jar" 
```

Having the driver installed we defined a Dagster resource to represent a MySQL database, remember from the previous lab that a Dagster resource should always represent an external service.

In [None]:
class MySQLResource(ConfigurableResource, ABC):
    
    host: str = Field(description="The hostname or IP address of the MySQL server.")
    port: str = Field(description="The port on which the MySQL server is listening, defaults to 3306.")
    database: str = Field(description="The name of the MySQL database to connect to.")
    user: str = Field(description="The username for authenticating with the MySQL server.")
    password: str = Field(description="The password associated with the given username.")
    
    @abstractmethod
    def fetch_table(self, table_name: str) -> Any:
        pass

    def _get_connection_string(self) -> str:
        return f"jdbc:mysql://{self.host}:{self.port}/{self.database}"

The `MySQLResource` class receives in its constructor the database credentials and its able to build a connection string for the database.

Also there is an abstract method `fetch_table` to customize the way we extract the data from the table. In this case we used PySpark to extract the data from the table.

In [None]:
class PySparkMySQLResource(MySQLResource):

    pyspark: ResourceDependency[PySparkResource]

    def fetch_table(self, table_name: str) -> DataFrame:
        df = self.pyspark.spark_session.read.jdbc(
            url=self._get_connection_string(), # the connection string
            table=table_name,
            properties={
                "user": self.user,
                "password": self.password,
                "driver": "com.mysql.cj.jdbc.Driver", # this specifies the driver for the connection
                "fetchSize": "10000" # this specifies how many rows we extract at a time
            }
        )

        return df

## 2. Defining the assets

To hold the assets code we created a `landing` folder within the assets folder, remember that this folder must be a module so we also created a `__init__.py` file.

To define all the AdventureWorks Core (MySQL) assets we created the `aw_core_mysql.py` file.

In [None]:
import dagster as dg
from adventureworks_orchestration.resources.mysql_resource import MySQLResource
from adventureworks_orchestration.constants import *
from pyspark.sql import DataFrame


# You can also dinamically define your assets, unlike a multi asset dinamically defined assets
# support native parallelism 

aw_core_tables = [] # ???

def get_landing_aw_core_table_asset(table: str):
    raise NotImplementedError()

def get_bronze_aw_core_assets():
    return [get_landing_aw_core_table_asset(table) for table in aw_core_tables]


ASSETS = get_bronze_aw_core_assets()

This file follows a different approach to declare the assets that the one we used in the previous lab. In this case we defined a `get_landing_aw_core_table_asset` method that given a table constructs its corresponding assets. Then we build the assets for all tables and declared them as a constant in the file.

Don't forget to register your new assets in the `definitions.py` file!