# Extracting Data from a Source System

I have written the same thing before, about extracting data notes, you can hover to that to see examples and know how/where to start!
- Kim

This module though will talk about getting data from a 
- csv
- Parquet files
- JSON files
- SQL databases

We can also:
- API
- Data lakes
- Data warehouses

## Reading Parquet files

```python
import pandas as pd

data = pd.read_parquet('data.parquet', engine='fastparquet')
```

## Reading SQL Database

To read data to a database a `connection` object has to be created to be connected to a database, this is done using SQLAlchemy create_engine method
```python
import sqlalchemy
import pandas as pd

# Connection URI: schema_identifier://username:password@host:port/db
connection_uri = 'postgresql+psycopg2://repl:password@localhost:5432/name_of_db'
db_engine = sqlalchemy.create_engine(connection_uri)
```

Now, once the connection is established. You can use that to query:
```python
data = pd.read_sql("SELECT * FROM table LIMIT 10", db_engine)
```

## Modularity
You must create a function for each letter in ETL:

```python
def extract_from_sql(connection_uri, query):
    db_engine = sqlalchemy.create_engine(connection_uri)
    return pd.read_sql(query, db_engine)

extract_from_sql('uri', 'query')
```

# Transforming Data with Pandas

- You can filter data with `loc` and `iloc`
- You can also alter datatypes
    - `cleaned['timestamps'] = pd.to_datetime(cleaned['timestamps'], format = '%Y%m%d%H%M%S`)
    - `cleaned['timestamps'] = pd.to_datetime(cleaned['timestamps'], unit = 'ms')`

When you transform **you must validate transformation**!

## Example of Extracting and Transforming: Use Case

```python
def extract(file_path):
    raw_data = pd.read_parquet(file_path)
    return raw_data

raw_sales_data = extract("sales_data.parquet")

def transform(raw_data):
  	# Filter rows and columns
    clean_data = raw_data.loc[raw_data['Quantity Ordered'] == 1, ['Order ID', 'Price Each', 'Quantity Ordered']]
    return clean_data

# Transform the raw_sales_data
clean_sales_data = transform(raw_sales_data)
```

# Loading Data

Persisting data allows for a snapshot of the data.

This is easy with `to_csv()` method:

```python
import pandas as pd

# data extraction and transformation
raw_data = pd.read_csv('data.csv')

transformed_data = raw_data.loc[raw_data['col'] > 100, ['col2', 'col3']]

# load dta
loaded_data.to_csv('loaded_data.csv')
```

There are arguments to customize the loaded csv files:
- header
- index
- sep

## Ensuring Data Persistence
How do we know that what we just loaded is correct?
- We can check the filepath using OS module

```python
# contonuing from the above

file_exists = os.path.exists('loaded_data.csv')
print(file_exists)
```

# Monitor Pipelines
- Pipelines should be monitored due to data or failures
- Sometimes package can be depracated

## Logging Data Pipeline Performance
- This is just documentung performance at execution
- You can use logging module

The following are the methods you can use from the said module to successfully log your performances/works:

**Functions**

| Functions | Description | Syntax |
|---|---|---|
| `logging.debug` | Logs messages that are typically used during development for detailed information. | `logging.debug(message)` |
| `logging.info` | Logs messages that provide general information about the pipeline's execution. | `logging.info(message)` |
| `logging.warning` | Logs messages about unexpected events that don't necessarily stop the pipeline. | `logging.warning(message)` |
| `logging.error` | Logs messages about errors that have occurred and might halt the pipeline. | `logging.error(message)` |

**Arguments**

| Arguments | Description | Syntax | Example Values |
|---|---|---|---|
| `message` | The string message to be logged. | `logging.debug(message)`, `logging.info(message)`, `logging.warning(message)`, `logging.error(message)` | `"Data dimensionality: (100, 5)"`, `"Starting data transformation"`, `"Unexpected number of rows: 95"`, `"KeyError: 'price_change'"` |
| `alias` | A name to refer to the caught exception within the `except` block. | `except SpecificError as alias:` | `e`, `err`, `key_error` |

## Sample Logs
Logs provide a starting point when something fails as they tell Data Engineers something about the pipeline

```python
# sample with debug and info
import logging

logging.basicConfig(format = '%(levelname)s: %(message)s', level = logging.DEBUG)

# create different types
logging.debug(f"Variable has value {path}.")

# this is just telling co-engineers
logging.info(f"Data has been transformed and will now be loaded.")
```

**Warnings and Errors** should also be captured using logs!

- **Warning** is used when something unepected happened but an exception has not occured (ex: unexpected number of rows)
- **Error** logs are used when an exception occurs that should halt the execution of the pipeline

### Handling exceptions with try-except
- When an error occured under try rather than ending, the code in th except block will be triggered

```python
try:
    # code in nyah
    pass
except:
    # logging about failures that occured
    # logic to execute upon exeption
    pass
```

You already know this, Kim, Java has this. If you know the error, it should be passed next to the except keyword. For example:

```python
try:
    data = transform(data)
    logging.info("Successfully filtered DataFrame by ... ")

except KeyError as ke:
    # handle the error cretae a bew column, transform
    logging.warning(f"{ke}: Cannot filter DataFrame by ...")
    data['newCol'] = data['oldCol'] - data['oldCol1']
    data = transform(data)
```