### **Tutorial 9: Extracting and Transforming Data**

In this tutorial, we will walk through the process of **extracting** and **transforming** data from a SQL database using Python. We will:
- Connect to a SQL Server database using `SQLAlchemy` and `pyodbc`
- Extract data from the **Sales.Orders** table
- Transform the data to filter only **date-related columns**

---

#### **Step 1: Install Required Libraries**
Before starting, ensure you have the required Python libraries installed. If not, install them using:

```bash
pip install pandas sqlalchemy pyodbc
```

---

#### **Step 2: Import Required Libraries**
We begin by importing the necessary libraries:

```python
import pandas as pd
from sqlalchemy import create_engine
```

---

#### **Step 3: Define the Database Connection**
We use **SQLAlchemy** to create a connection string for a Microsoft SQL Server database.

```python
# Define the ODBC connection parameters
driver = 'ODBC Driver 18 for SQL Server' 

params = urllib.parse.quote_plus(
    f"DRIVER={{{driver}}};SERVER={server};DATABASE={database};"
    f"UID={username};PWD={password};ENCRYPT=yes;TrustServerCertificate=yes"
)

# Create the database engine
engine = create_engine(f"mssql+pyodbc:///?odbc_connect={params}")
```

Replace:
- **`your_server`** with the actual SQL Server name.
- **`your_database`** with your database name.
- **`your_username`** and **`your_password`** with your credentials.

---

In [None]:
import urllib
import pyodbc
from sqlalchemy import create_engine
import pandas as pd

# Avoid SettingWithCopyWarning
pd.options.mode.chained_assignment = None

# Database credentials and connection setup
server = "10.30.0.10"
database = "WideWorldImporters"
username = "STUDENT"
password = "stu@cmpt326"
driver = 'ODBC Driver 18 for SQL Server'


params = urllib.parse.quote_plus(
    f"DRIVER={{{driver}}};SERVER={server};DATABASE={database};"
    f"UID={username};PWD={password};ENCRYPT=yes;TrustServerCertificate=yes"
)

engine = create_engine(f"mssql+pyodbc:///?odbc_connect={params}")


def extract(engine):
    query = "SELECT * FROM Sales.Orders;"
    raw_data = pd.read_sql(query, engine)
    return raw_data


def transform(raw_data):
    clean_data = raw_data.copy()
    date_columns = []
    
    for col in clean_data.columns:
        try:
            pd.to_datetime(clean_data[col], errors='raise')
            date_columns.append(col)
        except (ValueError, TypeError):
            continue

    clean_data = clean_data[date_columns]
    return clean_data

# Execute ETL
raw_sales_data = extract(engine)
clean_data = transform(raw_sales_data)

# Display the result
print(clean_data.head())


   OrderID  CustomerID  SalespersonPersonID  PickedByPersonID  \
0        1         832                    2               NaN   
1        2         803                    8               NaN   
2        3         105                    7               NaN   
3        4          57                   16               3.0   
4        5         905                    3               NaN   

   ContactPersonID  BackorderOrderID   OrderDate ExpectedDeliveryDate  \
0             3032              45.0  2013-01-01           2013-01-02   
1             3003              46.0  2013-01-01           2013-01-02   
2             1209              47.0  2013-01-01           2013-01-02   
3             1113               NaN  2013-01-01           2013-01-02   
4             3105              48.0  2013-01-01           2013-01-02   

  Comments DeliveryInstructions InternalComments PickingCompletedWhen  \
0     None                 None             None  2013-01-01 12:00:00   
1     None              

  pd.to_datetime(clean_data[col], errors='raise')


#### **Step 4: Extract Data from SQL Database**
### **Explanation:**
- The `extract()` function executes an SQL query to retrieve all columns from the **Sales.Orders** table.
- `pd.read_sql()` fetches the query results into a **pandas DataFrame**.

---

## **Step 5: Transform the Data**
Define a function that filters the dataset by keeping only columns that **do not contain null values**. Ensure that **date-related columns** are properly formatted as **datetime** objects.

```python
Example codes: 
# Extract Columns that do not have null values
no_null_columns = raw_data.columns[~raw_data.isnull().any()].to_list()
clean_data = raw_data[no_null_columns]
# Convert date-related columns to datetime if applicable
clean_data[col] = pd.to_datetime(clean_data[col])
```


## **Expected Output**

| OrderID | CustomerID | SalespersonPersonID | ContactPersonID | OrderDate  | ExpectedDeliveryDate | CustomerPurchaseOrderNumber | IsUndersupplyBackordered | LastEditedBy | LastEditedWhen         |
|---------|-----------|---------------------|-----------------|------------|----------------------|----------------------------|--------------------------|--------------|------------------------|
| 1       | 832       | 2                   | 3032            | 2013-01-01 | 2013-01-02          | 12126                      | True                     | 7            | 2013-01-01 12:00:00    |
| 2       | 803       | 8                   | 3003            | 2013-01-01 | 2013-01-02          | 15342                      | True                     | 7            | 2013-01-01 12:00:00    |
| 3       | 105       | 7                   | 1209            | 2013-01-01 | 2013-01-02          | 12211                      | True                     | 7            | 2013-01-01 12:00:00    |
| 4       | 57        | 16                  | 1113            | 2013-01-01 | 2013-01-02          | 17129                      | True                     | 3            | 2013-01-01 11:00:00    |
| 5       | 905       | 3                   | 3105            | 2013-01-01 | 2013-01-02          | 10369                      | True                     | 7            | 2013-01-01 12:00:00    |

---