<a href="https://colab.research.google.com/github/matthewpecsok/data_engineering/blob/main/tutorials/de_data_transformations.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
import os
import sqlite3
import pandas as pd

In [None]:
!wget -O northwind.db https://github.com/matthewpecsok/data_engineering/raw/main/data/northwind.db

In [None]:
conn = sqlite3.connect("northwind.db")

In [None]:
cur = conn.cursor()

# intentionally create some nulls

In [None]:
cur.execute('update products set unitprice = null where productid  in (1,4,15,22,30,35,38,40,55)')
conn.commit()

In [None]:
products_df = pd.read_sql("SELECT * FROM Products;", conn)
products_df.shape

In [None]:
products_df.head(6)

notice UnitPrice non-null count is now 68

In [None]:
products_df.info()

# dealing with nulls

## nulls option 1 - sql COALESCE

COALESCE allows us to force a default value if the value returned by the row is Null. In this case we choose 0.00 instead of Null as the return value. COALESCE can have multiple values possible for return. It returns the first non-null value in the list.

https://www.w3schools.com/sql/func_sqlserver_coalesce.asp

This fact may at first seem uninsteresting, but the possible values can themselves be queries, allowing you much versatility in the coalesce.


In [None]:
products_sql_coalesce_df = pd.read_sql("""SELECT ProductID,
ProductName,
SupplierID,
CategoryID,
QuantityPerUnit,
COALESCE(UnitPrice, 0.0) AS UnitPrice,
UnitsInStock,
UnitsOnOrder,
ReorderLevel,
Discontinued
FROM Products;""", conn)
products_sql_coalesce_df.info()


In [None]:
products_pandas_transform_df = products_df.copy()
products_pandas_transform_df.info()

# Transform with Pandas

We use the expression `products_pandas_transform_df['UnitPrice']` to access the UnitPrice column.

We then use the method fillna(0) to fill the Not a Number values with a 0. In Pandas nulls are either NaN if numeric columns, or None if Object.

## NaN transform

In [None]:
products_pandas_transform_df['UnitPrice'] = products_pandas_transform_df['UnitPrice'].fillna(0)
products_pandas_transform_df.info()

## absolute value transform

Let's assume a column mistakenly has negative values in it.

In [None]:
products_pandas_transform_df.UnitPrice = products_pandas_transform_df.UnitPrice*-1
products_pandas_transform_df.head()

Use abs() to resolve this. It's worth noting that sql has the same function available.

In [None]:
products_pandas_transform_df['UnitPrice'] = products_pandas_transform_df['UnitPrice'].abs()
products_pandas_transform_df.head()

# creating new boolean columns

beware the behavior with null values, these may not behave as you expect.

## sql boolean columns

sql often assume 0 = False and 1 = True

In [None]:
pd.read_sql("SELECT UnitPrice,IIF(UnitPrice>15,True,False) as UnitPrice_gt_15 FROM Products;", conn)

## pandas boolean columns

In [None]:
products_df['UnitPrice_gt_15'] = products_df['UnitPrice']>15
products_df.head()

### pandas sum boolean

In [None]:
products_df['UnitPrice_gt_15'].value_counts()

In [None]:
products_df['UnitPrice_gt_15'].sum()

# inserting with a select statement

we can use existing tables to create a resultset which can be used to populate a new table without ever leaving the database. This can reduce network hops and latency and allow the database to do the heavy lifting instead of Python.

In [None]:
cur.execute("""drop table if  exists customer_order_count """)
conn.commit()

In [None]:
cur.execute("""create table if not exists customer_order_count (
customerid int,
ordercount int
)"""
)
conn.commit()

In [None]:
cur.execute("""
delete from customer_order_count
""")
conn.commit()

In [None]:
cur.execute("""
insert into customer_order_count
SELECT c.CustomerID, COUNT(o.OrderID) AS TotalOrders
FROM Customers c
LEFT JOIN Orders o ON c.CustomerID = o.CustomerID
GROUP BY c.CustomerID;
""")
conn.commit()


In [None]:
pd.read_sql("select * from customer_order_count;", conn)

# create a dataframe with some null values

In [None]:
data = {'first_name': ['Alice', 'Bob', 'Charlie', 'David', 'Eve', 'Frank', 'Grace', 'Henry', 'Isabella', 'Jack', np.nan, np.nan],
        'last_name': ['Smith', 'Jones', 'Williams', 'Brown', 'Davis', 'Miller', 'Wilson', 'Moore', 'Taylor', 'Anderson', np.nan, np.nan]}

df = pd.DataFrame(data)
print(df)


## fill the nulls with 'missing'

In [None]:
df['first_name'] = df['first_name'].fillna('missing')
df


# retrieve the customers and orders tables as dataframes

In [None]:
customers_df = pd.read_sql("SELECT * FROM Customers;", conn)
orders_df = pd.read_sql("SELECT * FROM Orders;", conn)

In [None]:
customers_df.head(2)

In [None]:
orders_df.head(2)

## in pandas join via left join

In [None]:
merged_df_left = pd.merge(customers_df, orders_df, on='CustomerID', how='left')
merged_df_left.head(2)


## in pandas join via inner join

In [None]:
merged_df_inner = pd.merge(customers_df, orders_df, on='CustomerID', how='inner')
merged_df_inner.head(2)