<a href="https://colab.research.google.com/github/matthewpecsok/data_engineering/blob/main/tutorials/de_data_transformations.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [4]:
import os
import sqlite3
import pandas as pd

In [5]:
!wget -O northwind.db https://github.com/matthewpecsok/data_engineering/raw/main/data/northwind.db

--2024-09-05 22:19:39--  https://github.com/matthewpecsok/data_engineering/raw/main/data/northwind.db
Resolving github.com (github.com)... 140.82.114.4
Connecting to github.com (github.com)|140.82.114.4|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://raw.githubusercontent.com/matthewpecsok/data_engineering/main/data/northwind.db [following]
--2024-09-05 22:19:39--  https://raw.githubusercontent.com/matthewpecsok/data_engineering/main/data/northwind.db
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 602112 (588K) [application/octet-stream]
Saving to: ‘northwind.db’


2024-09-05 22:19:39 (10.3 MB/s) - ‘northwind.db’ saved [602112/602112]



In [6]:
conn = sqlite3.connect("northwind.db")

In [7]:
cur = conn.cursor()

# intentionally create some nulls

In [8]:
cur.execute('update products set unitprice = null where productid  in (1,4,15,22,30,35,38,40,55)')
conn.commit()

In [9]:
products_df = pd.read_sql("SELECT * FROM Products;", conn)
products_df.shape

(77, 10)

In [10]:
products_df.head(6)

Unnamed: 0,ProductID,ProductName,SupplierID,CategoryID,QuantityPerUnit,UnitPrice,UnitsInStock,UnitsOnOrder,ReorderLevel,Discontinued
0,1,Chai,1,1,10 boxes x 20 bags,,39,0,10,0
1,2,Chang,1,1,24 - 12 oz bottles,19.0,17,40,25,0
2,3,Aniseed Syrup,1,2,12 - 550 ml bottles,10.0,13,70,25,0
3,4,Chef Anton's Cajun Seasoning,2,2,48 - 6 oz jars,,53,0,0,0
4,5,Chef Anton's Gumbo Mix,2,2,36 boxes,21.35,0,0,0,1
5,6,Grandma's Boysenberry Spread,3,2,12 - 8 oz jars,25.0,120,0,25,0


notice UnitPrice non-null count is now 68

In [11]:
products_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 77 entries, 0 to 76
Data columns (total 10 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   ProductID        77 non-null     int64  
 1   ProductName      77 non-null     object 
 2   SupplierID       77 non-null     int64  
 3   CategoryID       77 non-null     int64  
 4   QuantityPerUnit  77 non-null     object 
 5   UnitPrice        68 non-null     float64
 6   UnitsInStock     77 non-null     int64  
 7   UnitsOnOrder     77 non-null     int64  
 8   ReorderLevel     77 non-null     int64  
 9   Discontinued     77 non-null     object 
dtypes: float64(1), int64(6), object(3)
memory usage: 6.1+ KB


# dealing with nulls

## nulls option 1 - sql COALESCE

COALESCE allows us to force a default value if the value returned by the row is Null. In this case we choose 0.00 instead of Null as the return value. COALESCE can have multiple values possible for return. It returns the first non-null value in the list.

https://www.w3schools.com/sql/func_sqlserver_coalesce.asp

This fact may at first seem uninsteresting, but the possible values can themselves be queries, allowing you much versatility in the coalesce.


In [12]:
products_sql_coalesce_df = pd.read_sql("""SELECT ProductID,
ProductName,
SupplierID,
CategoryID,
QuantityPerUnit,
COALESCE(UnitPrice, 0.0) AS UnitPrice,
UnitsInStock,
UnitsOnOrder,
ReorderLevel,
Discontinued
FROM Products;""", conn)
products_sql_coalesce_df.info()


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 77 entries, 0 to 76
Data columns (total 10 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   ProductID        77 non-null     int64  
 1   ProductName      77 non-null     object 
 2   SupplierID       77 non-null     int64  
 3   CategoryID       77 non-null     int64  
 4   QuantityPerUnit  77 non-null     object 
 5   UnitPrice        77 non-null     float64
 6   UnitsInStock     77 non-null     int64  
 7   UnitsOnOrder     77 non-null     int64  
 8   ReorderLevel     77 non-null     int64  
 9   Discontinued     77 non-null     object 
dtypes: float64(1), int64(6), object(3)
memory usage: 6.1+ KB


In [13]:
products_pandas_transform_df = products_df.copy()
products_pandas_transform_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 77 entries, 0 to 76
Data columns (total 10 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   ProductID        77 non-null     int64  
 1   ProductName      77 non-null     object 
 2   SupplierID       77 non-null     int64  
 3   CategoryID       77 non-null     int64  
 4   QuantityPerUnit  77 non-null     object 
 5   UnitPrice        68 non-null     float64
 6   UnitsInStock     77 non-null     int64  
 7   UnitsOnOrder     77 non-null     int64  
 8   ReorderLevel     77 non-null     int64  
 9   Discontinued     77 non-null     object 
dtypes: float64(1), int64(6), object(3)
memory usage: 6.1+ KB


# Transform with Pandas

We use the expression `products_pandas_transform_df['UnitPrice']` to access the UnitPrice column.

We then use the method fillna(0) to fill the Not a Number values with a 0. In Pandas nulls are either NaN if numeric columns, or None if Object.

## NaN transform

In [14]:
products_pandas_transform_df['UnitPrice'] = products_pandas_transform_df['UnitPrice'].fillna(0)
products_pandas_transform_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 77 entries, 0 to 76
Data columns (total 10 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   ProductID        77 non-null     int64  
 1   ProductName      77 non-null     object 
 2   SupplierID       77 non-null     int64  
 3   CategoryID       77 non-null     int64  
 4   QuantityPerUnit  77 non-null     object 
 5   UnitPrice        77 non-null     float64
 6   UnitsInStock     77 non-null     int64  
 7   UnitsOnOrder     77 non-null     int64  
 8   ReorderLevel     77 non-null     int64  
 9   Discontinued     77 non-null     object 
dtypes: float64(1), int64(6), object(3)
memory usage: 6.1+ KB


## absolute value transform

Let's assume a column mistakenly has negative values in it.

In [None]:
products_pandas_transform_df.UnitPrice = products_pandas_transform_df.UnitPrice*-1
products_pandas_transform_df.head()

Unnamed: 0,ProductID,ProductName,SupplierID,CategoryID,QuantityPerUnit,UnitPrice,UnitsInStock,UnitsOnOrder,ReorderLevel,Discontinued
0,1,Chai,1,1,10 boxes x 20 bags,-0.0,39,0,10,0
1,2,Chang,1,1,24 - 12 oz bottles,-19.0,17,40,25,0
2,3,Aniseed Syrup,1,2,12 - 550 ml bottles,-10.0,13,70,25,0
3,4,Chef Anton's Cajun Seasoning,2,2,48 - 6 oz jars,-0.0,53,0,0,0
4,5,Chef Anton's Gumbo Mix,2,2,36 boxes,-21.35,0,0,0,1


Use abs() to resolve this. It's worth noting that sql has the same function available.

In [None]:
products_pandas_transform_df['UnitPrice'] = products_pandas_transform_df['UnitPrice'].abs()
products_pandas_transform_df.head()

Unnamed: 0,ProductID,ProductName,SupplierID,CategoryID,QuantityPerUnit,UnitPrice,UnitsInStock,UnitsOnOrder,ReorderLevel,Discontinued
0,1,Chai,1,1,10 boxes x 20 bags,0.0,39,0,10,0
1,2,Chang,1,1,24 - 12 oz bottles,19.0,17,40,25,0
2,3,Aniseed Syrup,1,2,12 - 550 ml bottles,10.0,13,70,25,0
3,4,Chef Anton's Cajun Seasoning,2,2,48 - 6 oz jars,0.0,53,0,0,0
4,5,Chef Anton's Gumbo Mix,2,2,36 boxes,21.35,0,0,0,1


# creating new boolean columns

beware the behavior with null values, these may not behave as you expect.

## sql boolean columns

sql often assume 0 = False and 1 = True

In [None]:
pd.read_sql("SELECT UnitPrice,IIF(UnitPrice>15,True,False) as UnitPrice_gt_15 FROM Products;", conn)

Unnamed: 0,UnitPrice,UnitPrice_gt_15
0,,0
1,19.00,1
2,10.00,0
3,,0
4,21.35,1
...,...,...
72,15.00,0
73,10.00,0
74,7.75,0
75,18.00,1


## pandas boolean columns

In [None]:
products_df['UnitPrice_gt_15'] = products_df['UnitPrice']>15
products_df.head()

Unnamed: 0,ProductID,ProductName,SupplierID,CategoryID,QuantityPerUnit,UnitPrice,UnitsInStock,UnitsOnOrder,ReorderLevel,Discontinued,UnitPrice_gt_15
0,1,Chai,1,1,10 boxes x 20 bags,,39,0,10,0,False
1,2,Chang,1,1,24 - 12 oz bottles,19.0,17,40,25,0,True
2,3,Aniseed Syrup,1,2,12 - 550 ml bottles,10.0,13,70,25,0,False
3,4,Chef Anton's Cajun Seasoning,2,2,48 - 6 oz jars,,53,0,0,0,False
4,5,Chef Anton's Gumbo Mix,2,2,36 boxes,21.35,0,0,0,1,True


### pandas sum boolean

In [None]:
products_df['UnitPrice_gt_15'].value_counts()

Unnamed: 0_level_0,count
UnitPrice_gt_15,Unnamed: 1_level_1
True,42
False,35


In [None]:
products_df['UnitPrice_gt_15'].sum()

42

# inserting with a select statement

we can use existing tables to create a resultset which can be used to populate a new table without ever leaving the database. This can reduce network hops and latency and allow the database to do the heavy lifting instead of Python.

In [None]:
cur.execute("""drop table if  exists customer_order_count """)
conn.commit()

In [None]:
cur.execute("""create table if not exists customer_order_count (
customerid int,
ordercount int
)"""
)
conn.commit()

In [None]:
cur.execute("""
delete from customer_order_count
""")
conn.commit()

In [None]:
cur.execute("""
insert into customer_order_count
SELECT c.CustomerID, COUNT(o.OrderID) AS TotalOrders
FROM Customers c
LEFT JOIN Orders o ON c.CustomerID = o.CustomerID
GROUP BY c.CustomerID;
""")
conn.commit()


In [None]:
pd.read_sql("select * from customer_order_count;", conn)

Unnamed: 0,customerid,ordercount
0,ALFKI,6
1,ANATR,4
2,ANTON,7
3,AROUT,13
4,BERGS,18
...,...,...
88,WARTH,15
89,WELLI,9
90,WHITC,14
91,WILMK,7


In [1]:
import pandas as pd
import numpy as np

data = {'first_name': ['Alice', 'Bob', 'Charlie', 'David', 'Eve', 'Frank', 'Grace', 'Henry', 'Isabella', 'Jack', np.nan, np.nan],
        'last_name': ['Smith', 'Jones', 'Williams', 'Brown', 'Davis', 'Miller', 'Wilson', 'Moore', 'Taylor', 'Anderson', np.nan, np.nan]}

df = pd.DataFrame(data)
print(df)


   first_name last_name
0       Alice     Smith
1         Bob     Jones
2     Charlie  Williams
3       David     Brown
4         Eve     Davis
5       Frank    Miller
6       Grace    Wilson
7       Henry     Moore
8    Isabella    Taylor
9        Jack  Anderson
10        NaN       NaN
11        NaN       NaN


In [16]:
df['first_name'] = df['first_name'].fillna('missing')
df


Unnamed: 0,first_name,last_name
0,Alice,Smith
1,Bob,Jones
2,Charlie,Williams
3,David,Brown
4,Eve,Davis
5,Frank,Miller
6,Grace,Wilson
7,Henry,Moore
8,Isabella,Taylor
9,Jack,Anderson
