<a href="https://colab.research.google.com/github/p-tech/wbs-dm/blob/main/sqlite-exercise/SQLite_DB_Joins.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

**STEP 1: CREATE the SQLite database;**


We need to import the sqlite3 module and create the database and tables.  You'll see this follows the syntax we have used on previous weeks.


In [1]:
import sqlite3

#This statement creates a connection labelled as conn.  This will be used throughout to ensure the consistency for when we start to query the database tables.
conn = sqlite3.connect('ecommerce.db')
cursor = conn.cursor()

cursor.execute('''
CREATE TABLE olist_customers (
    customer_id VARCHAR(32) PRIMARY KEY,
    customer_unique_id VARCHAR(32),
    customer_zip_code_prefix INT,
    customer_city VARCHAR(255),
    customer_state VARCHAR(2)
);
''')

cursor.execute('''
CREATE TABLE olist_geolocation (
    geolocation_zip_code_prefix INT,
    geolocation_lat FLOAT,
    geolocation_lng FLOAT,
    geolocation_city VARCHAR(255),
    geolocation_state VARCHAR(2)
);
''')

cursor.execute('''
CREATE TABLE olist_order_items (
    order_id VARCHAR(32),
    order_item_id INT,
    product_id VARCHAR(32),
    seller_id VARCHAR(32),
    shipping_limit_date DATETIME,
    price FLOAT,
    freight_value FLOAT,
    PRIMARY KEY (order_id, order_item_id)
);
''')

cursor.execute('''
CREATE TABLE olist_order_payments (
    order_id VARCHAR(32),
    payment_sequential INT,
    payment_type VARCHAR(50),
    payment_installments INT,
    payment_value FLOAT,
    PRIMARY KEY (order_id, payment_sequential)
);
''')

cursor.execute('''
CREATE TABLE olist_order_reviews (
    review_id VARCHAR(32) PRIMARY KEY,
    order_id VARCHAR(32),
    review_score INT,
    review_comment_title TEXT,
    review_comment_message TEXT,
    review_creation_date DATETIME,
    review_answer_timestamp DATETIME
);
''')

cursor.execute('''
CREATE TABLE olist_orders (
    order_id VARCHAR(32) PRIMARY KEY,
    customer_id VARCHAR(32),
    order_status VARCHAR(50),
    order_purchase_timestamp DATETIME,
    order_approved_at DATETIME,
    order_delivered_carrier_date DATETIME,
    order_delivered_customer_date DATETIME,
    order_estimated_delivery_date DATETIME
);
''')

cursor.execute('''
CREATE TABLE olist_products (
    product_id VARCHAR(32) PRIMARY KEY,
    product_category_name VARCHAR(255),
    product_name_lenght FLOAT,
    product_description_lenght FLOAT,
    product_photos_qty FLOAT,
    product_weight_g FLOAT,
    product_length_cm FLOAT,
    product_height_cm FLOAT,
    product_width_cm FLOAT
);
''')

cursor.execute('''
CREATE TABLE olist_sellers (
    seller_id VARCHAR(32) PRIMARY KEY,
    seller_zip_code_prefix INT,
    seller_city VARCHAR(255),
    seller_state VARCHAR(2)
);
''')

cursor.execute('''
CREATE TABLE product_category_translation (
    product_category_name VARCHAR(255) PRIMARY KEY,
    product_category_name_english VARCHAR(255)
);
''')

#This saves the chnages to the databae.  Up unitl this point the executed SQL statement isn't stored, changes are not immediatley saved.
conn.commit()

print("Database and tables created successfully!")


Database and tables created successfully!


**STEP 2: Check Tables Created:**

Run the command to show the database tables created and the structure.

In [None]:

cursor.execute("SELECT name FROM sqlite_master WHERE type='table';")
tables = cursor.fetchall()

for table_name in tables:
    print(f"Table: {table_name[0]}")
    cursor.execute(f"PRAGMA table_info({table_name[0]});")
    columns = cursor.fetchall()
    for col in columns:
        print(f"  Column: {col[1]}, Type: {col[2]}, NotNull: {col[3]}, DefaultVal: {col[4]}, PrimaryKey: {col[5]}")
    print("-" * 20)




**STEP 3: Upload Files:**

Run this box multiple times to upload the relevant csv files. Or drag the files across to the Files window from your desktop.

In [None]:


from google.colab import files
uploaded = files.upload()
for fn in uploaded.keys():
  print('User uploaded file "{name}" with length {length} bytes'.format(
      name=fn, length=len(uploaded[fn])))


**STEP 4: Load CSV files into the database tables:**

This will populate the database tables with the data from the csv files.  No need to write INSERT statements.

You need to make sure the correct files are loaded into the corresponding tables.

In [14]:
import csv

def import_csv_to_table(csv_file, table_name):
    #opens the file aas read only 'r', doesn't allow the origianl csv to be changed.
    with open(csv_file, 'r', encoding='utf-8') as file:
        csv_reader = csv.reader(file)
        next(csv_reader)  # Skip header row if present
        for row in csv_reader:
            #? creates a placeholder for each column in the CSV file. ['?','?','?'] - Join makes it a string so it can then be inserted.
            # use of the '?' reduce risk of SQL injection
            placeholders = ', '.join(['?' for _ in row])
            #Assumes that the CSV and table have the same structure (this could be an issue) Would have to specify column names if different.
            sql = f"INSERT INTO {table_name} VALUES ({placeholders})"
            cursor.execute(sql, row)

# Import data from CSV files into the relevant table - Student_Table goes into student table.  teh import_csv_to_table is the function, passing the two values across.
try:
    import_csv_to_table('olist_customers_dataset.csv', 'olist_customers')
    #import_csv_to_table('olist_geolocation_dataset.csv', 'olist_geolocation')
    import_csv_to_table('olist_order_items_dataset.csv', 'olist_order_items')
    #import_csv_to_table('olist_order_payments_dataset.csv', 'olist_order_payments')
    import_csv_to_table('olist_orders_dataset.csv', 'olist_orders')
    #import_csv_to_table('olist_products_dataset.csv', 'olist_products')
    import_csv_to_table('olist_sellers_dataset.csv', 'olist_sellers')
    #import_csv_to_table('product_category_name_translation.csv', 'product_category_translation')
    conn.commit()
    print("Data imported successfully!")
except Exception as e:
    print(f"An error occurred: {e}")
    conn.rollback()  # Rollback changes if an error occurred



Data imported successfully!


**STEP 5: Check Data has loaded**

Query each database table and load the data into a dataframe and display the first 5 lines

**ONLY RUN IF YOU NEED TO DELETE THE DATA IN THE TABLES**

If you run go back to **STEP 4** and re-run from there.

In [13]:
# only run if you need to reset the tables without deleting the databae and starting again - then re-run the box previous box.
# Delete all data from the tables
cursor.execute("PRAGMA foreign_keys = OFF")
cursor.execute("DELETE FROM olist_customers")
cursor.execute("DELETE FROM olist_geolocation")
cursor.execute("DELETE FROM olist_order_items")
cursor.execute("DELETE FROM olist_order_payments")
cursor.execute("DELETE FROM olist_order_reviews")
cursor.execute("DELETE FROM olist_orders")
cursor.execute("DELETE FROM olist_products")
cursor.execute("DELETE FROM olist_sellers")
cursor.execute("DELETE FROM product_category_translation")
cursor.execute("PRAGMA foreign_keys = ON")

# Commit the changes
conn.commit()

conn.commit()
print("Database Deleted - restart.")



Database Deleted - restart.




---


# **INNER JOIN (Only Matching Records)**
This retrieves only orders that have at least one item.

Explanation:

INNER JOIN only returns rows where there is a match in both tables.
If an order does not have any items, it will not appear in the results.

In [26]:
import pandas as pd

inner_join_df = pd.read_sql_query("""
SELECT o.order_id, o.customer_id, oi.product_id, oi.price
FROM olist_orders o
INNER JOIN olist_order_items oi ON o.order_id = oi.order_id
""", conn)

# Print first 10 rows
print(inner_join_df.head(10).to_string(index=False))

# Print total row count
print(f"Total number of rows: {inner_join_df.shape[0]}")


                        order_id                      customer_id                       product_id  price
00010242fe8c5a6d1ba2dd792cb16214 3ce436f183e68e07877b285a838db11a 4244733e06e7ecb4970a6e2683c13e61  58.90
00018f77f2f0320c557190d7a144bdd3 f6dd3ec061db4e3987629fe6b26e5cce e5f2d52b802189ee658865ca93d83a8f 239.90
000229ec398224ef6ca0657da4fc703e 6489ae5e4333f3693df5ad4372dab6d3 c777355d18b72b67abbeef9df44fd0fd 199.00
00024acbcdf0a6daa1e931b038114c75 d4eb9395c8c0431ee92fce09860c5a06 7634da152a4610f1595efa32f14722fc  12.99
00042b26cf59d7ce69dfabb4e55b4fd9 58dbd0b2d70206bf40e62cd34e84d795 ac6c3623068f30de03045865e4e10089 199.90
00048cc3ae777c65dbb7d2a0634bc1ea 816cbea969fe5b689b39cfc97a506742 ef92defde845ab8450f9d70c526ef70f  21.90
00054e8431b9d7675808bcb819fb4a32 32e2e6ab09e778d99bf2e0ecd4898718 8d4f2bb7e93e6710a28f34fa83ee7d28  19.90
000576fe39319847cbb9d288c5617fa6 9ed5e522dd9dd85b4af4a077526d8117 557d850972a7d6f792fd18ae1400d9b6 810.00
0005a1a1728c9d785b8e2b08b904576c 16150771dfd47



---


# **LEFT JOIN (All Orders, Even If No Items)**
This keeps all orders, even if they have no items.

Explanation:
All orders are kept, even if they have no items.

If an order has no items, product_id and price will be NULL.



In [27]:
left_join_df = pd.read_sql_query("""
SELECT o.order_id, o.customer_id, oi.product_id, oi.price
FROM olist_orders o
LEFT JOIN olist_order_items oi ON o.order_id = oi.order_id
""", conn)

# Print first 10 rows
print(left_join_df.head(10).to_string(index=False))

# Print total row count
print(f"Total number of rows: {left_join_df.shape[0]}")


                        order_id                      customer_id                       product_id  price
e481f51cbdc54678b7cc49136f2d6af7 9ef432eb6251297304e76186b10a928d 87285b34884572647811a353c7ac498a  29.99
53cdb2fc8bc7dce0b6741e2150273451 b0830fb4747a6c6d20dea0b8c802d7ef 595fac2a385ac33a80bd5114aec74eb8 118.70
47770eb9100c2d0c44946d9cf07ec65d 41ce2a54c0b03bf3443c3d931a367089 aa4383b373c6aca5d8797843e5594415 159.90
949d5b44dbf5de918fe9c16f97b45f8a f88197465ea7920adcdbec7375364d82 d0b61bfb1de832b15ba9d266ca96e5b0  45.00
ad21c59c0840e6cb83a9ceb5573f8159 8ab97904e6daea8866dbdbc4fb7aad2c 65266b2da20d04dbe00c5c2d3bb7859e  19.90
a4591c265e18cb1dcee52889e2d8acc3 503740e9ca751ccdda7ba28e9ab8f608 060cb19345d90064d1015407193c233d 147.90
136cce7faa42fdb2cefd53fdc79a6098 ed0271e0b7da060a393796590e7b737a a1804276d9941ac0733cfd409f5206eb  49.90
6514b8ad8028c9f2cc2374ded245783f 9bdf08b4b3b52b5526ff42d37d47f222 4520766ec412348b8d4caa5e8a18c464  59.99
76c6e866289321a7c93b82b54852dc33 f54a9f0e6b351



---


# **RIGHT JOIN (All Order Items, Even If No Matching Order)**

SQLite does not support RIGHT JOIN, but we can simulate it by swapping the table order in a LEFT JOIN.

Explanation:
All order items are kept, even if there is no matching order in olist_orders.
If an item has no matching order, customer_id will be NULL.

In [28]:
right_join_df = pd.read_sql_query("""
SELECT oi.order_id, o.customer_id, oi.product_id, oi.price
FROM olist_order_items oi
LEFT JOIN olist_orders o ON oi.order_id = o.order_id
""", conn)

# Print first 10 rows
print(right_join_df.head(10).to_string(index=False))

# Print total row count
print(f"Total number of rows: {right_join_df.shape[0]}")



                        order_id                      customer_id                       product_id  price
00010242fe8c5a6d1ba2dd792cb16214 3ce436f183e68e07877b285a838db11a 4244733e06e7ecb4970a6e2683c13e61  58.90
00018f77f2f0320c557190d7a144bdd3 f6dd3ec061db4e3987629fe6b26e5cce e5f2d52b802189ee658865ca93d83a8f 239.90
000229ec398224ef6ca0657da4fc703e 6489ae5e4333f3693df5ad4372dab6d3 c777355d18b72b67abbeef9df44fd0fd 199.00
00024acbcdf0a6daa1e931b038114c75 d4eb9395c8c0431ee92fce09860c5a06 7634da152a4610f1595efa32f14722fc  12.99
00042b26cf59d7ce69dfabb4e55b4fd9 58dbd0b2d70206bf40e62cd34e84d795 ac6c3623068f30de03045865e4e10089 199.90
00048cc3ae777c65dbb7d2a0634bc1ea 816cbea969fe5b689b39cfc97a506742 ef92defde845ab8450f9d70c526ef70f  21.90
00054e8431b9d7675808bcb819fb4a32 32e2e6ab09e778d99bf2e0ecd4898718 8d4f2bb7e93e6710a28f34fa83ee7d28  19.90
000576fe39319847cbb9d288c5617fa6 9ed5e522dd9dd85b4af4a077526d8117 557d850972a7d6f792fd18ae1400d9b6 810.00
0005a1a1728c9d785b8e2b08b904576c 16150771dfd47



---

# **FULL OUTER JOIN (All Orders & All Order Items, Even If No Match)**
SQLite does not support FULL OUTER JOIN, but we can simulate it using LEFT JOIN + UNION + RIGHT JOIN.

Explanation:
First LEFT JOIN (Orders → Items)

Ensures all orders appear, even if they have no items.

Second LEFT JOIN (Items → Orders)

Ensures all order items appear, even if they have no order.
UNION merges both sets, simulating FULL OUTER JOIN.

In [29]:
full_outer_join_df = pd.read_sql_query("""
SELECT o.order_id, o.customer_id, oi.product_id, oi.price
FROM olist_orders o
LEFT JOIN olist_order_items oi ON o.order_id = oi.order_id
UNION
SELECT oi.order_id, o.customer_id, oi.product_id, oi.price
FROM olist_order_items oi
LEFT JOIN olist_orders o ON oi.order_id = o.order_id
""", conn)

# Print first 10 rows
print(full_outer_join_df.head(10).to_string(index=False))

# Print total row count
print(f"Total number of rows: {full_outer_join_df.shape[0]}")



                        order_id                      customer_id                       product_id  price
00010242fe8c5a6d1ba2dd792cb16214 3ce436f183e68e07877b285a838db11a 4244733e06e7ecb4970a6e2683c13e61  58.90
00018f77f2f0320c557190d7a144bdd3 f6dd3ec061db4e3987629fe6b26e5cce e5f2d52b802189ee658865ca93d83a8f 239.90
000229ec398224ef6ca0657da4fc703e 6489ae5e4333f3693df5ad4372dab6d3 c777355d18b72b67abbeef9df44fd0fd 199.00
00024acbcdf0a6daa1e931b038114c75 d4eb9395c8c0431ee92fce09860c5a06 7634da152a4610f1595efa32f14722fc  12.99
00042b26cf59d7ce69dfabb4e55b4fd9 58dbd0b2d70206bf40e62cd34e84d795 ac6c3623068f30de03045865e4e10089 199.90
00048cc3ae777c65dbb7d2a0634bc1ea 816cbea969fe5b689b39cfc97a506742 ef92defde845ab8450f9d70c526ef70f  21.90
00054e8431b9d7675808bcb819fb4a32 32e2e6ab09e778d99bf2e0ecd4898718 8d4f2bb7e93e6710a28f34fa83ee7d28  19.90
000576fe39319847cbb9d288c5617fa6 9ed5e522dd9dd85b4af4a077526d8117 557d850972a7d6f792fd18ae1400d9b6 810.00
0005a1a1728c9d785b8e2b08b904576c 16150771dfd47

# **JOIN Type	Rows Returned**
**INNER JOIN** : 112,650
Only orders that have at least one item are included. Orders without items are excluded.

**LEFT JOIN** :	113,425
Includes all orders, even if they have no items (which appear as NULL). More rows than INNER JOIN because some orders have no items.

**RIGHT JOIN **: 112,650
Includes all order items, even if the order is missing. In this case, it matches INNER JOIN, meaning every order item has a matching order.

**FULL OUTER JOIN**	: 103,200
Should include all orders and all items, even if they don’t match. However, the row count is lower than expected, which suggests some orders exist without items, and some items exist without orders, but overlapping records are counted only once.