<a href="https://colab.research.google.com/github/p-tech/wbs-dm/blob/main/Normalisation/3NF_Example.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Implementing 3rd Normal Form**

---



# **Task Description:**

---

You are given a dataset representing student course registrations, where multiple courses are stored in a single column as a comma-separated list.

Your task is to normalise this data to First Normal Form (1NF) using Python.



In [None]:
import pandas as pd

# Given dataset (1NF)
data = {
    "order_id": [1001, 1001, 1002, 1002, 1002, 1003, 1003, 1003, 1004, 1004, 1005, 1005],
    "first_name": ["Emma", "Emma", "Olivia", "Olivia", "Olivia", "Bob", "Bob", "Bob", "David", "David", "Noah", "Noah"],
    "last_name": ["Brown", "Brown", "Smith", "Smith", "Smith", "Moore", "Moore", "Moore", "Brown", "Brown", "Jones", "Jones"],
    "product_purchased": ["Tooth Brush, Hair Dryer", "Hair Dryer", "Tooth Brush, Computer", "TV", "Hair Dryer",
                           "Tooth Brush", "Computer, Hair Dryer, Tooth Brush", "Phone", "Phone, Computer", "Mouse, Phone", "Mouse", "Computer"],
    "total_spend": [512.28, 512.28, 692.99, 692.99, 692.99, 801.57, 801.57, 801.57, 575.67, 575.67, 685.19, 685.19],
    "payment_method": ["Google Pay", "Google Pay", "Google Pay", "Google Pay", "Google Pay",
                       "Google Pay", "Google Pay", "Google Pay", "Credit Card", "Credit Card", "Credit Card", "Credit Card"]
}

df_1NF = pd.DataFrame(data)
print(df_1NF)


In [None]:
# Explode the 'products_purchased' column
df_1NF_exploded = df_1NF.assign(product_purchased=df_1NF['product_purchased'].str.split(',')).explode('product_purchased')

# Remove leading/trailing whitespace from the 'products_purchased' column
df_1NF_exploded['product_purchased'] = df_1NF_exploded['product_purchased'].str.strip()

# Reset the index
df_1NF_exploded = df_1NF_exploded.reset_index(drop=True)

df_1NF_exploded


# **Convert to 3NF**
Students must split the data into tables to remove transient dependencies:

Customers Table (customer_id, first_name, last_name)

Orders Table (order_id, customer_id, payment_id)

Payment Methods Table (payment_id, payment_method)

Products Table (product_id, product_name)

Order_Items Table (order_id, product_id, total_spend)


# **Step 1: Create the Customers Table**

Extracts first_name and last_name from the df_INF_exploded and removes duplicates across both names.

removes the ID that exist (drop=True)

renames the index to customer_id

index.name sets the customer_id index column as an actual column in the table.  Would make this a primary key (unique) in a relational database.

In [None]:
# Step 1: Extract unique customers
customers = df_1NF_exploded[['first_name', 'last_name']].drop_duplicates().reset_index(drop=True)
customers['customer_id'] = customers.index + 1  # Assign unique customer IDs

# Display Customers Table
display(customers)

# **Create the Orders Table:**

In [None]:

# Merging customer_id into df_1NF to maintain relationships
df_1NF = df_1NF_exploded.merge(customers, on=['first_name', 'last_name'], how='left')

# Creating Orders Table
orders = df_1NF[['order_id', 'customer_id', 'payment_method']].drop_duplicates().reset_index(drop=True)

# Display the table
display(orders)


# **Create Payment Methods Table - Modify Orders Table**

In [None]:
# Extract unique payment methods
payment_methods = orders[['payment_method']].drop_duplicates().reset_index(drop=True)
payment_methods['payment_id'] = payment_methods.index + 1  # Assign unique payment IDs

# Merge payment_id into Orders Table
orders = orders.merge(payment_methods, on='payment_method', how='left').drop(columns=['payment_method'])

display(payment_methods)
display(orders)

In [None]:
products = df_1NF_exploded[['product_purchased']].drop_duplicates().reset_index(drop=True)
products['product_id'] = products.index + 1  # Assign unique product IDs

display(products)

# **Create the Order Items Table**

In [None]:
order_items = df_1NF_exploded[['order_id', 'product_purchased', 'total_spend']]
order_items = order_items.merge(products, on='product_purchased', how='left').drop(columns=['product_purchased'])

display(order_items)
