<a href="https://colab.research.google.com/github/p-tech/wbs-dm/blob/main/Normalisation/3NF_Example.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Implementing 2nd Normal Form**

---



# **Task Description:**

---

You are given a dataset representing student course registrations, where multiple courses are stored in a single column as a comma-separated list.

Your task is to normalise this data to First Normal Form (1NF) using Python.



In [None]:
import pandas as pd

# Given dataset (1NF)
data = {
    "order_id": [1001, 1002, 1003, 1004, 1005],
    "first_name": ["Emma", "Olivia", "Bob", "David", "Noah"],
    "last_name": ["Brown", "Smith", "Moore", "Brown", "Jones"],
    "products_purchased": ["Tooth Brush, Hair Dryer", "Tooth Brush, TV, Hair Dryer",
                           "Tooth Brush, Computer, Phone", "Computer, Phone", "Mouse, Computer"],
    "total_spend": [1024.56, 2078.96, 2404.72, 1151.34, 1370.38],
    "payment_method": ["Google Pay", "Google Pay", "Google Pay", "Credit Card", "Credit Card"]
}

df_1NF = pd.DataFrame(data)
print(df_1NF)

   order_id first_name last_name            products_purchased  total_spend  \
0      1001       Emma     Brown       Tooth Brush, Hair Dryer      1024.56   
1      1002     Olivia     Smith   Tooth Brush, TV, Hair Dryer      2078.96   
2      1003        Bob     Moore  Tooth Brush, Computer, Phone      2404.72   
3      1004      David     Brown               Computer, Phone      1151.34   
4      1005       Noah     Jones               Mouse, Computer      1370.38   

  payment_method  
0     Google Pay  
1     Google Pay  
2     Google Pay  
3    Credit Card  
4    Credit Card  


Splitting the products_purchased Column:

`df_1NF.assign(products_purchased=df_1NF['products_purchased'].str.split(','))`

The products_purchased column likely contains multiple products in a single row as a comma-separated string (e.g., "Laptop, Phone, Tablet").

`.str.split(',') converts this string into a list (e.g., ["Laptop", "Phone", "Tablet"]).`

`assign()` creates a new DataFrame with this transformed column.

Removing Extra Spaces:

```
df_1NF_exploded['products_purchased'] = df_1NF_exploded['products_purchased'].str.strip()
```

.str.strip() removes any leading or trailing spaces from each product name.

This is useful if the original data had spaces around commas (e.g., "Laptop , Phone , Tablet" → "Laptop", "Phone", "Tablet").

Resetting the Index:

```
df_1NF_exploded = df_1NF_exploded.reset_index(drop=True)
```

Since explode() creates new rows, the index may no longer be sequential.

reset_index(drop=True) reassigns new index values without keeping the old ones.

In [None]:
# Explode the 'products_purchased' column
df_1NF_exploded = df_1NF.assign(products_purchased=df_1NF['products_purchased'].str.split(',')).explode('products_purchased')

# Remove leading/trailing whitespace from the 'products_purchased' column
df_1NF_exploded['products_purchased'] = df_1NF_exploded['products_purchased'].str.strip()

# Reset the index
df_1NF_exploded = df_1NF_exploded.reset_index(drop=True)

df_1NF_exploded


Unnamed: 0,order_id,first_name,last_name,products_purchased,total_spend,payment_method
0,1001,Emma,Brown,Tooth Brush,1024.56,Google Pay
1,1001,Emma,Brown,Hair Dryer,1024.56,Google Pay
2,1002,Olivia,Smith,Tooth Brush,2078.96,Google Pay
3,1002,Olivia,Smith,TV,2078.96,Google Pay
4,1002,Olivia,Smith,Hair Dryer,2078.96,Google Pay
5,1003,Bob,Moore,Tooth Brush,2404.72,Google Pay
6,1003,Bob,Moore,Computer,2404.72,Google Pay
7,1003,Bob,Moore,Phone,2404.72,Google Pay
8,1004,David,Brown,Computer,1151.34,Credit Card
9,1004,David,Brown,Phone,1151.34,Credit Card


# **Convert to 2NF**
Students must split the data into three tables to remove partial dependencies:

Customers Table (customer_id, first_name, last_name)
Orders Table (order_id, customer_id, payment_method)
Order_Items Table (order_id, product, price)

# **Step 1: Create the Customers Table**

Extracts first_name and last_name from the df_INF_exploded and removes duplicates across both names.

removes the ID that exist (drop=True)

renames the index to customer_id

index.name sets the customer_id index column as an actual column in the table.  Would make this a primary key (unique) in a relational database.

In [None]:
# Generate unique customers
customers = df_1NF[['first_name', 'last_name']].drop_duplicates().reset_index(drop=True)
customers['customer_id'] = customers.index + 1  # Assign unique customer IDs

# Display the table
display(customers)


Unnamed: 0,first_name,last_name,customer_id
0,Emma,Brown,1
1,Olivia,Smith,2
2,Bob,Moore,3
3,David,Brown,4
4,Noah,Jones,5


# **Create the Orders Table:**

In [None]:

# Merge with customers to assign customer_id to each order
df_1NF = df_1NF.merge(customers, on=['first_name', 'last_name'], how='left')

# Create Orders Table
orders = df_1NF[['order_id', 'customer_id', 'payment_method']]

# Display the table
display(orders)


Unnamed: 0,order_id,customer_id,payment_method
0,1001,1,Google Pay
1,1002,2,Google Pay
2,1003,3,Google Pay
3,1004,4,Credit Card
4,1005,5,Credit Card


# **Create the Order Items Table**

In [None]:
# Step 1: Convert multi-valued 'products_purchased' column into separate rows
order_items = []

for _, row in df_1NF.iterrows():
    order_id = row["order_id"]
    products = row["products_purchased"].split(", ")  # Splitting products
    total_spend = row["total_spend"]

    # Distribute total spend randomly among products (for simplicity)
    product_prices = [round(total_spend / len(products), 2) for _ in products]

    for product, price in zip(products, product_prices):
        order_items.append({"order_id": order_id, "product": product, "price": price})

# Create Order_Items DataFrame
df_order_items = pd.DataFrame(order_items)

# Display the table
display(df_order_items)


Unnamed: 0,order_id,product,price
0,1001,Tooth Brush,512.28
1,1001,Hair Dryer,512.28
2,1002,Tooth Brush,692.99
3,1002,TV,692.99
4,1002,Hair Dryer,692.99
5,1003,Tooth Brush,801.57
6,1003,Computer,801.57
7,1003,Phone,801.57
8,1004,Computer,575.67
9,1004,Phone,575.67
