<a href="https://colab.research.google.com/github/p-tech/wbs-dm/blob/main/Normalisation/2NF_Example.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Implementing 2nd Normal Form**

---



# **Task Description:**

---

You are given a dataset representing student course registrations, where multiple courses are stored in a single column as a comma-separated list.

Your task is to normalise this data to First Normal Form (1NF) using Python.



In [1]:
import pandas as pd

# Given dataset (1NF)
data = {
    "order_id": [1001, 1002, 1003, 1004, 1005],
    "first_name": ["Emma", "Olivia", "Bob", "David", "Noah"],
    "last_name": ["Brown", "Smith", "Moore", "Brown", "Jones"],
    "products_purchased": ["Tooth Brush, Hair Dryer", "Tooth Brush, TV, Hair Dryer",
                           "Tooth Brush, Computer, Phone", "Computer, Phone", "Mouse, Computer"],
    "total_spend": [1024.56, 2078.96, 2404.72, 1151.34, 1370.38],
    "payment_method": ["Google Pay", "Google Pay", "Google Pay", "Credit Card", "Credit Card"]
}

df_1NF = pd.DataFrame(data)
print(df_1NF)

   order_id first_name last_name            products_purchased  total_spend  \
0      1001       Emma     Brown       Tooth Brush, Hair Dryer      1024.56   
1      1002     Olivia     Smith   Tooth Brush, TV, Hair Dryer      2078.96   
2      1003        Bob     Moore  Tooth Brush, Computer, Phone      2404.72   
3      1004      David     Brown               Computer, Phone      1151.34   
4      1005       Noah     Jones               Mouse, Computer      1370.38   

  payment_method  
0     Google Pay  
1     Google Pay  
2     Google Pay  
3    Credit Card  
4    Credit Card  


Splitting the products_purchased Column:

`df_1NF.assign(products_purchased=df_1NF['products_purchased'].str.split(','))`

The products_purchased column likely contains multiple products in a single row as a comma-separated string (e.g., "Laptop, Phone, Tablet").

`.str.split(',') converts this string into a list (e.g., ["Laptop", "Phone", "Tablet"]).`

`assign()` creates a new DataFrame with this transformed column.

In [3]:
# Explode the 'products_purchased' column
df_1NF_exploded = df_1NF.assign(products_purchased=df_1NF['products_purchased'].str.split(',')).explode('products_purchased')

# Remove leading/trailing whitespace from the 'products_purchased' column
df_1NF_exploded['products_purchased'] = df_1NF_exploded['products_purchased'].str.strip()

# Reset the index
df_1NF_exploded = df_1NF_exploded.reset_index(drop=True)

df_1NF_exploded


Unnamed: 0,order_id,first_name,last_name,products_purchased,total_spend,payment_method
0,1001,Emma,Brown,Tooth Brush,1024.56,Google Pay
1,1001,Emma,Brown,Hair Dryer,1024.56,Google Pay
2,1002,Olivia,Smith,Tooth Brush,2078.96,Google Pay
3,1002,Olivia,Smith,TV,2078.96,Google Pay
4,1002,Olivia,Smith,Hair Dryer,2078.96,Google Pay
5,1003,Bob,Moore,Tooth Brush,2404.72,Google Pay
6,1003,Bob,Moore,Computer,2404.72,Google Pay
7,1003,Bob,Moore,Phone,2404.72,Google Pay
8,1004,David,Brown,Computer,1151.34,Credit Card
9,1004,David,Brown,Phone,1151.34,Credit Card


# **Transform the Data to 1NF**

---

You need to eliminate multi-valued attributes (the courses column) and create a separate row for each course while preserving student details.

**Initialise an Empty List**
Crates and empty list to store the transformed rows in a normalised format.

normalised_data = []

**Loop Through the Dataframe**
df.iterrows() - steps through each row of the DataFrame
The'_' is used because we dont' need the row index, jsut the row values

for _, row in df.iterrows():

**Extract Data from Each Row**
Split based on the ',' : "Math, Science" -> ["Math,"Science"]

    student_id = row["student_id"]
    student_name = row["student_name"]
    courses = row["courses"].split(", ")  # Splitting courses

**Creates Normalised Rows**
For each course, a new dictionary is created and added to normalised_data

    for course in courses:
        normalised_data.append({"student_id": student_id, "student_name": student_name, "course": course})

**Create a New DataFrame**
Normailsed data, each course has it's own row.

df_1NF = pd.DataFrame(normalised_data)



In [None]:
#normalised_data = []

for _, row in df.iterrows():
    student_id = row["student_id"]
    student_name = row["student_name"]
    courses = row["courses"].split(", ")  # Splitting courses

    for course in courses:
        normalised_data.append({"student_id": student_id, "student_name": student_name, "course": course})

# Step 2: Create new DataFrame
df_1NF = pd.DataFrame(normalised_data)

# Display the normalised DataFrame
# directly display the DataFrame using pandas' display function:
display(df_1NF)

# Alternatively, simply print the dataframe:
#print(df_1NF)