# Fragrance Data Cleaning Pipeline


## Objectives
- Transform raw fragrance data into a clean, analysis-ready dataset
- Handle missing values and incorrect data types
- Normalize categorical variables
- Remove duplicate records
- Parse complex text fields

## Load Raw Data


In [98]:
import pandas as pd
import numpy as np

df = pd.read_csv("../data/raw/fra_perfumes.csv")


## Handle Missing Values


In [99]:
# Track missing ratings
df['rating_missing'] = df['Rating Value'].isna()

# No ratings keep them but fill with 0
df['Rating Value'] = df['Rating Value'].fillna(0)
df['Rating Count'] = df['Rating Count'].fillna(0)

# Remove records lacking critical identity fields
df = df.dropna(subset=['Name', 'Gender', 'Description'])

# Verification
df[['Rating Value','Rating Count']].isna().sum()



Rating Value    0
Rating Count    0
dtype: int64

### Missing Data Strategy
- Kept records with missing ratings to avoid unnecessary data loss
- Filled missing rating values and counts with 0 to represent unrated fragrances
- Removed records without needed identifying information

## Fix Data Types



In [100]:
# Clean Rating Count
df['Rating Count'] = (
    df['Rating Count']
    .astype(str)
    .str.replace(',', '', regex=False)
    .str.replace('k', '000', regex=False)
)

df['Rating Count'] = pd.to_numeric(df['Rating Count'], errors='coerce').fillna(0).astype(int)

# Verification
df[['Rating Value','Rating Count']].info()

<class 'pandas.core.frame.DataFrame'>
Index: 70100 entries, 0 to 70102
Data columns (total 2 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   Rating Value  70100 non-null  float64
 1   Rating Count  70100 non-null  int64  
dtypes: float64(1), int64(1)
memory usage: 1.6 MB


### Data Type Corrections
- Converted rating counts from strings into numeric format
- Handled non-numeric and malformed values safely using coercion
- Ensured rating metrics are usable for analysis

## Normalize Categories


In [101]:
# --- Normalize Gender ---

# Inspect original values
print(df['Gender'].value_counts(dropna=False))

# Clean casing and whitespace
df['Gender'] = df['Gender'].str.lower().str.strip()

# Map to final categories
gender_map = {
    'for women': 'Women',
    'for men': 'Men',
    'for women and men': 'Unisex'
}

df['Gender'] = df['Gender'].replace(gender_map)

# Verify
print(df['Gender'].value_counts(dropna=False))

# --- Clean Name field (after Gender is fixed) ---

df['Name'] = (
    df['Name']
    .str.replace(' for women and men', '', case=False, regex=False)
    .str.replace(' for women', '', case=False, regex=False)
    .str.replace(' for men', '', case=False, regex=False)
    .str.strip()
)

# Add a space to 'for ...' in Name
df['Name'] = df['Name'].str.replace('for women and men', ' for women and men', regex=False)
df['Name'] = df['Name'].str.replace('for women', ' for women', regex=False)
df['Name'] = df['Name'].str.replace('for men', ' for men', regex=False)

# Remove gender phrases from Name
df['Name'] = (
    df['Name']
    .str.replace(' for women and men', '', case=False, regex=False)
    .str.replace(' for women', '', case=False, regex=False)
    .str.replace(' for men', '', case=False, regex=False)
    .str.strip()
)

# Final check
df[['Name','Gender']].head(10)

Gender
for women and men    29708
for women            28102
for men              12290
Name: count, dtype: int64
Gender
Unisex    29708
Women     28102
Men       12290
Name: count, dtype: int64


Unnamed: 0,Name,Gender
0,9am Afnan,Women
1,9am Dive Afnan,Unisex
2,9am pour Femme Afnan,Women
3,9pm Afnan,Men
4,9pm pour Femme Afnan,Women
5,Naseej Al Kiswah Afnan,Unisex
6,Naseej Al Oud Afnan,Unisex
7,Naseej Al Ward Afnan,Unisex
8,Naseej Al Zafaran Afnan,Unisex
9,Afzal Abeer Afnan,Women


### Category Normalization
- Standardized gender labels into three categories: Women, Men, and Unisex
- Improved consistency for analysis and grouping

## Remove Duplicates


In [102]:
# duplicates
df.duplicated(subset=['Name', 'url']).sum()

# Remove duplicates using business logic
before = len(df)
df = df.drop_duplicates(subset=['Name', 'url'], keep='first')
after = len(df)
before, after, before - after


(70100, 69945, 155)

### Duplicate Handling
- Identified duplicates using Name and URL as a key
- Kept the first occurrence of each unique fragrance entry
- Removed 155 duplicate records to improve integrity

## Parse Complex Fields


### Extract Top, Middle, and Base Notes


In [103]:
import re

def extract_notes(text, note_type):
    if pd.isna(text):
        return np.nan
    text = text.lower()
    pattern = rf"{note_type} notes are ([^.;]+)"
    match = re.search(pattern, text)
    if match:
        return match.group(1).strip()
    return np.nan

df['Top Notes'] = df['Description'].apply(lambda x: extract_notes(x, 'top'))
df['Middle Notes'] = df['Description'].apply(lambda x: extract_notes(x, 'middle'))
df['Base Notes'] = df['Description'].apply(lambda x: extract_notes(x, 'base'))

# Capitalize extracted notes for consistency
for col in ['Top Notes', 'Middle Notes', 'Base Notes']:
    df[col] = df[col].str.title()

# Verification
df[['Name','Top Notes','Middle Notes','Base Notes']].head(10)


Unnamed: 0,Name,Top Notes,Middle Notes,Base Notes
0,9am Afnan,"Lemon, Mandarin Orange, Cardamom And Pink Pepper","Lavender, Green Apple, Orange Blossom And Rose","Musk, Moss, Cedar And Patchouli"
1,9am Dive Afnan,"Lemon, Mint, Black Currant And Pink Pepper","Apple, Incense And Cedar","Ginger, Sandalwood, Patchouli And Jasmine"
2,9am pour Femme Afnan,"Mandarin Orange, Grapefruit And Bergamot",Raspberry And Black Currant,"Musk, Amber And Orange"
3,9pm Afnan,"Apple, Cinnamon, Wild Lavender And Bergamot",Orange Blossom And Lily-Of-The-Valley,"Vanilla, Tonka Bean, Amber And Patchouli"
4,9pm pour Femme Afnan,"Raspberry, Violet, Apple And Orange","Rose, Iris, Peony And Jasmine","Cypress, Pine, Cedar And Amber"
5,Naseej Al Kiswah Afnan,"Patchouli, Tonka Bean And Amber",Woodsy Notes And Cedar,"Amber, Leather And Agarwood (Oud)"
6,Naseej Al Oud Afnan,"Agarwood (Oud), Pink Pepper, Bergamot And Saffron","Woodsy Notes, Rose, Jasmine And Orris","Agarwood (Oud), Leather And Amber"
7,Naseej Al Ward Afnan,"Rose, Raspberry And Saffron",Agarwood (Oud) And Patchouli,"Rose, Musk And Cedar"
8,Naseej Al Zafaran Afnan,"Saffron, Black Pepper And Cardamom","Cedar, Agarwood (Oud) And Rose",Vetiver And Sweet Notes
9,Afzal Abeer Afnan,,,


In [104]:
# Remove non-informative columns
df_final = df.drop(columns=['Perfumers'])


# Final dataset shape and missing values check
print("Final shape:", df_final.shape)
print("\nMissing values (top 10):")
print(df_final.isna().sum().sort_values(ascending=False).head(10))

# Export cleaned dataset
df_final.to_csv("../data/processed/cleaned_fragrances.csv", index=False)

Final shape: (69945, 11)

Missing values (top 10):
Top Notes       28647
Middle Notes    27983
Base Notes      27504
Rating Value        0
Gender              0
Name                0
Rating Count        0
url                 0
Description         0
Main Accords        0
dtype: int64


### Final Notes
- The `Perfumers` column was removed because it contained no populated values across the dataset.
- Brand information is enough for analysis purposes.

## Final Validation & Export


In [105]:
# Reorder columns so 'url' is last
cols = [c for c in df.columns if c != 'url'] + ['url']
df = df[cols]

# Remove unrated fragrances (no user ratings)
before = len(df)
df = df[df['rating_missing'] == False]
after = len(df)

print(f"Removed {before - after} unrated rows")

# Remove non-informative / redundant columns
df_final = df.drop(columns=['rating_missing','Perfumers','Main Accords','Description'], errors='ignore')

# Final dataset shape and missing values check
print("Final shape:", df_final.shape)
print("\nMissing values (top 10):")
print(df_final.isna().sum().sort_values(ascending=False).head(10))

# Export cleaned dataset
df_final.to_csv("../data/processed/cleaned_fragrances.csv", index=False)


Removed 6174 unrated rows
Final shape: (63771, 8)

Missing values (top 10):
Top Notes       25277
Middle Notes    24642
Base Notes      24218
Name                0
Rating Count        0
Rating Value        0
Gender              0
url                 0
dtype: int64
