# Fragrance Data Cleaning Pipeline


## Objectives
- Transform raw fragrance data into a clean, analysis-ready dataset
- Handle missing values and incorrect data types
- Normalize categorical variables
- Remove duplicate records
- Parse complex text fields

## Load Raw Data


In [2]:
import pandas as pd
import numpy as np

df = pd.read_csv("../data/raw/fra_perfumes.csv")


## Handle Missing Values


In [None]:
# Track missing ratings
df['rating_missing'] = df['Rating Value'].isna()

# No ratings keep them but fill with 0
df['Rating Value'] = df['Rating Value'].fillna(0)
df['Rating Count'] = df['Rating Count'].fillna(0)

# Remove records lacking critical identity fields
df = df.dropna(subset=['Name', 'Gender', 'Description'])

# Verification
df[['Rating Value','Rating Count']].isna().sum()



Rating Value    0
Rating Count    0
dtype: int64

### Missing Data Strategy
- Kept records with missing ratings to avoid unnecessary data loss
- Filled missing rating values and counts with 0 to represent unrated fragrances
- Removed records without needed identifying information

## Fix Data Types



In [9]:
# Clean Rating Count
df['Rating Count'] = (
    df['Rating Count']
    .astype(str)
    .str.replace(',', '', regex=False)
    .str.replace('k', '000', regex=False)
)

df['Rating Count'] = pd.to_numeric(df['Rating Count'], errors='coerce').fillna(0).astype(int)

# Verification
df[['Rating Value','Rating Count']].info()

<class 'pandas.core.frame.DataFrame'>
Index: 70100 entries, 0 to 70102
Data columns (total 2 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   Rating Value  70100 non-null  float64
 1   Rating Count  70100 non-null  int64  
dtypes: float64(1), int64(1)
memory usage: 1.6 MB


### Data Type Corrections
- Converted rating counts from strings into numeric format
- Handled non-numeric and malformed values safely using coercion
- Ensured rating metrics are usable for analysis

## Normalize Categories


## Remove Duplicates


## Parse Complex Fields


## Final Validation & Export
