# Data Cleaning Project – Orders Dataset

In this project, I practice core data cleaning techniques using a fictional orders dataset. The dataset contains common data quality issues like:

- Misformatted names and emails  
- Invalid phone numbers   
- Inconsistent or invalid dates  
- Prices, Quantity and Total values wrong or missing

This type of task is essential in real-world data projects, where raw data must be cleaned before analysis or modeling.

In [1]:
import pandas as pd

In [2]:
df = pd.read_csv('orders.csv')

In [3]:
df.info()
df

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6 entries, 0 to 5
Data columns (total 8 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   order_id       6 non-null      int64  
 1   customer_name  5 non-null      object 
 2   email          5 non-null      object 
 3   order_date     6 non-null      object 
 4   product        6 non-null      object 
 5   price          5 non-null      object 
 6   quantity       5 non-null      float64
 7   total          4 non-null      float64
dtypes: float64(2), int64(1), object(5)
memory usage: 516.0+ bytes


Unnamed: 0,order_id,customer_name,email,order_date,product,price,quantity,total
0,1,João Silva,joao.silva@email.com,2024-03-15,Mouse,49.99,2.0,
1,2,Ana Souza,,15/03/2024,Monitor,abc,1.0,299.99
2,3,Carlos Lima,carlos@email.com,2024/03/16,Teclado,,3.0,
3,4,,maria@email.com,2024-03-16,Laptop,1999.90,,1999.9
4,5,Pedro Rocha,pedro@email,2024-03-17,Mouse,49.99,2.0,99.98
5,6,Fernanda Silva,fernanda@email.com,2024-03-18,Monitor,1299.99,1.0,1299.99


# Data Quality Issues Identified:

- `customer_name` has *missing values*,
- `email` has *missing values* and *incorrect formating*,
- `order_date` has *incorrect formating*,
- `price` has *missing values*,
- `quantity` has *missing values* and *incorrect formating*,
- `total` has *missing values*

### `customer_name`

In [4]:
# Using .index[0] to get the first null values from the column 'customer_name'
# and assigning a new value 'Maria' to it.
# If i wanted to change another null value, I could use a different index.

idx = df[df['customer_name'].isna()].index[0] # Its 0 because we are taking the first null value.
df.loc[idx, 'customer_name'] = 'Maria'

df['customer_name']

0        João Silva
1         Ana Souza
2       Carlos Lima
3             Maria
4       Pedro Rocha
5    Fernanda Silva
Name: customer_name, dtype: object

### `email`

In [5]:
df['email'] = df['email'].fillna('') # Filling missing emails with an empty string
df['email'] = df['email'].str.replace(r'(@\w+)$', r'\1.com', regex=True) # Adding '.com' to emails that do not have it at the end

df['email']

0    joao.silva@email.com
1                        
2        carlos@email.com
3         maria@email.com
4         pedro@email.com
5      fernanda@email.com
Name: email, dtype: object

### `order_date`

In [6]:
# Function to parse dates in different formats
def parse_date(date_str):
    try:
        # First, it tries to parse as year-month-day
        return pd.to_datetime(date_str, format='%Y-%m-%d')
    except ValueError:
        # If it fails, it tries day-month-year
        return pd.to_datetime(date_str, dayfirst=True)

df['order_date'] = df['order_date'].apply(parse_date)

df

  return pd.to_datetime(date_str, dayfirst=True)


Unnamed: 0,order_id,customer_name,email,order_date,product,price,quantity,total
0,1,João Silva,joao.silva@email.com,2024-03-15,Mouse,49.99,2.0,
1,2,Ana Souza,,2024-03-15,Monitor,abc,1.0,299.99
2,3,Carlos Lima,carlos@email.com,2024-03-16,Teclado,,3.0,
3,4,Maria,maria@email.com,2024-03-16,Laptop,1999.90,,1999.9
4,5,Pedro Rocha,pedro@email.com,2024-03-17,Mouse,49.99,2.0,99.98
5,6,Fernanda Silva,fernanda@email.com,2024-03-18,Monitor,1299.99,1.0,1299.99


### `price`, `quantity` and `total`

In [7]:
# My logic here is to convert the 'price', 'total' columns to numeric types,
# and fill any missing values in 'quantity' with 1, then convert it to integer

df['price'] = pd.to_numeric(df['price'], errors='coerce')
df['total'] = pd.to_numeric(df['total'], errors='coerce')
df['quantity'] = df['quantity'].fillna(1).astype(int)

df

Unnamed: 0,order_id,customer_name,email,order_date,product,price,quantity,total
0,1,João Silva,joao.silva@email.com,2024-03-15,Mouse,49.99,2,
1,2,Ana Souza,,2024-03-15,Monitor,,1,299.99
2,3,Carlos Lima,carlos@email.com,2024-03-16,Teclado,,3,
3,4,Maria,maria@email.com,2024-03-16,Laptop,1999.9,1,1999.9
4,5,Pedro Rocha,pedro@email.com,2024-03-17,Mouse,49.99,2,99.98
5,6,Fernanda Silva,fernanda@email.com,2024-03-18,Monitor,1299.99,1,1299.99


In [8]:
# If 'price' is missing, we can calculate it from 'total' and 'quantity'

df['price'] = df['price'].fillna(df['total'] / df['quantity'])
df

Unnamed: 0,order_id,customer_name,email,order_date,product,price,quantity,total
0,1,João Silva,joao.silva@email.com,2024-03-15,Mouse,49.99,2,
1,2,Ana Souza,,2024-03-15,Monitor,299.99,1,299.99
2,3,Carlos Lima,carlos@email.com,2024-03-16,Teclado,,3,
3,4,Maria,maria@email.com,2024-03-16,Laptop,1999.9,1,1999.9
4,5,Pedro Rocha,pedro@email.com,2024-03-17,Mouse,49.99,2,99.98
5,6,Fernanda Silva,fernanda@email.com,2024-03-18,Monitor,1299.99,1,1299.99


In [9]:
# If 'total' is missing, we can calculate it from 'price' and 'quantity'

df['total'] = df['total'].fillna(df['price'] * df['quantity'])
df

Unnamed: 0,order_id,customer_name,email,order_date,product,price,quantity,total
0,1,João Silva,joao.silva@email.com,2024-03-15,Mouse,49.99,2,99.98
1,2,Ana Souza,,2024-03-15,Monitor,299.99,1,299.99
2,3,Carlos Lima,carlos@email.com,2024-03-16,Teclado,,3,
3,4,Maria,maria@email.com,2024-03-16,Laptop,1999.9,1,1999.9
4,5,Pedro Rocha,pedro@email.com,2024-03-17,Mouse,49.99,2,99.98
5,6,Fernanda Silva,fernanda@email.com,2024-03-18,Monitor,1299.99,1,1299.99


In [10]:
# After cleaning the data, we can fill any remaining NaN values with an empty string
df.fillna('')

Unnamed: 0,order_id,customer_name,email,order_date,product,price,quantity,total
0,1,João Silva,joao.silva@email.com,2024-03-15,Mouse,49.99,2,99.98
1,2,Ana Souza,,2024-03-15,Monitor,299.99,1,299.99
2,3,Carlos Lima,carlos@email.com,2024-03-16,Teclado,,3,
3,4,Maria,maria@email.com,2024-03-16,Laptop,1999.9,1,1999.9
4,5,Pedro Rocha,pedro@email.com,2024-03-17,Mouse,49.99,2,99.98
5,6,Fernanda Silva,fernanda@email.com,2024-03-18,Monitor,1299.99,1,1299.99


# Summary of Cleaning

- Cleaned `full_name` (capitalization and spaces)  
- Cleaned `email` (removed missing values and fixed typos)  
- Cleaned `order_date` (standardized its format)
- Cleaned `price`, `quantity` and `total` (fixed all calculations)

The cleaned dataset can now be saved and used in future analysis.

In [11]:
df.to_csv('orders_clean.csv', index=False)