# Data Cleaning Template (v1.0)

**Author:** Julio Carneiro  
**Location:** Canada  
**Year:** 2025  
**GitHub:** github.com/juliocezarcarneiro/data-cleaning-template.git

### Project Overview

This notebook provides a complete workflow for cleaning and preparing datasets for analysis or modelling.
It includes:

- Data type correction
- Handling missing values
- Cleaning monetary fields
- Removing artifacts (footnotes, symbols, formatting issues)  
- Outlier detection  
- Final data validation  

This template can be reused across different projects to ensure consistent and professional data quality workflows.

## Importing Required Libraries

In this section, I import the core Python libraries used for data cleaning and preprocessing. This typically includes `pandas` for data manipulation and `numpy` for numerical operations.

In [133]:
# Import dependencies
import pandas as pd
import numpy as np
from pathlib import Path
import matplotlib.pyplot as plt
import seaborn as sns

## Loading the Dataset

In this step, I load the raw CSV file into a pandas DataFrame. This dataset will be used as the starting point for all cleaning and preprocessing tasks. 

I also print the number of rows and columns to confirm that the file loaded correctly.


In [176]:
# Load the data
file_path = Path("~/Desktop/Projects/templates/data-cleaning-template/data/dirty-data-original.csv")
df = pd.read_csv(file_path)

print(f"Shape: {df.shape[0]:,} rows × {df.shape[1]} columns")
df.head()

Shape: 20 rows × 11 columns


Unnamed: 0,Rank,Peak,All Time Peak,Actual gross,Adjusted gross (in 2022 dollars),Artist,Tour title,Year(s),Shows,Average gross,Ref.
0,1,1,2,"$780,000,000","$780,000,000",Taylor Swift,The Eras Tour †,2023–2024,56,"$13,928,571",[1]
1,2,1,7[2],"$579,800,000","$579,800,000",Beyoncé,Renaissance World Tour,2023,56,"$10,353,571",[3]
2,3,1[4],2[5],"$411,000,000","$560,622,615",Madonna,Sticky & Sweet Tour ‡[4][a],2008–2009,85,"$4,835,294",[6]
3,4,2[7],10[7],"$397,300,000","$454,751,555",Pink,Beautiful Trauma World Tour,2018–2019,156,"$2,546,795",[7]
4,5,2[4],,"$345,675,146","$402,844,849",Taylor Swift,Reputation Stadium Tour,2018,53,"$6,522,173",[8]


## Initial Data Inspection

This section provides a quick overview of the raw dataset, including:

- First and last five records
- DataFrame shape
- Data types
- Missing value summary
- Duplicate rows
- Summary statistics

These checks help identify what cleaning actions are necessary.

In [183]:
# First 5 rows
print("First 5 rows")
display(df.head())

# Last 5 rows
print("\nLast 5 rows")
display(df.tail())

# Shape
print(f"\nShape: {df.shape[0]:,} rows × {df.shape[1]} columns")

# Data types
print("\nData types")
print(df.dtypes)

# Missing values
print("\nMissing values")
missing = df.isnull().sum()
percentage = (missing / len(df) * 100).round(1)

summary = pd.DataFrame({
    "Empty cells": missing,
    "% Empty": percentage
})

display(summary.sort_values("% Empty", ascending=False))

# Duplicate rows
print("Duplicate rows")
duplicates = df.duplicated().sum()
print(f"{duplicates:,} exact duplicate rows")

# Summary stats
print("\nNumbers summary")
display(df.describe().round(2))

# Basic stats (all columns)
print("\nBasic stats")
display(df.describe(include="all"))

First 5 rows


Unnamed: 0,Rank,Peak,All Time Peak,Actual gross,Adjusted gross (in 2022 dollars),Artist,Tour title,Year(s),Shows,Average gross,Ref.
0,1,1,2,"$780,000,000","$780,000,000",Taylor Swift,The Eras Tour †,2023–2024,56,"$13,928,571",[1]
1,2,1,7[2],"$579,800,000","$579,800,000",Beyoncé,Renaissance World Tour,2023,56,"$10,353,571",[3]
2,3,1[4],2[5],"$411,000,000","$560,622,615",Madonna,Sticky & Sweet Tour ‡[4][a],2008–2009,85,"$4,835,294",[6]
3,4,2[7],10[7],"$397,300,000","$454,751,555",Pink,Beautiful Trauma World Tour,2018–2019,156,"$2,546,795",[7]
4,5,2[4],,"$345,675,146","$402,844,849",Taylor Swift,Reputation Stadium Tour,2018,53,"$6,522,173",[8]



Last 5 rows


Unnamed: 0,Rank,Peak,All Time Peak,Actual gross,Adjusted gross (in 2022 dollars),Artist,Tour title,Year(s),Shows,Average gross,Ref.
15,16,,,"$184,000,000","$227,452,347",Pink,The Truth About Love Tour,2013–2014,142,"$1,295,775",[22]
16,17,,,"$170,000,000","$213,568,571",Lady Gaga,Born This Way Ball,2012–2013,98,"$1,734,694",[d]
17,18,,,"$169,800,000","$207,046,755",Madonna,Rebel Heart Tour,2015–2016,82,"$2,070,732",[4]
18,19,,,"$167,700,000[e]","$204,486,106",Adele,Adele Live 2016,2016–2017,121,"$1,385,950",[25]
19,20,,,"$150,000,000","$185,423,109",Taylor Swift,The Red Tour,2013–2014,86,"$1,744,186",[26]



Shape: 20 rows × 11 columns

Data types
Rank                                 int64
Peak                                object
All Time Peak                       object
Actual gross                        object
Adjusted gross (in 2022 dollars)    object
Artist                              object
Tour title                          object
Year(s)                             object
Shows                                int64
Average gross                       object
Ref.                                object
dtype: object

Missing values


Unnamed: 0,Empty cells,% Empty
All Time Peak,14,70.0
Peak,11,55.0
Rank,0,0.0
Actual gross,0,0.0
Adjusted gross (in 2022 dollars),0,0.0
Artist,0,0.0
Tour title,0,0.0
Year(s),0,0.0
Shows,0,0.0
Average gross,0,0.0


Duplicate rows
0 exact duplicate rows

Numbers summary


Unnamed: 0,Rank,Shows
count,20.0,20.0
mean,10.45,110.0
std,5.94,66.51
min,1.0,41.0
25%,5.75,59.0
50%,10.5,87.0
75%,15.25,134.5
max,20.0,325.0



Basic stats


Unnamed: 0,Rank,Peak,All Time Peak,Actual gross,Adjusted gross (in 2022 dollars),Artist,Tour title,Year(s),Shows,Average gross,Ref.
count,20.0,9.0,6.0,20,20,20,20,20,20.0,20,20
unique,,7.0,6.0,20,20,9,20,16,,20,20
top,,1.0,2.0,"$780,000,000","$780,000,000",Taylor Swift,The Eras Tour †,2013–2014,,"$13,928,571",[1]
freq,,2.0,1.0,1,1,4,1,3,,1,1
mean,10.45,,,,,,,,110.0,,
std,5.942488,,,,,,,,66.507617,,
min,1.0,,,,,,,,41.0,,
25%,5.75,,,,,,,,59.0,,
50%,10.5,,,,,,,,87.0,,
75%,15.25,,,,,,,,134.5,,


## Removing Duplicate Rows

Duplicate records can distort analysis and inflate summary statistics.  
In this step, I check for exact duplicate rows in the dataset and remove them using `drop_duplicates()`.

This ensures that each observation is unique and prevents biased results during aggregation or modelling. No duplicate rows were found.

In [190]:
# Remove duplicates
print("Duplicates before:", df.duplicated().sum())
df = df.drop_duplicates()
print("Duplicates after:", df.duplicated().sum())

Duplicates before: 0
Duplicates after: 0


## Standardizing Column Names

Consistent column names are essential for clean, readable code and reproducible analysis.

In this step:

- I rename columns to meaningful, standardized names  
- Names are concise, descriptive, and use snake_case or PascalCase conventions where appropriate  

This ensures clarity in all subsequent analysis steps.

In [195]:
# Standardize columns
# Rename columns to clean, descriptive names
df.columns = [
    "Rank", "Peak", "All_time_peak", "Actual_gross", "Adjusted_gross_2022",
    "Artist", "Tour_title", "Years", "Shows", "Average_gross", "Ref"
]

# Verify the changes
print("Columns forced to correct names:")
print(df.columns.tolist())

# Preview the first few rows
df.head()

Columns forced to correct names:
['Rank', 'Peak', 'All_time_peak', 'Actual_gross', 'Adjusted_gross_2022', 'Artist', 'Tour_title', 'Years', 'Shows', 'Average_gross', 'Ref']


Unnamed: 0,Rank,Peak,All_time_peak,Actual_gross,Adjusted_gross_2022,Artist,Tour_title,Years,Shows,Average_gross,Ref
0,1,1,2,"$780,000,000","$780,000,000",Taylor Swift,The Eras Tour †,2023–2024,56,"$13,928,571",[1]
1,2,1,7[2],"$579,800,000","$579,800,000",Beyoncé,Renaissance World Tour,2023,56,"$10,353,571",[3]
2,3,1[4],2[5],"$411,000,000","$560,622,615",Madonna,Sticky & Sweet Tour ‡[4][a],2008–2009,85,"$4,835,294",[6]
3,4,2[7],10[7],"$397,300,000","$454,751,555",Pink,Beautiful Trauma World Tour,2018–2019,156,"$2,546,795",[7]
4,5,2[4],,"$345,675,146","$402,844,849",Taylor Swift,Reputation Stadium Tour,2018,53,"$6,522,173",[8]


## Cleaning Monetary Columns

Several columns contain monetary values with formatting symbols:

- `Actual_gross`, `Adjusted_gross_2022`, `Average_gross`  

Steps applied:

1. Remove `$` and `,` symbols  
2. Strip extra whitespace  
3. Convert values to a numeric type  

This ensures that monetary columns are numeric and ready for calculations and analysis.

In [203]:
# Money columns: remove $ and commas
money_cols = ["Actual_gross", "Adjusted_gross_2022", "Average_gross"]

for col in money_cols:
    df[col] = (
        df[col]
        .astype(str)
        .str.replace(r"[$,]", "", regex=True)
        .str.strip()
    )
    df[col] = pd.to_numeric(df[col], errors="coerce")

# Verify the changes
print(df[money_cols].dtypes)
df.head(10)


Actual_gross           float64
Adjusted_gross_2022      int64
Average_gross            int64
dtype: object


Unnamed: 0,Rank,Peak,All_time_peak,Actual_gross,Adjusted_gross_2022,Artist,Tour_title,Years,Shows,Average_gross,Ref
0,1,1,2,780000000.0,780000000,Taylor Swift,The Eras Tour †,2023–2024,56,13928571,[1]
1,2,1,7[2],579800000.0,579800000,Beyoncé,Renaissance World Tour,2023,56,10353571,[3]
2,3,1[4],2[5],411000000.0,560622615,Madonna,Sticky & Sweet Tour ‡[4][a],2008–2009,85,4835294,[6]
3,4,2[7],10[7],397300000.0,454751555,Pink,Beautiful Trauma World Tour,2018–2019,156,2546795,[7]
4,5,2[4],,345675146.0,402844849,Taylor Swift,Reputation Stadium Tour,2018,53,6522173,[8]
5,6,2[4],10[9],305158363.0,388978496,Madonna,The MDNA Tour,2012,88,3467709,[9]
6,7,2[10],,280000000.0,381932682,Celine Dion,Taking Chances World Tour,2008–2009,131,2137405,[11]
7,7,,,257600000.0,257600000,Pink,Summer Carnival †,2023–2024,41,6282927,[12]
8,9,,,256084556.0,312258401,Beyoncé,The Formation World Tour,2016,49,5226215,[13]
9,10,,,250400000.0,309141878,Taylor Swift,The 1989 World Tour,2015,85,2945882,[14]


### Handling Missing Values
Before proceeding with analysis, I checked the dataset for missing values to understand which fields required imputation. The initial inspection showed null values in several numeric columns.

To maintain consistency:

* Average_gross: Missing values were replaced with 0 since non-reported amounts should not affect aggregate calculations.

* Peak and All_time_peak: Missing values were replaced with a placeholder (999) to identify entries where ranking information was unavailable or incomplete.

After applying these imputations, I re-ran the missing value check to confirm that all null values had been resolved.

In [206]:
# Print missing values before
print("Missing values before:")
print(df.isnull().sum()[df.isnull().sum() > 0])

# Fill missing values (pandas 3.0 safe)
df["Average_gross"]   = df["Average_gross"].fillna(0)     # since values should not impact aggregate calculations
df["Peak"]            = df["Peak"].fillna(999)            # flag incomplete or unavailable rankings for future review
df["All_time_peak"]   = df["All_time_peak"].fillna(999)   # flag incomplete or unavailable rankings for future review

# Check again
print("\nMissing values after:")
print(df.isnull().sum()[df.isnull().sum() > 0])

print("\nTotal missing:", df.isnull().sum().sum())

Missing values before:
Peak             11
All_time_peak    14
Actual_gross      2
dtype: int64

Missing values after:
Actual_gross    2
dtype: int64

Total missing: 2


## Standardizing Text Columns

To ensure consistency and prevent errors during analysis:

- Remove extra whitespace from text fields
- Standardize categorical/text columns
- Convert text to a consistent format if needed

This step improves data quality and ensures accurate grouping, filtering, and visualization.

In [213]:
# Inspect current column names
print("Columns before standardization:")
print(df.columns.tolist())

# Strip whitespace from string columns
text_cols = ["Artist", "Tour_title", "Ref"]

for col in text_cols:
    df[col] = df[col].astype(str).str.strip()

# Standardize title case
for col in text_cols:
    df[col] = df[col].str.title()

# Verify changes
print("\nColumns after standardization:")
print(df.columns.tolist())

Columns before standardization:
['Rank', 'Peak', 'All_time_peak', 'Actual_gross', 'Adjusted_gross_2022', 'Artist', 'Tour_title', 'Years', 'Shows', 'Average_gross', 'Ref']

Columns after standardization:
['Rank', 'Peak', 'All_time_peak', 'Actual_gross', 'Adjusted_gross_2022', 'Artist', 'Tour_title', 'Years', 'Shows', 'Average_gross', 'Ref']


In [131]:
# Handle outliers/anomalies
print(df.describe())

df = df[df["Average_gross"] >= 0]
df = df[df["Peak"] >= 0]
df = df[df["All_time_peak"] >= 0]

print("Negative rows deleted!")

rows_before = df.shape[0] + 100000
print(f"Rows we have now: {df.shape[0]:,}")

print("\nNumbers after cleaning:")
display(df.describe().round(0))

            Rank       Shows
count  20.000000   20.000000
mean   10.450000  110.000000
std     5.942488   66.507617
min     1.000000   41.000000
25%     5.750000   59.000000
50%    10.500000   87.000000
75%    15.250000  134.500000
max    20.000000  325.000000


KeyError: 'Average_gross'

In [119]:
# Create derived columns
# Example 1: Calculate age from birth year
#df['age'] = 2025 - df['birth_year']

# Example 2: Total sale amount
#df['total_sale'] = df['quantity'] * df['unit_price']

# Example 3: Full name
#df['full_name'] = df['first_name'] + " " + df['last_name']

# Example 4: Just the month from a date column
#df['month'] = pd.to_datetime(df['order_date']).dt.month

# Example 5: Yes/No if the customer is from Canada
#df['is_canada'] = df['country'] == 'Canada'

print("New columns created!")
print("New column names:", df.columns.tolist())

KeyError: 'birth_year'

In [49]:
# Final validation and quality check
print(df.info())
print(df.isnull().sum().sum())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 20 entries, 0 to 19
Data columns (total 11 columns):
 #   Column               Non-Null Count  Dtype 
---  ------               --------------  ----- 
 0   Rank                 20 non-null     int64 
 1   Peak                 9 non-null      object
 2   All_time_peak        6 non-null      object
 3   Actual_gross         20 non-null     object
 4   Adjusted_gross_2022  20 non-null     object
 5   Artist               20 non-null     object
 6   Tour_title           20 non-null     object
 7   Years                20 non-null     object
 8   Shows                20 non-null     int64 
 9   Average_gross        20 non-null     object
 10  Ref                  20 non-null     object
dtypes: int64(2), object(9)
memory usage: 1.8+ KB
None
25


In [None]:
# Export and version cleaned dataset
df.to_csv("../data/cleaned_data.csv", index=False)
df.to_excel("../data/cleaned_data.xlsx", index=False)

print("Files saved: cleaned_data.csv, cleaned_data.xlsx")