# Data Cleaning Project – Customer Dataset

In this project, I practice core data cleaning techniques using a fictional customer dataset. The dataset contains common data quality issues like:

- Misformatted names and emails  
- Invalid phone numbers   
- Inconsistent or invalid dates  

This type of task is essential in real-world data projects, where raw data must be cleaned before analysis or modeling.

In [1]:
import pandas as pd

In [2]:
df = pd.read_csv('clientes.csv')

In [3]:
df.info()
df

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5 entries, 0 to 4
Data columns (total 5 columns):
 #   Column       Non-Null Count  Dtype 
---  ------       --------------  ----- 
 0   full_name    5 non-null      object
 1   email        4 non-null      object
 2   phone        4 non-null      object
 3   address      5 non-null      object
 4   signup_date  5 non-null      object
dtypes: object(5)
memory usage: 332.0+ bytes


Unnamed: 0,full_name,email,phone,address,signup_date
0,ana paula,ANA.PAULA@GMAIL.COM,+55 (11) 91234-5678,"Av. Paulista, 1000 - SP",12-03-2024
1,joão da silva,joao.silva@@gmail,11999999999,"Rua das Laranjeiras, 45 - Rio de Janeiro",2024/03/14
2,Maria Oliveira,Maria.Oliveira@gmail.com,21-98888-7777,"Av Brasil 500,RJ",14.03.2024
3,LUCAS LIMA,lucas.lima@GMAIL.com,,Avenida das Américas 3000 - RJ,15/03/24
4,patricia gomes,,(11)98888-2222,"Rua A, SP","March 16, 2024"


# Data Quality Issues Identified:

- `full_name`: Wrong formating
- `email`: Wrong formating and missing value
- `phone`: Wrong pattern
- `signup_date`: Wrong formating

### `full_name`

In [4]:
# First, ill standardize the `full_name` column by converting it to title case
df['full_name'] = df['full_name'].str.title()
# Now, ill check if there is any blank spaces before or after the names
df[df['full_name'] != df['full_name'].str.strip()]

Unnamed: 0,full_name,email,phone,address,signup_date


Perfect, now everything is correct.

### `email`

In [5]:
# First, ill standardize the `email` column by converting it to lowercase
df['email'] = df['email'].str.lower()

In [6]:
# Now, lets check if there are any missing values in the `email` column
display(df[df['email'].isnull()])

Unnamed: 0,full_name,email,phone,address,signup_date
4,Patricia Gomes,,(11)98888-2222,"Rua A, SP","March 16, 2024"


In [7]:
# Ill get rid of the NaN values in the `email` column
df['email'] = df['email'].fillna('')
df

Unnamed: 0,full_name,email,phone,address,signup_date
0,Ana Paula,ana.paula@gmail.com,+55 (11) 91234-5678,"Av. Paulista, 1000 - SP",12-03-2024
1,João Da Silva,joao.silva@@gmail,11999999999,"Rua das Laranjeiras, 45 - Rio de Janeiro",2024/03/14
2,Maria Oliveira,maria.oliveira@gmail.com,21-98888-7777,"Av Brasil 500,RJ",14.03.2024
3,Lucas Lima,lucas.lima@gmail.com,,Avenida das Américas 3000 - RJ,15/03/24
4,Patricia Gomes,,(11)98888-2222,"Rua A, SP","March 16, 2024"


In [8]:
# Now, lets check if all emails are in the correct format
df[~df['email'].str.contains(r'^[\w\.-]+@[\w\.-]+\.\w+$', na=False)]

# There is one email that is not in the correct format, lets fix it
df['email'] = df['email'].str.replace('@@', '@')
df

Unnamed: 0,full_name,email,phone,address,signup_date
0,Ana Paula,ana.paula@gmail.com,+55 (11) 91234-5678,"Av. Paulista, 1000 - SP",12-03-2024
1,João Da Silva,joao.silva@gmail,11999999999,"Rua das Laranjeiras, 45 - Rio de Janeiro",2024/03/14
2,Maria Oliveira,maria.oliveira@gmail.com,21-98888-7777,"Av Brasil 500,RJ",14.03.2024
3,Lucas Lima,lucas.lima@gmail.com,,Avenida das Américas 3000 - RJ,15/03/24
4,Patricia Gomes,,(11)98888-2222,"Rua A, SP","March 16, 2024"


Now this column is fixed.

### `phone`

In [9]:
# The column phone is not in the correct format, lets fix it
df['phone'] = df['phone'].str.replace(r'\D', '', regex=True) # Its to remove all non-digit characters
# And ill also remove the NaN values
df['phone'] = df['phone'].fillna('')
df

Unnamed: 0,full_name,email,phone,address,signup_date
0,Ana Paula,ana.paula@gmail.com,5511912345678.0,"Av. Paulista, 1000 - SP",12-03-2024
1,João Da Silva,joao.silva@gmail,11999999999.0,"Rua das Laranjeiras, 45 - Rio de Janeiro",2024/03/14
2,Maria Oliveira,maria.oliveira@gmail.com,21988887777.0,"Av Brasil 500,RJ",14.03.2024
3,Lucas Lima,lucas.lima@gmail.com,,Avenida das Américas 3000 - RJ,15/03/24
4,Patricia Gomes,,11988882222.0,"Rua A, SP","March 16, 2024"


In [10]:
# Some numbers have the country code, lets remove it
df['phone'] = df['phone'].str.replace(r'^55', '', regex=True) # Its removing the first two digits if they are '55'

# Now lets check if all phone numbers are in the correct format
display((df['phone'][df['phone'] != ''].str.len() == 11).all()) # This checks if all phone numbers have 11 digits, excluding the empty ones

np.True_

In [11]:
# Now lets format the phone numbers to the Brazilian format XX XXXXX-XXXX
df['phone'] = df['phone'].apply(lambda x: x[:2] + ' ' + x[2:7] + '-' + x[7:])

df['phone']

0    11 91234-5678
1    11 99999-9999
2    21 98888-7777
3                -
4    11 98888-2222
Name: phone, dtype: object

Now this column is fixed.

`signup_date`

In [12]:
# # The column signup_date is in the correct format, but lets convert it to datetime
# def parse_date(date_str):
#     try:
#         # Try to parse as day/month/year
#         return pd.to_datetime(date_str, format='%d-%m-%Y', errors='raise')
#     except ValueError:
#         # If it fails, try to parse as month/day/year
#         return pd.to_datetime(date_str, dayfirst=True)
    
# df['signup_date'] = df['signup_date'].apply(parse_date) # Applying the function to the column

df['signup_date'] = pd.to_datetime(df['signup_date'], dayfirst= True, format='mixed')

df

Unnamed: 0,full_name,email,phone,address,signup_date
0,Ana Paula,ana.paula@gmail.com,11 91234-5678,"Av. Paulista, 1000 - SP",2024-03-12
1,João Da Silva,joao.silva@gmail,11 99999-9999,"Rua das Laranjeiras, 45 - Rio de Janeiro",2024-03-14
2,Maria Oliveira,maria.oliveira@gmail.com,21 98888-7777,"Av Brasil 500,RJ",2024-03-14
3,Lucas Lima,lucas.lima@gmail.com,-,Avenida das Américas 3000 - RJ,2024-03-15
4,Patricia Gomes,,11 98888-2222,"Rua A, SP",2024-03-16


# Summary of Cleaning

- Cleaned `full_name` (capitalization and spaces)  
- Cleaned `email` (removed missing values and fixed typos)  
- Cleaned `phone` (standardized its format)
- Cleaned `signedup_date` (standardized its format)

The cleaned dataset can now be saved and used in future analysis.

In [None]:
df.to_csv('clientes_limpo.csv', index=False)