Internship Task 1 — Data Cleaning and Preprocessing using Pandas
This project is part of my Data Analyst Internship Task 1, where I performed data cleaning and preprocessing using Python (Pandas).
The goal was to clean a raw dataset containing missing values, duplicates, inconsistent formats, and mixed data types.
| File Name | Description |
|---|---|
data_task1.csv |
Raw dataset (with missing values, duplicates, and inconsistent formats) |
task1.py |
Python script used for data cleaning and preprocessing |
cleaned_data_task1.csv |
Final cleaned dataset ready for analysis |
- Identify and handle missing values
- Remove duplicate rows
- Standardize text values (Gender, Country)
- Convert date formats into a consistent format
- Ensure all columns have clean and uniform names
- Correct data types (e.g., age → int, revenue → float, date → datetime)
- Loaded the dataset using Pandas
- Handled missing values using the column mean (
fillna()) - Removed duplicates with
drop_duplicates() - Standardized text columns (converted gender and country values to consistent case)
- Fixed inconsistent date formats using
pd.to_datetime() - Renamed columns to lowercase and replaced spaces with underscores
- Converted data types to ensure numeric and datetime consistency
- Saved cleaned data into a new CSV file
customerid name gender country age joindate revenue 0 101 John male Usa 25 2023-05-12 4500.0 1 102 Sarah female Usa 30 2023-06-14 3260.0 2 103 John male Canada 28 2023-07-14 3200.0 3 104 Ravi male India 27 2022-03-05 2800.0 4 105 Sarah female Uk 30 2023-06-05 3000.0 5 106 Priya female India 27 2023-03-10 2800.0