# Practice Task – Data Cleaning

This notebook performs data cleaning on a CSV file by removing missing and duplicate entries, converting data types, and normalizing numeric columns. The cleaned dataset is saved as `cleaned_data.csv`.


*Importing Libraries and Loading the dataset*

In [4]:
import pandas as pd
from sklearn.preprocessing import MinMaxScaler

df = pd.read_csv("raw_data.csv")

df.head()


Unnamed: 0,name,age,income,join_date
0,Alice,25.0,50000.0,2021-01-10
1,Bob,30.0,54000.0,2021-03-15
2,Charlie,,60000.0,2021-06-01
3,Alice,25.0,50000.0,2021-01-10
4,Eve,22.0,,2021-07-22


*Removing Missing and Duplicate rows*

In [5]:
#Dropping rows
df_cleaned = df.dropna()

# Duplicate rows removal
df_cleaned = df_cleaned.drop_duplicates()

# Checking shape after cleaning
print("Cleaned dataset shape:", df_cleaned.shape)
df_cleaned.head()


Cleaned dataset shape: (4, 4)


Unnamed: 0,name,age,income,join_date
0,Alice,25.0,50000.0,2021-01-10
1,Bob,30.0,54000.0,2021-03-15
5,Frank,29.0,62000.0,2021-08-30
6,Grace,31.0,58000.0,2021-10-05


*Type Coversion*

In [6]:
# Converting 'join_date' column to datetime format
df_cleaned['join_date'] = pd.to_datetime(df_cleaned['join_date'])

# Checking the type
df_cleaned.dtypes

Unnamed: 0,0
name,object
age,float64
income,float64
join_date,datetime64[ns]


*Normalising Numeric Columns*

In [7]:
# Selecting the columsn(numeric) to normalize
numeric_cols = ['age', 'income']

# Intializing Scaler
scaler = MinMaxScaler()

# Normalization
df_cleaned[numeric_cols] = scaler.fit_transform(df_cleaned[numeric_cols])

# Checking normalized data
df_cleaned.head()


Unnamed: 0,name,age,income,join_date
0,Alice,0.0,0.0,2021-01-10
1,Bob,0.833333,0.333333,2021-03-15
5,Frank,0.666667,1.0,2021-08-30
6,Grace,1.0,0.666667,2021-10-05


*Saving the cleaned data*

In [8]:
# Saving the cleaned data as cleaned_data.csv
df_cleaned.to_csv("cleaned_data.csv", index=False)
print("Cleaned data saved as cleaned_data.csv")

Cleaned data saved as cleaned_data.csv


Loading and displaying the cleaned data csv

In [9]:
# Load the cleaned CSV
cleaned_df = pd.read_csv("cleaned_data.csv")

# Display the first few rows
cleaned_df.head()


Unnamed: 0,name,age,income,join_date
0,Alice,0.0,0.0,2021-01-10
1,Bob,0.833333,0.333333,2021-03-15
2,Frank,0.666667,1.0,2021-08-30
3,Grace,1.0,0.666667,2021-10-05
