# Practice Task 3 – Data Cleaning

Load a CSV dataset with missing values and duplicates. 

Perform: 

Removal of missing & duplicate rows 

Type conversion of columns 

Normalization of numeric columns

In [1]:
import pandas as pd
from sklearn.preprocessing import MinMaxScaler

df = pd.read_csv("articles_test.csv")

print("Raw shape:", df.shape)
df.head()

Raw shape: (30, 4)


Unnamed: 0,title,category,author,published_date
0,Article 1,Science,Sarah L.,2024-04-04
1,Article 2,Sports,Ian M.,2024-01-29
2,Article 3,Environment,David K.,2024-03-27
3,Article 4,,David K.,2024-03-16
4,Article 5,Health,Sarah L.,2024-01-04


## show missing values & duplicates

In [2]:
print("Nulls per column:")
print(df.isnull().sum())

print("\nDuplicate rows:", df.duplicated().sum())

Nulls per column:
title             0
category          6
author            1
published_date    0
dtype: int64

Duplicate rows: 1


## remove missing values & duplicates

In [4]:
df_clean = df.dropna().drop_duplicates()
print("After cleaning:", df_clean.shape)

After cleaning: (22, 4)


## Convert Data Types
convert the **published_date** column to `datetime`.

In [5]:
df_clean["published_date"] = pd.to_datetime(df_clean["published_date"])
df_clean.dtypes.head()

title                     object
category                  object
author                    object
published_date    datetime64[ns]
dtype: object

## Normalize Numeric Columns (0-1)
create a demo numeric field: **title_length** = number of characters in the title, then scale it.

In [6]:
# numeric field
df_clean["title_length"] = df_clean["title"].str.len()

scaler = MinMaxScaler()
df_clean["title_length_norm"] = scaler.fit_transform(
    df_clean[["title_length"]]
)

df_clean[["title", "title_length", "title_length_norm"]].head()


Unnamed: 0,title,title_length,title_length_norm
0,Article 1,9,0.0
1,Article 2,9,0.0
2,Article 3,9,0.0
4,Article 5,9,0.0
11,Article 12,10,1.0


In [7]:
df_clean.to_csv("articles_test_cleaned.csv", index=False)
print("Saved cleaned file: articles_test_cleaned.csv")

Saved cleaned file: articles_test_cleaned.csv
