# Data Cleaning

Real-world data is messy.

We must handle:<br>
- missing values
- duplicates
- incorrect data types
- string inconsistencies

In [1]:
import pandas as pd

sales = pd.read_csv("../data/raw/sales.csv")
sales.head()

Unnamed: 0,order_id,customer_id,product,category,price,quantity,city,date
0,1001,C101,Laptop,Electronics,55000,1,Delhi,2024-01-05
1,1002,C102,Phone,Electronics,20000,2,Mumbai,2024-01-06
2,1003,C103,Shoes,Fashion,3000,1,Pune,2024-01-07
3,1004,C101,Headphones,Electronics,2000,3,Delhi,2024-01-07
4,1005,C104,Tshirt,Fashion,800,2,Bangalore,2024-01-08


## Checking Missing Values

In [2]:
sales.isnull()

Unnamed: 0,order_id,customer_id,product,category,price,quantity,city,date
0,False,False,False,False,False,False,False,False
1,False,False,False,False,False,False,False,False
2,False,False,False,False,False,False,False,False
3,False,False,False,False,False,False,False,False
4,False,False,False,False,False,False,False,False
5,False,False,False,False,False,False,False,False
6,False,False,False,False,False,False,False,False
7,False,False,False,False,False,False,False,False
8,False,False,False,False,False,False,False,False
9,False,False,False,False,False,False,False,False


In [3]:
sales.isnull().sum()

order_id       0
customer_id    0
product        0
category       0
price          0
quantity       0
city           0
date           0
dtype: int64

## Filling Missing Values

In [4]:
sales["city"].fillna("Unknown", inplace=True)

C:\Users\lakba\AppData\Local\Temp\ipykernel_23080\1316358005.py:1: ChainedAssignmentError: A value is being set on a copy of a DataFrame or Series through chained assignment using an inplace method.
Such inplace method never works to update the original DataFrame or Series, because the intermediate object on which we are setting values always behaves as a copy (due to Copy-on-Write).

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' instead, to perform the operation inplace on the original object, or try to avoid an inplace operation using 'df[col] = df[col].method(value)'.

See the documentation for a more detailed explanation: https://pandas.pydata.org/pandas-docs/stable/user_guide/copy_on_write.html
  sales["city"].fillna("Unknown", inplace=True)


0        Delhi
1       Mumbai
2         Pune
3        Delhi
4    Bangalore
5      Chennai
6       Mumbai
7        Delhi
8         Pune
9    Bangalore
Name: city, dtype: str

## Dropping Missing Values

In [5]:
sales.dropna()

Unnamed: 0,order_id,customer_id,product,category,price,quantity,city,date
0,1001,C101,Laptop,Electronics,55000,1,Delhi,2024-01-05
1,1002,C102,Phone,Electronics,20000,2,Mumbai,2024-01-06
2,1003,C103,Shoes,Fashion,3000,1,Pune,2024-01-07
3,1004,C101,Headphones,Electronics,2000,3,Delhi,2024-01-07
4,1005,C104,Tshirt,Fashion,800,2,Bangalore,2024-01-08
5,1006,C105,Watch,Accessories,2500,1,Chennai,2024-01-09
6,1007,C102,Laptop,Electronics,60000,1,Mumbai,2024-01-10
7,1008,C106,Backpack,Accessories,1500,2,Delhi,2024-01-10
8,1009,C103,Phone,Electronics,18000,1,Pune,2024-01-11
9,1010,C104,Shoes,Fashion,3500,1,Bangalore,2024-01-11


## Removing Duplicates

In [6]:
sales.duplicated()

0    False
1    False
2    False
3    False
4    False
5    False
6    False
7    False
8    False
9    False
dtype: bool

In [7]:
sales.drop_duplicates(inplace=True)

## Fixing Data Types

In [8]:
sales["date"] = pd.to_datetime(sales["date"])
sales.dtypes

order_id                int64
customer_id               str
product                   str
category                  str
price                   int64
quantity                int64
city                      str
date           datetime64[us]
dtype: object

## String Cleaning

In [9]:
sales["product"] = sales["product"].str.lower()
sales["city"] = sales["city"].str.strip()

## Save Cleaned Dataset

In [10]:
sales.to_csv("../data/processed/cleaned_sales.csv", index=False)

## Conclusion

Clean data is necessary before transformation and analysis.