### Objective
This notebook focuses on cleaning the company's monthly sales data by:
- Identifying and handling missing or inconsistent data.
- Converting the `Date` column to proper date time format.
- Removing duplicates.


In [1]:
import pandas as pd 
df=pd.read_csv(r"C:\Users\navya\OneDrive\Documents\Sales_Analysis_Project\Data\sample_sales_data.csv")
df.head()

Unnamed: 0,Date,Customer_ID,Product_Category,Product_Name,Units_Sold,Unit_Price,Region,Sales_Rep,Revenue
0,2023-11-24,CUST1070,Stationery,Pen,17,495,North,Bob,8415
1,2023-02-27,CUST1021,Clothing,Pen,1,544,South,Alice,544
2,2023-01-13,CUST1033,Electronics,Laptop,16,1356,South,Alice,21696
3,2023-05-21,CUST1067,Stationery,T-Shirt,12,175,West,Charlie,2100
4,2023-05-06,CUST1077,Stationery,Phone,19,364,South,Ethan,6916


### Identifying and Handling missing values

In [2]:
print("Missing values per column:\n")
print(df.isnull().sum())

Missing values per column:

Date                0
Customer_ID         0
Product_Category    0
Product_Name        5
Units_Sold          0
Unit_Price          0
Region              0
Sales_Rep           0
Revenue             0
dtype: int64


In [3]:
df = df.dropna(subset=['Product_Name'])
print(f"Dataset shape after dropping rows with missing Product_Name: {df.shape}")

Dataset shape after dropping rows with missing Product_Name: (200, 9)


***Finding:*** 5 rows had missing `Product_Name`.  
***Decision & Action:*** Dropped these rows because they were very few and imputing names could introduce errors in product-level analysis.  
***Result:*** Cleaned dataset now has 200 rows and 9 columns, ready for analysis.

### Date and Time formating ###

In [4]:
df['Date'] = pd.to_datetime(df['Date'], errors='coerce')


In [5]:
print("Invalid dates:", df['Date'].isna().sum())


Invalid dates: 0


### Identifying and Removing duplicate values ###

In [6]:
num_duplicated=df.duplicated().sum()
num_duplicated

5

In [7]:
df.drop_duplicates(inplace=True)
print(f"Dataset shape after removing duplicates: {df.shape}")


Dataset shape after removing duplicates: (195, 9)


**Finding:** 5 duplicate rows were present in the dataset.  
**Decision & Action:** Removed these duplicate rows to maintain data integrity.  
**Result:** Dataset is now clean with unique rows only.

### Saving cleaned Data ###

In [14]:

df.to_csv(r"C:\Users\navya\OneDrive\Documents\Sales_Analysis_Project\Data\sales_data_cleaned.csv", index=False)
print(" Cleaned dataset saved successfully!")



 Cleaned dataset saved successfully!
