# Data Cleaning and Preprocessing

## Objective
This notebook focuses on cleaning and preprocessing the sales data before performing further analysis.

## Steps:
1. Load data from PostgreSQL.
2. Handle missing values.
3. Convert data types.
4. Remove duplicates.
5. Standardize column values.


In [3]:
import pandas as pd
import numpy as np
from sqlalchemy import create_engine

In [5]:
# Database connection
DATABASE_URL = "postgresql://postgres:8484123@localhost:5432/sales_db"
engine = create_engine(DATABASE_URL)


In [7]:
# Load data
query = "SELECT * FROM fact_sales"
df = pd.read_sql(query, engine)

In [9]:
# Handling missing values
df.fillna({'total_sales': 0}, inplace=True)

In [13]:
# Convert data types
if 'date_id' in df.columns:
    df['date_id'] = pd.to_numeric(df['date_id'], errors='coerce')

In [15]:
# Remove duplicates
df.drop_duplicates(inplace=True)

In [17]:
# Standardize column values
if 'region' in df.columns:
    df['region'] = df['region'].str.title()

In [21]:
# Save cleaned data
df.to_csv("cleaned_sales_data.csv", index=False)

In [23]:
print("Data cleaning completed.")

Data cleaning completed.
