In [1]:
import pandas as pd

### Step 1: Data Exploration & Data Cleaning
1. Load and Inspect the Data

In [2]:
df = pd.read_csv('data/retail_sales_dataset.csv')
df.head()

Unnamed: 0,Transaction ID,Date,Customer ID,Gender,Age,Product Category,Quantity,Price per Unit,Total Amount
0,1,2023-11-24,CUST001,Male,34,Beauty,3,50,150
1,2,2023-02-27,CUST002,Female,26,Clothing,2,500,1000
2,3,2023-01-13,CUST003,Male,50,Electronics,1,30,30
3,4,2023-05-21,CUST004,Male,37,Clothing,1,500,500
4,5,2023-05-06,CUST005,Male,30,Beauty,2,50,100


In [3]:
# Check datatypes and null values
print(df.info())
print(df.isnull().sum())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 9 columns):
 #   Column            Non-Null Count  Dtype 
---  ------            --------------  ----- 
 0   Transaction ID    1000 non-null   int64 
 1   Date              1000 non-null   object
 2   Customer ID       1000 non-null   object
 3   Gender            1000 non-null   object
 4   Age               1000 non-null   int64 
 5   Product Category  1000 non-null   object
 6   Quantity          1000 non-null   int64 
 7   Price per Unit    1000 non-null   int64 
 8   Total Amount      1000 non-null   int64 
dtypes: int64(5), object(4)
memory usage: 70.4+ KB
None
Transaction ID      0
Date                0
Customer ID         0
Gender              0
Age                 0
Product Category    0
Quantity            0
Price per Unit      0
Total Amount        0
dtype: int64


There are no null values in my dataset, also some datatypes might need modification

2. Handling Missing Values

In [4]:
df = df.dropna()

3. Remove Duplicates

In [5]:
df = df.drop_duplicates()

4. Data Type Conversion

* Ensure Date is in datetime format. If needed, convert Quantity, Price per Unit, and Total Amount to numeric.
* Convert Customer ID and Gender to categorical for efficient storage and easier analysis.

In [6]:
# Convert Date to datetime format
df['Date'] = pd.to_datetime(df['Date'])

# Convert columns to numeric as needed
df['Quantity'] = pd.to_numeric(df['Quantity'], errors='coerce')
df['Price per Unit'] = pd.to_numeric(df['Price per Unit'], errors='coerce')
df['Total Amount'] = pd.to_numeric(df['Total Amount'], errors='coerce')

# Convert categorical columns
df['Customer ID'] = df['Customer ID'].astype('category')
df['Gender'] = df['Gender'].astype('category')


In [9]:
# Saving cleaned data
df.to_pickle('cleaned_data.pkl')

In [8]:
df.dtypes

Transaction ID               int64
Date                datetime64[ns]
Customer ID               category
Gender                    category
Age                          int64
Product Category            object
Quantity                     int64
Price per Unit               int64
Total Amount                 int64
dtype: object