## Project Objective

- **Objective**: Optimize inventory management by understanding customer purchase behavior. Focus on identifying which product categories are most popular, seasonal trends, and customer purchase frequency.

## Data Cleaning

In [3]:
import pandas as pd
import numpy as np
from cleaning_functions import convert_to_datetime, fill_missing_values, convert_to_int, lowercase_columns, save_cleaned_data


# Load the CSV file
file_path = 'C:\\Users\\USER\\Documents\\GitHub\\project_final\\dataset_raw\\data_utf8.csv'
df = pd.read_csv(file_path)

In [4]:
df

Unnamed: 0,InvoiceNo,StockCode,Description,Quantity,InvoiceDate,UnitPrice,CustomerID,Country
0,536365,85123A,WHITE HANGING HEART T-LIGHT HOLDER,6,12/1/2010 8:26,2.55,17850.0,United Kingdom
1,536365,71053,WHITE METAL LANTERN,6,12/1/2010 8:26,3.39,17850.0,United Kingdom
2,536365,84406B,CREAM CUPID HEARTS COAT HANGER,8,12/1/2010 8:26,2.75,17850.0,United Kingdom
3,536365,84029G,KNITTED UNION FLAG HOT WATER BOTTLE,6,12/1/2010 8:26,3.39,17850.0,United Kingdom
4,536365,84029E,RED WOOLLY HOTTIE WHITE HEART.,6,12/1/2010 8:26,3.39,17850.0,United Kingdom
...,...,...,...,...,...,...,...,...
541904,581587,22613,PACK OF 20 SPACEBOY NAPKINS,12,12/9/2011 12:50,0.85,12680.0,France
541905,581587,22899,CHILDREN'S APRON DOLLY GIRL,6,12/9/2011 12:50,2.10,12680.0,France
541906,581587,23254,CHILDRENS CUTLERY DOLLY GIRL,4,12/9/2011 12:50,4.15,12680.0,France
541907,581587,23255,CHILDRENS CUTLERY CIRCUS PARADE,4,12/9/2011 12:50,4.15,12680.0,France


- Customer ID: A unique identifier for each customer.
- StockCode (Product Category): Represents the product or product category.
- Description: A textual description of the product
- Quantity: Number of units of a product purchased in each transaction.
- UnitPrice: The price per unit of the product.
- InvoiceDate: The date and time of the transaction.
- Country: The country where the customer is located.

**Numeric Fields**: Quantity, UnitPrice

**Categorical Fields**: StockCode, Country

**Datetime Fields**: InvoiceDate

In [5]:
#Check data types
df.dtypes

InvoiceNo       object
StockCode       object
Description     object
Quantity         int64
InvoiceDate     object
UnitPrice      float64
CustomerID     float64
Country         object
dtype: object

In [6]:
df = convert_to_datetime(df, 'InvoiceDate')

In [7]:
df.isna().sum()

InvoiceNo           0
StockCode           0
Description      1454
Quantity            0
InvoiceDate         0
UnitPrice           0
CustomerID     135080
Country             0
dtype: int64

In [8]:
df = fill_missing_values(df, 'CustomerID', -1)

In [9]:
df = fill_missing_values(df, 'Description', 'Unknown')

In [10]:
df = convert_to_int(df, 'CustomerID')

In [11]:
df.dtypes

InvoiceNo              object
StockCode              object
Description            object
Quantity                int64
InvoiceDate    datetime64[ns]
UnitPrice             float64
CustomerID              int32
Country                object
dtype: object

In [12]:
df.isna().sum()

InvoiceNo      0
StockCode      0
Description    0
Quantity       0
InvoiceDate    0
UnitPrice      0
CustomerID     0
Country        0
dtype: int64

In [13]:
#Column titles in lower case
df = lowercase_columns(df)

In [14]:
df.sample(4)

Unnamed: 0,invoiceno,stockcode,description,quantity,invoicedate,unitprice,customerid,country
319229,564843,21741,COSY SLIPPER SHOES LARGE GREEN,1,2011-08-30 14:09:00,2.95,15866,United Kingdom
206902,554960,20711,JUMBO BAG TOYS,2,2011-05-27 15:18:00,4.13,-1,United Kingdom
339745,566602,23184,BULL DOG BOTTLE OPENER,4,2011-09-13 16:05:00,10.79,-1,United Kingdom
282449,561650,23320,GIANT 50'S CHRISTMAS CRACKER,12,2011-07-28 15:31:00,2.89,15329,United Kingdom


In [17]:
#Export the cleaned data to a new CSV file
save_cleaned_data(df, 'C:\\Users\\USER\\Documents\\GitHub\\project_final\\data_cleaning\\data_cleaned.csv')