# Introduction

This analysis presents a complete customer segmentation and predictive analytics workflow using transaction data

The goal is to convert raw purchase data into actionable insights that inform marketing strategies, improve customer retention, and support revenue forecasting.


## 1. Dataset Source and Description

The dataset used in this project was obtained from the **UC Irvine Machine Learning Repository**:

**Online Retail II Dataset**  
Donated: September 20, 2019  
Source: UC Irvine Machine Learning Repository  

ðŸ”— [https://archive.ics.uci.edu/dataset/502/online+retail+ii](https://archive.ics.uci.edu/dataset/502/online+retail+ii)

This dataset contains **two years of real online retail transactions** from a UK-based, non-store online retailer, covering the period **December 2009 to December 2011**. The company primarily sells unique, all-occasion gift products, with many customers being wholesalers.

The dataset is well-suited for **customer analytics and predictive modeling**, and supports tasks such as:

- Classification  
- Regression  
- Clustering  

It includes a mix of **transactional, temporal, and categorical features**, making it ideal for RFM analysis, customer segmentation, and CLV modeling.

#### Variable Description

| Fields/ Columns       | Description |
|-------------|------------|
| InvoiceNo    | Unique invoice number for each transaction. If it starts with "C", it indicates a cancellation. |
| StockCode    | Unique product identifier. |
| Description  | Product name. |
| Quantity     | Number of items purchased in a transaction. |
| InvoiceDate  | Date and time when the transaction occurred. |
| UnitPrice    | Price per unit in British Pounds (Â£). |
| CustomerID   | Unique customer identifier. |
| Country      | Customerâ€™s country of residence. |


## 2. Imports

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import yaml

## 3. Load Config

In [5]:
with open("../config.yaml","r") as file:
    config = yaml.safe_load(file)

data_path = config["paths"]["raw_data"]

## 4. Load Data

In [6]:
# Load all sheets from the Excel file
dfs = pd.read_excel(data_path, sheet_name=None)  # None = load all sheets

# dfs is a dictionary: {sheet_name: DataFrame}
print("Sheets loaded:", list(dfs.keys()))

df1 = dfs[list(dfs.keys())[0]]  # first tab
df2 = dfs[list(dfs.keys())[1]]  # second tab

print(df1.head())
print(df2.head())

FileNotFoundError: [Errno 2] No such file or directory: 'data/raw/online_retail_II.xlsx'