## 1. Data Overview

In [2]:
import pandas as pd
import numpy as np

import warnings
warnings.filterwarnings('ignore')

pd.set_option('display.width', 320)
pd.set_option('display.max_columns', None)
pd.set_option('display.max_colwidth', 30) 
pd.set_option('display.float_format', '{:.2f}'.format)
pd.set_option('display.max_rows', 100)

In [17]:
df = pd.read_csv('../data/raw_data.csv')

print(f"Dataset Shape : {df.shape[0]:,} Rows x {df.shape[1]} Columns")

Dataset Shape : 525,461 Rows x 8 Columns


In [4]:
df.head(10)

Unnamed: 0,InvoiceNo,StockCode,Description,Quantity,InvoiceDate,UnitPrice,CustomerID,Country
0,489434,85048,15CM CHRISTMAS GLASS BALL ...,12,12/1/2009 7:45,6.95,13085.0,United Kingdom
1,489434,79323P,PINK CHERRY LIGHTS,12,12/1/2009 7:45,6.75,13085.0,United Kingdom
2,489434,79323W,WHITE CHERRY LIGHTS,12,12/1/2009 7:45,6.75,13085.0,United Kingdom
3,489434,22041,"RECORD FRAME 7"" SINGLE SIZE",48,12/1/2009 7:45,2.1,13085.0,United Kingdom
4,489434,21232,STRAWBERRY CERAMIC TRINKET...,24,12/1/2009 7:45,1.25,13085.0,United Kingdom
5,489434,22064,PINK DOUGHNUT TRINKET POT,24,12/1/2009 7:45,1.65,13085.0,United Kingdom
6,489434,21871,SAVE THE PLANET MUG,24,12/1/2009 7:45,1.25,13085.0,United Kingdom
7,489434,21523,FANCY FONT HOME SWEET HOME...,10,12/1/2009 7:45,5.95,13085.0,United Kingdom
8,489435,22350,CAT BOWL,12,12/1/2009 7:46,2.55,13085.0,United Kingdom
9,489435,22349,"DOG BOWL , CHASING BALL DE...",12,12/1/2009 7:46,3.75,13085.0,United Kingdom


In [5]:
df.tail(10)

Unnamed: 0,InvoiceNo,StockCode,Description,Quantity,InvoiceDate,UnitPrice,CustomerID,Country
525451,538171,22748,POPPY'S PLAYHOUSE KITCHEN,2,12/9/2010 20:01,2.1,17530.0,United Kingdom
525452,538171,22745,POPPY'S PLAYHOUSE BEDROOM,2,12/9/2010 20:01,2.1,17530.0,United Kingdom
525453,538171,22558,CLOTHES PEGS RETROSPOT PAC...,4,12/9/2010 20:01,1.49,17530.0,United Kingdom
525454,538171,21671,RED SPOT CERAMIC DRAWER KNOB,6,12/9/2010 20:01,1.25,17530.0,United Kingdom
525455,538171,20971,PINK BLUE FELT CRAFT TRINK...,2,12/9/2010 20:01,1.25,17530.0,United Kingdom
525456,538171,22271,FELTCRAFT DOLL ROSIE,2,12/9/2010 20:01,2.95,17530.0,United Kingdom
525457,538171,22750,FELTCRAFT PRINCESS LOLA DOLL,1,12/9/2010 20:01,3.75,17530.0,United Kingdom
525458,538171,22751,FELTCRAFT PRINCESS OLIVIA ...,1,12/9/2010 20:01,3.75,17530.0,United Kingdom
525459,538171,20970,PINK FLORAL FELTCRAFT SHOU...,2,12/9/2010 20:01,3.75,17530.0,United Kingdom
525460,538171,21931,JUMBO STORAGE BAG SUKI,2,12/9/2010 20:01,1.95,17530.0,United Kingdom


In [6]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 525461 entries, 0 to 525460
Data columns (total 8 columns):
 #   Column       Non-Null Count   Dtype  
---  ------       --------------   -----  
 0   InvoiceNo    525461 non-null  object 
 1   StockCode    525461 non-null  object 
 2   Description  522533 non-null  object 
 3   Quantity     525461 non-null  int64  
 4   InvoiceDate  525461 non-null  object 
 5   UnitPrice    525461 non-null  float64
 6   CustomerID   417534 non-null  float64
 7   Country      525461 non-null  object 
dtypes: float64(2), int64(1), object(5)
memory usage: 32.1+ MB


In [7]:
print("COLUMN DESCRIPTIONS")
column_descriptions = {
    'InvoiceNo': 'Invoice number (6-digit, C prefix = cancellation)',
    'StockCode': 'Product/item code (5-digit)',
    'Description': 'Product name/description',
    'Quantity': 'Quantity of items per transaction',
    'InvoiceDate': 'Invoice date and time',
    'UnitPrice': 'Unit price per item (£)',
    'CustomerID': 'Unique customer identifier',
    'Country': 'Customer country name'
}

for col, desc in column_descriptions.items():
    print(f"• {col:15} : {desc}")

COLUMN DESCRIPTIONS
• InvoiceNo       : Invoice number (6-digit, C prefix = cancellation)
• StockCode       : Product/item code (5-digit)
• Description     : Product name/description
• Quantity        : Quantity of items per transaction
• InvoiceDate     : Invoice date and time
• UnitPrice       : Unit price per item (£)
• CustomerID      : Unique customer identifier
• Country         : Customer country name


In [8]:
df.describe()

Unnamed: 0,Quantity,UnitPrice,CustomerID
count,525461.0,525461.0,417534.0
mean,10.34,4.69,15360.65
std,107.42,146.13,1680.81
min,-9600.0,-53594.36,12346.0
25%,1.0,1.25,13983.0
50%,3.0,2.1,15311.0
75%,10.0,4.21,16799.0
max,19152.0,25111.09,18287.0


In [9]:
df.dtypes

InvoiceNo       object
StockCode       object
Description     object
Quantity         int64
InvoiceDate     object
UnitPrice      float64
CustomerID     float64
Country         object
dtype: object

In [10]:
df.dtypes.value_counts()

object     5
float64    2
int64      1
Name: count, dtype: int64

In [11]:
missing_df = pd.DataFrame({
    'Column' : df.columns,
    'Missing_Count': df.isnull().sum(),
    'Missing_Percentage': (df.isnull().sum() / len(df) * 100).round(2),
    'Data_Type':df.dtypes
})

missing_df = missing_df[missing_df['Missing_Count'] > 0].sort_values('Missing_Count', ascending=False)
print(missing_df.to_string(index=False))

if missing_df.empty:
    print('No Missing values found')

     Column  Missing_Count  Missing_Percentage Data_Type
 CustomerID         107927               20.54   float64
Description           2928                0.56    object


In [12]:
unique_df = pd.DataFrame({
    'Column': df.columns,
    'Unique_Values': df.nunique(),
    'Sample_Values': [df[col].unique()[:3].tolist() for col in df.columns]
})

print(unique_df.to_string(index=False))

     Column  Unique_Values                                                                   Sample_Values
  InvoiceNo          28816                                                        [489434, 489435, 489436]
  StockCode           4632                                                         [85048, 79323P, 79323W]
Description           4681 [15CM CHRISTMAS GLASS BALL 20 LIGHTS, PINK CHERRY LIGHTS,  WHITE CHERRY LIGHTS]
   Quantity            825                                                                    [12, 48, 24]
InvoiceDate          25296                                [12/1/2009 7:45, 12/1/2009 7:46, 12/1/2009 9:06]
  UnitPrice           1606                                                               [6.95, 6.75, 2.1]
 CustomerID           4383                                                     [13085.0, 13078.0, 15362.0]
    Country             40                                                   [United Kingdom, France, USA]


In [13]:
df['InvoiceDate'] = pd.to_datetime(df['InvoiceDate'])

print(f"Date Range : {df['InvoiceDate'].min()} - {df['InvoiceDate'].max()} ")

Date Range : 2009-12-01 07:45:00 - 2010-12-09 20:01:00 


In [14]:
print(f" Total Transactions(Including Canceled) : {df['InvoiceNo'].nunique():,}")
print(f"Total Products:      {df['StockCode'].nunique():,}")
print(f"Total Customers:     {df['CustomerID'].nunique():,}")
print(f"Total Countries:     {df['Country'].nunique():,}")
print(f"Total Records:       {len(df):,}")

 Total Transactions(Including Canceled) : 28,816
Total Products:      4,632
Total Customers:     4,383
Total Countries:     40
Total Records:       525,461


In [15]:
country_counts = df['Country'].value_counts()
print(country_counts)

Country
United Kingdom          485852
EIRE                      9670
Germany                   8129
France                    5772
Netherlands               2769
Spain                     1278
Switzerland               1187
Portugal                  1101
Belgium                   1054
Channel Islands            906
Sweden                     902
Italy                      731
Australia                  654
Cyprus                     554
Austria                    537
Greece                     517
United Arab Emirates       432
Denmark                    428
Norway                     369
Finland                    354
Unspecified                310
USA                        244
Japan                      224
Poland                     194
Malta                      172
Lithuania                  154
Singapore                  117
RSA                        111
Bahrain                    107
Canada                      77
Thailand                    76
Hong Kong                   76


In [16]:
print(f"Negative Quantities:  {(df['Quantity'] < 0).sum():,} records")
print(f"Zero Quantities:      {(df['Quantity'] == 0).sum():,} records")
print(f"Negative Prices:      {(df['UnitPrice'] < 0).sum():,} records")
print(f"Zero Prices:          {(df['UnitPrice'] == 0).sum():,} records")
print(f"Cancelled Invoices:   {df['InvoiceNo'].astype(str).str.startswith('C').sum():,} records")
print(f"Duplicate Rows:       {df.duplicated().sum():,} records")

Negative Quantities:  12,326 records
Zero Quantities:      0 records
Negative Prices:      3 records
Zero Prices:          3,687 records
Cancelled Invoices:   10,206 records
Duplicate Rows:       6,865 records


## Data Overview Report

###  Dataset Summary

This analysis uses transactional data from a UK-based online retail company spanning **December 2009 to December 2010** (13 months).

**Dataset Dimensions:**
- **525,461 transactions** across **28,816 invoices**
- **4,632 unique products**
- **4,383 unique customers**
- **40 countries** served globally

---

### Column Definitions

| Column | Type | Description |
|--------|------|-------------|
| `InvoiceNo` | Object | Invoice number (6-digit; 'C' prefix = cancellation) |
| `StockCode` | Object | Product/item code (5-digit alphanumeric) |
| `Description` | Object | Product name/description |
| `Quantity` | Integer | Quantity of items per transaction |
| `InvoiceDate` | Object | Invoice date and time (needs conversion) |
| `UnitPrice` | Float | Unit price per item in GBP (£) |
| `CustomerID` | Float | Unique customer identifier |
| `Country` | Object | Customer's country |

---

### Initial Data Quality Assessment

#### Geographic Distribution
- **92.5%** of transactions from **United Kingdom** (485,852 records)
- Significant international presence in EIRE, Germany, France, and Netherlands
- Long-tail distribution across 40 countries

#### Missing Data Issues
| Column | Missing Count | % Missing |
|--------|---------------|-----------|
| `CustomerID` | 107,927 | **20.54%** |
| `Description` | 2,928 | **0.56%** |

**Implications:**
- ~20% of transactions lack customer identification (likely guest purchases or B2B orders)
- Will require separate datasets for customer vs. product analysis

#### Data Quality Red Flags

| Issue | Count | Business Impact |
|-------|-------|-----------------|
| **Negative Quantities** | 12,326 | Likely returns; need investigation |
| **Cancelled Invoices** (C prefix) | 10,206 | Must be excluded from revenue analysis |
| **Zero Prices** | 3,687 | Free samples/promotions or errors |
| **Negative Prices** | 3 | Data entry errors |
| **Duplicate Rows** | 6,865 | System errors; must be removed |

#### Statistical Overview

**Quantity:**
- Mean: 10.3 items per transaction
- Median: 3 items (suggests right-skewed distribution)
- Range: -9,600 to 19,152 (extreme values need investigation)

**Unit Price:**
- Mean: £4.69
- Median: £2.10
- Range: -£53,594 to £25,111 (clear outliers and errors)

**Customer Distribution:**
- 4,383 unique customers over 13 months
- Average CustomerID: 15,361 (sequential IDs)

---

###  Key Observations

1. **Data Type Issues:** 
   - `InvoiceDate` stored as string → needs datetime conversion
   - `CustomerID` as float → should be integer (but has nulls)

2. **Data Cleanliness:** 
   - **~2% cancelled orders** must be removed
   - **~1.3% duplicate records** need deduplication
   - Negative quantities suggest returns not properly flagged

3. **Business Context:**
   - UK-focused e-commerce with international reach
   - B2C and potential B2B operations (missing CustomerIDs)
   - Product range: 4,632 SKUs (medium-sized catalog)
   - Transaction frequency: ~1,400 invoices per month

4. **Analysis Requirements:**
   - Must separate revenue analysis from customer behavior analysis
   - Need to handle cancellations systematically
   - Outlier investigation required (bulk orders vs. errors)
   - Time-based features needed (month, day, hour)

---

###  Next Steps

The Data Cleaning section will systematically address:
1. Remove cancelled orders (C prefix invoices)
2. Convert data types (datetime, proper numerics)
3. Handle missing CustomerIDs appropriately
4. Remove invalid values (negatives, zeros, duplicates)
5. Create calculated fields (TotalPrice, date components)
6. Investigate and justify outlier treatment

**Cleaning must be justified by business logic, not just statistical rules.**