# Online Retail Dataset Analysis

This notebook imports and analyzes the Online Retail dataset from UCI Machine Learning Repository.

**Dataset Information:**
- **Source:** UCI Machine Learning Repository (ID: 352)
- **Description:** Transactional data from a UK-based online retail store (01/12/2010 - 09/12/2011)
- **Instances:** 541,909 transactions
- **Features:** 8 variables
- **Tasks:** Classification, Clustering

**Variables:**
- InvoiceNo: Transaction ID (6-digit number, 'c' prefix indicates cancellation)
- StockCode: Product code (5-digit number)
- Description: Product name
- Quantity: Quantity per transaction
- InvoiceDate: Transaction date and time
- UnitPrice: Price per unit (sterling)
- CustomerID: Customer identifier (5-digit number)
- Country: Customer's country

In [1]:
# Import required libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from ucimlrepo import fetch_ucirepo

# Set up plotting style
plt.style.use('seaborn-v0_8')
sns.set_palette("husl")
plt.rcParams['figure.figsize'] = (12, 8)

print("Libraries imported successfully!")

Libraries imported successfully!


In [2]:
# Fetch the Online Retail dataset from UCI
print("Fetching Online Retail dataset from UCI...")
online_retail = fetch_ucirepo(id=352)

# Get the data
df = online_retail.data.features

print(f"Dataset shape: {df.shape}")
print(f"Dataset columns: {list(df.columns)}")
print("\nFirst few rows:")
df.head()

Fetching Online Retail dataset from UCI...
Dataset shape: (541909, 6)
Dataset columns: ['Description', 'Quantity', 'InvoiceDate', 'UnitPrice', 'CustomerID', 'Country']

First few rows:


Unnamed: 0,Description,Quantity,InvoiceDate,UnitPrice,CustomerID,Country
0,WHITE HANGING HEART T-LIGHT HOLDER,6,12/1/2010 8:26,2.55,17850.0,United Kingdom
1,WHITE METAL LANTERN,6,12/1/2010 8:26,3.39,17850.0,United Kingdom
2,CREAM CUPID HEARTS COAT HANGER,8,12/1/2010 8:26,2.75,17850.0,United Kingdom
3,KNITTED UNION FLAG HOT WATER BOTTLE,6,12/1/2010 8:26,3.39,17850.0,United Kingdom
4,RED WOOLLY HOTTIE WHITE HEART.,6,12/1/2010 8:26,3.39,17850.0,United Kingdom


In [3]:
# Display dataset information
print("Dataset Info:")
print("=" * 50)
df.info()

print("\nDataset Description:")
print("=" * 50)
df.describe()

Dataset Info:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 541909 entries, 0 to 541908
Data columns (total 6 columns):
 #   Column       Non-Null Count   Dtype  
---  ------       --------------   -----  
 0   Description  540455 non-null  object 
 1   Quantity     541909 non-null  int64  
 2   InvoiceDate  541909 non-null  object 
 3   UnitPrice    541909 non-null  float64
 4   CustomerID   406829 non-null  float64
 5   Country      541909 non-null  object 
dtypes: float64(2), int64(1), object(3)
memory usage: 24.8+ MB

Dataset Description:


Unnamed: 0,Quantity,UnitPrice,CustomerID
count,541909.0,541909.0,406829.0
mean,9.55225,4.611114,15287.69057
std,218.081158,96.759853,1713.600303
min,-80995.0,-11062.06,12346.0
25%,1.0,1.25,13953.0
50%,3.0,2.08,15152.0
75%,10.0,4.13,16791.0
max,80995.0,38970.0,18287.0


In [4]:
# Check for missing values
print("Missing Values:")
print("=" * 30)
missing_values = df.isnull().sum()
missing_percentage = (missing_values / len(df)) * 100

missing_df = pd.DataFrame({
    'Missing Count': missing_values,
    'Missing Percentage': missing_percentage
})
missing_df = missing_df[missing_df['Missing Count'] > 0]

if len(missing_df) > 0:
    print(missing_df)
else:
    print("No missing values found!")

Missing Values:
             Missing Count  Missing Percentage
Description           1454            0.268311
CustomerID          135080           24.926694
