# Online Retail Dataset

This dataset from UCI Machine Learning Repository contains 541,909 transactions from a UK-based online retailer between December 2010 and December 2011, including information such as invoice numbers, product details, prices, and customer data. The dataset meets our assignment requirements with its appropriate size (between 1,000 and 1,000,000 rows) and numeric columns (UnitPrice, Quantity), making it ideal for analyzing price distributions and transaction patterns.

- `InvoiceNo`: A unique identifier for each transaction
- `StockCode`: Product code
- `Description`: Product description
- `Quantity`: The quantities of each product per transaction
- `InvoiceDate`: The day and time when the transaction was generated
- `UnitPrice`: Unit price in sterling
- `CustomerID`: A unique identifier for each customer
- `Country`: The country where the customer resides

Dataset Source
- **Download Link**: [UCI Machine Learning Repository - Online Retail Data Set](https://archive.ics.uci.edu/ml/datasets/Online+Retail)

The raw data is xlsx file, I have converted it to csv file using excel.


In [None]:
import plotly.io as pio

pio.renderers.default = "vscode+jupyterlab+notebook_connected"


import pandas as pd
from datetime import datetime

## Data Calculation



### Calculate the stats using pandas:

In this section, we calculate the mean, median, and mode of the UnitPrice using pandas. We want to see the distribution of the prices. 

We want to know what prices are people buying the most?


In [30]:
# read the data
df = pd.read_csv('Online Retail.csv')

df = df[df['UnitPrice'] > 0]

df.head()

Unnamed: 0,InvoiceNo,StockCode,Description,Quantity,InvoiceDate,UnitPrice,CustomerID,Country
0,536365,85123A,WHITE HANGING HEART T-LIGHT HOLDER,6,2010/12/1 8:26,2.55,17850.0,United Kingdom
1,536365,71053,WHITE METAL LANTERN,6,2010/12/1 8:26,3.39,17850.0,United Kingdom
2,536365,84406B,CREAM CUPID HEARTS COAT HANGER,8,2010/12/1 8:26,2.75,17850.0,United Kingdom
3,536365,84029G,KNITTED UNION FLAG HOT WATER BOTTLE,6,2010/12/1 8:26,3.39,17850.0,United Kingdom
4,536365,84029E,RED WOOLLY HOTTIE WHITE HEART.,6,2010/12/1 8:26,3.39,17850.0,United Kingdom


In [31]:
pandas_stats = {
    'mean': df['UnitPrice'].mean(),
    'median': df['UnitPrice'].median(),
    'mode': df['UnitPrice'].mode().iloc[0]
}

print("Pandas analysis result:")
print(f"Average price: £{pandas_stats['mean']:.2f}")
print(f"Median price: £{pandas_stats['median']:.2f}")
print(f"Mode price: £{pandas_stats['mode']:.2f}")

Pandas analysis result:
Average price: £4.67
Median price: £2.08
Mode price: £1.25


### Calculate the stats using standard library

we solve the same problem using the standard library.


In [32]:
import csv

# Read the CSV file
prices = []
with open('Online Retail.csv', 'r') as file:
    csv_reader = csv.reader(file)
    next(csv_reader)
    for row in csv_reader:
        try:
            unit_price = float(row[5])
            if unit_price > 0:
                prices.append(unit_price)
        except (ValueError, IndexError):
            continue

print(f"Total number of valid prices loaded: {len(prices)}")

Total number of valid prices loaded: 539392


In [33]:

# calculate the mean
total = 0
count = 0
for price in prices:
    total += price
    count += 1
mean_price = total / count



# calculate the median
prices_for_median = prices.copy()
prices_for_median.sort()

# find median
mid = len(prices_for_median) // 2
if len(prices_for_median) % 2 == 0:
    median_price = (prices_for_median[mid - 1] + prices_for_median[mid]) / 2
else:
    median_price = prices_for_median[mid]


# calculate the mode
price_counts = {}
max_count = 0
mode_price = None
for price in prices:
    if price not in price_counts:
        price_counts[price] = 1
    else:
        price_counts[price] += 1
    
    if price_counts[price] > max_count:
        max_count = price_counts[price]
        mode_price = price
        

print("\nPython standard library analysis result:")
print(f"Average price: £{mean_price:.2f}")
print(f"Median price: £{median_price:.2f}")
print(f"Mode price: £{mode_price:.2f}")


Python standard library analysis result:
Average price: £4.67
Median price: £2.08
Mode price: £1.25


## Data Visualization and Analysis

We visualize the distribution of the prices.

In [38]:
def create_price_distribution_viz(prices, max_prics=-1, bins=10):
    if max_prics == -1:
        max_price = max(prices)
    else:
        max_price = max_prics
    
    bin_size = max_price / bins
    
    # count the number of items in each interval
    price_ranges = {}
    for price in prices:
        if price < max_price:
            bin_index = int(price / bin_size)
            price_ranges[bin_index] = price_ranges.get(bin_index, 0) + 1
    
    # find the max count to determine the scale factor
    max_count = max(price_ranges.values())
    scale_factor = 30 / max_count # 30 stars
    
    print("\nPrice distribution chart:")
    for bin_num in range(bins):
        count = price_ranges.get(bin_num, 0)
        stars = int(count * scale_factor)
        price_range = f"£{bin_num*bin_size:.2f}-£{(bin_num+1)*bin_size:.2f}"
        print(f"{price_range:20} | {'*' * stars} ({count})")

create_price_distribution_viz(prices)



Price distribution chart:
£0.00-£3897.00       | ****************************** (539349)
£3897.00-£7794.00    |  (30)
£7794.00-£11691.00   |  (5)
£11691.00-£15588.00  |  (4)
£15588.00-£19485.00  |  (3)
£19485.00-£23382.00  |  (0)
£23382.00-£27279.00  |  (0)
£27279.00-£31176.00  |  (0)
£31176.00-£35073.00  |  (0)
£35073.00-£38970.00  |  (0)


The data shows that the vast majority (99.99%) of products are priced within the £0-£3897 range.

In [39]:
create_price_distribution_viz(prices, max_prics=10)



Price distribution chart:
£0.00-£1.00          | *********************** (112153)
£1.00-£2.00          | ****************************** (140763)
£2.00-£3.00          | ********************* (102589)
£3.00-£4.00          | ******** (39613)
£4.00-£5.00          | *********** (56265)
£5.00-£6.00          | **** (19256)
£6.00-£7.00          | * (7160)
£7.00-£8.00          | ** (13062)
£8.00-£9.00          | *** (16587)
£9.00-£10.00         | * (6906)


Due to the long-tail distribution shown earlier, it was difficult to see the detailed comparison between price groups. Therefore, we visualized the price distribution for the £0-£10 range.

Most products are priced between £0-£3, with the £1-£2 range being the most common price point.