## Dataset Overview

This project uses multiple datasets including online retail transactions and
external holiday data. These datasets provide information on customer purchases,
product details, pricing, and temporal effects.

In [1]:
import pandas as pd

## Data Sources


In [2]:

online_retail=pd.read_csv("../data/raw/online_retail.csv")
public_holidays=pd.read_csv("../data/external/publicHolidays.csv")

## Shape and Size

In [20]:
print(online_retail.shape)
print(public_holidays.shape)

(541909, 8)
(69530, 7)


The online retail dataset contains 541909 rows and 8 columns representing individual
transaction records.

The public holidays dataset contains 69530 rows and 7 columns, capturing holiday
names, dates, and country or region information, which are later used to
incorporate seasonal and holiday effects into sales analysis.



## Column-Level Understanding

In [21]:
print(online_retail.columns)
print(online_retail.info())

Index(['InvoiceNo', 'StockCode', 'Description', 'Quantity', 'InvoiceDate',
       'UnitPrice', 'CustomerID', 'Country'],
      dtype='object')
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 541909 entries, 0 to 541908
Data columns (total 8 columns):
 #   Column       Non-Null Count   Dtype  
---  ------       --------------   -----  
 0   InvoiceNo    541909 non-null  object 
 1   StockCode    541909 non-null  object 
 2   Description  540455 non-null  object 
 3   Quantity     541909 non-null  int64  
 4   InvoiceDate  541909 non-null  object 
 5   UnitPrice    541909 non-null  float64
 6   CustomerID   406829 non-null  float64
 7   Country      541909 non-null  object 
dtypes: float64(2), int64(1), object(5)
memory usage: 33.1+ MB
None


### Online Retail Dataset Columns

- InvoiceNo: Unique invoice identifier
- StockCode: Product identifier
- Quantity: Number of units purchased
- UnitPrice: Price per unit
- CustomerID: Unique customer identifier
- InvoiceDate: Transaction timestamp
- Country: Country of purchase

In [23]:
print(public_holidays.columns)
print(public_holidays.info())

Index(['Unnamed: 0', 'countryOrRegion', 'holidayName', 'normalizeHolidayName',
       'isPaidTimeOff', 'countryRegionCode', 'date'],
      dtype='object')
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 69530 entries, 0 to 69529
Data columns (total 7 columns):
 #   Column                Non-Null Count  Dtype 
---  ------                --------------  ----- 
 0   Unnamed: 0            69530 non-null  int64 
 1   countryOrRegion       69530 non-null  object
 2   holidayName           69530 non-null  object
 3   normalizeHolidayName  69530 non-null  object
 4   isPaidTimeOff         3933 non-null   object
 5   countryRegionCode     64532 non-null  object
 6   date                  69530 non-null  object
dtypes: int64(1), object(6)
memory usage: 3.7+ MB
None


### Public Holidays Dataset Columns

- Unnamed: 0: Index column generated during data export; does not represent
  any business information.

- countryOrRegion: Country or region where the public holiday is observed.

- holidayName: Name of the public holiday.

- normalizeHolidayName: Standardized version of the holiday name used to
  maintain consistency across records.

- isPaidTimeOff: Indicates whether the holiday is officially recognized as a
  paid time-off day in the corresponding country or region.

- countryRegionCode: Short code representing the country or region.

- date: Calendar date on which the public holiday occurs.


# Data Types

In [24]:
print(online_retail.dtypes)


InvoiceNo       object
StockCode       object
Description     object
Quantity         int64
InvoiceDate     object
UnitPrice      float64
CustomerID     float64
Country         object
dtype: object


### Column Data Types

**Numeric Columns**

- Quantity:-no.of units purchased(Integer)
- UnitPrice:-price per unit(Float)

**Categorigals Columns**

- InvoiceNo:- Unique invoice identifier
- StockCode:- Product Identifier
- Description:-Product Description
- CustomerID:- unique customer identifier (Considered as categorical)
- Country:- Country of purchase

**Datetime Columns**

InvoiceDate:- Transaction timestamp

In [25]:
print(public_holidays.dtypes)

Unnamed: 0               int64
countryOrRegion         object
holidayName             object
normalizeHolidayName    object
isPaidTimeOff           object
countryRegionCode       object
date                    object
dtype: object


### Column Data Types

**Numeric Columns**  
- Unnamed: 0: Index column (integer)

**Categorical Columns**  
- countryOrRegion: Country or region  
- holidayName: Name of the holiday  
- normalizeHolidayName: Standardized holiday name  
- isPaidTimeOff: Paid-time-off flag (Yes/No)  
- countryRegionCode: Country code  

**Datetime Columns**  
- date: Date of the holiday


# Missing Value 

In [13]:
online_retail.isnull().sum()

InvoiceNo           0
StockCode           0
Description      1454
Quantity            0
InvoiceDate         0
UnitPrice           0
CustomerID     135080
Country             0
dtype: int64

CustomerID contains missing values, indicating anonymous transactions.
Description contains missing values, indicating transactions containing stockCodes but no textual descriptions

In [14]:
public_holidays.isnull().sum()

Unnamed: 0                  0
countryOrRegion             0
holidayName                 0
normalizeHolidayName        0
isPaidTimeOff           65597
countryRegionCode        4998
date                        0
dtype: int64

ispaidtimeoff contains missing values, indicating  Null values indicate no paid time-off information is available for that holiday or country.
countryRegionCode contains missing values indicating incomplete metadata as the  country/region name is recodered and code is not

# Duplicate Record Check

In [15]:
online_retail.duplicated().sum()

np.int64(5268)

In [16]:
public_holidays.duplicated().sum()

np.int64(0)

# Basic Value Sanity Checks

In [17]:
online_retail.describe()

Unnamed: 0,Quantity,UnitPrice,CustomerID
count,541909.0,541909.0,406829.0
mean,9.55225,4.611114,15287.69057
std,218.081158,96.759853,1713.600303
min,-80995.0,-11062.06,12346.0
25%,1.0,1.25,13953.0
50%,3.0,2.08,15152.0
75%,10.0,4.13,16791.0
max,80995.0,38970.0,18287.0


In [18]:
public_holidays.describe()

Unnamed: 0.1,Unnamed: 0
count,69530.0
mean,34764.5
std,20071.726445
min,0.0
25%,17382.25
50%,34764.5
75%,52146.75
max,69529.0
