# Analysis of an online retail

This dataset is an [Online Retail dataset](https://www.kaggle.com/datasets/tunguz/online-retail) containing sales transaction. According to the dataset information:

*This is a transnational data set which contains all the transactions occurring between 01/12/2010 and 09/12/2011*
*for a UK-based and registered non-store online retail.The company mainly sells unique all-occasion gifts.*
*Many customers of the company are wholesalers.*

This analysis consist of 3 major steps:

- A first exploration of the data.
- Cleaning the data.
- A deeper exploration of the data and gaining insights.
- Data analysis and visualisation.
- Drawing conclusions.

Load the dataset

In [None]:
import os
import pandas as pd

# Get the absolute path to the current notebook
os_path = os.getcwd()
# Add the extra path to the dataset file
dataset_path = os_path+'\datasets\Online_Retail.csv'
dataset_retail = pd.read_csv(dataset_path, encoding='ISO-8859-1')
dataset_retail

Unnamed: 0,InvoiceNo,StockCode,Description,Quantity,InvoiceDate,UnitPrice,CustomerID,Country
0,536365,85123A,WHITE HANGING HEART T-LIGHT HOLDER,6,12/1/10 8:26,2.55,17850.0,United Kingdom
1,536365,71053,WHITE METAL LANTERN,6,12/1/10 8:26,3.39,17850.0,United Kingdom
2,536365,84406B,CREAM CUPID HEARTS COAT HANGER,8,12/1/10 8:26,2.75,17850.0,United Kingdom
3,536365,84029G,KNITTED UNION FLAG HOT WATER BOTTLE,6,12/1/10 8:26,3.39,17850.0,United Kingdom
4,536365,84029E,RED WOOLLY HOTTIE WHITE HEART.,6,12/1/10 8:26,3.39,17850.0,United Kingdom
...,...,...,...,...,...,...,...,...
541904,581587,22613,PACK OF 20 SPACEBOY NAPKINS,12,12/9/11 12:50,0.85,12680.0,France
541905,581587,22899,CHILDREN'S APRON DOLLY GIRL,6,12/9/11 12:50,2.10,12680.0,France
541906,581587,23254,CHILDRENS CUTLERY DOLLY GIRL,4,12/9/11 12:50,4.15,12680.0,France
541907,581587,23255,CHILDRENS CUTLERY CIRCUS PARADE,4,12/9/11 12:50,4.15,12680.0,France


Get information about the number of columns and the associated datatypes.

In [2]:
# Get the columns and the datatypes
dataset_retail.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 541909 entries, 0 to 541908
Data columns (total 8 columns):
 #   Column       Non-Null Count   Dtype  
---  ------       --------------   -----  
 0   InvoiceNo    541909 non-null  object 
 1   StockCode    541909 non-null  object 
 2   Description  540455 non-null  object 
 3   Quantity     541909 non-null  int64  
 4   InvoiceDate  541909 non-null  object 
 5   UnitPrice    541909 non-null  float64
 6   CustomerID   406829 non-null  float64
 7   Country      541909 non-null  object 
dtypes: float64(2), int64(1), object(5)
memory usage: 33.1+ MB


## Data cleaning

### Data exploration

As a first step identify null or empty values in the dataset.

In [3]:
# fint the total empty elements per column
dataset_retail.isnull().sum()

InvoiceNo           0
StockCode           0
Description      1454
Quantity            0
InvoiceDate         0
UnitPrice           0
CustomerID     135080
Country             0
dtype: int64

Also is important to know, how many repeated rows exist in the dataset.

In [4]:
dataset_retail.duplicated().sum()

np.int64(5268)

Find the unique values per column

In [11]:
# dictionary comprehension
unique_values = {col: dataset_retail[col].unique() for col in dataset_retail.columns if (col == 'Country' or col == 'CustomerID')}

# Print the unique values
for col, vals in unique_values.items():
    print(f'Column name: {col}')
    print(f'Number of unique values: {len(vals)}')
    print(f'First 20 unique values: {vals[:20]}')
    print(50*'=') # separator

Column name: CustomerID
Number of unique values: 4373
First 20 unique values: [17850. 13047. 12583. 13748. 15100. 15291. 14688. 17809. 15311. 14527.
 16098. 18074. 17420. 16029. 16250. 12431. 17511. 17548. 13705. 13747.]
Column name: Country
Number of unique values: 38
First 20 unique values: ['United Kingdom' 'France' 'Australia' 'Netherlands' 'Germany' 'Norway'
 'EIRE' 'Switzerland' 'Spain' 'Poland' 'Portugal' 'Italy' 'Belgium'
 'Lithuania' 'Japan' 'Iceland' 'Channel Islands' 'Denmark' 'Cyprus'
 'Sweden']


### Data cleaning

Then, the first step in cleaning the data in this analysis is to remove empty values in the key columns.
- The **Description** column can contain empty values without affecting the analysis.
- The **CustomerID** column is important to keep track of all transactions and their details.

So, for this analysis, only the missing values in the **CustomerID** column will be removed.

In [12]:
# Remove the emty values in the column 'CustomerID'
dataset_retail_clean = dataset_retail.dropna(subset=['CustomerID'])
# Count the total number of empty cell per column
dataset_retail_clean.isnull().sum()

InvoiceNo      0
StockCode      0
Description    0
Quantity       0
InvoiceDate    0
UnitPrice      0
CustomerID     0
Country        0
dtype: int64

As can be seen, the removed cells also contained empty cells in **Description**.

The next step is to remove the duplicated rows, found in a previous cell.

In [13]:
# remove duplicates
dataset_retail_clean = dataset_retail_clean.drop_duplicates()
# Count the total duplicated rows after the operation
dataset_retail_clean.duplicated().sum()

np.int64(0)