# Exploratory Data Analysis

After cleaning the data, it is important to get a deep understanding of the data, exploring any trends or patterns that exist in the data.

## Import Packages

In [9]:
# Module containing all dependencies used
import src.dependencies as dep

# Module containing custom functions
import src.functions as fn

## Load the dataset

The dataset is the transformed version obtained from [UC Irvine Machine Learning Repository](https://archive.ics.uci.edu/dataset/502/online+retail+ii)

In [10]:
# Load data
df = dep.pd.read_csv('dataset/Transformed.csv')

# Confirm successful loading
df.head()

Unnamed: 0,Invoice,StockCode,Description,Quantity,InvoiceDate,Price,Customer ID,Country
0,489434,85048,15CM CHRISTMAS GLASS BALL 20 LIGHTS,12,2009-12-01 07:45:00,6.95,13085,United Kingdom
1,489434,79323P,PINK CHERRY LIGHTS,12,2009-12-01 07:45:00,6.75,13085,United Kingdom
2,489434,79323W,WHITE CHERRY LIGHTS,12,2009-12-01 07:45:00,6.75,13085,United Kingdom
3,489434,22041,"RECORD FRAME 7"" SINGLE SIZE",48,2009-12-01 07:45:00,2.1,13085,United Kingdom
4,489434,21232,STRAWBERRY CERAMIC TRINKET BOX,24,2009-12-01 07:45:00,1.25,13085,United Kingdom


## Univariate Analysis

Each variable will be investigated singularly to draw more insights from the data.

### Invoice

**Guiding Question:** _How many invoices were raised in the given time period?_

To answer this question, count all unique invoice numbers.

In [11]:
# Count invoices
invoices = len(list(df['Invoice'].value_counts())) # value_counts records each unique entry and respective count
print('There were {:,} invoices generated in the given time period'.format(invoices))

There were 44,876 invoices generated in the given time period


### Products

**Guiding Question:** _How many unique items were sold in the given time period?_

Count all the unique products using the `StockCode` as the identifier.

In [12]:
# Count products
products = len(list(df['StockCode'].value_counts()))
print('A total of {:,} unique products were sold'.format(products))

A total of 4,646 unique products were sold


**Guiding Question:** _What was the volume of sales in the time period under study?_

The sales volume refers to the total number of product units sold in the given time period.

In [13]:
# Sales volume
sales_volume = df['Quantity'].sum()
print('The volume of sales in the given time period is {:,}'.format(sales_volume))

The volume of sales in the given time period is 10,055,729


### Time Period

**Guiding Question:** _What is the time frame of the data under study?_

Get the date range from the `InvoiceDate` column.

In [14]:
# Ensure the invoice date is of date_time format
df['InvoiceDate'] = dep.pd.to_datetime(df['InvoiceDate'])

# Date range
date_from = str(df['InvoiceDate'].dt.date.min())
date_to = str(df['InvoiceDate'].dt.date.max())
print('The data ranges from {} to {}'.format(date_from, date_to))

The data ranges from 2009-12-01 to 2011-12-09


### Price

**Guiding Questions:** _How do the prices vary?_

Look at the distribution of prices.

In [15]:
# Number summary
df[['Price']].describe()

Unnamed: 0,Price
count,797885.0
mean,3.702732
std,71.392549
min,0.0
25%,1.25
50%,1.95
75%,3.75
max,38970.0


Over 75% of the products are priced between _0_ and _3.75_. However, there is a product that has been priced as high as _38,970_.

### Customers

**Guiding Question:** _How many customers did we serve in the given time period?_

Count the number of customers.

In [16]:
# Customers count
customers = len(list(df['Customer ID'].value_counts()))
print('A total of {:,} customers were served'.format(customers))

A total of 5,942 customers were served


### Countries

**Guiding Question:** _How many countries did we reach?_

Count the number of listed countries.

In [17]:
# Countries count
countries = len(list(df['Country'].value_counts()))
print('A total of {:,} countries were reached'.format(countries))

A total of 41 countries were reached
