In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import plotly.express as px

# **Import data**

In [2]:
df = pd.read_csv("data/online_retail_II.csv")
df

Unnamed: 0,Invoice,StockCode,Description,Quantity,InvoiceDate,Price,Customer ID,Country
0,489434,85048,15CM CHRISTMAS GLASS BALL 20 LIGHTS,12,01/12/2009 07:45,6.95,13085.0,United Kingdom
1,489434,79323P,PINK CHERRY LIGHTS,12,01/12/2009 07:45,6.75,13085.0,United Kingdom
2,489434,79323W,WHITE CHERRY LIGHTS,12,01/12/2009 07:45,6.75,13085.0,United Kingdom
3,489434,22041,"RECORD FRAME 7"" SINGLE SIZE",48,01/12/2009 07:45,2.10,13085.0,United Kingdom
4,489434,21232,STRAWBERRY CERAMIC TRINKET BOX,24,01/12/2009 07:45,1.25,13085.0,United Kingdom
...,...,...,...,...,...,...,...,...
525456,538171,22271,FELTCRAFT DOLL ROSIE,2,09/12/2010 20:01,2.95,17530.0,United Kingdom
525457,538171,22750,FELTCRAFT PRINCESS LOLA DOLL,1,09/12/2010 20:01,3.75,17530.0,United Kingdom
525458,538171,22751,FELTCRAFT PRINCESS OLIVIA DOLL,1,09/12/2010 20:01,3.75,17530.0,United Kingdom
525459,538171,20970,PINK FLORAL FELTCRAFT SHOULDER BAG,2,09/12/2010 20:01,3.75,17530.0,United Kingdom


In [3]:
# Check preliminary info
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 525461 entries, 0 to 525460
Data columns (total 8 columns):
 #   Column       Non-Null Count   Dtype  
---  ------       --------------   -----  
 0   Invoice      525461 non-null  object 
 1   StockCode    525461 non-null  object 
 2   Description  522533 non-null  object 
 3   Quantity     525461 non-null  int64  
 4   InvoiceDate  525461 non-null  object 
 5   Price        525461 non-null  float64
 6   Customer ID  417534 non-null  float64
 7   Country      525461 non-null  object 
dtypes: float64(2), int64(1), object(5)
memory usage: 32.1+ MB


# **Data cleaning**

* ## Fixing dtypes

Column `InvoiceDate` should be in datetime format.

In [4]:
df = df.astype({'InvoiceDate': 'datetime64'})

* ## Filtering prices and quantities

Only entries with prices greater than 0 ***and*** quantities greater than 0 should be kept:

In [5]:
rows_before = df.shape[0]
df = df.query("(Price > 0) & (Quantity > 0)")
rows_after = df.shape[0]

print(f"Number of rows before dropping: {rows_before}\n\
Number of rows after dropping: {rows_after}\n\
% of data removed: {100*(1-rows_after/rows_before):.2f}")

Number of rows before dropping: 525461
Number of rows after dropping: 511566
% of data removed: 2.64


* ## Fixing NA's 

In [6]:
# Check NA's in each column
df.isna().sum()

Invoice             0
StockCode           0
Description         0
Quantity            0
InvoiceDate         0
Price               0
Customer ID    103902
Country             0
dtype: int64

In [7]:
# Checking NA's from 'Customer ID'
df[df['Customer ID'].isna()]

Unnamed: 0,Invoice,StockCode,Description,Quantity,InvoiceDate,Price,Customer ID,Country
577,489525,85226C,BLUE PULL BACK RACING CAR,1,2009-01-12 11:49:00,0.55,,United Kingdom
578,489525,85227,SET/6 3D KIT CARDS FOR KIDS,1,2009-01-12 11:49:00,0.85,,United Kingdom
1055,489548,22271,FELTCRAFT DOLL ROSIE,1,2009-01-12 12:32:00,2.95,,United Kingdom
1056,489548,22254,FELT TOADSTOOL LARGE,12,2009-01-12 12:32:00,1.25,,United Kingdom
1057,489548,22273,FELTCRAFT DOLL MOLLY,3,2009-01-12 12:32:00,2.95,,United Kingdom
...,...,...,...,...,...,...,...,...
525143,538154,82599,FANNY'S REST STOPMETAL SIGN,1,2010-09-12 16:35:00,4.21,,United Kingdom
525144,538154,84029E,RED WOOLLY HOTTIE WHITE HEART.,5,2010-09-12 16:35:00,8.47,,United Kingdom
525145,538154,84029G,KNITTED UNION FLAG HOT WATER BOTTLE,4,2010-09-12 16:35:00,8.47,,United Kingdom
525146,538154,85099B,JUMBO BAG RED RETROSPOT,1,2010-09-12 16:35:00,4.21,,United Kingdom


Apparently, these entries are valid. Let's keep them for now.

# **Data analysis**

* ## Which countries had the most orders?

In [8]:
df.groupby('Country')['Quantity'].count().sort_values(ascending=False)[:5]

Country
United Kingdom    473379
EIRE                9459
Germany             7654
France              5532
Netherlands         2729
Name: Quantity, dtype: int64

* ## Which invoices had the most items?

In [9]:
df.groupby('Invoice')['Quantity'].sum().sort_values(ascending=False)[:5]

Invoice
518505    87167
524174    87167
497946    83774
501534    63974
495194    63302
Name: Quantity, dtype: int64

* ## Which invoices were the most expensive?

Pssst! For that, we need a new column (`Revenue`) containing the revenue from sale for each item, since the values in the `Price` column refer to unit prices.

In [10]:
df['Revenue'] = df['Price'] * df['Quantity']

In [11]:
df.groupby('Invoice')['Revenue'].sum().sort_values(ascending=False)[:5]

Invoice
533027    49844.99
531516    45332.97
493819    44051.60
524181    33167.80
526934    26007.08
Name: Revenue, dtype: float64

* ## Which are the most expensive items?

In [12]:
df[['Description', 'Price']].sort_values(by='Price', ascending=False)[:5]

Unnamed: 0,Description,Price
241827,Manual,25111.09
517955,AMAZON FEE,13541.33
135015,Manual,10953.5
135013,Manual,10953.5
372834,Manual,10468.8
