# **SOURCE**
https://www.kaggle.com/code/mgmarques/customer-segmentation-and-market-basket-analysis/notebook
- Customer segmentation: Customer segmentation is the problem of uncovering information about a firm's customer base, based on their interactions with the business. In most cases this interaction is in terms of their purchase behavior and patterns. We explore some of the ways in which this can be used.
- Market basket analysis: Market basket analysis is a method to gain insights into granular behavior of customers. This is helpful in devising strategies which uncovers deeper understanding of purchase decisions taken by the customers. This is interesting as a lot of times even the customer will be unaware of such biases or trends in their purchasing behavior.

Let's see the description of each column:
- InvoiceNo: A unique identifier for the invoice. An invoice number shared across rows means that those transactions were performed in a single invoice (multiple purchases).
- StockCode: Identifier for items contained in an invoice.
- Description: Textual description of each of the stock item.
- Quantity: The quantity of the item purchased.
- InvoiceDate: Date of purchase.
- UnitPrice: Value of each item.
- CustomerID: Identifier for customer making the purchase.
- Country: Country of customer.

# **DATA UNDERSTANDING**

In [1]:
import numpy as np
import pandas as pd
import warnings

warnings.filterwarnings('ignore')
pd.options.mode.chained_assignment = None

path = './db/online-retail.xlsx'
df = pd.read_excel(path)

In [2]:
df.head()

Unnamed: 0,InvoiceNo,StockCode,Description,Quantity,InvoiceDate,UnitPrice,CustomerID,Country
0,536365,85123A,WHITE HANGING HEART T-LIGHT HOLDER,6,2010-12-01 08:26:00,2.55,17850.0,United Kingdom
1,536365,71053,WHITE METAL LANTERN,6,2010-12-01 08:26:00,3.39,17850.0,United Kingdom
2,536365,84406B,CREAM CUPID HEARTS COAT HANGER,8,2010-12-01 08:26:00,2.75,17850.0,United Kingdom
3,536365,84029G,KNITTED UNION FLAG HOT WATER BOTTLE,6,2010-12-01 08:26:00,3.39,17850.0,United Kingdom
4,536365,84029E,RED WOOLLY HOTTIE WHITE HEART.,6,2010-12-01 08:26:00,3.39,17850.0,United Kingdom


In [3]:
line = '========================'
def dataProfile(data):
  dimension = data.shape
  dtype = data.dtypes
  countOfNull = data.isnull().sum()
  nullRatio = round(countOfNull/len(data)*100,4)
  countOfDistinct = data.nunique()
  distinctValue = data.apply(lambda x: x.unique())
  output = pd.DataFrame(list(zip(dtype, countOfNull, nullRatio, countOfDistinct, distinctValue)),
                        index=data.columns, 
                        columns=['dtype', 'count_of_null', 'null_ratio', 'count_of_distinct', 'distinct_value'])
  # output = pd.concat([dtype, countOfNull, nullRatio, countOfDistinct, distinctValue], axis=1)
  # output.rename(columns=['dtype', 'count_of_null', 'null_ratio', 'count_of_distinct', 'distinct_value'])
  print(f'Dimensions\t: {dimension}')
  print(f'Data Size\t: {round(data.memory_usage(deep=True).sum()/1000000, 2)} MB')
  print(line)
  print(f'Duplicated Data\t: {len(data[data.duplicated()])}')
  display(data[data.duplicated()])
  print(line)
  print('REVIEW')
  display(output)
  print(line)
  print('Stastical Numerics')
  display(data.describe())
  print(line)
  print('Stastical Categorics')
  display(data.describe(include=['category', 'object']))
  print(line)
  print('PREVIEW head(3)')
  display(data.head(3))
  

In [4]:
dataProfile(df)

Dimensions	: (541909, 8)
Data Size	: 141.48 MB
Duplicated Data	: 5268


Unnamed: 0,InvoiceNo,StockCode,Description,Quantity,InvoiceDate,UnitPrice,CustomerID,Country
517,536409,21866,UNION JACK FLAG LUGGAGE TAG,1,2010-12-01 11:45:00,1.25,17908.0,United Kingdom
527,536409,22866,HAND WARMER SCOTTY DOG DESIGN,1,2010-12-01 11:45:00,2.10,17908.0,United Kingdom
537,536409,22900,SET 2 TEA TOWELS I LOVE LONDON,1,2010-12-01 11:45:00,2.95,17908.0,United Kingdom
539,536409,22111,SCOTTIE DOG HOT WATER BOTTLE,1,2010-12-01 11:45:00,4.95,17908.0,United Kingdom
555,536412,22327,ROUND SNACK BOXES SET OF 4 SKULLS,1,2010-12-01 11:49:00,2.95,17920.0,United Kingdom
...,...,...,...,...,...,...,...,...
541675,581538,22068,BLACK PIRATE TREASURE CHEST,1,2011-12-09 11:34:00,0.39,14446.0,United Kingdom
541689,581538,23318,BOX OF 6 MINI VINTAGE CRACKERS,1,2011-12-09 11:34:00,2.49,14446.0,United Kingdom
541692,581538,22992,REVOLVER WOODEN RULER,1,2011-12-09 11:34:00,1.95,14446.0,United Kingdom
541699,581538,22694,WICKER STAR,1,2011-12-09 11:34:00,2.10,14446.0,United Kingdom


REVIEW


Unnamed: 0,dtype,count_of_null,null_ratio,count_of_distinct,distinct_value
InvoiceNo,object,0,0.0,25900,"[536365, 536366, 536367, 536368, 536369, 53637..."
StockCode,object,0,0.0,4070,"[85123A, 71053, 84406B, 84029G, 84029E, 22752,..."
Description,object,1454,0.2683,4223,"[WHITE HANGING HEART T-LIGHT HOLDER, WHITE MET..."
Quantity,int64,0,0.0,722,"[6, 8, 2, 32, 3, 4, 24, 12, 48, 18, 20, 36, 80..."
InvoiceDate,datetime64[ns],0,0.0,23260,"[2010-12-01T08:26:00.000000000, 2010-12-01T08:..."
UnitPrice,float64,0,0.0,1630,"[2.55, 3.39, 2.75, 7.65, 4.25, 1.85, 1.69, 2.1..."
CustomerID,float64,135080,24.9267,4372,"[17850.0, 13047.0, 12583.0, 13748.0, 15100.0, ..."
Country,object,0,0.0,38,"[United Kingdom, France, Australia, Netherland..."


Stastical Numerics


Unnamed: 0,Quantity,UnitPrice,CustomerID
count,541909.0,541909.0,406829.0
mean,9.55225,4.611114,15287.69057
std,218.081158,96.759853,1713.600303
min,-80995.0,-11062.06,12346.0
25%,1.0,1.25,13953.0
50%,3.0,2.08,15152.0
75%,10.0,4.13,16791.0
max,80995.0,38970.0,18287.0


Stastical Categorics


Unnamed: 0,InvoiceNo,StockCode,Description,Country
count,541909,541909,540455,541909
unique,25900,4070,4223,38
top,573585,85123A,WHITE HANGING HEART T-LIGHT HOLDER,United Kingdom
freq,1114,2313,2369,495478


PREVIEW head(3)


Unnamed: 0,InvoiceNo,StockCode,Description,Quantity,InvoiceDate,UnitPrice,CustomerID,Country
0,536365,85123A,WHITE HANGING HEART T-LIGHT HOLDER,6,2010-12-01 08:26:00,2.55,17850.0,United Kingdom
1,536365,71053,WHITE METAL LANTERN,6,2010-12-01 08:26:00,3.39,17850.0,United Kingdom
2,536365,84406B,CREAM CUPID HEARTS COAT HANGER,8,2010-12-01 08:26:00,2.75,17850.0,United Kingdom


We can observe from the preceding output that Quantity and UnitPrice are having negative values, which may mean that we may have some return transactions in our data also. As our goal is customer segmentation and market basket analysis, it is important that these records are removed, but first we will take a look at whether there are records where both are negative or if one of them is negative and the other is zero.

# **DATA CLEANSING**

## **Drop Duplicated**

In [5]:
def dropDuplicates(df):
  print(f'Dimensions before remove duplicates: {df.shape}')
  df = df.drop_duplicates()
  print(f'Dimensions after remove duplicates: {df.shape}')
  return df

In [6]:
data = df.sort_values('CustomerID').copy()
data = dropDuplicates(data)
data

Dimensions before remove duplicates: (541909, 8)
Dimensions after remove duplicates: (536641, 8)


Unnamed: 0,InvoiceNo,StockCode,Description,Quantity,InvoiceDate,UnitPrice,CustomerID,Country
61619,541431,23166,MEDIUM CERAMIC TOP STORAGE JAR,74215,2011-01-18 10:01:00,1.04,12346.0,United Kingdom
61624,C541433,23166,MEDIUM CERAMIC TOP STORAGE JAR,-74215,2011-01-18 10:17:00,1.04,12346.0,United Kingdom
286628,562032,21578,WOODLAND DESIGN COTTON TOTE BAG,6,2011-08-02 08:48:00,2.25,12347.0,Iceland
72263,542237,47559B,TEA TIME OVEN GLOVE,10,2011-01-26 14:30:00,1.25,12347.0,Iceland
72264,542237,21154,RED RETROSPOT OVEN GLOVE,10,2011-01-26 14:30:00,1.25,12347.0,Iceland
...,...,...,...,...,...,...,...,...
541536,581498,85099B,JUMBO BAG RED RETROSPOT,5,2011-12-09 10:26:00,4.13,,United Kingdom
541537,581498,85099C,JUMBO BAG BAROQUE BLACK WHITE,4,2011-12-09 10:26:00,4.13,,United Kingdom
541538,581498,85150,LADIES & GENTLEMEN METAL SIGN,1,2011-12-09 10:26:00,4.96,,United Kingdom
541539,581498,85174,S/4 CACTI CANDLES,1,2011-12-09 10:26:00,10.79,,United Kingdom


## **Drop N/a CustomerID**

In [7]:
def dropNull(df, cols=None):
  print(f'Dimensions before remove duplicates: {df.shape}')
  if(cols==None):
    df = df.dropna()
  else:
    df = df.dropna(subset=cols, axis=0)
  print(f'Dimensions after remove duplicates: {df.shape}')
  return df

In [8]:
data = dropNull(data, cols=['CustomerID'])
data

Dimensions before remove duplicates: (536641, 8)
Dimensions after remove duplicates: (401604, 8)


Unnamed: 0,InvoiceNo,StockCode,Description,Quantity,InvoiceDate,UnitPrice,CustomerID,Country
61619,541431,23166,MEDIUM CERAMIC TOP STORAGE JAR,74215,2011-01-18 10:01:00,1.04,12346.0,United Kingdom
61624,C541433,23166,MEDIUM CERAMIC TOP STORAGE JAR,-74215,2011-01-18 10:17:00,1.04,12346.0,United Kingdom
286628,562032,21578,WOODLAND DESIGN COTTON TOTE BAG,6,2011-08-02 08:48:00,2.25,12347.0,Iceland
72263,542237,47559B,TEA TIME OVEN GLOVE,10,2011-01-26 14:30:00,1.25,12347.0,Iceland
72264,542237,21154,RED RETROSPOT OVEN GLOVE,10,2011-01-26 14:30:00,1.25,12347.0,Iceland
...,...,...,...,...,...,...,...,...
392737,570715,23269,SET OF 2 CERAMIC CHRISTMAS TREES,36,2011-10-12 10:23:00,1.45,18287.0,United Kingdom
392736,570715,23223,CHRISTMAS TREE HANGING SILVER,48,2011-10-12 10:23:00,0.83,18287.0,United Kingdom
392735,570715,23378,PACK OF 12 50'S CHRISTMAS TISSUES,24,2011-10-12 10:23:00,0.39,18287.0,United Kingdom
423939,573167,23264,SET OF 3 WOODEN SLEIGH DECORATIONS,36,2011-10-28 09:29:00,1.25,18287.0,United Kingdom


## **Data Types**

In [9]:
dataProfile(data)

Dimensions	: (401604, 8)
Data Size	: 107.96 MB
Duplicated Data	: 0


Unnamed: 0,InvoiceNo,StockCode,Description,Quantity,InvoiceDate,UnitPrice,CustomerID,Country


REVIEW


Unnamed: 0,dtype,count_of_null,null_ratio,count_of_distinct,distinct_value
InvoiceNo,object,0,0.0,22190,"[541431, C541433, 562032, 542237, 573511, 5562..."
StockCode,object,0,0.0,3684,"[23166, 21578, 47559B, 21154, 21041, 21035, 22..."
Description,object,0,0.0,3896,"[MEDIUM CERAMIC TOP STORAGE JAR, WOODLAND DESI..."
Quantity,int64,0,0.0,436,"[74215, -74215, 6, 10, 3, 12, 4, 8, 24, 20, 2,..."
InvoiceDate,datetime64[ns],0,0.0,20460,"[2011-01-18T10:01:00.000000000, 2011-01-18T10:..."
UnitPrice,float64,0,0.0,620,"[1.04, 2.25, 1.25, 2.95, 12.75, 4.25, 0.42, 1...."
CustomerID,float64,0,0.0,4372,"[12346.0, 12347.0, 12348.0, 12349.0, 12350.0, ..."
Country,object,0,0.0,37,"[United Kingdom, Iceland, Finland, Italy, Norw..."


Stastical Numerics


Unnamed: 0,Quantity,UnitPrice,CustomerID
count,401604.0,401604.0,401604.0
mean,12.183273,3.474064,15281.160818
std,250.283037,69.764035,1714.006089
min,-80995.0,0.0,12346.0
25%,2.0,1.25,13939.0
50%,5.0,1.95,15145.0
75%,12.0,3.75,16784.0
max,80995.0,38970.0,18287.0


Stastical Categorics


Unnamed: 0,InvoiceNo,StockCode,Description,Country
count,401604,401604,401604,401604
unique,22190,3684,3896,37
top,576339,85123A,WHITE HANGING HEART T-LIGHT HOLDER,United Kingdom
freq,542,2065,2058,356728


PREVIEW head(3)


Unnamed: 0,InvoiceNo,StockCode,Description,Quantity,InvoiceDate,UnitPrice,CustomerID,Country
61619,541431,23166,MEDIUM CERAMIC TOP STORAGE JAR,74215,2011-01-18 10:01:00,1.04,12346.0,United Kingdom
61624,C541433,23166,MEDIUM CERAMIC TOP STORAGE JAR,-74215,2011-01-18 10:17:00,1.04,12346.0,United Kingdom
286628,562032,21578,WOODLAND DESIGN COTTON TOTE BAG,6,2011-08-02 08:48:00,2.25,12347.0,Iceland


In [10]:
data.CustomerID = data.CustomerID.astype('str')
data.CustomerID = data.CustomerID.str.replace(r'\D+0', '', regex=True)
numericalColumns = ['Quantity', 'UnitPrice', 'InvoiceDate']
for value in data.columns:
  if value not in numericalColumns:
    data[value] = data[value].astype('str')
dataProfile(data)

Dimensions	: (401604, 8)
Data Size	: 149.77 MB
Duplicated Data	: 0


Unnamed: 0,InvoiceNo,StockCode,Description,Quantity,InvoiceDate,UnitPrice,CustomerID,Country


REVIEW


Unnamed: 0,dtype,count_of_null,null_ratio,count_of_distinct,distinct_value
InvoiceNo,object,0,0.0,22190,"[541431, C541433, 562032, 542237, 573511, 5562..."
StockCode,object,0,0.0,3684,"[23166, 21578, 47559B, 21154, 21041, 21035, 22..."
Description,object,0,0.0,3896,"[MEDIUM CERAMIC TOP STORAGE JAR, WOODLAND DESI..."
Quantity,int64,0,0.0,436,"[74215, -74215, 6, 10, 3, 12, 4, 8, 24, 20, 2,..."
InvoiceDate,datetime64[ns],0,0.0,20460,"[2011-01-18T10:01:00.000000000, 2011-01-18T10:..."
UnitPrice,float64,0,0.0,620,"[1.04, 2.25, 1.25, 2.95, 12.75, 4.25, 0.42, 1...."
CustomerID,object,0,0.0,4372,"[12346, 12347, 12348, 12349, 12350, 12352, 123..."
Country,object,0,0.0,37,"[United Kingdom, Iceland, Finland, Italy, Norw..."


Stastical Numerics


Unnamed: 0,Quantity,UnitPrice
count,401604.0,401604.0
mean,12.183273,3.474064
std,250.283037,69.764035
min,-80995.0,0.0
25%,2.0,1.25
50%,5.0,1.95
75%,12.0,3.75
max,80995.0,38970.0


Stastical Categorics


Unnamed: 0,InvoiceNo,StockCode,Description,CustomerID,Country
count,401604,401604,401604,401604,401604
unique,22190,3684,3896,4372,37
top,576339,85123A,WHITE HANGING HEART T-LIGHT HOLDER,17841,United Kingdom
freq,542,2065,2058,7812,356728


PREVIEW head(3)


Unnamed: 0,InvoiceNo,StockCode,Description,Quantity,InvoiceDate,UnitPrice,CustomerID,Country
61619,541431,23166,MEDIUM CERAMIC TOP STORAGE JAR,74215,2011-01-18 10:01:00,1.04,12346,United Kingdom
61624,C541433,23166,MEDIUM CERAMIC TOP STORAGE JAR,-74215,2011-01-18 10:17:00,1.04,12346,United Kingdom
286628,562032,21578,WOODLAND DESIGN COTTON TOTE BAG,6,2011-08-02 08:48:00,2.25,12347,Iceland


## **Explore**

### **Duplicated values and null values has been removed. Negative value in Quantity?**

In [11]:
print(f'negative quantity => refund?')
print(f'InvoiceNo startwith: {data[(data.Quantity<0)].InvoiceNo.apply(lambda x: str(x)[0]).unique()}\n{line}')
display(data[(data.Quantity<0)])
print(line)
print(f'zero unitprice => free/bug/error?')
print(f'length: {len(data[(data.UnitPrice==0)])}\n{line}')
display(data[(data.UnitPrice==0)])

negative quantity => refund?
InvoiceNo startwith: ['C']


Unnamed: 0,InvoiceNo,StockCode,Description,Quantity,InvoiceDate,UnitPrice,CustomerID,Country
61624,C541433,23166,MEDIUM CERAMIC TOP STORAGE JAR,-74215,2011-01-18 10:17:00,1.04,12346,United Kingdom
106397,C545330,M,Manual,-1,2011-03-01 15:49:00,376.50,12352,Norway
106395,C545329,M,Manual,-1,2011-03-01 15:47:00,183.75,12352,Norway
106394,C545329,M,Manual,-1,2011-03-01 15:47:00,280.05,12352,Norway
129743,C547388,21914,BLUE HARMONICA IN BOX,-12,2011-03-22 16:07:00,1.25,12352,Norway
...,...,...,...,...,...,...,...,...
488515,C577832,84988,SET OF 72 PINK HEART PAPER DOILIES,-12,2011-11-22 10:18:00,1.45,18274,United Kingdom
481908,C577386,23401,RUSTIC MIRROR WITH LACE HEART,-1,2011-11-18 16:54:00,6.25,18276,United Kingdom
481921,C577390,23401,RUSTIC MIRROR WITH LACE HEART,-1,2011-11-18 17:01:00,6.25,18276,United Kingdom
70604,C542086,22423,REGENCY CAKESTAND 3 TIER,-1,2011-01-25 12:34:00,12.75,18277,United Kingdom


zero unitprice => free/bug/error?
length: 40


Unnamed: 0,InvoiceNo,StockCode,Description,Quantity,InvoiceDate,UnitPrice,CustomerID,Country
436428,574138,23234,BISCUIT TIN VINTAGE CHRISTMAS,216,2011-11-03 11:26:00,0.0,12415,Australia
198383,554037,22619,SET OF 6 SOLDIER SKITTLES,80,2011-05-20 14:13:00,0.0,12415,Australia
439361,574469,22385,JUMBO BAG SPACEBOY DESIGN,12,2011-11-04 11:55:00,0.0,12431,Australia
436961,574252,M,Manual,1,2011-11-03 13:24:00,0.0,12437,France
480649,577314,23407,SET OF 2 TRAYS HOME SWEET HOME,2,2011-11-18 13:23:00,0.0,12444,Norway
395529,571035,M,Manual,1,2011-10-13 12:50:00,0.0,12446,RSA
157042,550188,22636,CHILDS BREAKFAST SET CIRCUS PARADE,1,2011-04-14 18:57:00,0.0,12457,Switzerland
282912,561669,22960,JAM MAKING SET WITH JARS,11,2011-07-28 17:09:00,0.0,12507,Spain
479546,577168,M,Manual,1,2011-11-18 10:42:00,0.0,12603,Germany
9302,537197,22841,ROUND CAKE TIN VINTAGE GREEN,1,2010-12-05 14:02:00,0.0,12647,Germany


In [12]:
zeroUP = data[data.UnitPrice==0][['StockCode', "Description"]]
priceZero = pd.merge(data, zeroUP, left_on=['StockCode', 'Description'], right_on=['StockCode', 'Description'], how='inner')
# priceZero
priceZero = priceZero.groupby(['StockCode', 'Description', 'UnitPrice'], as_index=False).agg(Count_=('UnitPrice', 'count')).reset_index(drop=True)
priceZero[priceZero.UnitPrice==0]

Unnamed: 0,StockCode,Description,UnitPrice,Count_
0,21208,PASTEL COLOUR HONEYCOMB FAN,0.0,1
4,21786,POLKADOT RAIN HAT,0.0,1
8,22055,MINI CAKE STAND HANGING STRAWBERY,0.0,1
12,22062,CERAMIC BOWL WITH LOVE HEART DESIGN,0.0,1
15,22065,CHRISTMAS PUDDING TRINKET POT,0.0,1
19,22089,PAPER BUNTING VINTAGE PAISLEY,0.0,1
22,22090,PAPER BUNTING RETROSPOT,0.0,1
26,22162,HEART GARLAND RUSTIC PADDED,0.0,1
28,22167,OVAL WALL MIRROR DIAMANTE,0.0,1
31,22168,ORGANISER WOOD ANTIQUE WHITE,0.0,1


### **Drop Zero UnitPrice**
The zero-valued UnitPrice only has 40 registers data. Therefore, it can be removed to avoid data inconsistencies. 

In [13]:
data = data[data.UnitPrice > 0]
data

Unnamed: 0,InvoiceNo,StockCode,Description,Quantity,InvoiceDate,UnitPrice,CustomerID,Country
61619,541431,23166,MEDIUM CERAMIC TOP STORAGE JAR,74215,2011-01-18 10:01:00,1.04,12346,United Kingdom
61624,C541433,23166,MEDIUM CERAMIC TOP STORAGE JAR,-74215,2011-01-18 10:17:00,1.04,12346,United Kingdom
286628,562032,21578,WOODLAND DESIGN COTTON TOTE BAG,6,2011-08-02 08:48:00,2.25,12347,Iceland
72263,542237,47559B,TEA TIME OVEN GLOVE,10,2011-01-26 14:30:00,1.25,12347,Iceland
72264,542237,21154,RED RETROSPOT OVEN GLOVE,10,2011-01-26 14:30:00,1.25,12347,Iceland
...,...,...,...,...,...,...,...,...
392737,570715,23269,SET OF 2 CERAMIC CHRISTMAS TREES,36,2011-10-12 10:23:00,1.45,18287,United Kingdom
392736,570715,23223,CHRISTMAS TREE HANGING SILVER,48,2011-10-12 10:23:00,0.83,18287,United Kingdom
392735,570715,23378,PACK OF 12 50'S CHRISTMAS TISSUES,24,2011-10-12 10:23:00,0.39,18287,United Kingdom
423939,573167,23264,SET OF 3 WOODEN SLEIGH DECORATIONS,36,2011-10-28 09:29:00,1.25,18287,United Kingdom


In [14]:
dataProfile(data)

Dimensions	: (401564, 8)
Data Size	: 149.76 MB
Duplicated Data	: 0


Unnamed: 0,InvoiceNo,StockCode,Description,Quantity,InvoiceDate,UnitPrice,CustomerID,Country


REVIEW


Unnamed: 0,dtype,count_of_null,null_ratio,count_of_distinct,distinct_value
InvoiceNo,object,0,0.0,22186,"[541431, C541433, 562032, 542237, 573511, 5562..."
StockCode,object,0,0.0,3684,"[23166, 21578, 47559B, 21154, 21041, 21035, 22..."
Description,object,0,0.0,3896,"[MEDIUM CERAMIC TOP STORAGE JAR, WOODLAND DESI..."
Quantity,int64,0,0.0,435,"[74215, -74215, 6, 10, 3, 12, 4, 8, 24, 20, 2,..."
InvoiceDate,datetime64[ns],0,0.0,20456,"[2011-01-18T10:01:00.000000000, 2011-01-18T10:..."
UnitPrice,float64,0,0.0,619,"[1.04, 2.25, 1.25, 2.95, 12.75, 4.25, 0.42, 1...."
CustomerID,object,0,0.0,4371,"[12346, 12347, 12348, 12349, 12350, 12352, 123..."
Country,object,0,0.0,37,"[United Kingdom, Iceland, Finland, Italy, Norw..."


Stastical Numerics


Unnamed: 0,Quantity,UnitPrice
count,401564.0,401564.0
mean,12.149911,3.47441
std,249.512649,69.767501
min,-80995.0,0.001
25%,2.0,1.25
50%,5.0,1.95
75%,12.0,3.75
max,80995.0,38970.0


Stastical Categorics


Unnamed: 0,InvoiceNo,StockCode,Description,CustomerID,Country
count,401564,401564,401564,401564,401564
unique,22186,3684,3896,4371,37
top,576339,85123A,WHITE HANGING HEART T-LIGHT HOLDER,17841,United Kingdom
freq,542,2065,2058,7812,356704


PREVIEW head(3)


Unnamed: 0,InvoiceNo,StockCode,Description,Quantity,InvoiceDate,UnitPrice,CustomerID,Country
61619,541431,23166,MEDIUM CERAMIC TOP STORAGE JAR,74215,2011-01-18 10:01:00,1.04,12346,United Kingdom
61624,C541433,23166,MEDIUM CERAMIC TOP STORAGE JAR,-74215,2011-01-18 10:17:00,1.04,12346,United Kingdom
286628,562032,21578,WOODLAND DESIGN COTTON TOTE BAG,6,2011-08-02 08:48:00,2.25,12347,Iceland


### **Explore Returned/Canceled Transactions**

#### **By Transactions and Trasaction Items**

In [15]:
cancel = data.groupby(['InvoiceNo', 'CustomerID'], as_index=False).Quantity.sum().sort_values('CustomerID').reset_index(drop=True)
cancel['IsCanceled'] = np.where(cancel.InvoiceNo.str.startswith('C', na=False), 1, 0)

print(f'Total transactions\t\t: {len(cancel)}')
print(f'Total completed transactions\t: {len(cancel)-cancel.IsCanceled.sum()} => {round(100-(cancel.IsCanceled.sum()/len(cancel)*100),2)}%')
print(f'Total canceled transactions\t: {cancel.IsCanceled.sum()} => {round((cancel.IsCanceled.sum()/len(cancel)*100),2)}%')
print(line)
cancel

Total transactions		: 22186
Total completed transactions	: 18532 => 83.53%
Total canceled transactions	: 3654 => 16.47%


Unnamed: 0,InvoiceNo,CustomerID,Quantity,IsCanceled
0,541431,12346,74215,0
1,C541433,12346,-74215,1
2,549222,12347,483,0
3,537626,12347,319,0
4,562032,12347,277,0
...,...,...,...,...
22181,578262,18283,241,0
22182,579673,18283,132,0
22183,570715,18287,990,0
22184,554065,18287,488,0


In [16]:
# canceled items
data[data['InvoiceNo'].str.startswith("C", na = False)].sort_values('CustomerID').reset_index(drop=True)
# same as data[data.Quantity<0]

Unnamed: 0,InvoiceNo,StockCode,Description,Quantity,InvoiceDate,UnitPrice,CustomerID,Country
0,C541433,23166,MEDIUM CERAMIC TOP STORAGE JAR,-74215,2011-01-18 10:17:00,1.04,12346,United Kingdom
1,C547388,22784,LANTERN CREAM GAZEBO,-3,2011-03-22 16:07:00,4.95,12352,Norway
2,C547388,37448,CERAMIC CAKE DESIGN SPOTTED MUG,-12,2011-03-22 16:07:00,1.49,12352,Norway
3,C547388,22701,PINK DOG BOWL,-6,2011-03-22 16:07:00,2.95,12352,Norway
4,C547388,22645,CERAMIC HEART FAIRY CAKE MONEY BANK,-12,2011-03-22 16:07:00,1.45,12352,Norway
...,...,...,...,...,...,...,...,...
8867,C577832,84988,SET OF 72 PINK HEART PAPER DOILIES,-12,2011-11-22 10:18:00,1.45,18274,United Kingdom
8868,C577386,23401,RUSTIC MIRROR WITH LACE HEART,-1,2011-11-18 16:54:00,6.25,18276,United Kingdom
8869,C577390,23401,RUSTIC MIRROR WITH LACE HEART,-1,2011-11-18 17:01:00,6.25,18276,United Kingdom
8870,C542086,22423,REGENCY CAKESTAND 3 TIER,-1,2011-01-25 12:34:00,12.75,18277,United Kingdom


#### **Transactions Affected by Returned**

In [104]:
data.reset_index(drop=True, inplace=True)
dataIdx = data.copy()
dataIdx['idx'] = dataIdx.index
dataIdx

Unnamed: 0,InvoiceNo,StockCode,Description,Quantity,InvoiceDate,UnitPrice,CustomerID,Country,idx
0,541431,23166,MEDIUM CERAMIC TOP STORAGE JAR,74215,2011-01-18 10:01:00,1.04,12346,United Kingdom,0
1,C541433,23166,MEDIUM CERAMIC TOP STORAGE JAR,-74215,2011-01-18 10:17:00,1.04,12346,United Kingdom,1
2,562032,21578,WOODLAND DESIGN COTTON TOTE BAG,6,2011-08-02 08:48:00,2.25,12347,Iceland,2
3,542237,47559B,TEA TIME OVEN GLOVE,10,2011-01-26 14:30:00,1.25,12347,Iceland,3
4,542237,21154,RED RETROSPOT OVEN GLOVE,10,2011-01-26 14:30:00,1.25,12347,Iceland,4
...,...,...,...,...,...,...,...,...,...
401559,570715,23269,SET OF 2 CERAMIC CHRISTMAS TREES,36,2011-10-12 10:23:00,1.45,18287,United Kingdom,401559
401560,570715,23223,CHRISTMAS TREE HANGING SILVER,48,2011-10-12 10:23:00,0.83,18287,United Kingdom,401560
401561,570715,23378,PACK OF 12 50'S CHRISTMAS TISSUES,24,2011-10-12 10:23:00,0.39,18287,United Kingdom,401561
401562,573167,23264,SET OF 3 WOODEN SLEIGH DECORATIONS,36,2011-10-28 09:29:00,1.25,18287,United Kingdom,401562


In [105]:
dataCompleted = dataIdx[dataIdx.Quantity>0]
dataCanceled = dataIdx[dataIdx.Quantity<0]
dataReturned = pd.merge(dataCompleted, dataCanceled, how='right',
                   on=['StockCode', 'Description', 'CustomerID', 'Country', 'UnitPrice'], 
                   suffixes=['_completed', '_canceled'])
dataReturned

Unnamed: 0,InvoiceNo_completed,StockCode,Description,Quantity_completed,InvoiceDate_completed,UnitPrice,CustomerID,Country,idx_completed,InvoiceNo_canceled,Quantity_canceled,InvoiceDate_canceled,idx_canceled
0,541431,23166,MEDIUM CERAMIC TOP STORAGE JAR,74215.0,2011-01-18 10:01:00,1.04,12346,United Kingdom,0.0,C541433,-74215,2011-01-18 10:17:00,1
1,545332,M,Manual,1.0,2011-03-01 15:52:00,376.50,12352,Norway,317.0,C545330,-1,2011-03-01 15:49:00,318
2,545332,M,Manual,1.0,2011-03-01 15:52:00,183.75,12352,Norway,315.0,C545329,-1,2011-03-01 15:47:00,319
3,545332,M,Manual,1.0,2011-03-01 15:52:00,280.05,12352,Norway,316.0,C545329,-1,2011-03-01 15:47:00,320
4,546869,21914,BLUE HARMONICA IN BOX,12.0,2011-03-17 16:00:00,1.25,12352,Norway,354.0,C547388,-12,2011-03-22 16:07:00,345
...,...,...,...,...,...,...,...,...,...,...,...,...,...
21127,575485,84988,SET OF 72 PINK HEART PAPER DOILIES,12.0,2011-11-09 17:03:00,1.45,18274,United Kingdom,400700.0,C577832,-12,2011-11-22 10:18:00,400697
21128,572990,23401,RUSTIC MIRROR WITH LACE HEART,2.0,2011-10-27 10:54:00,6.25,18276,United Kingdom,400718.0,C577386,-1,2011-11-18 16:54:00,400713
21129,572990,23401,RUSTIC MIRROR WITH LACE HEART,2.0,2011-10-27 10:54:00,6.25,18276,United Kingdom,400718.0,C577390,-1,2011-11-18 17:01:00,400715
21130,,22423,REGENCY CAKESTAND 3 TIER,,NaT,12.75,18277,United Kingdom,,C542086,-1,2011-01-25 12:34:00,400727


In [106]:
dataReturnedQtyEQ = dataReturned[(dataReturned.Quantity_completed == np.abs(dataReturned.Quantity_canceled)) & 
                                 (dataReturned.InvoiceDate_completed < dataReturned.InvoiceDate_canceled)]
dataReturnedQtyUnknown = dataReturned[dataReturned.InvoiceNo_completed.isnull()]
print(line)
print(f'Transaction Items Affected by Returned => {len(dataReturned)}')
print(line)
print(f'Purchase Quantity == Return Quantity \t: {len(dataReturnedQtyEQ)}')
display(dataReturnedQtyEQ)
print(line)
print(f'Purchase Unknown & Return Quantity \t: {len(dataReturnedQtyUnknown)}')
display(dataReturnedQtyUnknown)

# dataReturnedQtyMT = dataReturned[dataReturned.Quantity_completed < np.abs(dataReturned.Quantity_canceled)]
# dataReturnedQtyLT = dataReturned[dataReturned.Quantity_completed > np.abs(dataReturned.Quantity_canceled)]
# print(f'Purchase Quantity > Return Quantity \t: {len(dataReturnedQtyLT)}')
# display(dataReturnedQtyLT)
# print(line)
# print(f'Purchase Quantity < Return Quantity \t: {len(dataReturnedQtyMT)}')
# display(dataReturnedQtyMT)

Transaction Items Affected by Returned => 21132
Purchase Quantity == Return Quantity 	: 3890


Unnamed: 0,InvoiceNo_completed,StockCode,Description,Quantity_completed,InvoiceDate_completed,UnitPrice,CustomerID,Country,idx_completed,InvoiceNo_canceled,Quantity_canceled,InvoiceDate_canceled,idx_canceled
0,541431,23166,MEDIUM CERAMIC TOP STORAGE JAR,74215.0,2011-01-18 10:01:00,1.04,12346,United Kingdom,0.0,C541433,-74215,2011-01-18 10:17:00,1
4,546869,21914,BLUE HARMONICA IN BOX,12.0,2011-03-17 16:00:00,1.25,12352,Norway,354.0,C547388,-12,2011-03-22 16:07:00,345
6,546869,22413,METAL SIGN TAKE IT OR LEAVE IT,6.0,2011-03-17 16:00:00,2.95,12352,Norway,340.0,C547388,-6,2011-03-22 16:07:00,346
8,546869,22645,CERAMIC HEART FAIRY CAKE MONEY BANK,12.0,2011-03-17 16:00:00,1.45,12352,Norway,352.0,C547388,-12,2011-03-22 16:07:00,347
11,546869,22701,PINK DOG BOWL,6.0,2011-03-17 16:00:00,2.95,12352,Norway,377.0,C547388,-6,2011-03-22 16:07:00,349
...,...,...,...,...,...,...,...,...,...,...,...,...,...
21123,575485,22989,SET 2 PANTRY DESIGN TEA TOWELS,6.0,2011-11-09 17:03:00,3.25,18274,United Kingdom,400704.0,C577832,-6,2011-11-22 10:18:00,400693
21124,575485,23243,SET OF TEA COFFEE SUGAR TINS PANTRY,4.0,2011-11-09 17:03:00,4.95,18274,United Kingdom,400703.0,C577832,-4,2011-11-22 10:18:00,400694
21125,575485,23245,SET OF 3 REGENCY CAKE TINS,4.0,2011-11-09 17:03:00,4.95,18274,United Kingdom,400701.0,C577832,-4,2011-11-22 10:18:00,400695
21126,575485,84509A,SET OF 4 ENGLISH ROSE PLACEMATS,4.0,2011-11-09 17:03:00,3.75,18274,United Kingdom,400707.0,C577832,-4,2011-11-22 10:18:00,400696


Purchase Unknown & Return Quantity 	: 1316


Unnamed: 0,InvoiceNo_completed,StockCode,Description,Quantity_completed,InvoiceDate_completed,UnitPrice,CustomerID,Country,idx_completed,InvoiceNo_canceled,Quantity_canceled,InvoiceDate_canceled,idx_canceled
26,,22826,LOVE SEAT ANTIQUE WHITE METAL,,NaT,42.50,12359,Cyprus,,C580165,-1,2011-12-02 11:21:00,906
71,,20712,JUMBO BAG WOODLAND ANIMALS,,NaT,2.08,12408,Belgium,,C549253,-1,2011-04-07 12:20:00,3541
152,,POST,POSTAGE,,NaT,262.73,12415,Australia,,C574344,-1,2011-11-04 10:18:00,4394
171,,M,Manual,,NaT,0.77,12421,Spain,,C557300,-1,2011-06-19 14:05:00,4951
196,,21217,RED RETROSPOT ROUND CAKE TINS,,NaT,9.95,12434,Australia,,C538723,-1,2010-12-14 11:12:00,6461
...,...,...,...,...,...,...,...,...,...,...,...,...,...
21089,,23057,BEADED CHANDELIER T-LIGHT HOLDER,,NaT,4.95,18257,United Kingdom,,C555268,-1,2011-06-01 16:17:00,400102
21092,,POST,POSTAGE,,NaT,8.65,18257,United Kingdom,,C545740,-1,2011-03-07 11:47:00,400166
21108,,POST,POSTAGE,,NaT,5.95,18270,United Kingdom,,C549945,-1,2011-04-13 12:39:00,400508
21111,,20932,PINK POT PLANT CANDLE,,NaT,2.95,18272,United Kingdom,,C552720,-1,2011-05-11 09:49:00,400560


###### **Returned Qty == Buying Qty & Unknown Returns**
There is a return quantity that is more than does not has a purchase invoice. Why?
it may be due to the purchase invoice being recorded outside the date interval of the acquired dataset. However, with a small amount of data and lacking information related to this, the transaction item can be ignored or removed. Likewise, transaction items with return quantities that are equal to the purchase quantity. There is no term explanation on how to process product returns therefore it can be assumed that these transaction items eliminate each other.

In [107]:
dataIdx.drop(dataReturnedQtyUnknown.idx_canceled.unique(), inplace=True)
dataIdx.drop(dataReturnedQtyEQ.idx_completed.unique(), inplace=True)
dataIdx.drop(dataReturnedQtyEQ.idx_canceled.unique(), inplace=True)
dataIdx

Unnamed: 0,InvoiceNo,StockCode,Description,Quantity,InvoiceDate,UnitPrice,CustomerID,Country,idx
2,562032,21578,WOODLAND DESIGN COTTON TOTE BAG,6,2011-08-02 08:48:00,2.25,12347,Iceland,2
3,542237,47559B,TEA TIME OVEN GLOVE,10,2011-01-26 14:30:00,1.25,12347,Iceland,3
4,542237,21154,RED RETROSPOT OVEN GLOVE,10,2011-01-26 14:30:00,1.25,12347,Iceland,4
5,542237,21041,RED RETROSPOT OVEN GLOVE DOUBLE,6,2011-01-26 14:30:00,2.95,12347,Iceland,5
6,542237,21035,SET/2 RED RETROSPOT TEA TOWELS,6,2011-01-26 14:30:00,2.95,12347,Iceland,6
...,...,...,...,...,...,...,...,...,...
401559,570715,23269,SET OF 2 CERAMIC CHRISTMAS TREES,36,2011-10-12 10:23:00,1.45,18287,United Kingdom,401559
401560,570715,23223,CHRISTMAS TREE HANGING SILVER,48,2011-10-12 10:23:00,0.83,18287,United Kingdom,401560
401561,570715,23378,PACK OF 12 50'S CHRISTMAS TISSUES,24,2011-10-12 10:23:00,0.39,18287,United Kingdom,401561
401562,573167,23264,SET OF 3 WOODEN SLEIGH DECORATIONS,36,2011-10-28 09:29:00,1.25,18287,United Kingdom,401562


###### **eaea**

In [109]:
dataCompleted = dataIdx[dataIdx.Quantity>0]
dataCanceled = dataIdx[dataIdx.Quantity<0]
dataReturned = pd.merge(dataCompleted, dataCanceled, how='right',
                   on=['StockCode', 'Description', 'CustomerID', 'Country', 'UnitPrice'], 
                   suffixes=['_completed', '_canceled'])
# dataReturned
dataReturnedQtyLT = dataReturned[(dataReturned.Quantity_completed > np.abs(dataReturned.Quantity_canceled)) &
                                 (dataReturned.InvoiceDate_completed < dataReturned.InvoiceDate_canceled)].reset_index(drop=True)
dataReturnedQtyMT = dataReturned[(dataReturned.Quantity_completed < np.abs(dataReturned.Quantity_canceled)) &
                                 (dataReturned.InvoiceDate_completed < dataReturned.InvoiceDate_canceled)].reset_index(drop=True)
#
print(line)
print(f'Transaction Items Affected by Returned => {len(dataReturned)}')
print(line)
print(f'Purchase Quantity > Return Quantity \t: {len(dataReturnedQtyLT)}')
display(dataReturnedQtyLT)
# print(line)
# print(f'Purchase Quantity < Return Quantity \t: {len(dataReturnedQtyMT)}')
# display(dataReturnedQtyMT)

Transaction Items Affected by Returned => 12068
Purchase Quantity > Return Quantity 	: 7341


Unnamed: 0,InvoiceNo_completed,StockCode,Description,Quantity_completed,InvoiceDate_completed,UnitPrice,CustomerID,Country,idx_completed,InvoiceNo_canceled,Quantity_canceled,InvoiceDate_canceled,idx_canceled
0,540946,22666,RECIPE BOX PANTRY YELLOW DESIGN,6.0,2011-01-12 12:43:00,2.95,12359,Cyprus,696.0,C549955,-2,2011-04-13 13:38:00,684
1,543370,22666,RECIPE BOX PANTRY YELLOW DESIGN,6.0,2011-02-07 14:51:00,2.95,12359,Cyprus,726.0,C549955,-2,2011-04-13 13:38:00,684
2,571034,23245,SET OF 3 REGENCY CAKE TINS,4.0,2011-10-13 12:47:00,4.95,12359,Cyprus,882.0,C580165,-2,2011-12-02 11:21:00,710
3,571034,22797,CHEST OF DRAWERS GINGHAM HEART,4.0,2011-10-13 12:47:00,16.95,12359,Cyprus,930.0,C580165,-2,2011-12-02 11:21:00,711
4,540946,22720,SET OF 3 CAKE TINS PANTRY DESIGN,3.0,2011-01-12 12:43:00,4.95,12359,Cyprus,698.0,C580165,-1,2011-12-02 11:21:00,903
...,...,...,...,...,...,...,...,...,...,...,...,...,...
7336,549185,22969,HOMEMADE JAM SCENTED CANDLES,24.0,2011-04-07 09:35:00,1.45,18272,United Kingdom,400583.0,C552720,-2,2011-05-11 09:49:00,400561
7337,551507,22204,MILK PAN BLUE POLKADOT,4.0,2011-04-28 18:11:00,3.75,18272,United Kingdom,400604.0,C552720,-1,2011-05-11 09:49:00,400564
7338,572990,23401,RUSTIC MIRROR WITH LACE HEART,2.0,2011-10-27 10:54:00,6.25,18276,United Kingdom,400718.0,C577386,-1,2011-11-18 16:54:00,400713
7339,572990,23401,RUSTIC MIRROR WITH LACE HEART,2.0,2011-10-27 10:54:00,6.25,18276,United Kingdom,400718.0,C577390,-1,2011-11-18 17:01:00,400715


In [111]:
# karena terdapat lebih dari satu purchase invoice yang memiliki kriteria terhadap return invoice, kita asumsikan salah satunya adalah invoice yang dilakukan returning sedangkan sisanya adalah purchase invoice yang tidak mengalami return.
# dataReturnedQtyLT = dataReturnedQtyLT[(~dataReturnedQtyLT.idx_canceled.duplicated()) &
#                                       (~dataReturnedQtyLT.idx_completed.duplicated())]
dataReturnedQtyLT = dataReturnedQtyLT[(~dataReturnedQtyLT.idx_canceled.duplicated())]
print(line)
print(f'Transaction Items Affected by Returned => {len(dataReturned)}')
print(line)
print(f'Purchase Quantity > Return Quantity \t: {len(dataReturnedQtyLT)}')
display(dataReturnedQtyLT)

Transaction Items Affected by Returned => 12068
Purchase Quantity > Return Quantity 	: 4277


Unnamed: 0,InvoiceNo_completed,StockCode,Description,Quantity_completed,InvoiceDate_completed,UnitPrice,CustomerID,Country,idx_completed,InvoiceNo_canceled,Quantity_canceled,InvoiceDate_canceled,idx_canceled
0,540946,22666,RECIPE BOX PANTRY YELLOW DESIGN,6.0,2011-01-12 12:43:00,2.95,12359,Cyprus,696.0,C549955,-2,2011-04-13 13:38:00,684
2,571034,23245,SET OF 3 REGENCY CAKE TINS,4.0,2011-10-13 12:47:00,4.95,12359,Cyprus,882.0,C580165,-2,2011-12-02 11:21:00,710
3,571034,22797,CHEST OF DRAWERS GINGHAM HEART,4.0,2011-10-13 12:47:00,16.95,12359,Cyprus,930.0,C580165,-2,2011-12-02 11:21:00,711
4,540946,22720,SET OF 3 CAKE TINS PANTRY DESIGN,3.0,2011-01-12 12:43:00,4.95,12359,Cyprus,698.0,C580165,-1,2011-12-02 11:21:00,903
6,544203,22629,SPACEBOY LUNCH BOX,12.0,2011-02-17 10:30:00,1.95,12362,Belgium,1125.0,C544902,-1,2011-02-24 13:05:00,1155
...,...,...,...,...,...,...,...,...,...,...,...,...,...
7335,551507,22969,HOMEMADE JAM SCENTED CANDLES,12.0,2011-04-28 18:11:00,1.45,18272,United Kingdom,400519.0,C552720,-2,2011-05-11 09:49:00,400561
7337,551507,22204,MILK PAN BLUE POLKADOT,4.0,2011-04-28 18:11:00,3.75,18272,United Kingdom,400604.0,C552720,-1,2011-05-11 09:49:00,400564
7338,572990,23401,RUSTIC MIRROR WITH LACE HEART,2.0,2011-10-27 10:54:00,6.25,18276,United Kingdom,400718.0,C577386,-1,2011-11-18 16:54:00,400713
7339,572990,23401,RUSTIC MIRROR WITH LACE HEART,2.0,2011-10-27 10:54:00,6.25,18276,United Kingdom,400718.0,C577390,-1,2011-11-18 17:01:00,400715


In [112]:
print(f'Num of return invoice\t\t\t\t: {len(dataReturnedQtyLT.idx_canceled.unique())}')
print(f'Num of purchase invoice affected by return\t: {len(dataReturnedQtyLT.idx_completed.unique())}')

Num of return invoice				: 4277
Num of purchase invoice affected by return	: 3934


In [114]:
display(dataReturnedQtyLT[dataReturnedQtyLT.idx_completed.duplicated()].tail())
print(line)
print('SAMPLE')
display(dataReturnedQtyLT[dataReturnedQtyLT.idx_completed==397983])

Unnamed: 0,InvoiceNo_completed,StockCode,Description,Quantity_completed,InvoiceDate_completed,UnitPrice,CustomerID,Country,idx_completed,InvoiceNo_canceled,Quantity_canceled,InvoiceDate_canceled,idx_canceled
7231,546165,22720,SET OF 3 CAKE TINS PANTRY DESIGN,3.0,2011-03-10 10:08:00,4.95,18183,United Kingdom,396420.0,C546897,-1,2011-03-17 18:25:00,396445
7269,565413,21931,JUMBO STORAGE BAG SUKI,10.0,2011-09-04 11:49:00,2.08,18223,United Kingdom,397925.0,C566460,-1,2011-09-12 17:19:00,398032
7285,560577,22720,SET OF 3 CAKE TINS PANTRY DESIGN,12.0,2011-07-19 15:07:00,4.95,18223,United Kingdom,397983.0,C574954,-3,2011-11-08 09:52:00,398175
7325,562732,21314,SMALL GLASS HEART TRINKET POT,8.0,2011-08-09 10:19:00,2.1,18248,United Kingdom,399852.0,C563594,-5,2011-08-18 06:14:00,399847
7339,572990,23401,RUSTIC MIRROR WITH LACE HEART,2.0,2011-10-27 10:54:00,6.25,18276,United Kingdom,400718.0,C577390,-1,2011-11-18 17:01:00,400715


SAMPLE


Unnamed: 0,InvoiceNo_completed,StockCode,Description,Quantity_completed,InvoiceDate_completed,UnitPrice,CustomerID,Country,idx_completed,InvoiceNo_canceled,Quantity_canceled,InvoiceDate_canceled,idx_canceled
7260,560577,22720,SET OF 3 CAKE TINS PANTRY DESIGN,12.0,2011-07-19 15:07:00,4.95,18223,United Kingdom,397983.0,C561604,-3,2011-07-28 12:08:00,398024
7285,560577,22720,SET OF 3 CAKE TINS PANTRY DESIGN,12.0,2011-07-19 15:07:00,4.95,18223,United Kingdom,397983.0,C574954,-3,2011-11-08 09:52:00,398175


Ternyata terdapat purchase invoice yang diretur dengan invoice yang berbeda atau lebih dari satu kali

In [117]:
dataIdx[dataIdx.idx == 131490]

Unnamed: 0,InvoiceNo,StockCode,Description,Quantity,InvoiceDate,UnitPrice,CustomerID,Country,idx
131490,541114,22423,REGENCY CAKESTAND 3 TIER,16,2011-01-13 15:19:00,10.95,14299,United Kingdom,131490


In [116]:
dataReturnedQtyLT[dataReturnedQtyLT.idx_canceled==131490]

Unnamed: 0,InvoiceNo_completed,StockCode,Description,Quantity_completed,InvoiceDate_completed,UnitPrice,CustomerID,Country,idx_completed,InvoiceNo_canceled,Quantity_canceled,InvoiceDate_canceled,idx_canceled


In [128]:
rm = dataReturnedQtyLT.groupby(['InvoiceNo_completed', 'StockCode', 'Description', 'Quantity_completed', 'InvoiceDate_completed',
                                  'UnitPrice', 'CustomerID', 'Country', 'idx_completed'], as_index=False).Quantity_canceled.sum()
# rm
rm['Quantity'] = rm.Quantity_completed - np.abs(rm.Quantity_canceled)
newQty = pd.DataFrame(list(zip(rm.InvoiceNo_completed,
                      rm.StockCode,
                      rm.Description,
                      rm.Quantity,
                      rm.InvoiceDate_completed,
                      rm.UnitPrice,
                      rm.CustomerID,
                      rm.Country,
                      rm.idx_completed)), columns=dataIdx.columns)
newQty.Quantity = newQty.Quantity.astype(np.int)
newQty.idx = newQty.idx.astype(np.int)
#
print(line)
print(f'Num of Qty < 0\t: {len(newQty[newQty.Quantity<0])}')
print(f'Num of Qty == 0\t: {len(newQty[newQty.Quantity==0])}')
display(newQty)

Num of Qty < 0	: 19
Num of Qty == 0	: 41


Unnamed: 0,InvoiceNo,StockCode,Description,Quantity,InvoiceDate,UnitPrice,CustomerID,Country,idx
0,536374,21258,VICTORIAN SEWING BOX LARGE,20,2010-12-01 09:09:00,10.95,15100,United Kingdom,198523
1,536378,21212,PACK OF 72 RETROSPOT CAKE CASES,118,2010-12-01 09:37:00,0.42,14688,United Kingdom,163872
2,536378,21977,PACK OF 60 PINK PAISLEY CAKE CASES,23,2010-12-01 09:37:00,0.55,14688,United Kingdom,163768
3,536381,22719,GUMBALL MONOCHROME COAT RACK,33,2010-12-01 09:41:00,1.06,15311,United Kingdom,209216
4,536381,22778,GLASS CLOCHE SMALL,2,2010-12-01 09:41:00,3.95,15311,United Kingdom,209215
...,...,...,...,...,...,...,...,...,...
3929,580543,22909,SET OF 20 VINTAGE CHRISTMAS NAPKINS,11,2011-12-05 09:11:00,0.85,18223,United Kingdom,398014
3930,580598,85048,15CM CHRISTMAS GLASS BALL 20 LIGHTS,6,2011-12-05 11:05:00,7.95,17526,United Kingdom,347698
3931,580719,84946,ANTIQUE SILVER T-LIGHT GLASS,69,2011-12-05 16:54:00,1.06,14739,United Kingdom,168865
3932,580978,22107,PIZZA PLATE IN BOX,7,2011-12-06 15:36:00,1.25,13078,United Kingdom,50545


Jumlah Qty < menunjukkan bahwa terdapat barang yang diretur namun invoice purchase tidak terekam atau dataset perlu menggunakan interval yang jauh lebih luas. Maka dapat diremove saja.

In [129]:
dataIdx.drop(dataReturnedQtyLT.idx_canceled ,inplace=True)
dataIdx.drop(dataReturnedQtyLT.idx_completed ,inplace=True)
dataIdx

Unnamed: 0,InvoiceNo,StockCode,Description,Quantity,InvoiceDate,UnitPrice,CustomerID,Country,idx
2,562032,21578,WOODLAND DESIGN COTTON TOTE BAG,6,2011-08-02 08:48:00,2.25,12347,Iceland,2
3,542237,47559B,TEA TIME OVEN GLOVE,10,2011-01-26 14:30:00,1.25,12347,Iceland,3
4,542237,21154,RED RETROSPOT OVEN GLOVE,10,2011-01-26 14:30:00,1.25,12347,Iceland,4
5,542237,21041,RED RETROSPOT OVEN GLOVE DOUBLE,6,2011-01-26 14:30:00,2.95,12347,Iceland,5
6,542237,21035,SET/2 RED RETROSPOT TEA TOWELS,6,2011-01-26 14:30:00,2.95,12347,Iceland,6
...,...,...,...,...,...,...,...,...,...
401559,570715,23269,SET OF 2 CERAMIC CHRISTMAS TREES,36,2011-10-12 10:23:00,1.45,18287,United Kingdom,401559
401560,570715,23223,CHRISTMAS TREE HANGING SILVER,48,2011-10-12 10:23:00,0.83,18287,United Kingdom,401560
401561,570715,23378,PACK OF 12 50'S CHRISTMAS TISSUES,24,2011-10-12 10:23:00,0.39,18287,United Kingdom,401561
401562,573167,23264,SET OF 3 WOODEN SLEIGH DECORATIONS,36,2011-10-28 09:29:00,1.25,18287,United Kingdom,401562


In [130]:
dataIdx = pd.concat([dataIdx, newQty])
dataIdx

Unnamed: 0,InvoiceNo,StockCode,Description,Quantity,InvoiceDate,UnitPrice,CustomerID,Country,idx
2,562032,21578,WOODLAND DESIGN COTTON TOTE BAG,6,2011-08-02 08:48:00,2.25,12347,Iceland,2
3,542237,47559B,TEA TIME OVEN GLOVE,10,2011-01-26 14:30:00,1.25,12347,Iceland,3
4,542237,21154,RED RETROSPOT OVEN GLOVE,10,2011-01-26 14:30:00,1.25,12347,Iceland,4
5,542237,21041,RED RETROSPOT OVEN GLOVE DOUBLE,6,2011-01-26 14:30:00,2.95,12347,Iceland,5
6,542237,21035,SET/2 RED RETROSPOT TEA TOWELS,6,2011-01-26 14:30:00,2.95,12347,Iceland,6
...,...,...,...,...,...,...,...,...,...
3929,580543,22909,SET OF 20 VINTAGE CHRISTMAS NAPKINS,11,2011-12-05 09:11:00,0.85,18223,United Kingdom,398014
3930,580598,85048,15CM CHRISTMAS GLASS BALL 20 LIGHTS,6,2011-12-05 11:05:00,7.95,17526,United Kingdom,347698
3931,580719,84946,ANTIQUE SILVER T-LIGHT GLASS,69,2011-12-05 16:54:00,1.06,14739,United Kingdom,168865
3932,580978,22107,PIZZA PLATE IN BOX,7,2011-12-06 15:36:00,1.25,13078,United Kingdom,50545


In [100]:
ee = dataIdx.sort_values('idx').copy()
ee.set_index('idx', inplace=True)
ee

Unnamed: 0_level_0,InvoiceNo,StockCode,Description,Quantity,InvoiceDate,UnitPrice,CustomerID,Country
idx,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
2,562032,21578,WOODLAND DESIGN COTTON TOTE BAG,6,2011-08-02 08:48:00,2.25,12347,Iceland
3,542237,47559B,TEA TIME OVEN GLOVE,10,2011-01-26 14:30:00,1.25,12347,Iceland
4,542237,21154,RED RETROSPOT OVEN GLOVE,10,2011-01-26 14:30:00,1.25,12347,Iceland
5,542237,21041,RED RETROSPOT OVEN GLOVE DOUBLE,6,2011-01-26 14:30:00,2.95,12347,Iceland
6,542237,21035,SET/2 RED RETROSPOT TEA TOWELS,6,2011-01-26 14:30:00,2.95,12347,Iceland
...,...,...,...,...,...,...,...,...
401559,570715,23269,SET OF 2 CERAMIC CHRISTMAS TREES,36,2011-10-12 10:23:00,1.45,18287,United Kingdom
401560,570715,23223,CHRISTMAS TREE HANGING SILVER,48,2011-10-12 10:23:00,0.83,18287,United Kingdom
401561,570715,23378,PACK OF 12 50'S CHRISTMAS TISSUES,24,2011-10-12 10:23:00,0.39,18287,United Kingdom
401562,573167,23264,SET OF 3 WOODEN SLEIGH DECORATIONS,36,2011-10-28 09:29:00,1.25,18287,United Kingdom


###### **ee**

In [131]:
dataCompleted = dataIdx[dataIdx.Quantity>0]
dataCanceled = dataIdx[dataIdx.Quantity<0]
dataReturned = pd.merge(dataCompleted, dataCanceled, how='right',
                   on=['StockCode', 'Description', 'CustomerID', 'Country', 'UnitPrice'], 
                   suffixes=['_completed', '_canceled'])
# dataReturned
dataReturnedQtyEQ = dataReturned[(dataReturned.Quantity_completed == np.abs(dataReturned.Quantity_canceled)) &
                                 (dataReturned.InvoiceDate_completed < dataReturned.InvoiceDate_canceled)].reset_index(drop=True)
dataReturnedQtyEQ = dataReturned[(dataReturned.Quantity_completed > np.abs(dataReturned.Quantity_canceled)) &
                                 (dataReturned.InvoiceDate_completed < dataReturned.InvoiceDate_canceled)].reset_index(drop=True)
dataReturnedQtyMT = dataReturned[(dataReturned.Quantity_completed < np.abs(dataReturned.Quantity_canceled)) &
                                 (dataReturned.InvoiceDate_completed < dataReturned.InvoiceDate_canceled)].reset_index(drop=True)
dataReturnedQtyUnknown = dataReturned[dataReturned.InvoiceNo_completed.isnull()]                                 
#
print(line)
print(f'Transaction Items Affected by Returned => {len(dataReturned)}')
print(line)
print(f'Purchase Quantity > Return Quantity \t: {len(dataReturnedQtyLT)}')
display(dataReturnedQtyLT)
print(line)
print(f'Purchase Quantity < Return Quantity \t: {len(dataReturnedQtyMT)}')
display(dataReturnedQtyMT)
print(line)
print(f'Purchase Quantity < Return Quantity \t: {len(dataReturnedQtyEQ)}')
display(dataReturnedQtyEQ)
print(line)
print(f'Purchase Quantity < Return Quantity \t: {len(dataReturnedQtyUnknown)}')
display(dataReturnedQtyUnknown)

Transaction Items Affected by Returned => 1151
Purchase Quantity > Return Quantity 	: 4277


Unnamed: 0,InvoiceNo_completed,StockCode,Description,Quantity_completed,InvoiceDate_completed,UnitPrice,CustomerID,Country,idx_completed,InvoiceNo_canceled,Quantity_canceled,InvoiceDate_canceled,idx_canceled
0,540946,22666,RECIPE BOX PANTRY YELLOW DESIGN,6.0,2011-01-12 12:43:00,2.95,12359,Cyprus,696.0,C549955,-2,2011-04-13 13:38:00,684
2,571034,23245,SET OF 3 REGENCY CAKE TINS,4.0,2011-10-13 12:47:00,4.95,12359,Cyprus,882.0,C580165,-2,2011-12-02 11:21:00,710
3,571034,22797,CHEST OF DRAWERS GINGHAM HEART,4.0,2011-10-13 12:47:00,16.95,12359,Cyprus,930.0,C580165,-2,2011-12-02 11:21:00,711
4,540946,22720,SET OF 3 CAKE TINS PANTRY DESIGN,3.0,2011-01-12 12:43:00,4.95,12359,Cyprus,698.0,C580165,-1,2011-12-02 11:21:00,903
6,544203,22629,SPACEBOY LUNCH BOX,12.0,2011-02-17 10:30:00,1.95,12362,Belgium,1125.0,C544902,-1,2011-02-24 13:05:00,1155
...,...,...,...,...,...,...,...,...,...,...,...,...,...
7335,551507,22969,HOMEMADE JAM SCENTED CANDLES,12.0,2011-04-28 18:11:00,1.45,18272,United Kingdom,400519.0,C552720,-2,2011-05-11 09:49:00,400561
7337,551507,22204,MILK PAN BLUE POLKADOT,4.0,2011-04-28 18:11:00,3.75,18272,United Kingdom,400604.0,C552720,-1,2011-05-11 09:49:00,400564
7338,572990,23401,RUSTIC MIRROR WITH LACE HEART,2.0,2011-10-27 10:54:00,6.25,18276,United Kingdom,400718.0,C577386,-1,2011-11-18 16:54:00,400713
7339,572990,23401,RUSTIC MIRROR WITH LACE HEART,2.0,2011-10-27 10:54:00,6.25,18276,United Kingdom,400718.0,C577390,-1,2011-11-18 17:01:00,400715


Purchase Quantity < Return Quantity 	: 155


Unnamed: 0,InvoiceNo_completed,StockCode,Description,Quantity_completed,InvoiceDate_completed,UnitPrice,CustomerID,Country,idx_completed,InvoiceNo_canceled,Quantity_canceled,InvoiceDate_canceled,idx_canceled
0,572061,22779,WOODEN OWLS LIGHT GARLAND,2.0,2011-10-20 12:53:00,4.25,12474,Germany,9928.0,C574061,-12,2011-11-02 14:18:00,9941
1,546365,22423,REGENCY CAKESTAND 3 TIER,1.0,2011-03-11 11:35:00,12.75,12520,Germany,13130.0,C546886,-2,2011-03-17 18:13:00,13149
2,573077,M,Manual,1.0,2011-10-27 14:13:00,4161.06,12536,France,14191.0,C573079,-2,2011-10-27 14:15:00,14215
3,574506,23085,ANTIQUE SILVER BAUBLE LAMP,3.0,2011-11-04 13:24:00,10.40,12577,France,16951.0,C574512,-6,2011-11-04 13:28:00,16957
4,570919,22847,BREAD BIN DINER STYLE IVORY,2.0,2011-10-13 10:57:00,16.95,12584,Italy,17595.0,C579785,-3,2011-11-30 15:29:00,17624
...,...,...,...,...,...,...,...,...,...,...,...,...,...
150,556218,22423,REGENCY CAKESTAND 3 TIER,2.0,2011-06-09 14:18:00,12.75,17731,United Kingdom,362662.0,C558110,-4,2011-06-26 15:47:00,362606
151,556219,22423,REGENCY CAKESTAND 3 TIER,2.0,2011-06-09 14:19:00,12.75,17731,United Kingdom,362677.0,C558110,-4,2011-06-26 15:47:00,362606
152,574034,22947,WOODEN ADVENT CALENDAR RED,1.0,2011-11-02 12:45:00,7.95,17841,United Kingdom,372935.0,C574524,-2,2011-11-04 13:53:00,372258
153,572306,22947,WOODEN ADVENT CALENDAR RED,1.0,2011-10-23 15:11:00,7.95,17841,United Kingdom,372997.0,C574524,-2,2011-11-04 13:53:00,372258


Purchase Quantity < Return Quantity 	: 13


Unnamed: 0,InvoiceNo_completed,StockCode,Description,Quantity_completed,InvoiceDate_completed,UnitPrice,CustomerID,Country,idx_completed,InvoiceNo_canceled,Quantity_canceled,InvoiceDate_canceled,idx_canceled
0,537195,21258,VICTORIAN SEWING BOX LARGE,8.0,2010-12-05 13:55:00,10.95,15311,United Kingdom,210199.0,540157,-2,2011-01-05 11:41:00,209153
1,538076,21258,VICTORIAN SEWING BOX LARGE,7.0,2010-12-09 14:15:00,10.95,15311,United Kingdom,209711.0,540157,-2,2011-01-05 11:41:00,209153
2,540345,22423,REGENCY CAKESTAND 3 TIER,9.0,2011-01-06 13:19:00,10.95,14299,United Kingdom,131614.0,541114,-6,2011-01-13 15:19:00,131490
3,540802,22423,REGENCY CAKESTAND 3 TIER,16.0,2011-01-11 12:29:00,10.95,15189,United Kingdom,203414.0,542134,-10,2011-01-25 16:36:00,203340
4,537040,22423,REGENCY CAKESTAND 3 TIER,48.0,2010-12-05 10:27:00,10.95,13089,United Kingdom,52741.0,551181,-3,2011-04-27 08:17:00,52166
5,543976,22423,REGENCY CAKESTAND 3 TIER,67.0,2011-02-14 15:26:00,10.95,13089,United Kingdom,52259.0,551181,-3,2011-04-27 08:17:00,52166
6,552904,71477,COLOUR GLASS. STAR T-LIGHT HOLDER,428.0,2011-05-12 11:07:00,2.75,16013,United Kingdom,254237.0,556718,-120,2011-06-14 11:02:00,254128
7,536969,71477,COLOUR GLASS. STAR T-LIGHT HOLDER,192.0,2010-12-03 13:10:00,2.75,16013,United Kingdom,254238.0,556718,-120,2011-06-14 11:02:00,254128
8,543056,22423,REGENCY CAKESTAND 3 TIER,32.0,2011-02-03 10:47:00,10.95,12471,Germany,8739.0,559298,-4,2011-07-07 12:38:00,8484
9,541093,22423,REGENCY CAKESTAND 3 TIER,48.0,2011-01-13 13:21:00,10.95,12471,Germany,8770.0,559298,-4,2011-07-07 12:38:00,8484


Purchase Quantity < Return Quantity 	: 19


Unnamed: 0,InvoiceNo_completed,StockCode,Description,Quantity_completed,InvoiceDate_completed,UnitPrice,CustomerID,Country,idx_completed,InvoiceNo_canceled,Quantity_canceled,InvoiceDate_canceled,idx_canceled
64,,M,Manual,,NaT,1241.98,12757,Portugal,,C554154,-1,2011-05-23 11:24:00,35027
138,,20725,LUNCH BAG RED RETROSPOT,,NaT,1.65,13113,United Kingdom,,C570221,-1,2011-10-09 12:56:00,56020
164,,22796,PHOTO FRAME 3 CLASSIC HANGING,,NaT,9.95,13148,United Kingdom,,C542604,-3,2011-01-30 12:35:00,58626
204,,M,Manual,,NaT,550.64,13564,United Kingdom,,C560408,-1,2011-07-18 14:24:00,81935
210,,79323P,PINK CHERRY LIGHTS,,NaT,6.75,13672,United Kingdom,,C540634,-4,2011-01-10 12:02:00,87621
411,,23155,KNICKERBOCKERGLORY MAGNET ASSORTED,,NaT,0.83,14339,United Kingdom,,C550168,-1,2011-04-14 16:41:00,133301
578,,90185C,BLACK DIAMANTE EXPANDABLE RING,,NaT,4.25,14911,EIRE,,C539221,-4,2010-12-16 12:56:00,183800
579,,90185B,AMETHYST DIAMANTE EXPANDABLE RING,,NaT,4.25,14911,EIRE,,C539221,-3,2010-12-16 12:56:00,183801
640,,22990,COTTON APRON PANTRY DESIGN,,NaT,4.95,15201,United Kingdom,,C562802,-2,2011-08-09 14:41:00,204080
641,,22990,COTTON APRON PANTRY DESIGN,,NaT,4.95,15201,United Kingdom,,C562848,-4,2011-08-10 09:35:00,204081


In [144]:
display(dataReturnedQtyEQ[dataReturnedQtyEQ.idx_canceled==9284])
display(dataReturnedQtyLT[dataReturnedQtyLT.idx_canceled==9284])
display(dataReturnedQtyMT[dataReturnedQtyMT.idx_canceled==9284])
display(dataReturnedQtyUnknown[dataReturnedQtyUnknown.idx_canceled==9284])
display(dataReturnedQtyEQ[dataReturnedQtyEQ.idx_completed==9284])
display(dataReturnedQtyLT[dataReturnedQtyLT.idx_completed==9284])
display(dataReturnedQtyMT[dataReturnedQtyMT.idx_completed==9284])
display(dataReturnedQtyUnknown[dataReturnedQtyUnknown.idx_completed==9284])

Unnamed: 0,InvoiceNo_completed,StockCode,Description,Quantity_completed,InvoiceDate_completed,UnitPrice,CustomerID,Country,idx_completed,InvoiceNo_canceled,Quantity_canceled,InvoiceDate_canceled,idx_canceled


Unnamed: 0,InvoiceNo_completed,StockCode,Description,Quantity_completed,InvoiceDate_completed,UnitPrice,CustomerID,Country,idx_completed,InvoiceNo_canceled,Quantity_canceled,InvoiceDate_canceled,idx_canceled
608,537201,22467,GUMBALL COAT RACK,12.0,2010-12-05 14:19:00,2.55,12472,Germany,9027.0,C575064,-8,2011-11-08 12:39:00,9284
611,542215,22467,GUMBALL COAT RACK,18.0,2011-01-26 12:27:00,2.55,12472,Germany,9198.0,C575064,-8,2011-11-08 12:39:00,9284


Unnamed: 0,InvoiceNo_completed,StockCode,Description,Quantity_completed,InvoiceDate_completed,UnitPrice,CustomerID,Country,idx_completed,InvoiceNo_canceled,Quantity_canceled,InvoiceDate_canceled,idx_canceled
609,561037,22467,GUMBALL COAT RACK,6.0,2011-07-24 11:55:00,2.55,12472,Germany,9048.0,C575064,-8,2011-11-08 12:39:00,9284
610,556578,22467,GUMBALL COAT RACK,6.0,2011-06-13 14:13:00,2.55,12472,Germany,9160.0,C575064,-8,2011-11-08 12:39:00,9284


Unnamed: 0,InvoiceNo_completed,StockCode,Description,Quantity_completed,InvoiceDate_completed,UnitPrice,CustomerID,Country,idx_completed,InvoiceNo_canceled,Quantity_canceled,InvoiceDate_canceled,idx_canceled


Unnamed: 0,InvoiceNo_completed,StockCode,Description,Quantity_completed,InvoiceDate_completed,UnitPrice,CustomerID,Country,idx_completed,InvoiceNo_canceled,Quantity_canceled,InvoiceDate_canceled,idx_canceled


Unnamed: 0,InvoiceNo_completed,StockCode,Description,Quantity_completed,InvoiceDate_completed,UnitPrice,CustomerID,Country,idx_completed,InvoiceNo_canceled,Quantity_canceled,InvoiceDate_canceled,idx_canceled


Unnamed: 0,InvoiceNo_completed,StockCode,Description,Quantity_completed,InvoiceDate_completed,UnitPrice,CustomerID,Country,idx_completed,InvoiceNo_canceled,Quantity_canceled,InvoiceDate_canceled,idx_canceled


Unnamed: 0,InvoiceNo_completed,StockCode,Description,Quantity_completed,InvoiceDate_completed,UnitPrice,CustomerID,Country,idx_completed,InvoiceNo_canceled,Quantity_canceled,InvoiceDate_canceled,idx_canceled


###### **Returned Qty < Purchase Qty**

In [79]:
dataReturnedQtyLT

Unnamed: 0,InvoiceNo_completed,StockCode,Description,Quantity_completed,InvoiceDate_completed,UnitPrice,CustomerID,Country,idx_completed,InvoiceNo_canceled,Quantity_canceled,InvoiceDate_canceled,idx_canceled
19,540946,22666,RECIPE BOX PANTRY YELLOW DESIGN,6.0,2011-01-12 12:43:00,2.95,12359,Cyprus,696.0,C549955,-2,2011-04-13 13:38:00,684
20,543370,22666,RECIPE BOX PANTRY YELLOW DESIGN,6.0,2011-02-07 14:51:00,2.95,12359,Cyprus,726.0,C549955,-2,2011-04-13 13:38:00,684
22,571034,23245,SET OF 3 REGENCY CAKE TINS,4.0,2011-10-13 12:47:00,4.95,12359,Cyprus,882.0,C580165,-2,2011-12-02 11:21:00,710
23,571034,22797,CHEST OF DRAWERS GINGHAM HEART,4.0,2011-10-13 12:47:00,16.95,12359,Cyprus,930.0,C580165,-2,2011-12-02 11:21:00,711
24,540946,22720,SET OF 3 CAKE TINS PANTRY DESIGN,3.0,2011-01-12 12:43:00,4.95,12359,Cyprus,698.0,C580165,-1,2011-12-02 11:21:00,903
...,...,...,...,...,...,...,...,...,...,...,...,...,...
21115,549185,22969,HOMEMADE JAM SCENTED CANDLES,24.0,2011-04-07 09:35:00,1.45,18272,United Kingdom,400583.0,C552720,-2,2011-05-11 09:49:00,400561
21116,551507,22204,MILK PAN BLUE POLKADOT,4.0,2011-04-28 18:11:00,3.75,18272,United Kingdom,400604.0,C552720,-1,2011-05-11 09:49:00,400564
21128,572990,23401,RUSTIC MIRROR WITH LACE HEART,2.0,2011-10-27 10:54:00,6.25,18276,United Kingdom,400718.0,C577386,-1,2011-11-18 16:54:00,400713
21129,572990,23401,RUSTIC MIRROR WITH LACE HEART,2.0,2011-10-27 10:54:00,6.25,18276,United Kingdom,400718.0,C577390,-1,2011-11-18 17:01:00,400715


In [90]:
dataReturnedQtyLT[dataReturnedQtyLT.idx_canceled==684]

Unnamed: 0,InvoiceNo_completed,StockCode,Description,Quantity_completed,InvoiceDate_completed,UnitPrice,CustomerID,Country,idx_completed,InvoiceNo_canceled,Quantity_canceled,InvoiceDate_canceled,idx_canceled
19,540946,22666,RECIPE BOX PANTRY YELLOW DESIGN,6.0,2011-01-12 12:43:00,2.95,12359,Cyprus,696.0,C549955,-2,2011-04-13 13:38:00,684
20,543370,22666,RECIPE BOX PANTRY YELLOW DESIGN,6.0,2011-02-07 14:51:00,2.95,12359,Cyprus,726.0,C549955,-2,2011-04-13 13:38:00,684


In [112]:
rm = dataReturnedQtyLT.copy()
rm['Quantity'] = rm.Quantity_completed - np.abs(rm.Quantity_canceled)
newQty = pd.DataFrame(list(zip(rm.InvoiceNo_completed,
                      rm.StockCode,
                      rm.Description,
                      rm.Quantity,
                      rm.InvoiceDate_completed,
                      rm.UnitPrice,
                      rm.CustomerID,
                      rm.Country,
                      rm.idx_completed)), columns=dataIdx.columns)
newQty.Quantity = newQty.Quantity.astype(np.int)
newQty.idx = newQty.idx.astype(np.int)
newQty

Unnamed: 0,InvoiceNo,StockCode,Description,Quantity,InvoiceDate,UnitPrice,CustomerID,Country,idx
0,540946,22666,RECIPE BOX PANTRY YELLOW DESIGN,4,2011-01-12 12:43:00,2.95,12359,Cyprus,696
1,543370,22666,RECIPE BOX PANTRY YELLOW DESIGN,4,2011-02-07 14:51:00,2.95,12359,Cyprus,726
2,571034,23245,SET OF 3 REGENCY CAKE TINS,2,2011-10-13 12:47:00,4.95,12359,Cyprus,882
3,571034,22797,CHEST OF DRAWERS GINGHAM HEART,2,2011-10-13 12:47:00,16.95,12359,Cyprus,930
4,540946,22720,SET OF 3 CAKE TINS PANTRY DESIGN,2,2011-01-12 12:43:00,4.95,12359,Cyprus,698
...,...,...,...,...,...,...,...,...,...
13232,549185,22969,HOMEMADE JAM SCENTED CANDLES,22,2011-04-07 09:35:00,1.45,18272,United Kingdom,400583
13233,551507,22204,MILK PAN BLUE POLKADOT,3,2011-04-28 18:11:00,3.75,18272,United Kingdom,400604
13234,572990,23401,RUSTIC MIRROR WITH LACE HEART,1,2011-10-27 10:54:00,6.25,18276,United Kingdom,400718
13235,572990,23401,RUSTIC MIRROR WITH LACE HEART,1,2011-10-27 10:54:00,6.25,18276,United Kingdom,400718


In [113]:
dataIdx.drop(rm.idx_canceled.unique(), inplace=True)
# dataIdx.drop(rm.idx_completed.unique(), inplace=True)
# dataIdx = pd.concat([dataIdx, newQty])
dataIdx

KeyError: '[4034, 4143, 4158, 4182, 4524, 4533, 6661, 8600, 8811, 8828, 8831, 8843, 8861, 8884, 8903, 9016, 9215, 9230, 9284, 9646, 9849, 10099, 10111, 11047, 11087, 11090, 11126, 11909, 11916, 12020, 12838, 13149, 13150, 13810, 14749, 14851, 16037, 16103, 16491, 16644, 17383, 17502, 17520, 17524, 17582, 17594, 17625, 17627, 18380, 18543, 18876, 20298, 20955, 21498, 23765, 23774, 26229, 26259, 26902, 27171, 27267, 27877, 28339, 28342, 28710, 29016, 30025, 32870, 33280, 33385, 33392, 33394, 33395, 33647, 34655, 38271, 38275, 38615, 41081, 41107, 41178, 41339, 41340, 41429, 41614, 41673, 41909, 42174, 42175, 42228, 42612, 42788, 42791, 42804, 42818, 42820, 42843, 42854, 42856, 42857, 42868, 44246, 46252, 47215, 47235, 48401, 48924, 49167, 49264, 49815, 49867, 49872, 49873, 49874, 49941, 49952, 49953, 49969, 49971, 49973, 50419, 50550, 52449, 52529, 52773, 52964, 52965, 52966, 52996, 52999, 54231, 54457, 54553, 54554, 55529, 55947, 56103, 56111, 56174, 56294, 56426, 58769, 58770, 59628, 60756, 66080, 66198, 68822, 68852, 69702, 74892, 74900, 76196, 78437, 79274, 79282, 80421, 80460, 82984, 82987, 83127, 83232, 83794, 83916, 85225, 85227, 85259, 85260, 85261, 85262, 87314, 87496, 87542, 87624, 87625, 87629, 87632, 87633, 87636, 88520, 89905, 90617, 90901, 92918, 94208, 96740, 96746, 98430, 98516, 99107, 99940, 99954, 99981, 99998, 100027, 100030, 100035, 100041, 100091, 100943, 101808, 102453, 102455, 102591, 102672, 103374, 106502, 106522, 106706, 107175, 107699, 107700, 107754, 107987, 108045, 108151, 108210, 108219, 108427, 108450, 108871, 109475, 109504, 109506, 110259, 116964, 117208, 117834, 117839, 119259, 121110, 121515, 121518, 123357, 123376, 123454, 123463, 124474, 131616, 131637, 131726, 134282, 135277, 135501, 136873, 136874, 137025, 137027, 137028, 137390, 137414, 138899, 138926, 142615, 143263, 144220, 144852, 145995, 147457, 148282, 148310, 148367, 148721, 148722, 148724, 148726, 148734, 148737, 148764, 148768, 148781, 148782, 148817, 149439, 149445, 149447, 149451, 149453, 149520, 149528, 149541, 150303, 151478, 151497, 151528, 151899, 152055, 152057, 152058, 152099, 152118, 154129, 154181, 154182, 154183, 154258, 154352, 154535, 154540, 154543, 154671, 154719, 155073, 155074, 155229, 155250, 155553, 155672, 155705, 155706, 155877, 156054, 156109, 156171, 156191, 156306, 156418, 156772, 156886, 158913, 159828, 161297, 162987, 163170, 163172, 163689, 163703, 163709, 163812, 163813, 163849, 164831, 167078, 171192, 171194, 171216, 171240, 171658, 171967, 171968, 171969, 171988, 172376, 172377, 172401, 172721, 172788, 173414, 173567, 173611, 173621, 177850, 177962, 178182, 178563, 179241, 179337, 179338, 179360, 179375, 179376, 179377, 179378, 179402, 179410, 179422, 179802, 180372, 180376, 180389, 180392, 180394, 180395, 180412, 180413, 180423, 181044, 181314, 181720, 181979, 182251, 182412, 182539, 183598, 183599, 183689, 183690, 183692, 183749, 184044, 184145, 186391, 187347, 188331, 188717, 189047, 189048, 189051, 189053, 189107, 189122, 189145, 189170, 189171, 189194, 189195, 189196, 189292, 189399, 190114, 190993, 191672, 192559, 192560, 192574, 193768, 193769, 194269, 194511, 194619, 194871, 196450, 196498, 196518, 198842, 201531, 203706, 204911, 208216, 208315, 208341, 209042, 209047, 209323, 209324, 209326, 209717, 210073, 210080, 210092, 210392, 210396, 210397, 210398, 210675, 210753, 210756, 210996, 210997, 215668, 216581, 219244, 219608, 219611, 219612, 219757, 219774, 219775, 219816, 221493, 221494, 222394, 222761, 222785, 222808, 222812, 222813, 222814, 222815, 222816, 222818, 222820, 222821, 222822, 222823, 222824, 222825, 222826, 222827, 222829, 223320, 223321, 223328, 223356, 224868, 224985, 224986, 226776, 227113, 227457, 227531, 227545, 227922, 229412, 229431, 229436, 229446, 229990, 229993, 230002, 230009, 230143, 230144, 230166, 230931, 231216, 232966, 233004, 233005, 233014, 233591, 233603, 233608, 234452, 234806, 235587, 235725, 236444, 237871, 238393, 239571, 239573, 239593, 239594, 239731, 239763, 239766, 239769, 239781, 240369, 240870, 242200, 243848, 244732, 244758, 251270, 251701, 252026, 254147, 254149, 254160, 255502, 255506, 255532, 255574, 256374, 256715, 256745, 256841, 258000, 258633, 262762, 262856, 262953, 263151, 263235, 270436, 270439, 273841, 273870, 274853, 275453, 275459, 275603, 275606, 275647, 275682, 275684, 275688, 276135, 277513, 279895, 280881, 280882, 281589, 282864, 283096, 283117, 284694, 287735, 288061, 288082, 288954, 290056, 290142, 291378, 292308, 292605, 294243, 294502, 295019, 296324, 296336, 296362, 296899, 296948, 297344, 298732, 299138, 299255, 299256, 299257, 299258, 299409, 300620, 301033, 301146, 301162, 303773, 303807, 304755, 309912, 310777, 310844, 310902, 311018, 311192, 312489, 313352, 314231, 314480, 315051, 315085, 315563, 315611, 316470, 319724, 319725, 319726, 319727, 319761, 319762, 319763, 319765, 320277, 321720, 322672, 322675, 322818, 325482, 325540, 325966, 325989, 328613, 333347, 333348, 333351, 334232, 334877, 335277, 335279, 341211, 341439, 341440, 341975, 342944, 342945, 343870, 344868, 345312, 346388, 349513, 350316, 350893, 351699, 352771, 354173, 356586, 357981, 358093, 358450, 358482, 360926, 361051, 361628, 361999, 367583, 367584, 368384, 368653, 368838, 368864, 368873, 371013, 371040, 371052, 371247, 371248, 371343, 371344, 371345, 371346, 371599, 371881, 372277, 372278, 372280, 372281, 372282, 372283, 372717, 372718, 372754, 372755, 372756, 372856, 373615, 373616, 373697, 374143, 374144, 374150, 374422, 374425, 374426, 374427, 374436, 374569, 374609, 374801, 374802, 374807, 374848, 375173, 375174, 375286, 375305, 375455, 375488, 375490, 375497, 375675, 375695, 375714, 375717, 376112, 376113, 376116, 376268, 376269, 376270, 376271, 376350, 376357, 376736, 376738, 377041, 377060, 377237, 377238, 377239, 377604, 377605, 377606, 377617, 377675, 377676, 377677, 377679, 377706, 377754, 377757, 377758, 377852, 377978, 377979, 377983, 378035, 378053, 378054, 378055, 378056, 378057, 378058, 378059, 378060, 378061, 378123, 378630, 378632, 378752, 378754, 381540, 383376, 383380, 383411, 383652, 386022, 387622, 391289, 391292, 391302, 391561, 391600, 392916, 393221, 395577, 397257] not found in axis'

In [85]:
dataProfile(dataIdx)

Dimensions	: (390204, 9)
Data Size	: 148.65 MB
Duplicated Data	: 1490


Unnamed: 0,InvoiceNo,StockCode,Description,Quantity,InvoiceDate,UnitPrice,CustomerID,Country,idx
110,543541,37449,CERAMIC CAKE STAND + HANGING CAKES,1,2011-02-09 14:44:00,9.95,12462,Spain,8191
111,577606,37449,CERAMIC CAKE STAND + HANGING CAKES,1,2011-11-21 09:11:00,9.95,12462,Spain,8220
116,543541,22063,CERAMIC BOWL WITH STRAWBERRY DESIGN,5,2011-02-09 14:44:00,2.95,12462,Spain,8164
117,577606,22063,CERAMIC BOWL WITH STRAWBERRY DESIGN,5,2011-11-21 09:11:00,2.95,12462,Spain,8219
269,546920,22649,STRAWBERRY FAIRY CAKE TEAPOT,7,2011-03-18 09:55:00,4.95,12471,Germany,8567
...,...,...,...,...,...,...,...,...,...
13187,567148,22180,RETROSPOT LAMP,5,2011-09-16 15:23:00,9.95,18225,United Kingdom,398296
13188,553915,22180,RETROSPOT LAMP,2,2011-05-19 19:51:00,9.95,18225,United Kingdom,398335
13189,563733,22180,RETROSPOT LAMP,1,2011-08-18 17:57:00,9.95,18225,United Kingdom,398402
13214,562732,21314,SMALL GLASS HEART TRINKET POT,3,2011-08-09 10:19:00,2.10,18248,United Kingdom,399852


REVIEW


Unnamed: 0,dtype,count_of_null,null_ratio,count_of_distinct,distinct_value
InvoiceNo,object,0,0.0,18269,"[562032, 542237, 573511, 556201, 549222, 53762..."
StockCode,object,0,0.0,3650,"[21578, 47559B, 21154, 21041, 21035, 22423, 84..."
Description,object,0,0.0,3862,"[WOODLAND DESIGN COTTON TOTE BAG, TEA TIME OV..."
Quantity,int64,0,0.0,345,"[6, 10, 3, 12, 4, 8, 24, 20, 2, 18, 36, 48, 16..."
InvoiceDate,datetime64[ns],0,0.0,17045,"[2011-08-02T08:48:00.000000000, 2011-01-26T14:..."
UnitPrice,float64,0,0.0,404,"[2.25, 1.25, 2.95, 12.75, 4.25, 0.42, 1.65, 3...."
CustomerID,object,0,0.0,4325,"[12347, 12348, 12349, 12350, 12352, 12353, 123..."
Country,object,0,0.0,37,"[Iceland, Finland, Italy, Norway, Bahrain, Spa..."
idx,int64,0,0.0,386997,"[2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 1..."


Stastical Numerics


Unnamed: 0,Quantity,UnitPrice,idx
count,390204.0,390204.0,390204.0
mean,12.57607,2.980172,200987.093897
std,41.101419,9.848664,115904.783307
min,1.0,0.001,2.0
25%,2.0,1.25,100242.75
50%,6.0,1.95,201193.5
75%,12.0,3.75,301456.25
max,4300.0,2500.0,401563.0


Stastical Categorics


Unnamed: 0,InvoiceNo,StockCode,Description,CustomerID,Country
count,390204,390204,390204,390204,390204
unique,18269,3650,3862,4325,37
top,576339,22423,REGENCY CAKESTAND 3 TIER,17841,United Kingdom
freq,542,2012,2012,7294,347151


PREVIEW head(3)


Unnamed: 0,InvoiceNo,StockCode,Description,Quantity,InvoiceDate,UnitPrice,CustomerID,Country,idx
2,562032,21578,WOODLAND DESIGN COTTON TOTE BAG,6,2011-08-02 08:48:00,2.25,12347,Iceland,2
3,542237,47559B,TEA TIME OVEN GLOVE,10,2011-01-26 14:30:00,1.25,12347,Iceland,3
4,542237,21154,RED RETROSPOT OVEN GLOVE,10,2011-01-26 14:30:00,1.25,12347,Iceland,4


In [44]:
dataCompleted = data[data.Quantity>0]
dataCanceled = data[data.Quantity<0]
dataReturned = pd.merge(dataCompleted, dataCanceled, how='inner',
                   on=['StockCode', 'Description', 'CustomerID', 'Country', 'UnitPrice'], 
                   suffixes=['_completed', '_canceled'])
dataReturnedABeforeCheckout = dataReturned[dataReturned['InvoiceDate_completed'] >= dataReturned['InvoiceDate_canceled']]
dataReturnedAAfterCheckout = dataReturned[dataReturned['InvoiceDate_completed'] < dataReturned['InvoiceDate_canceled']]
print(line)
print(f'Returned Transactions\t\t\t: {len(dataReturned)}')
print(line)
print(f'Returned Transactions Before Checkout\t: {len(dataReturnedABeforeCheckout)}')
print('Samples:')
display(dataReturnedABeforeCheckout.sample(3))
print(line)
print(f'Returned Transactions After Checkout\t: {len(dataReturnedAAfterCheckout)}')
print('Samples:')
display(dataReturnedAAfterCheckout.sample(3))

Returned Transactions			: 0
Returned Transactions Before Checkout	: 0
Samples:


ValueError: a must be greater than 0 unless no samples are taken

In [None]:
dataCompleted = data[data.Quantity>0]
dataCanceled = data[data.Quantity<0]
dataReturned = pd.merge(dataCompleted, dataCanceled, how='inner',
                   on=['StockCode', 'Description', 'CustomerID', 'Country', 'UnitPrice'], 
                   suffixes=['_completed', '_canceled'])
dataReturnedABeforeCheckout = dataReturned[dataReturned['InvoiceDate_completed'] >= dataReturned['InvoiceDate_canceled']]
dataReturnedAAfterCheckout = dataReturned[dataReturned['InvoiceDate_completed'] < dataReturned['InvoiceDate_canceled']]
print(line)
print(f'Returned Transactions\t\t\t: {len(dataReturned)}')
print(line)
print(f'Returned Transactions Before Checkout\t: {len(dataReturnedABeforeCheckout)}')
print('Samples:')
display(dataReturnedABeforeCheckout.sample(3))
print(line)
print(f'Returned Transactions After Checkout\t: {len(dataReturnedAAfterCheckout)}')
print('Samples:')
display(dataReturnedAAfterCheckout.sample(3))

Returned Transactions			: 19816
Returned Transactions Before Checkout	: 6629
Samples:


Unnamed: 0,InvoiceNo_completed,StockCode,Description,Quantity_completed,InvoiceDate_completed,UnitPrice,CustomerID,Country,InvoiceNo_canceled,Quantity_canceled,InvoiceDate_canceled
10202,567899,21471,STRAWBERRY RAFFIA FOOD COVER,6,2011-09-22 16:26:00,3.75,14911,EIRE,C564759,-2,2011-08-30 10:40:00
10634,550186,22698,PINK REGENCY TEACUP AND SAUCER,12,2011-04-14 18:28:00,2.95,14051,United Kingdom,C548485,-2,2011-03-31 12:50:00
9500,580362,21314,SMALL GLASS HEART TRINKET POT,8,2011-12-02 16:30:00,2.1,13884,United Kingdom,C545823,-3,2011-03-07 12:54:00


Returned Transactions After Checkout	: 13187
Samples:


Unnamed: 0,InvoiceNo_completed,StockCode,Description,Quantity_completed,InvoiceDate_completed,UnitPrice,CustomerID,Country,InvoiceNo_canceled,Quantity_canceled,InvoiceDate_canceled
11784,549835,37340,MULTICOLOUR SPRING FLOWER MUG,48,2011-04-12 13:24:00,0.39,17511,United Kingdom,C559136,-1,2011-07-06 13:21:00
8686,544301,21067,VINTAGE RED TEATIME MUG,2,2011-02-17 12:59:00,1.25,14606,United Kingdom,C545836,-1,2011-03-07 13:19:00
17995,567874,23239,SET OF 4 KNICK KNACK TINS POPPIES,6,2011-09-22 14:26:00,4.15,13055,United Kingdom,C569970,-1,2011-10-06 18:57:00
