Useful regular expressions:

* | or (: affect how the regular expressions around them are interpreted
*  (*, +, ?, {m,n}): repetition qualifiers --> (?:a{6})* marches multiple of six 'a' chars
* .: matches any character
* ^: matches the start of a string
* $: matches end of a string or just before a new line
* Ranges of characters can be added:
    - [a-z] match any lowercase ASCII letter
    - special characters loose their meaning inside sets: [(+*)]
* |:

* @[A-Za-z0-9]: means that any letter (regardless of case) or digit will match (A-Z are capital letter, a-z lower case, and 0-9 are digits)
* [^0-9A-Za-z \t]: Pattern to look for any single character that is either a digit between 0-9, or a lower case letter between a-z, or an upper case between A-Z, or a period, or a plus sign, or an underscore

In [1]:
import pandas as pd
import numpy as np

In [2]:
# reading raw data
df = pd.read_csv('UK_products.csv')

In [3]:
df = df.drop(['order_id', 'quantity'], axis=1)

In [4]:
# reorganizing df
df = df.loc[:, ['type', 'parent_chain_uuid', 'parent_chain_name', 'store_uuid', 'store_name', 
                'city_id', 'city_name', 'sku_uuid', 'sku_name']]

### Multiple Level Duplication

In [6]:
# duplicate rows
duplicate_cols = ['parent_chain_name', 'store_name', 'city_id', 'sku_name']
df_duplicate = df[df[duplicate_cols].duplicated() == True]

In [7]:
# sort values for easier look
df_duplicate = df_duplicate.sort_values(by=duplicate_cols)

In [8]:
df_duplicate[:2]

Unnamed: 0,type,parent_chain_uuid,parent_chain_name,store_uuid,store_name,city_id,city_name,sku_uuid,sku_name
72059,Convenience,228ee517-ab53-429b-b5fb-dac07e313a38,Booker,a665ecf6-7f3d-5124-9ab6-5e27fc2cd223,Londis Store and Post Office Burgess Hill,1459,Brighton and Sussex,699ece33-f915-45db-8d76-16f494342eef,Hardys Bin 161 Sauvignon Blanc 75cl
167822,Convenience,228ee517-ab53-429b-b5fb-dac07e313a38,Booker,a665ecf6-7f3d-5124-9ab6-5e27fc2cd223,Londis Store and Post Office Burgess Hill,1459,Brighton and Sussex,9098651e-15f0-465f-92a4-9f29e843b2aa,Jolly Rancher Slush


### Chain level duplication

In [9]:
df_chain = df.loc[:, ['type', 'parent_chain_uuid', 'parent_chain_name', 'sku_uuid', 'sku_name']]

In [10]:
# as we removed columns, we need to eliminate duplicated rows
df_chain = df_chain.drop_duplicates()

In [11]:
dup_chain = ['parent_chain_name', 'sku_name']
df_chain_dup = df_chain[df_chain[dup_chain].duplicated() == True]
df_chain_dup = df_chain_dup.sort_values(by=dup_chain).reset_index(drop=True)

In [12]:
Note: In the following dataframe we can see that the products are duplicated because they have different sku_uuid

SyntaxError: invalid syntax (2139986758.py, line 1)

In [13]:
df_chain_dup.head(10)

Unnamed: 0,type,parent_chain_uuid,parent_chain_name,sku_uuid,sku_name
0,Convenience,228ee517-ab53-429b-b5fb-dac07e313a32,Booker,11361d35-08e3-59a1-ab20-59290532b380,Andrex 4 Pack
1,Convenience,228ee517-ab53-429b-b5fb-dac07e313a38,Booker,848183f3-3510-4b80-a770-91c342b97071,Coca Cola 500ml Bottle
2,Convenience,228ee517-ab53-429b-b5fb-dac07e313a32,Booker,af5c1ef2-2cc7-5268-9717-091637b67485,Doritos Cool Original Tortilla Chips 150g
3,Convenience,228ee517-ab53-429b-b5fb-dac07e313a38,Booker,b4ca36ba-bfd1-401e-ac7c-2e06762e7d25,Doritos Cool Original Tortilla Chips 150g
4,Convenience,228ee517-ab53-429b-b5fb-dac07e313a38,Booker,94b91e7b-c0e5-4493-8509-d4fb4ae25fe0,Galaxy Smooth Milk Chocolate Bar 110g
5,Convenience,228ee517-ab53-429b-b5fb-dac07e313a32,Booker,15ba6382-7f44-52c8-aae7-48036512edfa,Hardys Bin 161 Chardonnay 75cl
6,Convenience,228ee517-ab53-429b-b5fb-dac07e313a32,Booker,3a18681e-04b0-55cc-ba93-fb211ae9de7f,Hardys Bin 161 Sauvignon Blanc 75cl
7,Convenience,228ee517-ab53-429b-b5fb-dac07e313a38,Booker,46c1a677-34ff-4f4a-88db-9dd68a1342cd,Hellmann's Real Squeezy Mayonnaise 430ml
8,Convenience,228ee517-ab53-429b-b5fb-dac07e313a38,Booker,d5020c17-a062-49a0-9e20-1830d7d1b40a,Kinder Surprise Egg 20g
9,Convenience,228ee517-ab53-429b-b5fb-dac07e313a32,Booker,9baa4693-4942-571f-976f-cf21d1a9ddc8,Malteasers 110g Box


In [14]:
df_chain_dup.tail(10)

Unnamed: 0,type,parent_chain_uuid,parent_chain_name,sku_uuid,sku_name
106917,Convenience,86ce49a5-b56c-4e09-b795-d2a8fb6aac00,valli forecourts (uk parent),6dd75af1-7f39-53ea-a6f1-bbc22d5d0f70,Yazoo Banana 1 Ltr
106918,Convenience,86ce49a5-b56c-4e09-b795-d2a8fb6aac00,valli forecourts (uk parent),8b0b6d0a-0f77-5c75-ae6e-e3a8133c892c,Yazoo Banana 1 Ltr
106919,Convenience,86ce49a5-b56c-4e09-b795-d2a8fb6aac00,valli forecourts (uk parent),e48bce1c-3359-528a-85fe-54f9c80568c2,Yazoo Chocolate 1 Ltr
106920,Convenience,86ce49a5-b56c-4e09-b795-d2a8fb6aac00,valli forecourts (uk parent),1b71850a-4947-5bef-9bb2-f4ea823bf495,Yazoo Strawberry 1 Ltr
106921,Convenience,86ce49a5-b56c-4e09-b795-d2a8fb6aac00,valli forecourts (uk parent),d56ebba7-07bb-5933-8cfa-a8c7d65aa3b4,Yop Raspberry Yogurt Drink 500ML
106922,Convenience,86ce49a5-b56c-4e09-b795-d2a8fb6aac00,valli forecourts (uk parent),04c666e5-1cb7-4c23-97ba-9680ece4262b,Yop Strawberry Yogurt Drink 500ml
106923,Convenience,86ce49a5-b56c-4e09-b795-d2a8fb6aac00,valli forecourts (uk parent),20588591-a8e3-5dcf-9fbf-d831c55b5486,🏏 Game On Snacks Bundle ⚽
106924,Convenience,86ce49a5-b56c-4e09-b795-d2a8fb6aac00,valli forecourts (uk parent),a8170514-6d76-554d-8fd6-85a33a01e8e9,🏏 Game On Snacks Bundle ⚽
106925,Convenience,86ce49a5-b56c-4e09-b795-d2a8fb6aac00,valli forecourts (uk parent),1493cea0-26fe-4ab4-9a21-82eed790753d,🥪 Meal Solutions Bundle 🥗
106926,Convenience,86ce49a5-b56c-4e09-b795-d2a8fb6aac00,valli forecourts (uk parent),4f01bcdd-5787-5c6b-aad4-e8870210046b,🥪 Meal Solutions Bundle 🥗
