### Market Basket Analysis - Record Linkage
### Encontrando produtos semelhantes em uma lista de itens ofertados em um site.

#### Introdução:
Em um dataset de cadastro de produtos ou em um dataset oriundo de uma raspagem de um site de compras, é muito comum encontrar itens muito parecidos com as mesmas características, fabricante ou modelo e pouca variação nas descrições. Quando nos deparamos com esse tipo de situação, pode ser necessário agrupá-los para fazer alguma análise ou até mesmo há situações onde precisamos apenas identificá-los. Nesse exercício, não faremos agrupamentos, vamos identificar os itens e quais deles possuem descrições semelhantes.   

##### Para esse exercício, será utilizado um dataset de produtos vendidos no site da Amazon que está disponível no Kaggle no link: https://www.kaggle.com/promptcloud/amazon-product-dataset-2020 

##### 1º - Descompactar o arquivo

In [1]:
import zipfile

with zipfile.ZipFile("marketing_sample_for_amazon_com-ecommerce__20200101_20200131__10k_data.csv.zip","r") as zip_ref:
    zip_ref.extractall("mba-dataset")

##### 2º - Carregar os dados em um DataFrame com o pandas

In [2]:
import pandas as pd

In [3]:
# configuração para apresentar todo o conteúdo da célula
pd.set_option('display.max_colwidth', None)

# gravar os dados do arquivo no DataFrame
df = pd.read_csv('mba-dataset\marketing_sample_for_amazon_com-ecommerce__20200101_20200131__10k_data.csv')

In [4]:
# quantidade de linhas e colunas do DataFrame
df.shape

(10002, 28)

In [5]:
# vamos olhar as colunas que esse dataset apresenta
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10002 entries, 0 to 10001
Data columns (total 28 columns):
 #   Column                 Non-Null Count  Dtype  
---  ------                 --------------  -----  
 0   Uniq Id                10002 non-null  object 
 1   Product Name           10002 non-null  object 
 2   Brand Name             0 non-null      float64
 3   Asin                   0 non-null      float64
 4   Category               9172 non-null   object 
 5   Upc Ean Code           34 non-null     object 
 6   List Price             0 non-null      float64
 7   Selling Price          9895 non-null   object 
 8   Quantity               0 non-null      float64
 9   Model Number           8232 non-null   object 
 10  About Product          9729 non-null   object 
 11  Product Specification  8370 non-null   object 
 12  Technical Details      9212 non-null   object 
 13  Shipping Weight        8864 non-null   object 
 14  Product Dimensions     479 non-null    object 
 15  Im

In [6]:
# agora veremos as primeiras e últimas linhas
df

Unnamed: 0,Uniq Id,Product Name,Brand Name,Asin,Category,Upc Ean Code,List Price,Selling Price,Quantity,Model Number,...,Product Url,Stock,Product Details,Dimensions,Color,Ingredients,Direction To Use,Is Amazon Seller,Size Quantity Variant,Product Description
0,4c69b61db1fc16e7013b43fc926e502d,"DB Longboards CoreFlex Crossbow 41"" Bamboo Fiberglass Longboard Complete",,,"Sports & Outdoors | Outdoor Recreation | Skates, Skateboards & Scooters | Skateboarding | Standard Skateboards & Longboards | Longboards",,,$237.68,,,...,https://www.amazon.com/DB-Longboards-CoreFlex-Fiberglass-Longboard/dp/B07KMVJJK7,,,,,,,Y,,
1,66d49bbed043f5be260fa9f7fbff5957,"Electronic Snap Circuits Mini Kits Classpack, FM Radio, Motion Detector, Music Box (Set of 5)",,,Toys & Games | Learning & Education | Science Kits & Toys,,,$99.95,,55324,...,https://www.amazon.com/Electronic-Circuits-Classpack-Motion-Detector/dp/B008AK6DAS,,,,,,,Y,,
2,2c55cae269aebf53838484b0d7dd931a,"3Doodler Create Flexy 3D Printing Filament Refill Bundle (X5 Pack, Over 1000'. of Extruded Plastics! - Innovate",,,Toys & Games | Arts & Crafts | Craft Kits,,,$34.99,,,...,https://www.amazon.com/3Doodler-Plastic-Innovate-Filament-Refills/dp/B07D36747F,,,,,,,Y,,
3,18018b6bc416dab347b1b7db79994afa,Guillow Airplane Design Studio with Travel Case Building Kit,,,Toys & Games | Hobbies | Models & Model Kits | Model Kits | Airplane & Jet Kits,,,$28.91,,142,...,https://www.amazon.com/Guillow-Airplane-Design-Studio-Building/dp/B076Y2SNHM,,,,,,,Y,,
4,e04b990e95bf73bbe6a3fa09785d7cd0,Woodstock- Collage 500 pc Puzzle,,,Toys & Games | Puzzles | Jigsaw Puzzles,,,$17.49,,62151,...,https://www.amazon.com/Woodstock-Collage-500-pc-Puzzle/dp/B07MX21WWX,,,,,,,Y,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
9997,1a22f23576bfdfe5ed6c887dc117aab6,"Remedia Publications REM536B Money Activity Book, Grade: 3 to 4, 8.5"" Wide, 11"" Length, 0.4"" Height",,,Toys & Games | Learning & Education | Counting & Math Toys,,,$9.31,,REM536B,...,https://www.amazon.com/Remedia-Publications-REM536B-Money-Activity/dp/B000F8XIZ6,,,,,,,Y,,
9998,e11514dcf1f087887cd5ea0bd646d1fc,Trends International NFL La Chargers HG - Mobile Wallet,,,Toys & Games | Arts & Crafts,,,$6.99,,,...,https://www.amazon.com/Trends-International-NFL-Chargers-HG/dp/B07PJ181TC,,,,,,,Y,,
9999,c00301a38560da2abc89c1f86ce4b267,"NewPath Learning 10 Piece Science Owls and Owl Pellets Curriculum Mastery Flip Chart Set, Grade 5-9",,,Office Products | Office & School Supplies | Education & Crafts | Classroom Science Supplies,,,$37.95,,34-6015,...,https://www.amazon.com/NewPath-Learning-Science-Pellets-Curriculum/dp/B00DOG823Y,,,,,,,Y,,
10000,c2928dbf9796ceba44863a2736afb405,Disney Princess Do It Yourself Braid Set,,,Toys & Games | Arts & Crafts | Craft Kits,,,$3.58,,2888PRST,...,https://www.amazon.com/Disney-Princess-Yourself-Braid-Set/dp/B076D3P6SW,,,,,,,Y,,


Várias colunas não possuem valores ou não são interessantes para a nossa análise, então podemos simplesmente deixá-las de fora por enquanto

##### 3º - Ajustar o DataFrame

In [7]:
# selecionar as colunas necessárias para a análise

columns = ['Uniq Id','Product Name','Category','Selling Price','Product Url']
df_limpo = df[columns]
df_limpo.head()

Unnamed: 0,Uniq Id,Product Name,Category,Selling Price,Product Url
0,4c69b61db1fc16e7013b43fc926e502d,"DB Longboards CoreFlex Crossbow 41"" Bamboo Fiberglass Longboard Complete","Sports & Outdoors | Outdoor Recreation | Skates, Skateboards & Scooters | Skateboarding | Standard Skateboards & Longboards | Longboards",$237.68,https://www.amazon.com/DB-Longboards-CoreFlex-Fiberglass-Longboard/dp/B07KMVJJK7
1,66d49bbed043f5be260fa9f7fbff5957,"Electronic Snap Circuits Mini Kits Classpack, FM Radio, Motion Detector, Music Box (Set of 5)",Toys & Games | Learning & Education | Science Kits & Toys,$99.95,https://www.amazon.com/Electronic-Circuits-Classpack-Motion-Detector/dp/B008AK6DAS
2,2c55cae269aebf53838484b0d7dd931a,"3Doodler Create Flexy 3D Printing Filament Refill Bundle (X5 Pack, Over 1000'. of Extruded Plastics! - Innovate",Toys & Games | Arts & Crafts | Craft Kits,$34.99,https://www.amazon.com/3Doodler-Plastic-Innovate-Filament-Refills/dp/B07D36747F
3,18018b6bc416dab347b1b7db79994afa,Guillow Airplane Design Studio with Travel Case Building Kit,Toys & Games | Hobbies | Models & Model Kits | Model Kits | Airplane & Jet Kits,$28.91,https://www.amazon.com/Guillow-Airplane-Design-Studio-Building/dp/B076Y2SNHM
4,e04b990e95bf73bbe6a3fa09785d7cd0,Woodstock- Collage 500 pc Puzzle,Toys & Games | Puzzles | Jigsaw Puzzles,$17.49,https://www.amazon.com/Woodstock-Collage-500-pc-Puzzle/dp/B07MX21WWX


##### 4º - Encontrar itens com descrições parecidas

A "recordlinkage" é a biblioteca principal nesse exercício, é por meio das funções dela que vamos fazer as comparações para chegar o nosso resultado.

In [8]:
import recordlinkage

Primeiro criamos uma cópia do nosso DataFrame inicial e vamos chamá-lo de df1.

In [9]:
df1 = df_limpo.copy()

Os métodos que vamos utilizar são o index() e o block(). O efeito esperado é que a comparação seja realizada entre os produtos da lista respeitando a categoria, nesse caso, denominada por "Category".

In [10]:
indexer = recordlinkage.Index()
indexer.block(left_on='Category', right_on='Category')
grupo = indexer.index(df1)

A próxima célula é onde a comparação é realizada. Nesse exemplo vamos identificar quais produtos possuem 90% ou mais da descrição igual a outros produtos.

In [11]:
comparacao = recordlinkage.Compare()
comparacao.exact('Category','Category',label='Category_Match')

# em threshold vamos deixar 90% como parâmetro de similaridade
comparacao.string('Product Name','Product Name',threshold=0.90,label='Product_Name_Match')
features = comparacao.compute(grupo, df1)    

In [12]:
# abaixo vemos representado pelo score 2 a quantidade de itens similares e o score 1 a quantidade de comparações
features.sum(axis=1).value_counts().sort_index(ascending=False)

2.0       869
1.0    356609
dtype: int64

In [13]:
# level_0 e level_1 são as posições dos index contendo as posições de itens similares
potential_matches = features[features.sum(axis=1) > 1].reset_index()
potential_matches['SCORE'] = potential_matches.loc[:, 'Category_Match':'Product_Name_Match'].sum(axis=1)
potential_matches.head(2)

Unnamed: 0,level_0,level_1,Category_Match,Product_Name_Match,SCORE
0,3564,991,1,1.0,2.0
1,7213,5733,1,1.0,2.0


In [14]:
# cruzando os matches

potential_matches_lv0 = potential_matches.set_index('level_0')
df1_matches = pd.merge(df1, potential_matches_lv0, 
                          left_index=True, 
                          right_index=True)

potential_matches_lv1 = potential_matches
potential_matches_lv1['J_level_1'] = potential_matches['level_1']
potential_matches_lv1 = potential_matches.set_index('level_1')

matches = pd.merge(df1_matches,
                   df1, 
                   left_on='level_1', 
                   right_index=True,
                   how='left')

In [15]:
# substituir espaços por "_" para facilitar a filtragem
matches.columns = matches.columns.str.replace(' ','_')

In [16]:
# verificar o tamanho do novo dataset
matches.shape

(869, 14)

In [17]:
matches.head(2)

Unnamed: 0,Uniq_Id_x,Product_Name_x,Category_x,Selling_Price_x,Product_Url_x,level_1,Category_Match,Product_Name_Match,SCORE,Uniq_Id_y,Product_Name_y,Category_y,Selling_Price_y,Product_Url_y
828,58d2e7043725286b9d3cecc10ee7adc2,Ceaco Perfect Piece Count Puzzle - Thomas Kinkade Disney Dreams Collection - Beauty and the Beast,Toys & Games | Puzzles | Jigsaw Puzzles,$16.99,https://www.amazon.com/Ceaco-Perfect-Piece-Count-Puzzle/dp/B07FR9NLHW,205,1,1.0,2.0,0c5298272cf8b8c881bfba43f0f9821a,Ceaco Perfect Piece Count Puzzle - Thomas Kinkade Disney Dreams Collection - Beauty and the Beast,Toys & Games | Puzzles | Jigsaw Puzzles,$19.86,https://www.amazon.com/Ceaco-Perfect-Piece-Count-Puzzle/dp/B07FR9NJ7V
1128,a5b115b9d2ad3a3bee270d442b5155df,"MightySkins Skin Compatible with Razor A2 Kick Scooter - Rainbow Streaks | Protective, Durable, and Unique Vinyl Decal wrap Cover | Easy to Apply, Remove, and Change Styles | Made in The USA","Sports & Outdoors | Outdoor Recreation | Skates, Skateboards & Scooters | Scooters & Equipment | Accessories",,https://www.amazon.com/MightySkins-Skin-Compatible-Razor-Scooter/dp/B07MJ7LJ3R,1037,1,1.0,2.0,97234d1266893b950ca56a7438d1f50b,"MightySkins Skin Compatible with Razor A Kick Scooter - Geo Tile | Protective, Durable, and Unique Vinyl Decal wrap Cover | Easy to Apply, Remove, and Change Styles | Made in The USA","Sports & Outdoors | Outdoor Recreation | Skates, Skateboards & Scooters | Scooters & Equipment | Accessories",,https://www.amazon.com/MightySkins-Skin-Compatible-Razor-Scooter/dp/B07MF316QS


Agora que temos o nosso dataset indicando quais produtos são similares, vamos verificar quais produtos possuem maior quantidade de correspondentes.

In [18]:
agrupamento = pd.DataFrame(matches.groupby(['Uniq_Id_x','Product_Name_x'])['level_1'].count()).reset_index()
agrupamento.sort_values(by=['level_1'], ascending=False)

Unnamed: 0,Uniq_Id_x,Product_Name_x,level_1
337,fc567bb32cc56b98811b39e56378cba0,"MightySkins Skin Compatible with Razor A Kick Scooter - Color Bugs | Protective, Durable, and Unique Vinyl Decal wrap Cover | Easy to Apply, Remove, and Change Styles | Made in The USA",13
334,fa556f12e65d41ffefe903997caad25b,MightySkins Skin Compatible with Blade Chroma Battery Batteries (4 Pack) wrap Cover Sticker Skins Diamond Plate,12
125,638f86394e6c7e4fde07422b787899f3,MightySkins Skin Compatible with Blade Chroma Battery Batteries (4 Pack) wrap Cover Sticker Skins Drops,11
167,7ef14575db6cbf1709219b51a6a86b7a,"MightySkins Skin Compatible with Razor A5 Lux Kick Scooter - Ripped | Protective, Durable, and Unique Vinyl Decal wrap Cover | Easy to Apply, Remove, and Change Styles | Made in The USA",11
41,1db87a20cde55df8437cdb1e0fd6ad10,"MightySkins Skin Compatible with Razor A2 Kick Scooter - Check | Protective, Durable, and Unique Vinyl Decal wrap Cover | Easy to Apply, Remove, and Change Styles | Made in The USA",11
...,...,...,...
90,49d362caad6b1cf6e2b8f0608196346b,MightySkins Skin Compatible with Parrot Bebop Quadcopter Drone wrap Cover Sticker Skins Black Marble,1
216,a96dc2434ff485beaf25d2e6da570ad7,"AmazonBasics Easy Care Super Soft Microfiber Kid's Bed-in-a-Bag Bedding Set - Full / Queen, Multi-Color Racing Cars",1
217,aa549c9822dcb29ef81dfe2e3eaa56f4,"MightySkins Skin Compatible with Hover-1 H1 Hoverboard Scooter - Ink Hearts | Protective, Durable, and Unique Vinyl Decal wrap Cover | Easy to Apply, Remove, and Change Styles | Made in The USA",1
218,ab792b48e348af772eb9717912c18f7e,"Little Kids Fubbles Light Up Bubble Blaster Blows tons of bubbles for Kids Includes Bubble Solution, Pink",1


Vamos exibir os produtos similares do nosso top 1 do ranking

In [20]:
matches[['Uniq_Id_x',
         'Uniq_Id_y',
         'Product_Name_x',
         'Product_Name_y',
         'Product_Url_x',
         'Product_Url_y']].query('Uniq_Id_x == "fc567bb32cc56b98811b39e56378cba0"')

Unnamed: 0,Uniq_Id_x,Uniq_Id_y,Product_Name_x,Product_Name_y,Product_Url_x,Product_Url_y
9698,fc567bb32cc56b98811b39e56378cba0,97234d1266893b950ca56a7438d1f50b,"MightySkins Skin Compatible with Razor A Kick Scooter - Color Bugs | Protective, Durable, and Unique Vinyl Decal wrap Cover | Easy to Apply, Remove, and Change Styles | Made in The USA","MightySkins Skin Compatible with Razor A Kick Scooter - Geo Tile | Protective, Durable, and Unique Vinyl Decal wrap Cover | Easy to Apply, Remove, and Change Styles | Made in The USA",https://www.amazon.com/MightySkins-Skin-Compatible-Razor-Scooter/dp/B07MF3DTT3,https://www.amazon.com/MightySkins-Skin-Compatible-Razor-Scooter/dp/B07MF316QS
9698,fc567bb32cc56b98811b39e56378cba0,a5b115b9d2ad3a3bee270d442b5155df,"MightySkins Skin Compatible with Razor A Kick Scooter - Color Bugs | Protective, Durable, and Unique Vinyl Decal wrap Cover | Easy to Apply, Remove, and Change Styles | Made in The USA","MightySkins Skin Compatible with Razor A2 Kick Scooter - Rainbow Streaks | Protective, Durable, and Unique Vinyl Decal wrap Cover | Easy to Apply, Remove, and Change Styles | Made in The USA",https://www.amazon.com/MightySkins-Skin-Compatible-Razor-Scooter/dp/B07MF3DTT3,https://www.amazon.com/MightySkins-Skin-Compatible-Razor-Scooter/dp/B07MJ7LJ3R
9698,fc567bb32cc56b98811b39e56378cba0,c07dc4cc0f61f940b9058af0be0bdffd,"MightySkins Skin Compatible with Razor A Kick Scooter - Color Bugs | Protective, Durable, and Unique Vinyl Decal wrap Cover | Easy to Apply, Remove, and Change Styles | Made in The USA","MightySkins Skin Compatible with Razor A2 Kick Scooter - Black Wall | Protective, Durable, and Unique Vinyl Decal wrap Cover | Easy to Apply, Remove, and Change Styles | Made in The USA",https://www.amazon.com/MightySkins-Skin-Compatible-Razor-Scooter/dp/B07MF3DTT3,https://www.amazon.com/MightySkins-Skin-Compatible-Razor-Scooter/dp/B07MC16RGB
9698,fc567bb32cc56b98811b39e56378cba0,3aeeb156ebe76c200d185eee72ada180,"MightySkins Skin Compatible with Razor A Kick Scooter - Color Bugs | Protective, Durable, and Unique Vinyl Decal wrap Cover | Easy to Apply, Remove, and Change Styles | Made in The USA","MightySkins Skin Compatible with Razor A2 Kick Scooter - Green Distortion | Protective, Durable, and Unique Vinyl Decal wrap Cover | Easy to Apply, Remove, and Change Styles | Made in The USA",https://www.amazon.com/MightySkins-Skin-Compatible-Razor-Scooter/dp/B07MF3DTT3,https://www.amazon.com/MightySkins-Skin-Compatible-Razor-Scooter/dp/B07MC15DSY
9698,fc567bb32cc56b98811b39e56378cba0,5ca683fc36fa5c971cbd6f19b1e1d6d8,"MightySkins Skin Compatible with Razor A Kick Scooter - Color Bugs | Protective, Durable, and Unique Vinyl Decal wrap Cover | Easy to Apply, Remove, and Change Styles | Made in The USA","MightySkins Skin Compatible with Razor A2 Kick Scooter - Black Leather | Protective, Durable, and Unique Vinyl Decal wrap Cover | Easy to Apply, Remove, and Change Styles | Made in The USA",https://www.amazon.com/MightySkins-Skin-Compatible-Razor-Scooter/dp/B07MF3DTT3,https://www.amazon.com/MightySkins-Skin-Compatible-Razor-Scooter/dp/B07MFM8R6N
9698,fc567bb32cc56b98811b39e56378cba0,12d8071b784b59278722236f9ae3dd5b,"MightySkins Skin Compatible with Razor A Kick Scooter - Color Bugs | Protective, Durable, and Unique Vinyl Decal wrap Cover | Easy to Apply, Remove, and Change Styles | Made in The USA","MightySkins Skin Compatible with Razor A5 Lux Kick Scooter - Splash of Color | Protective, Durable, and Unique Vinyl Decal wrap Cover | Easy to Apply, Remove, and Change Styles | Made in The USA",https://www.amazon.com/MightySkins-Skin-Compatible-Razor-Scooter/dp/B07MF3DTT3,https://www.amazon.com/MightySkins-Skin-Compatible-Razor-Scooter/dp/B07MC12R8N
9698,fc567bb32cc56b98811b39e56378cba0,71fc57fd48fcf9df9570ba761daeee01,"MightySkins Skin Compatible with Razor A Kick Scooter - Color Bugs | Protective, Durable, and Unique Vinyl Decal wrap Cover | Easy to Apply, Remove, and Change Styles | Made in The USA","MightySkins Skin Compatible with Razor A2 Kick Scooter - Scratched Up | Protective, Durable, and Unique Vinyl Decal wrap Cover | Easy to Apply, Remove, and Change Styles | Made in The USA",https://www.amazon.com/MightySkins-Skin-Compatible-Razor-Scooter/dp/B07MF3DTT3,https://www.amazon.com/MightySkins-Skin-Compatible-Razor-Scooter/dp/B07MC173DC
9698,fc567bb32cc56b98811b39e56378cba0,fa9a740b822b8b314da9efd7b12f2872,"MightySkins Skin Compatible with Razor A Kick Scooter - Color Bugs | Protective, Durable, and Unique Vinyl Decal wrap Cover | Easy to Apply, Remove, and Change Styles | Made in The USA","MightySkins Skin Compatible with Razor A2 Kick Scooter - Psychedelic | Protective, Durable, and Unique Vinyl Decal wrap Cover | Easy to Apply, Remove, and Change Styles | Made in The USA",https://www.amazon.com/MightySkins-Skin-Compatible-Razor-Scooter/dp/B07MF3DTT3,https://www.amazon.com/MightySkins-Skin-Compatible-Razor-Scooter/dp/B07MFM8PJF
9698,fc567bb32cc56b98811b39e56378cba0,a2e1dfe29ad3c515d5365f99d665bd32,"MightySkins Skin Compatible with Razor A Kick Scooter - Color Bugs | Protective, Durable, and Unique Vinyl Decal wrap Cover | Easy to Apply, Remove, and Change Styles | Made in The USA","MightySkins Skin Compatible with Razor A Kick Scooter - Dark Butterfly | Protective, Durable, and Unique Vinyl Decal wrap Cover | Easy to Apply, Remove, and Change Styles | Made in The USA",https://www.amazon.com/MightySkins-Skin-Compatible-Razor-Scooter/dp/B07MF3DTT3,https://www.amazon.com/MightySkins-Skin-Compatible-Razor-Scooter/dp/B07MF3Z8L6
9698,fc567bb32cc56b98811b39e56378cba0,3cc50f01eefa7a0debf44f3376a421c7,"MightySkins Skin Compatible with Razor A Kick Scooter - Color Bugs | Protective, Durable, and Unique Vinyl Decal wrap Cover | Easy to Apply, Remove, and Change Styles | Made in The USA","MightySkins Skin Compatible with Razor A2 Kick Scooter - Blue Swirls | Protective, Durable, and Unique Vinyl Decal wrap Cover | Easy to Apply, Remove, and Change Styles | Made in The USA",https://www.amazon.com/MightySkins-Skin-Compatible-Razor-Scooter/dp/B07MF3DTT3,https://www.amazon.com/MightySkins-Skin-Compatible-Razor-Scooter/dp/B07MFM8J3V


#### Conclusão
Vimos que nesse dataset há 869 produtos que possuem descrições semelhantes em 90% do texto e fomos capazes de identificar um a um. 
Esse exercício mostrou uma maneira de encontrar itens similares em um dataset usando o Record Linkage, mas existem muitos outros métodos utilizando RL e outras bibliotecas.