<h3 style="text-align: center;">ELABORAÇÃO DE SISTEMA DE RECOMENDAÇÃO POR MEIO DE SIMILARIDADE DE COSSENOS</h3>

### Descrição do problema

Uma loja que comercializa produtos de decoração, brinquedos e acessórios para vários países está interessada em descobrir quais são os clientes mais similares entre si e quais os produtos mais similares entre si. 

A intenção da loja é elaborar um Sistema de Recomendação que indique clientes e produtos parecidos, afim de que seja possível realizar promoções personalizadas para seus clientes.

Para tanto, será analisado um dataset de cerca de 407 mil linhas com diversos pedidos realizadas por clientes em diferentes países. 

Esse tipo de trabalho em dados é muito interessante por **desmistificar** a ideia geral que se tem de que apenas empresas gigantes, como Netflix e Amazon, podem ter um Sistema de Recomendação para os seus clientes. Na verdade, **qualquer empresa** com informações a respeito das compras realizadas pelos seus clientes **pode ter o seu próprio Sistema de Recomendação** e se beneficiar comercialmente com a sua utilização.

A base de dados foi buscada do Kaggle:

Título: Online Retail Data Set

URL [https://www.kaggle.com/datasets/vijayuv/onlineretail]

### Tipo de Sistema de Recomendação que utilizaremos

Iremos utilizar a **Filtragem Colaborativa (Collaborative Filtering)**: Esse tipo de sistema identifica padrões nas compras dos clientes e faz recomendações baseadas em similaridades entre as preferências deles. Assim, se dois usuários têm históricos de compra semelhantes, é provável que gostem dos mesmos tipos de produtos.

Para a elaboração desse Sistema de Recomendação, **utilizaremos o algoritmo de Similaridade de Cossenos**. Esse algoritmo mede a semelhança entre dois vetores com base no ângulo entre eles. Quanto menor for o ângulo entre esses dois vetores, mais similares eles serão. O algoritmo irá calcular o cosseno do ângulo entre os vetores, que varia de -1 a 1, para medir a semelhança entre eles.

Quando estivermos calculando a similaridade entre clientes ou entre produtos, o algoritmo identificará itens semelhantes ao comparar suas características, onde um cosseno próximo de 1 (100%) indica alta similaridade.

#### Importação das bibliotecas

In [38]:
import pandas as pd
from sklearn.metrics.pairwise import cosine_similarity
import numpy as np

#### Importação da Base de Dados

In [40]:
df = pd.read_excel('Online Retail Store.xlsx')

df.head()

Unnamed: 0,ID Pedido,ID Produto,Descrição,Quantidade,Data do Pedido,Preço Unitário,ID Cliente,País
0,536365,85123A,WHITE HANGING HEART T-LIGHT HOLDER,6,2010-01-12 08:26:00,2.55,17850,United Kingdom
1,536365,71053,WHITE METAL LANTERN,6,2010-01-12 08:26:00,3.39,17850,United Kingdom
2,536365,84406B,CREAM CUPID HEARTS COAT HANGER,8,2010-01-12 08:26:00,2.75,17850,United Kingdom
3,536365,84029G,KNITTED UNION FLAG HOT WATER BOTTLE,6,2010-01-12 08:26:00,3.39,17850,United Kingdom
4,536365,84029E,RED WOOLLY HOTTIE WHITE HEART.,6,2010-01-12 08:26:00,3.39,17850,United Kingdom


In [41]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 406829 entries, 0 to 406828
Data columns (total 8 columns):
 #   Column          Non-Null Count   Dtype  
---  ------          --------------   -----  
 0   ID Pedido       406829 non-null  object 
 1   ID Produto      406829 non-null  object 
 2   Descrição       406829 non-null  object 
 3   Quantidade      406829 non-null  int64  
 4   Data do Pedido  406829 non-null  object 
 5   Preço Unitário  406829 non-null  float64
 6   ID Cliente      406829 non-null  int64  
 7   País            406829 non-null  object 
dtypes: float64(1), int64(2), object(5)
memory usage: 24.8+ MB


#### Agrupando a quantidade total pelo ID Cliente e ID Produto

Para que o algoritmo de similaridade de cossenos faça sentido neste contexto comercial, é necesário que seja criado um DataFrame com a quantidade total vendida de cada produto para cada cliente. 

In [44]:
df_agrupado = df.groupby(['ID Cliente', 'ID Produto'], as_index=False)['Quantidade'].sum()

# Ordenando o Dataframe por ID Cliente
df_agrupado.sort_values(by='ID Cliente', ascending=False)
df_agrupado.head(10)

Unnamed: 0,ID Cliente,ID Produto,Quantidade
0,12346,23166,0
1,12347,16008,24
2,12347,17021,36
3,12347,20665,6
4,12347,20719,40
5,12347,20780,12
6,12347,20782,6
7,12347,20966,10
8,12347,21035,6
9,12347,21041,12


### Similaridade entre os Clientes

#### Criação da Matriz de Utilidade para os Clientes

In [46]:
matriz_utilidade = df_agrupado.pivot_table(index='ID Cliente',
                                  columns='ID Produto',
                                  values='Quantidade',
                                  fill_value=0)

matriz_utilidade.head(10)

ID Produto,10002,10080,10120,10125,10133,10135,11001,15030,15034,15036,...,90214Y,90214Z,BANK CHARGES,C2,CRUK,D,DOT,M,PADS,POST
ID Cliente,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
12346,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
12347,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
12348,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,9.0
12349,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0
12350,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0
12352,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,7.0
12353,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
12354,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
12355,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
12356,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,18.0


#### Cálculo da similaridade de cossenos entre os Clientes com base em seus padrões de compras

In [48]:
matriz_similaridade = cosine_similarity(matriz_utilidade)

#### Transformação da Matriz de Similaridade em um Dataframe

Neste DataFrame, tanto as linhas quanto as colunas serão o ID Cliente. Ou seja, ele nos dará a similaridade entre os clientes

In [51]:
df_similaridade = pd.DataFrame(matriz_similaridade, index=matriz_utilidade.index, columns=matriz_utilidade.index)

df_similaridade.head()

ID Cliente,12346,12347,12348,12349,12350,12352,12353,12354,12355,12356,...,18273,18274,18276,18277,18278,18280,18281,18282,18283,18287
ID Cliente,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
12346,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
12347,0.0,1.0,0.148879,0.02075,0.014435,0.034833,0.0,0.023478,0.506252,0.186107,...,0.0,0.0,0.40706,-0.001245,0.015133,0.037236,0.0,0.011921,0.07451,0.108942
12348,0.0,0.148879,1.0,0.000169,0.000315,0.001578,0.0,0.010634,0.286226,0.226244,...,0.0,0.0,0.168758,0.0,0.0,0.0,0.0,0.0,0.17517,0.110096
12349,0.0,0.02075,0.000169,1.0,0.030121,0.136488,0.0,0.004931,0.00018,0.150819,...,0.0,0.0,0.0,-0.000344,0.01568,0.0,0.0,0.014689,0.065295,0.022576
12350,0.0,0.014435,0.000315,0.030121,1.0,0.001938,0.0,0.0,0.0,0.001179,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.019385,0.0


#### Adicionando uma coluna 'ID Cliente' ao DataFrame de similaridade

In [53]:
df_similaridade['ID Cliente'] = df_similaridade.index

#### Reordenando as colunas para que a coluna 'ID Cliente' seja a primeira.

In [55]:
cols = ['ID Cliente'] + [col for col in df_similaridade if col != 'ID Cliente']
df_similaridade = df_similaridade[cols]
df_similaridade

ID Cliente,ID Cliente,12346,12347,12348,12349,12350,12352,12353,12354,12355,...,18273,18274,18276,18277,18278,18280,18281,18282,18283,18287
ID Cliente,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
12346,12346,0.0,0.000000,0.000000,0.000000,0.000000,0.000000,0.0,0.000000,0.000000,...,0.0,0.0,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000
12347,12347,0.0,1.000000,0.148879,0.020750,0.014435,0.034833,0.0,0.023478,0.506252,...,0.0,0.0,0.407060,-0.001245,0.015133,0.037236,0.000000,0.011921,0.074510,0.108942
12348,12348,0.0,0.148879,1.000000,0.000169,0.000315,0.001578,0.0,0.010634,0.286226,...,0.0,0.0,0.168758,0.000000,0.000000,0.000000,0.000000,0.000000,0.175170,0.110096
12349,12349,0.0,0.020750,0.000169,1.000000,0.030121,0.136488,0.0,0.004931,0.000180,...,0.0,0.0,0.000000,-0.000344,0.015680,0.000000,0.000000,0.014689,0.065295,0.022576
12350,12350,0.0,0.014435,0.000315,0.030121,1.000000,0.001938,0.0,0.000000,0.000000,...,0.0,0.0,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.019385,0.000000
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
18280,18280,0.0,0.037236,0.000000,0.000000,0.000000,0.000000,0.0,0.002707,0.000000,...,0.0,0.0,0.000000,0.000000,0.043042,1.000000,0.098363,0.000000,0.000000,0.000000
18281,18281,0.0,0.000000,0.000000,0.000000,0.000000,0.000000,0.0,0.000435,0.000000,...,0.0,0.0,0.000000,0.000000,0.000000,0.098363,1.000000,0.000000,0.098201,0.000000
18282,18282,0.0,0.011921,0.000000,0.014689,0.000000,0.002966,0.0,0.000000,0.007169,...,0.0,0.0,0.000000,-0.001372,0.000000,0.000000,0.000000,1.000000,0.003776,0.000000
18283,18283,0.0,0.074510,0.175170,0.065295,0.019385,0.017238,0.0,0.104890,0.050042,...,0.0,0.0,0.032142,0.070999,0.000000,0.000000,0.098201,0.003776,1.000000,0.044445


### Similaridade entre os Produtos

#### Criação da Matriz de Utilidade para os Produtos

In [57]:
# Criando uma matriz de utilidade
matriz_utilidade_produto = df_agrupado.pivot_table(index='ID Produto',
                                  columns='ID Cliente',
                                  values='Quantidade',
                                  fill_value=0)

matriz_utilidade_produto.head(10)

ID Cliente,12346,12347,12348,12349,12350,12352,12353,12354,12355,12356,...,18273,18274,18276,18277,18278,18280,18281,18282,18283,18287
ID Produto,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
10002,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
10080,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
10120,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
10125,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
10133,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
10135,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
11001,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
15030,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
15034,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
15036,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


#### Cálculo da similaridade de cossenos entre os Produtos com base no padrão de compras

In [59]:
matriz_similaridade_produto = cosine_similarity(matriz_utilidade_produto)

#### Transformação da Matriz de Similaridade em um Dataframe

Neste Dataframe, tanto as linhas quanto as colunas serão o ID Produto. Ou seja, ele nos dará a similaridade entre os Produtos.

In [62]:
df_similaridade_produto = pd.DataFrame(matriz_similaridade_produto, index=matriz_utilidade_produto.index, columns=matriz_utilidade_produto.index)

df_similaridade_produto

ID Produto,10002,10080,10120,10125,10133,10135,11001,15030,15034,15036,...,90214Y,90214Z,BANK CHARGES,C2,CRUK,D,DOT,M,PADS,POST
ID Produto,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
10002,1.000000,0.000000,0.001548,0.853890,0.052106,0.021922,0.004643,0.000244,0.000954,0.002452,...,0.000000,0.0,0.000000,0.038750,0.000000,-0.000051,0.000000,0.000258,0.000000,0.070843
10080,0.000000,1.000000,0.000000,0.004958,0.020655,0.011878,0.000000,0.000000,0.033336,0.000797,...,0.000000,0.0,0.000000,0.000000,0.000000,0.000000,0.000000,0.000006,0.000000,0.000000
10120,0.001548,0.000000,1.000000,0.001600,0.042560,0.010420,0.014993,0.007238,0.008608,0.002157,...,0.000000,0.0,0.000000,0.000000,0.000000,-0.000021,0.000000,0.007154,0.000000,0.000000
10125,0.853890,0.004958,0.001600,1.000000,0.011634,0.006547,0.004229,0.000000,0.000655,0.000169,...,0.000000,0.0,0.000000,0.000000,0.000000,-0.000217,0.000000,0.000055,0.000000,0.035679
10133,0.052106,0.020655,0.042560,0.011634,1.000000,0.224246,0.011233,0.008205,0.029796,0.005039,...,0.000000,0.0,0.000000,0.064872,0.000000,-0.000074,0.000000,0.002713,0.028508,0.003338
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
D,-0.000051,0.000000,-0.000021,-0.000217,-0.000074,-0.000879,-0.000066,0.000000,-0.002012,-0.001130,...,0.000000,0.0,0.000000,-0.001276,0.000000,1.000000,0.000000,0.061591,0.000000,-0.000268
DOT,0.000000,0.000000,0.000000,0.000000,0.000000,0.153671,0.063144,0.000000,0.000000,0.000000,...,0.000000,0.0,0.000000,0.000000,-1.000000,0.000000,1.000000,-0.000207,0.000000,0.000000
M,0.000258,0.000006,0.007154,0.000055,0.002713,0.001672,-0.034300,0.005408,0.003957,-0.016884,...,0.000000,0.0,0.001343,-0.000032,0.000207,0.061591,-0.000207,1.000000,0.000000,-0.023027
PADS,0.000000,0.000000,0.000000,0.000000,0.028508,0.000000,0.000000,0.000000,0.000000,0.000000,...,0.000000,0.0,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,1.000000,0.000000


#### Função que retorna os clientes mais similares a um cliente específico

In [64]:
df_clientes = df[['ID Cliente', 'País']].drop_duplicates()

# Função para encontrar clientes similares e os 3 produtos mais comprados pelos similares que o cliente não comprou
def clientes_similares(codigo_cliente, df_similaridade, df_agrupado, df_clientes, pais, top_n=3):
    
    # Verificar se o cliente informado existe no DataFrame de similaridade
    if codigo_cliente not in df_similaridade.index:
        raise ValueError(f"O código do cliente {codigo_cliente} não está presente no DataFrame de similaridade.")
    
    # Verificar se o cliente informado pertence ao país especificado
    if pais != df_clientes.loc[df_clientes['ID Cliente'] == codigo_cliente, 'País'].values[0]:
        raise ValueError(f"O cliente {codigo_cliente} não pertence ao país {pais}.")
    
    # Obter a similaridade com os outros clientes
    similaridades = df_similaridade[codigo_cliente]
    
    # Ordenar os clientes pela similaridade (excluindo o próprio cliente)
    similares = similaridades.drop(labels=codigo_cliente).sort_values(ascending=False)
    
    # Filtrar clientes que pertencem ao mesmo país
    clientes_mesmo_pais = df_clientes[df_clientes['País'] == pais]['ID Cliente']
    similares_mesmo_pais = similares[similares.index.isin(clientes_mesmo_pais)]
    
    # Selecionar os top_n clientes mais similares
    top_similares = similares_mesmo_pais.head(top_n).index
    
    # Obter os produtos comprados pelo cliente informado
    produtos_cliente = set(df_agrupado[df_agrupado['ID Cliente'] == codigo_cliente]['ID Produto'])
    
    # Dicionário para armazenar os produtos únicos e suas quantidades para cada cliente similar
    produtos_similares = {}

    # Para cada cliente similar, obter os 3 produtos mais comprados que o cliente informado não comprou
    for cliente in top_similares:
        produtos_cliente_similar = df_agrupado[df_agrupado['ID Cliente'] == cliente]
        
        # Filtrar os produtos que o cliente similar comprou e que o cliente informado não comprou
        produtos_unicos = produtos_cliente_similar[~produtos_cliente_similar['ID Produto'].isin(produtos_cliente)]
        
        # Ordenar os produtos pela quantidade comprada (mais comprados primeiro) e selecionar os top 3
        top_produtos = produtos_unicos.sort_values(by='Quantidade', ascending=False).head(3)
        
        # Armazenar os produtos e suas quantidades
        produtos_similares[cliente] = top_produtos[['ID Produto', 'Quantidade']].values.tolist()
    
    # Criar o DataFrame de resultados, arredondando similaridade de cada cliente similar
    df_resultado = pd.DataFrame({
        'Código_Cliente': top_similares,
        'Similaridade (%)': [round(similaridade * 100, 2) for similaridade in similares_mesmo_pais[top_similares].values],
        'Produtos_Similares': [produtos_similares[cliente] for cliente in top_similares]
    })
    
    return df_resultado

- A função 'clientes_similares' recebe parâmetros referentes a um cliente em específico. E retorna um Dataframe com os clientes mais similares ao cliente informado na função, o nível % de similaridade e 3 produtos que cada um dos clientes similares comprou que o cliente informado na função não comprou. Esse retorno é bastante útil para qualquer área comercial, uma vez que informa outros clientes parecidos ao cliente em questão e potenciais novos produtos que ele pode adquirir. 

- Está sendo criado um DataFrame nomeado de 'df_clientes'. Ele contêm as colunas 'ID Cliente' e 'País' do DataFrame 'df' original. A função drop_duplicates() que está sendo executada nessa variável remove qualquer linha duplicada, garantindo que cada cliente seja único nesse novo DataFrame.

- A função 'clientes_similares' possui 5 parâmetros:

> codigo_cliente: O código do cliente para o qual queremos encontrar outros clientes similares.

> df_similaridade: DataFrame da matriz de similaridade de cosseno entre os clientes, onde cada célula representa o grau de similaridade entre dois clientes.

> df_agrupado: DataFrame que agrupa os clientes e seus produtos comprados.

> df_clientes: DataFrame com informações informações únicas de ID Cliente e País.

> pais: País do cliente.

> top_n: Número de clientes mais similares a serem retornados (por padrão, 3).

- Estão sendo realizadas verificações (através da condicional if) para verificar se o 'codigo_cliente', digitado ao se chamar a função, está presente tanto na matriz de similaridade (df_similaridade) quanto no DataFrame de clientes (df_clientes). Se não estiver em ambos DataFrames, lança-se um erro, através do 'raise'.

- A variável 'pais_cliente' recebe o país do cliente informado na função. É criada uma série booleana que compara o valor da coluna 'ID Cliente' do DataFrame df_cliente com o valor do parâmetro 'codigo_cliente' informado na função (trecho df_clientes['ID_Cliente'] == codigo_cliente). Será marcado True quando forem iguais. Daí, o método 'loc' irá selecionar as linhas da coluna 'País' nas quais o valor de 'ID Cliente' corresponde ao codigo_cliente (True da série booleana). O método 'values' transforma o resultado em um array. E o values[0] acessa esse elemento diretamente.

- É criada uma variável nomeada de 'similaridades', que obtém a coluna da matriz de similaridade referente ao cliente informado na função. Essa coluna contém as similaridades de cosseno entre o cliente informado na função e os outros clientes.

- É removida a similaridade do cliente consigo mesmo (a própria linha, através do comando similaridades.drop(labels=codigo_cliente)) e os outros clientes são ordenados por ordem decrescente de similaridade (através do comando sort_values(ascending=False).


- A variável 'clientes_mesmo_pais' filtra o DataFrame 'df_clientes', obtendo uma lista de clientes que pertencem ao país especificado na função e retorna apenas a coluna 'ID Cliente'.

- A variável 'similares_mesmo_pais' filtra a lista de clientes similares para incluir apenas aqueles que pertencem ao país especificado. Ou seja, filtra apenas os clientes similares que também estão na lista da variável 'clientes_mesmo_pais'. O comando 'similares.index' refere-se ao conjunto dos índices (ou seja, os códigos dos clientes) na Series 'similares'. O método isin() verifica se os códigos de clientes que estão no index de 'similares' também estão na coluna 'ID Cliente' do DataFrame 'clientes_mesmo_pais', que contém os clientes do país especificado. Ele retorna um array booleano com o valor True quando o código do cliente está presente no DataFrame de clientes filtrados pelo país. Ao colocar toda a expressão entre colchetes, a Serie 'similares' será filtrada, trazendo apenas os clientes similares do país informado.

- A variável 'top_similares' seleciona os clientes mais similares que pertencem ao país especificado. A quantidade de clientes mais similares será de acordo com o informado no parâmetro 'top_n' da função.

- A variável 'produtos_cliente' filtra o DataFrame 'df_agrupado' para trazer apenas as linhas cujo ID Cliente seja o do cliente informado na função e retorna apenas a coluna ID Produto. Ou seja, a variável 'produtos_cliente' traz o conjunto de produtos comprados pelo clinete informado na função. O método set() converte essa lista em um conjunto, removendo possíveis duplicatas. Assim, cada ID Produto aparecerá apenas 1 vez, independentemente do número de vez que o cliente o tenha comprado.

- É inicializado um dicionário vazio chamado de produtos_similares.

- Looping for: Para cada cliente presente entre os top mais similares, faça o seguinte:

Filtre os produtos que o cliente similar comprou e que o cliente informado na função não comprou (essa informação está armazenada na variável 'produtos_unicos'),

Ordene os produtos pela quantidade comprada (mais comprados primeiro) e selecionar os top 3 (essa informação está armazenada na variável 'top_produtos')

Armazene os produtos e suas quantidades (essa informação está armazenada na variável 'produtos_similares[cliente]')

- Por fim, é gerado um DataFrame com 3 colunas:

> o Código dos clientes mais similares ao cliente informado na função, 

> o percentual de similaridade de cada um dos clientes ao cliente informado na função e 

> 3 produtos que cada um dos clientes similares comprou que o cliente informado na função não comprou (e as suas respectivas quantidades compradas)

#### Exemplo de aplicação da função 'clientes_similares'

In [67]:
clientes_similares(12583, df_similaridade, df_agrupado, df_clientes, 'France', 5)

Unnamed: 0,Código_Cliente,Similaridade (%),Produtos_Similares
0,12571,45.58,"[[21977, 24], [21122, 24], [21124, 24]]"
1,12660,44.89,"[[22966, 12], [84375, 12], [22175, 6]]"
2,12694,42.46,"[[21088, 24], [21124, 24], [23309, 24]]"
3,12674,40.75,"[[23076, 120], [22418, 72], [22029, 36]]"
4,12637,39.13,"[[22382, 54], [84692, 50], [22531, 48]]"


- Assim, o cliente mais similar ao cliente 12583 é o cliente 12571.

- Desta forma, os produtos que o cliente 12571 comprou que o cliente 12583 não comprou (produtos de ID 21977, 21122 e 21124, por exemplo) provavelmente serão atraentes para ele, sendo produtos interessantes para a empresa recomendar para o cliente 12583.

#### Função que retorna os produtos mais similares a um produto específico

In [93]:
import pandas as pd

def produtos_similares(codigo_produto, df_similaridade_produto, df_original, top_n=3):
    
    if codigo_produto not in df_similaridade_produto.columns:
        raise ValueError(f"O código do produto {codigo_produto} não está presente no DataFrame.")
    
    # Obter a similaridade com o produto informado
    similaridades = df_similaridade_produto[codigo_produto]
    
    # Ordenar os produtos pela similaridade (exceto o próprio produto)
    similares = similaridades.drop(labels=codigo_produto).sort_values(ascending=False)
    
    # Selecionar os top_n produtos mais similares
    top_similares = similares.head(top_n)
    
    # Criar DataFrame com os resultados
    df_resultado = top_similares.reset_index()
    df_resultado.columns = ['ID_Produto', 'Similaridade (%)']
    
    # Converter similaridade para percentual
    df_resultado['Similaridade (%)'] = df_resultado['Similaridade (%)'] * 100

    # Arredondar similaridade para 2 casas decimais
    df_resultado['Similaridade (%)'] = df_resultado['Similaridade (%)'].round(2)
    
    # Merge com o DataFrame original para adicionar as descrições
    df_descricao = df_original[['ID Produto', 'Descrição']].drop_duplicates()
    df_resultado = df_resultado.merge(df_descricao, left_on='ID_Produto', right_on='ID Produto', how='left')
    
    # Limpar as colunas redundantes
    df_resultado = df_resultado[['ID_Produto', 'Descrição', 'Similaridade (%)']]
    
    return df_resultado

- A função 'produtos_similares' recebe parâmetros referentes a um produto em específico. E retorna um DataFrame com os produtos mais similares ao produto informado e o nível % de similaridade desses produtos com o produto informado.

- A função 'produtos_similares' possui 4 parâmetros:

> codigo_produto: Código do produto que será comparado.

> df_similaridade_produto: DataFrame contendo a matriz de similaridade de cosseno, onde cada célula representa o grau de similaridade entre dois produtos.

> df_original: DataFrame contendo os dados originais dos produtos, como código e descrição.

> top_n: Quantidade de produtos similares que serão retornados (padrão é 3).

- Está sendo realizada verificação (através da condicional if) para verificar se o 'codigo_produto', digitado ao se chamar a função, está presente no DataFrame df_similaridade_produto. Ou seja, se o produto que se deseja comparar está na matriz de similaridade. Caso o código do produto não esteja presente, lança-se um erro, através do 'raise'.

- É criada uma variável nomeada de 'similaridades', que obtém a coluna da matriz de similaridade referente ao produto informado na função. Essa coluna contém as similaridades de cosseno entre o produto informado na função e os outros produtos.

- É removido o produto de referência da lista de similaridades (para não ser comparado com ele mesmo) com o método drop(). E os produtos são ordenados pela similaridade de forma decrescente

- A variável 'top_similares' seleciona os produtos mais similares. A quantidade de produtos mais similares será de acordo com o informado no parâmetro 'top_n' da função.

- Cria-se um DataFrame df_resultado contendo os produtos mais similares e suas similaridades. O reset_index() transforma o índice do DataFrame (que corresponde aos códigos dos produtos) em uma coluna. As colunas do df_resultado são renomeadas para ID_Produto (código dos produtos) e Similaridade (%).

- A variável 'df_descricao' extrai as colunas ID Produto e Descrição do DataFrame DataFrame original (df) e remove possíveis duplicatas com o método drop_duplicates().

- Daí, a variável df_resultado faz uma junção (merge - left) com o DataFrame df_descricao para adicionar a descrição dos produtos mais similares ao DataFrame resultante (próprio DataFrame df_resultado). A junção é feita pela coluna ID_Produto.

- O retorno final da função 'produtos_similares' será o DataFrame df_resultado com apenas as colunas 'ID_Produto', 'Descrição' e 'Similaridade (%)'.

#### Exemplo de aplicação da função 'produtos_similares'

In [95]:
produtos_similares(22633,df_similaridade_produto,df,3)

Unnamed: 0,ID_Produto,Descrição,Similaridade (%)
0,23139,SINGLE WIRE HOOK PINK HEART,77.94
1,22381,TOY TIDY PINK POLKADOT,73.63
2,23138,SINGLE WIRE HOOK IVORY HEART,73.48


- Analisando o DataFrame acima concluímos que, se um cliente comprou o produto ID_Produto 22633, então ele provavelmente vai se interessar pelo produto ID_Produto 23139 (SINGLE WIRE HOOK PINK HEART) já que eles tem um nível de similaridade de 77,94%.

- Desta forma, será muito interessante para a empresa recomendar o produto 23139 para um cliente que tenha comprado o produto 22633, já que o produto 23139 provavelmente será atraente para esse cliente.