search engines were lexical: the search engine looked for literal matches of the query words, without understanding of the query’s meaning and only returning links that contained the exact query.By using regular keyword search, a document either contains the given word or not, and there is no middle ground

On the other hand, "Semantic Search" can simplify query building, becuase it is supported by automated natural language processing programs i.e. using Latent Semantic Indexing - a concept that search engines use to discover how a keyword and content work together to mean the same thing.

LSI adds an important step to the document indexing process. LSI examines a collection of documents to see which documents contain some of those same words. LSI considers documents that have many words in common to be semantically close, and ones with less words in common to be less close.

In brief, LSI does not require an exact match to return useful results. Where a plain keyword search will fail if there is no exact match, LSI will often return relevant documents that don't contain the keyword at all.




***
***
# Sumário

* # 1. [Análise exploratória](#analise_expl)
    * ## 1.1. [Carregando dados](#analise_expl)
    * ## 1.2. [Pré-preocessamento: limpar e preparar dados](#limpar_preparar)

<br>


* # 2. [Construção do Índice](#construcao_ind)
Keyword Search Vs Semantic Search
At first, search engines were lexical: the search engine looked for literal matches of the query words, without understanding of the query’s meaning and only returning links that contained the exact query.By using regular keyword search, a document either contains the given word or not, and there is no middle ground

On the other hand, "Semantic Search" can simplify query building, becuase it is supported by automated natural language processing programs i.e. using Latent Semantic Indexing - a concept that search engines use to discover how a keyword and content work together to mean the same thin
<br>


* # 3. [Ranking](#ranking)

<br>


* # 4. [Análise dos resultados](#analise_result)


***
***

In [1]:
#importar bibliotecas

import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd

plt.style.use('ggplot')

#processamento de linguagem natural
import nltk
import string
import spacy


#modelo
import tensorflow as tf
#metricas
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score

2023-02-02 20:21:42.737930: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  SSE4.1 SSE4.2
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.


<a id='analise_expl'></a>
#  1. Análise exploratória

Nesse estagio, vou me familiarizar com os dados para criar intuição sobre o problema e assim, ser capaz de começar a formular hipóteses testáveis.

Nesse estagio, irei utilizar de vizualizacao e estatistica.

## 1.1. Carregando dados

In [2]:
df_pairs = pd.read_csv("../dados/pairs.csv")#, delimiter=";")
df_products = pd.read_csv("../dados/products.csv")#, delimiter=";")

## 1.2.: Entendendo os dados
Vamos olhar para as caracteristicas basicas do banco de dados
* Dataframe shape
* head and tail
* dtypes
* estatisticas

Vamos começar vendo o que tem dentro de cada dataframe

In [19]:
# features da relação entre o par query e produto
print(df_pairs.info())
print(df_pairs.shape)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 89832 entries, 0 to 89831
Data columns (total 8 columns):
 #   Column             Non-Null Count  Dtype 
---  ------             --------------  ----- 
 0   pair_id            89832 non-null  int64 
 1   product_id         89832 non-null  int64 
 2   query              89832 non-null  object
 3   search_position    89832 non-null  int64 
 4   print_count_query  89832 non-null  int64 
 5   view_count_query   89832 non-null  int64 
 6   cart_count_query   89832 non-null  int64 
 7   order_count_query  89832 non-null  int64 
dtypes: int64(7), object(1)
memory usage: 5.5+ MB
None
(89832, 8)


In [22]:
df_pairs.head()

Unnamed: 0,pair_id,product_id,query,search_position,print_count_query,view_count_query,cart_count_query,order_count_query
0,8589934593,14817,Convite Padrinhos Batismo,319,2374,18,1,0
1,8589934636,14884,Decoracao De Casamento,254,388,1,0,0
2,8589934836,8589934668,Toalha De Lavabo,233,219,2,0,0
3,8589934727,17179884005,Calendario 2023 Editavel,40,4871,2,0,0
4,8589934934,25769803777,Ecobag,286,166,3,0,0


Coluna com texto:
* query

In [21]:
print(df_products.info())
print(df_products.shape)

<class 'pandas.core.frame.DataFrame'>
Int64Index: 76711 entries, 0 to 76769
Data columns (total 13 columns):
 #   Column               Non-Null Count  Dtype  
---  ------               --------------  -----  
 0   product_id           76711 non-null  int64  
 1   title                76711 non-null  object 
 2   tags                 76711 non-null  object 
 3   creation_date        76711 non-null  object 
 4   price                76711 non-null  float64
 5   weight               76711 non-null  float64
 6   express_delivery     76711 non-null  int64  
 7   category             76711 non-null  object 
 8   minimum_quantity     76711 non-null  int64  
 9   print_count_product  76711 non-null  int64  
 10  view_count_product   76711 non-null  int64  
 11  cart_count_product   76711 non-null  int64  
 12  order_count_product  76711 non-null  int64  
dtypes: float64(2), int64(7), object(4)
memory usage: 8.2+ MB
None
(76711, 13)


In [23]:
#features de produtos
df_products.head()

Unnamed: 0,product_id,title,tags,creation_date,price,weight,express_delivery,category,minimum_quantity,print_count_product,view_count_product,cart_count_product,order_count_product
0,101,Jogo Banheiro de Crochê de 3 Peças,"['#jogobanheiro #croche #tapetes', 'decoração'...",2022-09-25 13:43:36,110.0,1.0,1,Técnicas de Artesanato,1,11,0,0,0
1,106,Guardanapos de Tecido - 100 unidades,"['guardanapos de tecido', 'guradanapo', 'festa...",2014-12-26 18:47:48,269.5,0.0,0,Casa,1,62,6,0,0
2,47,Toalha Papai Noel,"['natal', 'toalha de natal', 'toalha de mesa',...",2013-11-06 20:43:27,291.1,0.0,0,Casa,1,423,4,0,0
3,8589941942,Caixa para 1 bis feliz natal cliente como você...,"['lembrança', 'personalizados', 'festa', 'caix...",2021-11-22 15:02:30,45.0,0.0,0,Lembrancinhas,30,2746,93,6,2
4,17179869192,Árvore de Natal decorada em MDF,"['#madajoartesanato', '#decoraçaodenatal', '#e...",2020-12-18 18:52:35,100.0,0.0,0,Decoração,1,1010,4,0,0


colunas de texto
* title
* tags 
* category

### Estatisticas dos dados numéricos

In [28]:
df_products.drop('product_id', axis=1).describe()


Unnamed: 0,price,weight,express_delivery,minimum_quantity,print_count_product,view_count_product,cart_count_product,order_count_product
count,76711.0,76711.0,76711.0,76711.0,76711.0,76711.0,76711.0,76711.0
mean,205.439692,282.893223,0.17549,13.016477,2077.774856,53.246497,1.836425,0.363025
std,566.132438,1239.292652,0.380388,71.227281,12073.890932,134.071726,5.568,1.432096
min,9.9,0.0,0.0,1.0,0.0,0.0,0.0,0.0
25%,45.0,0.0,0.0,1.0,64.0,5.0,0.0,0.0
50%,97.14,0.0,0.0,1.0,378.0,16.0,1.0,0.0
75%,199.0,180.0,0.0,10.0,1461.0,47.0,2.0,0.0
max,43150.0,50000.0,1.0,7000.0,685904.0,5162.0,446.0,143.0


<a id='limpar_preparar'></a>
#  1.2. Pré-processamento: limpar e preparar dados

Vamos preparar/limpar nossos dados para a análise exploratoria. Para decidir o que fazer nesse estagio, vamos relembrar qual é o nosso problema: queremos criar um modelo que seja capaz de, dada uma query, retornar produtos mais relevantes (baseados nas suas descricoes) e recomendar produtos diferentes.

Fazer keyword search para resol
Se utilizarmos semantic search para resolver o problema,  Semantic search attempts to apply user intent and the meaning (or semantics) of words and phrases to find the right content.

It goes beyond keyword matching by using information that might not be present immediately in the text (the keywords themselves) but is closely tied to what the searcher wants.


Portanto, no estagio de limpeza de dados é interessante:
* Que o algoritmo seja capaz de detectar palavras com errors ortograficos como iguais. Ex:
            guardanapos = gurdanapo
            aniversário = aniversario
            
* remover captalizacao, para que o algoritmo nao diferencie, por ex.: guardanapos de Guardanapos

* Queremos remover inflexoes das palavras -> lemmetizacao

* Remover palavras que nao acrescentam signifcado ao texto (stopwords)



Assim, seguiremos as seguintes etapas:

* 1. Remover Null values
* 2. Tokenization 
* 3. Remover pontuações
* 4. Remover stopwords
* 5. Converter todos os caracteres para letras minusculas
* 6. Lemmatization
* 7. Word embedding

## 1.2.1. Remover Null values
Vamos examinar se temos algum valor faltante nos dataframes. Dos outputs abaixo, parece que apenas as colunas **weight** e **category** do dataframe **df_products** tem valores faltantes



In [7]:
print("---- df_pairs ----")
for col in df_pairs.columns:
    print(col, df_pairs[col].isnull().sum())
print("\n")


print("---- df_products ----")
for col in df_products.columns:
    print(col, df_products[col].isnull().sum())
print("\n")

---- df_pairs ----
pair_id 0
product_id 0
query 0
search_position 0
print_count_query 0
view_count_query 0
cart_count_query 0
order_count_query 0


---- df_products ----
product_id 0
title 0
tags 0
creation_date 0
price 0
weight 51
express_delivery 0
category 8
minimum_quantity 0
print_count_product 0
view_count_product 0
cart_count_product 0
order_count_product 0




Removendo *linhas* com valores faltantes

In [8]:
df_products.dropna(inplace=True)

print("---- df_products ----")
for col in df_products.columns:
    print(col, df_products[col].isnull().sum())
print("\n")

df_products.info()
#removemos 59 linhas

---- df_products ----
product_id 0
title 0
tags 0
creation_date 0
price 0
weight 0
express_delivery 0
category 0
minimum_quantity 0
print_count_product 0
view_count_product 0
cart_count_product 0
order_count_product 0


<class 'pandas.core.frame.DataFrame'>
Int64Index: 76711 entries, 0 to 76769
Data columns (total 13 columns):
 #   Column               Non-Null Count  Dtype  
---  ------               --------------  -----  
 0   product_id           76711 non-null  int64  
 1   title                76711 non-null  object 
 2   tags                 76711 non-null  object 
 3   creation_date        76711 non-null  object 
 4   price                76711 non-null  float64
 5   weight               76711 non-null  float64
 6   express_delivery     76711 non-null  int64  
 7   category             76711 non-null  object 
 8   minimum_quantity     76711 non-null  int64  
 9   print_count_product  76711 non-null  int64  
 10  view_count_product   76711 non-null  int64  
 11  cart_count_produ

<a id='pontuacao'></a>


## 1.2.2 Remover pontuação e tokenizar

Anteriormente, vimos que as colunas que contem dados em forma de texto sao:
* em df_pair:
    * query
    
* em df_products:
    * title
    * tags 
    * category    

Com os pré-processamentos, queremos reduzir o tamanho do vocabulário e simplificar algumas formas lexicais, garantindo, assim, que o algoritmo obtenha informações relevantes e que de fato representam o nosso os produtos e as queries

In [None]:
df2_pairs = df_pairs.copy()[:20]

In [147]:
#chamar modelo em portugues
nlp = spacy.load("pt_core_news_sm")

In [146]:
#criar uma lista com pontuações
punctuations = string.punctuation 
print(punctuations)

# criar uma lista de stop words em portugues
nltk.download('stopwords')

# para escolher as stopwords do português adicionamos a opçaõ de língua "portuguese"
stop_words = nltk.corpus.stopwords.words('portuguese')
print(stop_words[:10])

!"#$%&'()*+,-./:;<=>?@[\]^_`{|}~
['a', 'à', 'ao', 'aos', 'aquela', 'aquelas', 'aquele', 'aqueles', 'aquilo', 'as']


[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/gabriela/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [172]:
def tokenizer(sentence):
    
    #normalizar: converter todas as strings para letra minuscula
    sentence.lower

    #criar objeto token 
    tokens = nlp(sentence)
  
    #remover stopwords, pontuacao, simbolos especiais, palavras com menos de 2 caracteres
    #e lematizar e remover espacos depois de tokens
    
    tokens = [word.lemma_.strip() for word in tokens if str(word) not in stop_words\
                                     and str(word) not in punctuations\
                                     and len(word)>2]
    
    #replace extra spaces with single space
    sentence = re.sub(' +',' ',sentence)

    #remove unwanted lines starting from special characters
    sentence = re.sub(r'\n: \'\'.*','',sentence)
    sentence = re.sub(r'\n!.*','',sentence)
    sentence = re.sub(r'^:\'\'.*','',sentence)
    
    #remove non-breaking new line characters
    sentence = re.sub(r'\n',' ',sentence)
    
    #remove punctunations
    sentence = re.sub(r'[^\w\s]',' ',sentence)

    
    
    #return tokens
    return tokens

In [173]:
print ('Cleaning and Tokenizing...')
%time df2_products['query_tokenized'] = df2_products['query'].map(lambda x: tokenizer(x))

df2_pairs.head()

In [174]:
print ('Cleaning and Tokenizing...')
%time df2_pairs['query_tokenized'] = df2_pairs['query'].map(lambda x: tokenizer(x))

df2_pairs.head()

Cleaning and Tokenizing...
CPU times: user 88.6 ms, sys: 3.59 ms, total: 92.2 ms
Wall time: 89.6 ms


Unnamed: 0,pair_id,product_id,query,search_position,print_count_query,view_count_query,cart_count_query,order_count_query,query_tokenized
0,8589934593,14817,Convite Padrinhos Batismo,319,2374,18,1,0,"[Convite, Padrinhos, Batismo]"
1,8589934636,14884,Decoracao De Casamento,254,388,1,0,0,"[Decoracao, Casamento]"
2,8589934836,8589934668,Toalha De Lavabo,233,219,2,0,0,"[Toalha, Lavabo]"
3,8589934727,17179884005,Calendario 2023 Editavel,40,4871,2,0,0,"[Calendario, 2023, editavel]"
4,8589934934,25769803777,Ecobag,286,166,3,0,0,[Ecobag]


In [150]:
df2_products[['tags_tokenized', 'tags']].head()
#palavras duplicadas
#talvez seja melhor tokenizar frases curtas do que palavras individuais 

Unnamed: 0,tags_tokenized,tags
0,"[jogobanheiro, croche, tapete, decoração, cor,...","['#jogobanheiro #croche #tapetes', 'decoração'..."
1,"[guardanapo, tecido, guradanapo, festa, evento...","['guardanapos de tecido', 'guradanapo', 'festa..."
2,"[natal, toalha, natal, toalha, mesa, papai, no...","['natal', 'toalha de natal', 'toalha de mesa',..."
3,"[lembrança, personalizar, festa, caixa, caixin...","['lembrança', 'personalizados', 'festa', 'caix..."
4,"[madajoartesanato, decoraçaodenatal, enfeitede...","['#madajoartesanato', '#decoraçaodenatal', '#e..."


In [11]:
import re
#regular expressions biblioteca para detectar padroes em texto

In [48]:
nlp('A menina de azul "é" bonita. Ela anda de bicicleta e tem 1 laço')

A menina de azul "é" bonita.    Ela anda de bicicleta e tem 1 laço

In [16]:
def tokenizer(sentence):
 
    #remove distracting single quotes
    sentence = re.sub('\'','',sentence)

    #remove digits adnd words containing digits
    sentence = re.sub('\w*\d\w*','',sentence)

    #replace extra spaces with single space
    sentence = re.sub(' +',' ',sentence)

    #remove unwanted lines starting from special characters
    sentence = re.sub(r'\n: \'\'.*','',sentence)
    sentence = re.sub(r'\n!.*','',sentence)
    sentence = re.sub(r'^:\'\'.*','',sentence)
    
    #remove non-breaking new line characters
    sentence = re.sub(r'\n',' ',sentence)
    
    #remove punctunations
    sentence = re.sub(r'[^\w\s]',' ',sentence)
    
    #creating token object
    tokens = nlp(sentence)
    
    #lower, strip and lemmatize
    tokens = [word.lemma_.lower().strip() if word.lemma_ != "-PRON-" else word.lower_ for word in tokens]
    
    #remove stopwords, and exclude words less than 2 characters
    tokens = [word for word in tokens if word not in stop_words and word not in punctuations and len(word) > 2]
    
    #return tokens
    return tokens

In [47]:
print(re.sub('\'','','A menina de azul "é" bonita.     Ela anda de bicicleta e tem 1 laço'))
print(re.sub('\w*\d\w*','','A menina de azul "é" bonita.    Ela anda de bicicleta e tem 1 laço'))
print(re.sub(r'\n: \'\'.*','','A menina de azul "é" bonita.    Ela anda de bicicleta e tem 1 laço'))


A menina de azul "é" bonita.     Ela anda de bicicleta e tem 1 laço
A menina de azul "é" bonita.    Ela anda de bicicleta e tem  laço
A menina de azul "é" bonita.    Ela anda de bicicleta e tem 1 laço


Vamos aplicar a funcao de data-cleaning e pre-processamento nas colunas "query", "tags", "category", "title" column and store the cleaned, tokenized data into new column

In [154]:
norm = Normaliser(tokenizer='readable')

<enelvo.normaliser.Normaliser at 0x7ff0cca8feb0>

In [45]:
Normaliser(tokenizer)

SyntaxError: invalid syntax (2804356022.py, line 1)

In [36]:
print ('Cleaning and Tokenizing...')
%time a['tags_tokenized'] = a['tags'].map(lambda x: tokenizer(x))

a.info()

Cleaning and Tokenizing...
CPU times: user 94.9 ms, sys: 3.72 ms, total: 98.6 ms
Wall time: 96.6 ms
<class 'pandas.core.frame.DataFrame'>
Int64Index: 10 entries, 0 to 9
Data columns (total 2 columns):
 #   Column          Non-Null Count  Dtype 
---  ------          --------------  ----- 
 0   tags            10 non-null     object
 1   tags_tokenized  10 non-null     object
dtypes: object(2)
memory usage: 240.0+ bytes


In [38]:
print(a['tags'][0])
print(a['tags_tokenized'][0])

['#jogobanheiro #croche #tapetes', 'decoração', 'nas cores chumbo e rosa bebê']
['jogobanheiro', 'croche', 'tapete', 'decoração', 'em o', 'cor', 'chumbo', 'roso', 'bebê']


<a id='tokenization'></a>


## 1.2.3 Tokenization

Vamos quebrar os textos bruto em pedaços menores - tokens. Lembrando que anteriormente, vimos que as colunas que contem dados em forma de texto sao:

* em df_pair:
    * query
    
* em df_products:
    * title
    * tags 
    * category
    
Vamos comecar, removendo 

In [7]:
from nltk.tokenize import word_tokenize

# Sistema de recomendação: Content-Based Filtering
A recomendação é feita baseada nas **features de usuário** e nas **features dos produtos**. O match entre usuario e produto é feito pelo produto escalar:

## $ y^{i,j} = \vec{v}_{u}^{i}\cdot \vec{v}_{p}^{j}$

onde $\vec{v}_{u}^{i}$ e $\vec{v}_{p}^{j}$ são vetores computados das features do usuario $i$ e do produto $j$, respetivamente.

Muitos algoritmos de content-based filtering estimam $\vec{v}_{u}^{i}$ e  $\vec{v}_{p}^{j}$ usando redes neurais com a seguinte funcao de de custo:
## $J = \sum_{i,j}\left( \vec{v}_{u}^{i}\cdot \vec{v}_{p}^{j} -  y^{i,j} \right)^2 + \text{termos de regularização}$

Esse algoritmo tambem pode ser utilizado para *encontrar produtos similares*. Para isso, calculamos a distancia entre as features
## $ ||\vec{v}_{u}^{i} - \vec{v}_{p}^{j}||^2$

# Recomendação em grandes catálogos
O banco de dados contem mais de 6 milhoes de produtos. Portanto, o algoritmo pode ser computacionalmente inviavel de rodar para tantos itens. Para remediar esse problema, podemos implementar o sistema de recomendação em duas etapas: **retrieval** e **ranking**.

## Retrieval
Gerar uma lista de produtos plausiveis de serem buscados para aquela query. Por exemplo:
1) 100 produtos mais vendidos, clicados e adicionados ao carrinho numa determinada categoria de produtos
2) ....


Retrieving mais itens resulta numa performance melhor a custo de recomendacoes mais lentas. Para analizar o trade-off entre performance e velocidade, podemos rodar experimentos para ver se recuperar itens adicionais resulta em recomendacoes mais relevantes para o usuario.

## Ranking

Dada a lista de itens gerada na etapa anterior, rankeie os melhores produtos usando o modelo de aprendizagem e mostre esses itens rankeados ao usuario