# faq


In this notebook, it will:

    I. explain the faq problem.
    III. Realization

## I. Presentation

The faq problem is to match the question and return the answer of the question which is closest to the query. The model is the same as the match.
For this problem, there isn't a precise model but a strategy to get the final result.
The idea behind this is to 
 * use the model of similarity to encode all candidates and find several candidates
 * use the matching model to find the best one among the candidates

In [13]:
# install extra module

!python -m pip install faiss-cpu --break-system-packages

Defaulting to user installation because normal site-packages is not writeable
Collecting faiss-cpu
  Downloading faiss_cpu-1.8.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (3.6 kB)
Downloading faiss_cpu-1.8.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (27.0 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m27.0/27.0 MB[0m [31m20.4 MB/s[0m eta [36m0:00:00[0m00:01[0m00:01[0m
[?25hInstalling collected packages: faiss-cpu
Successfully installed faiss-cpu-1.8.0


## II. Realization

In [1]:
# to set the gpu to use
# Since I have 2 GPUs and I only want to use one, I need to run this.
# Should be run the first
# skip this if you don't need.

import os
os.environ["CUDA_VISIBLE_DEVICES"] = "0"  # or "0,1" for multiple GPUs
os.environ["TOKENIZERS_PARALLELISM"] = "false"

In [2]:
## defin repos for data and model

# data

ckp_data = "akshatshah1103/retail-faq"

# model

ckp = "google-bert/bert-base-uncased"

### 1. import

In [3]:
import evaluate
from datasets import load_dataset
from transformers import AutoTokenizer, AutoModelForSequenceClassification

2024-06-21 11:48:43.007063: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:9261] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2024-06-21 11:48:43.007115: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:607] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2024-06-21 11:48:43.009259: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1515] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2024-06-21 11:48:43.020202: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.


### 2. load data

In [4]:
data = load_dataset(ckp_data)
data

Downloading readme:   0%|          | 0.00/28.0 [00:00<?, ?B/s]

Downloading data files:   0%|          | 0/1 [00:00<?, ?it/s]

Downloading data:   0%|          | 0.00/26.9k [00:00<?, ?B/s]

Extracting data files:   0%|          | 0/1 [00:00<?, ?it/s]

Generating train split: 0 examples [00:00, ? examples/s]

DatasetDict({
    train: Dataset({
        features: ['FAQ', 'Response'],
        num_rows: 112
    })
})

In [5]:
data["train"][0]

{'FAQ': "What are your store's operating hours?",
 'Response': 'Our store is open from 10:00 AM to 8:00 PM, Monday through Saturday, and from 11:00 AM to 6:00 PM on Sundays.'}

### 3. split data

In [7]:
split_data = data["train"].train_test_split(test_size=0.2)
split_data

DatasetDict({
    train: Dataset({
        features: ['FAQ', 'Response'],
        num_rows: 89
    })
    test: Dataset({
        features: ['FAQ', 'Response'],
        num_rows: 23
    })
})

### 4. load model

In [8]:
# this is the same model as for matching

from transformers import BertPreTrainedModel, BertModel
from typing import Optional
import torch
from torch.nn import CosineEmbeddingLoss, CosineSimilarity

class SimilarityModel(BertPreTrainedModel):

    def __init__(self, config):

        super().__init__(config)

        self.bert = BertModel(config)

        self.post_init()

    def forward(self,
        input_ids: Optional[torch.Tensor] = None,
        attention_mask: Optional[torch.Tensor] = None,
        token_type_ids: Optional[torch.Tensor] = None,
        position_ids: Optional[torch.Tensor] = None,
        head_mask: Optional[torch.Tensor] = None,
        inputs_embeds: Optional[torch.Tensor] = None,
        labels: Optional[torch.Tensor] = None,
        output_attentions: Optional[bool] = None,
        output_hidden_states: Optional[bool] = None,
        return_dict: Optional[bool] = None,
    ):
        
        return_dict = return_dict if return_dict is not None else self.config.use_return_dict
        #print(input_ids)
        s1_input_ids, s2_input_ids = input_ids[0,:], input_ids[:,1]
        s1_attention_mask, s2_attention_mask = attention_mask[0,:], attention_mask[:,1]
        s1_token_type_ids, s2_token_type_ids = token_type_ids[0,:], token_type_ids[:,1]

        s1_outputs = self.bert(
            s1_input_ids,
            attention_mask=s1_attention_mask,
            token_type_ids=s1_token_type_ids,
            position_ids=position_ids,
            head_mask=head_mask,
            inputs_embeds=inputs_embeds,
            output_attentions=output_attentions,
            output_hidden_states=output_hidden_states,
            return_dict=return_dict,
        )

        s1_pooled_output = s1_outputs[1]

        s2_outputs = self.bert(
            s2_input_ids,
            attention_mask=s2_attention_mask,
            token_type_ids=s2_token_type_ids,
            position_ids=position_ids,
            head_mask=head_mask,
            inputs_embeds=inputs_embeds,
            output_attentions=output_attentions,
            output_hidden_states=output_hidden_states,
            return_dict=return_dict,
        )

        s2_pooled_output = s2_outputs[1]

        simi = CosineSimilarity()(s1_pooled_output, s2_pooled_output)

        loss = None

        if labels is not None:

            loss_fct = CosineEmbeddingLoss(0.3)
            loss = loss_fct(s1_pooled_output, s2_pooled_output, labels)

        output = (simi,)
        return ((loss,) + output) if loss is not None else output


model = SimilarityModel.from_pretrained(ckp)

model

SimilarityModel(
  (bert): BertModel(
    (embeddings): BertEmbeddings(
      (word_embeddings): Embedding(30522, 768, padding_idx=0)
      (position_embeddings): Embedding(512, 768)
      (token_type_embeddings): Embedding(2, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (encoder): BertEncoder(
      (layer): ModuleList(
        (0-11): 12 x BertLayer(
          (attention): BertAttention(
            (self): BertSdpaSelfAttention(
              (query): Linear(in_features=768, out_features=768, bias=True)
              (key): Linear(in_features=768, out_features=768, bias=True)
              (value): Linear(in_features=768, out_features=768, bias=True)
              (dropout): Dropout(p=0.1, inplace=False)
            )
            (output): BertSelfOutput(
              (dense): Linear(in_features=768, out_features=768, bias=True)
              (LayerNorm): LayerNorm((768,), eps=1e-12, elementwi

### 5. tokenization

In [9]:
tokenizer = AutoTokenizer.from_pretrained(ckp)
tokenizer

BertTokenizerFast(name_or_path='google-bert/bert-base-uncased', vocab_size=30522, model_max_length=512, is_fast=True, padding_side='right', truncation_side='right', special_tokens={'unk_token': '[UNK]', 'sep_token': '[SEP]', 'pad_token': '[PAD]', 'cls_token': '[CLS]', 'mask_token': '[MASK]'}, clean_up_tokenization_spaces=True),  added_tokens_decoder={
	0: AddedToken("[PAD]", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	100: AddedToken("[UNK]", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	101: AddedToken("[CLS]", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	102: AddedToken("[SEP]", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	103: AddedToken("[MASK]", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
}

### 6. Encoding

In [10]:
# encode questions
from tqdm import tqdm

def encode_batch(data, batch_size):

    encodes = []

    with torch.inference_mode():

        for i in tqdm(range(0, len(data["train"]), batch_size)):

            batch_sens = [data["train"][ind+i]["FAQ"] for ind in range(batch_size) if ind+i < len(data["train"])]

            toks = tokenizer(batch_sens, max_length=128, padding=True, truncation=True, return_tensors="pt")

            vec = model.bert(**toks)[1] # encode the data into a vector of length of hidden size (768)

            encodes.append(vec)

    encodes = torch.cat(encodes, dim=0).cpu().numpy()

    return encodes

encodes = encode_batch(data, 16)
print(encodes.shape)

100%|██████████| 7/7 [00:01<00:00,  4.98it/s]

(112, 768)





### 7. create indexing

In [14]:
# indexing using faiss

import faiss

index = faiss.IndexFlatIP(768)
faiss.normalize_L2(encodes)

index.add(encodes)

index


<faiss.swigfaiss_avx2.IndexFlatIP; proxy of <Swig Object of type 'faiss::IndexFlatIP *' at 0x7fc5880b3b40> >

### 8. encode query

In [15]:
# test search

def encode(question):

    with torch.inference_mode():

            toks = tokenizer(question, max_length=128, padding=True, truncation=True, return_tensors="pt")

            vec = model.bert(**toks)[1]

    return vec.cpu().numpy()

ques = "When it open"

vec = encode(ques)

print(vec.shape)

(1, 768)


### 9. Search

In [16]:
# search

faiss.normalize_L2(vec)
tops = index.search(vec, 20) # find top 20 candidats

print(tops)

res = []
for score, ind in zip(tops[0][0], tops[1][0]):

    ind = int(ind)
    matched_ques = data["train"][ind]["FAQ"]
    matched_resp = data["train"][ind]["Response"]
    res.append([matched_ques, matched_resp])

candidates = {}
candidates["question"] = [i for i, j in res]
candidates["respond"] = [j for i, j in res]

print(candidates["question"])
print(type(candidates))

(array([[0.91554874, 0.91554874, 0.91554874, 0.91554874, 0.91554874,
        0.91554874, 0.91554874, 0.91554874, 0.91554874, 0.91554874,
        0.88546276, 0.88546276, 0.88546276, 0.88546276, 0.8487283 ,
        0.8466814 , 0.8257435 , 0.81071067, 0.7890881 , 0.75049907]],
      dtype=float32), array([[ 31,  30,  29,  28,  27,  26,  25,  24,  23,  22,  67,  66,  65,
         64, 110, 102,  92,  97, 100,  16]]))
['How can I track my order?', 'How can I track my order?', 'How can I track my order?', 'How can I track my order?', 'How can I track my order?', 'How can I track my order?', 'How can I track my order?', 'How can I track my order?', 'How can I track my order?', 'How can I track my order?', 'Can I return an item without a receipt?', 'Can I return an item without a receipt?', 'Can I return an item without a receipt?', 'Can I return an item without a receipt?', 'How do I track my order when I order through Dunzo or Blinkit?', 'How do I cancel a pre-order if I change my mind?', 'Do

### 10. refine result

In [18]:
# match the recalls

from transformers import BertForSequenceClassification

cross_model = BertForSequenceClassification.from_pretrained("google-bert/bert-base-uncased", num_labels=1)


Some weights of BertForSequenceClassification were not initialized from the model checkpoint at google-bert/bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [19]:
# we match the candidates to the query question to find the best one

questions = [ques] * len(candidates["question"])

toks = tokenizer(questions, candidates["question"], max_length=128, truncation=True, padding=True, return_tensors="pt")

toks = {k: v.to(cross_model.device) for k, v in toks.items()}

with torch.inference_mode():

    logits = cross_model(**toks).logits.squeeze()
    res = torch.argmax(logits, dim=-1)
    print(res)

res = int(res)

tensor(15)


In [20]:
matched_ques = candidates["question"][res]
matched_resp = candidates["respond"][res]
print([matched_ques, matched_resp])

['How do I cancel a pre-order if I change my mind?', "To cancel a pre-order, simply contact our customer support team at [customer support email or phone number]. They'll assist you in canceling your pre-order and processing any necessary refunds."]
