In [None]:
pip install transformers

Collecting transformers
  Downloading transformers-4.10.2-py3-none-any.whl (2.8 MB)
[K     |████████████████████████████████| 2.8 MB 4.2 MB/s 
Collecting pyyaml>=5.1
  Downloading PyYAML-5.4.1-cp37-cp37m-manylinux1_x86_64.whl (636 kB)
[K     |████████████████████████████████| 636 kB 50.7 MB/s 
Collecting sacremoses
  Downloading sacremoses-0.0.45-py3-none-any.whl (895 kB)
[K     |████████████████████████████████| 895 kB 42.1 MB/s 
[?25hCollecting tokenizers<0.11,>=0.10.1
  Downloading tokenizers-0.10.3-cp37-cp37m-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_12_x86_64.manylinux2010_x86_64.whl (3.3 MB)
[K     |████████████████████████████████| 3.3 MB 43.0 MB/s 
[?25hCollecting huggingface-hub>=0.0.12
  Downloading huggingface_hub-0.0.17-py3-none-any.whl (52 kB)
[K     |████████████████████████████████| 52 kB 1.5 MB/s 
Installing collected packages: tokenizers, sacremoses, pyyaml, huggingface-hub, transformers
  Attempting uninstall: pyyaml
    Found existing installation: Py

In [None]:
import os
from transformers import AutoTokenizer  # Or BertTokenizer
from transformers import AutoModelForPreTraining , AutoConfig, AutoModel
from time import time
import huggingface_hub as hb
from sklearn.metrics.pairwise import cosine_similarity
import numpy as np
import json
import pandas as pd
import torch
import warnings



In [None]:
warnings.filterwarnings('ignore')

## Introduction
The below code changes my working directory to the initialized FILE_PATH. It's a path to a created directory on my google drive where I have stored all the resources and model I need for the code in this notebook to function. To make this work for you, simply create a directory, upload the 
'Dataset AI.xlsx' excel file into it and edit the FILE_PATH variable to the directory path (of the directory you just created).


All the code needed in production have been carefully written and documented in this notebook. It will also be created as a standalone python file for your later use in production.


In [None]:
FILE_PATH = '/content/drive/MyDrive/Deep Learning/Text Similarity'
os.chdir(FILE_PATH)

Now, Let's read the excel file.

The below code is used to download the brazillian 
portugese bert model to my local directory permanently. To replicate this download, change the FILE_PATH variable specified above to your own local directroy path as explained earlier. Then, the model will be downloaded and be available in your own local directory!

Also, uncomment (Remove the '#' and extraspaces on the left side of each line of code) the code to run the code below.

In [None]:
# hb.snapshot_download('neuralmind/bert-base-portuguese-cased',cache_dir= FILE_PATH)
# for f in os.listdir():
#         if f.startswith('neuralmind__bert-base'):
#             os.rename(f,'bert-base-portuguese-cased')

Next, we create a model path to our newly downloaded model.

In [None]:
model_path = os.path.join(FILE_PATH,'bert-base-portuguese-cased')
# The below code forces the model to look offline (in the local directory) for its weight and dependencies.
TRANSFORMERS_OFFLINE=1


Now, we load the model and its tokenizer into memory for computation.
The tokenizer is a utility that helps break down the model's input (sentences) into smaller subunits (e.g letters/characters, subwords, words) before it is then passed into the model for downstream processes.

In [None]:
model = AutoModel.from_pretrained( model_path, local_files_only = True, output_attentions=True)


In [None]:
tokenizer = AutoTokenizer.from_pretrained(model_path, local_files_only=True, output_attentions=True)

## Algorithm Creation.
Everything here (code and function) has been written according to the flowchart presented earlier. 

All codes are well commented and explained. The 'query_database' function has been left for your team to create depending on the database management system (sql, oracle etc) that will eventually be used for production.

Other functions include:
1. load_vectorizer : It loads the BERT model and its tokenizer used for vectorization of defect descriptions.

2. vectorize_all_defects: It vectorizes all the defect descriptions.

3. get_success_rate: It calculates the success rates of each solution retrieved from the database.

4. sort_score: It sorts the calculated similarity score of each vectorized defect description.

5. get_solution: This is the powerhouse of the algorithm. Given its required input, its retrieves all relevant solutions, groups into levels: l1, l2, l3 and returns these with the order_id, work_order, order_type (whether customer or technician), success rate of each solution.

A python file containing only this cell can be found in the same directory as this notebook.

In [None]:
def load_vectorizer(model_path=model_path):
  ''' This function loads the BERT model to memory for vectorization'''
  # Load the BERT model:
  model = AutoModel.from_pretrained( model_path, local_files_only = True, output_attentions=True)
  # Load the tokenizer:
  tokenizer = AutoTokenizer.from_pretrained(model_path, local_files_only=True)

  # Return the model and tokenizer:
  return model, tokenizer


def vectorize_all_defects(all_defects, model,tokenizer,max_length=128, batch_size=1000):
  ''' This function vectorizes 'all_defects' in batches (using the provided model and tokenizer arguments) 
      and concatenates all the resulting embeddings into one tensor (array) container
  '''
  # First tokenize and vectorize the first batch from 'all_defects'
  # Tokenize first batch:
  tokens = tokenizer.batch_encode_plus(all_defects,max_length= max_length, padding='max_length',truncation=True,return_tensors='pt')
  attention_mask = tokens['attention_mask']

  # Vectorize first batch and store result in 'embeddings' variable.
  with torch.no_grad():
    outs = model(**tokens)
    # The vector embeddings are stored in the last_hidden_state of the model so we retrieve it.
    embeddings = outs.last_hidden_state 

  # Make attention mask have exactly the same size and shape as the vectorized embeddings:
  # This is done because not all defects have similar sequence length and so they were all padded to the same size.
  # Therefore, attention mask here is used to ignore all the paddings.
  attention_mask = attention_mask.unsqueeze(-1).expand(embeddings.size()).float()
  embeddings = embeddings * attention_mask

  # summed_embeddings = torch.sum(embeddings, 1)
  # summed_mask = torch.clamp(attention_mask.sum(1), min=1e-9)
  # The embeddings are done for each words in each defect description. Therefore, calculating the mean of each word embeddings per defect description
  # gives a standard embedding (mean) of each defect description.
  mean_pooled_embeddings = torch.sum(embeddings, 1) / torch.clamp(attention_mask.sum(1), min=1e-9)

  # Now, apply the same process to other batches and continue concatenating to the 'embeddings' variable till there are no defects left:
  for i in range(batch_size, len(all_defects), batch_size ):
  # Tokenize_batch
    tokens = tokenizer.batch_encode_plus(sentences,max_length=max_length, padding='max_length',truncation=True,return_tensors='pt')
    attention_mask = tokens['attention_mask']

    with torch.no_grad():
      outs = model(**tokens)
      embeddings = outs.last_hidden_state
    
    attention_mask = attention_mask.unsqueeze(-1).expand(embeddings.size()).float()
    embeddings = embeddings * attention_mask
    # Calculate the mean of all word embeddings in each defect description:
    embeddings = torch.sum(embeddings, 1) / torch.clamp(attention_mask.sum(1), min=1e-9)
    # Concatenate this calculated mean_embeddings to the mean_pooled_embeddings:
    mean_pooled_embeddings = torch.cat((mean_pooled_embeddings, embeddings),0)

  # Return all embeddings:
  return mean_pooled_embeddings.detach().numpy() 


def get_success_rate(solution_id, success_status):
  ''' 
    Function calculates the success_rate (probability) of all solutions retrieved from the database. A more sophisticated function version.
    returns : a list of float values that represent the success probability for each solution.
  '''
  # Get the unique ids:
  # Create a dictionary and use each id in unique id as a key:
  success_rate_dict = { id : 0 for id in set(solution_id)}
  # Create a success rate list to store the probability of success:
  success_rate = []

  #index = np.arange(len(solution_id))

  # Expand dimensions of the solution_id and success_status list objects so we can concatenate them for faster processing:
  solution_id = np.expand_dims(solution_id, axis=1)
  success_status = np.expand_dims(success_status, axis=1)

  # Concatenate them into a 2-dimensional numpy array:
  solution_matrix = np.concatenate([solution_id, success_status],axis=-1)

  # For each id, filter out all occurences in the list and calculate the probability of success (mean):
  # This requires that the success status be represented as '1' for successful and '0' for unsuccessful.
  # Success probability is calculated here using the second column (to calculate the mean after filtering the occurences of each id) in the 2-dimensional array:
  for id in success_rate_dict.keys():
    success_rate_dict[id] = solution_matrix[solution_matrix[:,0] == id][:,1].astype('int').mean()

  # Assign the appropriate success probability to the solution id using the success_rate_dict dictionary.
  for id in solution_id[:,1]:
    success_rate.append(success_rate_dict[id])

  # Return success_rate
  return success_rate


def get_prob_success(solution_id, solution_counter):
  ''' 
    Function calculates the success_rate (probability) of all solutions retrieved from the database. This is the function used by the algorithm
    written below this function definition.
    returns : a list of float values that represent the success probability for each solution.
  '''
  # Convert solution_counter to a numpy array:
  solution_counter = np.array(solution_counter)
  # Calculate the total counts of all solutions:
  total_count = solution_counter.sum()
  # Calculate and round up success_percentage to 1 decimal place:
  success_percentage = np.round((solution_counter / total_count * 100), 1)
  
  return success_percentage
 

def query_database(top_defects_ids):
  ''' This function queries the database given the ids of the most similar defect descriptions and a condition (whether to return L1 solutions). 
      It has been specifically left to be created by your team depending on the database management software that is being used.
      It should return solution_ids , solution_level , solution_counter. Each of these columns should be a list.
      Function should return None, None, None for these columns if no solution was found in the database.'''
  pass

# Define a sort_score function to sort the similarity scores in descending order.
def sort_score(scores):
  '''
    The function sorts the similarity score in descending order and returns it.
  '''
  result = []
  for each in scores:
    result.append(each)
  # Sort the list according to the similarity scores:
  result.sort(key=lambda x: x[1],reverse=True)
  return np.array(result)

def get_solution(work_order, all_defect_ids, all_defect_descriptions, tokenizer,
                 bert_model, level='customer', threshold_score=0.8, batch_size= 1000, top_k=-1, return_type='json'):
  '''This function using deep learning, retrieves and returns the solutions and respective success_rates of the most similar workorder to the work order
     or description provided as one of its input. It returns an empty dictionary if no viable solution was found in the database.
     Args:
      work_order: string, a description of the defect created by a customer/ techinician - customer defect
      all_defect_ids: Ids of all defects stored in the database
      all_defect_descriptions: Description of all defects stored in the database.
      tokenizer: BERT model tokenizer
      bert_model: BERT model used as vectorizer by this function.
      level: 'customer' or 'technician'. It describes the person who created the work_order
      threshold_score: float, default == 0.8. It determines the similarity threshold score to use when retrieving the top_K similar defects.
      batch_size : integer , default == 1000, the number of defect descriptions to process/vectorize at once using multiprocessing.
      top_k: integer, default == -1. The number of defects to consider after similarity score computation.
      return_type: string, default == 'json'. It determines whether the function should return a dictionary or json object.
  '''
  # Convert list of defect_ids to a numpy array:
  all_defect_ids = np.array(all_defect_ids)

  ## Pass customer work order/tech_defect_description and all_defect_descriptions to vectorizer: BERT model 
  # Add the 'work order' to the 'all_defect_descriptions' for computational convenience:
  all_defect_descriptions.insert(0, work_order)

  # Compute vectorized embeddings:
  vectorized_embeddings  = vectorize_all_defects(all_defect_descriptions, bert_model, tokenizer)
  
  # Calculate the cosine similarity between customer work order and all other descriptions extracted from the database:
  similarity_scores = cosine_similarity([vectorized_embeddings[0]], vectorized_embeddings[1:])

  # Concatenate defect_ids to vectorized_defect_descriptions:
  # First expand the dimensions of both 'all_defect_ids' and 'similarity_scores' to a 2-dimensional array:
  all_defects_ids = np.expand_dims(all_defect_ids, 1)
  n_samples  = similarity_scores.shape[-1]
  # Reshape similarity scores:
  similarity_scores = similarity_scores.reshape(n_samples, 1)

  # Concatenate both into a single array:
  similarity_scores_id = np.concatenate((all_defects_ids, similarity_scores), axis= -1)

  # Filter the top k defects greater than threshold score:
  top_defects = similarity_scores_id[ similarity_scores_id[:,1] > threshold_score]

  # Sort similarity scores:
  top_defects  = sort_score(top_defects)

  # Extract only the top_k ids for this descriptions with high similarity score:
  top_defect_ids  = top_defects[: top_k,0]

  # Query database with this ids here and return solution_ids ,  solution_levels, solution's success percentage:
  # If its at customer level return all types of solutions:
  # query function is to be defined based on the database management system used.
  # query function should return (None, None, None ,None) if no solutions were retrieved.

  if  level == 'customer':
    solution_ids , solution_level , solution_counter = query_database(topic_defect_ids)
  else: # Else return only level 2 and 3 solutions:
    solution_ids , solution_level , solution_counter = query_database(topic_defect_ids)
  
  # If there are no solutions ( we know this by checking if the variable solution_ids == None (is empty))
  if solution_ids == None:
    # Create an empty dictionary and return it as json object.
    result = {}
  else:
    # Calculate the success percentage:
    success_percentage = get_prob_success(solution_ids, solution_counter)

    # Group by solution level:
    # First, define a dictionary for each level solution.
    # If its customer order, create a level 1 dictionary alongside:
    if type == 'customer':
      level_1 = {}

    level_2 = {}
    level_3 = {}

    # Group by solution level into the appropriate level dictionaries as created above:
    # Data structure used for each level dictionary: a dict of {solution_ids: success_percentage}
    for index, s_level in enumerate(solution_level):
      if s_level == 'L1':
        # if solution level for this index solution is level 1, then store solution_id and success percentage as key , value pairs in the level_1 dictionary.
        level_1[solution_ids[index]] = success_percentage[index]
      elif solution_levels == 'L2':
        # Else if level 2 , store in the level 2 dictionary.
        level_2[solution_ids[index]] = success_percentage[index]
      else:
        # Else store in the level 3 dictionary.
        level_3[solution_ids[index]] = success_percentage[index]

    # Result contains :  order_id , order_type (whether technician or customer), work order , +/- level 1 solutions, level 2 solutions, level 3 solutions.
    # If a customer solution (L1 solution) is enabled, return level 1 solution alongside. If not, return level 2 and 3 solution only.
    if type == 'customer':
      result = {'Level 1': level_1, 'Level 2': level_2, 'Level_3': level_3}
    else:
      result = {'Level 2': level_2, 'Level_3': level_3}

  # Return result as dictionary or json object:
  if return_type == 'json':
    result = json.dumps(result)
  
  return  result

## Algorithm Testing
We will test the algorithm with the excel file dataset. For this, we will modify the main function to suit the requirements of the excel file (This is because in production, a database management system will be used instead of an excel file). 

To test this algorithm, we will make use of the 'Customer defect' column and  pick out selective rows so that the performance of the algorithm can be appreciated.

We will make use of row 2, 9, 12 , 17. These values were chosen because of the uniqueness of the defect description.

In [None]:
customer_order = pd.read_excel('Customer order.xlsx')
customer_order.head()

Unnamed: 0.1,Unnamed: 0,OUTPUT,Unnamed: 2,Unnamed: 3,INPUT L1,INPUT L2,Unnamed: 6,Unnamed: 7,Unnamed: 8,Unnamed: 9,Unnamed: 10,Unnamed: 11,Unnamed: 12,Unnamed: 13,Unnamed: 14,OUTPUT.1,INPUT,INPUT.1,Unnamed: 18,Unnamed: 19,Unnamed: 20
0,English,Solution ID,workorder #,Technician name,Customer defect,Customer defect corrected by technician,Executed service,Solution,Root cause,TAGS,LAST_UPDATED,Solution Level,Improve equipment,Parts,Asset Name,Solution counter,Modality,Model,CODOS,USER_ID,Column1
1,Portuguese,ID,OS_NO,TECNICO,DEFEITO_CLIENTE,DEFEITO_CLIENTE_AJUSTADO,SERVICO_EXECUTADO,SOLUCAO,CAUSA_RAIZ,TAGS,LAST_UPDATED,NIVEL_DE_SOLUCAO,MELHORAR_EQUIPAMENTO,PECAS,NOME_DO_ATIVO,CONT_SOLUCOES,MODALIDADE,MODELO,CODOS,USER_ID,Column1
2,,1,,,AC não aquece,não esquenta,,,,,,,,,,2,HVAC,Brisa 1,,,
3,,2,,,AC não esfria,não esfria,,,,,,,,,,2,HVAC,Coolmax,,,
4,,3,,,Ac não esfria nem esquenta,não liga,,,,,,,,,,3,HVAC,Brisa 2,,,


Let's make row 0 the name for each columns. Also, we make the natural index (first column without a name) the ID value for the customer_defects.

In [None]:
customer_order.columns = customer_order.loc[0]

In [None]:
customer_order.head()

Unnamed: 0,English,Solution ID,workorder #,Technician name,Customer defect,Customer defect corrected by technician,Executed service,Solution,Root cause,TAGS,LAST_UPDATED,Solution Level,Improve equipment,Parts,Asset Name,Solution counter,Modality,Model,CODOS,USER_ID,Column1
0,English,Solution ID,workorder #,Technician name,Customer defect,Customer defect corrected by technician,Executed service,Solution,Root cause,TAGS,LAST_UPDATED,Solution Level,Improve equipment,Parts,Asset Name,Solution counter,Modality,Model,CODOS,USER_ID,Column1
1,Portuguese,ID,OS_NO,TECNICO,DEFEITO_CLIENTE,DEFEITO_CLIENTE_AJUSTADO,SERVICO_EXECUTADO,SOLUCAO,CAUSA_RAIZ,TAGS,LAST_UPDATED,NIVEL_DE_SOLUCAO,MELHORAR_EQUIPAMENTO,PECAS,NOME_DO_ATIVO,CONT_SOLUCOES,MODALIDADE,MODELO,CODOS,USER_ID,Column1
2,,1,,,AC não aquece,não esquenta,,,,,,,,,,2,HVAC,Brisa 1,,,
3,,2,,,AC não esfria,não esfria,,,,,,,,,,2,HVAC,Coolmax,,,
4,,3,,,Ac não esfria nem esquenta,não liga,,,,,,,,,,3,HVAC,Brisa 2,,,


In [None]:
customer_order.columns

Index(['English', 'Solution ID', 'workorder #', 'Technician name',
       'Customer defect', 'Customer defect corrected by technician',
       'Executed service', 'Solution', 'Root cause', 'TAGS', 'LAST_UPDATED',
       'Solution Level', 'Improve equipment', 'Parts', 'Asset Name',
       'Solution counter', 'Modality', 'Model', 'CODOS', 'USER_ID', 'Column1'],
      dtype='object', name=0)

Let's define a customized version of the algorithm to suit the dataset provided.

In [None]:
def get_solution_test(order_id, work_order, all_defect_ids, all_defect_descriptions,
                 tokenizer, bert_model, threshold_score=0.6, batch_size= 1000, top_k=-1):
  '''This function using deep learning, retrieves and returns the solutions and respective success_rates of the most similar workorder to the work order
     or description provided as one of its input. It returns an empty dictionary if no viable solution was found in the database.
     Args:
      order_id : integer value, (id) of the new order created by a customer/ technician.
      work_order: string, a description of the defect created by a customer/ techinician.
      all_defect_ids: Ids of all defects stored in the database
      all_defect_descriptions: Description of all defects stored in the database.
      tokenizer: BERT model tokenizer
      bert_model: BERT model used as vectorizer by this function.
      threshold_score: float, default == 0.7. It determines the similarity threshold score to use when retrieving the top_K similar defects.
      batch_size : integer , default == 1000, the number of defect descriptions to process/vectorize at once using multiprocessing.
      top_k: integer, default == -1. The number of top defects to consider after similarity score computation.
  '''
  ## Pass customer work order/tech_defect_description and all_defect_descriptions to vectorizer: BERT model 
  # Add the 'work order' to the 'all_defect_descriptions' for computational convenience:
  all_defect_descriptions.insert(0, work_order)

  # Compute vectorized embeddings:
  vectorized_embeddings  = vectorize_all_defects(all_defect_descriptions, bert_model, tokenizer)
  
  # Calculate the cosine similarity between customer work order and all other descriptions extracted from the database:
  similarity_scores = cosine_similarity([vectorized_embeddings[0]], vectorized_embeddings[1:])

  # Concatenate defect_ids to vectorized_defect_descriptions:
  # First expand the dimensions of both 'all_defect_ids' and 'similarity_scores' to a 2-dimensional array:
  all_defects_ids = np.expand_dims(all_defect_ids, 1)
  n_samples  = similarity_scores.shape[-1]
  # Reshape similarity scores:
  similarity_scores = similarity_scores.reshape(n_samples, 1)

  # Concatenate both into a single array:
  similarity_scores_id = np.concatenate((all_defects_ids, similarity_scores), axis= -1)

  # Filter the top k defects greater than threshold score:
  top_defects = similarity_scores_id[ similarity_scores_id[:,1] > threshold_score]

  # Sort similarity scores:
  top_defects  = sort_score(top_defects)

  # Extract only the top_k ids for this descriptions with high similarity score:
  top_defect_ids  = top_defects[: top_k,0]

  # Query excel file with this ids here and return solution_ids ,  solution's success percentage:
  c_order = customer_order.loc[top_defect_ids]
  solution_ids , solution_counter = list(c_order['Solution ID']), list(c_order['Solution counter'])

  # If there are no solutions/ similar defects ( we know this by checking if the variable solution_ids == None (is empty))
  if len(top_defect_ids) == 0:
    # Create an empty dictionary and return it as json object.
    print('There were no solutions found for this defect in the database')
    return
  else:
    # Calculate the success percentage:
    success_percentage = get_prob_success(solution_ids, solution_counter)
    c_order['Success percentage (%)'] = success_percentage
    c_order = c_order[['Customer defect','Solution ID', 'Solution counter', 'Success percentage (%)']]
    return  work_order , c_order

Now, let's extract the customer orders we are going to be using for demonstration alongside their IDs (the natural index). We will also drop these rows from the dataset.

In [None]:
# order ID to be extracted:
order_id = [2, 9 ,12, 17]
# Customer_defect_orders to be extracted.
customer_defect_orders = customer_order['Customer defect'].loc[order_id]
customer_defect_orders

2                  AC não aquece
9         aquecedor não funciona
12    ar condicionado não aquece
17          compressor não parte
Name: Customer defect, dtype: object

In [None]:
# Drop these order_ids from the excel file for convenience:
rows = [i for i in range(len(customer_order)) if i not in order_id]
database = customer_order.loc[rows]
database.head()

Unnamed: 0,English,Solution ID,workorder #,Technician name,Customer defect,Customer defect corrected by technician,Executed service,Solution,Root cause,TAGS,LAST_UPDATED,Solution Level,Improve equipment,Parts,Asset Name,Solution counter,Modality,Model,CODOS,USER_ID,Column1
0,English,Solution ID,workorder #,Technician name,Customer defect,Customer defect corrected by technician,Executed service,Solution,Root cause,TAGS,LAST_UPDATED,Solution Level,Improve equipment,Parts,Asset Name,Solution counter,Modality,Model,CODOS,USER_ID,Column1
1,Portuguese,ID,OS_NO,TECNICO,DEFEITO_CLIENTE,DEFEITO_CLIENTE_AJUSTADO,SERVICO_EXECUTADO,SOLUCAO,CAUSA_RAIZ,TAGS,LAST_UPDATED,NIVEL_DE_SOLUCAO,MELHORAR_EQUIPAMENTO,PECAS,NOME_DO_ATIVO,CONT_SOLUCOES,MODALIDADE,MODELO,CODOS,USER_ID,Column1
3,,2,,,AC não esfria,não esfria,,,,,,,,,,2,HVAC,Coolmax,,,
4,,3,,,Ac não esfria nem esquenta,não liga,,,,,,,,,,3,HVAC,Brisa 2,,,
5,,4,,,AC não gela,não funciona,,,,,,,,,,5,HVAC,Brisa 1,,,


Now, let's test our algorithm against the first customer order which was at row 2

In [None]:
order_id[0], customer_defect_orders[2]

(2, 'AC não aquece')

The model must be loaded first before the algorithm is used so it can attend to all queries.

Since we have already loaded the model earlier, we just continue with our algorithm test.

In [None]:
work_order , result = get_solution_test(order_id[0], customer_defect_orders[2], np.array(database.index), 
                                   list(database['Customer defect']), tokenizer , model, top_k=4)
print(f'The customer work order was: {work_order}')
result

The customer work order was: AC não aquece


Unnamed: 0,Customer defect,Solution ID,Solution counter,Success percentage (%)
5.0,AC não gela,4,5,33.3
6.0,AC não gela nem aquece,5,3,20.0
3.0,AC não esfria,2,2,13.3
10.0,aquecedor não liga,9,5,33.3


To get more similar result, the top_k and threshold arguments can be adjusted to suit your needs. However, I recommended tuning only the threshold_score in production and leaving the top_k as it is. Here, solution level was not returned because that column is empty in the excel file.

We repeat the same thing for other selected orders and we also play with the threshold and top_k values.

In [None]:
# We use a threshold_score of 0.8 here
work_order , result = get_solution_test(order_id[1], customer_defect_orders[9], np.array(database.index), 
                                   list(database['Customer defect']), tokenizer , model, threshold_score = 0.8)
print(f'The customer work order was: {work_order}')
result

The customer work order was: aquecedor não funciona


Unnamed: 0,Customer defect,Solution ID,Solution counter,Success percentage (%)
10.0,aquecedor não liga,9,5,100.0


In [None]:
# We use a top_k value of 5
work_order , result = get_solution_test(order_id[2], customer_defect_orders[12], np.array(database.index), 
                                   list(database['Customer defect']), tokenizer , model, top_k=5)
print(f'The customer work order was: {work_order}')
result

The customer work order was: ar condicionado não aquece


Unnamed: 0,Customer defect,Solution ID,Solution counter,Success percentage (%)
14.0,ar condicionado não esquenta,13,5,26.3
15.0,ar condicionado não gela,14,1,5.3
13.0,ar condicionado não esfria,12,5,26.3
6.0,AC não gela nem aquece,5,3,15.8
10.0,aquecedor não liga,9,5,26.3


In [None]:
# We use a threshold_score of 0.75
work_order , result = get_solution_test(order_id[2], customer_defect_orders[12], np.array(database.index), 
                                   list(database['Customer defect']), tokenizer , model, threshold_score= 0.75)
print(f'The customer work order was: {work_order}')
result

The customer work order was: ar condicionado não aquece


Unnamed: 0,Customer defect,Solution ID,Solution counter,Success percentage (%)
14.0,ar condicionado não esquenta,13,5,20.8
15.0,ar condicionado não gela,14,1,4.2
13.0,ar condicionado não esfria,12,5,20.8
6.0,AC não gela nem aquece,5,3,12.5
10.0,aquecedor não liga,9,5,20.8
35.0,ventilador não gira,34,5,20.8
