<a href="https://colab.research.google.com/github/howard-haowen/NLP-demos/blob/main/train_FAQ_model_with_sentence_transformers.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>



*   Author: [Haowen Jiang](https://howard-haowen.rohan.tw/)



# Download the dataset

In [None]:
!pip install -q datasets

In [2]:
from datasets import load_dataset

dataset = load_dataset("web_questions")

Downloading builder script:   0%|          | 0.00/3.54k [00:00<?, ?B/s]

Downloading metadata:   0%|          | 0.00/2.20k [00:00<?, ?B/s]

Downloading readme:   0%|          | 0.00/6.49k [00:00<?, ?B/s]

Downloading and preparing dataset web_questions/default to /root/.cache/huggingface/datasets/web_questions/default/1.0.0/e6742cc64f6652db0c52cb07b5414e3c001512bf5bde7aa5587353c31db1ed8b...


Downloading data files:   0%|          | 0/2 [00:00<?, ?it/s]

Downloading data: 0.00B [00:00, ?B/s]

Downloading data: 0.00B [00:00, ?B/s]

Generating train split:   0%|          | 0/3778 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/2032 [00:00<?, ? examples/s]

Dataset web_questions downloaded and prepared to /root/.cache/huggingface/datasets/web_questions/default/1.0.0/e6742cc64f6652db0c52cb07b5414e3c001512bf5bde7aa5587353c31db1ed8b. Subsequent calls will reuse this data.


  0%|          | 0/2 [00:00<?, ?it/s]

In [3]:
print(dataset)

DatasetDict({
    train: Dataset({
        features: ['url', 'question', 'answers'],
        num_rows: 3778
    })
    test: Dataset({
        features: ['url', 'question', 'answers'],
        num_rows: 2032
    })
})


In [7]:
def dataset2df(dataset, split='train'):
    df = dataset[split].to_pandas()
    df = df[['question', 'answers']]
    df.loc[:, 'answers'] = df['answers'].apply(lambda arr: ", ".join([i for i in arr]))
    return df

In [8]:
df = dataset2df(dataset)
df

Unnamed: 0,question,answers
0,what is the name of justin bieber brother?,"Jazmyn Bieber, Jaxon Bieber"
1,what character did natalie portman play in sta...,Padmé Amidala
2,what state does selena gomez?,New York City
3,what country is the grand bahama island in?,Bahamas
4,what kind of money to take to bahamas?,Bahamian dollar
...,...,...
3773,where did sir donald bradman live?,Adelaide
3774,what are the holydays of obligation in the cat...,"Name day, Saint Patrick's Day, Maundy Thursday..."
3775,what is the name of the broncos mascot?,Miles
3776,what caused the russian financial crisis of 1998?,Allies of World War II


# Load a pretrained model

In [None]:
!pip install -q -U sentence-transformers

In [11]:
from sentence_transformers import SentenceTransformer, util
import torch

In [12]:
embedder = SentenceTransformer('all-MiniLM-L6-v2')

# Corpus with example sentences
corpus = ['A man is eating food.',
          'A man is eating a piece of bread.',
          'The girl is carrying a baby.',
          'A man is riding a horse.',
          'A woman is playing violin.',
          'Two men pushed carts through the woods.',
          'A man is riding a white horse on an enclosed ground.',
          'A monkey is playing drums.',
          'A cheetah is running behind its prey.'
          ]
corpus_embeddings = embedder.encode(corpus, convert_to_tensor=True)

# Query sentences:
queries = ['A man is eating pasta.', 'Someone in a gorilla costume is playing a set of drums.', 'A cheetah chases prey on across a field.']


# Find the closest 5 sentences of the corpus for each query sentence based on cosine similarity
top_k = min(5, len(corpus))
for query in queries:
    query_embedding = embedder.encode(query, convert_to_tensor=True)

    # We use cosine-similarity and torch.topk to find the highest 5 scores
    cos_scores = util.cos_sim(query_embedding, corpus_embeddings)[0]
    top_results = torch.topk(cos_scores, k=top_k)

    print("\n\n======================\n\n")
    print("Query:", query)
    print("\nTop 5 most similar sentences in corpus:")

    for score, idx in zip(top_results[0], top_results[1]):
        print(corpus[idx], "(Score: {:.4f})".format(score))

Downloading (…)e9125/.gitattributes:   0%|          | 0.00/1.18k [00:00<?, ?B/s]

Downloading (…)_Pooling/config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

Downloading (…)7e55de9125/README.md:   0%|          | 0.00/10.6k [00:00<?, ?B/s]

Downloading (…)55de9125/config.json:   0%|          | 0.00/612 [00:00<?, ?B/s]

Downloading (…)ce_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

Downloading (…)125/data_config.json:   0%|          | 0.00/39.3k [00:00<?, ?B/s]

Downloading (…)"pytorch_model.bin";:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

Downloading (…)nce_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

Downloading (…)cial_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

Downloading (…)e9125/tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

Downloading (…)okenizer_config.json:   0%|          | 0.00/350 [00:00<?, ?B/s]

Downloading (…)9125/train_script.py:   0%|          | 0.00/13.2k [00:00<?, ?B/s]

Downloading (…)7e55de9125/vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

Downloading (…)5de9125/modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]





Query: A man is eating pasta.

Top 5 most similar sentences in corpus:
A man is eating food. (Score: 0.7035)
A man is eating a piece of bread. (Score: 0.5272)
A man is riding a horse. (Score: 0.1889)
A man is riding a white horse on an enclosed ground. (Score: 0.1047)
A cheetah is running behind its prey. (Score: 0.0980)




Query: Someone in a gorilla costume is playing a set of drums.

Top 5 most similar sentences in corpus:
A monkey is playing drums. (Score: 0.6433)
A woman is playing violin. (Score: 0.2564)
A man is riding a horse. (Score: 0.1389)
A man is riding a white horse on an enclosed ground. (Score: 0.1191)
A cheetah is running behind its prey. (Score: 0.1080)




Query: A cheetah chases prey on across a field.

Top 5 most similar sentences in corpus:
A cheetah is running behind its prey. (Score: 0.8253)
A man is eating food. (Score: 0.1399)
A monkey is playing drums. (Score: 0.1292)
A man is riding a white horse on an enclosed ground. (Score: 0.1097)
A man is riding a 

In [13]:
texts = df['question']
embeddings = embedder.encode(texts)

In [14]:
len(embeddings)

3778

In [15]:
len(embeddings[0])

384

In [16]:
embeddings[0]

array([ 1.62790064e-02,  1.49354329e-02,  8.72291401e-02, -5.57135716e-02,
       -9.60655659e-02,  2.01688707e-02,  1.25532627e-01,  2.05324683e-02,
        2.49744337e-02, -6.57258257e-02, -2.15121564e-02, -6.15234636e-02,
        4.93846200e-02, -6.47809431e-02,  8.62696394e-03,  3.79998684e-02,
        1.63003653e-02,  1.04235485e-01, -1.26602547e-02, -4.08711284e-02,
       -1.21324286e-02, -3.62557843e-02,  6.34663040e-03, -2.23475173e-02,
       -1.65197887e-02,  5.43341087e-03,  3.40506844e-02, -2.60913707e-02,
        4.31539267e-02, -4.66088485e-03, -4.00603702e-03, -1.35461492e-02,
        3.70181166e-02, -1.81436874e-02,  5.86432056e-04, -1.17724854e-02,
        1.03982249e-02,  6.60857335e-02,  2.67094690e-02,  9.79965436e-04,
       -5.71829304e-02, -3.15324962e-02, -1.22866677e-02,  1.99792348e-02,
       -2.38678586e-02, -5.13001047e-02, -8.60858802e-03, -8.46880153e-02,
        2.89068706e-02,  5.33487126e-02, -3.39674503e-02, -8.92750919e-02,
        7.07146227e-02,  

# Embed the corpus and build an embedding index

In [17]:
!pip install -q faiss-cpu

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m17.0/17.0 MB[0m [31m44.8 MB/s[0m eta [36m0:00:00[0m
[?25h

In [18]:
import faiss
import numpy as np

def create_index_embeddings(embed_arrays: np.array,index_arrays: np.array):

	# Step 1: Change data type
	embeddings = embed_arrays.astype("float32")
	
	# Step 2: Instantiate the index using a type of distance, which is L2 here
	index = faiss.IndexFlatIP(embeddings.shape[1])
	
	# Step 3: Pass the index to IndexIDMap
	index = faiss.IndexIDMap(index)
	
	# Step 4: Add vectors and their IDs
	index.add_with_ids(embeddings, index_arrays)
	
	return index, embeddings

In [19]:
type(embeddings)

numpy.ndarray

In [20]:
doc_ids = df.index.to_numpy()
type(doc_ids)

numpy.ndarray

In [21]:
fs_index, fs_embeddings = create_index_embeddings(embeddings, doc_ids)

# Search similar texts by a user query

In [38]:
def search_by_user_query(query: str, #User query text
                         embed_model=embedder, #USE embed model
                         index=fs_index, #Faiss index
                         df=df, #Corpus in dataframe
                         topK=10): #TopK results

	# embed the query with USE
	# Note: Make the query a list to keep it consistent with the format from above
	query_embedding = embed_model.encode([query])
	
	# covert the embeddings to conform to the Faiss format
	query_embedding = np.array(query_embedding).astype("float32")
	
	# get the distances and indexes
	# Note: The index is 0 because there's only one document in the embeddings
	D, I = index.search(np.array([query_embedding[0]]), k=topK)
	
	# results
	results_df = df.loc[I.flatten(), ['question', 'answers']]
	questions = results_df['question'].tolist()
	all_answers = results_df['answers'].tolist()

	print(f"Answer in corpus: {all_answers[0]}")
	print(f"Question in corpus: {questions[0]}")
	print("="*30)
	print("Other questions you may be asking:\n")
 
	for q, a in zip(questions[1:], all_answers[1:]):
		print(f"Q: {q}")
		print(f"A: {a}\n")

In [39]:
query = "Michael Jordan went to which school?"
search_by_user_query(query)

Answer in corpus: University of North Carolina at Chapel Hill, Emsley A. Laney High School
Question in corpus: what school did michael jordan go to?
Other questions you may be asking:

Q: what school did michael jordan play for?
A: North Carolina Tar Heels men's basketball

Q: where did michael jordan play high school basketball?
A: Emsley A. Laney High School

Q: what college did michael jordan play basketball for?
A: University of North Carolina at Chapel Hill

Q: what was basketball player michael jordan's major in college?
A: Cultural geography

Q: what year did michael jordan started his career?
A: 1984 NBA Draft

Q: what year michael jordan came in the nba?
A: 1984 NBA Draft

Q: when michael jordan got drafted?
A: 1984 NBA Draft

Q: what year did michael jordan get drafted?
A: 1984 NBA Draft

Q: when did michael jordan started playing basketball in the nba?
A: 1984



In [43]:
query = "What language do people speak in Taiwan?"
search_by_user_query(query)

Answer in corpus: Standard Mandarin
Question in corpus: what language they speak in taiwan?
Other questions you may be asking:

Q: what does taiwan people speak?
A: Chinese, Hakka Language, Taiwanese Mandarin, Taiwanese, Mandarin Chinese

Q: what kind of language do china speak?
A: Nepali Language, Standard Tibetan, Chinese language, Zhuang languages, Korean Language, English Language, Esperanto Language, Mongolian language, Portuguese Language, Cantonese

Q: what language does people in thailand speak?
A: Thai Language

Q: what type of language does china speak?
A: Nepali Language, Standard Tibetan, Chinese language, Zhuang languages, Korean Language, English Language, Esperanto Language, Mongolian language, Portuguese Language, Cantonese

Q: what kind of language do they speak in china?
A: Standard Mandarin

Q: what language do they speak thailand?
A: Mon Language, Lao Language, Khmer language, Hmong language, Thai Language, Cham language, Mlabri language, Malay, Pattani Language, Ny

In [45]:
query = "What's the reason for Rome's destruction?"
search_by_user_query(query)

Answer in corpus: Capture of Rome
Question in corpus: what made ancient rome fall?
Other questions you may be asking:

Q: what does the name rome mean?
A: Romulus and Remus

Q: where did the name rome come from?
A: Romulus and Remus

Q: where did romans go?
A: Europe

Q: what was the colosseum in rome used for?
A: Gladiator

Q: what are some sights to see in rome?
A: Sistine Chapel, Piazza Navona, Roman Forum, Baths of Caracalla, Castel Sant'Angelo, Bocca della Verità, Pantheon, St. Peter's Basilica, Raphael Rooms, Colosseum

Q: what does the roman empire consist of?
A: Cilicia

Q: where was the roman empire centered?
A: Rome

Q: what places made up the western roman empire?
A: Europe, North Africa, Middle East

Q: where is rome italy located on a map?
A: Rome

