# Question answering using embeddings-based search

In [1]:
# imports
import ast  # for converting embeddings saved as strings back to arrays
import openai  # for calling the OpenAI API
import pandas as pd  # for storing text and embeddings data
from scipy import spatial  # for calculating vector similarities for search


# models
EMBEDDING_MODEL = "text-embedding-ada-002"
GPT_MODEL = "gpt-3.5-turbo"

#from keys import *
openai.api_key = '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'

In [2]:
embeddings_path = "/home/ishikaa2/quickans/ans_generator/data/ir_txt/ir_book_embeddings.csv"

df = pd.read_csv(embeddings_path)

In [3]:
# convert embeddings from CSV str type back to list type
df['embedding'] = df['embedding'].apply(ast.literal_eval)

In [8]:
# the dataframe has two columns: "text" and "embedding"
df

Unnamed: 0,text,embedding
0,ChengXiang Zhai Sean Massung Text Data Managem...,"[-0.017203252762556076, 0.021963899955153465, ..."
1,This has led to an increasing demand for power...,"[-0.016908559948205948, 0.010532873682677746, ..."
2,Unlike data generated by a computer system or ...,"[-0.018540382385253906, 0.014797425828874111, ..."
3,As such text data are especially valuable for ...,"[-0.020149044692516327, 0.02085673063993454, 0..."
4,In contrast to structured data which conform t...,"[-0.02140810526907444, 0.02335185930132866, 0...."
...,...,...
8045,Please note that this α is different from αd i...,"[0.0068044946528971195, 0.035624921321868896, ..."
8046,Of course the next question is how to estimate...,"[0.007543814368546009, 0.032577402889728546, -..."
8047,where F d1 dk is the set of feedback document...,"[-0.016099639236927032, 0.021384460851550102, ..."
8048,Now given λ the feedback documents F and the ...,"[-0.014733465388417244, 0.021942878141999245, ..."


In [4]:
# search function
def strings_ranked_by_relatedness(
    query: str,
    df: pd.DataFrame,
    relatedness_fn=lambda x, y: 1 - spatial.distance.cosine(x, y),
    top_n: int = 100
):
    """Returns a list of strings and relatednesses, sorted from most related to least."""
    query_embedding_response = openai.Embedding.create(
        model=EMBEDDING_MODEL,
        input=query,
    )
    query_embedding = query_embedding_response["data"][0]["embedding"]
    strings_and_relatednesses = [
        (row["text"], relatedness_fn(query_embedding, row["embedding"]))
        for i, row in df.iterrows()
    ]
    strings_and_relatednesses.sort(key=lambda x: x[1], reverse=True)
    strings, relatednesses = zip(*strings_and_relatednesses)
    return strings[:top_n], relatednesses[:top_n]


In [10]:
# examples
strings, relatednesses = strings_ranked_by_relatedness("Web Crawling", df, top_n=5)
for string, relatedness in zip(strings, relatednesses):
    print(f"{relatedness}")
    display(string)

0.8846888413640148


'10. 1 Web Crawling The crawler is also called a spider or a software robot that crawls traverses parses and downloads pages on the web.  Building a toy crawler is relatively easy because you just need to start with a set of seed pages fetch pages from the web and parse these pages’ new links.  We then add them to a queue and then explore those page’s links in a breadthfirst search until we are satisfied.  Building a real crawler is quite tricky and there are some complicated issues that we inevitably deal with.'

0.8746059769842044


'We’ve already described all indexing steps except crawling in detail in Chapter 8.  After our crawling discussion we move onto the particular challenges of web indexing.  Then we discuss how we can take advantage of links between pages in link analysis.  The last technique we discuss is learning to rank which is a way to combine many different features for ranking.  10. 1 Web Crawling The crawler is also called a spider or a software robot that crawls traverses parses and downloads pages on the web.  Building a toy crawler is relatively easy because you just need to start with a set of seed pages fetch pages from the web and parse these pages’ new links.'

0.8735251396541552


'1 Web Crawling The crawler is also called a spider or a software robot that crawls traverses parses and downloads pages on the web.  Building a toy crawler is relatively easy because you just need to start with a set of seed pages fetch pages from the web and parse these pages’ new links.  We then add them to a queue and then explore those page’s links in a breadthfirst search until we are satisfied.  Building a real crawler is quite tricky and there are some complicated issues that we inevitably deal with.  One issue is robustness What if the server doesn’t respond or returns unparseable garbage What if there’s a trap that generates dynamically generated pages that attract your crawler to keep crawling the same site in circles Yet another issue is that we don’t want to overload one particular server with too many crawling requests.  Those may cause the site to experience a denial of service some sites will also block IP addresses that they believe to be crawling them or creating too 

0.8728600235603673


'In the next section we will discuss crawling.  We’ve already described all indexing steps except crawling in detail in Chapter 8.  After our crawling discussion we move onto the particular challenges of web indexing.  Then we discuss how we can take advantage of links between pages in link analysis.  The last technique we discuss is learning to rank which is a way to combine many different features for ranking.  10. 1 Web Crawling The crawler is also called a spider or a software robot that crawls traverses parses and downloads pages on the web.  Building a toy crawler is relatively easy because you just need to start with a set of seed pages fetch pages from the web and parse these pages’ new links.'

0.8725280804082206


'One interesting variation is called focused crawling.  Here we’re going to crawl some pages about a particular topic eg all pages about automobiles.  This is typically going to start with a query that you use to get some results.  Then you gradually crawl more.  An even more extreme version of focused crawling is for example downloading and indexing all forum posts on a particular forum.  In this case we might have a URL such as httpwww. textdatabookforum. comboardsid3 which refers to the third post on the forum.'

In [5]:
def query_message(
    query: str,
    df: pd.DataFrame,
    model: str,
    token_budget: int
) -> str:
    """Return a message for GPT, with relevant source texts pulled from a dataframe."""
    strings, relatednesses = strings_ranked_by_relatedness(query, df)
    introduction = "Suppose you are a teaching assistant for the course Advanced Information Retrieval and a student has posed the following question.\n"
    question = f"\n\nQuestion: {query}\n\n"
    end = 'How will you answer the question? \n Here are some snippets from the course textbook which may be useful.\n\n"'
    book_info = ""
    
    preface = introduction + question + end
    for string in strings:
        next_article = f'\n{string}\n'
        if (
            len(preface + book_info + next_article) > 2500
        ):
            break
        else:
            book_info += next_article
    return preface + book_info

def api_call(message, model: str = GPT_MODEL):
    messages = [
        {"role": "user", "content": message},
    ]
    response = openai.ChatCompletion.create(
        model=model,
        messages=messages,
        temperature=0
    )
    response_message = response["choices"][0]["message"]["content"]
    return response_message

def ask(
    query: str,
    df: pd.DataFrame = df,
    model: str = GPT_MODEL,
    token_budget: int = 4096 - 500,
    print_message: bool = False,
) -> str:
    """Answers a query using GPT and a dataframe of relevant texts and embeddings."""
    message = query_message(query, df, model=model, token_budget=token_budget)
    if print_message:
        print(message)
    
    reply = api_call(message, model)
    return reply

In [27]:
query = 'When modeling queries, how is multi-bernoulli different from multinomial?'

api_call(query)
# ask('When modeling queries, how is multi-bernoulli different from multinomial?')

'Multi-Bernoulli and Multinomial are two different models used in probability theory and statistics to model queries. \n\nMulti-Bernoulli is a model used to represent a sequence of binary events, where each event can have one of two outcomes (success or failure). It is used to model queries where the data is binary, such as in object tracking or sensor fusion. In Multi-Bernoulli, the probability of success or failure is assumed to be constant across all events.\n\nOn the other hand, Multinomial is a model used to represent a sequence of events where each event can have one of several outcomes. It is used to model queries where the data is categorical, such as in natural language processing or image classification. In Multinomial, the probability of each outcome is assumed to be constant across all events.\n\nIn summary, Multi-Bernoulli is used for binary data, while Multinomial is used for categorical data.'

In [6]:
def compare_responses(query):
    emb_reply = ask(query)
    base_reply = api_call(query)
    

    print(f'Query:{query}\n\n')
    print("Embedding Method response\n")
    print(emb_reply+"\n\n")
    print('Base GPT response\n')
    print(base_reply)

    return emb_reply, base_reply

In [31]:
query = 'How do you define a reference language model?'

compare_responses(query)

Query:How do you define a reference language model?


Embedding Method response

A reference language model is a probability distribution over word sequences that is used to estimate the probability of an unseen word in a corpus. If a word is not observed in the corpus, its probability is assumed to be governed by the reference language model. In the case of retrieval, a natural choice for the reference language model would be the collection language model. The collection language model is used to adjust the probability of the maximum likelihood estimate of seen terms. The general English language model can be used as a background language model to filter out common words and obtain a probability ratio for each word. Language models are useful for quantifying the uncertainties associated with the use of natural language and can help answer many interesting questions related to text analysis and information retrieval. Neural language models can be represented as a neural network and can

In [32]:
query = 'How can we use language models for part of speech tagging?'

compare_responses(query)

Query:How can we use language models for part of speech tagging?


Embedding Method response

To use language models for part of speech tagging, we can assign a part of speech tag to each word based on its surrounding words. This allows us to count the most frequent nouns or determine what kind of nouns are associated with what kind of verbs. We can also consider n-grams of part of speech tags and mix n-grams of words and part of speech tags to create hybrid features that can be useful for sentiment analysis. Additionally, we can use language models to perform semantic analysis of word relations by finding what words are semantically associated with a given word. However, it is important to note that while part of speech tagging is a relatively easy task, full structure parsing and semantic analysis remain difficult and are only successful for some aspects of analysis. Overall, using language models for part of speech tagging can enrich the representation of text data and enable deeper

In [33]:
query = 'What happens to the Dirichlet smoothing when the document langeth goes to infinity'

compare_responses(query)

Query:What happens to the Dirichlet smoothing when the document langeth goes to infinity


Embedding Method response

When the document length goes to infinity, the effect on Dirichlet smoothing is that the coefficient αd tends to zero. This is because αd penalizes long documents, and as the document length increases, the number of observed words also increases, making it less necessary to do smoothing. Therefore, the Dirichlet smoothing becomes less effective as the document length increases towards infinity. However, it is important to note that this effect is not unique to Dirichlet smoothing and is a general property of smoothing methods that penalize long documents.


Base GPT response

When the document length goes to infinity, the effect of Dirichlet smoothing decreases. This is because as the document length increases, the probability estimates become more accurate and the need for smoothing decreases. In other words, as the amount of data increases, the impact of the prior dis

In [25]:
#['What order should you evaluate the kl divergence?'] #, 
queries = ['Why do we need a background language model?', 
           'How does a tokenizer work?', 
           'What are the components of an inverted index?', 
           'What is the document filtering problem?', 
           'What are the aspects of a search engine?', 
           'For computing the f1 score, why can\'t we take the mean of precision and recall?',
           'Why can’t we use IR models for ranking of webpages?',
           'How do you evaluate a filtering system?',
           'What are the components in context-based filtering?',
           'What is the exploration-exploitation tradeoff?',
           'What is beta-gamma threshold learning?',
           'What type of word relations are there?',
           'What is intrusion detection?',
           'What are some cluster similarity measures?']
import time
for q in queries:
    print(q)
    #emb, base = compare_responses(q)
    #print('\n\n-----------------------NEXT QUESTION----------------------')
    #time.sleep(60)

    ''' f = open('responses.csv', 'w')
    f.write(f'{q}, {emb}, {base}')
    f.close()'''

Why do we need a background language model?
How does a tokenizer work?
What are the components of an inverted index?
What is the document filtering problem?
What are the aspects of a search engine?
For computing the f1 score, why can't we take the mean of precision and recall?
Why can’t we use IR models for ranking of webpages?
How do you evaluate a filtering system?
What are the components in context-based filtering?
What is the exploration-exploitation tradeoff?
What is beta-gamma threshold learning?
What type of word relations are there?
What is intrusion detection?
What are some cluster similarity measures?


## Evaluation

In [10]:
f = open('texts', 'r')
data = f.read()
f.close()

data

['A background language model is necessary in information retrieval because it helps to normalize the language model and obtain a probability ratio for each word. This allows us to identify words that are semantically associated with a particular topic or query. Additionally, using a single model with a maximum likelihood estimate would force high probabilities for common words like "the," which is not ideal. By using a background language model, we can assign high probabilities to these common words and reduce their probability in the main model. Essentially, the background language model helps to explain and account for the occurrence of common words in a given context.ISHIKAA background language model is needed to improve the accuracy of natural language processing tasks such as speech recognition, machine translation, and text generation. It helps in predicting the probability of a sequence of words in a given language. This model is trained on a large corpus of text data and learn

In [14]:
lines = data.split('\n')
lines

['A background language model is necessary in information retrieval because it helps to normalize the language model and obtain a probability ratio for each word. This allows us to identify words that are semantically associated with a particular topic or query. Additionally, using a single model with a maximum likelihood estimate would force high probabilities for common words like "the," which is not ideal. By using a background language model, we can assign high probabilities to these common words and reduce their probability in the main model. Essentially, the background language model helps to explain and account for the occurrence of common words in a given context.ISHIKAA background language model is needed to improve the accuracy of natural language processing tasks such as speech recognition, machine translation, and text generation. It helps in predicting the probability of a sequence of words in a given language. This model is trained on a large corpus of text data and learn

In [20]:
split_lines = [line.split('ISHIKA') for line in lines]
ft = [sl[0] for sl in split_lines]
gpt = [sl[1] for sl in split_lines]
book = [sl[2] for sl in split_lines]
output = pd.DataFrame({'fine_tuned':ft, 'gpt':gpt, 'textbook':book})
output

Unnamed: 0,fine_tuned,gpt,textbook
0,A background language model is necessary in in...,A background language model is needed to impro...,"Clearly, we need to get rid of these stop word..."
1,A tokenizer is a tool used in information retr...,Tokenization is a fundamental step in many nat...,"Thus, in tokenization we will simply use the r..."
2,The components of an inverted index include th...,1. Terms: The terms or words that are indexed....,An inverted index is the main data structure u...
3,The document filtering problem refers to the t...,The document filtering problem is the task of ...,Another common task is only returning document...
4,There are three major aspects of a search engi...,1. Crawling: The process of discovering and in...,Effectiveness or accuracy. How accurate are th...
5,The reason we cannot take the mean of precisio...,We cannot take the mean of precision and recal...,"Why is this not as good as F1, i.e., what’s th..."
6,"We can use IR models for ranking of webpages, ...","As an AI language model, I don't have personal...","First, on the web we tend to have very differe..."
7,"To evaluate a filtering system, we need to mea...","As an AI language model, I do not have persona...",One common strategy is to use a utility functi...
8,Context-based filtering is a type of content-b...,Context-based filtering typically involves the...,The three basic components in content-based fi...
9,The exploration-exploitation tradeoff refers t...,The exploration-exploitation tradeoff is a dec...,This issue is called the exploration-exploitat...


In [13]:
def jaccard(list1, list2):
    intersection = len(list(set(list1).intersection(list2)))
    union = (len(list1) + len(list2)) - intersection
    return float(intersection) / union

In [27]:
# not super efficient, but because we are dealing with a small dataframe, we can get away with it
for i, row in output.iterrows():
    print(f'------{queries[i]}------')
    list1 = row['textbook'].split(' ')

    #fine_tuned
    list2 = row['fine_tuned'].split(' ')
    sim = jaccard(list1, list2)
    print(f'textbook with fine_tuned: {sim}')

    #plain gpt
    list2 = row['gpt'].split(' ')
    sim = jaccard(list1, list2)
    print(f'textbook with gpt: {sim}')

    

------Why do we need a background language model?------
textbook with fine_tuned: 0.07633587786259542
textbook with gpt: 0.06896551724137931
------How does a tokenizer work?------
textbook with fine_tuned: 0.08298755186721991
textbook with gpt: 0.06593406593406594
------What are the components of an inverted index?------
textbook with fine_tuned: 0.27058823529411763
textbook with gpt: 0.0958904109589041
------What is the document filtering problem?------
textbook with fine_tuned: 0.13526570048309178
textbook with gpt: 0.07971014492753623
------What are the aspects of a search engine?------
textbook with fine_tuned: 0.17341040462427745
textbook with gpt: 0.045662100456621
------For computing the f1 score, why can't we take the mean of precision and recall?------
textbook with fine_tuned: 0.2247191011235955
textbook with gpt: 0.08670520231213873
------Why can’t we use IR models for ranking of webpages?------
textbook with fine_tuned: 0.1509433962264151
textbook with gpt: 0.08713692946058