# GRIT Question answering using Embeddings (Retrieval Augmented Generation)

Experiment with question answering based upon GRIT content. 

General procedure: First, you embed text chunks into the embedding space. Embed the user query into the embedding space. Find the nearest neighbors between text and query, and append that to your query to provide context for a better chatbot answer. 

* Inspo: https://github.com/openai/openai-cookbook/blob/main/examples/Question_answering_using_embeddings.ipynb
* Next step: https://cookbook.openai.com/examples/question_answering_using_a_search_api

In [1]:
import ast
import os
import re
from openai import OpenAI
import fitz
import shutil
from tqdm import tqdm
import pandas as pd
from scipy import spatial  # for calculating vector similarities for search
import tiktoken
from dotenv import load_dotenv

load_dotenv() #load your custom environment variables from .env file in same directory

client = OpenAI(
    api_key=os.environ.get("OPENAI_API_KEY"),
)

# models
EMBEDDING_MODEL = "text-embedding-ada-002"
GPT_MODEL = "gpt-3.5-turbo"

### Test the client 

In [5]:
query = 'Whats the Sandia Heights Homeowners Association and how much are the fees?'

response = client.chat.completions.create(
        model=GPT_MODEL,
        temperature=0,

        messages=[
        {'role': 'system', 'content': 'You are a helpful assistant.'},
            {"role": "user", "content": query}
        ]
    )

print(response.choices[0].message.content)

The Sandia Heights Homeowners Association is a community organization that helps maintain the neighborhood and enforce rules and regulations to ensure the well-being of the community. The fees for the association can vary depending on the specific neighborhood within Sandia Heights and the services provided. It is best to contact the association directly or visit their website for the most up-to-date information on fees and services.


## 1. Prepare search data

To save you the time & expense, we've prepared a pre-embedded dataset of a few hundred Wikipedia articles about the 2022 Winter Olympics.

To see how we constructed this dataset, or to modify it yourself, see [Embedding Wikipedia articles for search](Embedding_Wikipedia_articles_for_search.ipynb).

In [15]:
def read_text_file(file_path):
    """Reads a text file and returns its content."""
    with open(file_path, 'r', encoding='utf-8') as file:
        return file.read()

def chunk_text(text, max_tokens=1500):
    """Chunks text into smaller segments that fit within the token limit."""
    words = text.split()
    for i in range(0, len(words), max_tokens):
        yield ' '.join(words[i:i + max_tokens])

def get_embeddings_in_batch(texts):
    """Get GPT embeddings for a batch of text strings."""
    response = client.embeddings.create(
        model="text-embedding-ada-002",
        input=texts
    )
    return [data.embedding for data in response.data]


def get_embedding(text):
    """Get GPT embedding for the provided text."""
    response = client.embeddings.create(
        input=text,
        model="text-embedding-3-small"
    )
    return response.data[0].embedding


def process_text_files(source_folder, batch_size=1000):
    """Process text files in a folder structure, return dataframe with embeddings."""
    records = []
    text_chunks = []

    for root, dirs, files in os.walk(source_folder):
        for file in files:
            if file.endswith('.txt'):
                file_path = os.path.join(root, file)
                print('processing: '+file)
                text = read_text_file(file_path)
                
                # Chunk text
                for chunk in chunk_text(text):
                    text_chunks.append({'file': file_path, 'chunk': chunk})
                    
                    # When batch size is reached, process the batch
                    if len(text_chunks) == batch_size:
                        embeddings = get_embeddings_in_batch([tc['chunk'] for tc in text_chunks])
                        for idx, chunk_data in enumerate(text_chunks):
                            records.append({
                                'file': chunk_data['file'],
                                'chunk': chunk_data['chunk'],
                                'embedding': embeddings[idx]
                            })
                        text_chunks = []  # Reset for next batch

    # Process remaining chunks
    if text_chunks:
        embeddings = get_embeddings_in_batch([tc['chunk'] for tc in text_chunks])
        for idx, chunk_data in enumerate(text_chunks):
            records.append({
                'file': chunk_data['file'],
                'chunk': chunk_data['chunk'],
                'embedding': embeddings[idx]
            })

    # Create a dataframe from the records
    df = pd.DataFrame(records)
    return df

In [17]:
source_folder = '/Users/heidi/Documents/SHHA/GRIT/GRIT_archive_OCRtext/'

df = process_text_files(source_folder, batch_size=1000)

processing: SHHA-GRIT-2013_08.txt
processing: SHHA-GRIT-2013_06.txt
processing: SHHA-GRIT-2013_12.txt
processing: SHHA-GRIT-2013_07.txt
processing: SHHA-GRIT-2013_11.txt
processing: SHHA-GRIT-2013_05.txt
processing: SHHA-GRIT-2013_04.txt
processing: SHHA-GRIT-2013_10.txt
processing: SHHA-GRIT-2013_01.txt
processing: SHHA-GRIT-2013_03.txt
processing: SHHA-GRIT-2013_02.txt
processing: SHHA-GRIT-2014_12.txt
processing: SHHA-GRIT-2014_06.txt
processing: SHHA-GRIT-2014_07.txt
processing: SHHA-GRIT-2014_05.txt
processing: SHHA-GRIT-2014_11.txt
processing: SHHA-GRIT-2014_10.txt
processing: SHHA-GRIT-2014_04.txt
processing: SHHA-GRIT-2014_01.txt
processing: SHHA-GRIT-2014_03.txt
processing: SHHA-GRIT-2014_02.txt
processing: SHHA-GRIT-2014_09.txt
processing: SHHA-GRIT-2014_08.txt
processing: SHHA-GRIT-2022_07.txt
processing: SHHA-GRIT-2022_12.txt
processing: SHHA-GRIT-2022_06.txt
processing: SHHA-GRIT-2022_10.txt
processing: SHHA-GRIT-2022_04.txt
processing: SHHA-GRIT-2022_05.txt
processing: SH

In [18]:
df

Unnamed: 0,file,chunk,embedding
0,/Users/heidi/Documents/SHHA/GRIT/GRIT_archive_...,Parks & Safety….. 2 & 7 ACC article & projects...,"[0.010661140084266663, -0.016106758266687393, ..."
1,/Users/heidi/Documents/SHHA/GRIT/GRIT_archive_...,a sock and hang it near the garbage to disguis...,"[0.01870499737560749, 0.019076073542237282, 0...."
2,/Users/heidi/Documents/SHHA/GRIT/GRIT_archive_...,"Embrace container gardening. With large pots, ...","[0.017098443582654, 0.021761655807495117, 0.02..."
3,/Users/heidi/Documents/SHHA/GRIT/GRIT_archive_...,Proposed Bylaws Amendment …..2 Parks & Safety ...,"[0.007502797059714794, 0.01210128515958786, 0...."
4,/Users/heidi/Documents/SHHA/GRIT/GRIT_archive_...,services) is a subscription service separate f...,"[0.007772927638143301, 0.012730909511446953, -..."
...,...,...,...
1081,/Users/heidi/Documents/SHHA/GRIT/GRIT_archive_...,Sandia Heights Homeowners Association Septembe...,"[0.01556930411607027, 0.010248411446809769, -0..."
1082,/Users/heidi/Documents/SHHA/GRIT/GRIT_archive_...,"service, but can provide the required forms. T...","[0.009134347550570965, 0.011879567056894302, 0..."
1083,/Users/heidi/Documents/SHHA/GRIT/GRIT_archive_...,Sandia Heights Homeowners Association August -...,"[-0.001219979370944202, 0.01629675179719925, 0..."
1084,/Users/heidi/Documents/SHHA/GRIT/GRIT_archive_...,out the SHHA website at: Sandiahomeonwers.org ...,"[0.01114758849143982, 0.01634790003299713, 0.0..."


In [21]:
# Save the dataframe to a CSV
df.rename(columns={'chunk': 'text'}, inplace=True)
df.to_csv('embeddings.csv', index=False)

### Read from saved csv

In [21]:
# optionally, read
df = pd.read_csv('embeddings.csv')
df['embedding'] = df['embedding'].apply(ast.literal_eval)

## 2. Search

Now we'll define a search function that:
- Takes a user query and a dataframe with text & embedding columns
- Embeds the user query with the OpenAI API
- Uses distance between query embedding and text embeddings to rank the texts
- Returns two lists:
    - The top N texts, ranked by relevance
    - Their corresponding relevance scores

In [7]:
df.columns

Index(['file', 'text', 'embedding'], dtype='object')

In [23]:
'''# search function

#working
def strings_ranked_by_relatedness(
    query: str,
    df: pd.DataFrame,
    relatedness_fn=lambda x, y: 1 - spatial.distance.cosine(x, y),
    top_n: int = 100
) -> tuple[list[str], list[float]]:
    """Returns a list of strings and relatednesses, sorted from most related to least."""
    query_embedding_response = client.embeddings.create(
        model=EMBEDDING_MODEL,
        input=query,
    )
    query_embedding = query_embedding_response.data[0].embedding
    strings_and_relatednesses = [
        (row["text"], relatedness_fn(query_embedding, row["embedding"]))
        for i, row in df.iterrows()
    ]
    strings_and_relatednesses.sort(key=lambda x: x[1], reverse=True)
    strings, relatednesses = zip(*strings_and_relatednesses)
    return strings[:top_n], relatednesses[:top_n]
'''

def strings_ranked_by_relatedness(
    query: str,
    df: pd.DataFrame,
    relatedness_fn=lambda x, y: 1 - spatial.distance.cosine(x, y),
    top_n: int = 100
) -> tuple[list[tuple[str, str]], list[float]]:
    """Returns a list of tuples (file, string) and relatednesses, sorted from most related to least."""
    query_embedding_response = client.embeddings.create(
        model=EMBEDDING_MODEL,
        input=query,
    )
    query_embedding = query_embedding_response.data[0].embedding
    strings_and_relatednesses = [
        ((row["file"], row["text"]), relatedness_fn(query_embedding, row["embedding"]))
        for i, row in df.iterrows()
    ]
    strings_and_relatednesses.sort(key=lambda x: x[1], reverse=True)
    files_and_strings, relatednesses = zip(*strings_and_relatednesses)
    return files_and_strings[:top_n], relatednesses[:top_n]


In [24]:
# examples
strings, relatednesses = strings_ranked_by_relatedness("Budget deficit", df, top_n=5)
for string, relatedness in zip(strings, relatednesses):
    print(f"{relatedness=:.3f}")
    display(string)

relatedness=0.792


('/Users/heidi/Documents/SHHA/GRIT/GRIT_archive_OCRtext/1998/SHHA-GRIT-1998_03.txt',
 'bill passed both legislative chainbeis to reach the Governor for consideration. ~ · e sessions in even-numbered years are only 30 days Jong."\'!lrne is short: so while the drive-llp\' liquor window debate raged, precious time was norspenron the other important state mat- ters and issues; Drinking and driving is a priority of the highest level. however. not enough time was devoted~ the State\'s Annual Budget, which is traditionally the forus of 30 day legislative sessions. Thus the budget that passed the House and Senate thi5 year ha.s some seri- ous "bu~;,· as Miaosoft entrepreneur Bill Gates would it. Crime You\'ve asked us for more police officers to patrOl the state highways and we\'ve delivered. Governor Gary Johnson signed into law ler)slation ~ New Mexico\'s State Poore funding to hire so more officers thi5 year and 40 more officers for each of the next three years. The $13 million package will

relatedness=0.790


('/Users/heidi/Documents/SHHA/GRIT/GRIT_archive_OCRtext/1996/SHHA-GRIT-1996_01.txt',
 'a reserve margin of 5%, or $140 million, and one major problem is the massive reduction in our reserve accounts. Although there is always both upside and downside potential to the revenue situation, the effects of the federal budget reductions combined with an extremely low level of reserves puts New Mexico in a risky fiscal position that must be constantly moni- tored and analyzed. 1n hindsight, the Governor\'s vetoes (over 200 bills) resulted in a "savings" to the State of over $50 million and helped the financial picture. With all 112 legislative seats up for election in 1996 (70 in the House, and 42 in the Senate), we expect a great deal of activity in the thirty-day Session (beginning January 16th) as the majority party spars with Governor Gary Johnson. Although the state constitution requires lawmakers to limit legislative activity to matters of finance and the State\'s next fiscal year budge~o

relatedness=0.768


('/Users/heidi/Documents/SHHA/GRIT/GRIT_archive_OCRtext/1996/SHHA-GRIT-1996_09.txt',
 'goverrunent shouldn\'t borrow money to buy land or C0!1.50UC!ion equipment Kip Nicely State Rep/D istrict 31 296-9277 Tom Wray State Senator/D istrict 21 856-1450 Frank Bird State Rep/District 23 823-S770 Contractor Evaluation Pr~ram Needs Your Referrals by Erin Frinlcman - SHHA Administrative Assistant One of the benefits of SHHA membership is the use of the Contractor Evaluation Program. When a member needs some type of work done on their house, they may call the SHHA office for information on evaluations chat have been submitted by Sandia Heights residents on rontractors in chat area of expertise. However. we need to build up the number of completed forms in our tiles. If you have recently used a con- tractor and would like to inform your neighbors of the type of 5elVice you received, please fill in the form enclosed in chis issue of The Grit and return it to the SHHA oflic:e at the address given 

relatedness=0.758


('/Users/heidi/Documents/SHHA/GRIT/GRIT_archive_OCRtext/1996/SHHA-GRIT-1996_09.txt',
 'The Albuquerque Public SChools Boundary Committee has submitted its recommendations for changes to existing school boundaries to Peter Horoschak, the APS Superintendent He will now review the Committee\'s work and either forward their recommendation to the School Board, revise it or submit his own recommendation. The next event in this lengthy process will be a Public Hearing on October 30th. As at previous meetings, SHHA will present resri- mony on behalf of the residents of Sandia Heights. Since all recommendations thus far have kept Sandia Heights within the boundaries for La Cueva High ToUIQ.- ,,_ - Sandia Heights North High Desert School and had Middle SchOOt sru- den ts attending either the new Mid School or the current Eisenhower School, our role has been to support the recorrunenda- tions and to convey the fact that our area is vitally interested in the outcome of the process. Although the pu

relatedness=0.756


('/Users/heidi/Documents/SHHA/GRIT/GRIT_archive_OCRtext/1983/SHHA-GRIT-1983_10.txt',
 'expenses this next year to cover: 1) playground maintenance, 2) covenant monitoring and resident information service, 3) snow removal, and 4) newsletters. The Tram Company has requested in writing to raise the playground maintenance fees to $1,800.00 per year. The Tram Company has also requested an increase in the fees for item (2) above from $3,000.00 to $3,600.00 a year. Since we expect to take over this work (see VI below), we do not know the exact expenses for \'83-\'84 in this category. Your Board cannot understand those who will not contribute the nominal $30.00 annual homeowners\' membership fee. We will need additional funds in \'83-\'84 to work on your behalf, so please respond as soon as possible. VI. SEARCH FOR BOARD EXECUTIVEYICE PRESIDEN\'I\'(s)- SALARIBD POSITION(s) Your Board has decided to take over many functions now performed for homeowners by the Tram Company. These functions inclu

## 3. Ask

With the search function above, we can now automatically retrieve relevant knowledge and insert it into messages to GPT.

Below, we define a function `ask` that:
- Takes a user query
- Searches for text relevant to the query
- Stuffs that text into a message for GPT
- Sends the message to GPT
- Returns GPT's answer

In [24]:
def num_tokens(text: str, model: str = GPT_MODEL) -> int:
    """Return the number of tokens in a string."""
    encoding = tiktoken.encoding_for_model(model)
    return len(encoding.encode(text))

def query_message(
    query: str,
    df: pd.DataFrame,
    model: str,
    token_budget: int
) -> str:
    """Return a message for GPT, with relevant source texts pulled from a dataframe."""
    strings, relatednesses = strings_ranked_by_relatedness(query, df)
    introduction = 'Use the below articles on the Sandia Heights Homeowners Association to answer the subsequent question. If the answer cannot be found in the articles, write "I could not find an answer."'
    question = f"\n\nQuestion: {query}"
    message = introduction
    for string in strings:
        next_article = f'\n\nArticle section:\n"""\n{string}\n"""'
        if (
            num_tokens(message + next_article + question, model=model)
            > token_budget
        ):
            break
        else:
            message += next_article
    return message + question


def ask(
    query: str,
    df: pd.DataFrame = df,
    model: str = GPT_MODEL,
    token_budget: int = 4096 - 500,
    print_message: bool = False,
) -> str:
    """Answers a query using GPT and a dataframe of relevant texts and embeddings."""
    message = query_message(query, df, model=model, token_budget=token_budget)
    if print_message:
        print(message)
    messages = [
        {"role": "system", "content": "You answer questions about the Sandia Heights Homeowners Association."},
        {"role": "user", "content": message},
    ]
    response = client.chat.completions.create(
        model=model,
        messages=messages,
        temperature=0
    )
    response_message = response.choices[0].message.content
    return response_message

### Example questions

In [26]:
ask('When did the SHHA oppose the County Line restaurant liqour license?',print_message=True)

Use the below articles on the Sandia Heights Homeowners Association to answer the subsequent question. If the answer cannot be found in the articles, write "I could not find an answer."

Article section:
"""
October 18, 1983 HOMEOWNERS' A~CIATION NEWSLETTER Dear Sandia Heights Homeowners' Association Member, Enclosed you will find important information regarding a number of subjects which will affect all of us in coming years. 1. COUNTY LINE RESTAURANT SURVEY RESULTS Thank you very much for responding to our questionnaire regarding your opinion on the County Line Restaurant and thei1· liquor license application. Your Homeowners' Board of Directors decided in a July 26, 1983 meeting to inform the entire Sandia Heights Community about the status of the County Line's new liquor license application, and, at the same time, conduct a written survey. Mr. Randall Williams, newly elected Board Vice-President, was asked to be in charge of the survey, including an additional random sampling poll.

'The SHHA opposed the County Line restaurant liquor license on August 30, 1983.'