This notebook just load a prebuilt FAISS index and trials using it to see the process works.

In [1]:
#the GPU version seems to be only available on Anaconda, and various issues to build it myself. Oh well...
!pip install faiss-cpu


[0m

In [2]:
from faiss import write_index, read_index

#this is the FAISS index created originally with the faiss_indexer notebook
sentence_index = read_index("/mystuff/notebooks/faiss_index_512_flat_small.index")


In [3]:
!pwd

/mystuff/notebooks/to_be_uploaded


In [4]:
from sentence_transformers import SentenceTransformer, util

#embedding_model_path = "/mystuff/llm/gte-base"
#embedding_model_path = "/mystuff/llm/all-MiniLM-L12-v2"
embedding_model_path = "/mystuff/llm/bge-small-en"

embedding_model = SentenceTransformer(embedding_model_path, device='cuda')


Loading the tokenizer from the `special_tokens_map.json` and the `added_tokens.json` will be removed in `transformers 5`,  it is kept for forward compatibility, but it is recommended to update your `tokenizer_config.json` by uploading it again. You will see the new `added_tokens_decoder` attribute that will store the relevant information.


# Example question for embeddings search:

In [5]:
q_embeddings = embedding_model.encode(["What is the definition of anarchism?"])


In [6]:
D, I = sentence_index.search(q_embeddings, 3)

In [7]:
#I = id's of the found documents
I

array([[1054330, 1880579, 5809849]])

In [8]:
#D = similarity scores of found documents, 
D

array([[18.58313 , 19.57069 , 20.886288]], dtype=float32)

In [9]:
import pandas as pd

df_train = pd.read_csv("/mystuff/science-exam/train.csv")

In [10]:
def format_input(df, idx, options = ['A','B','C','D','E']):
    preamble = "You are a student taking an exam, and you need to work hard to answer "\
               "the multiple-choice questions correctly in order to get a higher score. "\
               "Please choose the most likely correct answer from the options provided below. "\
               "Your answer must be selected from the given options, and you can only respond with the letter of the choice."
    postamble = "Please respond with the letter of the option directly."

    prompt = df.loc[idx, 'prompt']
    answers = df.loc[idx,options].tolist()
    correct_answer = None
    if "answer" in df.columns:
        correct_answer = df.loc[idx, 'answer']

    options_text = ""
    for option, answer in zip(options, answers):
        options_text += (f"{option}) {answer}\n")
    
    input_text = f"{preamble}\n\nQuestion:\n{prompt}\n\nOptions:\n{options_text}\n{postamble}"
    return input_text, correct_answer

In [11]:
# Function to pick the prompt and its answer options from a dataframe row.
# Not that interesting but simplifies a little:
def collect_answer_options(df, idx, options = ['A','B','C','D','E']):
    prompt = df.loc[idx, 'prompt']
    answers = df.loc[idx,options].tolist()
    correct_answer = None
    if "answer" in df.columns:
        correct_answer = df.loc[idx, 'answer']
    
    return prompt, answers, correct_answer

In [12]:
def format_input(df, idx, options = ['A','B','C','D','E']):
    preamble = "You are a student taking an exam, and you need to work hard to answer "\
               "the multiple-choice questions correctly in order to get a higher score. "\
               "Please choose the most likely correct answer from the options provided below. "\
               "Your answer must be selected from the given options, and you can only respond with the letter of the choice."
    postamble = "Please respond with the letter of the option directly."

    prompt = df.loc[idx, 'prompt']
    answers = df.loc[idx,options].tolist()
    correct_answer = None
    if "answer" in df.columns:
        correct_answer = df.loc[idx, 'answer']

    options_text = ""
    for option, answer in zip(options, answers):
        options_text += (f"{option}) {answer}\n")
    
    input_text = f"{preamble}\n\nQuestion:\n{prompt}\n\nOptions:\n{options_text}\n{postamble}"
    return input_text, correct_answer

In [13]:
input_text, correct_answer = format_input(df_train, 0)
print(input_text)
print(correct_answer)

You are a student taking an exam, and you need to work hard to answer the multiple-choice questions correctly in order to get a higher score. Please choose the most likely correct answer from the options provided below. Your answer must be selected from the given options, and you can only respond with the letter of the choice.

Question:
Which of the following statements accurately describes the impact of Modified Newtonian Dynamics (MOND) on the observed "missing baryonic mass" discrepancy in galaxy clusters?

Options:
A) MOND is a theory that reduces the observed missing baryonic mass in galaxy clusters by postulating the existence of a new form of matter called "fuzzy dark matter."
B) MOND is a theory that increases the discrepancy between the observed missing baryonic mass in galaxy clusters and the measured velocity dispersions from a factor of around 10 to a factor of about 20.
C) MOND is a theory that explains the missing baryonic mass in galaxy clusters that was previously cons

In [15]:
q_embeddings = embedding_model.encode([input_text])


In [16]:
D, I = sentence_index.search(q_embeddings, 3)

In [17]:
I

array([[2407532, 3164010, 4896391]])

# SQLite functions to fetch documents based on FAISS id's

In [18]:
import sqlite3

def fetch_document_and_chunks_by_id(database_path, document_id):
    # Establish the database connection
    connection = sqlite3.connect(database_path)
    cursor = connection.cursor()

    # Prepare and execute the SQL query
    query = """
    SELECT 
        d.document_id, d.document_title,
        s.section_id, s.section_title,
        tc.chunk_id, tc.content
    FROM documents d
    LEFT JOIN sections s ON s.document_id = d.document_id
    LEFT JOIN text_chunks tc ON tc.document_id = d.document_id AND tc.section_id = s.section_id
    WHERE d.document_id = ?
    ORDER BY d.document_id, s.section_id, tc.chunk_id;
    """
    cursor.execute(query, (document_id,))

    # Initialize an empty dictionary to hold the document and its chunks
    # TODO: throw some error if this is not right
    document = {
        'id': document_id,
        'sections': {}
    }

    # Fetch and process the result
    for row in cursor:
        #print(row)
        _, document_title, section_id, section_title, chunk_id, content = row

        # Set the document title
        document['title'] = document_title

        # Add section if not already present
        if section_id and section_id not in document['sections']:
            document['sections'][section_id] = {
                'title': section_title,
                'chunks': {}
            }

        # Add chunk
        if chunk_id:
            document['sections'][section_id]['chunks'][chunk_id] = content

    # Close the database connection
    connection.close()

    return document

# Usage example [1054330,       0, 1880579]
chunk_database_path = "../wikipedia_chunks_256.db"


In [19]:
def load_and_print_doc_by_id(document_id_to_query, print_all=False):
#    document_id_to_query = 1880580 # Replace with the document ID you're interested in
    document = fetch_document_and_chunks_by_id(chunk_database_path, document_id_to_query)
    #print(document)
    
    # Print the document and its chunks
    all_chunks = []
    if print_all:
        print(f"Document ID: {document['id']}, Title: {document['title']}")
    for sec_id, sec_data in document['sections'].items():
        #print(f"  Section ID: {sec_id}, Title: {sec_data['title']}")
        for chunk_id, content in sec_data['chunks'].items():
            if print_all:
                print(f"    Chunk ID: {chunk_id}, Content: {content[:50]}...")  # Printing first 50 characters of each chunk
                # this was used to estimate how well the search works, using the first question in the dataset
                # if "MOND" in content:
                #    print("FOUND MOND")
            all_chunks.append(content)
    return document, all_chunks


In [20]:
docs = []
doc_chunks = []
doc_chunks_flat = []
for doc_id in I[0]:
    print(doc_id+1)
    doc, chunks = load_and_print_doc_by_id(int(doc_id+1))
    print()
    docs.append(doc)
    doc_chunks.append(chunks)
    doc_chunks_flat.extend(chunks)


2407533

3164011

4896392



In [21]:
print(doc_chunks[0][0])

'''Modified Newtonian dynamics''' ('''MOND''') is a hypothesis that proposes a modification of Newton's law of universal gravitation to account for observed properties of galaxies. It is an alternative to the hypothesis of dark matter in terms of explaining why galaxies do not appear to obey the currently understood laws of physics.


In [22]:
print(doc_chunks[1][0])

'''AQUAL''' is a theory of gravity based on Modified Newtonian Dynamics (MOND), but using a Lagrangian. It was developed by Jacob Bekenstein and Mordehai Milgrom in their 1984 paper, "Does the missing mass problem signal the breakdown of Newtonian gravity?". "AQUAL" stands for "A QUAdratic Lagrangian".

The gravitational force law obtained from MOND, 

:
has a serious defect: it violates Newton's third law of motion, and therefore fails to conserve momentum and energy. To see this, consider two objects with ; then we have:

:

but the third law gives  so we would get 

: 

even though  and  would therefore be constant, contrary to the MOND assumption that it is linear for small arguments.

This problem can be rectified by deriving the force law from a Lagrangian, at the cost of possibly modifying the general form of the force law. Then conservation laws could then be derived from the Lagrangian by the usual means.


In [23]:
print(doc_chunks[2][0])

In cosmology, the '''missing baryon problem''' is an observed discrepancy between the amount of baryonic matter detected from shortly after the Big Bang and from more recent epochs. Observations of the cosmic microwave background and Big Bang nucleosynthesis studies have set constraints on the abundance of baryons in the early universe, finding that baryonic matter accounts for approximately 4.8% of the energy contents of the Universe. At the same time, a census of baryons in the recent observable universe has found that observed baryonic matter accounts for less than half of that amount. This discrepancy is commonly known as the missing baryon problem. The missing baryon problem is different from the dark matter problem, which is non-baryonic in nature.
