# BM25 tokenization and vectorization
This notebook tokenizes, vectorizes, and stores the vectors in a local JSON file. 
<br>
adopted from the bm25s.ipynb notebook based off:<br>
BM25 Sparse<br>
https://bm25s.github.io/

## documentation
https://github.com/xhluca/bm25s

## citation
```
@misc{bm25s,
      title={BM25S: Orders of magnitude faster lexical search via eager sparse scoring}, 
      author={Xing Han Lù},
      year={2024},
      eprint={2407.03618},
      archivePrefix={arXiv},
      primaryClass={cs.IR},
      url={https://arxiv.org/abs/2407.03618}, 
}
```

## requires:
pip3 install bm25s[full]<br>
pip3 install nltk

In [2]:
import bm25s
from nltk.stem import PorterStemmer
from nltk.tokenize import word_tokenize
import os
import pickle


## create the corpus and identifier for each document
This should be a separate function/module. <br>
It takes the txt files in data folder, and makes a separate corpus and identifier file. <br>
these get used as inputs for the BM25s and as outputs. The identifier only has the DOI with resolver and a title. This may be easier for the UI. <br>
The corpus should be sent to the generator. 

In [3]:
# Create your corpus here

input_dir = '/Users/poppyriddle/Documents/PhD/Research_proposal/Part_3/part_3_cohere/data'

#initialize lists
corpus = []
identifier = []

#read each file in input_dir
for file_name in os.listdir(input_dir):
    if file_name.endswith('.txt'):
        file_path = os.path.join(input_dir,file_name)

        with open(file_path, 'r') as file:
            content = file.readlines()

            doi = content[0].lstrip("DOI: ")
            title = content[1].strip("\n")
            abstract = content[2].lstrip()
            """
            This provides just the title and abstract to be sent to the generator.
            This creates a list of strings where each document is a string.
            """
            if doi and title and abstract:
                document = f"{doi} {title} {abstract}"
                corpus.append(document)

            """ this will create a separate 'corpus' to be returned as the identifier
            and returned values from the retriever. This version might be a little less
            verbose and user friendly. 
            However, you will want to send the corpus above to the generator.
            """
            if title and doi:
                resolver_doi = f"https://doi.org/{doi}"
                link = f"URL: {resolver_doi} for title: {title}"
                identifier.append(link)
                    


#export corpus and url lists for import later
with open('corpus.pkl', 'wb') as file:
    pickle.dump(corpus, file)

with open('identifier.pkl','wb') as file:
    pickle.dump(identifier, file)

print('complete')


complete


In [4]:
print(len(corpus))

43


# tokenization and saving the vectors
## options:
- stemming: uncomment to use english Porter stemmer. See documentation here: https://www.nltk.org/howto/stem.html
- stopwords removal: can pass a list or chose language specific one, such as "en". Documentation here: 
    - optional stopwords list can be passed:
        ```python
        #provide your own stopwords list if you don't like the default one
        stopwords = ["a", "the"]
        ```

In [5]:
#optional stemmer
#stemmer = Stemmer.Stemmer("english")

#Tokenize the corpus and index it - removes stopwords 
#you can also add a stemmer here as an arg: stemmer=stemmer
corpus_tokens = bm25s.tokenize(corpus, stopwords="en", show_progress=True)

# Create the BM25 model and index the corpus
retriever = bm25s.BM25(corpus=corpus, method='lucene') # 'robertson', 'lucene', 'atire'
retriever.index(corpus_tokens)

# Save the index
# You can save the corpus along with the model - technically, this should go up in the previous cell after the retriever was created.
retriever.save("bm25/bm25", corpus=corpus)


Finding newlines for mmindex: 100%|██████████| 61.0k/61.0k [00:00<00:00, 119MB/s]


In [37]:
type(retriever)

bm25s.BM25

# stop here. 
The above code was just to set up the vectorized tokens. Below is just testing to make sure it works with the document set. <br>
The code below will be integrated into the RAG


## loading the index
These are the vectorized tokens. Only the corpus needs to be saved. <br>
The identifier corpus is used instead of an id# assigned to each document to be seen by the user. 

In [38]:

# reimport corpus and url lists
with open('corpus.pkl', 'rb') as file:
    corpus_list = pickle.load(file)
print(f"length of corpus list: {len(corpus_list)}")
with open('identifier.pkl', 'rb') as file:
    identifier_list = pickle.load(file)
print(f"--------\nlength of identifier list: {len(identifier_list)}")

#retriever = bm25s.BM25(corpus=corpus_list) 
# ...and load the retriever model and corpus when you need them
retriever = bm25s.BM25.load("bm25/bm25", load_corpus=True, mmap=True)
# set load_corpus=False if you don't need the corpus


length of corpus list: 43
--------
length of identifier list: 43


In [45]:
#You can now search the corpus with a query
query = input("what is your query")
#you can also add a stemmer here as an arg: stemmer=stemmer
query_tokens = bm25s.tokenize(query,
                            stopwords=True,
                            lower=True)

#note: if you pass a new corpus here, it must have the same length as your indexed corpus
#in this case, I am passing the new list 'identifier_list' - it contains just the DOI and title
# you can also pass 'corpus', or 'corpus_list'
if len(corpus_list)==len(identifier_list):
    results, scores = retriever.retrieve(query_tokens, corpus=identifier_list, k=3, return_as="tuple")
else:
    print("The len of the corpus_list does not equal the identifier_list")
    print(f"length of corpus list: {len(corpus_list)}")
    print(f"length of identifier_list: {len(identifier_list)}")
#loop through results
if all(score ==0.00 for score in scores[0]):
    print("Nothing found, please try another query.")
else:
    for i in range(results.shape[1]):
        doc, score = results[0, i], scores[0, i]
        print(f"------\nRank {i+1} (score: {score:.2f}): {doc}")


                                                     

------
Rank 1 (score: 2.30): URL: https://doi.org/10.3145/epi.2023.mar.09
 for title: Title: Which of the metadata with relevance for bibliometrics are the same and which are different when switching from Microsoft Academic Graph to OpenAlex?
------
Rank 2 (score: 2.03): URL: https://doi.org/10.31274/b8136f97.ccc3dae4
 for title: Title: Comparing Funder Metadata in OpenAlex and Dimensions
------
Rank 3 (score: 1.89): URL: https://doi.org/10.1590/SciELOPreprints.11205
 for title: Title: On the Open Road to Universal Indexing: OpenAlex and OpenJournal Systems


