If you haven't installed lancedb or sentence_transformers, please uncomment the following lines and run them.

In [2]:
# !pip install lancedb

# !pip install sentence_transformers

In [1]:
import lancedb
from lancedb.embeddings import with_embeddings

import pandas as pd

To create/connect to a lancedb database, we use the `connect` function. Here we will create a db called `lancedb-ebiblecorpus`.

In [2]:
db = lancedb.connect('./lancedb-ebiblecorpus')

In the `ebible` corpus, verses are stored line by line and verse references may be mapped to the corresponding line in `vref.txt`. We load `vref.txt` into a dataframe which we will use as a lookup later.

In [3]:
vrefs = pd.read_csv('ebible/metadata/vref.txt', sep='\t', header=None, names=['vref'])
vrefs.head()

Unnamed: 0,vref
0,GEN 1:1
1,GEN 1:2
2,GEN 1:3
3,GEN 1:4
4,GEN 1:5


Lancedb is a vector database and we will use it to store our sentence embeddings. To generate embeddings, we could use a variety of approaches. We could use an OpenAI API key and generate embeddings using their models (but this would be pricey), and many equivalent services exist.

We can also generate embeddings ourselves. HuggingFace provides one of the most popular methods of generating embeddings using their `sentence_transformers` library. Numerous `SentenceTransformer` models are available and can be found by searching models on <https://huggingface.co/>.

For now, we will use one of the larger multilingual models. Only languages that the model has been trained on are likely to generate meaningful embeddings, so it's important to use a multilingual model and be aware of which languages were in the training data.

In [4]:
from sentence_transformers import SentenceTransformer
model = SentenceTransformer('sentence-transformers/all-MiniLM-L6-v2', device='cuda')

# Lancedb provides the `with_embeddings` convenience function, which takes an embed_func function as an argument.
def embed_func(batch):
    return [model.encode(sentence) for sentence in batch]

Because there are over 1000 languages in the ebible dataset, and we do not have a model trained on all of them, we will only import a few of them.

In [7]:
languages_for_import = [
    # Original Languages
    "grc-grc-tisch.txt",
    "grc-grcbrent.txt",
    "grc-grcbyz.txt",
    "grc-grcf35.txt",
    "grc-grcmt.txt",
    "grc-grcsr.txt",
    "grc-grctcgnt.txt",
    "grc-grctr.txt",
    "heb-heb.txt",
    # English Bibles
    "eng-eng-asv.txt",
    "eng-eng-Brenton.txt",
    "eng-eng-glw.txt",
    "eng-eng-kjv.txt",
    "eng-eng-kjv2006.txt",
    "eng-eng-lxx2012.txt",
    "eng-eng-rv.txt",
    "eng-eng-t4t.txt",
    "eng-eng-uk-lxx2012.txt",
    "eng-eng-web-c.txt",
    "eng-eng-web.txt",
    "eng-eng-webbe.txt",
    "eng-engaoi.txt",
    "eng-engasvbt.txt",
    "eng-engBBE.txt",
    "eng-engDBY.txt",
    "eng-engDRA.txt",
    "eng-engemtv.txt",
    "eng-engf35.txt",
    "eng-engfbv.txt",
    "eng-enggnv.txt",
    "eng-engjps.txt",
    "eng-engkjvcpb.txt",
    "eng-englee.txt",
    "eng-englsv.txt",
    "eng-englxxup.txt",
    "eng-engnna.txt",
    "eng-engnoy.txt",
    "eng-engoebcw.txt",
    "eng-engoebus.txt",
    "eng-engoke.txt",
    "eng-engourb.txt",
    "eng-engPEV.txt",
    "eng-engtcent.txt",
    "eng-engtnt.txt",
    "eng-engULB.txt",
    "eng-engwebp.txt",
    "eng-engwebpb.txt",
    "eng-engwebster.txt",
    "eng-engwmb.txt",
    "eng-engwmbb.txt",
    "eng-engwyc2017.txt",
    "eng-engwyc2018.txt",
    "eng-engWycliffe.txt",
    "eng-engylt.txt",
    # LWCs
    ## Arabic
    "arb-arb-vd.txt", 
    "arb-arbnav.txt",
    ## Spanish
    "spa-spabes.txt",
    "spa-spablm.txt",
    "spa-spapddpt.txt",
    "spa-spaRV1909.txt",
    "spa-sparvg.txt",
    "spa-spavbl.txt",
    ## German
    "deu-deu1912.txt",
    "deu-deu1951.txt",
    "deu-deuelo.txt",
    "deu-deutkw.txt",
    ## French
    "fra-fra_fob.txt",
    "fra-fraLSG.txt",
    "fra-francl.txt",
    "fra-frasbl.txt",
    # Other Languages
    ## Tok Pisin
    "tpi-tpi.txt",
    "tpi-tpiOTNT.txt",
]

In [8]:
verses = []

# To importa all languages, import os and use (but be warned, there are 1000+ languages):
# for filename in os.listdir('ebible/corpus/'):
for filename in languages_for_import:
    print(filename)
    version = filename.split('.')[0]
    with open('ebible/corpus/' + filename, 'r') as f:
        for index, line in enumerate(f):
            text = line.strip()
            ref = vrefs.iloc[index]['vref']

            if text != '':
                verses.append([
                    version,
                    ref,
                    text
                ])

print(len(verses))

grc-grc-tisch.txt
grc-grcbrent.txt
grc-grcbyz.txt
grc-grcf35.txt
grc-grcmt.txt
grc-grcsr.txt
grc-grctcgnt.txt
grc-grctr.txt
heb-heb.txt
eng-eng-asv.txt
eng-eng-Brenton.txt
eng-eng-glw.txt
eng-eng-kjv.txt
eng-eng-kjv2006.txt
eng-eng-lxx2012.txt
eng-eng-rv.txt
eng-eng-t4t.txt
eng-eng-uk-lxx2012.txt
eng-eng-web-c.txt
eng-eng-web.txt
eng-eng-webbe.txt
eng-engaoi.txt
eng-engasvbt.txt
eng-engBBE.txt
eng-engDBY.txt
eng-engDRA.txt
eng-engemtv.txt
eng-engf35.txt
eng-engfbv.txt
eng-enggnv.txt
eng-engjps.txt
eng-engkjvcpb.txt
eng-englee.txt
eng-englsv.txt
eng-englxxup.txt
eng-engnna.txt
eng-engnoy.txt
eng-engoebcw.txt
eng-engoebus.txt
eng-engoke.txt
eng-engourb.txt
eng-engPEV.txt
eng-engtcent.txt
eng-engtnt.txt
eng-engULB.txt
eng-engwebp.txt
eng-engwebpb.txt
eng-engwebster.txt
eng-engwmb.txt
eng-engwmbb.txt
eng-engwyc2017.txt
eng-engwyc2018.txt
eng-engWycliffe.txt
eng-engylt.txt
arb-arb-vd.txt
arb-arbnav.txt
spa-spabes.txt
spa-spablm.txt
spa-spapddpt.txt
spa-spaRV1909.txt
spa-sparvg.txt
spa-spavb

In [10]:
verses_table = None

# If the table already exists, drop it
# We don't want to be appending the same data over and over
tables = db.table_names()
if 'verses' in tables:
    db.drop_table('verses')

BATCH_SIZE = 10000
i = 0
while i < len(verses):
    # pop 1000 verses from the start of the list and shorten the list
    v = verses[i:i+BATCH_SIZE]

    df_verses = pd.DataFrame(v, columns=['version', 'vref', 'text'])
    print(df_verses.head())
    df_embeddings = with_embeddings(embed_func, df_verses)

    if verses_table is None:
        verses_table = db.create_table('verses', df_embeddings)
    else:
        verses_table.add(df_embeddings)
    
    i += BATCH_SIZE
    print(i * BATCH_SIZE / len(verses) * 100, '%')

         version     vref                                               text
0  grc-grc-tisch  MAT 1:1  Βίβλος γενέσεως Ἰησοῦ Χριστοῦ υἱοῦ Δαυεὶδ υἱοῦ...
1  grc-grc-tisch  MAT 1:2  Ἀβραὰμ ἐγέννησεν τὸν Ἰσαάκ, Ἰσαὰκ δὲ ἐγέννησεν...
2  grc-grc-tisch  MAT 1:3  Ἰούδας δὲ ἐγέννησεν τὸν Φάρες καὶ τὸν Ζάρα ἐκ ...
3  grc-grc-tisch  MAT 1:4  Ἀρὰμ δὲ ἐγέννησεν τὸν Ἀμιναδάβ, Ἀμιναδὰβ δὲ ἐγ...
4  grc-grc-tisch  MAT 1:5  Σαλμὼν δὲ ἐγέννησεν τὸν Βόες ἐκ τῆς Ῥαχάβ, Βόε...
1772246
        version       vref                                               text
0  grc-grcbrent  EXO 20:19  Καὶ εἶπαν πρὸς Μωυσῆν, λάλησον σὺ ἡμῖν, καὶ μὴ...
1  grc-grcbrent  EXO 20:20  Καὶ λέγει αὐτοῖς Μωυσῆς, θαρσεῖτε· ἕνεκεν γὰρ ...
2  grc-grcbrent  EXO 20:21  Εἱστήκει δὲ ὁ λαὸς μακρόθεν, Μωυσῆς δὲ εἰσῆλθε...
3  grc-grcbrent  EXO 20:22  Εἶπε δὲ Κύριος πρὸς Μωυσῆν, τάδε ἐρεῖς τῷ οἴκῳ...
4  grc-grcbrent  EXO 20:23  Οὐ ποιήσετε ὑμῖν αὐτοῖς θεοὺς ἀργυροῦς, καὶ θε...
1772246
        version      vref                             