<a href="https://colab.research.google.com/github/justinjunge/Convergent-Wisdom-Project/blob/main/MakeEmbeddings_Dec2024.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

This is a modular notebook for generating vector embeddings from a source file.  This version includes uploading 3 sacred texts (Bhagavad Gita, Qu'ran, Christian Bible) from CSV files hosted on GitHub in TheConvergentWisdomProject, and procedures to create embeddings, save them, and download them.  

Embeddings made using this notebook in October 2024 appear the Embeddings folder in TheConvergentWisdomProject, and they are also used in the SemanticAnalysis


In [None]:
#import matplotlib.pyplot as plt
import numpy as np
import pandas as pd

from scipy import stats
import random
import copy
import torch

In [None]:
!pip install -U sentence-transformers
!pip install torch torchvision torchaudio
import torch
!pip install transformers
!pip install sentence-transformers
from sentence_transformers import SentenceTransformer, util

Collecting sentence-transformers
  Downloading sentence_transformers-3.3.1-py3-none-any.whl.metadata (10 kB)
Downloading sentence_transformers-3.3.1-py3-none-any.whl (268 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m268.8/268.8 kB[0m [31m4.5 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: sentence-transformers
  Attempting uninstall: sentence-transformers
    Found existing installation: sentence-transformers 3.2.1
    Uninstalling sentence-transformers-3.2.1:
      Successfully uninstalled sentence-transformers-3.2.1
Successfully installed sentence-transformers-3.3.1


In [None]:
!pip install nltk
import nltk
from nltk.tokenize import sent_tokenize
nltk.download('punkt')



[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


True

This version (October 2024) uses a model found [here](https://sbert.net/examples/applications/computing-embeddings/README.html)

In [None]:
model = SentenceTransformer("all-MiniLM-L6-v2")

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/10.7k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/612 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/350 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

1_Pooling/config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

In [None]:
dfG = pd.read_csv ('https://raw.githubusercontent.com/justinjunge/Convergent-Wisdom-Project/main/CSV_files/Gita.csv')
print(dfG)

     Chapter  Verse                                               Text
0          1      1  Dhritarashtra said: O Sanjay, after gathering ...
1          1      2  Sanjay said: On observing the Pandava army sta...
2          1      3  Duryodhan said: Respected teacher!  Behold the...
3          1      4  Behold in their ranks are many powerful warrio...
4          1      5  There are also accomplished heroes like Dhrish...
..       ...    ...                                                ...
695       18     74  Sanjay said: Thus, have I heard this wonderful...
696       18     75  By the grace of Veda Vyas, I have heard this s...
697       18     76  As I repeatedly recall this astonishing and wo...
698       18     77  And remembering that most astonishing and wond...
699       18     78  Wherever there is Shree Krishna, the Lord of a...

[700 rows x 3 columns]


In [None]:
# Get the column of interest
columnG = dfG["Text"]

# Convert the column to a list of strings with quotes
quoted_listG = ['"' + str(x) + '"' for x in columnG]

# Print the list
print(quoted_listG)

['"Dhritarashtra said: O Sanjay, after gathering on the holy field of Kurukshetra, and desiring to fight, what did my sons and the sons of Pandu do?"', '"Sanjay said: On observing the Pandava army standing in military formation, King Duryodhan approached his teacher Dronacharya, and said the following words."', '"Duryodhan said: Respected teacher!  Behold the mighty army of the sons of Pandu, so expertly arrayed for battle by your own gifted disciple, the son of Drupad."', '"Behold in their ranks are many powerful warriors, like Yuyudhan, Virat, and Drupad, wielding mighty bows and equal in military prowess to Bheem and Arjun. "', '"There are also accomplished heroes like Dhrishtaketu, Chekitan, the gallant King of Kashi, Purujit, Kuntibhoj, and Shaibya, all the best of men."', '"In their ranks, they also have the courageous Yudhamanyu, the gallant Uttamauja, the son of Subhadra, and the sons of Draupadi, who are all great warrior chiefs."', '"O best of Brahmins, hear too about the pri

In [None]:
body_tokenG = quoted_listG[0:]

text_embeddingG = model.encode(body_tokenG, convert_to_tensor=True)


In [None]:
import torch

In [None]:
torch.save(text_embeddingG, "Gita_Embeddings.pt")

In [None]:
from google.colab import files

In [None]:
files.download('Gita_Embeddings.pt')

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

This following two cells make embeddings for the Bible & Quran.
Each has the additional step converting the unit of analysis from verse to chapter.  (For the Gita we used verse.)

In [None]:
dfB = pd.read_csv ('https://raw.githubusercontent.com/justinjunge/Convergent-Wisdom-Project/main/jj_ChristianBible.csv')

## following code parses the bible into 1189 chapters
dfB2= pd.DataFrame(np.empty((1189, 2), dtype = str))
dfB2.columns = ["chapter","text"]
counter = 0
for i in range(len(dfB)):
  if dfB["verse"][i]==1:
    dfB2["text"][counter] = dfB["text"][i]
    counter = counter + 1
  else:
    dfB2["text"][counter-1] = dfB2["text"][counter-1] + dfB["text"][i]

# the remainder of this cell replicates the functionality seen above with the Gita

# Get the column of interest
columnB = dfB2["text"]
quoted_listB = ['"' + str(x) + '"' for x in columnB]
body_tokenB = quoted_listB[0:]
text_embeddingB = model.encode(body_tokenB, convert_to_tensor=True)

torch.save(text_embeddingB, "Bible_Embeddings.pt")

You are setting values through chained assignment. Currently this works in certain cases, but when using Copy-on-Write (which will become the default behaviour in pandas 3.0) this will never work to update the original DataFrame or Series, because the intermediate object on which we are setting values will behave as a copy.
A typical example is when you are setting values in a column of a DataFrame, like:

df["col"][row_indexer] = value

Use `df.loc[row_indexer, "col"] = values` instead, to perform the assignment in a single step and ensure this keeps updating the original `df`.

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy

  dfB2["text"][counter] = dfB["text"][i]
You are setting values through chained assignment. Currently this works in certain cases, but when using Copy-on-Write (which will become the default behaviour in pandas 3.0) this will never work to update the original DataFrame or 

In [None]:
dfQ = pd.read_csv ('https://raw.githubusercontent.com/justinjunge/Convergent-Wisdom-Project/main/CSV_files/Quran.csv')

## following code parses the bible into 1189 chapters
dfQ2= pd.DataFrame(np.empty((114, 2), dtype = str))
dfQ2.columns = ["chapter","text"]
counter = 0
for i in range(len(dfQ)):
  if dfQ["Ayah"][i]==1:
    dfQ2["text"][counter] = dfQ["Text"][i]
    counter = counter + 1
  else:
    dfQ2["text"][counter-1] = dfQ2["text"][counter-1] + dfQ["Text"][i]

# the remainder of this cell replicates the functionality seen above with the Gita

# Get the column of interest
columnQ = dfQ2["text"]
quoted_listQ = ['"' + str(x) + '"' for x in columnQ]
body_tokenQ = quoted_listQ[0:]
text_embeddingQ = model.encode(body_tokenQ, convert_to_tensor=True)

torch.save(text_embeddingQ, "Quran_Embeddings.pt")

You are setting values through chained assignment. Currently this works in certain cases, but when using Copy-on-Write (which will become the default behaviour in pandas 3.0) this will never work to update the original DataFrame or Series, because the intermediate object on which we are setting values will behave as a copy.
A typical example is when you are setting values in a column of a DataFrame, like:

df["col"][row_indexer] = value

Use `df.loc[row_indexer, "col"] = values` instead, to perform the assignment in a single step and ensure this keeps updating the original `df`.

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy

  dfQ2["text"][counter] = dfQ["Text"][i]
You are setting values through chained assignment. Currently this works in certain cases, but when using Copy-on-Write (which will become the default behaviour in pandas 3.0) this will never work to update the original DataFrame or 

You are setting values through chained assignment. Currently this works in certain cases, but when using Copy-on-Write (which will become the default behaviour in pandas 3.0) this will never work to update the original DataFrame or Series, because the intermediate object on which we are setting values will behave as a copy.
A typical example is when you are setting values in a column of a DataFrame, like:

df["col"][row_indexer] = value

Use `df.loc[row_indexer, "col"] = values` instead, to perform the assignment in a single step and ensure this keeps updating the original `df`.

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy

  dfQ2["text"][counter] = dfQ["Text"][i]
You are setting values through chained assignment. Currently this works in certain cases, but when using Copy-on-Write (which will become the default behaviour in pandas 3.0) this will never work to update the original DataFrame or 

    chapter                                               text
0            In the name of Allah, Most Gracious, Most Merc...
1            A. L. M.This is the Book; in it is guidance su...
2            A. L. M.Allah! There is no god but He,-the Liv...
3            O mankind! reverence your Guardian-Lord, who c...
4            O ye who believe! fulfil (all) obligations. La...
..      ...                                                ...
109          When comes the Help of Allah, and Victory,And ...
110          Perish the hands of the Father of Flame! Peris...
111          Say: He is Allah, the One and Only;Allah, the ...
112          Say: I seek refuge with the Lord of the DawnFr...
113          Say: I seek refuge with the Lord and Cherisher...

[114 rows x 2 columns]
