# Embedding for JSON File

##### Importing Libraries

In [1]:
from transformers import BertTokenizer, BertModel
import torch
import json

##### Loading the Dataset

In [2]:
json_file = 'entries .json'  # Provide the path to your JSON file
with open(json_file, 'r') as f:
    json_data = json.load(f)

##### Extracting text from the JSON File

In [3]:
def extract_text_from_json(json_data):
    text_data = []
    for entry in json_data:
        # Add text from each field that contains textual information
        for key, value in entry.items():
            if isinstance(value, str):
                text_data.append(value)
    return text_data
text_data = extract_text_from_json(json_data)

##### Load the Pretrained BERT Model

###### BERT MODEL
BERT, or Bidirectional Encoder Representations from Transformers, is a powerful tool for understanding and working with language. It's like a super-smart computer program that learns from huge amounts of text. Unlike older models, BERT looks at both the words before and after each word to understand its meaning better. This makes it really good at understanding context and nuances in language. People use BERT to do lots of language tasks like understanding emails, answering questions, and translating languages. It's a big deal in the world of computers and language!

In [4]:
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = BertModel.from_pretrained('bert-base-uncased')

model.safetensors:   0%|          | 0.00/440M [00:00<?, ?B/s]

To support symlinks on Windows, you either need to activate Developer Mode or to run Python as an administrator. In order to see activate developer mode, see this article: https://docs.microsoft.com/en-us/windows/apps/get-started/enable-your-device-for-development


##### Mapping Tokens with BERT for Embedding

In [5]:
def process_tokens_with_bert(text_data):
    embeddings = []
    for text in text_data:
        # Tokenize the text
        inputs = tokenizer(text, return_tensors="pt", max_length=512, truncation=True, padding=True)

        # Forward pass through the BERT model
        outputs = model(**inputs)

        # Extract the embeddings from the output
        last_hidden_states = outputs.last_hidden_state

        # Average pooling to get a single embedding for the entire text
        avg_embedding = torch.mean(last_hidden_states, dim=1).squeeze().detach().numpy()
        
        embeddings.append(avg_embedding)
    return embeddings

##### Generate Embeddings

In [6]:
embeddings = process_tokens_with_bert(text_data)

# Print the embeddings
for embedding_vector in embeddings:
    print(embedding_vector)

[ 5.02123721e-02  1.12346485e-01  4.25063550e-01 -1.41579688e-01
  3.47938955e-01 -2.70838737e-01  1.26685128e-01  2.96716273e-01
 -8.64960477e-02 -1.71615183e-01  5.75192124e-02 -3.83529007e-01
 -3.96623284e-01  7.66541362e-01 -1.38765685e-02  6.87290788e-01
  1.28633648e-01 -2.31156182e-02 -1.09559722e-01  2.55929440e-01
  2.84802705e-01 -3.78664255e-01  8.23299773e-03  5.74061632e-01
  2.58468855e-02 -3.68808806e-02  9.83352587e-02 -8.51845294e-02
  1.22518808e-01 -2.86140833e-02  2.15316370e-01 -1.10999100e-01
 -2.59791277e-02 -1.94666922e-01 -8.02257285e-02  2.68163458e-02
 -2.44049400e-01 -3.69037658e-01  1.88233927e-02  2.67065376e-01
 -4.96291220e-01 -5.95294118e-01 -4.46089119e-01  6.50597736e-02
  1.68627240e-02  1.17836855e-01  7.63212144e-02  2.90247560e-01
  5.05711257e-01 -4.59208399e-01 -5.29490590e-01  1.39297023e-01
  3.83904669e-03 -1.38876662e-01  1.73704892e-01  8.23037803e-01
  7.20133632e-02 -4.24176872e-01 -5.85744344e-02 -2.54372448e-01
 -1.91707909e-01 -5.09258

[ 2.07495451e-01  7.66727775e-02  3.41922730e-01 -2.54636228e-01
  2.00911164e-01 -8.97395462e-02  1.18984260e-01  1.81595773e-01
 -1.18100464e-01 -2.75741518e-01  2.62762047e-02 -2.06164092e-01
 -2.20092058e-01  6.36040628e-01 -4.46902104e-02  5.88375628e-01
  3.55054319e-01  1.14122689e-01 -2.70664960e-01  3.38006526e-01
  3.89930695e-01 -2.33473390e-01  2.86001682e-01  5.89537680e-01
 -1.29856929e-01 -2.51366887e-02  2.33296975e-01 -6.08405992e-02
  2.08577365e-01 -2.83150841e-02  2.50484675e-01 -8.98556113e-02
 -1.96403503e-01 -2.26971403e-01 -7.75187388e-02  2.91192792e-02
 -3.20970595e-01 -2.85408825e-01  2.50763912e-02 -4.68609668e-02
 -3.59573066e-01 -6.85147941e-01 -3.55254501e-01  5.90188093e-02
 -1.67492226e-01 -9.75025538e-03 -1.03841126e-01 -1.07126366e-02
  2.90076435e-01 -1.86011121e-01 -3.34984869e-01  3.05138052e-01
 -1.07330285e-01 -7.06124455e-02  6.72761649e-02  8.65407169e-01
  3.96766067e-02 -5.18820941e-01  2.47646406e-01 -1.97554961e-01
 -1.80279985e-01 -5.85762

IOPub data rate exceeded.
The notebook server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--NotebookApp.iopub_data_rate_limit`.

Current values:
NotebookApp.iopub_data_rate_limit=1000000.0 (bytes/sec)
NotebookApp.rate_limit_window=3.0 (secs)



##### Converting the Embedded Vetors to a array to save in a JSON Compatible form

In [9]:

embedded_vectors_json = [embedding.tolist() for embedding in embeddings]

json_output_file = 'JSON_OUT.json'
with open(json_output_file, 'w') as json_file:
    json.dump(embedded_vectors_json, json_file)

print("Embedded vectors saved to:", json_output_file)


Embedded vectors saved to: JSON_OUT.json


##### Retrieving from the saved File

In [10]:
import json
import numpy as np

json_input_file = 'JSON_OUT.json'

with open(json_input_file, 'r') as json_file:
    embedded_vectors_json = json.load(json_file)

embedded_vectors = np.array(embedded_vectors_json)

print("Embedded vectors retrieved successfully.")
print(embedded_vectors_json)

IOPub data rate exceeded.
The notebook server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--NotebookApp.iopub_data_rate_limit`.

Current values:
NotebookApp.iopub_data_rate_limit=1000000.0 (bytes/sec)
NotebookApp.rate_limit_window=3.0 (secs)

