## Chunking and Bert 

In [46]:
import warnings
warnings.filterwarnings('ignore')

In [47]:
import pandas as pd
from sentence_transformers import SentenceTransformer, util
import time

In [2]:
data = pd.read_csv('video_subtitles.csv',nrows=5000)

In [3]:
data.shape

(5000, 3)

In [4]:
data.head()

Unnamed: 0,num,name,content_clean
0,9180533,the.message.(1976).eng.1cd,in the name of god the most gracious the...
1,9180583,here.comes.the.grump.s01.e09.joltin.jack.in.bo...,ah there s princess dawn and terry with the...
2,9180592,yumis.cells.s02.e13.episode.2.13.(2022).eng.1cd,yumi s cells 2 episode extremely polite y...
3,9180594,yumis.cells.s02.e14.episode.2.14.(2022).eng.1cd,yumi s cells 2 episode 39 laptop firs...
4,9180600,broker.(2022).eng.1cd,if you re going to throw it away then don...


In [5]:
df = data

In [6]:
import re

def preprocess_text(text):
    # Replace every two or more spaces with a single space
    cleaned_text = re.sub(r'\s{2,}', ' ', text)
    return cleaned_text
df['content_clean'] = df['content_clean'].apply(preprocess_text)

In [7]:
import tensorflow as tf

In [11]:
from semantic_text_splitter import TextSplitter
from tokenizers import Tokenizer
MAX_TOKENS = 1500
tokenizer = Tokenizer.from_pretrained("bert-base-uncased")
splitter = TextSplitter.from_huggingface_tokenizer(tokenizer,capacity=MAX_TOKENS)
counter = 0
def chunk_document(content):
    global counter  # Use the global counter variable
    counter += 1  # Increment the counter
    start_time = time.time()
    # Perform your chunking operation
    chunks_with_model = splitter.chunks(content)
    end_time = time.time()
    execution_time = end_time - start_time
    print(counter)
    print("Chunking operation completed in {:.2f} seconds ".format(execution_time))
    return chunks_with_model


In [12]:
with tf.device('/device:GPU:0'): 
# Apply chunking to each document in the DataFrame
    df['chunks'] = df['content_clean'].apply(chunk_document)

1
Chunking operation completed in 0.87 seconds 
2
Chunking operation completed in 0.01 seconds 
3
Chunking operation completed in 0.38 seconds 
4
Chunking operation completed in 0.52 seconds 
5
Chunking operation completed in 0.78 seconds 
6
Chunking operation completed in 0.25 seconds 
7
Chunking operation completed in 0.96 seconds 
8
Chunking operation completed in 0.01 seconds 
9
Chunking operation completed in 0.01 seconds 
10
Chunking operation completed in 0.01 seconds 
11
Chunking operation completed in 0.10 seconds 
12
Chunking operation completed in 0.01 seconds 
13
Chunking operation completed in 0.10 seconds 
14
Chunking operation completed in 0.01 seconds 
15
Chunking operation completed in 0.01 seconds 
16
Chunking operation completed in 0.37 seconds 
17
Chunking operation completed in 0.74 seconds 
18
Chunking operation completed in 0.31 seconds 
19
Chunking operation completed in 0.23 seconds 
20
Chunking operation completed in 0.23 seconds 
21
Chunking operation complet

In [13]:
for i, chunk in enumerate(df['chunks'][0]):
    print(f"CHUNK {i+1}: ", chunk )

CHUNK 1:  in the name of god the most gracious the most merciful from muhammad the messenger of god to heraclius the emperor of byzantium greetings to him who is the follower of righteous guidance i bid you to hear the divine call i am the messenger of god to the people accept islam for your salvation he speaks of a new prophet in arabia was it like this when john the baptist came to king herod out of the desert crying about salvation to muqawqis patriarch of alexandria kisra emperor of persia muhammad calls you with the call of god accept islam for your salvation embrace islam you come out of the desert smelling of camel and goat to tell persia where he should kneel muhammad messenger of god who gave him this authority god sent muhammad as a mercy to mankind the scholars and historians of islam the university of al azhar in cairo the high islamic congress of the shiat in lebanon the makers of this film honour the islamic tradition which holds that the impersonation of the prophet offe

In [14]:
from sentence_transformers import SentenceTransformer

model = SentenceTransformer("all-MiniLM-L6-v2")

To support symlinks on Windows, you either need to activate Developer Mode or to run Python as an administrator. In order to see activate developer mode, see this article: https://docs.microsoft.com/en-us/windows/apps/get-started/enable-your-device-for-development


In [15]:
counter = 0
def embedding(content):
    global counter  # Use the global counter variable
    counter += 1  # Increment the counter
    start_time = time.time()
    # Perform your embedding operation
    query_embedding = model.encode(content)

    end_time = time.time()
    execution_time = end_time - start_time
    print(counter)

    print("Embedding operation completed in {:.2f} seconds".format(execution_time))
    return query_embedding

In [16]:
import tensorflow as tf

with tf.device('/device:GPU:0'):  
    df['emb'] = df['chunks'].apply(embedding)

1
Embedding operation completed in 1.76 seconds
2
Embedding operation completed in 0.08 seconds
3
Embedding operation completed in 0.26 seconds
4
Embedding operation completed in 0.43 seconds
5
Embedding operation completed in 0.69 seconds
6
Embedding operation completed in 0.21 seconds
7
Embedding operation completed in 0.62 seconds
8
Embedding operation completed in 0.09 seconds
9
Embedding operation completed in 0.08 seconds
10
Embedding operation completed in 0.08 seconds
11
Embedding operation completed in 0.20 seconds
12
Embedding operation completed in 0.08 seconds
13
Embedding operation completed in 0.21 seconds
14
Embedding operation completed in 0.08 seconds
15
Embedding operation completed in 0.08 seconds
16
Embedding operation completed in 0.26 seconds
17
Embedding operation completed in 0.65 seconds
18
Embedding operation completed in 0.22 seconds
19
Embedding operation completed in 0.21 seconds
20
Embedding operation completed in 0.22 seconds
21
Embedding operation comple

In [17]:
df.head()

Unnamed: 0,num,name,content_clean,chunks,emb
0,9180533,the.message.(1976).eng.1cd,in the name of god the most gracious the most...,[in the name of god the most gracious the most...,"[[-0.038742643, 0.14162847, -0.070024535, -0.0..."
1,9180583,here.comes.the.grump.s01.e09.joltin.jack.in.bo...,ah there s princess dawn and terry with the b...,[ah there s princess dawn and terry with the b...,"[[-0.077799045, -0.019268993, 0.047039438, -0...."
2,9180592,yumis.cells.s02.e13.episode.2.13.(2022).eng.1cd,yumi s cells 2 episode extremely polite yumi ...,[yumi s cells 2 episode extremely polite yumi ...,"[[-0.14962062, -0.14706895, 0.055221133, -0.04..."
3,9180594,yumis.cells.s02.e14.episode.2.14.(2022).eng.1cd,yumi s cells 2 episode 39 laptop first place ...,[yumi s cells 2 episode 39 laptop first place ...,"[[-0.11727133, -0.060495883, 0.06533917, -0.02..."
4,9180600,broker.(2022).eng.1cd,if you re going to throw it away then don t g...,[if you re going to throw it away then don t g...,"[[-0.065826565, 0.0031215963, 0.017494537, -0...."


In [18]:
df.info()


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5000 entries, 0 to 4999
Data columns (total 5 columns):
 #   Column         Non-Null Count  Dtype 
---  ------         --------------  ----- 
 0   num            5000 non-null   int64 
 1   name           5000 non-null   object
 2   content_clean  5000 non-null   object
 3   chunks         5000 non-null   object
 4   emb            5000 non-null   object
dtypes: int64(1), object(4)
memory usage: 195.4+ KB


In [19]:
df['emb'][5].shape

(3, 384)

In [20]:
len(df['emb'][0])

7

In [21]:
df.to_csv('video_subtitles1.csv',index=False)

In [22]:
exploded_df = df.explode(['chunks', 'emb'])

In [23]:
exploded_df.shape

(22619, 5)

In [24]:
exploded_df.tail(20)

Unnamed: 0,num,name,content_clean,chunks,emb
4995,9203069,reign.s01.e16.monsters.(2014).eng.1cd,use the free code joinnow at www playships eu...,use the free code joinnow at www playships eu ...,"[-0.09329108, -0.035985764, 0.068635985, 0.007..."
4995,9203069,reign.s01.e16.monsters.(2014).eng.1cd,use the free code joinnow at www playships eu...,the same spot i m here on behalf of another a ...,"[0.017117752, -0.1072273, 0.052833226, 0.05586..."
4995,9203069,reign.s01.e16.monsters.(2014).eng.1cd,use the free code joinnow at www playships eu...,minute i have some royal customs and clothes t...,"[-0.0487563, -0.036066856, 0.023064941, -0.036..."
4995,9203069,reign.s01.e16.monsters.(2014).eng.1cd,use the free code joinnow at www playships eu...,you two getting along again i know just the th...,"[0.00065269304, -0.058537498, 0.019920824, -0...."
4996,9203070,reign.s01.e17.liege.lord.(2014).eng.1cd,advertise your product or brand here contact ...,advertise your product or brand here contact w...,"[-0.052993715, 0.0090676, 0.051016606, -0.0390..."
4996,9203070,reign.s01.e17.liege.lord.(2014).eng.1cd,advertise your product or brand here contact ...,t want her to know that we know we must find o...,"[-0.09753958, -0.04141836, -0.005782377, -0.03..."
4996,9203070,reign.s01.e17.liege.lord.(2014).eng.1cd,advertise your product or brand here contact ...,ask me that question when you have earned the ...,"[-0.091781765, -0.0076913694, 0.0060868817, -0..."
4996,9203070,reign.s01.e17.liege.lord.(2014).eng.1cd,advertise your product or brand here contact ...,dear they ll shoulder lost causes and take on ...,"[-0.04651844, -0.04769734, 0.036056582, -0.011..."
4997,9203071,reign.s01.e18.no.exit.(2014).eng.1cd,previously on reign it s not what we chose we...,previously on reign it s not what we chose we ...,"[-0.094702, -0.0059790076, 0.07485432, 0.00229..."
4997,9203071,reign.s01.e18.no.exit.(2014).eng.1cd,previously on reign it s not what we chose we...,some game to save france some embarrassment my...,"[-0.018079458, -0.079063475, 0.018396856, -0.1..."


In [25]:
exploded_df.to_csv('full_video_subtitles.csv',index=False)

In [26]:
exploded_df = exploded_df.reset_index(drop=True)

In [27]:
exploded_df.shape

(22619, 5)

In [28]:

embeddings = exploded_df['emb'].tolist()
ids = exploded_df.index.astype(str).tolist()  # Convert integers to stringsids = cleaned_df.index.astype(str).tolist()

documents = exploded_df['chunks'].tolist()
metadata = exploded_df.drop(['content_clean','chunks','emb'], axis = 1).to_dict(orient = 'records')

In [29]:
documents[20000]

'father gave me fortitude when i was sick as a child he died soon after i married i m sorry if i alarmed you why do you do this sketch me it is my work and my habit you take care of people i draw them you have a husband at war no i m widowed it s been well quite a while now or do you have an amour forgive me i am too inquisitive painters are very aware of color for example your cheeks just turned pink it s complicated it s not a doctor oh no pauvre i once knew a doctor who could say precisely what i felt by the pulse of a single vein he sounds most observant he came to salpetriere to be a surgeon well they all come for that but more than the skill with his hands his greatest gift was his mind diagnostique seeing subtle signs of malady his treatment was very imaginative did you love him do you love yours miss mary i m sorry i ve been told to prepare you for travel travel no i i m to remain here major mcburney ordered it what no you you tell him i want to see him immediately i m not goin

In [30]:
embeddings[20000]

array([-1.08318720e-02,  1.14078652e-02,  5.36149777e-02, -8.29183217e-03,
        1.13421902e-02,  2.20190249e-02,  1.26442671e-01, -8.67985189e-03,
        5.33005362e-03, -7.39185065e-02, -6.24225661e-02, -1.85600519e-02,
        4.42798324e-02, -2.25373264e-02, -5.38464934e-02,  3.56786959e-02,
        1.58190634e-02,  1.67831406e-02, -9.34234411e-02,  7.03647286e-02,
       -3.92329367e-03,  7.99494609e-02, -1.36089167e-02, -6.37173057e-02,
       -9.04341862e-02,  2.03320328e-02, -4.13964018e-02, -2.92496048e-02,
       -4.26860899e-02, -4.52286331e-03,  3.01101338e-02,  2.75595449e-02,
        3.09035350e-02,  4.83868122e-02, -1.68869700e-02,  2.30935533e-02,
       -8.53673518e-02,  1.10261902e-01, -5.23688346e-02,  2.69253254e-02,
       -8.70506745e-03,  3.19940969e-02,  6.42745495e-02, -2.90992390e-02,
        4.41749021e-02, -8.30030143e-02, -4.56465743e-02, -2.34987512e-02,
        1.15208529e-01,  1.87002551e-02, -1.25130162e-01, -5.34767359e-02,
       -1.15279099e-02, -

In [31]:
embeddings_as_lists = [embedding.tolist() for embedding in embeddings]


In [37]:
import chromadb
import chromadb.config

from chromadb.config import Settings

client = chromadb.PersistentClient(path="/content")


In [38]:
collection = client.create_collection(name="Search_Engine",
                                      metadata={"hnsw:space": "cosine"} # l2 is the default
)


In [39]:
for i, embedding in enumerate(embeddings_as_lists):

    # Add the embeddings list to your collection
    collection.add(
            documents=documents[i],
            embeddings=embeddings_as_lists[i],
            ids=ids[i],
            metadatas=metadata[i]
        )


In [40]:
results = collection.query(
    query_texts=["up when he has fallen   is described by them as upsetting social order   to this inhumanity has come a man   whom god chose   and in that we believe   you ve overcome  i beg you to collect yourself   i speak of the messenger of god   muhammad teaches us to worship one god   to speak truth   to love our neighbors as ourselves   to give charity even a smile can be charity   to protect women from misuse   to shelter orphans   and to turn away from gods of wood and stone   i cannot keep still and hear this blasphemy   we are an ancient civilization   to call our gods wood and stone is to speak ignorantly of them   the idol the form is not what we worship   but the spirit that resides within the form   i agree that idolatry is not always fully understood   thank you   now let me bring him back to the women   god made woman to be the proper companion of man   she is different but equal   equal   we buy them   feed them  clothe them   use them  discard them   women equal to us   god created man from one male and one female   amr  you must respect in all woman the womb that bore you   why are your  guards so tongue tied   while this only guard is eloquent   god has spoken to us before   through abraham  noah  moses and through jesus christ   why should we be so surprised that god speaks to us now through muhammad   who taught you those names   they are named in the quran   i knew muhammad when he was an orphan minding sheep   and we knew christ as a carpenter   what christ says and what your muhammad says   is like two raised from the same land   they are lying to you they deny christ   you worship three gods  they say father  son and holy ghost  they say   what do you say of christ   they say god cannot have a son   christ is not the son of god   speak to me of christ   we say of christ what our prophet has taught us   that god cast his holy spirit into the womb of a virgin named  mary   and that she conceived christ  the apostle of god   the apostle he says not the son  not the son   what does your miracle  your quran say of the birth of our dear lord jesus christ   may i relate the words   come closer to me   in the name of god most gracious  most merciful   relate in the book  the story of mary   how she withdrew from her family to a place in the east   how we sent to her our angel  gabriel  who said   i am a messenger from your god   to announce the birth of a holy son to you   she said   how shall i  mary  have a son when no man has touched me   and gabriel replied   for your lord says  it will happen   we appoint him as a sign onto man   and a mercy from us"],
    n_results=5
)

C:\Users\CHARISHMA\.cache\chroma\onnx_models\all-MiniLM-L6-v2\onnx.tar.gz: 100%|██████████| 79.3M/79.3M [04:32<00:00, 305kiB/s] 


In [41]:
print(results)

{'ids': [['1', '6', '2', '13332', '21961']], 'distances': [[0.4356571435928345, 0.4643341898918152, 0.47441935539245605, 0.47961968183517456, 0.48448920249938965]], 'metadatas': [[{'name': 'the.message.(1976).eng.1cd', 'num': 9180533}, {'name': 'the.message.(1976).eng.1cd', 'num': 9180533}, {'name': 'the.message.(1976).eng.1cd', 'num': 9180533}, {'name': 'sodom.and.gomorrah.(1962).eng.1cd', 'num': 9194455}, {'name': '10000.bc.(2008).eng.1cd', 'num': 9202523}]], 'embeddings': None, 'documents': [['there jafar when god gave him these words dawn is coming up ammar you first then you jaafar ammar you kept your mother awake all night with worry i m sorry father where were you have you been with muhammad again what will happen now forgive him it was my fault i did it that god has helped us all our lives but it fell it could not even help itself what talk have you been listening to the real god is unseen he s not made of clay ammar we see the gods in the kaaba every day i m afraid for you you

In [42]:
# Connect to the ChromaDB client and collection
client = chromadb.PersistentClient(path="/content")

collection = client.get_collection("Search_Engine")

In [43]:
model = SentenceTransformer('all-MiniLM-L6-v2')
query_embedding = model.encode(["up when he has fallen   is described by them as upsetting social order   to this inhumanity has come a man   whom god chose   and in that we believe   you ve overcome  i beg you to collect yourself   i speak of the messenger of god   muhammad teaches us to worship one god   to speak truth   to love our neighbors as ourselves   to give charity even a smile can be charity   to protect women from misuse   to shelter orphans   and to turn away from gods of wood and stone   i cannot keep still and hear this blasphemy   we are an ancient civilization   to call our gods wood and stone is to speak ignorantly of them   the idol the form is not what we worship   but the spirit that resides within the form   i agree that idolatry is not always fully understood   thank you   now let me bring him back to the women   god made woman to be the proper companion of man   she is different but equal   equal   we buy them   feed them  clothe them   use them  discard them   women equal to us   god created man from one male and one female   amr  you must respect in all woman the womb that bore you   why are your  guards so tongue tied   while this only guard is eloquent   god has spoken to us before   through abraham  noah  moses and through jesus christ   why should we be so surprised that god speaks to us now through muhammad   who taught you those names   they are named in the quran   i knew muhammad when he was an orphan minding sheep   and we knew christ as a carpenter   what christ says and what your muhammad says   is like two raised from the same land   they are lying to you they deny christ   you worship three gods  they say father  son and holy ghost  they say   what do you say of christ   they say god cannot have a son   christ is not the son of god   speak to me of christ   we say of christ what our prophet has taught us   that god cast his holy spirit into the womb of a virgin named  mary   and that she conceived christ  the apostle of god   the apostle he says not the son  not the son   what does your miracle  your quran say of the birth of our dear lord jesus christ   may i relate the words   come closer to me   in the name of god most gracious  most merciful   relate in the book  the story of mary   how she withdrew from her family to a place in the east   how we sent to her our angel  gabriel  who said   i am a messenger from your god   to announce the birth of a holy son to you   she said   how shall i  mary  have a son when no man has touched me   and gabriel replied   for your lord says  it will happen   we appoint him as a sign onto man   and a mercy from us"])


In [44]:
results = collection.query(
                query_embeddings=query_embedding,
                n_results=5,
            )

In [45]:
results

{'ids': [['1', '6', '2', '13332', '21961']],
 'distances': [[0.435657262802124,
   0.46433424949645996,
   0.4744194746017456,
   0.4796196222305298,
   0.4844893217086792]],
 'metadatas': [[{'name': 'the.message.(1976).eng.1cd', 'num': 9180533},
   {'name': 'the.message.(1976).eng.1cd', 'num': 9180533},
   {'name': 'the.message.(1976).eng.1cd', 'num': 9180533},
   {'name': 'sodom.and.gomorrah.(1962).eng.1cd', 'num': 9194455},
   {'name': '10000.bc.(2008).eng.1cd', 'num': 9202523}]],
 'embeddings': None,
 'documents': [['there jafar when god gave him these words dawn is coming up ammar you first then you jaafar ammar you kept your mother awake all night with worry i m sorry father where were you have you been with muhammad again what will happen now forgive him it was my fault i did it that god has helped us all our lives but it fell it could not even help itself what talk have you been listening to the real god is unseen he s not made of clay ammar we see the gods in the kaaba every d