## Team Members
1) Muzammil Lakdawala (C0872315)
2) Keerat Singh (C0851344)
3) Gurdaan Walia  (C0872042)
4) Manuel Paredes (C0874185)

Deliverables for your Assignment:

1) Describe the MODEL that you are using in your code as the Model for your Embedding. Research and Discuss WHY you choose that model. How is it of particular value to your Project Business Domain.

2) Research and select a MODEL for your Embedding (and therefore later your Project), and support and defend your reasoning and decision making as to why you choose that MODEL for your Use Cases and Business Domain:

3) If you were doing this at work: What licensing and pricing considerations for using the APIs would factor into account?

1) **Model Description and Selection**: The model used in the code is the `jinaai/jina-embeddings-v2-base-en`¹. This is an English, monolingual embedding model that supports a sequence length of 8192¹. It is based on a Bert architecture (JinaBert) that supports the symmetric bidirectional variant of ALiBi to allow longer sequence length¹. The backbone `jina-bert-v2-base-en` is pretrained on the C4 dataset¹. The model is further trained on Jina AI's collection of more than 400 millions of sentence pairs and hard negatives¹. These pairs were obtained from various domains and were carefully selected through a thorough cleaning process¹. The model was chosen for its ability to handle long sequences, making it particularly useful for tasks that require processing long documents, including long document retrieval, semantic textual similarity, text reranking, recommendation, RAG and LLM-based generative search, etc¹.

2) **Model Value for Business Domain**: The `jinaai/jina-embeddings-v2-base-en` model is of particular value to many business domains due to its extended context capabilities⁴. For instance, in the legal domain, it can capture and analyze intricate details in extensive legal texts effectively⁴. In the medical research domain, it can holistically embed scientific papers for advanced analytics and discoveries⁴. The model's ability to handle long sequences makes it especially useful when processing long documents is needed¹.

3) **Licensing and Pricing Considerations**: The `jinaai/jina-embeddings-v2-base-en` model is freely available under the Apache 2.0 license³. This means it can be used without any cost, making it a cost-effective choice for businesses. However, if you plan to use the model in a commercial product, you should review the terms of the Apache 2.0 license to ensure compliance. As for API usage, pricing would depend on the specific API provider and usage requirements. It's important to consider factors such as the number of API calls needed, data transfer costs, and whether the API provider offers a free tier or volume discounts. Always review the API provider's pricing documentation for the most accurate information.

(1) jinaai/jina-embeddings-v2-base-en · Hugging Face. https://huggingface.co/jinaai/jina-embeddings-v2-base-en.

(2) jina-embeddings-v2-base-en model | Clarifai - The World's AI. https://clarifai.com/jinaai/jina-embeddings/models/jina-embeddings-v2-base-en.

(3) Jina AI's Open-Source Embedding Model Outperforms OpenAI's Ada - InfoQ. https://www.infoq.com/news/2023/11/jina-ai-embeddings/.

(4) jinaai/jina-embeddings-v2-small-en · Hugging Face. https://huggingface.co/jinaai/jina-embeddings-v2-small-en.

(5) Embedding API - jinaai.cn. https://www.jinaai.cn/embeddings/.

(6) Jina AI’s jina-embeddings-v2: an open source text embedding model that .... https://www.baseten.co/blog/jina-embeddings-v2-open-source-text-embedding-that-matches-openai-ada-002/.

(7) Jina Embeddings - Finetuner documentation. https://finetuner.jina.ai/get-started/pretrained/.

# Before using Pretrained model from Hugging Face, Lets create our own model

In [56]:

corpus = [
    "Who is Luke Skywalker's father?", "Darth Vader is Luke Skywalker's father.",
    "What is the name of Han Solo's ship?", "Han Solo's ship is called the Millennium Falcon.",
    "Who is the main antagonist in Star Wars: Episode IV - A New Hope?", "Darth Vader is the main antagonist in Episode IV.",
    "What is the Force?", "The Force is a mystical energy field in the Star Wars universe.",
    "Who trained Obi-Wan Kenobi in the ways of the Jedi?", "Obi-Wan Kenobi was trained by Qui-Gon Jinn.",
    "What is the home planet of Chewbacca?", "Chewbacca's home planet is Kashyyyk.",
    "Who is the Supreme Leader of the First Order in the sequel trilogy?", "Snoke is the Supreme Leader of the First Order.",
    "What is the name of Anakin Skywalker's lightsaber?", "Anakin Skywalker's lightsaber is called the Skywalker lightsaber.",
    "Who played Princess Leia in the original trilogy?", "Carrie Fisher played Princess Leia.",
    "What is the capital of the Galactic Republic?", "The capital of the Galactic Republic is Coruscant.",
]


In [57]:
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences

# Tokenization
tokenizer = Tokenizer()
tokenizer.fit_on_texts(corpus)
total_words = len(tokenizer.word_index) + 1

# Convert text to sequence of integers
input_sequences = []
for line in corpus:
    token_list = tokenizer.texts_to_sequences([line])[0]
    for i in range(1, len(token_list)):
        n_gram_sequence = token_list[:i+1]
        input_sequences.append(n_gram_sequence)

# Pad sequences for equal length
max_sequence_len = max([len(x) for x in input_sequences])
input_sequences = pad_sequences(input_sequences, maxlen=max_sequence_len, padding='pre')

Model Explanation

1. **Embedding Layer**: The first layer is an Embedding layer, which is used for word embeddings. Word embeddings are a type of word representation that allows words with similar meaning to have a similar representation. The Embedding layer takes the integer-encoded vocabulary (`total_words`) and the length of input sequences (`max_sequence_len-1`) as inputs and produces dense vectors of fixed size (10 in this case). This layer can only be used as the first layer in a model.

2. **LSTM Layer**: The next layer is an LSTM (Long Short-Term Memory) layer with 50 units. LSTM is a type of recurrent neural network (RNN) that can learn and remember over long sequences and is not prone to the vanishing gradient problem, which is a common issue with traditional RNNs. This makes LSTMs useful for processing and making predictions based on time series data or any data where the temporal dynamics are important.

3. **Dense Layer**: The final layer is a Dense layer, which is a regular densely-connected neural network layer. It implements the operation: `output = activation(dot(input, kernel) + bias)`. Here, `total_words` is the dimensionality of the output space and `softmax` is the activation function. The softmax function outputs a vector that represents the probability distribution of a list of potential outcomes.

4. **Compilation**: Finally, the model is compiled with the `adam` optimizer and the `categorical_crossentropy` loss function, which is suitable for multi-class classification problems. The model's performance is measured with the `accuracy` metric during training and testing.

In [78]:
from tensorflow.keras.layers import Embedding, LSTM, Dense
from tensorflow.keras.models import Sequential

model = Sequential()
model.add(Embedding(total_words, 10, input_length=max_sequence_len-1))  # Embedding layer
model.add(LSTM(50))
model.add(Dense(total_words, activation='softmax'))
model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
print(model.summary())

Model: "sequential_6"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 embedding_6 (Embedding)     (None, 12, 10)            690       
                                                                 
 lstm_7 (LSTM)               (None, 50)                12200     
                                                                 
 dense_4 (Dense)             (None, 69)                3519      
                                                                 
Total params: 16409 (64.10 KB)
Trainable params: 16409 (64.10 KB)
Non-trainable params: 0 (0.00 Byte)
_________________________________________________________________
None


In [79]:
from tensorflow.keras.utils import to_categorical

# Splitting data into predictors and label
X = input_sequences[:,:-1]
y = input_sequences[:,-1]

# One-hot encoding the labels
y = to_categorical(y, num_classes=total_words)

# Training the model
model.fit(X, y, epochs=1000, verbose=1)

Epoch 1/1000
Epoch 2/1000
Epoch 3/1000
Epoch 4/1000
Epoch 5/1000
Epoch 6/1000
Epoch 7/1000
Epoch 8/1000
Epoch 9/1000
Epoch 10/1000
Epoch 11/1000
Epoch 12/1000
Epoch 13/1000
Epoch 14/1000
Epoch 15/1000
Epoch 16/1000
Epoch 17/1000
Epoch 18/1000
Epoch 19/1000
Epoch 20/1000
Epoch 21/1000
Epoch 22/1000
Epoch 23/1000
Epoch 24/1000
Epoch 25/1000
Epoch 26/1000
Epoch 27/1000
Epoch 28/1000
Epoch 29/1000
Epoch 30/1000
Epoch 31/1000
Epoch 32/1000
Epoch 33/1000
Epoch 34/1000
Epoch 35/1000
Epoch 36/1000
Epoch 37/1000
Epoch 38/1000
Epoch 39/1000
Epoch 40/1000
Epoch 41/1000
Epoch 42/1000
Epoch 43/1000
Epoch 44/1000
Epoch 45/1000
Epoch 46/1000
Epoch 47/1000
Epoch 48/1000
Epoch 49/1000
Epoch 50/1000
Epoch 51/1000
Epoch 52/1000
Epoch 53/1000
Epoch 54/1000
Epoch 55/1000
Epoch 56/1000
Epoch 57/1000
Epoch 58/1000
Epoch 59/1000
Epoch 60/1000
Epoch 61/1000
Epoch 62/1000
Epoch 63/1000
Epoch 64/1000
Epoch 65/1000
Epoch 66/1000
Epoch 67/1000
Epoch 68/1000
Epoch 69/1000
Epoch 70/1000
Epoch 71/1000
Epoch 72/1000
E

<keras.src.callbacks.History at 0x21b5f6e5510>

# Now for extracting the embeddings from our trained embedding layer we are using the code as shown below

In [80]:
embedding_layer = model.layers[0]
weights = embedding_layer.get_weights()[0]

# Create a dictionary to store the embeddings
word_embeddings = {}
for word, i in tokenizer.word_index.items():
    word_embeddings[word] = weights[i]

In [81]:
print(word_embeddings)

{'the': array([-0.89323634,  0.7945932 , -0.35135734,  0.79967153,  0.6161243 ,
        0.8542147 , -0.5984699 ,  0.7532137 ,  0.42572156, -0.1615027 ],
      dtype=float32), 'is': array([ 0.6353876 , -0.10452574, -1.0016447 ,  0.14203502,  0.72258884,
       -0.3668314 , -0.37684783,  0.1418532 ,  0.9317049 ,  0.5978924 ],
      dtype=float32), 'of': array([ 0.3407598 , -0.16956581, -0.8923172 , -0.3106645 ,  0.3644033 ,
       -0.2924137 ,  0.56571835, -0.8931208 ,  0.5035306 ,  0.55873203],
      dtype=float32), 'in': array([ 0.5310066 , -0.21680789, -0.928487  ,  0.54026145,  0.70341337,
       -0.33885482,  0.6776855 , -0.30425638,  0.6794363 ,  0.7236018 ],
      dtype=float32), 'who': array([-0.44685215, -0.47316214,  0.05553109, -0.42508668, -0.24375921,
       -0.34375715, -0.31000587,  0.20505954,  0.11338197, -0.01668107],
      dtype=float32), 'what': array([ 0.4385224 , -0.73268396,  0.6786778 ,  1.1386111 , -0.12788138,
       -0.49299604,  0.6300115 , -0.25826788, -0.534

In [82]:
import numpy as np

def generate_response(model, tokenizer, max_sequence_len, input_text, num_words=1):
    for _ in range(num_words):
        # Tokenize input text
        token_list = tokenizer.texts_to_sequences([input_text])[0]
        
        # Pad sequences
        token_list = pad_sequences([token_list], maxlen=max_sequence_len-1, padding='pre')
        
        # Predict next word
        predicted_word_index = np.argmax(model.predict(token_list), axis=-1)

        print(predicted_word_index)
        
        # Convert index to word
        predicted_word = ""
        for word, index in tokenizer.word_index.items():
            if index == predicted_word_index:
                predicted_word = word
                break
        
        # Update input text for the next iteration
        input_text += " " + predicted_word
    
    return input_text

# Example usage
user_input = "Who is Luke"
response = generate_response(model, tokenizer, max_sequence_len, user_input, num_words=10)
print(response)


[7]
[10]
[10]
[10]
[50]
[4]
[1]
[23]
[1]
[20]
Who is Luke skywalker's father father father field in the iv the star


# Let's Now use a model from Hugging face

Installation: The transformers library, which provides pre-trained models for various text-related tasks, is installed using the command !pip install transformers.

Imports: The AutoModel class from the transformers library and the norm function from the numpy.linalg module are imported.

Cosine Similarity Function: A function named cos_sim is defined to calculate the cosine similarity between two vectors. This measure is used to determine the cosine of the angle between two non-zero vectors, providing a measure of their similarity.

Model Loading: A pre-trained model, ‘jinaai/jina-embeddings-v2-base-en’, is loaded using the AutoModel.from_pretrained method. The trust_remote_code=True argument is required to use the encode method of the model.

Encoding and Similarity Calculation: Two sentences, ‘How is the weather today?’ and ‘What is the current weather like today?’, are encoded using the pre-trained model. The cosine similarity between the resulting embeddings is then calculated using the cos_sim function.

In [7]:
# !pip install transformers
from transformers import AutoModel
from numpy.linalg import norm

cos_sim = lambda a,b: (a @ b.T) / (norm(a)*norm(b))
model = AutoModel.from_pretrained('jinaai/jina-embeddings-v2-base-en', trust_remote_code=True) # trust_remote_code is needed to use the encode method
embeddings = model.encode(['How is the weather today?', 'What is the current weather like today?'])
# print(cos_sim(embeddings[0], embeddings[1]))



  from .autonotebook import tqdm as notebook_tqdm


In [8]:
embeddings

array([[-0.34827104, -0.60091805,  0.6022362 , ..., -0.2523272 ,
         0.23249894, -0.7026478 ],
       [-0.11724894, -0.89896137,  0.4500913 , ..., -0.02847653,
        -0.22871459, -0.42282885]], dtype=float32)

# Now lets use data from Hugging Face and generate  embeddings. We will fetch the data using api

In [33]:
import requests

url = "https://datasets-server.huggingface.co/rows?dataset=benlehrburger%2Fcollege-text-corpus&config=default&split=train&offset=0&length=100"
response = requests.get(url)

sentences=[]
# The data is returned as a JSON
data = response.json()
for row in data['rows']:
    sentences.append(row['row']['text'])
sentences

['The experiment that I outline in the following paper is designed to shed light on any relationship between information exposure and attentional cognition.',
 '\tThe past few decades have been so information-filled and information-dependent that they have been appropriately deemed the “Information Age.” This trend can be attributed to increases in the accessibility, interconnectedness, and potency of technology, specifically big data. There are now more than 2.5 quintillion bytes of data created each day (Lu et al., 2014), enough to fill twenty billion human brains (Marois et al., 2005). Since we cannot physically intake all this information, we must navigate the sea of data to identify what is pertinent to us. The problem of prioritizing and processing select information is nothing new; it is likely that our brains evolved attentional mechanisms that internalize only the most important bits of information necessary to form a complete conception. But now, I hypothesize that our attent

In [14]:
from transformers import AutoModel
from numpy.linalg import norm

cos_sim = lambda a,b: (a @ b.T) / (norm(a)*norm(b))
model = AutoModel.from_pretrained('jinaai/jina-embeddings-v2-base-en', trust_remote_code=True) # trust_remote_code is needed to use the encode method
embeddings = model.encode(sentences)

In [15]:
embeddings

array([[-0.24711892, -0.62824494,  0.91264004, ..., -0.11137763,
         0.0649934 , -0.50874877],
       [-0.6640367 , -0.79840165,  0.89068604, ...,  0.61323667,
        -0.28749797, -0.41671854],
       [-0.41613916, -1.185504  ,  0.685599  , ...,  0.5142365 ,
        -0.36112726, -0.49074647],
       ...,
       [-0.3899692 , -0.751353  ,  0.6884154 , ...,  0.24628246,
        -0.05247517, -0.34310815],
       [-0.7399178 , -0.8254181 ,  0.82034874, ...,  0.69983196,
        -0.21441585, -0.04659319],
       [-0.12853403, -0.18616784,  0.652913  , ...,  0.2621476 ,
         0.04487842, -0.6869442 ]], dtype=float32)

## Now Let's try to use a pretained model for conversation about starwars

Reasons for choosing BlenderBot:

Conversational AI: BlenderBot is specifically trained for conversational tasks, making it suitable for projects that involve generating responses in a chat-like interface or dialogue systems.

Large-scale training data: BlenderBot has been trained on a diverse and extensive dataset, which helps it capture a wide range of language nuances and context.

Distilled version for efficiency: The use of the distilled version (blenderbot-400M-distill) allows for more efficient usage in terms of memory and inference time while still retaining a substantial amount of the original model's capabilities.

Pre-trained model: The model is pre-trained, which means it has already learned a significant amount from various dialogues. This is advantageous as it reduces the need for extensive training on custom datasets for your specific use case.

2) **Model Value for Business Domain**: The model is of particular value to many business domains due to its extended context capabilities⁴. For instance, in the legal domain, it can capture and analyze intricate details in extensive legal texts effectively⁴. In the medical research domain, it can holistically embed scientific papers for advanced analytics and discoveries⁴. The model's ability to handle long sequences makes it especially useful when processing long documents is needed¹.

3) **Licensing and Pricing Considerations**: The model is freely available under the Apache 2.0 license³. This means it can be used without any cost, making it a cost-effective choice for businesses. 

In [86]:
from transformers import BlenderbotTokenizer, BlenderbotForConditionalGeneration

mname = "facebook/blenderbot-400M-distill"
model = BlenderbotForConditionalGeneration.from_pretrained(mname)
tokenizer = BlenderbotTokenizer.from_pretrained(mname)
UTTERANCE = "Explain the plot of star wars."
inputs = tokenizer([UTTERANCE], return_tensors="pt")
reply_ids = model.generate(**inputs)
print(tokenizer.batch_decode(reply_ids))

['<s> Do you like Star Wars? It is an American science fiction epic space opera film directed by George R. R. Martin.</s>']
