<a href="https://colab.research.google.com/github/jrgosalvez/data255_DL/blob/main/HW12-Chatbot/Jorge_Gosalvez_DL255_HW12_rag_chatbot_Part_A.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# SJSU MSDS 255 DL, Spring 2024 - Building RAG Chatbots with LangChain
Homework 12 - Part A: Code Chatbot

Git: https://github.com/jrgosalvez/data255_DL

Sources:
* SJSU DL 255 RAG Chatbot with LangChain demo
* [OpenAI API key](https://platform.openai.com/account/api-keys) and [Pinecone API key](https://app.pinecone.io)
* [OpenAI Embeddings](https://platform.openai.com/docs/guides/embeddings/use-cases)
* [RAGs with OpenAI](https://cookbook.openai.com/examples/parse_pdf_docs_for_rag)

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/pinecone-io/examples/blob/master/learn/generation/langchain/rag-chatbot.ipynb) [![Open nbviewer](https://raw.githubusercontent.com/pinecone-io/examples/master/assets/nbviewer-shield.svg)](https://nbviewer.org/github/pinecone-io/examples/blob/master/learn/generation/langchain/rag-chatbot.ipynb)

### Part A Goal

Build a code understanding model using LangChain, OpenAI, and Pinecone vector DB, to build a chatbot capable of learning from the external world using **R**etrieval **A**ugmented **G**eneration (RAG).

Uploading my previous code files to the model I will be able to ask questions based on the code file as context.

This example will have a functioning chatbot and RAG pipeline that can hold a conversation and provide informative responses based on a knowledge base.

### Prerequisites

Install the following Python libraries:

- **langchain**: This is a library for GenAI. We'll use it to chain together different language models and components for our chatbot.
- **openai**: This is the official OpenAI Python client. We'll use it to interact with the OpenAI API and generate responses for our chatbot.
- **datasets**: This library provides a vast array of datasets for machine learning. We'll use it to load our knowledge base for the chatbot.
- **pinecone-client**: This is the official Pinecone Python client. We'll use it to interact with the Pinecone API and store our chatbot's knowledge base in a vector database.

**NOTE**: *OpenAI dataloaders will not load locally for on-prem devices easily. To simplify the use of these loaders, it is recommended to use an online notebook such as CoLab.*

In [1]:
!pip install -qU \
    langchain==0.0.354 \
    openai==1.6.1 \
    datasets==2.10.1 \
    pinecone-client==3.1.0 \
    tiktoken==0.5.2

### BACKGROUND: Building a Chatbot (no RAG)

We will be relying heavily on the LangChain library to bring together the different components needed for our chatbot. To begin, we'll create a simple chatbot without any retrieval augmentation. We do this by initializing a `ChatOpenAI` object. For this we do need an [OpenAI API key](https://platform.openai.com/account/api-keys).

In [2]:
import os
from langchain.chat_models import ChatOpenAI
from google.colab import userdata

os.environ["OPENAI_API_KEY"] = userdata.get('OpenAI')

chat = ChatOpenAI(
    openai_api_key=os.environ["OPENAI_API_KEY"],
    model='gpt-3.5-turbo'
)

  warn_deprecated(


Chats with OpenAI's `gpt-3.5-turbo` and `gpt-4` chat models are typically structured (in plain text) like this:

```
System: You are a helpful assistant.

User: Hi AI, how are you today?

Assistant: I'm great thank you. How can I help you?

User: I'd like to understand string theory.

Assistant:
```

The final `"Assistant:"` without a response is what would prompt the model to continue the conversation. In the official OpenAI `ChatCompletion` endpoint these would be passed to the model in a format like:

```python
[
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user", "content": "Hi AI, how are you today?"},
    {"role": "assistant", "content": "I'm great thank you. How can I help you?"}
    {"role": "user", "content": "I'd like to understand string theory."}
]
```

In LangChain there is a slightly different format. We use three _message_ objects like so:

In [3]:
from langchain.schema import (
    SystemMessage,
    HumanMessage,
    AIMessage
)

messages = [
    SystemMessage(content="You are a helpful data scientist python coder."),
    HumanMessage(content="Hi AI, how are you today?"),
    AIMessage(content="I'm great thank you. How can I help you?"),
    HumanMessage(content="I'd like to understand pytorch.")
]

The format is very similar, we're just swapped the role of `"user"` for `HumanMessage`, and the role of `"assistant"` for `AIMessage`.

We generate the next response from the AI by passing these messages to the `ChatOpenAI` object.

In [4]:
res = chat(messages)
res

  warn_deprecated(


AIMessage(content='Sure! PyTorch is an open-source machine learning library for Python that provides a flexible and dynamic computational graph. It is widely used for deep learning tasks such as neural networks. PyTorch offers a wide range of tools and utilities for building, training, and deploying machine learning models. \n\nIf you have any specific questions about PyTorch or need help with a particular task, feel free to ask!')

In response we get another AI message object. We can print it more clearly like so:

In [5]:
print(res.content)

Sure! PyTorch is an open-source machine learning library for Python that provides a flexible and dynamic computational graph. It is widely used for deep learning tasks such as neural networks. PyTorch offers a wide range of tools and utilities for building, training, and deploying machine learning models. 

If you have any specific questions about PyTorch or need help with a particular task, feel free to ask!


### Stringing Messages for a Conversation
Because `res` is just another `AIMessage` object, we can append it to `messages`, add another `HumanMessage`, and generate the next response in the conversation.

In [6]:
# add latest AI response to messages
messages.append(res)

# now create a new user prompt
prompt = HumanMessage(
    content="Why do data scientists believe it can produce general artificial intelligence?"
)
# add to messages
messages.append(prompt)

# send to chat-gpt
res = chat(messages)

print(res.content)

PyTorch, like other deep learning frameworks, is a powerful tool for building complex neural networks and training them on large datasets. While PyTorch is widely used in the field of artificial intelligence, it is important to note that the development of general artificial intelligence (AGI) goes beyond just the choice of framework.

Data scientists believe that PyTorch can contribute to the development of AGI because of its flexibility, scalability, and ease of use in building sophisticated neural network architectures. By leveraging PyTorch's capabilities, researchers can experiment with advanced deep learning models and algorithms that may potentially lead to breakthroughs in AGI.

However, achieving AGI requires more than just a powerful framework - it involves interdisciplinary research in fields such as cognitive science, neuroscience, and philosophy, in addition to advancements in machine learning and computational technologies. While PyTorch is a valuable tool in the pursuit 

### Dealing with Hallucinations

We have our chatbot, but as mentioned — the knowledge of LLMs can be limited. The reason for this is that LLMs learn all they know during training. An LLM essentially compresses the "world" as seen in the training data into the internal parameters of the model. We call this knowledge the _parametric knowledge_ of the model.

By default, LLMs have no access to the external world.

The result of this is very clear when we ask LLMs about more recent information, like about the new (and very popular) Llama 2 LLM.

In [7]:
# add latest AI response to messages
messages.append(res)

# now create a new user prompt
prompt = HumanMessage(
    content="What is so special about Llama 2?"
)
# add to messages
messages.append(prompt)

# send to OpenAI
res = chat(messages)

In [8]:
print(res.content)

I'm not familiar with a specific technology or framework called "Llama 2" in the context of data science or artificial intelligence. It's possible that it may be a new or specialized tool that I'm not aware of.

If you can provide more context or details about Llama 2, I'd be happy to look into it further and provide more information or insights. Alternatively, if you meant something else or have a different question, feel free to clarify and I'll do my best to assist you.


Our chatbot can no longer help us, it doesn't contain the information we need to answer the question. It was very clear from this answer that the LLM doesn't know the informaiton, but sometimes an LLM may respond like it _does_ know the answer — and this can be very hard to detect.

OpenAI have since adjusted the behavior for this particular example as we can see below:

In [9]:
# add latest AI response to messages
messages.append(res)

# now create a new user prompt
prompt = HumanMessage(
    content="Can you tell me about the LLMChain in LangChain?"
)
# add to messages
messages.append(prompt)

# send to OpenAI
res = chat(messages)

In [10]:
print(res.content)

I'm not aware of a specific technology called "LLMChain in LangChain." It's possible that these terms refer to specialized tools or concepts within a specific domain that I may not be familiar with.

If you can provide more context or details about LLMChain and LangChain, I'd be happy to try and help you understand them better. Alternatively, if you have a different question or topic in mind, feel free to let me know so I can assist you accordingly.


### Feed the LLM More Data Manually [Not scalable]
There is another way of feeding knowledge into LLMs. It is called _source knowledge_ and it refers to any information fed into the LLM via the prompt. We can try that with the LLMChain question. We can take a description of this object from the LangChain documentation.

In [11]:
llmchain_information = [
    "A LLMChain is the most common type of chain. It consists of a PromptTemplate, a model (either an LLM or a ChatModel), and an optional output parser. This chain takes multiple input variables, uses the PromptTemplate to format them into a prompt. It then passes that to the model. Finally, it uses the OutputParser (if provided) to parse the output of the LLM into a final format.",
    "Chains is an incredibly generic concept which returns to a sequence of modular components (or other chains) combined in a particular way to accomplish a common use case.",
    "LangChain is a framework for developing applications powered by language models. We believe that the most powerful and differentiated applications will not only call out to a language model via an api, but will also: (1) Be data-aware: connect a language model to other sources of data, (2) Be agentic: Allow a language model to interact with its environment. As such, the LangChain framework is designed with the objective in mind to enable those types of applications."
]

source_knowledge = "\n".join(llmchain_information)

We can feed this additional knowledge into our prompt with some instructions telling the LLM how we'd like it to use this information alongside our original query.

In [12]:
query = "Can you tell me about the LLMChain in LangChain?"

augmented_prompt = f"""Using the contexts below, answer the query.

Contexts:
{source_knowledge}

Query: {query}"""

Now we feed this into our chatbot as we were before.

In [13]:
# create a new user prompt
prompt = HumanMessage(
    content=augmented_prompt
)
# add to messages
messages.append(prompt)

# send to OpenAI
res = chat(messages)

In [14]:
print(res.content)

The LLMChain in LangChain is a common type of chain that is part of the LangChain framework for developing applications powered by language models. The LLMChain consists of a PromptTemplate, a model (either an LLM or a ChatModel), and an optional output parser. 

In the context of LangChain, a chain is a generic concept that refers to a sequence of modular components (or other chains) combined in a specific way to achieve a common use case. The LLMChain takes multiple input variables, formats them into a prompt using the PromptTemplate, passes that prompt to the model (LLM or ChatModel), and then uses the OutputParser (if provided) to parse the output of the language model into a final format.

LangChain aims to enable the development of applications that are data-aware and agentic, meaning they can connect a language model to other sources of data and allow the language model to interact with its environment. By leveraging the capabilities of the LLMChain within the LangChain framewor

The quality of this answer is phenomenal. This is made possible thanks to the idea of augmented our query with external knowledge (source knowledge). There's just one problem — how do we get this information in the first place?

We learned in the previous chapters about Pinecone and vector databases. Well, they can help us here too. But first, we'll need a dataset.

### Importing the Data

In [15]:
# mount drive
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [16]:
!pip install pypdf



In [17]:
from langchain.document_loaders import PyPDFDirectoryLoader

#load pdf files
loader = PyPDFDirectoryLoader('/content/drive/MyDrive/MSDA/DATA255/codePDF')
data = loader.load()
print(data)



In [18]:
data

 Document(page_content='version data\n0 v2.0 {\'title\': \'Beyoncé\', \'paragraphs\': [{\'qas\': [{...\n1 v2.0 {\'title\': \'Frédéric_Chopin\', \'paragraphs\': [{\'...\n2 v2.0 {\'title\': \'Sino-T ibetan_relations_during_the_M...\n3 v2.0 {\'title\': \'IPod\', \'paragraphs\': [{\'qas\': [{\'qu...\n4 v2.0 {\'title\': \'The_Legend_of_Zelda:_T wilight_Princ...\n... ... ...\n437 v2.0 {\'title\': \'Infection\', \'paragraphs\': [{\'qas\': ...\n438 v2.0 {\'title\': \'Hunting\', \'paragraphs\': [{\'qas\': [{...\n439 v2.0 {\'title\': \'Kathmandu\', \'paragraphs\': [{\'qas\': ...\n440 v2.0 {\'title\': \'Myocardial_infarction\', \'paragraphs...\n441 v2.0 {\'title\': \'Matter\', \'paragraphs\': [{\'qas\': [{\'...\n442 rows × 2 columns\nOpen and pr eprocess (add special t okens) dataset per BER T format\nLoad pr eprocess (add special t okens) the SQU AD 2.0  dataset per BER T format. Get a minimum 20 QnA pairs. \ue313\n#\xa0Function\xa0to\xa0load\xa0SQuAD2\xa0data\xa0and\xa0add\xa0special\xa0tok ens

In [20]:
from langchain.text_splitter import RecursiveCharacterTextSplitter

# split text data into chunks
text_splitter = RecursiveCharacterTextSplitter(chunk_size=1024, chunk_overlap=20)
text_chunks = text_splitter.split_documents(data)
print(len(text_chunks))

166


In [21]:
# check the chunks
text_chunks[2]

Document(page_content="version data\n0 v2.0 {'title': 'Beyoncé', 'paragraphs': [{'qas': [{...\n1 v2.0 {'title': 'Frédéric_Chopin', 'paragraphs': [{'...\n2 v2.0 {'title': 'Sino-T ibetan_relations_during_the_M...\n3 v2.0 {'title': 'IPod', 'paragraphs': [{'qas': [{'qu...\n4 v2.0 {'title': 'The_Legend_of_Zelda:_T wilight_Princ...\n... ... ...\n437 v2.0 {'title': 'Infection', 'paragraphs': [{'qas': ...\n438 v2.0 {'title': 'Hunting', 'paragraphs': [{'qas': [{...\n439 v2.0 {'title': 'Kathmandu', 'paragraphs': [{'qas': ...\n440 v2.0 {'title': 'Myocardial_infarction', 'paragraphs...\n441 v2.0 {'title': 'Matter', 'paragraphs': [{'qas': [{'...\n442 rows × 2 columns\nOpen and pr eprocess (add special t okens) dataset per BER T format\nLoad pr eprocess (add special t okens) the SQU AD 2.0  dataset per BER T format. Get a minimum 20 QnA pairs. \ue313\n#\xa0Function\xa0to\xa0load\xa0SQuAD2\xa0data\xa0and\xa0add\xa0special\xa0tok ens\xa0[CLS]\xa0and\xa0[SEP]\ndef\xa0load_squad_data (file_path ,\xa0num

In [22]:
# reformat chunks to improve vectorization; match 'jamescalam/llama-2-arxiv-papers-chunked' format sourced from Llama 2 ArXiv papers on huggingface
dataset = []

for i, chunk in enumerate(text_chunks):
    dataset.append({
        'doi': '',  # you can add a DOI here if available
        'chunk-id': str(i),
        'chunk': chunk,
        'id': '',  # you can add an ID here if available
        'title': '',  # you can add a title here if available
        'summary': '',  # you can add a summary here if available
        'source': '',  # you can add a source here if available
        'authors': [],  # you can add authors here if available
        'categories': [],  # you can add categories here if available
        'comment': '',  # you can add a comment here if available
        'journal_ref': None,  # you can add a journal reference here if available
        'primary_category': '',  # you can add a primary category here if available
        'published': '',  # you can add a published date here if available
        'updated': '',  # you can add an updated date here if available
        'references': []  # you can add references here if available
    })

print(dataset[3])

{'doi': '', 'chunk-id': '3', 'chunk': Document(page_content="data\xa0=\xa0 []\n\xa0\xa0\xa0\xa0for\xa0i\xa0in\xa0range(min(len(squad_data ['data']),\xa0num_samples )):\n\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0paragraphs\xa0=\xa0squad_data ['data'][i]['paragraphs' ]\n\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0 for\xa0paragraph\xa0 in\xa0paragraphs :\n\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0context\xa0=\xa0paragraph ['context' ]\n\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0qas\xa0=\xa0paragraph ['qas']\n\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0 for\xa0qa\xa0in\xa0qas:\n\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0question\xa0=\xa0qa ['question' ]\n\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0answers\xa0\xa0=\xa0qa ['answers' ]\n\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0 if\xa0answers :\xa0\xa0#\xa0Check\xa0if\xa0answers\xa0are\xa0available\n\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0answer_te

#### Dataset Overview

The dataset used are PDFs samples of my (Jorge Gosalvez's) Deep Learning homeworks.

Because most **L**arge **L**anguage **M**odels (LLMs) only contain knowledge of the world as it was during training, they cannot answer our questions about Jorge's code without example data.

### Task 4: Building the Knowledge Base

We now have a dataset that can serve as our chatbot knowledge base. Our next task is to transform that dataset into the knowledge base that our chatbot can use. To do this we must use an embedding model and vector database.

We begin by initializing our connection to Pinecone, this requires a [free API key](https://app.pinecone.io).

In [23]:
from pinecone import Pinecone

# initialize connection to pinecone (get API key at app.pinecone.io)
api_key= userdata.get('PineCone')

# configure client
pc = Pinecone(api_key=api_key)

Now we setup our index specification, this allows us to define the cloud provider and region where we want to deploy our index. You can find a list of all [available providers and regions here](https://docs.pinecone.io/docs/projects).

In [24]:
from pinecone import ServerlessSpec

spec = ServerlessSpec(
    cloud="aws", region="us-east-1"
)

Then we initialize the index. We will be using OpenAI's `text-embedding-ada-002` model for creating the embeddings, so we set the `dimension` to `1536`.

In [25]:
import time

index_name = 'llama-2-rag'
existing_indexes = [
    index_info["name"] for index_info in pc.list_indexes()
]

# check if index already exists (it shouldn't if this is first time)
if index_name not in existing_indexes:
    # if does not exist, create index
    pc.create_index(
        index_name,
        dimension=1536,  # dimensionality of ada 002
        metric='dotproduct',
        spec=spec
    )
    # wait for index to be initialized
    while not pc.describe_index(index_name).status['ready']:
        time.sleep(1)

# connect to index
index = pc.Index(index_name)
time.sleep(1)
# view index stats
index.describe_index_stats()

{'dimension': 1536,
 'index_fullness': 0.0,
 'namespaces': {},
 'total_vector_count': 0}

Our index is now ready but it's empty. It is a vector index, so it needs vectors. As mentioned, to create these vector embeddings we will OpenAI's `text-embedding-ada-002` model — we can access it via LangChain like so:

In [26]:
from langchain.embeddings.openai import OpenAIEmbeddings

embed_model = OpenAIEmbeddings(model="text-embedding-ada-002")

  warn_deprecated(


Using this model we can create embeddings like so:

In [27]:
texts = [
    'this is the first chunk of text',
    'then another second chunk of text is here'
]

res = embed_model.embed_documents(texts)
len(res), len(res[0])

(2, 1536)

From this we get two (aligning to our two chunks of text) 1536-dimensional embeddings.

We're now ready to embed and index all our our data! We do this by looping through our dataset and embedding and inserting everything in batches.

**NOTE**: *ensure that chunks are strings and ensure that they are correctly assigned to metadata (do this with the .page_content method)*

In [28]:
import pandas as pd
from tqdm.auto import tqdm  # for progress bar

data = pd.DataFrame(dataset) # this makes it easier to iterate over the dataset

batch_size = 100

for i in tqdm(range(0, len(data), batch_size)):
    i_end = min(len(data), i+batch_size)
    # get batch of data
    batch = data.iloc[i:i_end]
    # generate unique ids for each chunk
    ids = [f"{x['doi']}-{x['chunk-id']}" for i, x in batch.iterrows()]
    # get text to embed
    texts = [str(x['chunk']) for _, x in batch.iterrows()]

    # embed text
    embeds = embed_model.embed_documents(texts)
    # get metadata to store in Pinecone
    metadata = [
        {'text': x['chunk'].page_content,
         'source': x['source'],
         'title': x['title']} for i, x in batch.iterrows()
    ]
    # add to Pinecone
    index.upsert(vectors=zip(ids, embeds, metadata))

  0%|          | 0/2 [00:00<?, ?it/s]

We can check that the vector index has been populated using `describe_index_stats` like before:

In [29]:
index.describe_index_stats()

{'dimension': 1536,
 'index_fullness': 0.0,
 'namespaces': {'': {'vector_count': 166}},
 'total_vector_count': 166}

#### Retrieval Augmented Generation

We've built a fully-fledged knowledge base. Now it's time to connect that knowledge base to our chatbot. To do that we'll be diving back into LangChain and reusing our template prompt from earlier.

To use LangChain here we need to load the LangChain abstraction for a vector index, called a `vectorstore`. We pass in our vector `index` to initialize the object.

In [30]:
from langchain.vectorstores import Pinecone

text_field = "text"  # the metadata field that contains our text

# initialize the vector store object
vectorstore = Pinecone(
    index, embed_model.embed_query, text_field
)

  warn_deprecated(


Using this `vectorstore` we can already query the index and see if we have any relevant information given our question about Jorge's prior deep learning homeworks.

In [31]:
query = "Did Jorge Gosalvez code in python?"

vectorstore.similarity_search(query, k=3)

[Document(page_content='Homework 07: NLPSJSU MSDS 255 DL, Spring 2024\ue313\nGit: https:/ /github.com/jr gosalv ez/data255_DL\nimport\xa0numpy\xa0as\xa0np\nimport\xa0pandas\xa0 as\xa0pd\nfrom\xa0scipy\xa0import\xa0spatial\nfrom\xa0sklearn.metrics.pairwise\xa0 import\xa0cosine_similarity\nfrom\xa0sklearn.model_selection\xa0 import\xa0train_test_split\nfrom\xa0sklearn.naive_bayes\xa0 import\xa0MultinomialNB\nfrom\xa0sklearn.feature_extraction.text\xa0 import\xa0CountVectorizer\nfrom\xa0sklearn.metrics\xa0 import\xa0accuracy_score ,\xa0confusion_matrix ,\xa0classification_report\nimport\xa0scikitplot\xa0 as\xa0skplt\nimport\xa0matplotlib.pyplot\xa0 as\xa0plt\nfrom\xa0gensim.models\xa0 import\xa0KeyedVectors\nimport\xa0re\nimport\xa0nltk\nfrom\xa0nltk.corpus\xa0 import\xa0stopwords\nStep 1: Load the Wikipedia GLoVE W ord2Vec (glo ve.6B.50d.txt) \ue313\nDownload GLoV e pretrained models: https:/ /nlp.stanfor d.edu/pr ojects/glo ve/\nembeddings_dict\xa0=\xa0 {}\nwith\xa0open("glove.6B.50d.tx

We return a lot of text here and it's not that clear what we need or what is relevant. Fortunately, our LLM will be able to parse this information much faster than us. All we need is to connect the output from our `vectorstore` to our `chat` chatbot. To do that we can use the same logic as we used earlier.

In [32]:
def augment_prompt(query: str):
    # get top 3 results from knowledge base
    results = vectorstore.similarity_search(query, k=3)
    # get the text from the results
    source_knowledge = "\n".join([x.page_content for x in results])
    # feed into an augmented prompt
    augmented_prompt = f"""Using the contexts below, answer the query.

    Contexts:
    {source_knowledge}

    Query: {query}"""
    return augmented_prompt

Using this we produce an augmented prompt:

In [33]:
print(augment_prompt(query))

Using the contexts below, answer the query.

    Contexts:
    Homework 07: NLPSJSU MSDS 255 DL, Spring 2024
Git: https:/ /github.com/jr gosalv ez/data255_DL
import numpy as np
import pandas  as pd
from scipy import spatial
from sklearn.metrics.pairwise  import cosine_similarity
from sklearn.model_selection  import train_test_split
from sklearn.naive_bayes  import MultinomialNB
from sklearn.feature_extraction.text  import CountVectorizer
from sklearn.metrics  import accuracy_score , confusion_matrix , classification_report
import scikitplot  as skplt
import matplotlib.pyplot  as plt
from gensim.models  import KeyedVectors
import re
import nltk
from nltk.corpus  import stopwords
Step 1: Load the Wikipedia GLoVE W ord2Vec (glo ve.6B.50d.txt) 
Download GLoV e pretrained models: https:/ /nlp.stanfor d.edu/pr ojects/glo ve/
embeddings_dict =  {}
with open("glove.6B.50d.txt" , 'r', encoding= "utf-8") as f:
    for line in f:
        values = line.split ()
        word   = values [0]
      

There is still a lot of text here, so let's pass it onto our chat model to see how it performs.

In [34]:
# create a new user prompt
prompt = HumanMessage(
    content=augment_prompt(query)
)
# add to messages
messages.append(prompt)

res = chat(messages)

print(res.content)

Based on the provided contexts, the code snippets and references indicate that Jorge Gosalvez worked on coding tasks in Python. The code snippets include Python libraries such as numpy, pandas, scikit-learn, gensim, and nltk, which are commonly used in Python programming for data analysis, machine learning, and natural language processing tasks. Additionally, the references to evaluating the performance of a GAN model in a Jupyter notebook on Google Colab further suggest that Python was used for coding purposes.

Therefore, based on the information available, it can be inferred that Jorge Gosalvez coded in Python for the tasks related to the NLPSJSU MSDS 255 DL course and the evaluation of the GAN model.


We can continue with more questions about Jorge's prior deep learning homeworks. Let's try _without_ RAG first:

In [35]:
prompt = HumanMessage(
    content="what model did Jorge Gosalvez code?"
)

res = chat(messages + [prompt])
print(res.content)

Jorge Gosalvez coded a GAN (Generative Adversarial Network) model in Python. The GAN model consists of a Generator (G.eval()) and a Discriminator (D.eval()), both of which are defined using a Sequential neural network architecture in PyTorch. The Generator comprises linear layers with ReLU activation functions and a final Tanh activation function, while the Discriminator also includes linear layers with ReLU activation functions.


The chatbot is able to respond about Jorge's prior deep learning homeworks thanks to it's conversational history stored in `messages`.

In [36]:
prompt = HumanMessage(
    content=augment_prompt(
        "did he code other models?"
    )
)

res = chat(messages + [prompt])
print(res.content)

Based on the provided contexts, Jorge Gosalvez appears to have coded custom models using PyTorch for deep learning tasks such as transfer learning and object detection. Specifically, he implemented a custom model class called "CustomModel" which includes convolutional and pooling layers, as well as fully connected layers for classification tasks. Additionally, he worked on projects involving YOLOv8 for object detection on videos.

Therefore, based on the information provided, it seems that Jorge Gosalvez has coded custom models using PyTorch for various deep learning applications, including transfer learning and object detection tasks.


In [37]:
prompt = HumanMessage(
    content=augment_prompt(
        "Show the code that Jorge Gosalvez used to optimze his GAN."
    )
)

res = chat(messages + [prompt])
print(res.content)

Based on the provided context, the code snippet used by Jorge Gosalvez to optimize his GAN (Generative Adversarial Network) model is as follows:

```python
# Device configuration
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

# Define optimizer for the GAN model
G_optimizer = optim.Adam(G.parameters(), lr=learning_rate, betas=(0.5, 0.999))
D_optimizer = optim.Adam(D.parameters(), lr=learning_rate, betas=(0.5, 0.999))

# Training loop
for epoch in range(num_epochs):
    for i, (real_images, _) in enumerate(data_loader):
        real_images = real_images.to(device)
        batch_size = real_images.size(0)
        
        # Train Discriminator
        D.zero_grad()
        real_outputs = D(real_images)
        real_loss = criterion(real_outputs, torch.ones(batch_size, 1).to(device))
        
        fake_images = G(generate_noise(batch_size)).detach()
        fake_outputs = D(fake_images)
        fake_loss = criterion(fake_outputs, torch.zeros(batch_size, 1).to(de

In [38]:
prompt = HumanMessage(
    content=augment_prompt(
        "Did Jorge code a transformer model?"
    )
)

res = chat(messages + [prompt])
print(res.content)

Yes, based on the provided contexts, Jorge Gosalvez did code a Transformer model. The context mentions that Jorge created a Seq2Seq network using Transformer for German to English translation in Pytorch. Additionally, the context includes references to Jorge's GitHub repository where the code for the Transformer model implementation can be found.


In [39]:
prompt = HumanMessage(
    content=augment_prompt(
        "How many pre-trained models did Jorge use for tranfer learning?"
    )
)

res = chat(messages + [prompt])
print(res.content)

Jorge used two pre-trained models for transfer learning. The first pre-trained model is "models.inception_v3" which was loaded and adjusted to match the original model's output units. The second pre-trained model is "models.resnet18" which was also loaded and adjusted to match the desired output units. By using these pre-trained models and adjusting them accordingly, Jorge performed transfer learning for his task.


In [40]:
prompt = HumanMessage(
    content=augment_prompt(
        "Show the roboflow code snippet Jorge wrote to predict what objects existed in the boston_dog.jpeg image."
    )
)

res = chat(messages + [prompt])
print(res.content)

Based on the provided context, the code snippet that Jorge wrote using Roboflow to predict what objects existed in the "boston_dog.jpeg" image is as follows:

```python
!pip install roboflow
from roboflow import Roboflow

This image most likely belongs to frog with a 57.76% confidence.
This image most likely belongs to airplane with a 57.76% confidence.
This image most likely belongs to bird with a 57.76% confidence.
This image most likely belongs to airplane with a 57.76% confidence.
```

In this code snippet, Jorge is using Roboflow to predict the objects present in the "boston_dog.jpeg" image, with corresponding confidence levels for each predicted object.


In [41]:
prompt = HumanMessage(
    content=augment_prompt(
        "How did Jorge split the CNN dataset into training and validation sets?"
    )
)

res = chat(messages + [prompt])
print(res.content)

Based on the provided context, Jorge split the CNN dataset into training and validation sets using the following approach:

Jorge used the `Subset` class from PyTorch to split the dataset into training and validation sets. By creating a custom dataset with at least 100 images for each of the 3 categories from CIFAR-10, Jorge then utilized the `Subset` class to separate the dataset into training and validation subsets. This method allows for the creation of custom training and validation sets with specific criteria, such as the number of images per category.

If you need more detailed information or code snippets related to Jorge's specific implementation for splitting the dataset, please let me know.


We get a much more informed response that includes several items missing in the previous non-RAG response, such as "red-teaming", "iterative evaluations", and the intention of the researchers to share this research to help "improve their safety, promoting responsible development in the field".

**Observations and Limitations:**
* although the RAG provided more informed based on text content, the LLM could not return code examples based on embeddings
* notebooks PDFed include special characters, removing these will likely improve the quality of responses
* chunking format ensures data loading and ingestion occurs properly
* appending prompts and responses to messages expand content to enable the chatbot to 'converse'
* configuring system (assistant) parameters affects LLM results

Delete the index to save resources and not be charged for non-use:

In [42]:
pc.delete_index(index_name)

---