# PA06

# Author: Chandan Patel 

This notebook demonstrates the use of the LangChain .
1. Calling a LLM to get answer
2. Calling a ChatGPT to get answer
3. Using Templates to more easily manage the creation of prompts
4. Using a Template to create a prompt for a ChatGPT
5. Chains
   1. Single LLM and Prompot
   2. Single LLM (chatGPT) and Prompt
   3. Multiple LLMs and Prompts
6. Create a vector database using Pinecone
   1. Chunk text
   2. Create embeddings
   3. Create a Pinecone index
   4. Query the index for similar text
7. Agents
   1. Use PythonREPL agent to create and execute Python code to find answer. (notice that the agent has 'memory' of previous answers)

### Import necessary libraries

In [1]:
import openai
import langchain
from dotenv import load_dotenv,find_dotenv

from langchain.llms import OpenAI
from langchain.chat_models import ChatOpenAI
from langchain import PromptTemplate
from langchain.chains import LLMChain
from langchain.chains import SimpleSequentialChain
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.embeddings import OpenAIEmbeddings
from langchain.agents.agent_toolkits import create_python_agent
from langchain.tools.python.tool import PythonREPLTool
from langchain.python import PythonREPL



from tqdm.notebook import tqdm # need this to stop warning

import os
import pinecone 
from langchain.vectorstores import Pinecone

from pprint import pprint

# load environmnet variables
load_dotenv(find_dotenv())


  from tqdm.autonotebook import tqdm


True

First, we use the langchain llm wrapper to send a message to the OpenAI API. We ask the question "explain large language models in one sentence".

In [2]:
# Run basic query with OpenAI wrapper
llm = OpenAI(model_name="text-davinci-003")
pprint(llm("explain large language models in one sentence"))

('\n'
 '\n'
 'Large language models are deep neural networks that are trained on large '
 'amounts of text data and are used to generate predictions for natural '
 'language processing tasks.')


## What's the difference between "text-davinci-003" and "gpt-3.5-turbo"?

| Feature	| DaVinci	| Turbo|
|-----------|-----------|--------|
|Model Size | 175B parameters | 6B parameters|
|Performance	| Higher accuracy and ability to generate longer texts | Lower accuracy and limited ability to generate longer texts|
|Cost| More expensive | Less expensive|
|Use Cases | Better suited for complex language tasks such as content creation and language translation	| Better suited for simple language tasks such as chatbots and basic language processing|
|Availability |Only available through OpenAI’s API access program|Available through OpenAI’s API access program and some third-party tools|
|Training Data | Trained on a larger and more diverse dataset| Trained on a smaller dataset|


Here use GPT-3.5 Turbo because it is cheaper and faster. Also, GPT-3.5 Turbo is more suited for simple language tasks such as chatbots and basic language processing.

In [3]:
# import schema for chat messages and ChatOpenAI in order to query chatmodels GPT-3.5-turbo or GPT-4
from langchain.schema import AIMessage, HumanMessage, SystemMessage

The primary interface through which end users interact with llms is a chat interface. For this reason, some model providers even started providing access to the underlying API in a way that expects chat messages. These messages have a content field (which is usually text) and are associated with a user. Right now the supported users are System, Human, and AI.

**SystemChatMessage:**
A chat message representing information that should be instructions to the AI system. NOTE: This can also be achieved by include this information in the HumanChatMessage. For instance "Act like you are Albert Einstein and never leave the role".

**HumanChatMessage:**
A chat message representing information coming from a human interacting with the AI system. Typically this is text in the form of a question and/or instructions, but it can also be other information such as images or videos.

**AIChatMessage:**
A chat message representing information coming from the AI system.

In [4]:
chat = ChatOpenAI(model_name="gpt-3.5-turbo",temperature=0.3)
messages = [
    SystemMessage(content="You are an expert data scientist"),
    HumanMessage(content="Write a Python script that trains a neural network on simulated data ")
]
response = chat(messages) # note that this returns back an AIMessage
response

AIMessage(content="Sure! Here's an example of a Python script that trains a neural network on simulated data using the Keras library:\n\n```python\nimport numpy as np\nfrom keras.models import Sequential\nfrom keras.layers import Dense\n\n# Generate simulated data\nnp.random.seed(0)\nX = np.random.rand(1000, 10)\ny = np.random.randint(2, size=(1000, 1))\n\n# Define the neural network model\nmodel = Sequential()\nmodel.add(Dense(32, input_dim=10, activation='relu'))\nmodel.add(Dense(1, activation='sigmoid'))\n\n# Compile the model\nmodel.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])\n\n# Train the model\nmodel.fit(X, y, epochs=10, batch_size=32)\n\n# Evaluate the model\nloss, accuracy = model.evaluate(X, y)\nprint(f'Loss: {loss}, Accuracy: {accuracy}')\n```\n\nIn this script, we first generate simulated data using `numpy`. We then define a simple neural network model using the `Sequential` class from Keras. The model consists of two dense layers, with the f

In [5]:
# here we extract the 'content' from the AIMessage object
pprint(response.content, indent=2)

("Sure! Here's an example of a Python script that trains a neural network on "
 'simulated data using the Keras library:\n'
 '\n'
 '```python\n'
 'import numpy as np\n'
 'from keras.models import Sequential\n'
 'from keras.layers import Dense\n'
 '\n'
 '# Generate simulated data\n'
 'np.random.seed(0)\n'
 'X = np.random.rand(1000, 10)\n'
 'y = np.random.randint(2, size=(1000, 1))\n'
 '\n'
 '# Define the neural network model\n'
 'model = Sequential()\n'
 "model.add(Dense(32, input_dim=10, activation='relu'))\n"
 "model.add(Dense(1, activation='sigmoid'))\n"
 '\n'
 '# Compile the model\n'
 "model.compile(loss='binary_crossentropy', optimizer='adam', "
 "metrics=['accuracy'])\n"
 '\n'
 '# Train the model\n'
 'model.fit(X, y, epochs=10, batch_size=32)\n'
 '\n'
 '# Evaluate the model\n'
 'loss, accuracy = model.evaluate(X, y)\n'
 "print(f'Loss: {loss}, Accuracy: {accuracy}')\n"
 '```\n'
 '\n'
 'In this script, we first generate simulated data using `numpy`. We then '
 'define a simple neura

### Templating

The templating system is a way to provide a template for the AI to fill in. The template is a string with a special syntax that indicates where the AI should fill in information. See https://python.langchain.com/docs/modules/model_io/prompts/prompt_templates/ for more information.

In [6]:
template = """
You are an expert data scientist with an expertise in building deep learning models. 
Explain the concept of {concept} in a couple of lines
"""

prompt = PromptTemplate(
    input_variables=["concept"],
    template=template,
)

In [7]:
# Run LLM with PromptTemplate
llm(prompt.format(concept="autoencoder"))

'\nAn autoencoder is an unsupervised deep learning model that learns to reconstruct its input. It consists of two main components: an encoder that compresses the input data into a latent representation, and a decoder that reconstructs the input data from the latent representation. Autoencoders are mainly used for dimensionality reduction, denoising, and feature learning.'

### Chains

The LangChain is a way to chain together multiple llms. The output of one llm is used as the input to the next llm. This is useful for a number of reasons. For instance, it allows you to use a more general llm to generate a prompt for a more specific llm. It also allows you to use multiple llms to generate a single answer.

In [8]:
# Define chain with language model and prompt as arguments.

chain = LLMChain(llm=llm, prompt=prompt) # one llm and one prompt

# Run the chain only specifying the input variable.
pprint(chain.run("autoencoder")) # we save a few lines of code by taking this approach instead of the previous one

('\n'
 'An autoencoder is a type of artificial neural network used to learn '
 'efficient data codings in an unsupervised manner by learning to reconstruct '
 'its own inputs. It consists of two parts: an encoder which maps the input '
 'data into a lower-dimensional representation, and a decoder which '
 'reconstructs the original input from the compressed representation.')


In [9]:
# Define a second prompt 

second_prompt = PromptTemplate(
    input_variables=["ml_concept"],
    template="Use the concept description of {ml_concept} and explain it to me like I'm five. Use about 500 words to describe this.",
)
chain_two = LLMChain(llm=llm, prompt=second_prompt) # create another chain with a new prompt

# Run the chain only specifying the input variable.
pprint(chain_two.run("autoencoder")) # we save a few lines of code by taking this approach instead of the previous one

('\n'
 '\n'
 'An autoencoder is a type of artificial intelligence program that can learn '
 "to copy its own input. It's like a copy machine that can take a picture of "
 "something and then make a copy of it. But it's not just a copy machine. It's "
 'like a smart copy machine that can make a better copy of the original.\n'
 '\n'
 'The autoencoder works by taking the input it is given and breaking it down '
 'into smaller parts. These parts are called “features." The autoencoder looks '
 'at each feature and tries to figure out how to best represent it in a '
 'simpler form. This simpler form is called an “encoding”. This encoding is '
 'like a language that the autoencoder can understand.\n'
 '\n'
 'Once the autoencoder has made a simplified version of the input, it uses '
 'that encoding to make a new version of the input. This new version is called '
 'the “reconstruction”. This reconstruction is like the autoencoder’s best '
 'guess at what the original input was.\n'
 '\n'
 'The a

In [10]:
# Define a sequential chain using the two chains above: the second chain takes the output of the first chain as input

overall_chain = SimpleSequentialChain(chains=[chain, chain_two], verbose=True) # previously we only used llm chain, now we use two chains

# Run the chain specifying only the input variable for the first chain.
explanation = overall_chain.run("autoencoder")
pprint(explanation)



[1m> Entering new SimpleSequentialChain chain...[0m
[36;1m[1;3m
Autoencoders are neural networks that are trained to encode data into a lower-dimensional representation and then reconstruct the original input data from the encoded representation. Autoencoders can be used for dimensionality reduction, anomaly detection, and data denoising.[0m
[33;1m[1;3m

Autoencoders are like robots that help you learn things. They help you learn things by taking a lot of complex information and breaking it down into simpler pieces. It's like taking a puzzle and breaking it into smaller pieces that are easier to understand.

Autoencoders can help you reduce the amount of data you have to work with, which can make it easier to analyze and understand. For example, if you had a really big dataset of numbers and wanted to understand it better, an autoencoder could help you break it down into smaller, more manageable pieces. 

Autoencoders can also help you identify anomalies in your data, which me

## Create a Knowledge Base using Pinecone and LangChain

We will use Pinecone to create a vector database. We will use the LangChain to create a knowledge base. The knowledge base will be used to answer questions.

In [11]:
# Import utility for splitting up texts and split up the explanation given above into document chunks
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size = 100,
    chunk_overlap  = 0,
)

texts = text_splitter.create_documents([explanation])
print(texts)


[Document(page_content='Autoencoders are like robots that help you learn things. They help you learn things by taking a lot', metadata={}), Document(page_content="of complex information and breaking it down into simpler pieces. It's like taking a puzzle and", metadata={}), Document(page_content='breaking it into smaller pieces that are easier to understand.', metadata={}), Document(page_content='Autoencoders can help you reduce the amount of data you have to work with, which can make it easier', metadata={}), Document(page_content='to analyze and understand. For example, if you had a really big dataset of numbers and wanted to', metadata={}), Document(page_content='understand it better, an autoencoder could help you break it down into smaller, more manageable', metadata={}), Document(page_content='pieces.', metadata={}), Document(page_content='Autoencoders can also help you identify anomalies in your data, which means they can help you', metadata={}), Document(page_content='identify da

In [12]:
# Individual text chunks can be accessed with "page_content"

pprint(texts[0].page_content)

('Autoencoders are like robots that help you learn things. They help you learn '
 'things by taking a lot')


In [13]:
# Instantiate OpenAI embeddings

# note, there are a few approaches that seem to achieve the same thing
# unlear if there is a difference between them; but the last one seems to be the most recent
# and doesn't generate any warnings.
#embeddings = OpenAIEmbeddings(model_name="ada")
#embeddings = OpenAIEmbeddings(engine="text-embedding-ada-002")
embeddings = OpenAIEmbeddings(deployment="text-embedding-ada-002")

In [14]:
# Turn the first text chunk into a vector with the embedding

query_result = embeddings.embed_query(texts[0].page_content)
print(query_result)

[-0.03675442777659658, -0.014334758655346952, 0.00732694981868598, -0.024613765584562814, -0.0024101810084608036, 0.015571431241262477, 0.003339347160298931, -0.02676796698040097, -0.0287758985693582, -0.028536541203024842, 0.011402648612893269, 0.01772563449974589, 0.005219288477283526, -0.008570271116852697, 0.0007243009011607405, -0.006273119147984753, 0.016316094463976794, 0.02511907144094377, -0.012725755389396895, -0.006060358493301414, -0.015371968011081517, 0.02922801472772964, -0.019839943622076906, -0.023603150146510377, 0.013802857018638603, -0.011562219569567088, 0.00743333014602765, -0.0445733878898064, 0.010797610210099201, -0.016515557694157756, 0.031594978640896394, -0.006176711889019869, 0.004388191529642423, -0.03773844464035374, -0.020518119722279324, -0.022885083635446078, 0.014707090266704111, -0.011422595680969469, 0.012486398954386167, -0.009122118890314616, 0.012353423778039737, 0.005338966229127576, 0.00567805381356747, -0.008284373696083638, -0.002807445290759

In [15]:
# Import and initialize Pinecone client
# You will need to create a Pinecone account and create an API key and environment variable, see www.pinecone.io

pinecone.init(
    api_key=os.getenv('PINECONE_API_KEY'),  
    environment=os.getenv('PINECONE_ENV')  
)

In [17]:
# create an index
# NOTE: For free version, you can only create one index. If you want to create a new index, you need to delete the old one first.
# 

pinecone.init(api_key=os.getenv('PINECONE_API_KEY'), environment=os.getenv('PINECONE_ENV'))

#pinecone.delete_index("langchain-demo-index")

pinecone.create_index("langchain-demo-index", dimension=1536) # 1536 is openai ada embedding dimension

In [18]:
# Upload vectors to Pinecone
# You need first create an index in Pinecone, see www.pinecone.io - set it to 1536 dimensions

index_name = "langchain-demo-index"
search = Pinecone.from_documents(texts, embeddings, index_name=index_name) # load all text chunks into Pinecone with the associated embeddings

In [19]:
# Do a simple vector similarity search

query = "about an autoencoder?"
result = search.similarity_search(query)

pprint(result)

[Document(page_content='understand it better, an autoencoder could help you break it down into smaller, more manageable', metadata={}),
 Document(page_content='Autoencoders can help you reduce the amount of data you have to work with, which can make it easier', metadata={}),
 Document(page_content='one of them looks really different from the rest, an autoencoder can help you detect that.', metadata={}),
 Document(page_content='To use an autoencoder, you first need to give it', metadata={})]


## Agents

Agents in LangChain have memory (but, you can also use zeroshot to get the answer, which wouldn't hold memory). Agents can be used to create and execute Python code to find answer. (notice that the agent has 'memory' of previous answers). See more here above memory: https://python.langchain.com/docs/modules/memory/agent_with_memory

> NOTE: See Memory Store backed memory options here: https://python.langchain.com/docs/modules/memory/types/vectorstore_retriever_memory


In [72]:
# Instantiate Python agent

agent_executor = create_python_agent(
    llm=OpenAI(temperature=0, max_tokens=2000),
    tool=PythonREPLTool(), # Read-Eval-Print Loop (REPL) tool
    verbose=True
)

In [73]:
# Execute the Python agent

agent_executor.run("Load a csv file found at https://raw.githubusercontent.com/patelchandan0912/Data-Science-Programming/master/DSP-3-%20SVM%2C%20Logistic%20Regression/RidingMowers.csv into a dataframe, and tell me the number of rows and columns in the dataframe")



[1m> Entering new AgentExecutor chain...[0m
[32;1m[1;3m I need to load the csv file into a dataframe and then get the number of rows and columns
Action: Python_REPL
Action Input: import pandas as pd
df = pd.read_csv('https://raw.githubusercontent.com/patelchandan0912/Data-Science-Programming/master/DSP-3-%20SVM%2C%20Logistic%20Regression/RidingMowers.csv')[0m
Observation: [36;1m[1;3m[0m
Thought:[32;1m[1;3m I now need to get the number of rows and columns
Action: Python_REPL
Action Input: print(df.shape)[0m
Observation: [36;1m[1;3m(24, 3)
[0m
Thought:[32;1m[1;3m I now know the final answer
Final Answer: The dataframe has 24 rows and 3 columns.[0m

[1m> Finished chain.[0m


'The dataframe has 24 rows and 3 columns.'

In [74]:
agent_executor.run("Describe the dataframe")



[1m> Entering new AgentExecutor chain...[0m
[32;1m[1;3m I need to look at the dataframe
Action: Python_REPL
Action Input: print(df)[0m
Observation: [36;1m[1;3m    Income  Lot_Size Ownership
0     60.0      18.4     Owner
1     85.5      16.8     Owner
2     64.8      21.6     Owner
3     61.5      20.8     Owner
4     87.0      23.6     Owner
5    110.1      19.2     Owner
6    108.0      17.6     Owner
7     82.8      22.4     Owner
8     69.0      20.0     Owner
9     93.0      20.8     Owner
10    51.0      22.0     Owner
11    81.0      20.0     Owner
12    75.0      19.6  Nonowner
13    52.8      20.8  Nonowner
14    64.8      17.2  Nonowner
15    43.2      20.4  Nonowner
16    84.0      17.6  Nonowner
17    49.2      17.6  Nonowner
18    59.4      16.0  Nonowner
19    66.0      18.4  Nonowner
20    47.4      16.4  Nonowner
21    33.0      18.8  Nonowner
22    51.0      14.0  Nonowner
23    63.0      14.8  Nonowner
[0m
Thought:[32;1m[1;3m I now know the final answer
Fi

'The dataframe contains columns for Income, Lot_Size, and Ownership, with 24 rows of data.'

In [78]:
agent_executor.run(r"Coduct a train-test split of the dataframe using 80% for training and 20% for testing. Set 'Ownership' column as the target and then fit a decision tree model to the training data and report the accuracy, precision and recall on the test data")



[1m> Entering new AgentExecutor chain...[0m
[32;1m[1;3m I need to split the dataframe, set the target, fit the model, and then report the accuracy, precision, and recall
Action: Python_REPL
Action Input: 
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier

X = df.drop('Ownership', axis=1)
y = df['Ownership']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

clf = DecisionTreeClassifier()
clf.fit(X_train, y_train)

y_pred = clf.predict(X_test)
[0m
Observation: [36;1m[1;3m[0m
Thought:[32;1m[1;3m I now need to report the accuracy, precision, and recall
Action: Python_REPL
Action Input: 
from sklearn.metrics import accuracy_score, precision_score, recall_score

accuracy = accuracy_score(y_test, y_pred)
precision = precision_score(y_test, y_pred)
recall = recall_score(y_test, y_pred)

print(f'Accuracy: {accuracy}')
print(f'Precision: {precision}')
print(f'Recall: {recall}')
[0m
Observati

'The accuracy, precision, and recall of the decision tree model are 0.4, 0.5, and 0.3333333333333333 respectively.'

# Conclusion

From the example above, we utilized the REPL agent to access data stored in my GitHub repository and read a CSV file. The REPL agent effectively interpreted the data within the dataframe, offering valuable insights and functionalities. This versatile tool can be applied to various dataframe tasks, including data cleaning, calculations, and generating new insights.

In the demonstration, we showcased how the agent described the dataframe's characteristics. Specifically, we instructed the agent to split the dataframe into 80% training data and 20% testing data, then apply a decision tree machine learning model to this data, and finally report the accuracy, precision, and recall of the model.

The agent flawlessly executed these instructions by creating the necessary Python script, applying it to the dataframe, and providing the desired output. This highlights the utility of agents like REPL, which have access to powerful resources such as an LLM and a suite of tools including Google Search, Python REPL, math calculators, weather APIs, and more.