# 03 - Langchain with vector store

In this lab, we will introduce [Langchain](https://python.langchain.com/docs/get_started/introduction), a framework for developing applications powered by language models and ask question on custom data using a vector store.

Langchain supports Python and Javascript / Typescript. For this lab, we will use Python.

## Setup

We'll use the `pip` tool to install the `langchain` Python package and `quadrant`

In [None]:
pip install langchain openai tiktoken qdrant-client python-dotenv --upgrade

We'll start by loading the movies csv file the `AzureOpenAI` specific components from the `langchain` package.
As with all the other labs, we'll need to provide our API key and endpoint details. We'll also provide the name (id) of the model deployment that we want to use.

In [None]:
import os
from dotenv import load_dotenv

# Load environment variables
# API_KEY = "<YOUR API KEY>"
# RESOURCE_ENDPOINT = "<YOUR AZURE OPENAI ENDPOINT>" # For example https://<your azure open ai instance>.openai.azure.com/
# DEPLOYMENT_ID = "<YOUR DEPLOYMENT ID>" # For example "text-davinci-003"
load_dotenv()

# Set this to `azure`
os.environ["OPENAI_API_TYPE"] = "azure"
os.environ["OPENAI_API_VERSION"] = "2023-03-15-preview"

First we will load the data from the csv file into a loader

In [None]:
from langchain.document_loaders.csv_loader import CSVLoader

loader = CSVLoader(file_path='../../extra/data/movies/movies.csv', source_column='original_title', encoding='utf-8', csv_args={'delimiter':',', 'fieldnames': ['id', 'original_language', 'original_title', 'popularity', 'release_date', 'vote_average', 'vote_count', 'genre', 'overview', 'revenue', 'runtime', 'tagline']})
data = loader.load()
data = data[1:20] # reduce dataset if you want
print('Loaded %s movies' % len(data))


We will be using the OpenAI embedding

In [None]:
from langchain.embeddings import OpenAIEmbeddings
embeddings = OpenAIEmbeddings(deployment="text-embedding-ada-002", chunk_size=1) 

from langchain.llms import AzureOpenAI

llm = AzureOpenAI(
    deployment_name=os.environ["DEPLOYMENT_ID"],
    model_name="gpt-35-turbo",
)

Next, we'll configure Langchain to use Qdrant as vector store using docker, embedd the loaded documents and store the embeddings in the vector store. Depending on the rate limiting this might take a while.

```
docker run --name qdrant -p 6333:6333 -p 6334:6334 -v "$(pwd)/labs/extra/data/qdrantstorage" qdrant/qdrant
```

In [None]:
from langchain.vectorstores import Qdrant
from qdrant_client import QdrantClient

url = "http://localhost:6333"
qdrant = Qdrant.from_documents(
    data,
    embeddings,
    url=url,
    prefer_grpc=False,
    collection_name="my_movies",
)


Now we are going to test the vector store and search for similarity

In [None]:
vectorstore = qdrant

query = "What is the best 80s movie I should look?"
found_docs = vectorstore.similarity_search(query)

print(found_docs[0].metadata['source'])

Another way would be to search for similar movies but with a more diverse results

In [None]:
retriever = vectorstore.as_retriever(search_type="mmr")

query = "Which movies are about space travel?"
print(retriever.get_relevant_documents(query)[0].metadata['source'])

In [None]:
from langchain.indexes import VectorstoreIndexCreator
from langchain.chains import RetrievalQA
from langchain.llms import OpenAI

index_creator = VectorstoreIndexCreator(embedding=embeddings)
docsearch = index_creator.from_loaders([loader])

Now we are using a QA chain to ask questions about the movies

In [None]:
openai = OpenAI(deployment_id=os.environ["DEPLOYMENT_ID"])
chain = RetrievalQA.from_chain_type(llm=openai, chain_type="stuff", retriever=docsearch.vectorstore.as_retriever(), input_key="question", return_source_documents=True)
query = "Do you have a column called popularity?"
response = chain({"question": query})
print(response['result'])
print(response['source_documents'])

query = "What is the movie with the highest popularity?"
response = chain({"question": query})
print(response['result'])
print(response['source_documents'])

Load the vector database from a file and ask the same question again.

In [None]:
#del vectorstore

from langchain.vectorstores import Qdrant
from qdrant_client import QdrantClient

from langchain.embeddings import OpenAIEmbeddings
embeddings = OpenAIEmbeddings(deployment="text-embedding-ada-002", chunk_size=1) 

client = QdrantClient(url="http://localhost:6333", prefer_grpc=False)
qdrant = Qdrant(client=client, collection_name="my_movies", embeddings=embeddings)

query = "What the best movie about space travel?"
found_docs = qdrant.similarity_search(query)

print(found_docs[0].metadata['source'])

Now lets create a retriever to query it

In [None]:
retriever = qdrant.as_retriever(search_type="mmr")

query = "What the best movie about space travel?"
print(retriever.get_relevant_documents(query)[0].metadata['source'])
