##### Importing the GooglePalm llm model from the Langchain Commuinity and passing our own temporary API key to it.
##### Note: Temperature is an additional parameter which ranges between 0 to 1. Closer to 1 means reponses will be more creative.

In [2]:
from langchain.llms import GooglePalm

api_key = "YOUR API KEY"

llm = GooglePalm(google_api_key=api_key, temperature=0.9)

##### Testing the GooglePalm model and its creativity

In [11]:
poem = llm("Write a 4 line description of Generative AI and LLM.")
print(poem)

Generative AI (also known as LLM) uses AI to create human-like text, images, and other content. LLMs are trained on massive datasets of text and code, and can learn to generate new content that is indistinguishable from human-generated content. They have the potential to revolutionize the way we create and consume content, but there are also concerns about their potential for abuse.


##### Importing CSV dataset file and cleaning it using the Pandas library. This step is essential for the Langchain CSV loader to load properly.

In [4]:
import pandas as pd

df = pd.read_csv("dataset.csv", encoding="utf-8")
df.columns = df.columns.str.strip()
df.to_csv("cleaned_dataset.csv", index=False)

# Verifying the column names before passing it to the document loader function.
print(df.columns)


Index(['question', 'answer'], dtype='object')


###### Here we load the document using the CSVLoader function from LangChain. This helps us to do operations on the files (in a list of Documents format) using other LangChain classes and functions.

In [5]:
# This is a basic implementation of File Loader from Langchain using CSVLoader
from langchain.document_loaders.csv_loader import CSVLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter

loader = CSVLoader(file_path="cleaned_dataset.csv", source_column="question", encoding="utf-8")
data = loader.load()

print(type(data)) # list of documents
print(type(data[0].page_content)) # complete row
print(data[0].metadata) # question(source column) and the row number

# Now we split the documents into chunks which we will pass on to the LLM later. 
# Because each LLM has a specific token size to process at once. Eg, OpenAI GPT 3.5 has 4097 Token size. like wise...


document_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=0)
document_in_chunks = document_splitter.split_documents(data)
# Keeping the chunk overlap equal to 0 because our application does not need any historical answers.


<class 'list'>
<class 'str'>
{'source': 'What is supervised machine learning?', 'row': 0}


###### Now we have loaded our CSV Data. Now we have to create Embeddings. As OpenAI is a paid one, we will use Sentence Transformer from Hugging Face

###### After creating the embeddings for our database document, we also create a vector database using Chroma DB

In [6]:
import os

os.environ['token'] = "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"

In [12]:
import os
from InstructorEmbedding import INSTRUCTOR
from sentence_transformers import SentenceTransformer
from langchain.embeddings import HuggingFaceEmbeddings
from langchain.embeddings import HuggingFaceInstructEmbeddings
from langchain import embeddings
from langchain.vectorstores import Chroma
from langchain.chains import VectorDBQA

model = "sentence-transformers/all-mpnet-base-v2"

# embedding_model = SentenceTransformer('sentence-transformers/all-MiniLM-L6-v2')
embedding_model = HuggingFaceEmbeddings(
    model_name=model,
    model_kwargs={"device": "cpu"},
)

vectordb = Chroma.from_documents(document_in_chunks, embedding_model)

print('done')

done


In [19]:
from langchain.chains import VectorDBQA

from langchain.chains import RetrievalQA

retriever = vectordb.as_retriever()

# qa = VectorDBQA.from_chain_type(llm=llm, chain_type="stuff", vectorstore=vectordb)

qa = RetrievalQA.from_chain_type(llm=llm, chain_type="stuff", retriever=retriever)

In [21]:
query = "Do you know how to use gradient boosting trees for ranking? "
qa.run(query)

'Yes, I do know how to use gradient boosting trees for ranking. Gradient boosting trees can be used for ranking by using them to learn a model of the relationship between the features of a data point and its ranking. This model can then be used to predict the ranking of new data points. Gradient boosting trees are a powerful technique for ranking, and they can often outperform other methods such as linear regression and logistic regression.'