<a href="https://colab.research.google.com/github/psymed/AllureReport/blob/main/LangChain_Train_ChatGPT_with_your_own_Data.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Install Dependencies

In [None]:
!pip install langchain chromadb
!pip install openAi
!pip install tiktoken

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting langchain
  Downloading langchain-0.0.174-py3-none-any.whl (869 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m869.7/869.7 kB[0m [31m8.5 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting chromadb
  Downloading chromadb-0.3.23-py3-none-any.whl (71 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m71.3/71.3 kB[0m [31m3.9 MB/s[0m eta [36m0:00:00[0m
Collecting aiohttp<4.0.0,>=3.8.3 (from langchain)
  Downloading aiohttp-3.8.4-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.0 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.0/1.0 MB[0m [31m17.0 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting async-timeout<5.0.0,>=4.0.0 (from langchain)
  Downloading async_timeout-4.0.2-py3-none-any.whl (5.8 kB)
Collecting dataclasses-json<0.6.0,>=0.5.7 (from langchain)
  Downloading dataclasses_json-0.5.7-py3-none-any.whl 

In [None]:
import os
from langchain.indexes import VectorstoreIndexCreator
from dotenv import dotenv_values
os.environ['OPENAI_API_KEY'] = dotenv_values()['openai_api_key'] # set environment variable

In [None]:
from langchain.document_loaders import TextLoader

loader1 = TextLoader('sd_wiki.txt')
loader2 = TextLoader('midjourney_wiki.txt')

In [None]:
# or simply
index = VectorstoreIndexCreator().from_loaders([loader1,loader2])



In [None]:
index.query('who authored the theoretical paper behind stable defusion?')

' Patrick Esser of Runway and Robin Rombach of CompVis.'

In [None]:
index.query('what is midjourney?')

' Midjourney is a generative artificial intelligence program and service created and hosted by a San Francisco-based independent research lab Midjourney, Inc. Midjourney generates images from natural language descriptions, called "prompts", similar to OpenAI\'s DALL-E and Stable Diffusion.'

In [None]:
index.query('create a list of top 5 pros and cons per each midjourney and Stable difusion and provide a comparison between the 2 tools.')

" I don't know."

In [None]:
index.query('What are the main capabilities of midjourney?')

' Midjourney is a generative artificial intelligence program and service that generates images from natural language descriptions. It is used by artists for rapid prototyping of artistic concepts, by the advertising industry to create original content and brainstorm ideas quickly, and by other industries for custom ads, special effects, and e-commerce advertising.'

# proving that we can learn chatGPT new data it didn't know

In [None]:
index.query('what is a controlnet?')

' A ControlNet is a neural network architecture designed to manage diffusion models by incorporating additional conditions. It duplicates the weights of neural network blocks into a "locked" copy and a "trainable" copy. The "trainable" copy learns the desired condition, while the "locked" copy preserves the original model. This approach ensures that training with small datasets of image pairs does not compromise the integrity of production-ready diffusion models.'

Using QueryResources to ask multiple questions

In [None]:
# useful when quering multiple documents 
index.query_with_sources("Can you access midjourney via a web app ?")

{'question': 'Can you access midjourney via a web app ?',
 'answer': ' Yes, you can access Midjourney via a web app.\n',
 'sources': 'midjourney_wiki.txt'}

# Now we'll use a different an more robust way to load a documents into a db

# Step 1 - Load data & split into chunks

In [None]:
from langchain.document_loaders import TextLoader
loader1 = TextLoader('sd_wiki.txt')
loader2 = TextLoader('midjourney_wiki.txt')
documents = loader1.load()
documents += loader2.load() # put document objects inside a list

In [None]:
documents

[Document(page_content='Stable Diffusion\n\nArticle\nTalk\nRead\nEdit\nView history\n\nTools\nFrom Wikipedia, the free encyclopedia\nStable Diffusion\nA photograph of an astronaut riding a horse 2022-08-28.png\nAn image generated by Stable Diffusion based on the text prompt "a photograph of an astronaut riding a horse"\nOriginal author(s)\tRunway, CompVis, and Stability AI\nDeveloper(s)\tStability AI\nInitial release\tAugust 22, 2022\nStable release\t\n2.1 (model)[1] / December 7, 2022\nRepository\tgithub.com/Stability-AI/stablediffusion\nWritten in\tPython[2]\nOperating system\tAny that support CUDA kernels\nType\tText-to-image model\nLicense\tCreative ML OpenRAIL-M\nWebsite\tommer-lab.com/research/latent-diffusion-models/ \nStable Diffusion is a deep learning, text-to-image model released in 2022. It is primarily used to generate detailed images conditioned on text descriptions, though it can also be applied to other tasks such as inpainting, outpainting, and generating image-to-imag

In [None]:
#split data
from langchain.text_splitter import CharacterTextSplitter
text_splitter = CharacterTextSplitter(chunk_size=4000, chunk_overlap=0) #default 4000
texts = text_splitter.split_documents(documents)



In [None]:
len(texts)

12

In [None]:
len(texts[1].page_content)

3661

In [None]:
# Print the first chunk's page content
texts[0].page_content

'Stable Diffusion\n\nArticle\nTalk\nRead\nEdit\nView history\n\nTools\nFrom Wikipedia, the free encyclopedia\nStable Diffusion\nA photograph of an astronaut riding a horse 2022-08-28.png\nAn image generated by Stable Diffusion based on the text prompt "a photograph of an astronaut riding a horse"\nOriginal author(s)\tRunway, CompVis, and Stability AI\nDeveloper(s)\tStability AI\nInitial release\tAugust 22, 2022\nStable release\t\n2.1 (model)[1] / December 7, 2022\nRepository\tgithub.com/Stability-AI/stablediffusion\nWritten in\tPython[2]\nOperating system\tAny that support CUDA kernels\nType\tText-to-image model\nLicense\tCreative ML OpenRAIL-M\nWebsite\tommer-lab.com/research/latent-diffusion-models/ \nStable Diffusion is a deep learning, text-to-image model released in 2022. It is primarily used to generate detailed images conditioned on text descriptions, though it can also be applied to other tasks such as inpainting, outpainting, and generating image-to-image translations guided b

# Step 2 create text embeddings, save into a vectorstore (database / index)

In [None]:
#create text embedding & index import os
from langchain.embeddings import OpenAIEmbeddings
from langchain.vectorstores import Chroma
from dotenv import dotenv_values
import os

os.environ['OPENAI_API_KEY'] = dotenv_values() ['openai_api_key'] #set environment variable
embeddings = OpenAIEmbeddings ()
db = Chroma. from_documents (texts, embeddings)





# Step 3 creater a retrieve from the db, create chain & ask questions

In [None]:
from langchain.chains import RetrievalQA
from langchain.llms import OpenAI
retriever = db.as_retriever()
qa = RetrievalQA.from_chain_type(llm=OpenAI(model_name='gpt-3.5-turbo'),
                                  chain_type="stuff",
                                  retriever=retriever)
#chain type
# stuff
# map_reduce
# refine
# map-rerankI



In [None]:
qa.run('I want to use midjourney, how do i use it?')

'Midjourney is currently accessible through a Discord bot on their official Discord server. Users can generate images by using the /imagine command and typing in a prompt. The bot will then return a set of four images and users can choose which images they want to upscale. Midjourney is also working on a web interface, but currently, it is only accessible through the Discord bot. It is important to note that Midjourney is currently in open beta and has three subscription tiers.'

# Company Policy Test

In [None]:
from langchain.document_loaders import TextLoader
loader1 = TextLoader('./company_policies/Alpha.txt')
loader2 = TextLoader('./company_policies/Beta.txt')
loader3 = TextLoader('./company_policies/Gamma.txt')
documents = loader1.load()
documents += loader2.load() 
documents += loader3.load() 
# put document objects inside a list

#split data
from langchain.text_splitter import CharacterTextSplitter
text_splitter = CharacterTextSplitter(chunk_size=4000, chunk_overlap=0) #default 4000
texts = text_splitter.split_documents(documents)

#create text embedding & index import os
from langchain.embeddings import OpenAIEmbeddings
from langchain.vectorstores import Chroma
from dotenv import dotenv_values
import os

os.environ['OPENAI_API_KEY'] = dotenv_values() ['openai_api_key'] #set environment variable
embeddings = OpenAIEmbeddings ()
db = Chroma. from_documents (texts, embeddings)

from langchain.chains import RetrievalQA
from langchain.llms import OpenAI
retriever = db.as_retriever()
qa = RetrievalQA.from_chain_type(llm=OpenAI(model_name='gpt-3.5-turbo'),
                                  chain_type="stuff",
                                  retriever=retriever)
#chain type
# stuff
# map_reduce
# refine
# map-rerankI




In [None]:
qa.run('What is the Car policies for an employee of company Gamma?')

ERROR:root:Chroma collection langchain contains fewer than 4 elements.


'The Car policy for an employee of company Gamma is that the use of personal vehicles for business travel will be reimbursed at a rate lower than the current IRS mileage rate.'

In [None]:
qa.run('Compare air policie for companies Alpha & Beta')

ERROR:root:Chroma collection langchain contains fewer than 4 elements.


'Company Alpha allows employees to travel in business class for international flights and economy class for domestic flights, while Company Beta allows employees to travel in business class for both international and domestic flights.'

In [None]:
qa.run('Compare air policie for companies Alpha & Beta - Return comparison in a table format')

ERROR:root:Chroma collection langchain contains fewer than 4 elements.


'| Company | International Flights | Domestic Flights |\n| --- | --- | --- |\n| Alpha | Business Class | Economy Class |\n| Beta | Business Class | Business Class for employees |'

In [None]:
# Asking for Non-Data
qa.run('Compare Space Booking policies for all companies - Return comparison in a table format')

ERROR:root:Chroma collection langchain contains fewer than 4 elements.


'Unfortunately, there is no information provided about the space booking policies for any of the companies mentioned, so a comparison in a table format cannot be made.'

In [None]:
qa.run('Act as a travel adviser for an Alpha company traveler. Create a booking template or guidlines that will compily with the company policy')

ERROR:root:Chroma collection langchain contains fewer than 4 elements.


'As a travel adviser for an Alpha company traveler, here are some guidelines to follow when booking travel:\n\n1. Enroll in frequent traveler programs to earn benefits for personal travel, but remember that personal travel expenses will not be reimbursed by the company.\n2. Book your flights at least two weeks in advance, and try to use the lowest logical airfare. Non-refundable fares are only acceptable if they are cheaper than the lowest available refundable fare.\n3. For international flights, you are eligible to travel in business class, while for domestic flights, only economy class is allowed.\n4. If you have a colleague from the same department who needs to travel, ensure that no more than two of you are booked on the same flight.\n5. Avoid late arrival guarantees when booking hotels, and ensure that the rates do not exceed the maximum limit set by the company.\n6. For short-distance travel, it is encouraged to use rail transportation. However, if you need to rent a car for busi

In [None]:
qa.run('Provide 2 booking requests for an Alpha Travelers - 1 that fully complies with the company policy, and a second booking request that does not comply. Return results in bullets')

ERROR:root:Chroma collection langchain contains fewer than 4 elements.


"Booking Request 1 (Compliant with Company Alpha Policy)\n- Economy class flight booked at least two weeks in advance using the company's travel agency\n- Non-refundable fare chosen only because it is cheaper than the lowest refundable fare\n- Soft dollars balanced across different airlines\n- Hotel booked through the company's travel agency within the maximum limit set by the company\n\nBooking Request 2 (Non-Compliant with Company Alpha Policy)\n- Business class flight booked for a domestic flight\n- Flight booked less than two weeks in advance\n- Non-direct routing chosen even though it does not result in significant cost savings\n- Personal travel expenses, such as a frequent traveler membership, added onto the booking and requested to be reimbursed by the company"

In [None]:
# !zip -r /content/LangChain-Train_ChatGPT_With_Your_Own_Data.zip /content/
# #How to Download Files and Folders from Colab 2
# #The command is !zip followed by r which means “recursive”, then we write the file path of the zipped file (i.e. /content/sample_data.zip) and finally, we write the folder that we want to zip (i.e. /content/sample_data) and voila, the zip file is generated :-).
# #Lastly, we can download the zip file as before:
# files.download('/content/sample_data.zip')


updating: content/company_policies/ (stored 0%)
updating: content/company_policies/Gamma.txt (deflated 47%)
updating: content/company_policies/Alpha.txt (deflated 57%)
updating: content/company_policies/Beta.txt (deflated 47%)
updating: content/company_policies/.ipynb_checkpoints/ (stored 0%)
updating: content/company_policies/untitled (stored 0%)
updating: content/midjourney_wiki.txt (deflated 56%)
updating: content/sample_data/ (stored 0%)
updating: content/sample_data/anscombe.json (deflated 83%)
updating: content/sample_data/README.md (deflated 42%)
updating: content/sample_data/california_housing_test.csv (deflated 76%)
updating: content/sample_data/mnist_test.csv (deflated 88%)
updating: content/sample_data/mnist_train_small.csv (deflated 88%)
updating: content/sample_data/california_housing_train.csv (deflated 79%)
updating: content/sample_data.zip (stored 0%)
updating: content/sd_wiki.txt (deflated 60%)
  adding: content/ (stored 0%)
  adding: content/.config/ (stored 0%)
  add

NameError: ignored