In [1]:
import os
from dotenv import load_dotenv

load_dotenv(override=True)

OPENAI_API_KEY = os.environ.get("OPENAI_API_KEY")
ACTIVELOOP_TOKEN = os.environ.get('ACTIVELOOP_TOKEN')
GOOGLE_CSE_ID = os.environ.get('GOOGLE_CSE_ID')
GOOGLE_API_KEY = os.environ.get('GOOGLE_API_KEY')
GOOGLE_APPLICATION_CREDENTIALS = os.environ.get('GOOGLE_APPLICATION_CREDENTIALS')

## .TextLoader, Simple txt file
Textloader loads a txt file and returns a LangChain document containing the page_content and metadata.

Option to add -encoding argument or utilize the autodetect_ecoding function


In [12]:
from langchain.document_loaders import TextLoader

# use TextLoader to load text from local file
# The TextLoader converts the text into a LangChhain document: a piece of text (the .txt) and metadata
loader = TextLoader("data/soilsense_info.txt", autodetect_encoding=True)
docs_from_file = loader.load()

# print the first 50 characters of the first document
print('Page content (first 50 chars):', docs_from_file[0].page_content[0:50])
print('Meta data: ', docs_from_file[0].metadata)

Page content (first 50 chars): Take the guesswork out of irrigation
We offer a si
Meta data:  {'source': 'data/soilsense_info.txt'}


### .PyPDFLoader, PDF files
The LangChain library provides two methods for loading and processing PDF files: PyPDFLoader and PDFMinerLoader. 

- PyPDFLoader is a wrapper for the PyPDF2 library, which is the most commonly used tool and simple to get started
-  PDFMinerLoader More capable extraction capabilities than PyPDFLoader but more difficult to set up
- More tools exist https://python.langchain.com/docs/modules/data_connection/document_loaders/pdf

Review of PyPDF, PDFMiner and 2 more> https://medium.com/social-impact-analytics/comparing-4-methods-for-pdf-text-extraction-in-python-fd34531034f 

In [14]:
from langchain.document_loaders import PyPDFLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter

# create a text splitter
text_splitter = RecursiveCharacterTextSplitter(chunk_size=200, chunk_overlap=20)


loader = PyPDFLoader("data/dagpengeguide-2023.pdf")
#load and split defaults to recursiveCharacterTextSplitter anyway, but for clarity I define it
pages = loader.load_and_split(text_splitter=text_splitter)

print(pages[0])

page_content='1\nEn guide \ntil dig,\nder modtager \ndagpenge' metadata={'source': 'data/dagpengeguide-2023.pdf', 'page': 0}


### WebBaseLoader (Web pages, HTML)

Use WebBaseLoader to load all text from HTML webpages by calling urllib to download the page and then using BeautifulSoup to extract the text.

In [16]:
from langchain.document_loaders import WebBaseLoader

soilsense_urls = ["https://soilsense.io/", "https://soilsense.io/sensors-en",
                  "https://soilsense.io/system-en","https://soilsense.io/dashboard-en",
                "https://soilsense.io/soil-moisture-sensors-en", "https://soilsense.io/smart-farming-en",
                "https://soilsense.io/cases", "https://soilsense.io/orchards","https://soilsense.io/greenhouses",
                "https://soilsense.io/digital-smart-irrometer-watermark","https://soilsense.io/about-en",
                "https://soilsense.io/blog/tpost/d0jafmpzt1-precision-irrigation-a-simple-guide-to-s",
                "https://soilsense.io/blog/tpost/otk772c4g1-matric-potential-and-volumetric-water-co",
                "https://soilsense.io/blog/tpost/ffvu68rr81-soil-moisture-sensors-or-satellite-data"]

loader = WebBaseLoader(soilsense_urls)
url_docs = loader.load()

In [17]:
print(url_docs[0].page_content[:1000])

 SoilSense - soil moisture sensors for the future of agriculture SoilSense ProductSensorsWireless SystemOnline DashboardDigital farmingSoil Moisture SensorsSmart FarmingUse CasesIrrometer DigitalizedAbout usBuy nowConsultationBlogLog inLanguageEnglishDanskPolskiEspa√±olSoilSense ProductSensorsWireless SystemOnline DashboardDigital farmingSoil Moisture SensorsSmart FarmingUse CasesIrrometer DigitalizedAbout usBuy nowConsultationBlogLog in Make the smartest irrigation decisionsSoilSense wireless soil moisture sensor system analyzes data in real-time to provide you with direct insights on how much water your crops needBook a free consultationRead moreTake the guesswork out of irrigation  We offer a simple, robust and affordable soil sensor system to help orchard managers, greenhouse growers and high-value field crop farmers manage and optimize irrigation                         Save waterAvoid over-irrigation and irrigate only when necessaryIncrease yieldIrrigate correctly throughout the 

### .SeleniumURLLoader (Websites w. Javascript)
**Issue of WebBaseLoader**: When you request a webpage using a library like requests or aiohttp, you're getting the initial HTML of the page, but any content that's loaded via JavaScript after the page loads will not be included. That's why you might see template tags like (item.price)}} taka instead of the actual values. Those tags are placeholders that get filled in with actual data by JavaScript after the page loads.\

https://github.com/langchain-ai/langchain/issues/4838


**FAILS TO RUN DUE TO CHROMEDRIVER ISSUE**
https://stackoverflow.com/questions/49323099/webdriverexception-message-service-chromedriver-unexpectedly-exited-status-co

**Skip for now**

In [None]:
from langchain.document_loaders import SeleniumURLLoader

urls_javascript = [
    "https://www.youtube.com/watch?v=TFa539R09EQ&t=139s",
    "https://www.youtube.com/watch?v=6Zv6A_9urh4&t=112s"
]

loader = SeleniumURLLoader(urls=soilsense_urls)
url_docs = loader.load()

print(url_docs[0].page_content)

### GoogleDriveLoader (Google Drive)


- Step 1: Enable Google Drive API on Google Cloud Platform
- Step 2: Create OAth Application, of type desktop
- Step 3: Download credentials_desktop.json
- Step 4: Manually create an empty token file by calling mkdir -p ~/.credentials/ (Perhaps because of WSL/Langchain issue) 
- Step 5: Run the Google Drive Loader and ctrl-click to open the link in a browser and authenticate

In [9]:
from langchain.document_loaders import GoogleDriveLoader

loader = GoogleDriveLoader(
    folder_id="11Vpbdd4mC6GxlPNwg-GJdg4ovvvCFatq",
    #token_path='/path/where/you/want/token/to/be/created/google_token.json'
    #file_types=["document", "sheet"],
    credentials_path=os.environ["GOOGLE_APPLICATION_CREDENTIALS"],
    recursive=False  # Optional: Fetch files from subfolders recursively. Defaults to False.
)

docs = loader.load()

In [13]:
#Docs = each page of the PDFs loaded
len(docs)

284

In [14]:
docs[3].page_content[:1000]

'42.  INSTALLATION\nThe gypsum block sensor has been fitted with a logger interface which is inside the\nsmall black box, in line with the sensor cable. This box is completely waterproof andcan be safely buried in the soil during installation if necessary.\nThis sensor is recommended for applications with low to moderately low saline soils\nof pH 4.5 or higher.\nPlace the gypsum block in a container of water for ten minutes to saturate the block.Use an auger slightly larger than the sensor (which has approximately a 25 mm\ndiameter) to drill a hole in the soil to the required depth. Make a small amount of soil/ water slurry using some earth from the hole, pour this into the hole just enough tocover the gypsum block.\nUsing a pole or broom handle push the saturated gypsum block down to the bottom of\nthe hole, taking care not to damage the signal wire or the interface box. A slot in thebase of the pole will help protect the wire during this process.\nRefill the hole with the original so

In [12]:
docs[3].metadata

{'source': 'https://drive.google.com/file/d/197YLPeWUY3HPQf0eQgteWM5g8DdMBtWd/view',
 'title': 'Gypsum Block with DataHog interface.pdf',
 'page': 3}

## Chat with PDF folder on Google Drive


In [15]:
from langchain.document_loaders import PyPDFLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter

# create a text splitter
text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=200)

#load and split defaults to recursiveCharacterTextSplitter anyway, but for clarity I define it
docs_split = loader.load_and_split(text_splitter=text_splitter)

In [17]:
#Docs = each page of the PDFs loaded
print('no of docs after splitting, ' , len(docs_split))

no of docs after splitting,  707


In [18]:
from langchain.embeddings import HuggingFaceEmbeddings, SentenceTransformerEmbeddings
sbert_name = 'multi-qa-MiniLM-L6-cos-v1'
sbert_embeddings = HuggingFaceEmbeddings(model_name=sbert_name)

In [19]:
from langchain.vectorstores import Chroma

persist_directory = './data/chroma/'

!rm -rf ./data/chroma  # remove old database files if any

vectordb = Chroma.from_documents(
    documents=docs_split,
    embedding=sbert_embeddings,
    persist_directory=persist_directory
)


In [20]:
retriever = vectordb.as_retriever()

In [27]:
from langchain.chains import RetrievalQA
from langchain.llms import OpenAI

# create a retrieval chain
# gpt-3.5-turbo-instruct replaced davinchi-003
qa_chain = RetrievalQA.from_chain_type(
	llm=OpenAI(model="gpt-3.5-turbo-instruct", temperature=0.0),
	chain_type="stuff",
	retriever=retriever,
	return_source_documents=True
)

In [30]:
query = "Can the watermark sensor be used with other dataloggers?"
response = qa_chain({"query": query})

In [32]:
response['result']

" I don't know."

In [33]:
response['source_documents']

[Document(page_content='53. USE WITH OTHER DATALOGGERS\nThis gypsum block interface has been designed for use with the Skye DataHog or\nMiniMet datalogger, using the 5.000 volt regulated sensor excitation supply.\nIf it is to be used on other dataloggers, please ensure a 5V power supply else the\ncalibration data supplied in this manual will be incorrect.\nPlease also note that the interface is not protected against any power supply reversal.\nWIRING DETAILS FOR WIRE ENDED SENSORS\nRed Positive power supply (5V)\nBlue Sensor outputGrey (cable screen) Power supply & output ground', metadata={'page': 4, 'source': 'https://drive.google.com/file/d/197YLPeWUY3HPQf0eQgteWM5g8DdMBtWd/view', 'title': 'Gypsum Block with DataHog interface.pdf'}),
 Document(page_content='42.  INSTALLATION\nThe gypsum block sensor has been fitted with a logger interface which is inside the\nsmall black box, in line with the sensor cable. This box is completely waterproof andcan be safely buried in the soil during 

In [34]:
query = "What does it take to use the watermark sensor with another datalogger?"
response = qa_chain({"query": query})

In [35]:
response['result']

' It is recommended to use the gypsum block interface with the Skye DataHog or MiniMet datalogger, using the 5.000 volt regulated sensor excitation supply. If using it with other dataloggers, a 5V power supply is needed to ensure correct calibration data. It is also important to note that the interface is not protected against power supply reversal.'

In [36]:
response['source_documents']

[Document(page_content='53. USE WITH OTHER DATALOGGERS\nThis gypsum block interface has been designed for use with the Skye DataHog or\nMiniMet datalogger, using the 5.000 volt regulated sensor excitation supply.\nIf it is to be used on other dataloggers, please ensure a 5V power supply else the\ncalibration data supplied in this manual will be incorrect.\nPlease also note that the interface is not protected against any power supply reversal.\nWIRING DETAILS FOR WIRE ENDED SENSORS\nRed Positive power supply (5V)\nBlue Sensor outputGrey (cable screen) Power supply & output ground', metadata={'page': 4, 'source': 'https://drive.google.com/file/d/197YLPeWUY3HPQf0eQgteWM5g8DdMBtWd/view', 'title': 'Gypsum Block with DataHog interface.pdf'}),
 Document(page_content='42.  INSTALLATION\nThe gypsum block sensor has been fitted with a logger interface which is inside the\nsmall black box, in line with the sensor cable. This box is completely waterproof andcan be safely buried in the soil during 