<img src="https://miro.medium.com/v2/resize:fit:1400/format:webp/1*6nKt2fNOIxv4HX_o9-AIVw.png" height=300 width=800>

# How to talk to your Documents on Google Colab
### Information Retreival - LangChain
- Without Using OpenAI Embeddings
- Without OpenAI LLM
- With Hugging Face Inference API 

Three Applications:
- Text Documents
- Multiple PDF Files
- Webpages from url(s)

### Additionala documents
LangChain document loaders https://python.langchain.com/en/latest/modules/indexes/document_loaders.html

Answering Question About Custom Documents Using LangChain (and OpenAI) https://kleiber.me/blog/2023/02/25/question-answering-using-langchain/

LangChain CheatSheet  https://github.com/Tor101/LangChain-CheatSheet


Inspirational video youtube: https://youtu.be/wrD-fZvT6UI


In [1]:
!pip install langchain
!pip install huggingface_hub
!pip install sentence_transformers
!pip install faiss-cpu
!pip install unstructured
!pip install chromadb
!pip install Cython
!pip install tiktoken
!pip install unstructured[local-inference]

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting langchain
  Downloading langchain-0.0.170-py3-none-any.whl (834 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m834.2/834.2 kB[0m [31m40.6 MB/s[0m eta [36m0:00:00[0m
Collecting aiohttp<4.0.0,>=3.8.3 (from langchain)
  Downloading aiohttp-3.8.4-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.0 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.0/1.0 MB[0m [31m28.1 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting async-timeout<5.0.0,>=4.0.0 (from langchain)
  Downloading async_timeout-4.0.2-py3-none-any.whl (5.8 kB)
Collecting dataclasses-json<0.6.0,>=0.5.7 (from langchain)
  Downloading dataclasses_json-0.5.7-py3-none-any.whl (25 kB)
Collecting openapi-schema-pydantic<2.0,>=1.2 (from langchain)
  Downloading openapi_schema_pydantic-1.2.4-py3-none-any.whl (90 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [3

### 🚸 restart runtime

### Get HUGGINGFACEHUB_API_KEY

In [29]:
import os
import requests
os.environ["HUGGINGFACEHUB_API_TOKEN"] = "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"

from langchain.document_loaders import TextLoader  #for textfiles
from langchain.text_splitter import CharacterTextSplitter #text splitter
from langchain.embeddings import HuggingFaceEmbeddings #for using HugginFace models
# Vectorstore: https://python.langchain.com/en/latest/modules/indexes/vectorstores.html
from langchain.vectorstores import FAISS  #facebook vectorizationfrom langchain.chains.question_answering import load_qa_chain
from langchain.chains.question_answering import load_qa_chain
from langchain import HuggingFaceHub
from langchain.document_loaders import UnstructuredPDFLoader  #load pdf
from langchain.indexes import VectorstoreIndexCreator #vectorize db index with chromadb
from langchain.chains import RetrievalQA
from langchain.document_loaders import UnstructuredURLLoader  #load urls into docoument-loader


# WORKING WITH TEXT FILES

### Download Text File

In [2]:
import requests
url2 = "https://github.com/fabiomatricardi/cdQnA/raw/main/KS-all-info_rev1.txt"
res = requests.get(url2)
with open("KS-all-info_rev1.txt", "w") as f:
  f.write(res.text)

In [3]:
!pwd
!ls -l 

/content
total 12
-rw-r--r-- 1 root root 4494 May 16 07:17 KS-all-info_rev1.txt
drwxr-xr-x 1 root root 4096 May 12 13:31 sample_data


In [4]:
# Document Loader
from langchain.document_loaders import TextLoader
loader = TextLoader('./KS-all-info_rev1.txt')
documents = loader.load()
import textwrap

def wrap_text_preserve_newlines(text, width=110):
    # Split the input text into lines based on newline characters
    lines = text.split('\n')

    # Wrap each line individually
    wrapped_lines = [textwrap.fill(line, width=width) for line in lines]  # 将文本按width长度重新分割，并用换行符重现连接

    # Join the wrapped lines back together using newline characters
    wrapped_text = '\n'.join(wrapped_lines)

    return wrapped_text
#print(wrap_text_preserve_newlines(str(documents[0])))   

In [5]:
documents

[Document(page_content="WHAT IS HIERARCHY 4.0\nwhether you own build manage maintain or operate an oil plant inevitably issues arise that require immediate action and resolution.\nWith big data flowing in constantly from all sectors making sense of everything while troubleshooting\nissues without wasting time can be a huge challenge. \nSo what's the solution?\nintroducing hierarchy 4.0 and Innovative software solution for control Safety Systems \nHierarchy 4.0 presents an interactive diagram of the entire plant revealing cause and effect Behavior with readings provided in a hierarchical view allowing for a deep understanding of the system's strategy \nAll data is collected from multiple sources visualized as a diagram and optimized through a customized dashboard allowing users to run a logic simulation from live data or pick a moment from their history. \nYour simulation is based on actual safety Logics not just on a math model \nNow every users can prepare an RCA report 90 percent fas

In [6]:
# Text Splitter
from langchain.text_splitter import CharacterTextSplitter
text_splitter = CharacterTextSplitter(chunk_size=1000, chunk_overlap=10)
docs = text_splitter.split_documents(documents)



In [27]:
type(docs)

list

In [28]:
docs[0]

Document(page_content='CASE STUDY 1 - Restore a PSD 1 node without requiring a planned shutdown, all while the plant remains in normal operation.\nThis Case Study Challenge is to restore a rack connected to the PSD 1 node without requiring a planned shutdown, all while the plant remains in normal operation. \nWhat is the actual method: engineer and operator must use the instrument list the get the Rack 2 number of modules (9 AI modules) and verify how many channels they have assigned (8  channels for 9 cards for a total of 72 I/O). This is required to get all the Input/Output connected the Rack 2. All signals that have been identified must be verified in the logical configuration using an engineering workstation: 1)Each signal must be individually inspected in the system applications for its logic function. 2)If a signal is used in multiple controllers, all logics must be cross examined again for all the references. 3)The engineer must prepare a report containing the listed logics and 

### Embeddings

In [13]:
# Embeddings
from langchain.embeddings import HuggingFaceEmbeddings
embeddings = HuggingFaceEmbeddings()

Downloading (…)a8e1d/.gitattributes:   0%|          | 0.00/1.18k [00:00<?, ?B/s]

Downloading (…)_Pooling/config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

Downloading (…)b20bca8e1d/README.md:   0%|          | 0.00/10.6k [00:00<?, ?B/s]

Downloading (…)0bca8e1d/config.json:   0%|          | 0.00/571 [00:00<?, ?B/s]

Downloading (…)ce_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

Downloading (…)e1d/data_config.json:   0%|          | 0.00/39.3k [00:00<?, ?B/s]

Downloading pytorch_model.bin:   0%|          | 0.00/438M [00:00<?, ?B/s]

Downloading (…)nce_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

Downloading (…)cial_tokens_map.json:   0%|          | 0.00/239 [00:00<?, ?B/s]

Downloading (…)a8e1d/tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

Downloading (…)okenizer_config.json:   0%|          | 0.00/363 [00:00<?, ?B/s]

Downloading (…)8e1d/train_script.py:   0%|          | 0.00/13.1k [00:00<?, ?B/s]

Downloading (…)b20bca8e1d/vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

Downloading (…)bca8e1d/modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

In [14]:
#Create the vectorized db
# Vectorstore: https://python.langchain.com/en/latest/modules/indexes/vectorstores.html
from langchain.vectorstores import FAISS
db = FAISS.from_documents(docs, embeddings)

### my questions
only with similarity search

In [15]:
query = "What is Hierarchy 4.0?"
docs = db.similarity_search(query)

In [16]:
len(docs)

3

In [17]:
print(wrap_text_preserve_newlines(str(docs[0].page_content)))

WHAT IS HIERARCHY 4.0
whether you own build manage maintain or operate an oil plant inevitably issues arise that require immediate
action and resolution.
With big data flowing in constantly from all sectors making sense of everything while troubleshooting
issues without wasting time can be a huge challenge.
So what's the solution?
introducing hierarchy 4.0 and Innovative software solution for control Safety Systems
Hierarchy 4.0 presents an interactive diagram of the entire plant revealing cause and effect Behavior with
readings provided in a hierarchical view allowing for a deep understanding of the system's strategy
All data is collected from multiple sources visualized as a diagram and optimized through a customized
dashboard allowing users to run a logic simulation from live data or pick a moment from their history.
Your simulation is based on actual safety Logics not just on a math model
Now every users can prepare an RCA report 90 percent faster in just a few minutes.
Hierarchy c

### Create QA Chain  with FlanT5 large

In [22]:
from langchain.chains.question_answering import load_qa_chain
from langchain import HuggingFaceHub


In [23]:
llm=HuggingFaceHub(repo_id="google/flan-t5-xl", model_kwargs={"temperature":0, "max_length":512})
chain = load_qa_chain(llm, chain_type="stuff")

In [24]:
chain = load_qa_chain(llm, chain_type="stuff")



---



---


my questions
Tested with google/flan-t5-xl

In [25]:
query = "What is the case study challenge"
docs = db.similarity_search(query)
chain.run(input_documents=docs, question=query)

KeyboardInterrupt: ignored

In [None]:
query = "What is the Scenario about?"
docs = db.similarity_search(query)
chain.run(input_documents=docs, question=query)

In [None]:
query = "What the actual issues and drawbacks ?"
docs = db.similarity_search(query)
chain.run(input_documents=docs, question=query)



---



---



---



## trying to use vicuna 13b 4bit or Alpaca
### NOTE: this inference works only with text2text models (and Vicuna 4bit is not one of them, all the llama.cpp ones in reaality
eachadea/legacy-ggml-vicuna-13b-4bit  (not text2text...)
<br>
Got this error<br>
<img src="https://i.ibb.co/TBrXtX5/vicunatypeerror.png" width=950>
<br>
Browsing HuggingFace for text2text-generaation models<br>
https://huggingface.co/models?pipeline_tag=text2text-generation&sort=downloads
<br>
- [declare-lab/flan-alpaca-large](https://huggingface.co/declare-lab/flan-alpaca-large)

- you cana also try (Flan-ShareGPT-XL	3B	Flan, ShareGPT/Vicuna	1x A6000) taken from HF declare-lab/flan-sharegpt-xl [at this link](https://huggingface.co/declare-lab/flan-sharegpt-xl)



In [30]:
llm1=HuggingFaceHub(repo_id="eachadea/legacy-ggml-vicuna-13b-4bit", model_kwargs={"temperature":0, "max_length":512})
chain = load_qa_chain(llm1, chain_type="stuff")

ValidationError: ignored

In [None]:
query = "What did the president say about the Supreme Court"
docs = db.similarity_search(query)
chain.run(input_documents=docs, question=query)

In [None]:
query = "What did the president say about economy?"
docs = db.similarity_search(query)
chain.run(input_documents=docs, question=query)

'The president said that the economy has been "changing" for the better.'

### Test with `declare-lab/flan-sharegpt-xl`
first run very long!!!  and took timeout error from HF<br>
<img src="https://i.ibb.co/xm58mbj/Screenshot-2023-04-30-alle-07-48-17.png" width=800>


In [31]:
llm2=HuggingFaceHub(repo_id="declare-lab/flan-sharegpt-xl", model_kwargs={"temperature":0, "max_length":512})

In [32]:
chain = load_qa_chain(llm2, chain_type="stuff")

In [33]:
query = "What did the president say about the Supreme Court"
docs = db.similarity_search(query)
chain.run(input_documents=docs, question=query)

ValueError: ignored

In [None]:
query = "What did the president say about economy?"
docs = db.similarity_search(query)
chain.run(input_documents=docs, question=query)

'The president said that the economy has been "changing" for the better.'

## ⭐⭐ Test with declare-lab/flan-alpaca-large
First run good<br>
good reply on the first question, not so good on the second one

In [None]:
from langchain.chains.question_answering import load_qa_chain
from langchain import HuggingFaceHub

In [None]:
llm2=HuggingFaceHub(repo_id="declare-lab/flan-alpaca-large", model_kwargs={"temperature":0, "max_length":512})
chain = load_qa_chain(llm2, chain_type="stuff")

### my questions
Tested with declare-lab/flan-alpaca-large

In [None]:
query = "What is the case study challenge"
docs = db.similarity_search(query)
chain.run(input_documents=docs, question=query)

'The case study challenge is to restore a rack connected to the PSD 1 node without requiring a planned shutdown, all while the plant remains in normal operation.'

In [None]:
query = "What is the Scenario about?"
docs = db.similarity_search(query)
chain.run(input_documents=docs, question=query)

'The Scenario is about a successful maintenance of a plant.'

In [None]:
query = "What the actual issues and drawbacks ?"
docs = db.similarity_search(query)
chain.run(input_documents=docs, question=query)

'The actual method is time consuming due to the involvement of several specialists and other maintenance activities have been delayed as a result. The new method is more efficient and can be used to solve the issue in few simple steps.'

## try declare-lab/flan-alpaca-xxl
first run very long!!!  and took timeout error from HF saame as above - probably too LARGE...<br>
<img src="https://i.ibb.co/xm58mbj/Screenshot-2023-04-30-alle-07-48-17.png" width=800>

In [None]:
llm3=HuggingFaceHub(repo_id="declare-lab/flan-alpaca-xxl", model_kwargs={"temperature":0, "max_length":512})

In [None]:
chain = load_qa_chain(llm3, chain_type="stuff")

In [None]:
query = "What did the president say about the Supreme Court"
docs = db.similarity_search(query)
chain.run(input_documents=docs, question=query)

ValueError: ignored

In [None]:
query = "What did the president say about economy?"
docs = db.similarity_search(query)
chain.run(input_documents=docs, question=query)

'The president said that the economy has been "changing" for the better.'

## test with AlpacaAlice/t5-end2end-questions-generation
Got an error<br>
```python
ValueError: Error raised by inference API: Can't load tokenizer using from_pretrained, please update its configuration: 
Can't load tokenizer for 'AlpacaAlice/t5-end2end-questions-generation'. If you were trying to load it from 'https://huggingface.co/models', 
make sure you don't have a local directory with the same name. 
Otherwise, make sure 'AlpacaAlice/t5-end2end-questions-generation' is the correct 
path to a directory containing all relevant files for a T5TokenizerFast tokenizer.
```

In [None]:
llm4=HuggingFaceHub(repo_id="AlpacaAlice/t5-end2end-questions-generation", model_kwargs={"temperature":0, "max_length":512})

In [None]:
chain = load_qa_chain(llm4, chain_type="stuff")

In [None]:
query = "What did the president say about the Supreme Court"
docs = db.similarity_search(query)
chain.run(input_documents=docs, question=query)

In [None]:
query = "What did the president say about economy?"
docs = db.similarity_search(query)
chain.run(input_documents=docs, question=query)

'The president said that the economy has been "changing" for the better.'

## test with GarciaLnk/flan-alpaca-base-squad2
---
it is working but with very bad results

In [None]:
llm5=HuggingFaceHub(repo_id="GarciaLnk/flan-alpaca-base-squad2", model_kwargs={"temperature":0, "max_length":512})

In [None]:
chain = load_qa_chain(llm5, chain_type="stuff")

In [None]:
query = "What did the president say about the Supreme Court"
docs = db.similarity_search(query)
chain.run(input_documents=docs, question=query)

'I will always have your back as your President, so you can be yourself and reach your God-given potential'

In [None]:
query = "What did the president say about economy?"
docs = db.similarity_search(query)
chain.run(input_documents=docs, question=query)

''

## test with areht/t5-small-finetuned-t5
---
very bad results

In [None]:
llm6=HuggingFaceHub(repo_id="areht/t5-small-finetuned-t5", model_kwargs={"temperature":0, "max_length":512})

In [None]:
chain = load_qa_chain(llm6, chain_type="stuff")

In [None]:
query = "What did the president say about the Supreme Court"
docs = db.similarity_search(query)
chain.run(input_documents=docs, question=query)

": the question:,,,::, just say that you don't know, don't try to make up an answer. Tonight, I’d like to honor someone who has dedicated his life to serve this country: Justice Stephen Breyer . And tonight, I’m offering a Unity Agenda for the Nation. Four big things we can do together. First, beat the opioid epidemic. And I’m taking robust action to make sure the pain of our sanctions is targeted at Russia’s"

In [None]:
query = "What did the president say about economy?"
docs = db.similarity_search(query)
chain.run(input_documents=docs, question=query)

': the question: the question:, to pass the Bipartisan Innovation Act, which will make record investments in emerging technologies and American manufacturing. Let me give you one example of why it’s so important to pass it. Question: What did the president say about economy? Question: What did the president say about economy? Question: What did the president say about economy? Question: What did the president say about economy?'

In [None]:
os.environ["HUGGINGFACEHUB_API_TOKEN"] = "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"

## ⭐ Test with MBZUAI/LaMini-Flan-T5-783M
---
I would say very good results for a 783M model

In [None]:
llm6=HuggingFaceHub(repo_id="MBZUAI/LaMini-Flan-T5-783M", model_kwargs={"temperature":0, "max_length":512})
chain = load_qa_chain(llm6, chain_type="stuff")

In [None]:
query = "What is the case study challenge"
docs = db.similarity_search(query)
chain.run(input_documents=docs, question=query)

'The case study challenge is to restore a rack connected to the PSD 1 node without requiring a planned shutdown, all while the plant remains in normal operation.'

In [None]:
query = "What is the Scenario about?"
docs = db.similarity_search(query)
chain.run(input_documents=docs, question=query)

'The context does not provide information about what the Scenario is about.'

In [None]:
query = "What the actual issues and drawbacks ?"
docs = db.similarity_search(query)
chain.run(input_documents=docs, question=query)

'The actual issues and drawbacks of using the actual method are: 1) possibility of human error 2) incorrect impact analysis report 3) time consuming troubleshooting process 4) delayed maintenance activities 5) lack of a comprehensive overview of all signals allocated in the specified controller 6) lack of a user-friendly interface 7) lack of a comprehensive database of all the data 8) lack of a user-friendly interface 9) lack of a user-friendly interface 10) lack of a user-friendly interface'

## ⭐ Test with IAJw/declare-flan-alpaca-large-18378
---
relative small model but not bad results, specially on the first  question

In [None]:
llm7=HuggingFaceHub(repo_id="IAJw/declare-flan-alpaca-large-18378", model_kwargs={"temperature":0, "max_length":512})
chain = load_qa_chain(llm7, chain_type="stuff")

In [None]:
query = "What is the case study challenge"
docs = db.similarity_search(query)
chain.run(input_documents=docs, question=query)

In [None]:
query = "What is the Scenario about?"
docs = db.similarity_search(query)
chain.run(input_documents=docs, question=query)

In [None]:
query = "What the actual issues and drawbacks ?"
docs = db.similarity_search(query)
chain.run(input_documents=docs, question=query)

## Now test flan aplacaXL  declare-lab/flan-alpaca-xl
---
bad, take too long, I guss still timeout on the API request

In [None]:
llm9=HuggingFaceHub(repo_id="declare-lab/flan-alpaca-xl", model_kwargs={"temperature":0, "max_length":512})

In [None]:
chain = load_qa_chain(llm9, chain_type="stuff")

In [None]:
query = "What did the president say about the Supreme Court"
docs = db.similarity_search(query)
chain.run(input_documents=docs, question=query)

In [None]:
query = "What did the president say about economy?"
docs = db.similarity_search(query)
chain.run(input_documents=docs, question=query)

': the question: the question:, to pass the Bipartisan Innovation Act, which will make record investments in emerging technologies and American manufacturing. Let me give you one example of why it’s so important to pass it. Question: What did the president say about economy? Question: What did the president say about economy? Question: What did the president say about economy? Question: What did the president say about economy?'

## test with declare-lab/flan-gpt4all-xl
---
not owrking... API timeout

In [None]:
llm10=HuggingFaceHub(repo_id="declare-lab/flan-gpt4all-xl", model_kwargs={"temperature":0, "max_length":512})

In [None]:
chain = load_qa_chain(llm10, chain_type="stuff")

In [None]:
query = "What did the president say about the Supreme Court"
docs = db.similarity_search(query)
chain.run(input_documents=docs, question=query)

KeyboardInterrupt: ignored

In [None]:
query = "What did the president say about economy?"
docs = db.similarity_search(query)
chain.run(input_documents=docs, question=query)

'The president said that the economy has been "changing" for the better.'

In [None]:
query = input("your question: ")
docs = db.similarity_search(query)
chain.run(input_documents=docs, question=query)

your question: what the president say about Russia?


"The president said that Russia's economy is reeling and Putin alone is to blame."

# WORKING WITH PDF Files

In [34]:
!pip install unstructured
!pip install chromadb
!pip install Cython
!pip install tiktoken
!pip install unstructured[local-inference]

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


In [35]:
from langchain.document_loaders import UnstructuredPDFLoader
from langchain.indexes import VectorstoreIndexCreator

Download from github 2 pdf files and copy them into `pdfs` directory

In [36]:
!wget https://github.com/fabiomatricardi/cdQnA/raw/main/PLC_mediumArticle.pdf
!wget https://github.com/fabiomatricardi/cdQnA/raw/main/BridgingTheGaap_fromMedium.pdf
!mkdir pdfs
!cp *pdf '/content/pdfs'

--2023-05-16 08:06:29--  https://github.com/fabiomatricardi/cdQnA/raw/main/PLC_mediumArticle.pdf
Resolving github.com (github.com)... 20.205.243.166
Connecting to github.com (github.com)|20.205.243.166|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://raw.githubusercontent.com/fabiomatricardi/cdQnA/main/PLC_mediumArticle.pdf [following]
--2023-05-16 08:06:29--  https://raw.githubusercontent.com/fabiomatricardi/cdQnA/main/PLC_mediumArticle.pdf
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 9602910 (9.2M) [application/octet-stream]
Saving to: ‘PLC_mediumArticle.pdf’


2023-05-16 08:06:30 (352 MB/s) - ‘PLC_mediumArticle.pdf’ saved [9602910/9602910]

--2023-05-16 08:06:30--  https://github.com/fabiomatricardi/cdQnA/raw/main/

In [37]:
# connect your Google Drive
#from google.colab import drive
#drive.mount('/content/gdrive', force_remount=True)
import os
pdf_folder_path = '/content/pdfs'
os.listdir(pdf_folder_path)

['PLC_mediumArticle.pdf', 'BridgingTheGaap_fromMedium.pdf']

In [38]:
loaders = [UnstructuredPDFLoader(os.path.join(pdf_folder_path, fn)) for fn in os.listdir(pdf_folder_path)]
loaders

[<langchain.document_loaders.pdf.UnstructuredPDFLoader at 0x7ff4dad5f190>,
 <langchain.document_loaders.pdf.UnstructuredPDFLoader at 0x7ff4dad5f280>]

In [39]:
index = VectorstoreIndexCreator(
    embedding=HuggingFaceEmbeddings(),
    text_splitter=CharacterTextSplitter(chunk_size=1000, chunk_overlap=0)).from_loaders(loaders)

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Unzipping taggers/averaged_perceptron_tagger.zip.


In [40]:
index

VectorStoreIndexWrapper(vectorstore=<langchain.vectorstores.chroma.Chroma object at 0x7ff4dad5e7d0>)

In [41]:
#llm=HuggingFaceHub(repo_id="google/flan-t5-xl", model_kwargs={"temperature":0, "max_length":512})
llm2=HuggingFaceHub(repo_id="declare-lab/flan-alpaca-large", model_kwargs={"temperature":0, "max_length":512})

In [42]:
from langchain.chains import RetrievalQA
chain = RetrievalQA.from_chain_type(llm=llm2, 
                                    chain_type="stuff", 
                                    retriever=index.vectorstore.as_retriever(), 
                                    input_key="question")

In [43]:
chain.run('What is the difference betweena PLC and a PC?')

'PLCs are built to operate in industrial settings with varying temperatures, vibrations, and humidity levels, and are highly resistant to electrical noise.'

In [44]:
chain.run('What is a PLC?')

ValueError: ignored

In [None]:
chain.run('Where and why a PLC is used?')

'A PLC is used in industrial and manufacturing applications to control machinery and processes.'

In [None]:
ques = input('Your question: ')
chain.run(ques)

Your question: what is disruption of AI?


"The AI revolution is a shift in the way we think about technology and the way we use it. It is a shift from a focus on automation and automation to one that is focused on the development of AI and its potential to enhance people's lives and create a better future."

In [None]:
ques2 = input('Your question: ')
chain.run(ques2)

Your question: What is the new role of engineers?


'Engineers are becoming more responsible and ethical in their work.'



---

## text with MBZUAI/LaMini-Flan-T5-783M on pdf documents

In [None]:
llm6=HuggingFaceHub(repo_id="MBZUAI/LaMini-Flan-T5-783M", model_kwargs={"temperature":0, "max_length":512})

In [None]:
from langchain.chains import RetrievalQA
chain = RetrievalQA.from_chain_type(llm=llm6, 
                                    chain_type="stuff", 
                                    retriever=index.vectorstore.as_retriever(), 
                                    input_key="question")

In [None]:
chain.run('What is the difference between a PLC and a PC?')

In [None]:
chain.run('What is a PLC?')

'The CPU is responsible for executing the PLC program, performing data processing, and communicating with other devices.'

In [None]:
chain.run('Where and why a PLC is used?')

'The hardware components of a PLC are the CPU, I/O modules, power supply, communication ports, programming and monitoring interface, and chassis.'

In [None]:
ques = input('Your question: ')
chain.run(ques)

Your question: what is disruption of AI?


'The disruption of AI is the emergence of new technologies and products that can benefit society and benefit all of humanity.'

In [None]:
ques2 = input('Your question: ')
chain.run(ques2)

Your question: why engineers and philosophers?


'Engineers and philosophers can work together to develop solutions that are not only technically sound and ethically responsible but also commercially viable.'

# WORKING WITH URLS

In [None]:
from langchain.document_loaders import UnstructuredURLLoader
urls = [
    "https://basicplc.com/plc-programming/",
    "https://www.learnrobotics.org/blog/plc-programming-languages/"
]
loader2 = [UnstructuredURLLoader(urls=urls)]
#data = loader2.load()

In [None]:
index2 = VectorstoreIndexCreator(
    embedding=HuggingFaceEmbeddings(),
    text_splitter=CharacterTextSplitter(chunk_size=1000, chunk_overlap=0)).from_loaders(loader2)



In [None]:
llm2=HuggingFaceHub(repo_id="declare-lab/flan-alpaca-large", model_kwargs={"temperature":0, "max_length":512})

In [None]:
from langchain.chains import RetrievalQA
chain = RetrievalQA.from_chain_type(llm=llm2, 
                                    chain_type="stuff", 
                                    retriever=index2.vectorstore.as_retriever(), 
                                    input_key="question")

In [None]:
chain.run('How do you program a PLC?')

'PLC Programming starts by identifying the problem, creating a sequence of operations based on binary logic, entering a program using a language, and simulating the program in your software.'

In [None]:
chain.run('What is the problem to identify?')

'The problem to identify is the need to control the flow of water into a tank.'

In [None]:
chain.run('What is ladder diagram?')

'Ladder Logic Programming is a PLC programming language that is used to create a diagram that shows the connections between inputs and outputs. It is derived from the Relay Logic Diagrams and uses almost the same context.'

In [None]:
ques = input('Your question: ') #What is function block diagraam?
chain.run(ques)

Your question: whaat is function block diagram?


'Functional Blocks is a simple way of PLC programming where there are “Function blocks” (hence the name) are available in the programming software.'

In [None]:
ques2 = input('Your question: ')
chain.run(ques2)

Your question: what are the fundamentals of logic?


'The fundamentals of logic are NOT, AND, OR, XOR, NAND, NOR, and XNOR.'



---



---



---



In [None]:
llm6=HuggingFaceHub(repo_id="MBZUAI/LaMini-Flan-T5-783M", model_kwargs={"temperature":0, "max_length":512})

In [None]:
from langchain.chains import RetrievalQA
chain = RetrievalQA.from_chain_type(llm=llm6, 
                                    chain_type="stuff", 
                                    retriever=index2.vectorstore.as_retriever(), 
                                    input_key="question")

In [None]:
chain.run('How do you program a PLC?')

'The first step in programming a PLC is to identify the problem, create a sequence of operations based on binary logic, enter a program using a language, and simulate the program in your software.'

In [None]:
chain.run('What is the problem to identify?')

'The problem to identify is the difference between the AND and NAND conditions in PLC programming.'

In [None]:
chain.run('What is ladder diagram?')

"The ladder logic diagram is a visual representation of the system's operation to display the sequence of actions involved in the operation."

In [None]:
ques = input('Your question: ') #What is function block diagraam?
chain.run(ques)

Your question: what is function block diagram?


'The most commonly used PLC programming language is the Ladder Logic Diagram.'

In [None]:
ques2 = input('Your question: ')
chain.run(ques2)

Your question: what are the fundamentals of logic?


'The answer is not provided in the given context.'