## PDF Document Loaders
- Load various kind of documents from the web and local files.
- Apply LLM to the documents for summarization and question answering.

### Project 1: Question Answering from PDF Document
- We will load the document from the local file and apply LLM to answer the questions.
- Lets use research paper published on the missuse of the health supplements for workout. 

rag-dataset: git@github.com:laxmimerit/rag-dataset.git

```bash
git clone git@github.com:laxmimerit/rag-dataset.git
```

In [9]:
from dotenv import load_dotenv

load_dotenv('../env')

True

In [10]:
from langchain_community.document_loaders import PyMuPDFLoader

loader = PyMuPDFLoader("rag-dataset/health supplements/1. dietary supplements - for whom.pdf")

docs = loader.load()

In [11]:
len(docs)

17

In [12]:
# docs[0].metadata
# print(docs[0].page_content)

In [13]:
### Read the list of PDFs in the dir
import os

pdfs = []
for root, dirs, files in os.walk("rag-dataset"):
    # print(root, dirs, files)
    for file in files:
        if file.endswith(".pdf"):
            pdfs.append(os.path.join(root, file))

In [14]:
docs = []
for pdf in pdfs:
    loader = PyMuPDFLoader(pdf)
    temp = loader.load()
    docs.extend(temp)

    # print(temp)
    # break

In [15]:
len(docs)

324

In [16]:
def format_docs(docs):
    return "\n\n".join([x.page_content for x in docs])


context = format_docs(docs)

In [17]:
docs[0]

Document(metadata={'producer': 'Microsoft® Word for Microsoft 365', 'creator': 'Microsoft® Word for Microsoft 365', 'creationdate': '2024-10-30T20:21:42-07:00', 'source': 'rag-dataset/finance/facebook/META-Q3-2024-Earnings-Call-Transcript.pdf', 'file_path': 'rag-dataset/finance/facebook/META-Q3-2024-Earnings-Call-Transcript.pdf', 'total_pages': 19, 'format': 'PDF 1.7', 'title': 'META Q3 2024 Earnings Call Transcript', 'author': 'Kenneth Dorell;Jonathan Rong Li', 'subject': '', 'keywords': '', 'moddate': '2024-10-30T20:21:42-07:00', 'trapped': '', 'modDate': "D:20241030202142-07'00'", 'creationDate': "D:20241030202142-07'00'", 'page': 0}, page_content="1 \n \nMeta Platforms, Inc. (META) \nThird Quarter 2024 Results Conference Call \nOctober 30th, 2024 \n \nKenneth Dorell, Director, Investor Relations \n \n \nThank you. Good afternoon and welcome to Meta Platforms third quarter 2024 earnings \nconference call. Joining me today to discuss our results are Mark Zuckerberg, CEO and Susan Li,

In [18]:
# print(context)

In [19]:
import tiktoken

encoding = tiktoken.encoding_for_model("gpt-4o-mini")


In [20]:
encoding.encode("congratulations"), encoding.encode("rqsqeft")

([542, 111291, 14571], [81, 31847, 80, 5276])

In [21]:
len(encoding.encode(docs[0].page_content))

627

In [22]:
len(encoding.encode(context))

230329

In [23]:
969*64

62016

In [24]:
### Question Answering using LLM
from langchain_ollama import ChatOllama

from langchain_core.prompts import (SystemMessagePromptTemplate, HumanMessagePromptTemplate,
                                    ChatPromptTemplate)



from langchain_core.output_parsers import StrOutputParser

base_url = "http://localhost:11434"
model = 'llama3.2:3b'

llm = ChatOllama(base_url=base_url, model=model)


In [25]:
system = SystemMessagePromptTemplate.from_template("""You are helpful AI assistant who answer user question based on the provided context. 
                                                    Do not answer in more than {words} words""")

prompt = """Answer user question based on the provided context ONLY! If you do not know the answer, just say "I don't know".
            ### Context:
            {context}

            ### Question:
            {question}

            ### Answer:"""

prompt = HumanMessagePromptTemplate.from_template(prompt)

messages = [system, prompt]
template = ChatPromptTemplate(messages)

# template
# template.invoke({'context': context, 'question': "How to gain muscle mass?", 'words': 50})

qna_chain = template | llm | StrOutputParser()

In [26]:
qna_chain

ChatPromptTemplate(input_variables=['context', 'question', 'words'], input_types={}, partial_variables={}, messages=[SystemMessagePromptTemplate(prompt=PromptTemplate(input_variables=['words'], input_types={}, partial_variables={}, template='You are helpful AI assistant who answer user question based on the provided context. \n                                                    Do not answer in more than {words} words'), additional_kwargs={}), HumanMessagePromptTemplate(prompt=PromptTemplate(input_variables=['context', 'question'], input_types={}, partial_variables={}, template='Answer user question based on the provided context ONLY! If you do not know the answer, just say "I don\'t know".\n            ### Context:\n            {context}\n\n            ### Question:\n            {question}\n\n            ### Answer:'), additional_kwargs={})])
| ChatOllama(model='llama3.2:3b', base_url='http://localhost:11434')
| StrOutputParser()

In [None]:
response = qna_chain.invoke({'context': context, 'question': "How to gain muscle mass?", 'words': 50})
print(response)

In [None]:
response = qna_chain.invoke({'context': context, 'question': "How to reduce the weight?", 'words': 50})
print(response)

I don't see a question related to reducing weight in the text. The provided text appears to be about toxic mechanisms of action of botanical supplements and potential side effects, rather than weight loss. If you'd like to know how to reduce weight, I'd be happy to provide general information or help with a related topic.


In [None]:
response = qna_chain.invoke({'context': context, 'question': "How to do weight loss?", 'words': 50})
print(response)

I can help with that!

To answer your question, there is no single "best" way to lose weight, but here are some general tips that may be helpful:

1. **Eat a healthy and balanced diet**: Focus on whole, unprocessed foods like vegetables, fruits, whole grains, lean proteins, and healthy fats.
2. **Keep track of your calorie intake**: Use a food diary or an app to track your daily calorie consumption. Aim for a deficit of 500-1000 calories per day to promote weight loss.
3. **Stay hydrated**: Drink plenty of water throughout the day to help control hunger and boost metabolism.
4. **Exercise regularly**: Aim for at least 150 minutes of moderate-intensity aerobic exercise, or 75 minutes of vigorous-intensity aerobic exercise, or a combination of both, per week. You can also incorporate strength training and high-intensity interval training (HIIT) into your routine.
5. **Get enough sleep**: Aim for 7-9 hours of sleep per night to help regulate hunger hormones and support weight loss.
6. **B

In [None]:
response = qna_chain.invoke({'context': context, 'question': "How many planets are there outside of our solar system?", 'words': 50})
print(response)

There is no information about planets in the provided text. The text discusses botanical supplements, their active constituents, typical use and dosage, adverse effects, and potential herb-drug interactions. It does not mention planets or our solar system.


### Project 2: PDF Document Summarization

In [None]:
system = SystemMessagePromptTemplate.from_template("""You are helpful AI assistant who works as document summarizer. 
                                                   You must not hallucinate or provide any false information.""")

prompt = """Summarize the given context in {words}.
            ### Context:
            {context}

            ### Summary:"""

prompt = HumanMessagePromptTemplate.from_template(prompt)

messages = [system, prompt]
template = ChatPromptTemplate(messages)

summary_chain = template | llm | StrOutputParser()

In [None]:
summary_chain

ChatPromptTemplate(input_variables=['context', 'words'], input_types={}, partial_variables={}, messages=[SystemMessagePromptTemplate(prompt=PromptTemplate(input_variables=[], input_types={}, partial_variables={}, template='You are helpful AI assistant who works as document summarizer. \n                                                   You must not hallucinate or provide any false information.'), additional_kwargs={}), HumanMessagePromptTemplate(prompt=PromptTemplate(input_variables=['context', 'words'], input_types={}, partial_variables={}, template='Summarize the given context in {words}.\n            ### Context:\n            {context}\n\n            ### Summary:'), additional_kwargs={})])
| ChatOllama(model='llama3.2:3b', base_url='http://localhost:11434')
| StrOutputParser()

In [None]:
response = summary_chain.invoke({'context': context, 'words': 50})
print(response)

This article discusses the potential therapeutic and toxic mechanisms of action of botanical supplements, also known as herbal remedies. The authors review various botanicals, their primary active constituents, typical use, dosage, and reported adverse effects. They note that concurrent exposure to other compounds and the heterogeneity of herbal supplements can make it difficult to determine the specific mechanism of toxicity. Case reports have identified potential mechanisms for liver toxicity in some botanicals, such as black cohosh, kava kava, saw palmetto, and milk thistle. Other botanicals, such as yohimbe and ginkgo biloba, have been associated with non-hepatic symptoms like seizures and bleeding. The article also discusses potential herb-drug interactions, where the activation of metabolizing enzymes can affect the pharmacokinetics of drugs.

Some key points from the article include:

* Botanical supplements can have therapeutic and toxic mechanisms of action.
* Many botanicals 

In [None]:
response = summary_chain.invoke({'context': context, 'words': 500})
print(response)

This article discusses the potential toxicities and interactions of botanical supplements, including their active compounds, typical use, dosage, and reported adverse effects. The authors review various case reports and in vivo studies to highlight the mechanisms underlying these adverse effects, such as mitochondrial dysfunction, oxidative stress, and alteration of bile acid homeostasis. They also discuss herb-drug interactions, including induction or suppression of metabolizing enzymes, which can affect the pharmacokinetics of drugs. The article emphasizes the need for caution when using botanical supplements, particularly in combination with other compounds, and highlights the importance of monitoring patients for potential adverse effects.


### Project 3: Report Generation from PDF Document

Streamlit Tutorial: https://www.youtube.com/watch?v=hff2tHUzxJM&list=PLc2rvfiptPSSpZ99EnJbH5LjTJ_nOoSWW

In [None]:
response = qna_chain.invoke({'context': context, 
                             'question': "Provide a detailed report from the provided context. Write answer in Markdown.", 
                             'words': 2000})
print(response)

**Botanical Supplements and Potential Adverse Effects**

The use of botanical supplements has become increasingly popular, with many people turning to these products for their purported health benefits. However, like any other substance, botanicals can also pose potential risks and adverse effects.

**Commonly Used Botanical Supplements and Their Potential Risks**

1. **Black Cohosh (Cimicifuga racemosa)**
	* Associated with jaundice and liver failure in menopausal women
	* Possible mechanisms: mitochondrial dysfunction, oxidative stress, and alteration of bile acid homeostasis
2. **Kava Kava**
	* Linked to liver toxicity, sometimes requiring transplants
	* Possible mechanisms: depletion of glutathione, increasing oxidative stress, and inhibition of cyclooxygenases (mitochondrial dysfunction)
3. **Saw Palmetto**
	* Use has been associated with cholestatic hepatitis and pancreatitis
	* Possible mechanism: alterations in bile secretion
4. **Echinacea**
	* Cholestatic symptoms were seen i