# LLM Tutor

# Data Collection

## For research papers:

### 1. Arvix Full-Text Bulk Downloader (PDF):
- Initially, I experimented with [UnstructedPdfLoader](https://api.python.langchain.com/en/latest/document_loaders/langchain_community.document_loaders.pdf.UnstructuredPDFLoader.html), but encountered some issues.
- ~~Subsequently, I transitioned to using BaseLoader and employed the PyMuPDF library to extract text from the PDFs.~~
- PyPDF worked out fine.
- Challenges arose due to the predominant use of figures rather than textual content in most papers, resulting in less useful extracted text.

In [54]:
from langchain.document_loaders import PyPDFLoader
# from langchain.document_loaders import UnstructuredPDFLoader

# Load the PDF file
pdf_path = '/Users/tusharchandra/Workspace/repos/LLM-Tutor/paper1.pdf'
# loader = UnstructuredPDFLoader(pdf_path)
loader = PyPDFLoader(pdf_path)

# Load and print the document
documents = loader.load()

documents[:5]

[Document(page_content='A Comprehensive Overview of Large Language Models\nHumza Naveeda, Asad Ullah Khana,∗, Shi Qiub,∗, Muhammad Saqibc,d,∗, Saeed Anware,f, Muhammad Usmane,f, Naveed Akhtarg,i,\nNick Barnesh, Ajmal Miani\naUniversity of Engineering and Technology (UET), Lahore, Pakistan\nbThe Chinese University of Hong Kong (CUHK), HKSAR, China\ncUniversity of Technology Sydney (UTS), Sydney, Australia\ndCommonwealth Scientific and Industrial Research Organisation (CSIRO), Sydney, Australia\neKing Fahd University of Petroleum and Minerals (KFUPM), Dhahran, Saudi Arabia\nfSDAIA-KFUPM Joint Research Center for Artificial Intelligence (JRCAI), Dhahran, Saudi Arabia\ngThe University of Melbourne (UoM), Melbourne, Australia\nhAustralian National University (ANU), Canberra, Australia\niThe University of Western Australia (UWA), Perth, Australia\nAbstract\nLarge Language Models (LLMs) have recently demonstrated remarkable capabilities in natural language processing tasks and\nbeyond. This s

In [55]:
from langchain.text_splitter import RecursiveCharacterTextSplitter

text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=200)

document = text_splitter.split_documents(documents)

document[:5]

[Document(page_content='A Comprehensive Overview of Large Language Models\nHumza Naveeda, Asad Ullah Khana,∗, Shi Qiub,∗, Muhammad Saqibc,d,∗, Saeed Anware,f, Muhammad Usmane,f, Naveed Akhtarg,i,\nNick Barnesh, Ajmal Miani\naUniversity of Engineering and Technology (UET), Lahore, Pakistan\nbThe Chinese University of Hong Kong (CUHK), HKSAR, China\ncUniversity of Technology Sydney (UTS), Sydney, Australia\ndCommonwealth Scientific and Industrial Research Organisation (CSIRO), Sydney, Australia\neKing Fahd University of Petroleum and Minerals (KFUPM), Dhahran, Saudi Arabia\nfSDAIA-KFUPM Joint Research Center for Artificial Intelligence (JRCAI), Dhahran, Saudi Arabia\ngThe University of Melbourne (UoM), Melbourne, Australia\nhAustralian National University (ANU), Canberra, Australia\niThe University of Western Australia (UWA), Perth, Australia\nAbstract\nLarge Language Models (LLMs) have recently demonstrated remarkable capabilities in natural language processing tasks and', metadata={'so

In [58]:
# type(document)
type(document[0])

langchain_core.documents.base.Document

### 2. Using BaseLoader:

In [23]:
import fitz  # PyMuPDF
from langchain.document_loaders.base import BaseLoader
from langchain.docstore.document import Document

class PyMuPDFLoader(BaseLoader):
    def __init__(self, file_path: str):
        self.file_path = file_path

    def load(self):
        # Open the PDF file
        document = fitz.open(self.file_path)
        text = ""
        
        # Extract text from each page
        for page_num in range(len(document)):
            page = document.load_page(page_num)
            text += page.get_text()
        
        # Create a LangChain Document object
        langchain_doc = Document(page_content=text)
        return [langchain_doc]

# Path to the PDF file
pdf_path = 'paper1.pdf'
loader = PyMuPDFLoader(pdf_path)

# Load the document
documents = loader.load()

# Save the content to a text file
with open('BaseLoaderOutput.txt', 'w', encoding='utf-8') as file:
    for doc in documents:
        file.write(doc.page_content)

print("PDF content has been saved to BaseLoaderOutput.txt")

PDF content has been saved to BaseLoaderOutput.txt


## For web articles:

### 1. Web Article Downloader using BeautifulSoup (without langchain):

In [11]:
import requests
from bs4 import BeautifulSoup
import json

# URL to crawl
url = "https://aman.ai/primers/ai/LLM/"

# Function to extract and preprocess content from the webpage
def extract_content(url):
    # Send a GET request to the URL
    response = requests.get(url)
    
    # Check if the request was successful
    if response.status_code == 200:
        # Parse the HTML content of the page
        soup = BeautifulSoup(response.content, 'html.parser')
        
        # Extract metadata and main content
        metadata = {
            "title": soup.title.string.strip() if soup.title else "",
            "url": url,
        }
        
        # Extract main content and preprocess it
        main_content = soup.get_text(separator=' ', strip=True)
        
        # Remove excessive blank spaces
        cleaned_content = ' '.join(main_content.split())
        
        return metadata, cleaned_content
    else:
        print(f"Failed to retrieve the webpage. Status code: {response.status_code}")
        return None, None

# Extract content from the given URL
metadata, page_content = extract_content(url)

# Save metadata and content as JSON if extraction was successful
if metadata and page_content:
    data = {
        "metadata": metadata,
        "page_content": page_content
    }
    
    with open('LLM_Primer.json', 'w', encoding='utf-8') as file:
        json.dump(data, file, ensure_ascii=False, indent=4)
    
    print("Content and metadata have been saved to LLM_Primer.json")
else:
    print("Content extraction failed.")

Content and metadata have been saved to LLM_Primer.json


### 2. Web Article Downloader using BeautifulSoup and WebBaseLoader (with langchain):

In [1]:
import bs4
from langchain.document_loaders import WebBaseLoader

url = "https://aman.ai/primers/ai/LLM/"

loader = WebBaseLoader(web_path=url, bs_kwargs=dict(parse_only=bs4.SoupStrainer(
    class_ = ("page-content", "canonical")
)),)

text_docs = loader.load()



In [2]:
text_docs

[Document(page_content='\n\n\n\nPrimers ‚Ä¢ Overview of Large Language Models\n\n\n\nOverview\nEmbeddings \nContextualized vs. Non-Contextualized Embeddings\nUse-cases of Embeddings\nSimilarity Search with Embeddings \nDot Product Similarity \nGeometric intuition\n\n\nCosine Similarity\nCosine similarity vs. dot product similarity\n\n\n\n\nHow do LLMs work? \nLLM Training Steps\nReasoning\n\n\nRetrieval/Knowledge-Augmented Generation or RAG (i.e., Providing LLMs External Knowledge) \nProcess\nSummary\n\n\nVector Database Feature Matrix\nContext Length Extension \nChallenges with Context Scaling \nThe ‚ÄúNeedle in a Haystack‚Äù Test\nStatus quo\nRAG vs. Ultra Long Context (1M+ Tokens)\n\n\nSolutions to Challenges \nPositional Interpolation (PI)\nRotary Positional Encoding (RoPE)\nALiBi (Attention with Linear Biases)\nSparse Attention\nFlash Attention\nMulti-Query Attention\n\n\nComparative Analysis\nDynamically Scaled RoPE \nApproach\nKey Benefits\nNTK-Aware Method Perspective\nSummary\

In [1]:
from langchain.document_loaders import JSONLoader

loader = JSONLoader(
        file_path="LLM_Primer.json",
        jq_schema=".metadata + {page_content: .page_content}",
        text_content=False  # Indicate that we are not expecting a single string as the page_content
    )

text_docs = loader.load()

text_docs



[Document(page_content='{"title": "Aman\'s AI Journal \\u2022 Primers \\u2022 Overview of Large Language Models", "url": "https://aman.ai/primers/ai/LLM/", "page_content": "Aman\'s AI Journal \\u2022 Primers \\u2022 Overview of Large Language Models Distilled AI Back to aman.ai Primers \\u2022 Overview of Large Language Models Overview Embeddings Contextualized vs. Non-Contextualized Embeddings Use-cases of Embeddings Similarity Search with Embeddings Dot Product Similarity Geometric intuition Cosine Similarity Cosine similarity vs. dot product similarity How do LLMs work? LLM Training Steps Reasoning Retrieval/Knowledge-Augmented Generation or RAG (i.e., Providing LLMs External Knowledge) Process Summary Vector Database Feature Matrix Context Length Extension Challenges with Context Scaling The \\u201cNeedle in a Haystack\\u201d Test Status quo RAG vs. Ultra Long Context (1M+ Tokens) Solutions to Challenges Positional Interpolation (PI) Rotary Positional Encoding (RoPE) ALiBi (Attenti

In [2]:
from langchain.text_splitter import RecursiveCharacterTextSplitter

text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=200)

documents = text_splitter.split_documents(text_docs)

documents

[Document(page_content='{"title": "Aman\'s AI Journal \\u2022 Primers \\u2022 Overview of Large Language Models", "url": "https://aman.ai/primers/ai/LLM/", "page_content": "Aman\'s AI Journal \\u2022 Primers \\u2022 Overview of Large Language Models Distilled AI Back to aman.ai Primers \\u2022 Overview of Large Language Models Overview Embeddings Contextualized vs. Non-Contextualized Embeddings Use-cases of Embeddings Similarity Search with Embeddings Dot Product Similarity Geometric intuition Cosine Similarity Cosine similarity vs. dot product similarity How do LLMs work? LLM Training Steps Reasoning Retrieval/Knowledge-Augmented Generation or RAG (i.e., Providing LLMs External Knowledge) Process Summary Vector Database Feature Matrix Context Length Extension Challenges with Context Scaling The \\u201cNeedle in a Haystack\\u201d Test Status quo RAG vs. Ultra Long Context (1M+ Tokens) Solutions to Challenges Positional Interpolation (PI) Rotary Positional Encoding (RoPE) ALiBi (Attenti

In [3]:
import os
from dotenv import load_dotenv
from langchain_community.vectorstores import Chroma
from langchain_community.embeddings import HuggingFaceInferenceAPIEmbeddings

load_dotenv()

True

In [4]:
embeddings = HuggingFaceInferenceAPIEmbeddings(api_key=os.getenv("HF_Token"), 
                                               model_name= "mixedbread-ai/mxbai-embed-large-v1")

In [5]:
len(documents)

376

KeyError: 0

In [9]:
# from langchain_community.embeddings import OpenAIEmbeddings
# from langchain_community.vectorstores import Chroma

# db = Chroma.from_documents(documents[:5], OpenAIEmbeddings(api_key=os.getenv("OPENAI_API_KEY")), persist_directory="db")