# Directly calling LLMs

In this notebook, we will implement a simple approach to using LLMs by directly calling them.

We will use LangChain to abstract away the LLM integration.

Let's start by installing some dependencies

In [None]:
%pip install -r requirements.txt

To load the content from the pdf files, we will define a custom function named (`read_files`). We will also define a custom class (`FileContent`) that we will use to store, as the name implies, content from files.

We will use Pydantic for the class definition, as it comes with built-in validation, automatic data conversion and more. Our custom class will be very simple and only have 2 attributes:
- `path`: to store the full path to the file
- `content`: the content of the file

Pay attention to how we define the `FileContent` class. The class name, its attributes, and the descriptions will all be passed to the LLM, as it is needed to force a specific schema for its outputs. More information [here]().

The function `read_files` will receive a list of file paths and will return a list of `FileContent` objects.

In [None]:
import os
import asyncio
from langchain_community.document_loaders import PyPDFLoader
from typing import List
from pydantic import BaseModel, Field

# FileContent Definition
class FileContent(BaseModel):
    path: str = Field(
        description="Full path to the file"
    )
    content: str = Field(
        description="Raw content of the file"
    )

async def read_file_async(path: str) -> FileContent:
    """
    Asynchronously loads a PDF file using LangChain's PyPDFLoader and extracts text content.
    
    Args:
        path (str): The file path to process.

    Returns:
        FileContent: An object containing the extracted text content.
    """
    try:
        if not os.path.exists(path):
            print(f"Warning: File not found {path}")
            return FileContent(path=path, content="")
        
        loader = PyPDFLoader(path)
        pages = await asyncio.to_thread(loader.load)  # Run synchronous PDF loading in a thread
        # the line below works well for small files, but I would suggest a different approach for large file
        content = "\n".join(page.page_content for page in pages)
        return FileContent(path=path, content=content)
    except Exception as e:
        print(f"Error processing {path}: {str(e)}")
        return FileContent(path=path, content="")

async def read_files(file_list: List[str]) -> List[FileContent]:
    """
    Asynchronously loads multiple PDF files and extracts text content.
    
    Args:
        file_list (List[str]): The list of file paths to process.

    Returns:
        List[FileContent]: A list of FileContent objects containing extracted text.
    """
    tasks = [read_file_async(path) for path in file_list]
    return await asyncio.gather(*tasks)

Let's load some files.

I have a bunch of pdfs files in the `~/Desktop/pdfs` folder.

In [None]:
base_dir = os.path.expanduser('~/Desktop/pdfs')

input_files = \
    [
        os.path.join(base_dir, "1.pdf"),
        os.path.join(base_dir, "2.pdf"),
        os.path.join(base_dir, "3.pdf"),
        os.path.join(base_dir, "4.pdf"),
        os.path.join(base_dir, "5.pdf")
]

In [None]:
processed_files = await read_files(input_files)

Let's take a look at the content of the first 3 files

In [None]:
for content in processed_files[:3]:
    print(content.path)
    print(content.content)
    print("------")
    print("------")

## LLM invocation

Lets set aside the pdf files for a moment, while we set everything up to use an LLM.

We will use it to summarize the content of the files.

Before invoking the LLM, we will need to define some parameters, including defining which model we will use.

In [7]:
from utils import load_llm

llm = load_llm()

Loading LLM...
Parameters:
max_tokens: 8192, temperature: 0.1, top_p: 0.4
Using Anthropic. Model: claude-3-5-haiku-20241022


We now have everything we need to invoke the LLM. Let's do a quick test.

In [None]:
llm.invoke("Hello, how are you?")

We could use the code above to do the summarization, but as explained before, forcing the LLM output to follow a specific schema is a good idea.

As a recap, our use case is to parse random pdf files from newsletters, docs, etc. Each file will contain multiple discrete sections, and the LLM will be tasked with identifying and summarizing each section.

To use schema validation we can use `with_structured_output` from Langchain. More information [here](https://python.langchain.com/docs/how_to/structured_output/). To do so we will need to have a custom class to define the schema.

Therefore, we will create 2 classes:
- `SectionSchema`: will define the schema for each individual section.
- `FileSummary`: will contain all news from a file.

In [8]:
from typing import List, Iterator

class Section(BaseModel):
    title: str = Field(
        description="Title of the section",
        min_length = 5,
        max_length = 100
    )
    summary: str = Field(
        description="Comprehensive summary of the file content",
        min_length = 100,
        max_length = 1000
    )

class FileSummary(BaseModel):
    path: str = Field(
        description="Full path to the file"
    )
    sections: List[Section] = Field(
        description="List of sections extracted from the file",
        min_items=1
    )
    category: str = Field(
        description="Category for the field. It can be a newsletter, documentation page, etc."
    )

    # Iterator object
    def __iter__(self) -> Iterator[Section]:
        return iter(self.sections)
    
    # To allow splicing
    def __getitem__(self, index: int) -> Section:
        return self.sections[index]

In [9]:
structured_llm = llm.with_structured_output(FileSummary)

Let's build a prompt and do some basic input sanitization.

Note: Since I have direct control on the input files I am using, I wont spend too much time with input sanitization and will mostly just use Langchain's [PromptTemplates](https://python.langchain.com/docs/concepts/prompt_templates/).

We will build a simple prompt and define a placeholder for the content we want to summarize.

In [10]:
from langchain.prompts import PromptTemplate

template = """

<text>{content}</text>

You are an expert at analyzing and summarizing files.
Analyze the text contained within the <text> tags and identify:
- individual sections within the file
- an overall category for the file (newsletter, documentation page, blogpost, etc)
For each section, create a title and a summary.

"""
prompt = PromptTemplate(
    input_variables = ["content"],
    template=template
)

We have everything we need to invoke the LLM with the specified schema. Lets try it out with a single file!

In [11]:
test_response = structured_llm.invoke(prompt.format(content=processed_files[0].content))

Let's see the results!

In [12]:
for section in test_response:
    print(section.title)
    print(section.summary)
    print("---------------")


Introduction to LangGraph: Core Concepts
LangGraph is a library for modeling agent workflows as graphs. It uses three key components: State (a shared data structure), Nodes (Python functions encoding agent logic), and Edges (functions determining node execution flow). The library allows creating complex, looping workflows that evolve state over time, with nodes and edges being flexible Python functions that can contain LLMs or standard Python code.
---------------
Graph Execution and Message Passing
LangGraph uses a message-passing algorithm inspired by Google's Pregel system. Execution proceeds in discrete 'super-steps', where nodes become active when receiving messages. Nodes run their functions, send updates, and can run in parallel or sequentially. The graph execution terminates when all nodes are inactive and no messages are in transit.
---------------
StateGraph and Graph State Management
The StateGraph class is the main graph class, parameterized by a user-defined State object. 

It works!

### Expanding the logic to summarize all files

Now that we have a summary of a single file, lets do the same across all files.

It would be a good idea to build a function that does so.

Since we already have both the `FileContent` and `FileSummary` classes defined, we can use them both.

The new function will receive the list of `FileContent` we created before and return a list of `FileSummary` objects.

Note: I decided to not use concurrency on this function, as I am running a local LLM and my setup is better suited to process LLM request one at a time. If your setup does, I suggest you to implement concurrency :)

In [13]:
from typing import List
from langchain_core.language_models import BaseLanguageModel

def summarize_files(file_list: List[FileContent], llm: BaseLanguageModel) -> List[FileSummary]:
    """
    Summarize PDF files using an LLM and retuns a list of
    FileSummary objects with the summarize news content.
    
    Args:
        file_list (list): the list of files to summarize
        llm (BaseLanguageModel): an LLM to use for summarization

    Returns:
        List[FileSummary]: List of FileSummary objects with the summarized news
    """

    structured_llm = llm.with_structured_output(FileSummary)

    file_summary_list = []

    for file in file_list:
        print(f"processing file {file.path}")
        try:           
            file_summary = structured_llm.invoke(prompt.format(content=file.content))
            file_summary.path = file.path
            
        except Exception as e:
            print(f"Error generating title & summary {str(e)}")
            file_summary = FileSummary(path=file.path, news_items=[])


        # create new FileContent object with the extracted text
        file_summary_list.append(file_summary)

    return file_summary_list

Let's run the function!

In [None]:
summarized_files = summarize_files(file_list=processed_files, llm=llm)

Let's see some of the results

In [None]:
for sections in summarized_files[0]:
    print(f"title: {sections.title}\nsummary: {sections.summary}")
    print("--------")

## Classes & utils re-utilization

To make it easier in the following notebooks, we will save the class definitions and functions we created on this notebook into a sepparate file, `utils.py` that we can import.

I'll manually copy the classes and do some small modifications for this to work as an import.

Then we'll test the code works well.

In [16]:
import os

# Classes
from utils import FileContent, Section, FileSummary

# Functions
from utils import read_files, summarize_files, load_llm

llm = load_llm()

Loading LLM...
Parameters:
max_tokens: 8192, temperature: 0.1, top_p: 0.4
Using Anthropic. Model: claude-3-5-haiku-20241022


In [17]:
base_dir = os.path.expanduser('~/Desktop/pdfs')

input_files = \
    [
        os.path.join(base_dir, "1.pdf"),
        os.path.join(base_dir, "2.pdf"),
        os.path.join(base_dir, "3.pdf"),
        os.path.join(base_dir, "4.pdf"),
        os.path.join(base_dir, "5.pdf")
]
processed_files = await read_files(input_files)

In [None]:
summarized_files = summarize_files(file_list=processed_files, llm=llm)

In [20]:
for sections in summarized_files[0]:
    print(f"title: {sections.title}\nsummary: {sections.summary}")
    print("--------")

title: Introduction to LangGraph
summary: LangGraph is a library that models agent workflows as graphs. It uses three key components: State (a shared data structure), Nodes (Python functions encoding agent logic), and Edges (functions determining node execution flow). The core mechanism involves message passing between nodes in discrete 'super-steps', allowing complex, evolving workflows with flexible state management.
--------
title: Graph State and Reducers
summary: The graph state is defined using TypedDict or Pydantic BaseModel. Each state key can have an independent reducer function that determines how updates are applied. By default, updates override existing values, but custom reducers can implement more complex update strategies like list concatenation. The state can include message histories, with special handling for message tracking and deserialization.
--------
title: Nodes and Edges
summary: Nodes are Python functions that process the graph state, while edges define routin