https://medium.com/data-science-collective/the-complete-guide-to-building-your-first-ai-agent-its-easier-than-you-think-c87f376c84b2

What are some good cheap models for classification?

I want to classify text (youtube transcripts or articles) - in different categories (finance, crypto, education, news). 
- And subsequently I want to summarize the text. 
- And subsequently I want to extract all personas and their opinions / sentiment. 
- And subsequently I want to extract all entities (organisations / companies) mentionned in the text.

In [1]:
import os
from typing import TypedDict, List
from langgraph.graph import StateGraph, END
from langchain.prompts import PromptTemplate
from langchain_openai import ChatOpenAI
from langchain.schema import HumanMessage
from dotenv import load_dotenv

load_dotenv() 

True

In [2]:
llm = ChatOpenAI(
  model="deepseek/deepseek-r1",
  temperature=0,
  openai_api_key=os.getenv("OPENROUTER_API_KEY"),
  base_url="https://openrouter.ai/api/v1",
)

In [3]:
response = llm.invoke("Hello! Are you working?") 
print(response.content)

Hello! Yes, I'm up and running. How can I assist you today? 😊


In [4]:
class State(TypedDict):
    text: str  # Stores the original input text
    classification: str  # Represents the classification result (e.g., category label)
    human_entities: List[str]  # Holds a list of extracted human entities (e.g., named entities)
    company_entities: List[str]  # Holds a list of extracted company entities (e.g., named entities)
    summary: str  # Stores a summarized version of the text

In [5]:
def classification_node(state: State):
   """
   Classify the text into one of predefined categories.
   
   Parameters:
       state (State): The current state dictionary containing the text to classify
       
   Returns:
       dict: A dictionary with the "classification_type" key containing the category result
       
   Categories:
       - article: Factual reporting of current events
       - youtube_transcript: Personal or informal web writing
       - Other: Content that doesn't fit the above categories
   """

   instructions = """
   You are a helpful assistant that can classify youtube transcripts into one of the following categories:
   - News: Factual reporting of current events
   - Podcast: Discussion between hosts, about one or more subjects
   - Finance: Discussion about finance, markets, economics, etc. Could be one or multiple hosts.
   - Education: Discuss about education, learning, teaching, etc. Could be one or multiple hosts.
   - other: Content that doesn't fit the above categories

   return 'news', 'podcast', 'finance', 'education', or 'other' - no other words or characters.
   """

   # Define a prompt template that asks the model to classify the given text
   prompt = PromptTemplate(
       input_variables=["text"],
       template= instructions + "\n\nText: {text}\n\nCategory:"
   )

   # Format the prompt with the input text from the state
   message = HumanMessage(content=prompt.format(text=state["text"]))

   # Invoke the language model to classify the text based on the prompt
   classification = llm.invoke([message]).content.strip()

   # Return the classification result in a dictionary
   return {"classification": classification}

In [6]:
def human_entity_extraction_node(state: State):

  instructions = """
  Extract all the Human entities from the following transcript. 
  Provide the result as a comma-separated list.

  The output should be a comma-separated list of human entities, no other text or characters.
  Parsable by a python script.
  """
  prompt = PromptTemplate(
      input_variables=["text"],
      template= instructions + "\n\nText: {text}\n\nHuman Entities:"
  )
  
  message = HumanMessage(content=prompt.format(text=state["text"]))
  
  human_entities = llm.invoke([message]).content.strip().split(", ")
  
  # Return dictionary with entities list to be merged into agent state
  return {"human_entities": human_entities}

In [7]:
def company_entity_extraction_node(state: State):

  instructions = """
  Extract all the Company entities from the following transcript. 
  Provide the result as a comma-separated list.

  The output should be a comma-separated list of company entities, no other text or characters.
  Parsable by a python script.
  """
  prompt = PromptTemplate(
      input_variables=["text"],
      template= instructions + "\n\nText: {text}"
  )
  
  message = HumanMessage(content=prompt.format(text=state["text"]))
  
  company_entities = llm.invoke([message]).content.strip().split(", ")
  
  # Return dictionary with entities list to be merged into agent state
  return {"company_entities": company_entities}

In [8]:
def summarize_node(state: State):
    # Create a template for the summarization prompt
    # This tells the model to summarize the input text in one sentence
    summarization_prompt = PromptTemplate.from_template(
        """You are a helpful assistant that summarizes transcripts of youtube videos. 
        Summarize the transcript in a way that is easy to understand and contains the most important information. 
        - First section should be a high level summary - aim for 3-5 concise bullet points.
        - Second section should be to list the speakers and the main points they make.
        - Third section should be a list of topics that are discussed.
        - Fourth section should be to flag if any stock tickers or cryptocurrencies are mentioned. If so, what was the sentiment on them?
        - Fifth section should be about markout outlook. Are speakers bullish or bearish - short or long term? Only if transcript is about markets or trading.
        - Sixth section should be to flag any other interesting information that is not covered in the other sections, and / or perhpas one or a few meanigful quotes from any of the speakers.
        Keep it short and concise, each section should be well apparent and contain up to 10 bullet points.
        
        Text: {text}
        
        Summary:"""
    )
    
    # Create a chain by connecting the prompt template to the language model
    # The "|" operator pipes the output of the prompt into the model
    chain = summarization_prompt | llm
    
    # Execute the chain with the input text from the state dictionary
    # This passes the text to be summarized to the model
    response = chain.invoke({"text": state["text"]})
    
    # Return a dictionary with the summary extracted from the model's response
    # This will be merged into the agent's state
    return {"summary": response.content}

In [9]:
workflow = StateGraph(State)

# Add nodes to the graph
workflow.add_node("classification_node", classification_node)
workflow.add_node("human_entity_extraction", human_entity_extraction_node)
workflow.add_node("company_entity_extraction", company_entity_extraction_node)
workflow.add_node("summarization", summarize_node)

# Add edges to the graph
workflow.set_entry_point("classification_node") # Set the entry point of the graph
workflow.add_edge("classification_node", "human_entity_extraction")
workflow.add_edge("human_entity_extraction", "company_entity_extraction")
workflow.add_edge("company_entity_extraction", "summarization")
workflow.add_edge("summarization", END)

# Compile the graph
app = workflow.compile()

In [10]:

import json

# Load the sample transcript.json
with open("./data/transcript.json", "r") as f:
    sample_text = json.load(f)

# Convert the transcript to a string
sample_text = "\n".join([item["text"] for item in sample_text])

# Create the initial state with our sample text
state_input = {"text": sample_text}

# Run the agent's full workflow on our sample text
result = app.invoke(state_input)

print("DONE!")

# Print each component of the result:
# - The classification category ('news', 'podcast', 'finance', 'education', or 'other')
# print("Classification:", result["classification"])

# - The extracted human entities:
# print("\nHuman Entities:", result["human_entities"])

# - The extracted company entities:
# print("\nCompany Entities:", result["company_entities"])

# - The generated summary of the text
# print("\nSummary:", result["summary"])


DONE!


In [10]:
# Save result to file
with open("./data/result.json", "w") as f:
    json.dump(result, f)