## Load OpenAI's LLM

In [132]:
from langchain_openai import ChatOpenAI
import getpass
import os

os.environ["OPENAI_API_KEY"] = getpass.getpass()
llm = ChatOpenAI(model="gpt-4o")

## Indexing
### Load

In [133]:
# Load the pdf documents from ./glossary
from langchain_community.document_loaders import PyPDFDirectoryLoader
loader = PyPDFDirectoryLoader("./glossary")
docs = loader.load()
print(len(docs), "documents loaded")

126 documents loaded


### Split
Our loaded document is over 42k characters long. This is too long to fit in the context window of many models. Even for those models that could fit the full post in their context window, models can struggle to find information in very long inputs.

To handle this we’ll split the Document into chunks for embedding and vector storage. This should help us retrieve only the most relevant bits of the blog post at run time.

In [134]:
from langchain_text_splitters import RecursiveCharacterTextSplitter

text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=1000, chunk_overlap=200, add_start_index=True
)
all_splits = text_splitter.split_documents(docs)

len(all_splits)

370

### Store

Now we need to index our text chunks so that we can search over them at runtime. The most common way to do this is to **embed the contents of each document split** and **insert these embeddings into a vector database** (or vector store).

In [135]:
from langchain_chroma import Chroma # Chroma vector store
from langchain_openai import OpenAIEmbeddings

vectorstore = Chroma.from_documents(documents=all_splits, embedding=OpenAIEmbeddings())

## Retrieval and Generation

### Retrieval

In [136]:
retriever = vectorstore.as_retriever(search_type="similarity", search_kwargs={"k": 5})  
retrieved_docs = retriever.invoke("What is Model Minority?")

len(retrieved_docs)

5

### Generation

Read the example article

In [137]:
with open("./articles/example.txt", "r") as f:
    text = f.read()

example_article = ' '.join(text.split('\n\n')[4:])
print(example_article)

There’s an old saying about traditions: Abandon a tradition it tells us and you will soon learn why it became one. That’s what California cities are learning today as city councils around the state have acted on the urging of Black Lives Matter protesters and diverted portions of their former police budgets to social action causes in previously underserved neighborhoods. But the spate of anti-Asian violence now playing out in California communities as disparate as Sacramento San Jose Los Angeles San Leandro and Orange County dramatically shows this was folly. The logical police response to on-street attacks ought to be more foot and squad car patrols but that’s not happening in most places despite the rise in anti-Asian hate crimes. So they continue showing no signs of abating. Before the coronavirus pandemic this was not a big problem. Hate crimes against Asians were relatively few and far between. But a huge upswing began just about the time ex-President Donald Trump dubbed the coron

Read the concept tree

In [174]:
with open("./concept_tree.txt", "r") as f:
    concept_tree = f.read()

Citation Setup

In [144]:
from langchain_core.pydantic_v1 import BaseModel, Field
from operator import itemgetter
from typing import List


class CitedAnswer(BaseModel):
    """Answer the user question based only on the given sources, and cite the sources used."""

    answer: str = Field(
        ...,
        description="The analysis the user's news article (include main topics and relevant concepts), which is based only on the given sources and the concept tree.",
    )
    citations: List[int] = Field(
        ...,
        description="The source name of the SPECIFIC sources which justify the answer.",
    )

structured_llm = llm.with_structured_output(CitedAnswer)

In [176]:
from langchain_core.prompts import ChatPromptTemplate, PromptTemplate
from langchain_core.output_parsers import StrOutputParser
from langchain_core.runnables import RunnablePassthrough

def format_docs(docs):
    return "\n\n".join(doc.page_content for doc in docs)

def format_docs_with_id(docs):
    formatted = [
        f"Source ID: {i}\nArticle Title: {doc.metadata['source']}\nArticle Snippet: {doc.page_content}"
        for i, doc in enumerate(docs)
    ]
    return "\n\n" + "\n\n".join(formatted)

template = """
You are an helpful assistant who can analyze the news article provided by the user using the given context and concept tree. Use the following pieces of context and the concept tree to analyze the article at the end.
Context: {context}
Concept Tree: {concept_tree}

Your analysis should include:
- Relevant Concepts: Identify the concepts from the provided hierarchical concept tree that are mentioned or relevant to the article. List these concepts and briefly explain their connection to the article's content, quoting the related sentences. Ensure the title of each concept exactly matches the concept name in the tree. Additionally, show all parent concepts of the relevant concepts.

Also, here are some tips to help you structure your analysis:
- When identifying relevant concepts from the tree, consider both explicit mentions and implicit references. Include parent concepts if multiple child concepts are relevant.
- Provide brief explanations for why you've selected each concept to justify your choices.
- If there are ambiguities or multiple interpretations possible, mention these in your analysis.

Article: {input}

Now, begin your analysis of the news article:
"""
custom_rag_prompt = PromptTemplate.from_template(template)

rag_chain_from_docs = (
    RunnablePassthrough.assign(context=(lambda x: format_docs_with_id(x["context"])), concept_tree= lambda x: concept_tree)
    | custom_rag_prompt
    | structured_llm
)

retrieve_docs = (lambda x: x["input"]) | retriever

chain = RunnablePassthrough.assign(context=retrieve_docs).assign(
    answer=rag_chain_from_docs
)

In [177]:
result = chain.invoke({"input": example_article})

In [178]:
result_dict = dict(result["answer"])
print(result_dict)

{'answer': '**Relevant Concepts:**\n\n1. **COVID-19/coronavirus/pandemic (Triggering anti-Asian hate)**\n   - The article mentions the dramatic increase in anti-Asian hate crimes coinciding with the COVID-19 pandemic, specifically noting that it began around the time ex-President Donald Trump referred to the coronavirus as the “Chinese virus.” This concept is directly connected to the context of the pandemic fueling discrimination against Asian Americans. \n   - Quote: “But a huge upswing began just about the time ex-President Donald Trump dubbed the coronavirus plague the ‘Chinese virus.’”\n\n2. **Types of Anti-Asian hate > Individual-level racism > Stereotype B (with hatred) > China/Chinese/Asian virus”/“Kung flu/plague/Ramen noodle flu**\n   - The derogatory term “Chinese virus” used by Donald Trump is an example of the stereotypes and hateful language mentioned in the article. This aligns with the concept of using terms like “China/Chinese/Asian virus” to incite hate.\n   - Quote: 

In [179]:
# Save result_dict to a .md file
with open("./output.md", "w") as f:
    f.write(f"{result_dict['answer']}")