In [1]:
# 导入必要的库
import bs4
from langchain_community.document_loaders import WebBaseLoader
from langchain_text_splitters import RecursiveCharacterTextSplitter
from langchain_chroma import Chroma
from langchain_openai import OpenAIEmbeddings
from langchain_core.output_parsers import StrOutputParser
from langchain_core.runnables import RunnablePassthrough
from langchain import hub
from langchain_core.prompts import ChatPromptTemplate
from langchain_openai import ChatOpenAI

USER_AGENT environment variable not set, consider setting it to identify your requests.


## Load Documents

In [3]:
!pip install pypdf

Collecting pypdf
  Downloading pypdf-5.1.0-py3-none-any.whl.metadata (7.2 kB)
Downloading pypdf-5.1.0-py3-none-any.whl (297 kB)
Installing collected packages: pypdf
Successfully installed pypdf-5.1.0


### Homework 1: 使用其他的线上文档或离线文件，重新构建向量数据库
线上文档使用Gaze Follower的论文：https://people.csail.mit.edu/khosla/papers/nips2015_recasens.pdf

In [4]:
from langchain_community.document_loaders import PyPDFLoader
loader = PyPDFLoader(
    file_path = "https://people.csail.mit.edu/khosla/papers/nips2015_recasens.pdf"
)
docs = await loader.aload()

In [5]:
print(len(docs[0].page_content))

2640


In [6]:
print(docs[0].page_content[:100])

Where are they looking?
Adri`a Recasens∗ Aditya Khosla∗ Carl Vondrick Antonio Torralba
Massachusetts


## Split Texts

In [7]:
# 使用 RecursiveCharacterTextSplitter 将文档分割成块，每块1000字符，重叠200字符
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=1000, chunk_overlap=200, add_start_index=True
)
all_splits = text_splitter.split_documents(docs)

In [8]:
print(len(all_splits))  # 打印分割后的文档块数量

42


In [9]:
print(len(all_splits[0].page_content))  # 打印第一个块的字符数

947


In [10]:
print(all_splits[0].page_content)  # 打印第一个块的内容

Where are they looking?
Adri`a Recasens∗ Aditya Khosla∗ Carl Vondrick Antonio Torralba
Massachusetts Institute of Technology
{recasens, khosla, vondrick, torralba}@csail.mit.edu
(* - indicates equal contribution)
Abstract
Humans have the remarkable ability to follow the gaze of other people to identify
what they are looking at. Following eye gaze, or gaze-following, is an important
ability that allows us to understand what other people are thinking, the actions they
are performing, and even predict what they might do next. Despite the importance
of this topic, this problem has only been studied in limited scenarios within the
computer vision community. In this paper, we propose a deep neural network-
based approach for gaze-following and a new benchmark dataset, GazeFollow, for
thorough evaluation. Given an image and the location of a head, our approach
follows the gaze of the person and identiﬁes the object being looked at. Our deep


In [11]:
print(all_splits[0].metadata)  # 打印第一个块的元数据

{'source': 'https://people.csail.mit.edu/khosla/papers/nips2015_recasens.pdf', 'page': 0, 'start_index': 0}


## Embedding and Store in vector database

In [12]:
# 使用 Chroma 向量存储和 OpenAIEmbeddings 模型，将分割的文档块嵌入并存储
vectorstore = Chroma.from_documents(
    documents=all_splits,
    embedding=OpenAIEmbeddings()
)

In [14]:
# 使用 VectorStoreRetriever 从向量存储中检索与查询最相关的文档
retriever = vectorstore.as_retriever(search_type="similarity", search_kwargs={"k": 6})

In [15]:
retrieved_docs = retriever.invoke("What is the main purpose of this article?")

In [16]:
# 检查检索到的文档内容
print(len(retrieved_docs))  # 打印检索到的文档数量

6


In [17]:
print(retrieved_docs[0].page_content)  # 打印第一个检索到的文档内容

http://gazefollow.csail.mit.edu.
1 Introduction
You step out of your house and notice a group of people looking up. You look up and realize they are
looking at an aeroplane in the sky. Despite the object being far away, humans have the remarkable
ability to precisely follow the gaze direction of another person, a task commonly referred to asgaze-
following (see [3] for a review). Such an ability is a key element to understanding what people are
doing in a scene and their intentions. Similarly, it is crucial for a computer vision system to have
this ability to better understand and interpret people. For instance, a person might be holding a book
but looking at the television, or a group of people might be looking at the same object which can
indicate that they are collaborating at some task, or they might be looking at different places which
Figure 1: Gaze-following: We present a model that learns to predict where people in images are


## Set LLM model

In [18]:
llm = ChatOpenAI(model="gpt-4o-mini")

## Create Prompt Template

### Homework 2: 重新设计或在 LangChain Hub 上找一个可用的 RAG 提示词模板，测试对比两者的召回率和生成质量。
提示词模板使用：https://smith.langchain.com/hub/krunal/more-crafted-rag-prompt

In [19]:
prompt = hub.pull("krunal/more-crafted-rag-prompt")



In [22]:
# 打印模板
print(prompt.template)

# Your role
You are a brilliant expert at understanding the intent of the questioner and the crux of the question, and providing the most optimal answer to the questioner's needs from the documents you are given.


# Instruction
Your task is to answer the question using the following pieces of retrieved context delimited by XML tags.

<retrieved context>
Retrieved Context:
{context}
</retrieved context>


# Constraint
1. Think deeply and multiple times about the user's question\nUser's question:\n{question}\nYou must understand the intent of their question and provide the most appropriate answer.
- Ask yourself why to understand the context of the question and why the questioner asked it, reflect on it, and provide an appropriate response based on what you understand.
2. Choose the most relevant content(the key content that directly relates to the question) from the retrieved context and use it to generate an answer.
3. Generate a concise, logical answer. When generating the answer, Do

In [23]:
# 为 context 和 question 填充样例数据，并生成 ChatModel 可用的 Messages
example_messages = prompt.invoke(
    {"context": "filler context", "question": "filler question"}
).to_messages()

In [24]:
# 查看提示词
print(example_messages[0].content)

# Your role
You are a brilliant expert at understanding the intent of the questioner and the crux of the question, and providing the most optimal answer to the questioner's needs from the documents you are given.


# Instruction
Your task is to answer the question using the following pieces of retrieved context delimited by XML tags.

<retrieved context>
Retrieved Context:
filler context
</retrieved context>


# Constraint
1. Think deeply and multiple times about the user's question\nUser's question:\nfiller question\nYou must understand the intent of their question and provide the most appropriate answer.
- Ask yourself why to understand the context of the question and why the questioner asked it, reflect on it, and provide an appropriate response based on what you understand.
2. Choose the most relevant content(the key content that directly relates to the question) from the retrieved context and use it to generate an answer.
3. Generate a concise, logical answer. When generating the 

## Use LCEL to build RAG chain

In [25]:
# 定义格式化文档的函数
def format_docs(docs):
    return "\n\n".join(doc.page_content for doc in docs)

In [26]:
# 使用 LCEL 构建 RAG Chain
rag_chain = (
    {"context": retriever | format_docs, "question": RunnablePassthrough()}
    | prompt
    | llm
    | StrOutputParser()
)

## Test

### Homework 1: 尝试提出3个相关问题，测试 LCEL 构建的 RAG Chain 是否能成功召回。

In [27]:
for chunk in rag_chain.stream("What is the application on Gaze Tracking?"):
    print(chunk, end="", flush=True)

Gaze tracking has several important applications, particularly in robotics and human interaction interfaces, where understanding a person's object of interest is crucial. By accurately following a person's gaze, systems can predict future actions, as individuals often look at objects they intend to interact with before they actually do so. This capability enhances the interpretation of social interactions, allowing for better recognition of collaborative tasks among groups of people or understanding individual focus in varied contexts. Moreover, gaze tracking can improve action recognition in egocentric vision by identifying what a user is likely to engage with next. Overall, the ability to track gaze significantly contributes to the development of more intuitive and responsive technology that interacts seamlessly with human behavior.

In [28]:
for chunk in rag_chain.stream("Descibe the model for Gaze Tracking?"):
    print(chunk, end="", flush=True)

The gaze tracking model described utilizes a deep neural network architecture that combines head orientation and location information with scene content to predict where a person is looking within an image. Inputting a picture along with the person's head position, the model outputs a distribution over potential gaze targets, effectively functioning as a saliency map from the selected person's perspective. This approach is designed to operate in natural settings without restrictive assumptions, allowing it to handle various scenarios, such as multiple people interacting or looking at each other. The model is trained using the GazeFollow dataset, which captures diverse fixation scenarios and variations in gaze behavior. Overall, this model aims to emulate human gaze-following abilities, enhancing our understanding of social interactions in visual contexts.

In [29]:
for chunk in rag_chain.stream("Evaluate the model."):
    print(chunk, end="", flush=True)

The model demonstrates strong performance in predicting where people are looking, achieving an AUC of 0.878 and a mean Euclidean error of 0.190, significantly outperforming various baseline models. Notably, the model utilizes components such as head position and gaze pathways effectively, suggesting these features are crucial for accurate gaze estimation. However, it still remains below human performance, where a single annotator achieved an AUC of 0.924 and a mean error of 0.096. Qualitative results indicate that while the model can distinguish between different subjects and identify salient objects, it struggles with depth perception, leading to some inaccurate predictions. Overall, the model shows promise but highlights areas for improvement, particularly in achieving human-level accuracy.

### Homework2: 测试对比两个提示词模板的召回率和生成质量。

In [30]:
# 使用 hub 模块拉取原本的 rag 提示词模板，并构建RAG chain
prompt2 = hub.pull("rlm/rag-prompt")
# 使用 LCEL 构建 RAG Chain
rag_chain2 = (
    {"context": retriever | format_docs, "question": RunnablePassthrough()}
    | prompt
    | llm
    | StrOutputParser()
)



In [31]:
for chunk in rag_chain2.stream("What is the application on Gaze Tracking?"):
    print(chunk, end="", flush=True)

Gaze tracking has several important applications, particularly in the fields of robotics and human interaction interfaces. By understanding where individuals are looking, systems can better interpret human intentions and actions, which is crucial for enhancing interaction quality. For instance, gaze-following can help predict what a person might do next, such as identifying objects they plan to interact with. Additionally, it can facilitate collaborative tasks by recognizing when multiple individuals are focusing on the same object. Overall, effective gaze tracking enhances the ability of computer vision systems to understand and interact with human behavior in natural settings.

In [32]:
for chunk in rag_chain2.stream("Descibe the model for Gaze Tracking?"):
    print(chunk, end="", flush=True)

The gaze-following model introduced in the context of GazeFollow leverages deep learning to predict where a person is looking within an image. It takes as input a picture and the location of a person's head, combining information about head orientation and location with scene content to output a distribution over potential gaze targets, effectively creating a saliency map from that person's perspective. This approach emulates human gaze-following behavior by first assessing the person's head and eyes to infer their line of sight, and then reasoning about salient objects within their view. The model is designed to handle various scenarios, including multiple people with joint attention or looking at each other, without relying on face detection or object-level annotations during training. The GazeFollow dataset serves as a benchmark for evaluating this model, providing a comprehensive resource for further research in gaze tracking.

In [33]:
for chunk in rag_chain2.stream("Evaluate the model."):
    print(chunk, end="", flush=True)

The model demonstrates strong performance in predicting gaze direction, achieving an AUC of 0.878 and a mean Euclidean error of 0.190, which outperforms all baseline methods significantly. Notably, the model's results show a minimum L2 distance of 0.113 to the nearest ground truth fixation, indicating precise predictions. An ablation study reveals that all input components—image, position, and head—contribute positively to model performance, with the gaze pathway being particularly effective at estimating gaze direction. However, while the model shows promise, it still falls short of human performance, which has an AUC of 0.924 and a mean Euclidean error of 0.096, suggesting room for improvement. Overall, the model effectively distinguishes different people's points of view in images but faces challenges due to a lack of 3D understanding, leading to occasional inaccuracies.