源代码链接：https://github.com/qianniuspace/llm_notebooks/tree/main/Recommendation-system-with-LLMs   

In [1]:
import pandas as pd

anime = pd. read_csv('anime_with_synopsis.csv')
anime.head()

Unnamed: 0,MAL_ID,Name,Score,Genres,sypnopsis
0,1,Cowboy Bebop,8.78,"Action, Adventure, Comedy, Drama, Sci-Fi, Space","In the year 2071, humanity has colonized sever..."
1,5,Cowboy Bebop: Tengoku no Tobira,8.39,"Action, Drama, Mystery, Sci-Fi, Space","other day, another bounty—such is the life of ..."
2,6,Trigun,8.24,"Action, Sci-Fi, Adventure, Comedy, Drama, Shounen","Vash the Stampede is the man with a $$60,000,0..."
3,7,Witch Hunter Robin,7.27,"Action, Mystery, Police, Supernatural, Drama, ...",ches are individuals with special powers like ...
4,8,Bouken Ou Beet,6.98,"Adventure, Fantasy, Shounen, Supernatural",It is the dark century and the people are suff...


源码见 Recommendation systems.ipynb 文件   
底下`apply`操作详见`pandas_apply.ipynb`文件

In [2]:
anime = anime.dropna() # 删除空值
anime['combined_info'] = anime.apply(lambda row: f"Title: {row['Name']}. Overview: {row['sypnopsis']} Genres: {row['Genres']}", axis=1)
anime['combined_info'][0]

'Title: Cowboy Bebop. Overview: In the year 2071, humanity has colonized several of the planets and moons of the solar system leaving the now uninhabitable surface of planet Earth behind. The Inter Solar System Police attempts to keep peace in the galaxy, aided in part by outlaw bounty hunters, referred to as "Cowboys." The ragtag team aboard the spaceship Bebop are two such individuals. Mellow and carefree Spike Spiegel is balanced by his boisterous, pragmatic partner Jet Black as the pair makes a living chasing bounties and collecting rewards. Thrown off course by the addition of new members that they meet in their travels—Ein, a genetically engineered, highly intelligent Welsh Corgi; femme fatale Faye Valentine, an enigmatic trickster with memory loss; and the strange computer whiz kid Edward Wong—the crew embarks on thrilling adventures that unravel each member\'s dark and mysterious past little by little. Well-balanced with high density action and light-hearted comedy, Cowboy Bebo

In [5]:
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("BAAI/bge-m3")

max_tokens = 512  # 根据模型实际最大长度设置

# 使用tokenizer自动截断
anime["n_tokens"] = anime.combined_info.apply(lambda x: len(tokenizer.encode(x, truncation=True, max_length=max_tokens)))
'''anime.combined_info是数据表中存储 “合并后文本信息” 的列（比如之前拼接的标题、简介、类型等
apply(lambda x: ...)遍历该列的每一条文本（x），对每条文本执行以下操作：
tokenizer.encode(x, truncation=True, max_length=max_tokens)：将文本x转换为 token 序列（数字 ID）。
truncation=True：如果文本长度超过max_tokens，自动截断到max_tokens长度（避免编码失败）。
max_length=max_tokens：明确截断的最大长度为 512。
len(...)：计算编码后的 token 序列长度，即这条文本的 token 数量。
结果存储在新列n_tokens中，记录每条文本的 token 数。
'''
anime = anime[anime.n_tokens <= max_tokens]  # 过滤后数据量
len(anime)

16206

下面不是运行代码，但有借鉴价值

In [None]:
# 1. 查看 token 数量分布
print(anime.n_tokens.describe())
# 输出：count    1000.0, mean     156.2, std       89.5, min       45.0, 25%       89.0, 50%      142.0, 75%      198.0, max      510.0

# 2. 过滤过长的文本（你代码中的下一行）
anime = anime[anime.n_tokens <= max_tokens]  # max_tokens=512

# 3. 查看被过滤的数据量
print(f"原始数据量: {len(anime)}")
print(f"过滤后数据量: {len(anime[anime.n_tokens <= max_tokens])}")
print(f"被过滤的数据: {len(anime[anime.n_tokens > max_tokens])}")

## 使用 sentence-transformers 库

In [10]:
import torch
from sentence_transformers import SentenceTransformer

# 检查 GPU 是否可用
device = 'cuda' if torch.cuda.is_available() else 'cpu'
print(f"使用设备: {device}")

model = SentenceTransformer("all-MiniLM-L6-v2", device=device) # 句子嵌入模型 all-MiniLM-L6-v2  使用GPU加速

texts = anime.combined_info.tolist() # 将数据表中存储 “合并后动漫信息” 的列 转换为 列表，作为模型的输入格式

# 生成文本嵌入向量
embeddings = model.encode(
    texts,
    batch_size=256 if device == 'cuda' else 128,  # 一次处理的文本数量。GPU 内存更大，可设为 256；CPU 内存有限，设为 128（避免内存溢出）。
    show_progress_bar=True,     # 显示进度条
    normalize_embeddings=True  # 将生成的向量归一化（模长为 1），确保后续计算相似度时结果更准确。
)

anime["embedding"] = list(embeddings)  # list(embeddings) 可以把多维数组拆成 “一维列表的集合” 存储嵌入向量到数据表

使用设备: cuda


Batches: 100%|██████████| 64/64 [00:18<00:00,  3.49it/s]


```python 
   anime.to_pickle('anime.pkl')
```
`to_pickle()` 是 pandas 的方法，将整理后的 anime 数据表保存为 Python 的 pickle 格式文件（anime.pkl）。   
pickle 格式可**完整保留数据的结构和类型**（包括 vector 列的列表型向量、text 列的字符串等），后续**无需重新处理数据，直接用 pd.read_pickle('anime.pkl') 即可快速加载**。   
保存本地文件的目的：避免重复执行分词、生成嵌入向量等耗时步骤，后续构建 LanceDB 时可直接读取复用。   

In [12]:
from langchain.vectorstores import LanceDB # 向量数据库
anime.rename(columns = {'embedding': 'vector'}, inplace = True) 
#rename(columns={旧列名: 新列名}) 用于修改 DataFrame 的列名，inplace=True 表示直接在原数据表上修改（不创建新副本）
anime.rename(columns = {'combined_info': 'text'}, inplace = True)
anime.to_pickle('anime.pkl')

In [13]:
anime = pd.read_pickle('anime.pkl')
anime.head(2)

Unnamed: 0,MAL_ID,Name,Score,Genres,sypnopsis,text,n_tokens,vector
0,1,Cowboy Bebop,8.78,"Action, Adventure, Comedy, Drama, Sci-Fi, Space","In the year 2071, humanity has colonized sever...",Title: Cowboy Bebop. Overview: In the year 207...,294,"[-0.075829476, -0.038479, -0.037791666, -0.005..."
1,5,Cowboy Bebop: Tengoku no Tobira,8.39,"Action, Drama, Mystery, Sci-Fi, Space","other day, another bounty—such is the life of ...",Title: Cowboy Bebop: Tengoku no Tobira. Overvi...,242,"[-0.091244884, 0.031044936, -0.019721977, 0.03..."


In [14]:
anime['text'][0]

'Title: Cowboy Bebop. Overview: In the year 2071, humanity has colonized several of the planets and moons of the solar system leaving the now uninhabitable surface of planet Earth behind. The Inter Solar System Police attempts to keep peace in the galaxy, aided in part by outlaw bounty hunters, referred to as "Cowboys." The ragtag team aboard the spaceship Bebop are two such individuals. Mellow and carefree Spike Spiegel is balanced by his boisterous, pragmatic partner Jet Black as the pair makes a living chasing bounties and collecting rewards. Thrown off course by the addition of new members that they meet in their travels—Ein, a genetically engineered, highly intelligent Welsh Corgi; femme fatale Faye Valentine, an enigmatic trickster with memory loss; and the strange computer whiz kid Edward Wong—the crew embarks on thrilling adventures that unravel each member\'s dark and mysterious past little by little. Well-balanced with high density action and light-hearted comedy, Cowboy Bebo

In [None]:
import lancedb # pip install lancedb

uri = "dataset/sample-anime-lancedb"
db = lancedb.connect(uri)
'''lancedb.connect(uri) 用于建立与 LanceDB 数据库的连接：
如果 uri 对应的路径不存在，会自动创建新的数据库；
如果路径已存在（之前创建过），则连接到已有的数据库。'''
table = db.create_table("anime", anime) # 不能运行两次，否则会报错


In [22]:
from langchain_huggingface import HuggingFaceEmbeddings 
from langchain.vectorstores import LanceDB
from langchain.chains import RetrievalQA
embeddings = HuggingFaceEmbeddings(model_name="all-MiniLM-L6-v2")
# 1. 连接到 LanceDB 数据库
db = lancedb.connect(uri)  # 替换为你的数据库路径

# 2. 你的表已经存在，通过表名获取它
# 如果表不存在，你需要先创建它，通常是通过 db.create_table 并传入数据和嵌入向量
table = db.open_table("anime")  # 替换为你的表名

#创建一个与 LanceDB 向量数据库关联的检索器（docsearch）  是LangChain 封装的 “向量数据库检索器”
docsearch = LanceDB(
    connection=db, # db 是之前通过 lancedb.connect(uri) 创建的 LanceDB 数据库连接对象（代表与数据库的连接）
    table=table,  # table 是之前通过 db.create_table("anime", anime) 创建的表对象（存储了动漫的 vector 向量和 text 文本等数据）
    embedding=embeddings)

In [24]:
query = "I'm looking for an action anime. What could you suggest to me?"
docs_with_scores = docsearch.similarity_search_with_score(query, k=3)
for doc, score in docs_with_scores:
    print(f"相似度分数: {score:.4f}")
    print(f"内容: {doc.page_content[:100]}...")
    print("---")

相似度分数: 0.8462
内容: Title: Animation!. Overview: collection of short animations, from the revelations Tomoyoshi Joko and...
---
相似度分数: 0.8509
内容: Title: Paper Film. Overview: short animation by Taku Furukawa. Genres: Comedy...
---
相似度分数: 0.8831
内容: Title: Shinkansen Henkei Robo Shinkalion Z the Animation. Overview: No synopsis information has been...
---


In [28]:
from langchain_openai import ChatOpenAI
from langchain.callbacks.streaming_stdout import StreamingStdOutCallbackHandler
# 导入流式输出回调处理器，用于实时打印模型生成的内容（避免等待完整结果，提升交互体验）
callbacks = [StreamingStdOutCallbackHandler()]

'''通过RetrievalQA.from_chain_type构建 RAG 系统，连接 “检索器” 和 “大语言模型”：
检索器（retriever）负责从向量数据库中找到与用户 query 相关的动漫信息（比如 “动作类” 相关的动漫文本和向量）；
大语言模型（llm）负责将检索到的信息整理成自然语言推荐结果，同时支持流式输出。'''
qa = RetrievalQA.from_chain_type(
    llm=ChatOpenAI(model="glm-4.5",
    api_key="",
    base_url="https://dashscope.aliyuncs.com/compatible-mode/v1",
    streaming=True,  # 启用流式模式
    callbacks=callbacks  # 添加回调处理器
    ), 
chain_type="stuff", retriever=docsearch.as_retriever(), return_source_documents=True)
# chain_type="stuff"：指定链类型："stuff"表示将所有检索到的文档内容整合到一个prompt中传给模型
#  return_source_documents=True：配置是否返回检索到的原始文档（方便查看推荐依据）

query = "I'm looking for an action anime. What could you suggest to me?"
result = qa.invoke({"query": query})
result['result']

Based on the anime titles I have information about, I can suggest the following action anime:

1. "Shinkansen Henkei Robo Shinkalion The Animation Recap" - This is an action, sci-fi, mecha anime aimed at kids. It appears to be a recap version of the main series.

2. "Shinkansen Henkei Robo Shinkalion Z the Animation" - This is another action, sci-fi, mecha anime for kids, likely a sequel or continuation of the previous series.

Both of these titles fall under the action genre you're looking for, though I don't have detailed synopsis information available for either one. They appear to be mecha series involving transforming bullet train robots.

'Based on the anime titles I have information about, I can suggest the following action anime:\n\n1. "Shinkansen Henkei Robo Shinkalion The Animation Recap" - This is an action, sci-fi, mecha anime aimed at kids. It appears to be a recap version of the main series.\n\n2. "Shinkansen Henkei Robo Shinkalion Z the Animation" - This is another action, sci-fi, mecha anime for kids, likely a sequel or continuation of the previous series.\n\nBoth of these titles fall under the action genre you\'re looking for, though I don\'t have detailed synopsis information available for either one. They appear to be mecha series involving transforming bullet train robots.'

In [32]:
import asyncio
from langchain.chains import RetrievalQA

# 配置流式但不在控制台实时显示
llm = ChatOpenAI(
    model="glm-4.5",
    api_key="sk-468218656c7a4e858c038c81d49665a7",
    base_url="https://dashscope.aliyuncs.com/compatible-mode/v1", 
    streaming=True,  # 仍然启用流式，但不用回调显示
    temperature=0.7
)

qa = RetrievalQA.from_chain_type(
    llm=llm,
    chain_type="stuff",
    retriever=docsearch.as_retriever(),
    return_source_documents=True
)

# 使用异步方式运行查询
async def ask_question(question):
    result = qa({"query": question})
    return result

# 运行示例
await ask_question("推荐一些动作动画")

{'query': '推荐一些动作动画',
 'result': '根据您提供的信息，我推荐一部动作动画：\n\n《Wu Liuqi Zhi Xuanwu Guo Pian》（伍六七之玄武国篇）- 这是一部包含动作、冒险、喜剧、剧情、武术、神秘和超能力元素的作品。故事讲述了伍六七和他的伙伴大宝和小飞为了保护小鸡岛的居民和那里的和平生活，踏上前往玄武国的旅程，以寻找他身份的真相和拯救岛屿的方法。等待他们的是更多未知和冒险。\n\n如果您需要更多动作动画推荐，我很乐意在获得更多信息后为您提供。',
 'source_documents': [Document(metadata={}, page_content='Title: Wu Liuqi Zhi Xuanwu Guo Pian. Overview: In order to protect the residents of Xiaoji Island and the peaceful life here, Wu Liuqi and his partners Dabao and Xiaofei embark on a journey to the Xuanwu Kingdom to find out the truth about his identity and a way to save the island. Waiting for them is more unknowns and adventures. (Source: bilibili, edited) Genres: Action, Adventure, Comedy, Drama, Martial Arts, Mystery, Super Power'),
  Document(metadata={}, page_content="Title: Itsumo Kokoro ni Taiyou wo!. Overview: Follows the story of the manga of the same name where a a high school boy finds a room for a boarding house that only costs 30,000 yen a month (about 300 USD). But ther