---

title: "一文搞懂 Bert 模型、向量表示、向量搜索、API"
date: 2025-03-02
author: 郝鸿涛
slug: bert
draft: false
toc: true
tags: ML
---

假设我们现在有一万个电影的文字描述，如何根据用户的搜索，推荐给用户最相关的几个电影？比如，用户输入「NASA 努力救一个困在火星的宇航员」，那我们肯定会导出《火星救援》。如何实现呢？

办法是，假设我们有一个神奇的工具，你可以把它想象成一张网。任何一串文字通过它之后，都会变成一个高维空间里的一个向量，也就是高维空间里的一个坐标。然后我们让这一万条电影的文字描述经过这张网，我们得到一万个坐标。用户输入搜索，我们让这个搜索也经过这张网，得到一个坐标。最后的结果，我们计算这个搜索对应的坐标与每一个电影坐标的余弦相似度 (Cosine Similarity)，然后找到结果最大的几个，就是我们的搜索结果。

这里涉及到一个问题。如果我们把每串文字经过网后的结果看成是一个坐标，那么寻找相似的点，可以用[欧几里得距离](https://zh.wikipedia.org/zh-hans/%E6%AC%A7%E5%87%A0%E9%87%8C%E5%BE%97%E8%B7%9D%E7%A6%BB)。如果我们把每个结果看成是一个向量，那么需要用余弦相似度。我们这里把结果看成是向量比较合适。为什么？我现在还没有办法解释清楚。

这张神奇的网是「向量表示 (Embedding)」。比较知名的是 Bert，它比较大，我们这里用 Sentence Transformers。更神奇的是，如果我们用多语言的模型，比如 `paraphrase-multilingual-mpnet-base-v2`，那即使我们的训练数据是英文的，我们也可以用中文搜索。不过，多语言模型也比较大，我们这里不用。我们用比较小的 `all-MiniLM-L6-v2`。

另外，我们用 [FAISS](https://pypi.org/project/faiss-cpu/) (Facebook AI Similarity Search) 计算余弦相似度，因为这比较快。

## 原始数据

数据下载地址：[https://hongtaoh.com/files/tmdb-movies.csv](/../files/tmdb-movies.csv)。

In [1]:
import pandas as pd 
import numpy as np 

In [2]:
df = pd.read_csv('../static/files/tmdb-movies.csv')
df.shape

(10866, 21)

数据很大，我只选取几个关键的列：

In [3]:
df.iloc[0:5, [0, 1, 4, 5, 11]]

Unnamed: 0,id,imdb_id,revenue,original_title,overview
0,135397,tt0369610,1513528810,Jurassic World,Twenty-two years after the events of Jurassic ...
1,76341,tt1392190,378436354,Mad Max: Fury Road,An apocalyptic story set in the furthest reach...
2,262500,tt2908446,295238201,Insurgent,Beatrice Prior must confront her inner demons ...
3,140607,tt2488496,2068178225,Star Wars: The Force Awakens,Thirty years after defeating the Galactic Empi...
4,168259,tt2820852,1506249360,Furious 7,Deckard Shaw seeks revenge against Dominic Tor...


我们看下最关键的 `imdb_id`、`original_title` 和 `overview` 有没有缺失：

In [4]:
df_na = df[df[['imdb_id', 'original_title', 'overview']].isna().any(axis = 1)]
df_na.iloc[:, [0, 1, 4, 5, 11]]

Unnamed: 0,id,imdb_id,revenue,original_title,overview
548,355131,,0,Sense8: Creating the World,
997,287663,,0,Star Wars Rebels: Spark of Rebellion,"A Long Time Ago In A Galaxy Far, Far Awayâ€¦ A..."
1528,15257,,0,Hulk vs. Wolverine,Department H sends in Wolverine to track down ...
1750,101907,,0,Hulk vs. Thor,"For ages, Odin has protected his kingdom of As..."
2370,127717,tt1525359,0,Freshman Father,
2401,45644,,0,Opeth: In Live Concert At The Royal Albert Hall,As part of the ongoing celebration of their 20...
3722,85993,tt1680105,0,Baciato dalla fortuna,
3794,58253,tt1588335,0,"Toi, moi, les autres",
4797,369145,,0,Doctor Who: The Snowmen,"Christmas Eve, 1892, and the falling snow is t..."
4872,269177,,0,Party Bercy,Florence Foresti is offered Bercy tribute to a...


把这些去掉：

In [5]:
df.dropna(subset=['imdb_id', 'original_title', 'overview'], inplace=True)

In [6]:
# 我们只需要用到这两列
movie_ids = df.imdb_id.tolist()
movie_overviews = df.overview.tolist()

#create dictionaries
imbdId2Title = dict(zip(df.imdb_id, df.original_title))
imbdId2Overview = dict(zip(df.imdb_id, df.overview))

## 向量表示

In [7]:
from sentence_transformers import SentenceTransformer
from sklearn.preprocessing import normalize
model = SentenceTransformer('all-MiniLM-L6-v2')
from typing import List, Dict, Tuple
import faiss 

  from tqdm.autonotebook import tqdm, trange


In [8]:
def get_movie_embeddings(
    movie_ids:List[str], 
    movie_overviews:List[str], 
    model: SentenceTransformer,
    batch_size: int = 32, 
    ) -> Dict[str, np.ndarray]:
    """
    Generates normalized embeddings for movie overviews.

    Args:
        movie_ids: List of movie IDs.
        movie_overviews: List of movie overviews.
        model: Embedding model (e.g., SentenceTransformer).
        batch_size: Batch size for processing.

    Returns:
        Dictionary mapping movie IDs to normalized embeddings.
    """
    if len(movie_ids) != len(movie_overviews):
        raise ValueError("movie_ids and movie_overviews must have the same length.")

    all_embeddings = []
    for i in range(0, len(movie_ids), batch_size):
        batch_movies = movie_overviews[i:i+batch_size]
        try:
            batch_embeddings = model.encode(batch_movies)
            all_embeddings.extend(batch_embeddings)
        except Exception as e:
            print(f"Error encoding batch {i}: {e}")
            raise
    try:
        all_embeddings_np = np.array(all_embeddings) #added numpy array conversion.
        normalized_embeddings = normalize(all_embeddings_np, axis=1, norm='l2')
    except Exception as e:
        print(f"Error normalizing embeddings: {e}")
        raise

    movie_embeddings = dict(zip(movie_ids, normalized_embeddings))
    return movie_embeddings

In [9]:
movie_embeddings = get_movie_embeddings(movie_ids, movie_overviews, model)

## 搜索

In [10]:
def prepare_faiss_index(
    movie_embeddings: Dict[str, np.ndarray]) -> Tuple[faiss.Index, List[str]]:
    """
    Prepares a FAISS index from movie embeddings.
    
    Args:
        movie_embeddings: Dictionary of movie IDs to normalized embeddings.
    
    Returns:
        Tuple of (FAISS index, ordered list of movie IDs)
    """
    movie_ids = list(movie_embeddings.keys())
    embeddings = np.array([movie_embeddings[mid] for mid in movie_ids])

    #create and populate the index
    dim = embeddings.shape[1]
    index = faiss.IndexFlatIP(dim)
    index.add(embeddings)

    return index, movie_ids

def search_movies(
    faiss_index: faiss.Index,
    movie_ids: List[str],
    user_query: str,
    model: SentenceTransformer,
    k: int = 10,
) -> Tuple[List[str], List[float]]:
    """
    Searches movies based on a user query using Faiss and cosine similarity.

    Args:
        faiss_index: Pre-built FAISS index
        movie_ids: Ordered list of movie IDs that corresponds to the index
        user_query: The user's search query.
        model: SentenceTransformer model.
        k: Number of results to return.

    Returns:
        Tuple, top K movie titles and their associated cosine similarity scores
    """
    #embed and normalize the user query
    query_embedding = model.encode([user_query])
    normalized_query_embedding = normalize(query_embedding, axis = 1, norm = 'l2')

    #search the index
    similarities, indices = faiss_index.search(normalized_query_embedding, k)

    #retrieve movie ids and similarity scores
    top_k_ids = [movie_ids[i] for i in indices[0]]
    top_k_titles = [imbdId2Title[x] for x in top_k_ids]
    top_k_scores = similarities[0]
    return top_k_titles, top_k_scores

In [11]:
faiss_index, movie_ids = prepare_faiss_index(movie_embeddings)
user_query = "NASA tried to rescue an astronaut stranded on Mars."
results = search_movies(
    faiss_index,
    movie_ids,
    user_query,
    model
)

In [12]:
results 

(['Mission to Mars',
  'Red Planet',
  'Robinson Crusoe on Mars',
  'The Martian',
  'Capricorn One',
  'My Favorite Martian',
  'The Last Days on Mars',
  'Infini',
  '(T)Raumschiff Surprise - Periode 1',
  'My Stepmother is an Alien'],
 array([0.6767248 , 0.6620594 , 0.651513  , 0.55047977, 0.5192548 ,
        0.49691725, 0.4924257 , 0.46683484, 0.46024293, 0.44739127],
       dtype=float32))

结果正确。

## 走向云端

现在的问题是，一个不懂计算机的用户，需要搜索电影，怎么办？他不会下载数据，也不会运行上面的代码。这是大部分用户的现状。解决办法是把我们算好的 `movie_embedding` 存在云端。

我首先想到的一个办法是把我们的电影原始数据以及我们得到的向量表示 (Embedding) 储存在 MongoDB。然后，在一个 Web App (网页应用程序，你可以理解为一个网站) 上，把这个数据和向量表示下载下来，然后根据用户的搜索，给出最相似的电影推荐。这个办法可行，但是很慢。

更好的办法是使用专门用来做向量搜索的数据库。这些向量库有自己的储存和搜索方法，我们不需要用上面的 FAISS。只需要把所有的 embedding 存入。搜索的时候，给一个向量，这些向量库会进行优化搜素，快速给出结果。

### Qdrant

我先试了一下 [Qdrant](https://qdrant.tech/)。

免费版暂时够用了。

首先需要运行：

```sh
pip install qdrant-client
```

In [13]:
import os 
# `pip install python-dotenv` first
from dotenv import load_dotenv
load_dotenv()
url = os.getenv("QDRANT_MOVIE")
api_key = os.getenv("QDRANT_API_KEY")

In [None]:
from qdrant_client import QdrantClient
from qdrant_client.models import VectorParams, Distance
import uuid

qdrant_client = QdrantClient(
    url=url,
    api_key=api_key
)

try:
    qdrant_client.delete_collection(collection_name='movies')
    print("Deleted existing collection")
except Exception as e:
    print(f"Collection might not exist yet: {e}")

# Create the collection first
vector_size = len(next(iter(movie_embeddings.values())))
qdrant_client.create_collection(
    collection_name="movies",
    vectors_config=VectorParams(
        size=vector_size,
        distance=Distance.COSINE
    )
)

points = []
for movie_id in movie_ids:
    # Generate a UUID for each movie ID
    point_id = str(uuid.uuid4()) #generates a random unique uuid.
    points.append({
        'id': point_id,
        'vector': movie_embeddings[movie_id].tolist(),
        'payload': {
            'title': imbdId2Title[movie_id],
            'overview': imbdId2Overview[movie_id],
            'imdb_id': movie_id,
        }
    })

batch_size = 1_000
for i in range(0, len(points), batch_size):
    batch = points[i:i+batch_size]
    qdrant_client.upsert(
        collection_name = 'movies',
        points = batch
    )
    print(f"Uploaded batch {i//batch_size + 1}/{(len(points)-1)//batch_size + 1}")

Deleted existing collection
Uploaded batch 1/11
Uploaded batch 2/11
Uploaded batch 3/11
Uploaded batch 4/11
Uploaded batch 5/11
Uploaded batch 6/11
Uploaded batch 7/11
Uploaded batch 8/11
Uploaded batch 9/11
Uploaded batch 10/11
Uploaded batch 11/11


In [15]:
def search_qdrant(
    query_text:str, model:SentenceTransformer, top_k:int = 5):
    query_embedding = model.encode([query_text])
    normalized_query = normalize(query_embedding, axis = 1, norm='l2')

    results = qdrant_client.search(
        collection_name = 'movies',
        query_vector=normalized_query[0].tolist(),
        limit=top_k
    )

    movies = []
    for result in results:
        movies.append({
            'title': result.payload['title'],
            'overview': result.payload['overview'],
            'imdb_id': result.payload['imdb_id'],
            'similarity_score': result.score
        })
    return movies 

In [16]:
user_query = "NASA tried to rescue an astronaut stranded on Mars."
search_qdrant(query_text=user_query, model=model)

[{'title': 'Mission to Mars',
  'overview': 'When contact is lost with the crew of the first Mars expedition, a rescue mission is launched to discover their fate.',
  'imdb_id': 'tt0183523',
  'similarity_score': 0.6767248},
 {'title': 'Red Planet',
  'overview': 'Astronauts search for solutions to save a dying Earth by searching on Mars, only to have the mission go terribly awry.',
  'imdb_id': 'tt0199753',
  'similarity_score': 0.6620594},
 {'title': 'Robinson Crusoe on Mars',
  'overview': 'Stranded on Mars with only a monkey as a companion, an astronaut must figure out how to find oxygen, water, and food on the lifeless planet.',
  'imdb_id': 'tt0058530',
  'similarity_score': 0.651513},
 {'title': 'The Martian',
  'overview': 'During a manned mission to Mars, Astronaut Mark Watney is presumed dead after a fierce storm and left behind by his crew. But Watney has survived and finds himself stranded and alone on the hostile planet. With only meager supplies, he must draw upon his ing

### Pinecone

也可以用 [Pinecone](https://www.pinecone.io/)。

First 

```bash
pip install pinecone
```

In [17]:
from pinecone import Pinecone, ServerlessSpec

pinecone_api_key = api_key = os.getenv("PINECONE_API_KEY")
pc = Pinecone(api_key=pinecone_api_key)

index_name = "movies"

# Get the list of index names from the dictionaries
existing_indexes = [index["name"] for index in pc.list_indexes()]

if index_name in existing_indexes:
    print(f"Deleting existing index: {index_name}!")
    pc.delete_index(index_name)
else:
    print(f"Index '{index_name}' does not exist.")

vector_size = len(next(iter(movie_embeddings.values())))

pc.create_index(
    name=index_name,
    dimension=vector_size, # Replace with your model dimensions
    metric="cosine", # Replace with your model metric
    spec=ServerlessSpec(
        cloud="aws",
        region="us-east-1"
    ) 
)

Deleting existing index: movies!


In [18]:
# Get the index object
index = pc.Index(index_name)

batch_size = 1_000
for i in range(0, len(movie_ids), batch_size):
    batch_ids = movie_ids[i:i+batch_size]
    batch_vectors = []
    for movie_id in batch_ids:
        batch_vectors.append({
            'id': str(movie_id),
            'values': movie_embeddings[movie_id].tolist(),
            'metadata':{
                'title': imbdId2Title.get(movie_id, ''),
                'overview': imbdId2Overview.get(movie_id, "")
            }
        })
    index.upsert(vectors = batch_vectors)
    print(f"Uploaded batch {i//batch_size + 1}/{(len(movie_ids)-1)//batch_size + 1}")


Uploaded batch 1/11
Uploaded batch 2/11
Uploaded batch 3/11
Uploaded batch 4/11
Uploaded batch 5/11
Uploaded batch 6/11
Uploaded batch 7/11
Uploaded batch 8/11
Uploaded batch 9/11
Uploaded batch 10/11
Uploaded batch 11/11


In [19]:
def search_pinecone(
    query_text:str, 
    model:SentenceTransformer, 
    top_k:int = 5):
    query_embedding = model.encode([query_text])
    normalized_query = normalize(query_embedding, axis = 1, norm='l2')

    results = index.query(
        vector=normalized_query[0].tolist(),
        top_k=top_k,
        include_metadata=True
    )

    # Format results
    movies = []
    for match in results['matches']:
        movies.append({
            "title": match['metadata']['title'],
            "overview": match['metadata']['overview'],
            "imdb_id": match['id'],
            "similarity_score": match['score']
        })
    
    return movies

In [21]:
user_query = "NASA tried to rescue an astronaut stranded on Mars."
search_pinecone(query_text=user_query, model=model)

[{'title': 'Mission to Mars',
  'overview': 'When contact is lost with the crew of the first Mars expedition, a rescue mission is launched to discover their fate.',
  'imdb_id': 'tt0183523',
  'similarity_score': 0.676724792},
 {'title': 'Red Planet',
  'overview': 'Astronauts search for solutions to save a dying Earth by searching on Mars, only to have the mission go terribly awry.',
  'imdb_id': 'tt0199753',
  'similarity_score': 0.662059426},
 {'title': 'Robinson Crusoe on Mars',
  'overview': 'Stranded on Mars with only a monkey as a companion, an astronaut must figure out how to find oxygen, water, and food on the lifeless planet.',
  'imdb_id': 'tt0058530',
  'similarity_score': 0.651513},
 {'title': 'The Martian',
  'overview': 'During a manned mission to Mars, Astronaut Mark Watney is presumed dead after a fierce storm and left behind by his crew. But Watney has survived and finds himself stranded and alone on the hostile planet. With only meager supplies, he must draw upon his

### 其他选择

- [Chroma](https://www.trychroma.com/) 看上去也不错，但是需要 self-host，比较麻烦。
- [Azure Cosmos DB](https://learn.microsoft.com/en-us/azure/cosmos-db/vector-database)
- [Milvus](https://milvus.io/) 和 [zilliz](https://zilliz.com/pricing) 一起。

## 构建 API

现在的问题是，我们依然是用代码才能做这些事情。而真正的用户大部分是不懂代码的。我们如何让他们直接输入搜索文本，然后给他们返回结果？这就需要用到 API。也就是说，把搜索文本向量化、搜索、返回搜索结果，都在网络上完成。

我用 [FastAPI](/cn/2024/09/01/fastapi/) 做了一个 API。代码如下：

In [None]:
#|code-fold:true

from fastapi import FastAPI, HTTPException, Query
from fastapi.middleware.cors import CORSMiddleware
from pydantic import BaseModel
from typing import List, Optional
from sentence_transformers import SentenceTransformer
from sklearn.preprocessing import normalize
from qdrant_client import QdrantClient
import os
from dotenv import load_dotenv
import numpy as np

# Load environment variables
load_dotenv()

# Initialize FastAPI app
app = FastAPI(
    title="Movie Search API",
    description="Search for movies using semantic similarity",
    version="1.0.0"
)

# Add CORS middleware to allow cross-origin requests (important for web clients)
app.add_middleware(
    CORSMiddleware,
    allow_origins=["*"],  # Allows all origins
    allow_credentials=True,
    allow_methods=["*"],  # Allows all methods
    allow_headers=["*"],  # Allows all headers
)

# Initialize Qdrant client
qdrant_url = os.getenv("QDRANT_MOVIE")
qdrant_api_key = os.getenv("QDRANT_API_KEY")

qdrant_client = QdrantClient(
    url=qdrant_url,
    api_key=qdrant_api_key
)

# Initialize the SentenceTransformer model (load it only once at startup)
model = SentenceTransformer('all-MiniLM-L6-v2')

# Define response model for better documentation and type checking
class MovieSearchResult(BaseModel):
    title: str
    overview: str
    imdb_id: str
    similarity_score: float

class SearchResponse(BaseModel):
    movies: List[MovieSearchResult]
    query: str

@app.get("/")
async def root():
    return {
        "message": "Welcome to Movie Search API!",
        "docs": "/docs",
        "usage": "Send GET requests to /search?query=your search text"
    }

@app.get("/search", response_model=SearchResponse)
async def search_movies(
    query: str = Query(..., description="The search query to find similar movies"),
    top_k: int = Query(5, ge=1, le=50, description="Number of results to return")
):
    try:
        # Encode the query text
        query_embedding = model.encode([query])
        normalized_query = normalize(query_embedding, axis=1, norm='l2')
        
        # Search in Qdrant
        results = qdrant_client.search(
            collection_name='movies',
            query_vector=normalized_query[0].tolist(),
            limit=top_k
        )
        
        # Format the results
        movies = []
        for result in results:
            movies.append(MovieSearchResult(
                title=result.payload['title'],
                overview=result.payload['overview'],
                imdb_id=result.payload['imdb_id'],
                similarity_score=result.score
            ))
        
        return SearchResponse(
            movies=movies,
            query=query
        )
    
    except Exception as e:
        # Log the error (you might want to use a proper logging system)
        print(f"Error during search: {e}")
        raise HTTPException(status_code=500, detail=f"Search failed: {str(e)}")

@app.get("/health")
async def health_check():
    """Health check endpoint to verify the service is running"""
    return {"status": "healthy"}

在本地运行没问题：

{{< figure src="/media/cnblog/movie_api_search1.png" title="本地运行 FastAPI">}}

{{< figure src="/media/cnblog/movie_api_search2.png" title="本地运行 FastAPI">}}

但是部署到 Vercel 没成功：

{{< figure src="/media/cnblog/movie_api_search3.png" title="Vercel 部署 FastAPI 失败">}}

这是因为我们需要的包太多了

```txt
fastapi[all]
sentence_transformers
qdrant_client
python-dotenv==1.0.0
scikit-learn==1.3.2
```

超过了 Vercel 允许的大小限制。

最好的办法是找一个适合 ML/LLM 项目的部署平台。我试过免费的 render.com 但是太慢了，因为有休眠限制。

第一个解决办法是用封装好的 embedding，比如 DeepSeek OpenAI Gemini 等，这样就不需要 `sentence_transformer` 这么大的包。按道理就可以部署到 Vercel。

第二个解决办法是使用专业的 API 部署平台，比如 Google Cloud Platform, Azure, AWS 或者小公司，比如 fly.io, railway.app, heroku.com 等。

我之后试试。