---

title: "Bert 模型、向量表示与搜索"
date: 2025-03-02
author: 郝鸿涛
slug: bert
draft: false
toc: true
tags: ML
---

假设我们现在有一万个电影的文字描述，如何根据用户的搜索，推荐给用户最相关的几个电影？比如，用户输入「NASA 努力救一个困在火星的宇航员」，那我们肯定会导出《火星救援》。如何实现呢？

办法是，假设我们有一个神奇的工具，你可以把它想象成一张网。任何一串文字通过它之后，都会变成一个高维空间里的一个向量，也就是高维空间里的一个坐标。然后我们让这一万条电影的文字描述经过这张网，我们得到一万个坐标。用户输入搜索，我们让这个搜索也经过这张网，得到一个坐标。最后的结果，我们计算这个搜索对应的坐标与每一个电影坐标的余弦相似度 (Cosine Similarity)，然后找到结果最大的几个，就是我们的搜索结果。

这张神奇的网是「向量表示 (Embedding)」。我们这里用 Bert。更神奇的是，如果我们用多语言的模型，比如 `paraphrase-multilingual-mpnet-base-v2`，那即使我们的训练数据是英文的，我们也可以用中文搜索。

多语言模型比较大，我们这里不用。我们用比较小的 `all-MiniLM-L6-v2`。

另外，我们用 [FAISS](https://pypi.org/project/faiss-cpu/) (Facebook AI Similarity Search) 计算余弦相似度，因为这比较快。

## 原始数据

数据下载地址：[https://hongtaoh.com/files/tmdb-movies.csv](/../files/tmdb-movies.csv)。

In [77]:
import pandas as pd 
import numpy as np 

In [78]:
df = pd.read_csv('../static/files/tmdb-movies.csv')
df.shape

(10866, 21)

In [80]:
imbdId2Title = dict(zip(df.imdb_id, df.original_title))

数据很大，我只选取几个关键的列：

In [81]:
df.iloc[0:5, [0, 1, 4, 5, 11]]

Unnamed: 0,id,imdb_id,revenue,original_title,overview
0,135397,tt0369610,1513528810,Jurassic World,Twenty-two years after the events of Jurassic ...
1,76341,tt1392190,378436354,Mad Max: Fury Road,An apocalyptic story set in the furthest reach...
2,262500,tt2908446,295238201,Insurgent,Beatrice Prior must confront her inner demons ...
3,140607,tt2488496,2068178225,Star Wars: The Force Awakens,Thirty years after defeating the Galactic Empi...
4,168259,tt2820852,1506249360,Furious 7,Deckard Shaw seeks revenge against Dominic Tor...


In [82]:
# 我们只需要用到这两列
movie_ids = df.imdb_id.tolist()
movie_overviews = df.overview.tolist()

## 向量表示

In [83]:
from sentence_transformers import SentenceTransformer
from sklearn.preprocessing import normalize
model = SentenceTransformer('all-MiniLM-L6-v2')
from typing import List, Dict, Tuple
import faiss 

In [84]:
def get_movie_embeddings(
    movie_ids:List[str], 
    movie_overviews:List[str], 
    model: SentenceTransformer,
    batch_size: int = 32, 
    ) -> Dict[str, np.ndarray]:
    """
    Generates normalized embeddings for movie overviews.

    Args:
        movie_ids: List of movie IDs.
        movie_overviews: List of movie overviews.
        model: Embedding model (e.g., SentenceTransformer).
        batch_size: Batch size for processing.

    Returns:
        Dictionary mapping movie IDs to normalized embeddings.
    """
    if len(movie_ids) != len(movie_overviews):
        raise ValueError("movie_ids and movie_overviews must have the same length.")

    all_embeddings = []
    for i in range(0, len(movie_ids), batch_size):
        batch_movies = movie_overviews[i:i+batch_size]
        try:
            batch_embeddings = model.encode(batch_movies)
            all_embeddings.extend(batch_embeddings)
        except Exception as e:
            print(f"Error encoding batch {i}: {e}")
            raise
    try:
        all_embeddings_np = np.array(all_embeddings) #added numpy array conversion.
        normalized_embeddings = normalize(all_embeddings_np, axis=1, norm='l2')
    except Exception as e:
        print(f"Error normalizing embeddings: {e}")
        raise

    movie_embeddings = dict(zip(movie_ids, normalized_embeddings))
    return movie_embeddings

In [85]:
movie_embeddings = get_movie_embeddings(movie_ids, movie_overviews, model)

## 搜索

In [86]:
def search_movies(
    movie_embeddings: Dict[str, np.ndarray],
    user_query: str,
    model: SentenceTransformer,
    k: int = 10,
) -> Tuple[List[str], List[float]]:
    """
    Searches movies based on a user query using Faiss and cosine similarity.

    Args:
        movie_embeddings: Dictionary of movie IDs to normalized embeddings.
        user_query: The user's search query.
        model: SentenceTransformer model.
        k: Number of results to return.

    Returns:
        List of tuples, where each tuple contains (movie_id, similarity_score).
    """
    #prepare faiss index
    # Creates a 2D NumPy array
    embeddings = np.array(list(movie_embeddings.values())) 
     # Gets the correct shape
    dim = embeddings.shape[1]                            
    index = faiss.IndexFlatIP(dim)
    index.add(embeddings)

    #embed and normalize the user query
    query_embedding = model.encode([user_query])
    normalized_query_embedding = normalize(query_embedding, axis = 1, norm = 'l2')

    #search the index
    distances, indices = index.search(normalized_query_embedding, k)

    #retrieve movie ids and similarity scores
    top_k_ids = [list(movie_embeddings.keys())[i] for i in indices[0]]
    top_k_titles = [imbdId2Title[x] for x in top_k_ids]
    top_k_scores = distances[0]
    return top_k_titles, top_k_scores

In [87]:
user_query = "NASA tried to rescue an astronaut stranded on Mars."
results = search_movies(
    movie_embeddings,
    user_query,
    model
)

In [88]:
results 

(['Mission to Mars',
  'Red Planet',
  'Robinson Crusoe on Mars',
  'The Martian',
  'Capricorn One',
  'My Favorite Martian',
  'The Last Days on Mars',
  'Infini',
  '(T)Raumschiff Surprise - Periode 1',
  'My Stepmother is an Alien'],
 array([0.6767248 , 0.6620594 , 0.651513  , 0.55047977, 0.5192548 ,
        0.49691725, 0.4924257 , 0.46683484, 0.46024293, 0.44739133],
       dtype=float32))

结果正确。