# JINA-CLIP-V2 多模态模型能力测试

本notebook测试JINA-CLIP-V2模型在跨模态理解和嵌入方面的能力。我们将测试以下几个方面:

1. 多语言文本嵌入与相似度
2. 图像嵌入与相似度
3. 跨模态(文本-图像)检索
4. 零样本分类能力
5. 多语言理解能力

这些测试将帮助我们理解模型的实际性能和应用场景。

## 1. 环境设置与模型初始化

In [None]:
# 安装必要的库
!pip install sentence-transformers requests tqdm matplotlib seaborn pillow scikit-learn -q

In [None]:
from sentence_transformers import SentenceTransformer
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import requests
import os
from PIL import Image
from tqdm import tqdm
from sklearn.metrics import accuracy_score
import matplotlib

# 为中文显示设置字体
matplotlib.rcParams['font.sans-serif'] = ['SimHei', 'Arial Unicode MS']  
matplotlib.rcParams['axes.unicode_minus'] = False  # 正确显示负号

In [None]:
# 模型路径设置
model_path = r"C:\Users\k\Desktop\BaiduSyncdisk\baidu_sync_documents\hf_models\jina-colbert-v2"  # 本地模型路径
# model_path = "jinaai/jina-clip-v2"  # 远程模型路径

# 初始化模型
model = SentenceTransformer(model_path, trust_remote_code=True)
print("模型加载完成！")

## 2. 辅助函数定义

In [None]:
# 绘制热力图的工具函数
def plot_similarity_heatmap(embeddings, labels, title='Embeddings Similarity Heatmap', figsize=(12, 10), rotation=45):
    """
    绘制嵌入向量之间的相似度热力图
    
    Args:
        embeddings: 嵌入向量
        labels: 标签列表
        title: 热力图标题
        figsize: 图形大小
        rotation: x轴标签旋转角度
    """
    # 计算相似度矩阵
    if not np.allclose(np.linalg.norm(embeddings, axis=1), 1.0, atol=1e-5):
        print("警告：嵌入向量可能未归一化。计算点积，对于归一化向量，点积等于余弦相似度。")
    similarity_matrix = embeddings @ embeddings.T
    
    # 绘制热力图
    plt.figure(figsize=figsize)
    sns.heatmap(
        similarity_matrix, 
        annot=True, 
        cmap='viridis', 
        fmt=".2f", 
        xticklabels=labels, 
        yticklabels=labels
    )
    plt.title(title)
    plt.xticks(rotation=rotation, ha='right')
    plt.yticks(rotation=0)
    plt.tight_layout()
    plt.show()

def plot_cross_modal_similarity(img_embeds, text_embeds, img_labels, text_labels, title='Cross-Modal Similarity'):
    """
    绘制跨模态相似度热力图
    
    Args:
        img_embeds: 图像嵌入向量
        text_embeds: 文本嵌入向量
        img_labels: 图像标签列表
        text_labels: 文本标签列表
        title: 热力图标题
    """
    # 计算图像与文本的相似度矩阵
    similarity_matrix = img_embeds @ text_embeds.T
    
    # 绘制热力图
    plt.figure(figsize=(14, len(img_labels) * 0.7))
    sns.heatmap(
        similarity_matrix, 
        annot=True, 
        cmap='viridis', 
        fmt=".2f", 
        xticklabels=text_labels, 
        yticklabels=img_labels
    )
    plt.title(title)
    plt.xlabel('文本')
    plt.ylabel('图像')
    plt.xticks(rotation=45, ha='right')
    plt.yticks(rotation=0)
    plt.tight_layout()
    plt.show()
    
def download_image(url, save_path):
    """
    下载图片并保存到指定路径
    
    Args:
        url: 图片URL
        save_path: 保存路径
        
    Returns:
        成功则返回保存路径，失败则返回None
    """
    try:
        response = requests.get(url, stream=True)
        response.raise_for_status()
        os.makedirs(os.path.dirname(save_path), exist_ok=True)
        with open(save_path, 'wb') as file:
            for chunk in tqdm(response.iter_content(chunk_size=8192),
                             desc=f"下载 {os.path.basename(url)}",
                             unit='KB', unit_scale=True):
                file.write(chunk)
        return save_path
    except Exception as e:
        print(f"下载 {url} 时出错: {e}")
        return None
    
def display_images(image_paths, titles=None, figsize=(15, 10), columns=3):
    """
    显示多张图片
    
    Args:
        image_paths: 图片路径列表
        titles: 标题列表
        figsize: 图形大小
        columns: 列数
    """
    rows = (len(image_paths) + columns - 1) // columns
    fig = plt.figure(figsize=figsize)
    
    for i, image_path in enumerate(image_paths):
        img = Image.open(image_path)
        ax = fig.add_subplot(rows, columns, i + 1)
        
        if titles is not None and i < len(titles):
            ax.set_title(titles[i])
            
        ax.imshow(img)
        ax.axis('off')
    
    plt.tight_layout()
    plt.show()

## 3. 准备多语言文本和图像数据

In [None]:
# 创建保存目录
save_dir = 'images'
if not os.path.exists(save_dir):
    os.makedirs(save_dir)

# 多样化的图片URL
image_urls = [
    'https://i.ibb.co/nQNGqL0/beach1.jpg',  # 海滩日落
    'https://i.ibb.co/r5w8hG8/beach2.jpg',  # 另一个海滩日落
    'https://i.ibb.co/Sx3mLpB/city-night.jpg',  # 城市夜景
    'https://i.ibb.co/6WyNVHM/mountain.jpg',  # 山脉
    'https://i.ibb.co/KmpJGz2/cat.jpg',  # 猫
    'https://i.ibb.co/4RxpzWC/food.jpg'   # 食物
]

# 下载图片
image_paths = []
for i, url in enumerate(image_urls):
    name = url.split('/')[-1]
    save_path = os.path.join(save_dir, name)
    result = download_image(url, save_path)
    if result:
        print(f"已下载: {result}")
        image_paths.append(result)
    else:
        print(f"下载失败: {url}")

# 图片名称标签
image_labels = [os.path.splitext(os.path.basename(path))[0] for path in image_paths]

In [None]:
# 显示下载的图片
display_images(image_paths, image_labels)

In [None]:
# 多语言文本数据 - 关于海滩日落
sunset_texts = {
    '阿拉伯语': 'غروب جميل على الشاطئ',
    '中文': '海滩上美丽的日落', 
    '英语': 'A beautiful sunset over the beach',
    '法语': 'Un beau coucher de soleil sur la plage', 
    '德语': 'Ein wunderschöner Sonnenuntergang am Strand', 
    '希腊语': 'Ένα όμορφο ηλιοβασίλεμα πάνω από την παραλία', 
    '印地语': 'समुद्र तट पर एक खूबसूरत सूर्यास्त', 
    '意大利语': 'Un bellissimo tramonto sulla spiaggia', 
    '日语': '浜辺に沈む美しい夕日', 
    '韩语': '해변 위로 아름다운 일몰',
    '俄语': 'Красивый закат над пляжем',
    '西班牙语': 'Una hermosa puesta de sol sobre la playa'
}

# 提取文本和标签
sunset_labels = list(sunset_texts.keys())
sunset_sentences = list(sunset_texts.values())

In [None]:
# 准备不同场景的多语言描述
scene_descriptions = {
    '海滩日落': {
        '中文': '海滩上美丽的日落',
        '英语': 'A beautiful sunset over the beach',
        '法语': 'Un beau coucher de soleil sur la plage'
    },
    '城市夜景': {
        '中文': '灯火辉煌的城市夜景',
        '英语': 'A brightly lit city skyline at night',
        '法语': 'Un paysage urbain brillamment illuminé la nuit'
    },
    '山脉': {
        '中文': '雄伟的山脉风景',
        '英语': 'Majestic mountain landscape',
        '法语': 'Paysage montagneux majestueux'
    },
    '猫': {
        '中文': '可爱的小猫',
        '英语': 'A cute cat',
        '法语': 'Un chat mignon'
    },
    '食物': {
        '中文': '美味的食物',
        '英语': 'Delicious food',
        '法语': 'Nourriture délicieuse'
    }
}

# 提取所有场景描述文本
all_scene_texts = []
scene_text_labels = []

for scene, languages in scene_descriptions.items():
    for lang, text in languages.items():
        all_scene_texts.append(text)
        scene_text_labels.append(f"{scene} ({lang})")

## 4. 多语言文本嵌入与相似度分析

In [None]:
# 编码多语言日落文本
sunset_embeddings = model.encode(sunset_sentences, normalize_embeddings=True)
print(f"嵌入向量形状: {sunset_embeddings.shape}")

# 绘制多语言文本相似度热力图
plot_similarity_heatmap(
    sunset_embeddings, 
    sunset_labels, 
    title='不同语言中相同内容文本的嵌入相似度',
    figsize=(14, 12)
)

In [None]:
# 中文同义表达测试
chinese_similar_sunset = [
    '海滩上美丽的日落',
    '海滩上美丽的夕阳',
    '海滩上美丽的黄昏',
    '海滩上美丽的晚霞',
    '海边迷人的落日',
    '沙滩上绚丽的日落',
    '沿海壮观的夕阳'
]

# 编码中文同义句
chinese_similar_sunset_embeddings = model.encode(chinese_similar_sunset, normalize_embeddings=True)

# 绘制中文同义句相似度
plot_similarity_heatmap(
    chinese_similar_sunset_embeddings, 
    chinese_similar_sunset, 
    title='中文同义表达相似度分析',
    figsize=(12, 10)
)

In [None]:
# 中文不同主题测试
chinese_diverse_texts = [
    '海滩上美丽的日落',
    '今天天气真好',
    '我喜欢吃苹果',
    '北京是中国的首都',
    '人工智能正在改变世界',
    '熊猫是中国的国宝',
    '长城是世界奇迹之一',
    '电影院里人很多',
    '学校明天放假'
]

# 编码中文不同主题文本
chinese_diverse_embeddings = model.encode(chinese_diverse_texts, normalize_embeddings=True)

# 绘制中文不同主题相似度
plot_similarity_heatmap(
    chinese_diverse_embeddings, 
    chinese_diverse_texts, 
    title='中文不同主题文本相似度分析',
    figsize=(12, 10)
)

## 5. 图像嵌入与相似度分析

In [None]:
# 编码图像
image_embeddings = model.encode(image_paths, normalize_embeddings=True)
print(f"图像嵌入向量形状: {image_embeddings.shape}")

# 绘制图像相似度热力图
plot_similarity_heatmap(
    image_embeddings, 
    image_labels, 
    title='图像嵌入相似度热力图',
    figsize=(10, 8)
)

## 6. 跨模态检索能力测试

In [None]:
# 编码场景描述文本
scene_text_embeddings = model.encode(all_scene_texts, normalize_embeddings=True)

# 生成图像-文本相似度矩阵并可视化
plot_cross_modal_similarity(
    image_embeddings, 
    scene_text_embeddings, 
    image_labels, 
    scene_text_labels, 
    title='图像-文本跨模态相似度热力图'
)

In [None]:
# 文本到图像检索
def text_to_image_retrieval(query_text, image_paths, image_embeddings, top_k=3):
    # 编码查询文本
    query_embedding = model.encode(query_text, normalize_embeddings=True)
    
    # 计算相似度
    similarities = query_embedding @ image_embeddings.T
    
    # 获取前k个相似度最高的图片
    top_indices = np.argsort(similarities)[::-1][:top_k]
    top_similarities = similarities[top_indices]
    top_images = [image_paths[i] for i in top_indices]
    
    return top_images, top_similarities, top_indices

# 测试用例 - 不同语言查询
queries = {
    '英语': 'A beautiful sunset by the ocean',
    '中文': '城市的夜景',
    '法语': 'Un chat mignon',
    '西班牙语': 'Deliciosa comida en un plato',  # 盘子里美味的食物
    '德语': 'Majestätische Berge mit Schnee'  # 雄伟的雪山
}

# 测试每个查询
for language, query in queries.items():
    print(f"\n查询 ({language}): {query}")
    top_images, similarities, _ = text_to_image_retrieval(query, image_paths, image_embeddings)
    
    # 显示结果
    titles = [f"{os.path.basename(img)} (相似度: {sim:.2f})" for img, sim in zip(top_images, similarities)]
    display_images(top_images, titles, figsize=(15, 5), columns=3)

In [None]:
# 图像到文本检索
def image_to_text_retrieval(query_image, texts, text_embeddings, top_k=5):
    # 编码查询图像
    query_embedding = model.encode(query_image, normalize_embeddings=True)
    
    # 计算相似度
    similarities = query_embedding @ text_embeddings.T
    
    # 获取前k个相似度最高的文本
    top_indices = np.argsort(similarities)[::-1][:top_k]
    top_similarities = similarities[top_indices]
    top_texts = [texts[i] for i in top_indices]
    
    return top_texts, top_similarities, top_indices

# 合并所有文本用于检索
all_texts = sunset_sentences + all_scene_texts + chinese_diverse_texts
all_text_labels = sunset_labels + scene_text_labels + chinese_diverse_texts
all_text_embeddings = model.encode(all_texts, normalize_embeddings=True)

# 选择几张图片测试
test_images = [image_paths[0], image_paths[2], image_paths[4]]
test_image_labels = [image_labels[0], image_labels[2], image_labels[4]]

for img_path, img_label in zip(test_images, test_image_labels):
    print(f"\n图像查询: {img_label}")
    display_images([img_path], [img_label], figsize=(5, 5))
    
    top_texts, similarities, indices = image_to_text_retrieval(img_path, all_texts, all_text_embeddings)
    
    print("匹配到的文本:")
    for i, (text, sim) in enumerate(zip(top_texts, similarities)):
        print(f"{i+1}. 文本: '{text}' (相似度: {sim:.3f}) - 标签: {all_text_labels[indices[i]]}")
    print("-" * 80)

## 7. 零样本分类能力测试

In [None]:
# 零样本图像分类
def zero_shot_image_classification(image_paths, class_names):
    # 编码图像
    image_embeddings = model.encode(image_paths, normalize_embeddings=True)
    
    # 编码类别名称
    class_embeddings = model.encode(class_names, normalize_embeddings=True)
    
    # 计算每个图像与每个类别的相似度
    similarities = image_embeddings @ class_embeddings.T
    
    # 对每个图像找到最相似的类别
    predicted_classes = np.argmax(similarities, axis=1)
    predicted_class_names = [class_names[i] for i in predicted_classes]
    confidence_scores = np.max(similarities, axis=1)
    
    # 构建完整结果
    results = []
    for i, (img_path, pred_class, confidence) in enumerate(zip(image_paths, predicted_class_names, confidence_scores)):
        results.append({
            'image': img_path,
            'predicted_class': pred_class,
            'confidence': confidence,
            'all_scores': {class_name: float(similarities[i, j]) for j, class_name in enumerate(class_names)}
        })
    
    return results

# 定义分类类别 - 中英文各一组
class_names_en = ['sunset at beach', 'cityscape at night', 'mountains', 'cat', 'food']
class_names_zh = ['海滩日落', '城市夜景', '山脉', '猫', '食物']

# 执行英文零样本分类
results_en = zero_shot_image_classification(image_paths, class_names_en)

# 可视化英文分类结果
for result in results_en:
    img_path = result['image']
    pred_class = result['predicted_class']
    confidence = result['confidence']
    img_name = os.path.basename(img_path)
    
    plt.figure(figsize=(10, 6))
    
    # 显示图像
    plt.subplot(1, 2, 1)
    img = Image.open(img_path)
    plt.imshow(img)
    plt.title(f"图像: {img_name}\n预测类别: {pred_class} (置信度: {confidence:.2f})")
    plt.axis('off')
    
    # 显示所有类别的得分
    plt.subplot(1, 2, 2)
    all_scores = result['all_scores']
    classes = list(all_scores.keys())
    scores = list(all_scores.values())
    
    y_pos = np.arange(len(classes))
    plt.barh(y_pos, scores, align='center')
    plt.yticks(y_pos, classes)
    plt.xlabel('相似度得分')
    plt.title('各类别得分')
    
    plt.tight_layout()
    plt.show()

# 执行中文零样本分类
results_zh = zero_shot_image_classification(image_paths, class_names_zh)

# 比较英文和中文分类结果
print("英文与中文零样本分类结果比较:")
print("-" * 80)
print(f"{'图像':<15} | {'英文预测':<20} | {'英文置信度':<10} | {'中文预测':<20} | {'中文置信度':<10} | {'结果一致':<10}")
print("-" * 80)

for en_result, zh_result in zip(results_en, results_zh):
    img_name = os.path.basename(en_result['image'])
    en_pred = en_result['predicted_class']
    en_conf = en_result['confidence']
    zh_pred = zh_result['predicted_class']
    zh_conf = zh_result['confidence']
    
    # 判断结果是否一致 (比较类别索引)
    en_idx = class_names_en.index(en_pred)
    zh_idx = class_names_zh.index(zh_pred)
    consistent = "✓" if en_idx == zh_idx else "✗"
    
    print(f"{img_name:<15} | {en_pred:<20} | {en_conf:.4f} | {zh_pred:<20} | {zh_conf:.4f} | {consistent:<10}")

print("-" * 80)

## 8. 模型性能分析

In [None]:
# 分析不同语言的理解能力
language_groups = {
    '西方语言': ['英语', '法语', '德语', '意大利语', '西班牙语'],
    '东亚语言': ['中文', '日语', '韩语'],
    '其他语言': ['阿拉伯语', '希腊语', '印地语', '俄语']
}

# 计算组内平均相似度
print("不同语言组内的平均相似度:")
for group_name, languages in language_groups.items():
    # 提取该组语言的嵌入向量
    indices = [sunset_labels.index(lang) for lang in languages if lang in sunset_labels]
    if not indices:
        continue
    
    group_embeddings = sunset_embeddings[indices]
    similarity_matrix = group_embeddings @ group_embeddings.T
    
    # 计算上三角矩阵的平均值（排除对角线）
    n = similarity_matrix.shape[0]
    upper_triangle_sum = 0
    upper_triangle_count = 0
    
    for i in range(n):
        for j in range(i+1, n):
            upper_triangle_sum += similarity_matrix[i, j]
            upper_triangle_count += 1
    
    if upper_triangle_count > 0:
        avg_similarity = upper_triangle_sum / upper_triangle_count
        print(f"{group_name}: {avg_similarity:.4f}")

# 计算组间平均相似度
print("\n不同语言组之间的平均相似度:")
for group1_name, languages1 in language_groups.items():
    for group2_name, languages2 in language_groups.items():
        if group1_name >= group2_name:  # 避免重复计算
            continue
        
        # 提取两组语言的嵌入向量
        indices1 = [sunset_labels.index(lang) for lang in languages1 if lang in sunset_labels]
        indices2 = [sunset_labels.index(lang) for lang in languages2 if lang in sunset_labels]
        
        if not indices1 or not indices2:
            continue
        
        group1_embeddings = sunset_embeddings[indices1]
        group2_embeddings = sunset_embeddings[indices2]
        
        # 计算组间相似度
        cross_similarity = group1_embeddings @ group2_embeddings.T
        avg_cross_similarity = np.mean(cross_similarity)
        
        print(f"{group1_name} vs {group2_name}: {avg_cross_similarity:.4f}")

In [None]:
# 分析跨模态检索性能

# 准备场景-图片对应关系作为真实标签
scene_to_image = {
    '海滩日落': ['beach1.jpg', 'beach2.jpg'],
    '城市夜景': ['city-night.jpg'],
    '山脉': ['mountain.jpg'],
    '猫': ['cat.jpg'],
    '食物': ['food.jpg']
}

# 准备场景描述（每种语言）
scene_descriptions_flat = []
true_image_labels = []

for scene, image_names in scene_to_image.items():
    if scene not in scene_descriptions:
        continue
        
    for lang, text in scene_descriptions[scene].items():
        scene_descriptions_flat.append(text)
        # 对于每个场景描述，记录其应匹配的图片
        true_image_labels.append([os.path.splitext(img)[0] for img in image_names])

# 文本到图像检索的评估
text_to_image_results = []
for i, query_text in enumerate(scene_descriptions_flat):
    # 对每个场景描述进行检索
    _, _, top_indices = text_to_image_retrieval(query_text, image_paths, image_embeddings, top_k=len(image_paths))
    retrieved_image_labels = [image_labels[j] for j in top_indices]
    
    # 计算排名指标
    true_labels = true_image_labels[i]
    ranks = []
    for true_label in true_labels:
        if true_label in retrieved_image_labels:
            ranks.append(retrieved_image_labels.index(true_label) + 1)
        else:
            ranks.append(float('inf'))
    
    # 计算平均排名
    if ranks:
        avg_rank = sum(r for r in ranks if r < float('inf')) / len([r for r in ranks if r < float('inf')])
    else:
        avg_rank = float('inf')
    
    # 计算准确率@1（最高排名的图片是否正确）
    precision_at_1 = 1 if retrieved_image_labels[0] in true_labels else 0
    
    scene = next((s for s, langs in scene_descriptions.items() if query_text in langs.values()), "未知")
    lang = next((l for l, text in scene_descriptions.get(scene, {}).items() if text == query_text), "未知")
    
    text_to_image_results.append({
        'query': query_text,
        'scene': scene,
        'language': lang,
        'true_labels': true_labels,
        'retrieved_labels': retrieved_image_labels[:5],  # 只保留前5个
        'average_rank': avg_rank,
        'precision_at_1': precision_at_1
    })

# 输出结果汇总
print("文本到图像检索性能分析:")
print("-" * 100)
print(f"{'场景':<15} | {'语言':<10} | {'查询文本':<40} | {'Precision@1':<15} | {'平均排名':<10}")
print("-" * 100)

for result in text_to_image_results:
    print(f"{result['scene']:<15} | {result['language']:<10} | {result['query'][:37]+'...' if len(result['query'])>40 else result['query']:<40} | {result['precision_at_1']:<15} | {result['average_rank']:.2f}")

print("-" * 100)

# 按语言分组计算平均性能
languages = set(result['language'] for result in text_to_image_results)
print("\n各语言的平均检索性能:")
print(f"{'语言':<10} | {'平均 Precision@1':<20} | {'平均排名':<10}")
print("-" * 50)

for lang in languages:
    lang_results = [r for r in text_to_image_results if r['language'] == lang]
    avg_precision = sum(r['precision_at_1'] for r in lang_results) / len(lang_results) if lang_results else 0
    avg_rank = sum(r['average_rank'] for r in lang_results) / len(lang_results) if lang_results else float('inf')
    
    print(f"{lang:<10} | {avg_precision:.2f}{' ':<17} | {avg_rank:.2f}")

## 9. 总结

### JINA-CLIP-V2 多模态模型能力总结

在这个测试中，我们对JINA-CLIP-V2模型在多模态理解方面的能力进行了全面的评估。以下是主要发现:

1. **多语言理解能力**
   - 模型能够理解并关联不同语言中表达相同概念的文本
   - 同一语言中的同义表达具有很高的相似度
   - 不同语言族之间(如东亚语言vs西方语言)的理解能力保持一致性

2. **图像表示能力**
   - 模型能够捕捉图像的语义内容，相似场景的图像具有较高的向量相似度
   - 不同场景的图像表示在向量空间中能够有效区分

3. **跨模态检索性能**
   - 文本到图像检索：模型能够根据文本描述找到相关的图像
   - 图像到文本检索：模型能够根据图像找到相关的文本描述
   - 跨语言检索：不同语言描述的同一场景都能找到相应的图像

4. **零样本分类能力**
   - 模型展示了良好的零样本图像分类能力
   - 用不同语言表述的类别都能获得一致的分类结果

### 应用场景

基于测试结果，JINA-CLIP-V2模型适用于以下应用场景：

1. **多语言图像检索系统**
2. **跨语言内容推荐**
3. **零样本图像分类**
4. **多语言内容理解**
5. **语义搜索引擎**

### 限制与改进方向

1. 提高细粒度场景区分能力
2. 加强对特定领域文本和图像的理解
3. 优化对长文本的处理能力

这些测试结果为后续应用开发和研究提供了重要的基础。