We repeat the procedure in other notebook to import dataset: 

In [1]:
import itertools
import random
from pprint import pprint
from sentence_transformers import SentenceTransformer, util

# Define the paths to the dataset files
english_dataset_path = r'./dataset/News-Commentary/News-Commentary.en-zh.en'
chinese_dataset_path = r'./dataset/News-Commentary/News-Commentary.en-zh.zh'

# Function to get random sentence pairs
def get_random_sentence_pairs(english_path, chinese_path, num_pairs=1):
    with open(english_path, 'r', encoding='utf-8') as eng_file, \
         open(chinese_path, 'r', encoding='utf-8') as zh_file:
        
        # Read all lines from both files
        english_lines = eng_file.readlines()
        chinese_lines = zh_file.readlines()
        
        # Ensure both files have the same number of lines
        if len(english_lines) != len(chinese_lines):
            print("Error: The files don't have the same number of lines.")
            return None

        # Generate random indices
        random_indices = random.sample(range(len(english_lines)), num_pairs)
        
        # Get the random sentence pairs
        sentence_pairs = [(english_lines[index].strip(), chinese_lines[index].strip()) for index in random_indices]
        
        return sentence_pairs

# Function to compute and print the cosine similarity scores
def print_similarity_scores(sentence_pairs):
    
    model = SentenceTransformer("all-MiniLM-L6-v2")
    # Separate the sentence pairs
    sentences1, sentences2 = zip(*sentence_pairs)

    # Compute embedding for both lists
    embeddings1 = model.encode(sentences1, convert_to_tensor=True)
    embeddings2 = model.encode(sentences2, convert_to_tensor=True)

    # Compute cosine-similarities
    cosine_scores = util.cos_sim(embeddings1, embeddings2)

    # Output the pairs with their score
    for i in range(len(sentences1)):
        print(f"{sentences1[i]} \n {sentences2[i]} \n Score: {cosine_scores[i][i]:.4f}")


  from .autonotebook import tqdm as notebook_tqdm


In [2]:
# Load the SentenceTransformer model

# Example usage:
# Generate 5 random sentence pairs
num_of_gen = 5
sentence_pair = get_random_sentence_pairs(english_dataset_path, chinese_dataset_path, num_of_gen)

# Compute and print the similarity scores for the sentence pairs
if sentence_pair:
    print_similarity_scores(sentence_pair)

During the Cold War, America's approach to the Middle East was to foster stability in order to prevent the spread of Soviet influence, ensure the supply of oil, and provide security for Israel. The American strategy was management through autocratic leaders, and a "don't rock the boat" approach. 
 冷战时期，美国的中东策略是促进地区稳定，防止苏联影响的进一步蔓延，及确保石油供应和以色列的安全。 美国的战略是靠独裁领袖实现统治，始终抱着一种"维持现状"的态度。 
 Score: 0.1004
Countries considering agreements like the Trans-Pacific Partnership or bilateral “partnership” agreements with the US and Europe need to be aware that this is one of the hidden objectives. What are being sold as “free-trade agreements” include IP provisions that could stifle access to affordable medicines, with a potentially significant impact on economic growth and development. 
 从全球看，越来越多的人认识到需要更加平衡的知识产权机制。 但制药行业为了巩固自己的利益，一直在推行更强势、更不平衡的知识产权机制。 考虑和美国和欧洲构建跨太平洋合作伙伴关系协定或双边“合作伙伴”协定的国家必须清楚，这是隐含目标之一。 它们兜售的是“自由贸易协定 ” ， 其中的知识产权保护条款可能阻挠人民获得负担得起的药品，对经济增长和发展起着潜在重大影响。 
 Score: 0.1960
A large-scale program t

## The similarity score is surprisingly low.

Reasons:
- 1. Cross-lingual Embeddings Challenge: However, cross-lingual semantic similarity can be particularly challenging, especially when dealing with languages that are structurally different, like English and Chinese. These languages not only differ in grammar and syntax but also in cultural context and expression.
- 2. Literal vs Contextual Meaning: The model may be capturing the literal translation of words rather than the contextual meaning of the whole sentence. This can lead to lower scores if the sentence structure or idiomatic expressions vary significantly between the two languages.
- 3. Quality of the Model: While SentenceTransformer models are generally quite good, their performance on specific language pairs can vary. There might be better-suited models for English-Chinese similarity specifically, such as those trained on cross-lingual datasets.
- 4. Model's Training Data: If the model hasn't been trained on a diverse enough dataset that includes similar sentence pairs to those you're comparing, its ability to accurately measure similarity across languages may be limited.
- 5. Embedding Space Alignment: For cross-lingual similarity, the embedding spaces of both languages must be well-aligned. If the alignment is not accurate, even semantically similar sentences might end up far apart in the embedding space, resulting in low similarity scores.
- 6. Complex Sentences: The examples given include relatively complex sentences with multiple clauses and potentially nuanced meanings. The model might struggle more with these than with simpler, more straightforward sentences.
- 7. Possible Errors: Ensure there are no errors in data preprocessing, such as incorrect sentence pairings or issues with text encoding that could affect the embeddings generated by the model.