# Cross-lingual Information Retrieval with google translate API & NLTK

In [1]:
!pip install googletrans==4.0.0-rc1

Collecting googletrans==4.0.0-rc1
  Downloading googletrans-4.0.0rc1.tar.gz (20 kB)
  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting httpx==0.13.3 (from googletrans==4.0.0-rc1)
  Downloading httpx-0.13.3-py3-none-any.whl (55 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m55.1/55.1 kB[0m [31m978.1 kB/s[0m eta [36m0:00:00[0m
Collecting hstspreload (from httpx==0.13.3->googletrans==4.0.0-rc1)
  Downloading hstspreload-2024.5.1-py3-none-any.whl (1.1 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.1/1.1 MB[0m [31m7.8 MB/s[0m eta [36m0:00:00[0m
Collecting chardet==3.* (from httpx==0.13.3->googletrans==4.0.0-rc1)
  Downloading chardet-3.0.4-py2.py3-none-any.whl (133 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m133.4/133.4 kB[0m [31m13.4 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting idna==2.* (from httpx==0.13.3->googletrans==4.0.0-rc1)
  Downloading idna-2.10-py2.py3-none-any.whl (58 kB)
[2K     [9

In [2]:
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from googletrans import Translator

In [3]:
# Initialize NLTK resources
nltk.download('punkt')
nltk.download('stopwords')

# Define a list of sample queries in English
queries = [
    "Cross-lingual information retrieval with NLTK",
    "Natural language processing techniques",
    "Machine learning applications in NLP"
]

# Define a list of French documents
documents = [
    {
        'language': 'fr',  # French
        'content': "La recherche d'informations interlingues avec NLTK est difficile."
    },
    {
        'language': 'fr',  # French
        'content': "Les techniques de traitement du langage naturel sont essentielles."
    },
    {
        'language': 'fr',  # French
        'content': "Les applications d'apprentissage automatique en TAL sont en croissance."
    }
]



[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


In [4]:
# Initialize the Google Translator API
translator = Translator()



In [5]:
# Perform cross-lingual information retrieval for each query
for query in queries:
    print(f"Query: {query}\n")

    # Translate the query into French
    translated_queries = {doc['language']: translator.translate(query, src='en', dest=doc['language']).text for doc in documents}

    # Tokenize and preprocess the translated queries
    stop_words = set(stopwords.words('english'))

    # Perform the cross-lingual information retrieval
    for doc in documents:
        translated_query = translated_queries[doc['language']]
        tokens_query = word_tokenize(translated_query.lower())
        tokens_query = [word for word in tokens_query if word.isalnum() and word not in stop_words]

        # You can implement a similarity measure (e.g., cosine similarity) here to rank the documents.
        # For simplicity, we'll just check if any query terms exist in the document.
        doc_tokens = word_tokenize(doc['content'].lower())
        if any(token in doc_tokens for token in tokens_query):
            print(f"Query '{query}' found in French document: {doc['content']}\n")

Query: Cross-lingual information retrieval with NLTK

Query 'Cross-lingual information retrieval with NLTK' found in French document: La recherche d'informations interlingues avec NLTK est difficile.

Query: Natural language processing techniques

Query 'Natural language processing techniques' found in French document: Les techniques de traitement du langage naturel sont essentielles.

Query: Machine learning applications in NLP

Query 'Machine learning applications in NLP' found in French document: Les applications d'apprentissage automatique en TAL sont en croissance.



# **Cross-lingual Information Retrieval with Multilingual-BERT**

In [6]:
!pip install transformers
!pip install sentencepiece



In [7]:
from transformers import AutoTokenizer, AutoModel
import torch
import numpy as np
from sklearn.metrics.pairwise import cosine_similarity

In [8]:


def cross_lingual_information_retrieval(query, documents, model_name="bert-base-multilingual-cased"):
    # Load the pre-trained model and tokenizer
    tokenizer = AutoTokenizer.from_pretrained(model_name)
    model = AutoModel.from_pretrained(model_name)

    # Encode the query and documents
    query_tokens = tokenizer(query, return_tensors="pt", padding=True, truncation=True)
    document_tokens = tokenizer(documents, return_tensors="pt", padding=True, truncation=True)

    # Get the embeddings for the query and documents
    with torch.no_grad():
        query_embeddings = model(**query_tokens).last_hidden_state.mean(dim=1)
        document_embeddings = model(**document_tokens).last_hidden_state.mean(dim=1)

    # Calculate cosine similarity between the query and documents
    similarity_scores = cosine_similarity(query_embeddings, document_embeddings)
    print("similarity_scores",similarity_scores)
    # Sort the documents by similarity score in descending order
    sorted_indices = np.argsort(similarity_scores[0])[::-1]
    sorted_documents = [documents[i] for i in sorted_indices]

    return sorted_documents, similarity_scores[0, sorted_indices]




In [9]:
if __name__ == "__main__":
    # Example query in English and documents in French
    query = """XLM-RoBERTa is a multilingual version of RoBERTa. It is pre-trained on 2.5TB of filtered CommonCrawl data containing 100 languages.

RoBERTa is a transformers model pretrained on a large corpus in a self-supervised fashion. This means it was pretrained on the raw texts only, with no humans labelling them in any way (which is why it can use lots of publicly available data) with an automatic process to generate inputs and labels from those texts."""
    documents = [
       """XLM-RoBERTa ist eine mehrsprachige Version von RoBERTa. Es ist vortrainiert auf 2,5 TB gefilterter CommonCrawl-Daten, die 100 Sprachen enthalten.

RoBERTa ist ein Transformers-Modell, das in einer selbstüberwachten Art und Weise auf einem großen Korpus vortrainiert wurde. Dies bedeutet, dass es nur auf den Rohdaten vortrainiert wurde, ohne dass Menschen sie in irgendeiner Weise kennzeichnen (deshalb kann es viele öffentlich verfügbare Daten verwenden) mit einem automatischen Prozess zur Generierung von Eingaben und Labels aus diesen Texten.""",
        "Die Sonne scheint hell am Himmel, während Vögel fröhlich zwitschern.",
        "In der Bibliothek finden sich zahlreiche Bücher über Geschichte, Kunst und Wissenschaft."
    ]

    # Perform cross-lingual information retrieval
    retrieved_documents, similarity_scores = cross_lingual_information_retrieval(query, documents)

    # Print the retrieved documents and their similarity scores
    for doc, score in zip(retrieved_documents, similarity_scores):
        print(f"Document: {doc}\nSimilarity Score: {score:.4f}\n")

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/49.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/625 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/996k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.96M [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/714M [00:00<?, ?B/s]

similarity_scores [[0.7391072  0.24958709 0.29241443]]
Document: XLM-RoBERTa ist eine mehrsprachige Version von RoBERTa. Es ist vortrainiert auf 2,5 TB gefilterter CommonCrawl-Daten, die 100 Sprachen enthalten.

RoBERTa ist ein Transformers-Modell, das in einer selbstüberwachten Art und Weise auf einem großen Korpus vortrainiert wurde. Dies bedeutet, dass es nur auf den Rohdaten vortrainiert wurde, ohne dass Menschen sie in irgendeiner Weise kennzeichnen (deshalb kann es viele öffentlich verfügbare Daten verwenden) mit einem automatischen Prozess zur Generierung von Eingaben und Labels aus diesen Texten.
Similarity Score: 0.7391

Document: In der Bibliothek finden sich zahlreiche Bücher über Geschichte, Kunst und Wissenschaft.
Similarity Score: 0.2924

Document: Die Sonne scheint hell am Himmel, während Vögel fröhlich zwitschern.
Similarity Score: 0.2496

