# Introduction

This notebook illustrates how to use `XLM-T` models for encoding a dataset from a text file into tweet embeddings.

# Installs and imports

In [1]:
!pip install --upgrade pip
!pip install sentencepiece
!pip install transformers

Collecting pip
  Downloading pip-25.0.1-py3-none-any.whl.metadata (3.7 kB)
Downloading pip-25.0.1-py3-none-any.whl (1.8 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.8/1.8 MB[0m [31m20.7 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: pip
  Attempting uninstall: pip
    Found existing installation: pip 24.1.2
    Uninstalling pip-24.1.2:
      Successfully uninstalled pip-24.1.2
Successfully installed pip-25.0.1


In [2]:
from transformers import AutoTokenizer, AutoModel, AutoConfig
from transformers import AutoModelForSequenceClassification
from torch.utils.data import DataLoader
import numpy as np

# Data

In [3]:
def preprocess(corpus):
  outcorpus = []
  for text in corpus:
    new_text = []
    for t in text.split(" "):
        t = '@user' if t.startswith('@') and len(t) > 1 else t
        t = 'http' if t.startswith('http') else t
        new_text.append(t)
    new_text = " ".join(new_text)
    outcorpus.append(new_text)
  return outcorpus

In [4]:
!wget https://raw.githubusercontent.com/cardiffnlp/xlm-t/main/data/sentiment/all/test_text.txt

--2025-04-14 22:20:57--  https://raw.githubusercontent.com/cardiffnlp/xlm-t/main/data/sentiment/all/test_text.txt
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.110.133, 185.199.111.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 654172 (639K) [text/plain]
Saving to: ‘test_text.txt’


2025-04-14 22:20:57 (15.4 MB/s) - ‘test_text.txt’ saved [654172/654172]



In [5]:
dataset_path = './test_text.txt'
dataset = open(dataset_path).read().split('\n')

In [6]:
# this is a dataset in 8 different languages
for example in [0,870,1740,2610,3480,4350,5220,6090]:
  print(dataset[example])

نوال الزغبي (الشاب خالد ليس عالمي) هههههههه أتفرجي على ها الفيديو يا مبتدئة http vía @user
Trying to have a conversation with my dad about vegetarianism is the most pointless infuriating thing ever #caveman 
Royal: le président n'aime pas les pauvres? "c'est n'importe quoi" http …
@user korrekt! Verstehe sowas nicht...
CONGRESS na ye party kabhi bani hoti na india ka partition hota nd na hi humari country itni khokhli hoti   @ 
@user @user Ma Ferrero? il compagno Ferrero? ma il suo partito esiste ancora? allora stiamo proprio frecati !!!
todos os meus favoritos na prova de eliminação #MasterChefBR
@user jajajaja dale, hacete la boluda vos jajaja igual a vos nunca se te puede tomar en serio te mando un abrazo desde Perú!


# Model

In [7]:
CUDA = True # set to true if using GPU (Runtime -> Change runtime Type -> GPU)
BATCH_SIZE = 32
MODEL = "cardiffnlp/twitter-xlm-roberta-base"
tokenizer = AutoTokenizer.from_pretrained(MODEL)
config = AutoConfig.from_pretrained(MODEL)
model = AutoModel.from_pretrained(MODEL)
if CUDA:
  model = model.to('cuda')
_ = model.eval()

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


config.json:   0%|          | 0.00/652 [00:00<?, ?B/s]

sentencepiece.bpe.model:   0%|          | 0.00/5.07M [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/9.10M [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/1.11G [00:00<?, ?B/s]

Some weights of XLMRobertaModel were not initialized from the model checkpoint at cardiffnlp/twitter-xlm-roberta-base and are newly initialized: ['pooler.dense.bias', 'pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


model.safetensors:   0%|          | 0.00/1.11G [00:00<?, ?B/s]

## Encode

In [8]:
def encode(text, cuda=True):
  text = preprocess(text)
  encoded_input = tokenizer(text, return_tensors='pt', padding=True, truncation=True)
  if cuda:
    encoded_input.to('cuda')
    output = model(**encoded_input)
    embeddings = output[0].detach().cpu().numpy()
  else:
    output = model(**encoded_input)
    embeddings = output[0].detach().numpy()

  embeddings = np.max(embeddings, axis=1)
  #embeddings = np.mean(embeddings, axis=1)
  return embeddings

In [9]:
dl = DataLoader(dataset, batch_size=BATCH_SIZE)
all_embeddings = np.zeros([len(dataset), 768])
for idx,batch in enumerate(dl):
  print('Batch ',idx+1,' of ',len(dl))
  text = preprocess(batch)
  embeddings = encode(text, cuda=CUDA)
  a = idx*BATCH_SIZE
  b = (idx+1)*BATCH_SIZE
  all_embeddings[a:b,:]=embeddings

Asking to truncate to max_length but no maximum length is provided and the model has no predefined maximum length. Default to no truncation.


Batch  1  of  218
Batch  2  of  218
Batch  3  of  218
Batch  4  of  218
Batch  5  of  218
Batch  6  of  218
Batch  7  of  218
Batch  8  of  218
Batch  9  of  218
Batch  10  of  218
Batch  11  of  218
Batch  12  of  218
Batch  13  of  218
Batch  14  of  218
Batch  15  of  218
Batch  16  of  218
Batch  17  of  218
Batch  18  of  218
Batch  19  of  218
Batch  20  of  218
Batch  21  of  218
Batch  22  of  218
Batch  23  of  218
Batch  24  of  218
Batch  25  of  218
Batch  26  of  218
Batch  27  of  218
Batch  28  of  218
Batch  29  of  218
Batch  30  of  218
Batch  31  of  218
Batch  32  of  218
Batch  33  of  218
Batch  34  of  218
Batch  35  of  218
Batch  36  of  218
Batch  37  of  218
Batch  38  of  218
Batch  39  of  218
Batch  40  of  218
Batch  41  of  218
Batch  42  of  218
Batch  43  of  218
Batch  44  of  218
Batch  45  of  218
Batch  46  of  218
Batch  47  of  218
Batch  48  of  218
Batch  49  of  218
Batch  50  of  218
Batch  51  of  218
Batch  52  of  218
Batch  53  of  218
Ba

## Cosine similarity and retrieval of all embeddings

In [10]:
norms = np.linalg.norm(all_embeddings, axis=-1)
all_embeddings_unit = all_embeddings/norms[:,None]
all_embeddings_sim = np.dot(all_embeddings_unit, all_embeddings_unit.T)

In [11]:
def get_most_sim(sim):
  s = np.argsort(sim)
  s = s[::-1] # invert sort order
  return s

In [12]:
query = 1111
a = 870  # english text from
b = 1740 # english text to
tmp_sim = all_embeddings_sim[a:b,query]
tmp_data = dataset[a:b]
s = get_most_sim(tmp_sim)

In [13]:
print('QUERY: ', dataset[query])

QUERY:  This means they believe it to be a legitimate non-violent movement based on a concern for human rights in #Palestine. #queensu #ygk 


In [14]:
print(' ----- Most similar ----- ')
too_much = 10
for i in s:
  print(tmp_sim[i], tmp_data[i])
  if too_much < 0:
    break
  too_much-=1

print(' ----- Least similar ----- ')
too_much = 10
for i in s[::-1]:
  print(tmp_sim[i], tmp_data[i])
  if too_much < 0:
    break
  too_much-=1

 ----- Most similar ----- 
0.9999999999999999 This means they believe it to be a legitimate non-violent movement based on a concern for human rights in #Palestine. #queensu #ygk 
0.9641096587044422 @user aint in support with Israel nor Palestine! Hope this fire is settled soon & there's no more massacre in #Palestine either... 
0.9612606945435014 Israel deems comatose Gaza man who needs treatment in West Bank  a security threat. #Palestine  via @user 
0.9593051127702781 #latestnews 4 #newmexico #politics + #nativeamerican + #Israel + #Palestine  -  Protesting Rise Of Alt-Right At... 
0.9588319138540209 UK Govt reject criticism on Libya saying its involvement saved lives-... wishing UK to enjoy post Gadafi Libya fate. #UK #libya 
0.958380367512331 @user Megyn, Please interview Halderman from the Univ of Michigan re:discrepancy in the results in counties with e-voting machines. 
0.9579723885682849 Saakashvili is pushing his own agenda here.The Ukrainian economy is growing, although corru