# Scoring system with BETO

## Summary of the approach in this notebook

**Choice of BETO:** We choose BETO, a BERT model trained on Spanish corpora, because our text data is in Spanish and we need a model that understands the semantic meaning of sentences. BERT models are great at this because they're designed to understand the context of words in a sentence. By using BETO, we capture more complex language features compared to simpler techniques like TF-IDF.

**Generating Embeddings:** We use BETO to generate embeddings for both the senators' initiative profiles and the user's input. These embeddings are vectors in a high-dimensional space that represent the semantic meaning of the text. By representing the text as vectors, we can calculate the distance (or similarity) between different pieces of text.

**Calculating Similarity:** To match the user's input to the senators' profiles, we calculate the cosine similarity between the user's input vector and each senator's vector. This gives us a measure of how similar the user's input is to each senator's profile, which we use as our scoring mechanism.

**Ranking Senators:** Finally, we rank the senators based on their similarity scores and return the top N senators. This gives us a list of senators whose initiative profiles best match the user's interests.

The core idea of this approach is to leverage the power of language models like BETO to understand the semantic meaning of text and use this to match users with senators based on their interests.

## Import Libraries

In [1]:
import pandas as pd
import os
from sklearn.metrics.pairwise import cosine_similarity
from transformers import BertModel, BertTokenizer
import torch
import numpy as np

## Initialize testing DF

This dataframe is to test the functions in this notebook with a limited set of data

In [2]:
SENATORS_TO_PROCESS = 3

current_path = os.getcwd()
parent_directory = os.path.dirname(current_path)
project_data_path = os.path.join(parent_directory, 'data')


senators_test_df = pd.read_csv(os.path.join(project_data_path, 'senators_data.csv')).head(SENATORS_TO_PROCESS)

## Download BETO model and tokenizer

In [3]:
tokenizer = BertTokenizer.from_pretrained('dccuchile/bert-base-spanish-wwm-cased')
model = BertModel.from_pretrained('dccuchile/bert-base-spanish-wwm-cased')

Some weights of the model checkpoint at dccuchile/bert-base-spanish-wwm-cased were not used when initializing BertModel: ['cls.predictions.decoder.bias', 'cls.predictions.transform.dense.bias', 'cls.predictions.bias', 'cls.predictions.transform.dense.weight', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.decoder.weight']
- This IS expected if you are initializing BertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of BertModel were not initialized from the model checkpoint at dccuchile/bert-base-spanish-wwm-cased and are newly initialized: ['bert.pooler.dense.we

### Test BETO

In [4]:
# Tokenize the text
inputs = tokenizer("Arriba la democracia!", return_tensors="pt")

# Generate the embeddings
outputs = model(**inputs)

# The embeddings are stored in the `last_hidden_state` attribute
embeddings = outputs.last_hidden_state
embeddings

tensor([[[ 0.0733, -0.2156,  0.1107,  ..., -0.1891,  0.1999,  0.1254],
         [-0.0935,  0.1006, -0.4794,  ..., -0.1307,  0.3479,  0.3917],
         [-0.2780,  1.1009, -0.6256,  ...,  0.0978,  0.3362,  0.4117],
         [-0.5546, -0.0858, -0.3468,  ...,  0.1149,  0.3315, -0.0632],
         [-0.4995, -0.1995, -0.0398,  ..., -0.3834,  0.0416, -0.3520],
         [-0.6204, -0.1591, -0.4836,  ..., -0.8248,  0.9292, -0.4672]]],
       grad_fn=<NativeLayerNormBackward0>)

## Generate embeddings with BETO

Embeddings are a way to represent text (or other types of data) as vectors of numbers. The key idea behind embeddings is to represent words or sentences in a high-dimensional space in such a way that their location in this space captures some of the semantic meaning of the text.

For example, in a well-constructed embedding space, words or sentences with similar meanings will be located near each other, and their relative locations can capture some of the relationships between them. For instance, the vectors for "king" and "queen" might be located at similar positions in the embedding space, and the direction from "king" to "queen" might be the same as the direction from "man" to "woman", capturing the relationship of gender between these words.

We use embeddings in NLP (Natural Language Processing) because they provide a way to turn text into a form that machine learning algorithms can understand. Most machine learning algorithms require numerical input, and embeddings provide a way to turn text into numbers while preserving some of the semantic meaning of the text.

In the context of this project, we use embeddings to represent both the user's input and the senator's initiative profiles. By representing these texts as vectors in a high-dimensional space, we can calculate the distance (or similarity) between the user's input and each senator's profile. This allows us to rank the senators based on how similar their profile is to the user's input.

In [5]:
def generate_embeddings(text, tokenizer, model, max_length=512):
    # Split the text into chunks to handle long summaries
    text_chunks = [text[i:i+max_length] for i in range(0, len(text), max_length)]
    
    # Initialize an empty tensor to store the embeddings
    embeddings = torch.zeros((len(text_chunks), model.config.hidden_size))
    
    # Generate embeddings for each chunk
    for i, chunk in enumerate(text_chunks):
        inputs = tokenizer(chunk, return_tensors="pt", truncation=True, padding=True)
        outputs = model(**inputs)
        embeddings[i] = outputs.last_hidden_state.mean(dim=1)
    
    # Average the embeddings of all the chunks
    embeddings = embeddings.mean(dim=0)
    
    return embeddings


### Test embeddings function

In [6]:
senators_test_df['initiatives_summary_dummy'].apply(lambda x: generate_embeddings(x, tokenizer, model))

0    [tensor(-0.2797, grad_fn=<UnbindBackward0>), t...
1    [tensor(-0.2587, grad_fn=<UnbindBackward0>), t...
2    [tensor(-0.2833, grad_fn=<UnbindBackward0>), t...
Name: initiatives_summary_dummy, dtype: object

### Why we don't use stop words or lemmatization?

In traditional NLP tasks, lemmatization and removing stop words are common steps to reduce the dimensionality of the data and focus on the most informative words. However, BERT-like models are based on transformers that use the context of words in a sentence to understand their meaning. They can even leverage the information contained in stop words. So, for these models, we usually keep the original form of the words and do not remove stop words.

## Score senator profiles based on user input

### Function to get scores

In [7]:
from sklearn.metrics.pairwise import cosine_similarity

# Define a function to match senators based on the user's input
def match_senators(user_input, senators_embeddings, tokenizer, model):
    # Generate an embedding for the user's input
    user_embedding = generate_embeddings(user_input, tokenizer, model)

    # Initialize an empty array to store the similarity scores
    scores = np.zeros(len(senators_embeddings))

    # Calculate the cosine similarity between the user's input and each senator's profile
    for i, senator_embedding in enumerate(senators_embeddings):
        scores[i] = cosine_similarity(user_embedding.detach().numpy().reshape(1, -1), senator_embedding.detach().numpy().reshape(1, -1))

    return scores

### Test scoring function

In [8]:
# generate senators_embeddings
senators_embeddings = senators_test_df['initiatives_summary_dummy'].apply(lambda x: generate_embeddings(x, tokenizer, model))

# Get the user's input
user_input = "Quiero proteccion para los animales"

# Match the senators
similarity_scores = match_senators(user_input, senators_embeddings, tokenizer, model)
similarity_scores

  scores[i] = cosine_similarity(user_embedding.detach().numpy().reshape(1, -1), senator_embedding.detach().numpy().reshape(1, -1))
  scores[i] = cosine_similarity(user_embedding.detach().numpy().reshape(1, -1), senator_embedding.detach().numpy().reshape(1, -1))
  scores[i] = cosine_similarity(user_embedding.detach().numpy().reshape(1, -1), senator_embedding.detach().numpy().reshape(1, -1))


array([0.84016222, 0.83779895, 0.83383113])

## Get top senators

### Function to get the senator with best matching score

In [9]:
def get_top_senators(scores, senators_df, N=5):
    # Get the indices of the senators sorted by their scores
    sorted_indices = np.argsort(scores)[::-1]
    
    # Get the top N senators
    top_senators = senators_df.iloc[sorted_indices[:N]]
    
    return top_senators

### Test top senators function

In [10]:
# Get the top 5 senators
top_senators = get_top_senators(similarity_scores, senators_test_df)

# Print the names and scores of the top senators
for i, row in top_senators.iterrows():
    print(f"Senator: {row['Nombre']} {row['Apellidos']}, Score: {similarity_scores[i]}")

Senator: José Alfredo Botello Montes, Score: 0.8401622176170349
Senator: Estrella Rojas Loreto, Score: 0.8377989530563354
Senator: Roberto Juan Moya Clemente, Score: 0.8338311314582825
