# Calculate Embeddings

Ajuda a calcular embeddings dos dados adquiridos dos comentários das redes sociais de políticos. (Planilha de Gabrielly)

Entrada: `LulaTotal Validado.xlsx` e `Bolsonaro Validado.xlsx`

Saída: `Embeddings_(NOME_DO_MODELO)_(REDE_SOCIAL).xlsx`

In [1]:
import pandas as pd
from sentence_transformers import SentenceTransformer

BASE_PATH = 'dados/'
DIRECTORY = BASE_PATH + 'preprocessed/embeddings/'
SOCIAL_NETWORK = 'tiktok'

In [2]:
column_types = {'ID' : str}

# 1. Leitura de Arquivos e inicialização de Variáveis

Iniciação das variáveis que iremos utilizar e das colunas na qual iremos trabalhar


In [None]:
# Pegar as sentenças (nesse caso, no Post-filtrado)
file_pathL = BASE_PATH + 'input/LulaTotal Validado.xlsx'
file_pathB = BASE_PATH + 'input/Bolsonaro Validado.xlsx'

EMBEDDING_MODEL = 'mixedbread-ai/mxbai-embed-large-v1'
#EMBEDDING_MODEL = 'BAAI/llm-embedder'

file_path_features = 'Embeddings_mxbai-embed-large-v1_'+ (SOCIAL_NETWORK) +'.xlsx'
#é bom modificar o nome manualmente, pois pode dar erro na questão do save

column_text = "Texto"
column_id = "ID"
column_likes = "Curtidas"
column_inicial_date = "DataColeta"
column_final_date = "DataPost"
column_cand = "Candidato"

In [4]:
# converte para datetime 
def adjust_dates(df_aux):
    df_aux['DataColeta'] = pd.to_datetime(df_aux['DataColeta'])
    df_aux['DataPost'] = pd.to_datetime(df_aux['DataPost'])

    try:
        # converte esta coluna para conter uma string com o dia da semana
        df_aux['DiaDaSemana'] = pd.to_datetime(df_aux['DiaDaSemana'])
        df_aux['DiaDaSemana'] = df_aux['DiaDaSemana'].dt.strftime('%A')
    except:
        print("Coluna DiaDaSemana parece já estar guardado como date. Não precisa transformar!")

In [5]:
# 3. Ler os Arquivos tanto de Lula quanto de bolsonaro
df_lula = pd.read_excel(file_pathL, dtype=column_types)
df_bolsonaro = pd.read_excel(file_pathB, dtype=column_types)

df_lula.shape, df_bolsonaro.shape


((308, 22), (269, 22))

União das planilha e limpeza de variáveis que estão em branco, ou seja, não possuem utilidade.

In [6]:
df_total = pd.concat([df_lula, df_bolsonaro], axis=0)
df_total.shape

(577, 22)

In [None]:
len(df_total['ID'].unique())

577

In [8]:
df_total = df_total.reset_index(drop=True)
df_total.head(5)

Unnamed: 0.1,Unnamed: 0,DataColeta,Perfil,DataPost,DiaDaSemana,Plays,Curtidas,Comentarios,Compart.,Texto,...,LinkPost,ID,Duracao,Retórica Aristotélica,Dispositivo Retórico,Tipo de conteúdo,Abordagem,Tonalidade,Main character,Texto / Hashtag
0,1,2022-10-02,lulaoficial,2022-06-30 00:00:00,1900-01-05 00:00:00,196800.0,11700,809,589.0,Alô alô geração tiktoker! Imagina só um #gover...,...,https://www.tiktok.com/@lulaoficial/video/7115...,7115033431473474822,17.3,Pathos,Political Statement,Political-Purposeful,Acclamation,Neutral,Self alone,Texto + Hashtag
1,2,2022-10-02,lulaoficial,2022-06-30 00:00:00,1900-01-05 00:00:00,522000.0,33600,3324,3973.0,Já imaginou um #governo feito pra que as pesso...,...,https://www.tiktok.com/@lulaoficial/video/7115...,7115174031162215686,60.16,Pathos,Political Statement,Political-Purposeful,Acclamation,Neutral,Self alone,Texto + Hashtag
2,3,2022-10-02,lulaoficial,2022-07-01 00:00:00,1900-01-06 00:00:00,427900.0,34600,2289,1752.0,O pai ta estourado!😎 marque nos comentários um...,...,https://www.tiktok.com/@lulaoficial/video/7115...,7115357413712153861,14.88,Ethos,Humor,Campaign Act,Acclamation,Positive,Self alone,Texto + Hashtag
3,4,2022-10-02,lulaoficial,2022-07-01 00:00:00,1900-01-06 00:00:00,882200.0,47500,4312,2257.0,A gente tem um um encontro marcado no dia 02 d...,...,https://www.tiktok.com/@lulaoficial/video/7115...,7115560675824422149,15.39,Pathos,Call to Action,Campaign Act,Acclamation,Positive,Self alone,Texto + Hashtag
4,5,2022-10-02,lulaoficial,2022-07-02 00:00:00,1900-01-07 00:00:00,262200.0,22400,2150,1438.0,#PepsiApplePieChallenge Estamos fazendo uma #c...,...,https://www.tiktok.com/@lulaoficial/video/7115...,7115793869152734470,60.93,Pathos,Commitment,Political-Ideological,Acclamation,Positive,Self + voters,Texto + Hashtag


# 2. Limpeza de Dados

## 2.1. Remoção de Dados Faltantes

In [9]:
df_total[df_total.isna().sum(axis=1) > 0].groupby('Perfil').count()

Unnamed: 0_level_0,Unnamed: 0,DataColeta,DataPost,DiaDaSemana,Plays,Curtidas,Comentarios,Compart.,Texto,LinkFoto,...,LinkPost,ID,Duracao,Retórica Aristotélica,Dispositivo Retórico,Tipo de conteúdo,Abordagem,Tonalidade,Main character,Texto / Hashtag
Perfil,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
bolsonaromessiasjair,9,9,9,9,9,9,9,8,9,4,...,9,8,8,1,1,1,1,1,1,1


In [10]:
df_total.dropna(inplace=True)

Observar o que foi eliminado e conferir se está tudo certo:

In [11]:
df_total.shape

(568, 22)

In [12]:
df_total[df_total.isna().sum(axis=1) > 0].groupby('Perfil').count()

Unnamed: 0_level_0,Unnamed: 0,DataColeta,DataPost,DiaDaSemana,Plays,Curtidas,Comentarios,Compart.,Texto,LinkFoto,...,LinkPost,ID,Duracao,Retórica Aristotélica,Dispositivo Retórico,Tipo de conteúdo,Abordagem,Tonalidade,Main character,Texto / Hashtag
Perfil,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1


## 2.2. Remoção de Colunas Não Usadas 

Drop de colunas que não iremos utilizar como por exemplo `Duração` ou até `Dispositivo Retórico` etc.

In [13]:
df_total.drop(columns=["Unnamed: 0","Plays","Comentarios","Compart.","Duracao", "LinkFoto", "LinkVideo", "LinkPost", 
                       "Retórica Aristotélica", "Dispositivo Retórico", "Tipo de conteúdo",
                  "Abordagem", "Tonalidade", "Main character", "Texto / Hashtag"], inplace=True)
df_total.tail(5)

Unnamed: 0,DataColeta,Perfil,DataPost,DiaDaSemana,Curtidas,Texto,ID
572,2022-10-30,bolsonaromessiasjair,2022-10-29 09:14:06.702000,1900-01-07 00:00:00,95300,#bolsonaro #rock #brasil #good #vibes #jairbol...,7159766081593150726
573,2022-10-30,bolsonaromessiasjair,2022-10-29 09:14:01.295000,1900-01-07 00:00:00,240900,- Muito brigado a você que nos acompanhou até ...,7159773640030997766
574,2022-10-30,bolsonaromessiasjair,2022-10-29 09:13:55.570000,1900-01-07 00:00:00,109600,#bolsonaro #lula #empolgado #brasil #🇧🇷,7159891102143630597
575,2022-10-30,bolsonaromessiasjair,2022-10-29 12:13:51.619000,1900-01-07 00:00:00,52100,#vacina #presidente #jair #bolsonaro #comparti...,7159943895445441797
576,2022-10-30,bolsonaromessiasjair,2022-10-29 17:13:45.948000,1900-01-07 00:00:00,23500,- Belo Horizonte/MG. - Presidente Jair Bolsona...,7160017587160485125


# 3. Novas Colunas

Conversão de datas para calcular a coluna de `Dias Decorridos`:

In [14]:
# converte para datetime
adjust_dates(df_total)
df_total.tail(5)

Unnamed: 0,DataColeta,Perfil,DataPost,DiaDaSemana,Curtidas,Texto,ID
572,2022-10-30,bolsonaromessiasjair,2022-10-29 09:14:06.702,Sunday,95300,#bolsonaro #rock #brasil #good #vibes #jairbol...,7159766081593150726
573,2022-10-30,bolsonaromessiasjair,2022-10-29 09:14:01.295,Sunday,240900,- Muito brigado a você que nos acompanhou até ...,7159773640030997766
574,2022-10-30,bolsonaromessiasjair,2022-10-29 09:13:55.570,Sunday,109600,#bolsonaro #lula #empolgado #brasil #🇧🇷,7159891102143630597
575,2022-10-30,bolsonaromessiasjair,2022-10-29 12:13:51.619,Sunday,52100,#vacina #presidente #jair #bolsonaro #comparti...,7159943895445441797
576,2022-10-30,bolsonaromessiasjair,2022-10-29 17:13:45.948,Sunday,23500,- Belo Horizonte/MG. - Presidente Jair Bolsona...,7160017587160485125


In [15]:
df_total.groupby(['DataColeta']).count()

Unnamed: 0_level_0,Perfil,DataPost,DiaDaSemana,Curtidas,Texto,ID
DataColeta,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
2022-10-02,425,425,425,425,425,425
2022-10-30,143,143,143,143,143,143


In [None]:
df_total['Dias Decorridos'] = (df_total['DataColeta'] - df_total['DataPost']).dt.days
df_total['Candidato'] = df_total['Perfil'].apply(lambda x: 'Lula' if x == 'lulaoficial' else 'Bolsonaro')

df_total.head()

Unnamed: 0,DataColeta,Perfil,DataPost,DiaDaSemana,Curtidas,Texto,ID,DiasDecorridos,Candidato
0,2022-10-02,lulaoficial,2022-06-30,Friday,11700,Alô alô geração tiktoker! Imagina só um #gover...,7115033431473474822,94,Lula
1,2022-10-02,lulaoficial,2022-06-30,Friday,33600,Já imaginou um #governo feito pra que as pesso...,7115174031162215686,94,Lula
2,2022-10-02,lulaoficial,2022-07-01,Saturday,34600,O pai ta estourado!😎 marque nos comentários um...,7115357413712153861,93,Lula
3,2022-10-02,lulaoficial,2022-07-01,Saturday,47500,A gente tem um um encontro marcado no dia 02 d...,7115560675824422149,93,Lula
4,2022-10-02,lulaoficial,2022-07-02,Sunday,22400,#PepsiApplePieChallenge Estamos fazendo uma #c...,7115793869152734470,92,Lula


Função para verificar se o texto contém apenas hashtags

Caso o comentário só possua hashtags ele irá retornar  `true`, caso contrário a função retorna `false`

In [17]:
import re

def contains_only_hashtags(text):
    hashtags = re.findall(r'#\S+', text)
    return len(hashtags) == len(text.split())

In [18]:
df_total['Only Hashtags'] = df_total[column_text].apply(contains_only_hashtags)
df_total['Only Hashtags'] = df_total['Only Hashtags'].fillna(False)  # Para lidar com tabulações que o pandas não reconhece

# 4 . Calcula Embeddings

Encode dos embeddings utilizando a library do "sentence-transformers" e também a concatenação de dados previamente informados da planilha passada como `ID`,`Candidato` e `Curtidas`

In [None]:
# Pegar o modelo para testar
model = SentenceTransformer(EMBEDDING_MODEL, device='cpu') # usar 'cuda' se tiver GPU e o Pytorch com CUDA

In [21]:
# Calcular os embeddings das sentenças
embeddings = model.encode(df_total[column_text].to_numpy(), show_progress_bar=True)

Batches:   0%|          | 0/18 [00:00<?, ?it/s]

In [None]:
# Cria um dataframe com os embeddings usando o mesmo índice (rótulos das linhas) do dataframe original
df_embeddings = pd.DataFrame(embeddings, index=df_total.index) 

df_embeddings.columns = [f'x{i+1}' for i in range(df_embeddings.shape[1])]
df_embeddings.tail(5)

Unnamed: 0,x1,x2,x3,x4,x5,x6,x7,x8,x9,x10,...,x759,x760,x761,x762,x763,x764,x765,x766,x767,x768
572,0.008577,0.008128,0.016376,0.019288,0.031601,-0.00635,0.031105,0.024234,-0.029309,-0.024002,...,0.030606,0.017753,-0.002498,-0.038253,0.041505,-0.028203,-0.016435,0.007059,0.052966,0.013647
573,-0.004758,-0.018209,-0.004426,0.026147,0.054161,0.003876,0.02264,0.015291,-0.05047,-0.024312,...,0.055812,0.001787,0.014966,-0.032032,-0.000353,-0.029847,0.000815,0.01124,0.030023,0.035728
574,0.014291,0.015281,0.014075,0.013016,0.045404,0.006554,0.00414,0.017683,-0.046188,-0.02651,...,0.045256,0.0034,-0.003894,-0.046124,0.022977,-0.042103,-0.015663,-0.00343,0.046219,0.040913
575,0.020424,0.003011,0.014616,0.02236,0.021892,0.004031,0.028285,0.014492,-0.044687,-0.053916,...,0.029414,0.02005,-0.026424,-0.047081,0.008602,-0.02826,0.001365,-0.001123,0.044278,0.042991
576,-0.022435,0.004387,-0.00197,0.008333,0.044339,0.011664,0.015723,0.027566,-0.050004,-0.027405,...,0.048884,-0.003383,-0.029516,-0.02155,0.031778,-0.027595,-0.025059,0.004741,0.043421,0.031835


A planilha ficará com as colunas importantes originais (__ID__,__Candidato__ e __Curtidas__), e também estará com uma flag que mostra se há hashtags(__Only Hashtags__), e por fim, features calculadas pelo embeddinG   __x1__,__x2__,__x3__ ... __xN__ , onde N seria o tamanho de dimensões que aquele modelo possui

# 5. Concatenar e Salvar o Arquivo

In [None]:
df = df_total[[column_id, column_cand, column_likes, 'Dias Decorridos', 'Only Hashtags']].copy()
df.head(5)


Unnamed: 0,ID,Candidato,Curtidas,DiasDecorridos,Only Hashtags
0,7115033431473474822,Lula,11700,94,False
1,7115174031162215686,Lula,33600,94,False
2,7115357413712153861,Lula,34600,93,False
3,7115560675824422149,Lula,47500,93,False
4,7115793869152734470,Lula,22400,92,False


In [33]:
# Concatena com df_embeddings
df = pd.concat([df, df_embeddings], axis=1)
df

Unnamed: 0,ID,Candidato,Curtidas,DiasDecorridos,Only Hashtags,x1,x2,x3,x4,x5,...,x759,x760,x761,x762,x763,x764,x765,x766,x767,x768
0,7115033431473474822,Lula,11700,94,False,0.021364,-0.016445,0.022641,0.020986,0.027848,...,0.013324,-0.014651,-0.004243,-0.061445,0.007757,-0.039489,0.013565,-0.012188,0.039852,0.033160
1,7115174031162215686,Lula,33600,94,False,-0.000527,-0.017740,0.011214,0.010682,0.043282,...,0.041989,-0.005031,-0.001357,-0.065066,0.019234,-0.050373,0.004409,-0.012008,0.057075,0.024026
2,7115357413712153861,Lula,34600,93,False,0.007890,-0.012661,0.029693,0.004084,0.038024,...,0.021196,-0.019906,0.004805,-0.055777,0.013048,-0.026539,0.016227,0.005020,0.046948,0.022732
3,7115560675824422149,Lula,47500,93,False,0.004239,-0.007715,0.020440,0.021936,0.033811,...,0.040014,-0.030766,0.009765,-0.060900,0.018824,-0.022950,0.019792,0.002503,0.035500,0.013818
4,7115793869152734470,Lula,22400,92,False,-0.034147,-0.018919,0.011031,0.031991,0.014927,...,0.014115,0.000676,-0.008564,-0.042402,0.031594,-0.041089,-0.009193,-0.006140,0.030862,0.029791
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
572,7159766081593150726,Bolsonaro,95300,0,True,0.008577,0.008128,0.016376,0.019288,0.031601,...,0.030606,0.017753,-0.002498,-0.038253,0.041505,-0.028203,-0.016435,0.007059,0.052966,0.013647
573,7159773640030997766,Bolsonaro,240900,0,False,-0.004758,-0.018209,-0.004426,0.026147,0.054161,...,0.055812,0.001787,0.014966,-0.032032,-0.000353,-0.029847,0.000815,0.011240,0.030023,0.035728
574,7159891102143630597,Bolsonaro,109600,0,True,0.014291,0.015281,0.014075,0.013016,0.045404,...,0.045256,0.003400,-0.003894,-0.046124,0.022977,-0.042103,-0.015663,-0.003430,0.046219,0.040913
575,7159943895445441797,Bolsonaro,52100,0,True,0.020424,0.003011,0.014616,0.022360,0.021892,...,0.029414,0.020050,-0.026424,-0.047081,0.008602,-0.028260,0.001365,-0.001123,0.044278,0.042991


## 5.1 Salvar o arquivo

In [None]:
df.to_excel((DIRECTORY + file_path_features), index=False)

print("Arquivo salvo:", file_path_features)

Arquivo salvo: Embeddings_mxbai-embed-large-v1_tiktok.xlsx
