# Calculate Embeddings

Ajuda a calcular embeddings dos dados adquiridos dos comentários das redes sociais de políticos.

Entrada: `Post-filtrado.xlsx`

Saída: `Embeddings_(NOME_DO_MODELO).xlsx`

In [178]:
import pandas as pd
import re
from sentence_transformers import SentenceTransformer

BASE_PATH = 'dados/'
DIRECTORY = BASE_PATH + 'preprocessed/embeddings/'

Iniciação das variáveis que iremos utilizar e das colunas na qual iremos trabalhar


In [179]:
# 1. Pegar o modelo para testar
TYPE_MODEL = 'sentence-transformers/all-MiniLM-L6-v2'
model = SentenceTransformer(TYPE_MODEL)

# 2. Pegar as sentenças (nesse caso, no Post-filtrado)
file_path = BASE_PATH + 'Post-filtrado.xlsx'
file_path_features = 'Embeddings_all-MiniLM.xlsx'
#é bom modificar o nome manualmente, pois pode dar erro na questão do save

column_text = "Texto"
column_id = "ID"
column_author = "Autor"
column_likes = "Curtidas"

In [180]:
# 3. Ler os Arquivos e Remover NaN
rf = pd.read_excel(file_path)
rf = rf.dropna(subset=[column_text])

rf.head(3)

Unnamed: 0,ID,Autor,Data,Texto,Link,Rede,Tipo,Curtidas,Comentários,Compart.,Hora
0,7114971700365692165,Jair Bolsonaro,2022-06-30,#jairbolsonaro #bolsonaro #palavras #president...,https://www.tiktok.com/@bolsonaromessiasjair/v...,tiktok,Video,24400,1083,1566,
1,7115033431473474822,Lula,2022-06-30,Alô alô geração tiktoker! Imagina só um #gover...,https://www.tiktok.com/@lulaoficial/video/7115...,tiktok,Video,11700,809,589,
2,7115050482179050758,Jair Bolsonaro,2022-06-30,#emprego #jair #bolsonaro #jairbolsonaro #pres...,https://www.tiktok.com/@bolsonaromessiasjair/v...,tiktok,Video,9163,480,1140,


Função para verificar se o texto contém apenas hashtags

Caso o comentário só possua hashtags ele irá retornar  `true`, caso contrário a função retorna `false`

In [181]:
import re

def contains_only_hashtags(text):
    hashtags = re.findall(r'#\S+', text)
    return len(hashtags) == len(text.split())

In [182]:
ids = rf[column_id].tolist()
authors = rf[column_author].tolist()
sentences = rf[column_text].tolist()
likes = rf[column_likes].tolist()

Encode dos embeddings utilizando a library do "sentence-transformers" e também a concatenação de dados previamente informados da planilha passada como `ID`,`Candidato` e `Curtidas`

In [183]:
# 4. Calcular os embeddings das sentenças
embeddings = model.encode(sentences)
df_embeddings = pd.DataFrame(embeddings) 
df_embeddings.columns = [f'x{i+1}' for i in range(df_embeddings.shape[1])]



A planilha ficará com as colunas importantes originais (__ID__,__Candidato__ e __Curtidas__), e também estará com uma flag que mostra se há hashtags(__Only Hashtags__), e por fim, features calculadas pelo embeddinG   __x1__,__x2__,__x3__ ... __xN__ , onde N seria o tamanho de dimensões que aquele modelo possui

# Concatenar e Salvar o Arquivo

In [184]:
df_final = pd.DataFrame({
    column_id: ids,
    "Candidato": authors,
    column_likes: likes,
})

df_final['Only Hashtags'] = rf[column_text].apply(contains_only_hashtags)
df_final['Only Hashtags'] = df_final['Only Hashtags'].fillna(False) #palavras que há tabulações que o panda não consegue reconhecer


df_final = pd.concat([df_final, df_embeddings], axis=1)


  df_final['Only Hashtags'] = df_final['Only Hashtags'].fillna(False) #palavras que há tabulações que o panda não consegue reconhecer


In [185]:
df_final.head()

Unnamed: 0,ID,Candidato,Curtidas,Only Hashtags,x1,x2,x3,x4,x5,x6,...,x375,x376,x377,x378,x379,x380,x381,x382,x383,x384
0,7114971700365692165,Jair Bolsonaro,24400,True,0.026735,-0.003081,0.019811,-0.081651,-0.035613,0.068314,...,-0.011839,-0.037309,0.061741,0.068183,0.070977,-0.030143,0.15576,0.021305,0.086035,-0.026262
1,7115033431473474822,Lula,11700,False,-0.009553,0.061238,0.019681,-0.049951,-0.063027,-0.004563,...,0.014746,-0.040006,0.092884,0.073248,0.002842,-0.00905,0.043108,0.078581,0.029602,-0.012377
2,7115050482179050758,Jair Bolsonaro,9163,True,0.032516,-0.005751,0.035155,-0.092776,-0.036708,0.049461,...,0.0064,-0.045546,0.033622,0.064346,0.075292,-0.016813,0.159488,0.015305,0.095402,0.01487
3,7115120078982630661,Jair Bolsonaro,3485,True,0.045939,-0.002861,0.021989,-0.038963,-0.02042,0.04267,...,0.039002,-0.059857,0.057749,0.066202,0.075185,0.022375,0.086455,-0.025722,0.059293,-0.028086
4,7115161088219565317,Jair Bolsonaro,22100,True,0.008706,-0.012809,-0.017359,-0.073054,0.009759,0.068075,...,0.021046,-0.01211,0.016464,0.042351,0.050253,-0.033603,0.129427,-0.011694,0.069729,0.012653


# Salvar o arquivo

In [186]:

df_final.to_excel((DIRECTORY + file_path_features), index=False)