# Calculate Embeddings

Ajuda a calcular embeddings dos dados adquiridos dos comentários das redes sociais de políticos. (Planilha de Gabrielly)

Entrada: `LulaTotal Validado.xlsx` e `Bolsonaro Validado.xlsx`

Saída: `Embeddings_(NOME_DO_MODELO)_(REDE_SOCIAL).xlsx`

In [23]:
import pandas as pd
from sentence_transformers import SentenceTransformer

BASE_PATH = 'dados/'
DIRECTORY = BASE_PATH + 'preprocessed/embeddings/'
SOCIAL_NETWORK = 'tiktok'

In [24]:
column_types = {'ID' : str}

# 1. Leitura de Arquivos e inicialização de Variáveis

Iniciação das variáveis que iremos utilizar e das colunas na qual iremos trabalhar


In [25]:
# 1. Pegar o modelo para testar
TYPE_MODEL = 'sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2'
model = SentenceTransformer(TYPE_MODEL)

# 2. Pegar as sentenças (nesse caso, no Post-filtrado)
file_pathL = BASE_PATH + 'input/LulaTotal Validado.xlsx'
file_pathB = BASE_PATH + 'input/Bolsonaro Validado.xlsx'
file_path_features = 'Embeddings_paraphrase-multilingual-MiniLM_'+(SOCIAL_NETWORK) +'.xlsx'
#é bom modificar o nome manualmente, pois pode dar erro na questão do save

column_text = "Texto"
column_id = "ID"
column_author = "Perfil"
column_likes = "Curtidas"
column_inicial_date = "DataColeta"
column_final_date = "DataPost"
column_cand = "Candidato"

In [26]:
# converte para datetime 
def adjust_dates(df_aux):
    df_aux['DataColeta'] = pd.to_datetime(df_aux['DataColeta'])
    df_aux['DataPost'] = pd.to_datetime(df_aux['DataPost'])

    try:
        # converte esta coluna para conter uma string com o dia da semana
        df_aux['DiaDaSemana'] = pd.to_datetime(df_aux['DiaDaSemana'])
        df_aux['DiaDaSemana'] = df_aux['DiaDaSemana'].dt.strftime('%A')
    except:
        print("Coluna DiaDaSemana parece já estar guardado como date. Não precisa transformar!")

In [27]:
# 3. Ler os Arquivos tanto de Lula quanto de bolsonaro
rfL = pd.read_excel(file_pathL, dtype=column_types)
rfB = pd.read_excel(file_pathB, dtype=column_types)

rfL.shape, rfB.shape


((308, 22), (269, 22))

União das planilha e limpeza de variáveis que estão em branco, ou seja, não possuem utilidade.

In [28]:
rf_total = pd.concat([rfL, rfB], axis=0)
rf_total.shape

(577, 22)

In [29]:
len(rf_total['ID'].unique())


577

In [30]:
rf_total = rf_total.reset_index(drop=True)
rf_total.head(5)

Unnamed: 0.1,Unnamed: 0,DataColeta,Perfil,DataPost,DiaDaSemana,Plays,Curtidas,Comentarios,Compart.,Texto,...,LinkPost,ID,Duracao,Retórica Aristotélica,Dispositivo Retórico,Tipo de conteúdo,Abordagem,Tonalidade,Main character,Texto / Hashtag
0,1,2022-10-02,lulaoficial,2022-06-30 00:00:00,1900-01-05 00:00:00,196800.0,11700,809,589.0,Alô alô geração tiktoker! Imagina só um #gover...,...,https://www.tiktok.com/@lulaoficial/video/7115...,7115033431473474822,17.3,Pathos,Political Statement,Political-Purposeful,Acclamation,Neutral,Self alone,Texto + Hashtag
1,2,2022-10-02,lulaoficial,2022-06-30 00:00:00,1900-01-05 00:00:00,522000.0,33600,3324,3973.0,Já imaginou um #governo feito pra que as pesso...,...,https://www.tiktok.com/@lulaoficial/video/7115...,7115174031162215686,60.16,Pathos,Political Statement,Political-Purposeful,Acclamation,Neutral,Self alone,Texto + Hashtag
2,3,2022-10-02,lulaoficial,2022-07-01 00:00:00,1900-01-06 00:00:00,427900.0,34600,2289,1752.0,O pai ta estourado!😎 marque nos comentários um...,...,https://www.tiktok.com/@lulaoficial/video/7115...,7115357413712153861,14.88,Ethos,Humor,Campaign Act,Acclamation,Positive,Self alone,Texto + Hashtag
3,4,2022-10-02,lulaoficial,2022-07-01 00:00:00,1900-01-06 00:00:00,882200.0,47500,4312,2257.0,A gente tem um um encontro marcado no dia 02 d...,...,https://www.tiktok.com/@lulaoficial/video/7115...,7115560675824422149,15.39,Pathos,Call to Action,Campaign Act,Acclamation,Positive,Self alone,Texto + Hashtag
4,5,2022-10-02,lulaoficial,2022-07-02 00:00:00,1900-01-07 00:00:00,262200.0,22400,2150,1438.0,#PepsiApplePieChallenge Estamos fazendo uma #c...,...,https://www.tiktok.com/@lulaoficial/video/7115...,7115793869152734470,60.93,Pathos,Commitment,Political-Ideological,Acclamation,Positive,Self + voters,Texto + Hashtag


In [31]:
rf_total.dropna(inplace=True)

Observar o que foi eliminado e conferir se está tudo certo:

In [32]:
rf_total.shape

(568, 22)

In [33]:
rf_total[rf_total.isna().sum(axis=1) > 0].groupby('Perfil').count()

Unnamed: 0_level_0,Unnamed: 0,DataColeta,DataPost,DiaDaSemana,Plays,Curtidas,Comentarios,Compart.,Texto,LinkFoto,...,LinkPost,ID,Duracao,Retórica Aristotélica,Dispositivo Retórico,Tipo de conteúdo,Abordagem,Tonalidade,Main character,Texto / Hashtag
Perfil,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1


In [34]:
rf_total.loc[rf_total.isna().sum(axis=1) > 0].shape
rf_total.shape

(568, 22)

# 2. Configuração da planilha pré-processada
## 2.1 Limpeza de colunas 

Drop de colunas que não iremos utilizar como por exemplo `Duração` ou até `Dispositivo Retórico` etc.

In [35]:
rf_total.drop(columns=["Unnamed: 0","Plays","Comentarios","Compart.","Duracao", "LinkFoto", "LinkVideo", "LinkPost", 
                       "Retórica Aristotélica", "Dispositivo Retórico", "Tipo de conteúdo",
                  "Abordagem", "Tonalidade", "Main character", "Texto / Hashtag"], inplace=True)
rf_total.tail(5)

Unnamed: 0,DataColeta,Perfil,DataPost,DiaDaSemana,Curtidas,Texto,ID
572,2022-10-30,bolsonaromessiasjair,2022-10-29 09:14:06.702000,1900-01-07 00:00:00,95300,#bolsonaro #rock #brasil #good #vibes #jairbol...,7159766081593150726
573,2022-10-30,bolsonaromessiasjair,2022-10-29 09:14:01.295000,1900-01-07 00:00:00,240900,- Muito brigado a você que nos acompanhou até ...,7159773640030997766
574,2022-10-30,bolsonaromessiasjair,2022-10-29 09:13:55.570000,1900-01-07 00:00:00,109600,#bolsonaro #lula #empolgado #brasil #🇧🇷,7159891102143630597
575,2022-10-30,bolsonaromessiasjair,2022-10-29 12:13:51.619000,1900-01-07 00:00:00,52100,#vacina #presidente #jair #bolsonaro #comparti...,7159943895445441797
576,2022-10-30,bolsonaromessiasjair,2022-10-29 17:13:45.948000,1900-01-07 00:00:00,23500,- Belo Horizonte/MG. - Presidente Jair Bolsona...,7160017587160485125


In [36]:
rf_total.loc[rf_total.isna().sum(axis=1) > 0].shape
rf_total.dropna(inplace=True)
rf_total.shape

(568, 7)

Conversão de datas para calcular a coluna de `Dias Decorridos`:

In [37]:
# converte para datetime
adjust_dates(rf_total)
rf_total.tail(5)

Unnamed: 0,DataColeta,Perfil,DataPost,DiaDaSemana,Curtidas,Texto,ID
572,2022-10-30,bolsonaromessiasjair,2022-10-29 09:14:06.702,Sunday,95300,#bolsonaro #rock #brasil #good #vibes #jairbol...,7159766081593150726
573,2022-10-30,bolsonaromessiasjair,2022-10-29 09:14:01.295,Sunday,240900,- Muito brigado a você que nos acompanhou até ...,7159773640030997766
574,2022-10-30,bolsonaromessiasjair,2022-10-29 09:13:55.570,Sunday,109600,#bolsonaro #lula #empolgado #brasil #🇧🇷,7159891102143630597
575,2022-10-30,bolsonaromessiasjair,2022-10-29 12:13:51.619,Sunday,52100,#vacina #presidente #jair #bolsonaro #comparti...,7159943895445441797
576,2022-10-30,bolsonaromessiasjair,2022-10-29 17:13:45.948,Sunday,23500,- Belo Horizonte/MG. - Presidente Jair Bolsona...,7160017587160485125


In [38]:
rf_total.groupby(['DataColeta']).count()

Unnamed: 0_level_0,Perfil,DataPost,DiaDaSemana,Curtidas,Texto,ID
DataColeta,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
2022-10-02,425,425,425,425,425,425
2022-10-30,143,143,143,143,143,143


In [39]:
rf_total['DiasDecorridos'] = (rf_total['DataColeta'] - rf_total['DataPost']).dt.days
rf_total

rf_total['Candidato'] = rf_total['Perfil'].apply(lambda x: 'Lula' if x == 'lulaoficial' else 'Bolsonaro')

## 2.2 Flag para análise de Hashtags

Função para verificar se o texto contém apenas hashtags

Caso o comentário só possua hashtags ele irá retornar  `true`, caso contrário a função retorna `false`

In [40]:
import re

def contains_only_hashtags(text):
    hashtags = re.findall(r'#\S+', text)
    return len(hashtags) == len(text.split())

In [41]:
ids = rf_total[column_id].tolist()
authors = rf_total[column_author].tolist()
sentences = rf_total[column_text].tolist()
likes = rf_total[column_likes].tolist()
days = rf_total['DiasDecorridos'].tolist()
candidatos = rf_total['Candidato'].tolist()

Encode dos embeddings utilizando a library do "sentence-transformers" e também a concatenação de dados previamente informados da planilha passada como `ID`,`Candidato` e `Curtidas`

In [42]:
# 4. Calcular os embeddings das sentenças
embeddings = model.encode(sentences)
df_embeddings = pd.DataFrame(embeddings) 
df_embeddings.columns = [f'x{i+1}' for i in range(df_embeddings.shape[1])]



In [43]:
df_embeddings.tail(5)

Unnamed: 0,x1,x2,x3,x4,x5,x6,x7,x8,x9,x10,...,x375,x376,x377,x378,x379,x380,x381,x382,x383,x384
563,0.062468,-0.165707,-0.014988,-0.046828,-0.138289,0.048322,0.310527,0.314963,-0.1009,-0.065606,...,-0.002563,-0.106669,-0.268587,0.030322,-0.096264,-0.001911,0.384258,-0.307925,-0.035968,0.234999
564,-0.163468,0.077066,-0.156664,-0.085776,0.133453,0.225597,0.063915,-0.064023,0.044049,0.114685,...,0.317411,-0.00412,0.018524,-0.022853,-0.236356,0.061097,0.077973,0.205867,0.109763,0.176917
565,0.131149,-0.077428,-0.067303,-0.020109,0.218621,-0.06111,0.27506,0.250501,0.05785,0.129988,...,0.06279,-0.291914,-0.257054,-0.115412,0.018693,0.142378,0.403265,-0.229022,-0.246011,0.117278
566,0.011311,-0.0634,0.107073,-0.160402,0.371433,0.144655,0.10261,0.245211,0.122302,0.229537,...,0.262899,-0.199217,-0.148512,-0.104007,0.125126,0.068031,0.212707,-0.136398,-0.033402,0.186931
567,0.018853,0.130617,-0.012598,-0.090308,0.292937,0.103336,-0.02382,0.11903,0.077762,0.114861,...,0.281803,-0.258964,-0.153638,-0.012124,-0.048347,0.000193,0.230905,0.01098,-0.234178,0.090194


A planilha ficará com as colunas importantes originais (__ID__,__Candidato__ e __Curtidas__), e também estará com uma flag que mostra se há hashtags(__Only Hashtags__), e por fim, features calculadas pelo embeddinG   __x1__,__x2__,__x3__ ... __xN__ , onde N seria o tamanho de dimensões que aquele modelo possui

# 3. Concatenar e Salvar o Arquivo

In [44]:
df_final = pd.DataFrame({
    column_id: ids,
    'Candidato': candidatos,
    column_likes: likes,
    'Dias Decorridos': days
    
})

df_final['Only Hashtags'] = rf_total[column_text].apply(contains_only_hashtags)
df_final['Only Hashtags'] = df_final['Only Hashtags'].fillna(False)  # Para lidar com tabulações que o pandas não reconhece

# Concatena com df_embeddings
df_final = pd.concat([df_final, df_embeddings], axis=1)


  df_final['Only Hashtags'] = df_final['Only Hashtags'].fillna(False)  # Para lidar com tabulações que o pandas não reconhece


In [45]:
df_final.tail(5)


Unnamed: 0,ID,Candidato,Curtidas,Dias Decorridos,Only Hashtags,x1,x2,x3,x4,x5,...,x375,x376,x377,x378,x379,x380,x381,x382,x383,x384
563,7159766081593150726,Bolsonaro,95300,0,True,0.062468,-0.165707,-0.014988,-0.046828,-0.138289,...,-0.002563,-0.106669,-0.268587,0.030322,-0.096264,-0.001911,0.384258,-0.307925,-0.035968,0.234999
564,7159773640030997766,Bolsonaro,240900,0,True,-0.163468,0.077066,-0.156664,-0.085776,0.133453,...,0.317411,-0.00412,0.018524,-0.022853,-0.236356,0.061097,0.077973,0.205867,0.109763,0.176917
565,7159891102143630597,Bolsonaro,109600,0,True,0.131149,-0.077428,-0.067303,-0.020109,0.218621,...,0.06279,-0.291914,-0.257054,-0.115412,0.018693,0.142378,0.403265,-0.229022,-0.246011,0.117278
566,7159943895445441797,Bolsonaro,52100,0,True,0.011311,-0.0634,0.107073,-0.160402,0.371433,...,0.262899,-0.199217,-0.148512,-0.104007,0.125126,0.068031,0.212707,-0.136398,-0.033402,0.186931
567,7160017587160485125,Bolsonaro,23500,0,True,0.018853,0.130617,-0.012598,-0.090308,0.292937,...,0.281803,-0.258964,-0.153638,-0.012124,-0.048347,0.000193,0.230905,0.01098,-0.234178,0.090194


## 3.1 Salvar o arquivo

In [987]:

df_final.to_excel((DIRECTORY + file_path_features), index=False)