# Political Labelling

This script determines the political affiliation (left, center, right) of each user in our sample by analyzing the retweets they have made.

We use a list of political influencers previously categorized as left, center, or right by La Silla Vacia, a Colombian news outlet. For each user, we tally the number of retweets they've made (excluding retweets with comments) that correspond to each influencer. From this data, we calculate the total number of tweets associated with each political category.

This process is carried out on tweets from the "Paro Nacional" period and on tweets that are not from this period, across three sections:

1. Paro Nacional tweets
2. Tweets not related to the Paro Nacional
3. Outputs

In [2]:
import pickle
import pandas as pd
import numpy as np
from scipy.sparse import csr_matrix
import scipy.sparse as sp
import os

# Cargar Datos de congresistas y partidos

In [3]:
partidos = pd.read_excel("~/clasificacion_partidos_v1.xlsx","Sheet1")
partidos = partidos.loc[:,['codigo_partido','ideologia']]

congresistas = pd.read_excel("~/Twitter Congresistas.xlsx","Sheet1")
congresistas = congresistas.loc[:,('Partido', 'Twitter')]

# We load the tweets_lite DataFrame for the analysis
retweets = pd.read_pickle('/mnt/disk2/Data/Tweets_DataFrames/retweets_lite.gzip', compression='gzip')

# We load the map that relates an ID to a political Label
with open("/mnt/disk2/Data/Pickle/User_Dicts/mapa.pkl", "rb") as file:
    mapa = pickle.load(file)

In [4]:
def partidos_dict(x):
    
    d = {
        "PartidoSocialdeUnidadNacional":20050002,#PartidodelaU
        "CentroDemocrático":20130001,#CentroDemocrático
        "ListadelaDecencia":20180008,#ListadelaDecencia
        "AlianzaVerde":20090002,#AlianzaVerde
        "PartidoConservadorColombiano":18490002,#PartidoConservadorColombiano
        "PartidoLiberalColombiano":18480001,#PartidoLiberalColombiano
        "ColombiaJustaLibres":20170001,#ColombiaJustaLibres
        "PoloDemocráticoAlternativo":20050001,#PoloDemocráticoAlternativo
        "PartidoCambioRadical":20030001,#PoloCambioRadical
        "ConsejoComunitariodeComunidadesNegrasPlayaRenaciente":20180014,#ConsejoComunitarioPlayaRenaciente
        "Comunes":20170003,#PartidoComunes
        "MovimientoAlternativoIndígenaySocial":20130002,#MAIS
        "ConsejoComunitarioLaMamuncia":20180029,#ConsejoComunitariolaMamuncia(wrongnameinexternaldatabase)
        "CoaliciónAlternativaSantandereana":20180042,#CoaliciónAlternativaSantandereana
        "MovimientoIndependientedeRenovaciónAbsoluta":20000036,#MIRA
        "OpciónCiudadana":20090001#Partido Opción Ciudadana
    }
    
    try:
        return d[x]
    except KeyError:
        return 0

def idelogias_dict(x):
    d = {
        1:'Izquierda',
        2:'Derecha',
        3:'Centro',
        4:'Sin Clasificar'
    }
    
    try:
        return d[x]
    except KeyError:
        return 'nada'

In [5]:
# Diccionario con ID Nombre de twittero
user_name = (
    retweets[['Referenced Tweet Author ID', 'Referenced Tweet Author Name']]
    .drop_duplicates()
    .astype({'Referenced Tweet Author ID':int})
    .set_index('Referenced Tweet Author ID')
    .to_dict()['Referenced Tweet Author Name']
)

# Limpiar nombres para agregar el ID de partido
congresistas.Partido = congresistas.Partido.str.replace(' ','')
congresistas['Partido ID'] = congresistas['Partido'].apply(partidos_dict)
congresistas_new = congresistas.merge(partidos,left_on='Partido ID', right_on='codigo_partido', how = 'left')

# Obetenemos la ideología
congresistas_new['Afiliacion'] = congresistas_new.ideologia.apply(idelogias_dict)
mapa_2 = (
    congresistas_new.merge(retweets, left_on='Twitter', right_on='Referenced Tweet Author Name')
    .loc[:,('Referenced Tweet Author ID','Afiliacion')]
    .query("Afiliacion != 'nada'")
    .drop_duplicates()
    .astype({'Referenced Tweet Author ID':int})
    .set_index('Referenced Tweet Author ID')
    .to_dict()['Afiliacion']
)

In [6]:
common_keys = list(set(mapa.keys()).intersection(set(mapa_2.keys())))
print(f"Hay {len(mapa)} usuarios identificados por la Silla")
print(f"Hay {len(mapa_2)} congresistas identificados según su partido")
print(f"Hay {len(common_keys)} usuarios identificados por ambas fuentes")
print('')
for key in common_keys:
    if mapa[key] != mapa_2[key]:
        # Usar el criterio de la silla vacía
        print(f"{user_name[key]}: Silla dice {mapa[key]} pero partido es {mapa_2[key]}")
    mapa_2.pop(key)

# Unir ambos diccioanrios
mapa_full = mapa_2 | mapa
print('')
print(f"Total de dicionarios {len(mapa_full)}")

# Guardar
with open("/mnt/disk2/Data/Pickle/User_Dicts/mapa_full.pkl", "wb") as file:
    pickle.dump(mapa_full, file)

Hay 93 usuarios identificados por la Silla
Hay 228 congresistas identificados según su partido
Hay 15 usuarios identificados por ambas fuentes

JERobledo: Silla dice Centro pero partido es Izquierda
RoyBarreras: Silla dice Izquierda pero partido es Derecha
angelamrobledo: Silla dice Centro pero partido es Izquierda
AABenedetti: Silla dice Izquierda pero partido es Derecha
intiasprilla: Silla dice Izquierda pero partido es Centro

Total de dicionarios 306


# CHECKPOINT: Cargar Datos del Paro

Load all pickle files will need

In [10]:
# We load the tweets_lite DataFrame for the analysis
retweets = pd.read_pickle('/mnt/disk2/Data/Tweets_DataFrames/retweets.gzip', compression='gzip')

# We load the map that relates an ID to a political Label
with open("/mnt/disk2/Data/Pickle/User_Dicts/mapa_full.pkl", "rb") as file:
    mapa = pickle.load(file)

In [11]:
# Now we assign each RT a political label according to its influencer's label.
retweets["Party"] = retweets["Referenced Tweet Author ID"].map(mapa)

# We select all non-NA labeled RT.
retweets = retweets[retweets["Party"].notna()]
print(retweets["Party"].value_counts())
retweets.head()

Party
Izquierda         3902002
Derecha           1091764
Centro             479654
Sin Clasificar       5105
Name: count, dtype: int64


Unnamed: 0,Tweet ID,Author ID,Author Name,Referenced Tweet Author ID,Referenced Tweet Author Name,Date,Referenced Tweet,Party
20,1405513539427635200,788250746,Laura_Milena98,142092456,JulianRoman,2021/06/17 08:11:55,1.405305e+18,Izquierda
22,1405510167844765696,788250746,Laura_Milena98,237445795,subcantante,2021/06/17 07:58:31,1.405317e+18,Izquierda
27,1404183860339003392,788250746,Laura_Milena98,237445795,subcantante,2021/06/13 16:08:15,1.404126e+18,Izquierda
29,1403881554112360448,788250746,Laura_Milena98,939908874357870592,DonIzquierdo_,2021/06/12 20:07:00,1.403795e+18,Izquierda
34,1403527396733640704,788250746,Laura_Milena98,237445795,subcantante,2021/06/11 20:39:42,1.403517e+18,Izquierda


We create a 3x1 positive integer vector for every tweeter in the community that registers the number of RTs that the user has based on the political affilation. 

In [12]:
# We create lambda-functions that count the number of RTs for each political label.
a = lambda x: np.sum(x == "Derecha")
b = lambda x: np.sum(x == "Izquierda")
c = lambda x: np.sum(x == "Centro")

# given per political label for each user using the lambda-functions.
rts_usuario_paro = retweets.groupby("Author ID").agg({"Party": [a,b,c]})

rts_usuario_paro.columns = ["Retweets Derecha", 
                       "Retweets Izquierda", 
                       "Retweets Centro"]

# Total RTs...
rts_usuario_paro["Retweets Totales"] = rts_usuario_paro.sum(axis=1)

rts_usuario_paro.index = rts_usuario_paro.index.astype(int)

# Now we determine the political affiliation by checking the index with the maximum.
rts_usuario_paro["Sin Clasificar"] = (rts_usuario_paro["Retweets Totales"] == 0).astype('int32')
rts_usuario_paro.sort_index()
print('Vector Database size is: ',rts_usuario_paro.shape)
rts_usuario_paro.head()

Vector Database size is:  (33782, 5)


Unnamed: 0_level_0,Retweets Derecha,Retweets Izquierda,Retweets Centro,Retweets Totales,Sin Clasificar
Author ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
12996,2,386,126,514,0
777978,1,1,1,3,0
784125,0,71,5,76,0
1061601,0,258,7,265,0
1981631,0,41,4,45,0


In [14]:
# Now we determine the political affiliation by checking the index with the maximum.
rts_usuario_paro["Afiliacion"] = rts_usuario_paro[["Retweets Centro", 
                                         "Retweets Derecha", 
                                         "Retweets Izquierda", 
                                         "Sin Clasificar"]].idxmax(axis=1)

conditions = [
    (rts_usuario_paro['Afiliacion'] == 'Retweets Izquierda'),
    (rts_usuario_paro['Afiliacion'] == 'Retweets Derecha'),
    (rts_usuario_paro['Afiliacion'] == 'Retweets Centro'),
    (rts_usuario_paro['Afiliacion'] == 'Sin Clasificar')
]

choices = ['Izquierda', 'Derecha', 'Centro', 'Sin Clasificar']

rts_usuario_paro['Afiliacion'] = pd.Series(np.select(conditions, choices, default=''), index=rts_usuario_paro.index)

# We generate dummy variables for each political label...
rts_usuario_paro["Dummy Derecha"] = (rts_usuario_paro["Afiliacion"] == 'Derecha').astype('int32')
rts_usuario_paro["Dummy Izquierda"] = (rts_usuario_paro["Afiliacion"] == 'Izquierda').astype('int32')
rts_usuario_paro["Dummy Centro"] = (rts_usuario_paro["Afiliacion"] == 'Centro').astype('int32')
rts_usuario_paro["Sin Clasificar"] = (rts_usuario_paro["Afiliacion"] == 'Sin Clasificar').astype('int32')

# We see the sizes of our groups
print(rts_usuario_paro['Afiliacion'].value_counts())
rts_usuario_paro.head()

Afiliacion
Izquierda         23209
Derecha            7006
Centro             3562
Sin Clasificar        5
Name: count, dtype: int64


Unnamed: 0_level_0,Retweets Derecha,Retweets Izquierda,Retweets Centro,Retweets Totales,Sin Clasificar,Afiliacion,Dummy Derecha,Dummy Izquierda,Dummy Centro
Author ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
12996,2,386,126,514,0,Izquierda,0,1,0
777978,1,1,1,3,0,Centro,0,0,1
784125,0,71,5,76,0,Izquierda,0,1,0
1061601,0,258,7,265,0,Izquierda,0,1,0
1981631,0,41,4,45,0,Izquierda,0,1,0


In [15]:
# Finally, we create a dictionary which stores the affiliation for each user.
user_to_party_paro = {}

for index, row in rts_usuario_paro.iterrows():
    author_id = int(index)
    afiliacion = row['Afiliacion']
    
    # Adding the author ID and affiliation to the dictionary
    user_to_party_paro[author_id] = afiliacion

with open("/mnt/disk2/Data/Pickle/User_Dicts/user_to_party_paro.pkl", 'wb') as file:
    pickle.dump(user_to_party_paro,file)

We export the Dictionary a Pickle File for Further usage

In [16]:
rts_usuario_paro.to_pickle('/mnt/disk2/Data/Pickle/User_Rts_Vector/rts_usuario_paro.pkl')

# CHECKPOINT: Tweets not related to the Paro Nacional

Load all Pickle files needed

In [17]:
# We create an aux empty list to concatenate Tweets from January and October
aux = []

# We load January tweets
tweets_jan = pd.read_pickle('/mnt/disk2/Data/Tweets_DataFrames/tweets_jan21.gzip', compression='gzip')

# We load October tweets
tweets_oct = pd.read_pickle('/mnt/disk2/Data/Tweets_DataFrames/tweets_oct19.gzip', compression='gzip')

# Append both to the auxiliary list and concat them
aux.append(tweets_jan)
aux.append(tweets_oct)
tweets = pd.concat(aux)
print('October Shape: ', tweets_oct.shape)
print('January Shape: ', tweets_jan.shape)
print('Total Shape: ', tweets.shape)

October Shape:  (5424132, 25)
January Shape:  (5893802, 25)
Total Shape:  (11317934, 25)


In [18]:
# We load the map that relates an ID to a political Label
with open("/mnt/disk2/Data/Pickle/User_Dicts/mapa_full.pkl", "rb") as file:
    mapa = pickle.load(file)

In [19]:
# Now we assign each RT a political label according to its influencer's label.
retweets["Party"] = retweets["Referenced Tweet Author ID"].map(mapa)

# We select all non-NA labeled RT.
retweets = retweets[retweets["Party"].notna()]
print(retweets["Party"].value_counts())
retweets.head()

Party
Izquierda         3902002
Derecha           1091764
Centro             479654
Sin Clasificar       5105
Name: count, dtype: int64


Unnamed: 0,Tweet ID,Author ID,Author Name,Referenced Tweet Author ID,Referenced Tweet Author Name,Date,Referenced Tweet,Party
20,1405513539427635200,788250746,Laura_Milena98,142092456,JulianRoman,2021/06/17 08:11:55,1.405305e+18,Izquierda
22,1405510167844765696,788250746,Laura_Milena98,237445795,subcantante,2021/06/17 07:58:31,1.405317e+18,Izquierda
27,1404183860339003392,788250746,Laura_Milena98,237445795,subcantante,2021/06/13 16:08:15,1.404126e+18,Izquierda
29,1403881554112360448,788250746,Laura_Milena98,939908874357870592,DonIzquierdo_,2021/06/12 20:07:00,1.403795e+18,Izquierda
34,1403527396733640704,788250746,Laura_Milena98,237445795,subcantante,2021/06/11 20:39:42,1.403517e+18,Izquierda


In [21]:
# We create lambda-functions that count the number of RTs for each political 
# label.
a = lambda x: np.sum(x == "Derecha")
b = lambda x: np.sum(x == "Izquierda")
c = lambda x: np.sum(x == "Centro")

# given per political label for each user using the lambda-functions.
rts_usuario_jan_oct = retweets.groupby("Author ID").agg({"Party": [a,b,c]})

rts_usuario_jan_oct.columns = ["Retweets Derecha", 
                       "Retweets Izquierda", 
                       "Retweets Centro"]

# Total RTs...
rts_usuario_jan_oct["Retweets Totales"] = rts_usuario_jan_oct.sum(axis=1)
# We generate dummy variables for each political label...
rts_usuario_jan_oct["Dummy Derecha"] = (rts_usuario_jan_oct["Retweets Derecha"] != 0).astype('int32')
rts_usuario_jan_oct["Dummy Izquierda"] = (rts_usuario_jan_oct["Retweets Izquierda"] != 0).astype('int32')
rts_usuario_jan_oct["Dummy Centro"] = (rts_usuario_jan_oct["Retweets Centro"] != 0).astype('int32')

# Now we determine the political affiliation by checking the index with the maximum.
rts_usuario_jan_oct["Sin Clasificar"] = (rts_usuario_jan_oct["Retweets Totales"] == 0).astype('int32')
print('Vector Datbase size is: ',rts_usuario_jan_oct.shape)
rts_usuario_jan_oct.head()

Vector Datbase size is:  (33782, 8)


Unnamed: 0_level_0,Retweets Derecha,Retweets Izquierda,Retweets Centro,Retweets Totales,Dummy Derecha,Dummy Izquierda,Dummy Centro,Sin Clasificar
Author ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
12996,2,386,126,514,1,1,1,0
777978,1,1,1,3,1,1,1,0
784125,0,71,5,76,0,1,1,0
1061601,0,258,7,265,0,1,1,0
1981631,0,41,4,45,0,1,1,0


In [22]:
# Now we determine the political affiliation by checking the index with the maximum.
rts_usuario_jan_oct["Afiliacion"] = rts_usuario_jan_oct[["Retweets Centro", 
                                         "Retweets Derecha", 
                                         "Retweets Izquierda", 
                                         "Sin Clasificar"]].idxmax(axis=1)

rts_usuario_jan_oct['Afiliacion'].value_counts()

Afiliacion
Retweets Izquierda    23209
Retweets Derecha       7006
Retweets Centro        3562
Sin Clasificar            5
Name: count, dtype: int64

In [23]:
# Finally, we create a dictionary which stores the affiliation for each user.
user_to_party_jan_oct = {}

for index, row in rts_usuario_jan_oct.iterrows():
    author_id = int(index)
    afiliacion = row['Afiliacion']
    
    # Adding the author ID and affiliation to the dictionary
    user_to_party_jan_oct[author_id] = afiliacion

with open("/mnt/disk2/Data/Pickle/user_to_party_jan_oct.pkl", 'wb') as file:
    pickle.dump(user_to_party_jan_oct,file)

In [24]:
rts_usuario_jan_oct.to_pickle('/mnt/disk2/Data/Pickle/User_Rts_Vector/rts_usuario_jan_oct.pkl')

## 3. Outputs

The output of this Notebook is listed Below:

- **user_to_party**: A Python Dictionary stored in a Pickle File with the Party affiliatin of every user based on the Retweets during the Paro Nacional

- **user_to_party_jan_oct**: A Python Dictionary stored in a Pickle File with the Party affiliatin of every user based on the Retweets from January 2021 and October 2019

- **rts_usuario_paro**: DataFrame that contains the amount of Left-wing, Right-wing and Center-Wing for every User during the Paro

- **rts_usuario_jan_oct**: DataFrame that contains the amount of Left-wing, Right-wing and Center-Wing for every user 
