# Political Labelling

This script determines the political affiliation (left, center, right) of each user in our sample by analyzing the retweets they have made.

We use a list of political influencers previously categorized as left, center, or right by La Silla Vacia, a Colombian news outlet. For each user, we tally the number of retweets they've made (excluding retweets with comments) that correspond to each influencer. From this data, we calculate the total number of tweets associated with each political category.

This process is carried out on tweets from the "Paro Nacional" period and on tweets that are not from this period, across three sections:

1. Paro Nacional tweets
2. Tweets not related to the Paro Nacional
3. Outputs

In [28]:
import pickle
import pandas as pd
import numpy as np
from scipy.sparse import csr_matrix
import scipy.sparse as sp
import os

## 1. Paro Nacional Tweets

Load all pickle files will need

In [34]:
# We load the tweets_lite DataFrame for the analysis
tweets = pd.read_pickle('/mnt/disk2/Data/Tweets_DataFrames/tweets_lite.gzip', compression='gzip')

# We load the map that relates an ID to a political Label
with open("/mnt/disk2/Data/Pickle/User_Dicts/mapa_full.pkl", "rb") as file:
    mapa = pickle.load(file)

In [8]:
with open("/mnt/disk2/Data/Pickle/User_Dicts/mapa.pkl", "rb") as file:
    mapa_2 = pickle.load(file)

In [35]:
# Now we assign each RT a political label according to its influencer's label.
tweets.loc[tweets["Reference Type"] == "retweeted", "Party"] = tweets.loc[tweets["Reference Type"] == "retweeted",
                                                                         "Referenced Tweet Author ID"].map(mapa)

# We select all non-NA labeled RT.
tweets[tweets["Party"].notna()]
print(tweets["Party"].value_counts())
tweets.head()

Party
Izquierda         3973064
Derecha           1091764
Centro             570486
Sin Clasificar       5105
Name: count, dtype: int64


Unnamed: 0,Tweet ID,Author ID,Author Name,Referenced Tweet Author ID,Date,Reference Type,Referenced Tweet,Party
0,1.397298e+18,138377765.0,hmauriciojg,,2021/05/25 16:06:23,original tweet,,
1,1.394702e+18,138377765.0,hmauriciojg,,2021/05/18 12:08:44,original tweet,,
2,1.389576e+18,138377765.0,hmauriciojg,,2021/05/04 08:41:29,original tweet,,
3,1.389273e+18,138377765.0,hmauriciojg,,2021/05/03 12:35:56,original tweet,,
4,1.409909e+18,788250746.0,Laura_Milena98,,2021/06/29 11:16:36,original tweet,,


We create a 3x1 positive integer vector for every tweeter in the community that registers the number of RTs that the user has based on the political affilation. 

In [36]:
# We create lambda-functions that count the number of RTs for each political label.
a = lambda x: np.sum(x == "Derecha")
b = lambda x: np.sum(x == "Izquierda")
c = lambda x: np.sum(x == "Centro")

# given per political label for each user using the lambda-functions.
rts_usuario_paro = tweets.groupby("Author ID").agg({"Party": [a,b,c]})

rts_usuario_paro.columns = ["Retweets Derecha", 
                       "Retweets Izquierda", 
                       "Retweets Centro"]

# Total RTs...
rts_usuario_paro["Retweets Totales"] = rts_usuario_paro.sum(axis=1)

rts_usuario_paro.index = rts_usuario_paro.index.astype(int)

# Now we determine the political affiliation by checking the index with the maximum.
rts_usuario_paro["Sin Clasificar"] = (rts_usuario_paro["Retweets Totales"] == 0).astype('int32')
rts_usuario_paro.sort_index()
print('Vector Database size is: ',rts_usuario_paro.shape)
rts_usuario_paro.head()

Vector Database size is:  (37237, 5)


Unnamed: 0_level_0,Retweets Derecha,Retweets Izquierda,Retweets Centro,Retweets Totales,Sin Clasificar
Author ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
0,0,0,0,0,1
1,0,0,0,0,1
2,0,0,0,0,1
3,0,0,0,0,1
4,0,0,0,0,1


In [37]:
# Now we determine the political affiliation by checking the index with the maximum.
rts_usuario_paro["Afiliacion"] = rts_usuario_paro[["Retweets Centro", 
                                         "Retweets Derecha", 
                                         "Retweets Izquierda", 
                                         "Sin Clasificar"]].idxmax(axis=1)

conditions = [
    (rts_usuario_paro['Afiliacion'] == 'Retweets Izquierda'),
    (rts_usuario_paro['Afiliacion'] == 'Retweets Derecha'),
    (rts_usuario_paro['Afiliacion'] == 'Retweets Centro'),
    (rts_usuario_paro['Afiliacion'] == 'Sin Clasificar')
]

choices = ['Izquierda', 'Derecha', 'Centro', 'Sin Clasificar']

rts_usuario_paro['Afiliacion'] = pd.Series(np.select(conditions, choices, default=''), index=rts_usuario_paro.index)

# We generate dummy variables for each political label...
rts_usuario_paro["Dummy Derecha"] = (rts_usuario_paro["Afiliacion"] == 'Derecha').astype('int32')
rts_usuario_paro["Dummy Izquierda"] = (rts_usuario_paro["Afiliacion"] == 'Izquierda').astype('int32')
rts_usuario_paro["Dummy Centro"] = (rts_usuario_paro["Afiliacion"] == 'Centro').astype('int32')
rts_usuario_paro["Sin Clasificar"] = (rts_usuario_paro["Afiliacion"] == 'Sin Clasificar').astype('int32')

# We see the sizes of our groups
print(rts_usuario_paro['Afiliacion'].value_counts())
rts_usuario_paro.head()

Afiliacion
Izquierda         23139
Derecha            6981
Centro             3743
Sin Clasificar     3374
Name: count, dtype: int64


Unnamed: 0_level_0,Retweets Derecha,Retweets Izquierda,Retweets Centro,Retweets Totales,Sin Clasificar,Afiliacion,Dummy Derecha,Dummy Izquierda,Dummy Centro
Author ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
0,0,0,0,0,1,Sin Clasificar,0,0,0
1,0,0,0,0,1,Sin Clasificar,0,0,0
2,0,0,0,0,1,Sin Clasificar,0,0,0
3,0,0,0,0,1,Sin Clasificar,0,0,0
4,0,0,0,0,1,Sin Clasificar,0,0,0


In [33]:
# Finally, we create a dictionary which stores the affiliation for each user.
user_to_party_paro = {}

for index, row in rts_usuario_paro.iterrows():
    author_id = int(index)
    afiliacion = row['Afiliacion']
    
    # Adding the author ID and affiliation to the dictionary
    user_to_party_paro[author_id] = afiliacion

with open("/mnt/disk2/Data/Pickle/User_Dicts/user_to_party_paro.pkl", 'wb') as file:
    pickle.dump(user_to_party_paro,file)

We export the Dictionary a Pickle File for Further usage

In [15]:
rts_usuario_paro.to_pickle('/mnt/disk2/Data/Pickle/User_Rts_Vector/rts_usuario_paro.pkl')

## 2. Tweets not related to the Paro Nacional

Load all Pickle files needed

In [16]:
# We create an aux empty list to concatenate Tweets from January and October
aux = []

In [17]:
# We load January tweets
tweets_jan = pd.read_pickle('/mnt/disk2/Data/Tweets_DataFrames/tweets_jan21.gzip', compression='gzip')

# We load October tweets
tweets_oct = pd.read_pickle('/mnt/disk2/Data/Tweets_DataFrames/tweets_oct19.gzip', compression='gzip')

# Append both to the auxiliary list and concat them
aux.append(tweets_jan)
aux.append(tweets_oct)
tweets = pd.concat(aux)
print('October Shape: ', tweets_oct.shape)
print('January Shape: ', tweets_jan.shape)
print('Total Shape: ', tweets.shape)


October Shape:  (5424132, 25)
January Shape:  (5893802, 25)
Total Shape:  (11317934, 25)


In [20]:
# We load the map that relates an ID to a political Label
with open("/mnt/disk2/Data/Pickle/User_Dicts/mapa_full.pkl", "rb") as file:
    mapa = pickle.load(file)

In [21]:
# Now we assign each RT a political label according to its influencer's label.
tweets.loc[tweets["Reference Type"] == "retweeted", "Party"] = tweets.loc[tweets["Reference Type"] == "retweeted",
                                                                         "Referenced Tweet Author ID"].map(mapa)

# We select all non-NA labeled RT.
tweets[tweets["Party"].notna()]

tweets["Party"].value_counts()

Party
Izquierda         611580
Derecha           236466
Centro            148150
Sin Clasificar      1177
Name: count, dtype: int64

In [22]:
# We create lambda-functions that count the number of RTs for each political 
# label.
a = lambda x: np.sum(x == "Derecha")
b = lambda x: np.sum(x == "Izquierda")
c = lambda x: np.sum(x == "Centro")

# given per political label for each user using the lambda-functions.
rts_usuario_jan_oct = tweets.groupby("Author ID").agg({"Party": [a,b,c]})

rts_usuario_jan_oct.columns = ["Retweets Derecha", 
                       "Retweets Izquierda", 
                       "Retweets Centro"]

# Total RTs...
rts_usuario_jan_oct["Retweets Totales"] = rts_usuario_jan_oct.sum(axis=1)
# We generate dummy variables for each political label...
rts_usuario_jan_oct["Dummy Derecha"] = (rts_usuario_jan_oct["Retweets Derecha"] != 0).astype('int32')
rts_usuario_jan_oct["Dummy Izquierda"] = (rts_usuario_jan_oct["Retweets Izquierda"] != 0).astype('int32')
rts_usuario_jan_oct["Dummy Centro"] = (rts_usuario_jan_oct["Retweets Centro"] != 0).astype('int32')

# Now we determine the political affiliation by checking the index with the maximum.
rts_usuario_jan_oct["Sin Clasificar"] = (rts_usuario_jan_oct["Retweets Totales"] == 0).astype('int32')
print('Vector Datbase size is: ',rts_usuario_jan_oct.shape)
rts_usuario_jan_oct.head()

Vector Datbase size is:  (34901, 8)


Unnamed: 0_level_0,Retweets Derecha,Retweets Izquierda,Retweets Centro,Retweets Totales,Dummy Derecha,Dummy Izquierda,Dummy Centro,Sin Clasificar
Author ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
12996,0,10,8,18,0,1,1,0
777978,0,0,0,0,0,0,0,1
784125,0,35,4,39,0,1,1,0
1061601,0,16,1,17,0,1,1,0
1488031,0,0,0,0,0,0,0,1


In [23]:
# Now we determine the political affiliation by checking the index with the maximum.
rts_usuario_jan_oct["Afiliacion"] = rts_usuario_jan_oct[["Retweets Centro", 
                                         "Retweets Derecha", 
                                         "Retweets Izquierda", 
                                         "Sin Clasificar"]].idxmax(axis=1)

rts_usuario_jan_oct['Afiliacion'].value_counts()

Afiliacion
Retweets Izquierda    14102
Sin Clasificar         9841
Retweets Centro        5564
Retweets Derecha       5394
Name: count, dtype: int64

In [24]:
# Finally, we create a dictionary which stores the affiliation for each user.
user_to_party_jan_oct = {}

for index, row in rts_usuario_jan_oct.iterrows():
    author_id = int(index)
    afiliacion = row['Afiliacion']
    
    # Adding the author ID and affiliation to the dictionary
    user_to_party_jan_oct[author_id] = afiliacion

with open("/mnt/disk2/Data/Pickle/user_to_party_jan_oct.pkl", 'wb') as file:
    pickle.dump(user_to_party_jan_oct,file)

In [26]:
rts_usuario_jan_oct.to_pickle('/mnt/disk2/Data/Pickle/User_Rts_Vector/rts_usuario_jan_oct.pkl')

## 3. Outputs

The output of this Notebook is listed Below:

- **user_to_party**: A Python Dictionary stored in a Pickle File with the Party affiliatin of every user based on the Retweets during the Paro Nacional

- **user_to_party_jan_oct**: A Python Dictionary stored in a Pickle File with the Party affiliatin of every user based on the Retweets from January 2021 and October 2019

- **rts_usuario_paro**: DataFrame that contains the amount of Left-wing, Right-wing and Center-Wing for every User during the Paro

- **rts_usuario_jan_oct**: DataFrame that contains the amount of Left-wing, Right-wing and Center-Wing for every user 
