# Political Labelling

This script determines the political affiliation (left, center, right) of each user in our sample by analyzing the retweets they've made.

We use a list of political influencers that have been previously categorized as left, center, or right by La Silla Vacia, a Colombian news outlet. For each user, we tally the number of retweets (excluding those with comments) corresponding to each influencer. From this, we calculate the total number of tweets associated with each political category.

This process is made for the Tweets during the "Paro Nacional" and for that tweets that arern't from this period across 3 chapters.

1. Paro Nacional tweets
2. Tweets that aren't from the Paro Nacional
3. Conclusion

In [1]:
import pickle
import pandas as pd
import numpy as np

## 1. Paro Nacional Tweets

Load all pickle files will need

In [4]:
# Run time: 1 Minute aprox.

# We load the tweets_lite DataFrame for the analysis
tweets = pd.read_pickle('/mnt/disk2/Data/Tweets_DataFrames/tweets_lite.pkl')

# We load the map that relates an ID to a political Label
with open("/mnt/disk2/Data/Pickle/mapa.pkl", "rb") as file:
    mapa = pickle.load(file)

In [5]:
# Now we assign each RT a political label according to its influencer's label.
tweets.loc[tweets["Reference Type"] == "retweeted", "Party"] = tweets.loc[tweets["Reference Type"] == "retweeted",
                                                                         "Referenced Tweet Author ID"].map(mapa)

# We select all non-NA labeled RT.
tweets[tweets["Party"].notna()]

tweets["Party"].value_counts()

Party
Izquierda    3443228
Derecha       814456
Centro        430559
Name: count, dtype: int64

We create a 3x1 positive integer vector for every tweeter in the community that registers the number of RTs that the user has based on the political affilation. 

In [27]:
# We create lambda-functions that count the number of RTs for each political 
# label.
a = lambda x: np.sum(x == "Derecha")
b = lambda x: np.sum(x == "Izquierda")
c = lambda x: np.sum(x == "Centro")

# given per political label for each user using the lambda-functions.
rts_usuario_paro = tweets.groupby("Author ID").agg({"Party": [a,b,c]})

rts_usuario_paro.columns = ["Retweets Derecha", 
                       "Retweets Izquierda", 
                       "Retweets Centro"]

# Total RTs...
rts_usuario_paro["Retweets Totales"] = rts_usuario_paro.sum(axis=1)
# We generate dummy variables for each political label...
rts_usuario_paro["Dummy Derecha"] = (rts_usuario_paro["Retweets Derecha"] != 0).astype('int32')
rts_usuario_paro["Dummy Izquierda"] = (rts_usuario_paro["Retweets Izquierda"] != 0).astype('int32')
rts_usuario_paro["Dummy Centro"] = (rts_usuario_paro["Retweets Centro"] != 0).astype('int32')

# Now we determine the political affiliation by checking the index with the maximum.
rts_usuario_paro["No Retweets"] = (rts_usuario_paro["Retweets Totales"] == 0).astype('int32')
print('Vector Datbase size is: ',rts_usuario_paro.shape)
rts_usuario_paro.head()

Vector Datbase size is:  (34901, 8)


Unnamed: 0_level_0,Retweets Derecha,Retweets Izquierda,Retweets Centro,Retweets Totales,Dummy Derecha,Dummy Izquierda,Dummy Centro,No Retweets
Author ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
12996,0,10,7,17,0,1,1,0
777978,0,0,0,0,0,0,0,1
784125,0,30,3,33,0,1,1,0
1061601,0,15,1,16,0,1,1,0
1488031,0,0,0,0,0,0,0,1


In [7]:
# Now we determine the political affiliation by checking the index with the maximum.
rts_usuario_paro["Afiliacion"] = rts_usuario_paro[["Retweets Centro", 
                                         "Retweets Derecha", 
                                         "Retweets Izquierda", 
                                         "No Retweets"]].idxmax(axis=1)

rts_usuario_paro['Afiliacion'].value_counts()

Afiliacion
Retweets Izquierda    23138
Retweets Derecha       6812
No Retweets            3815
Retweets Centro        3543
Name: count, dtype: int64

In [9]:
# Finally, we create a dictionary which stores the affiliation for each user.
user_to_party = {}

for index, row in rts_usuario_paro.iterrows():
    author_id = int(index)
    afiliacion = row['Afiliacion']
    
    # Adding the author ID and affiliation to the dictionary
    user_to_party[author_id] = afiliacion

with open("/mnt/disk2/Data/Pickle/user_to_party.pkl", 'wb') as file:
    pickle.dump(user_to_party,file)

We export the Dictionary a Pickle File for Further usage

In [10]:
rts_usuario_paro.to_pickle('/mnt/disk2/Data/Pickle/rts_usuario_paro.pkl')

## 2. Tweets That aren´t from the Paro Nacional

Load all Pickle files needed

In [19]:
# We create an aux empty list to concatenate Tweets from January and October
aux = []

In [20]:
# We load January tweets
tweets_jan = pd.read_pickle('/mnt/disk2/Data/Tweets_DataFrames/tweets_jan21.gzip', compression='gzip')

# We load October tweets
tweets_oct = pd.read_pickle('/mnt/disk2/Data/Tweets_DataFrames/tweets_oct19.gzip', compression='gzip')

# Append both to the auxiliary list and concat them
aux.append(tweets_jan)
aux.append(tweets_oct)
tweets = pd.concat(aux)
print('October Shape: ', tweets_oct.shape)
print('January Shape: ', tweets_jan.shape)
print('Total Shape: ', tweets.shape)


October Shape:  (5424132, 25)
January Shape:  (5893802, 25)
Total Shape:  (11317934, 25)


In [21]:
# We load the map that relates an ID to a political Label
with open("/mnt/disk2/Data/Pickle/mapa.pkl", "rb") as file:
    mapa = pickle.load(file)

In [22]:
# Now we assign each RT a political label according to its influencer's label.
tweets.loc[tweets["Reference Type"] == "retweeted", "Party"] = tweets.loc[tweets["Reference Type"] == "retweeted",
                                                                         "Referenced Tweet Author ID"].map(mapa)

# We select all non-NA labeled RT.
tweets[tweets["Party"].notna()]

tweets["Party"].value_counts()

Party
Izquierda    548466
Derecha      173499
Centro       124886
Name: count, dtype: int64

In [29]:
# We create lambda-functions that count the number of RTs for each political 
# label.
a = lambda x: np.sum(x == "Derecha")
b = lambda x: np.sum(x == "Izquierda")
c = lambda x: np.sum(x == "Centro")

# given per political label for each user using the lambda-functions.
rts_usuario_jan_oct = tweets.groupby("Author ID").agg({"Party": [a,b,c]})

rts_usuario_jan_oct.columns = ["Retweets Derecha", 
                       "Retweets Izquierda", 
                       "Retweets Centro"]

# Total RTs...
rts_usuario_jan_oct["Retweets Totales"] = rts_usuario_jan_oct.sum(axis=1)
# We generate dummy variables for each political label...
rts_usuario_jan_oct["Dummy Derecha"] = (rts_usuario_jan_oct["Retweets Derecha"] != 0).astype('int32')
rts_usuario_jan_oct["Dummy Izquierda"] = (rts_usuario_jan_oct["Retweets Izquierda"] != 0).astype('int32')
rts_usuario_jan_oct["Dummy Centro"] = (rts_usuario_jan_oct["Retweets Centro"] != 0).astype('int32')

# Now we determine the political affiliation by checking the index with the maximum.
rts_usuario_jan_oct["No Retweets"] = (rts_usuario_jan_oct["Retweets Totales"] == 0).astype('int32')
print('Vector Datbase size is: ',rts_usuario_jan_oct.shape)
rts_usuario_jan_oct.head()

Vector Datbase size is:  (34901, 8)


Unnamed: 0_level_0,Retweets Derecha,Retweets Izquierda,Retweets Centro,Retweets Totales,Dummy Derecha,Dummy Izquierda,Dummy Centro,No Retweets
Author ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
12996,0,10,7,17,0,1,1,0
777978,0,0,0,0,0,0,0,1
784125,0,30,3,33,0,1,1,0
1061601,0,15,1,16,0,1,1,0
1488031,0,0,0,0,0,0,0,1


In [33]:
# Now we determine the political affiliation by checking the index with the maximum.
rts_usuario_jan_oct["Afiliacion"] = rts_usuario_jan_oct[["Retweets Centro", 
                                         "Retweets Derecha", 
                                         "Retweets Izquierda", 
                                         "No Retweets"]].idxmax(axis=1)

rts_usuario_jan_oct['Afiliacion'].value_counts()

Afiliacion
Retweets Izquierda    13983
No Retweets           10378
Retweets Centro        5337
Retweets Derecha       5203
Name: count, dtype: int64

In [34]:
# Finally, we create a dictionary which stores the affiliation for each user.
user_to_party_jan_oct = {}

for index, row in rts_usuario_jan_oct.iterrows():
    author_id = int(index)
    afiliacion = row['Afiliacion']
    
    # Adding the author ID and affiliation to the dictionary
    user_to_party_jan_oct[author_id] = afiliacion

with open("/mnt/disk2/Data/Pickle/user_to_party_jan_oct.pkl", 'wb') as file:
    pickle.dump(user_to_party_jan_oct,file)

In [35]:
rts_usuario_jan_oct.to_pickle('/mnt/disk2/Data/Pickle/rts_usuario_jan_oct.pkl')

## 3. Conclusion

The output of this Notebook is listed Below:

- **user_to_party**: A Python Dictionary stored in a Pickle File with the Party affiliatin of every user based on the Retweets during the Paro Nacional

- **user_to_party_jan_oct**: A Python Dictionary stored in a Pickle File with the Party affiliatin of every user based on the Retweets from January 2021 and October 2019

- **rts_usuario_paro**: DataFrame that contains the amount of Left-wing, Right-wing and Center-Wing for every User during the Paro

- **rts_usuario_jan_oct**: DataFrame that contains the amount of Left-wing, Right-wing and Center-Wing for every user 
