# 13 Cross-Countries Retweets - Save Data
In this notebook we analyse the cross-border retweets, i.e., users from one country retweeting users from another country.

We do it in two ways:
- counting absolute number of retweets between each pairs of countries
- normalizing the adjacency matrix of the retweets between countries above, comparing the retweets between countries with the retweets obtained if nodes had the same strength and edges were random (random baseline context)

In [87]:
import pandas as pd
import numpy as np
from glob import glob
from matplotlib.colors import LogNorm, ListedColormap
import matplotlib.pyplot as plt
import seaborn as sns

In [17]:
#data on edges and users are stored in this folder
folder_DATA = "/data/public/jlenti/multilang-vax/DATA_clean_url"
folder = "/data/public/jlenti/multilang-vax/EuropeAmerica_RTCO"

#list of all countries (size ordered)
countries = ["US", "BR", "AR", "GB", "ES", "MX", "FR", "CA", "TR", "VE", "AU", "CO", "IT", "CL", "DE",
             "PT", "IE", "PY", "EC", "RU", "UY", "NZ", "PL", "NL", "PE", "CU", "PA", "GR"]
#sorted by language
lang_sort = ["US", "IE", "GB", "CA", "NZ", "AU", "FR", "IT", "PL", "NL", "DE", "RU", "TR", 
             "BR", "PT", "GR", "AR", "ES", "MX","VE", "CO", "CL",
             "PY", "EC", "UY", "PE", "CU", "PA"]
#named periods
periods = {"period1": ["201910","201911","201912"],
           "period2": ["202007","202008","202009"], 
           "period3": ["202010","202011","202012"], 
           "period4": ["202101","202102","202103"]}
periods_names = ["pre-COVID", "pre-vax", "vax development", "vax rollout"]

In [6]:
#these users are filtered out. They are uncorrectly geolocated (they can be users with more than one geolocation
#in the observation period, or users with abnormal interactions towards another country)
filtered_users = pd.concat([pd.read_csv("/data/public/jlenti/multilang-vax/Geolocation_Mismatches/more_countries_users_RT.csv"),
                            pd.read_csv("/data/public/jlenti/multilang-vax/Geolocation_Mismatches/misgeo_popular_user_countries_pairs.csv")])["user"].tolist()

## Cross-Countries Retweets
From folder_DATA, extract the retweets between users coming from different (not null) countries.
Save one dataframe per period, with columns [user, user_RT, country, country_RT, lang], in folder_EUAM.

In [15]:
cross_RT = {}
for period in periods:
    print(period)
    cross_RT[period] = pd.concat([pd.read_csv(file, lineterminator = "\n", sep = "\t", low_memory = False,
                                              quoting = False, escapechar = None)[["user_screen_name", "user_country_code", "RT_user_screen_name",
                                                                                   "RT_user_country_code", "lang"]].dropna()
                                  .rename(columns = {"user_country_code": "country",
                                                     "user_screen_name": "user",
                                                     "RT_user_country_code": "country_RT",
                                                     "RT_user_screen_name": "user_RT"})
                                  #keep all the RT between different countries
                                  .query("(country != country_RT)&(country != ' ')&(country_RT != ' ')") \
                                  .query("(user not in @filtered_users)&(user_RT not in @filtered_users)")
                                  for month in periods[period]
                                  for file in sorted(glob(folder_DATA + "/*/{0}*".format(month)))])
    #cross_RT[period].to_csv("/".join([folder, period, "_".join([period, "crossedges_RT.csv.gz"])]),
    #               compression = "gzip", index = False)

period1
period2
period3
period4


In [16]:
cross_RT = {period: pd.read_csv(sorted(glob("/".join([folder, period, "*cross*RT*"])))[0])
            .query("(country in @countries)&(country_RT in @countries)") for period in periods}

In [18]:
cross_RT["period1"].head()

Unnamed: 0,user,country,user_RT,country_RT,lang
13,Longanlon,FR,Abe_Angele,US,bg
32,SteeliestLlama,GB,CoffeeShopRabbi,US,da
33,IntJewCon,GB,CoffeeShopRabbi,US,da
34,TomVargheseJr,US,jennybencardino,CO,da
35,DrIanWeissman,US,jennybencardino,CO,da


## Internal Retweets

To compare the within and between countries interactions, I count the number of retweets between users in the same country, for each country, each period (aggregating all languages).
To do this, I read the data in folder_DATA containing retweets between users in the same countries, count them for each day, and sum all the pandas Series in the same period. 
In this way, I obtain a pandas Series with index countries and values the number of retweets within the country (for each period).

In [15]:
internal_RT_dict = {}
for period in periods:
    print(period)
    internal_RT_dict[period] = pd.concat([pd.read_csv(file, lineterminator = "\n", sep = "\t", low_memory = False,
                                                      quoting = False, escapechar = None)[["user_screen_name", "user_country_code", "RT_user_screen_name",
                                                                                           "RT_user_country_code", "lang"]].dropna()
                                          .rename(columns = {"user_country_code": "country",
                                                             "user_screen_name": "user",
                                                             "RT_user_country_code": "country_RT",
                                                             "RT_user_screen_name": "user_RT"})
                                          #keep all the RT between different countries
                                          .query("(country == country_RT)&(country in countries)") \
                                          .query("(user not in @filtered_users)&(user_RT not in @filtered_users)") \
                                          ["country"].value_counts()
                                          for month in periods[period]
                                          for file in sorted(glob(folder_DATA + "/*/{0}*".format(month)))],
                                         axis = 1).sum(axis = 1)
internal_RT = pd.DataFrame(internal_RT)
#internal_RT.to_csv("/home/jlenti/Files/internal_retweets_volume_2503.csv", index = False)

period1
period2
period3
period4


I create the corresponding (directed, weighted) countries edgelist, where nodes are countries, and edges are the total number of retweets between users of the two countries.

In [22]:
internal_RT

Unnamed: 0_level_0,period1,period2,period3,period4
country,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
AR,67664.0,304011.0,597933.0,1761522.0
AU,8823.0,64773.0,50435.0,279560.0
BR,97499.0,1309062.0,2335052.0,5132419.0
CA,14665.0,78407.0,297355.0,927729.0
CL,12889.0,51764.0,117216.0,368590.0
CO,7919.0,123190.0,164745.0,1089199.0
CU,5925.0,21814.0,15989.0,48721.0
DE,11654.0,32989.0,138942.0,431077.0
EC,9186.0,65920.0,90036.0,673901.0
ES,55096.0,246224.0,537968.0,1300624.0
