You might be testing whether the intensity of certain types of GDELT events in year t (e.g., negative sentiment events from Country A towards Country B) predicts an increase or decrease in migration flows from A to B in year t+1, controlling for other factors.

In [1]:
import pandas as pd
import networkx as nx
import numpy as np
from scipy.stats import pearsonr

In [2]:
import sys
sys.path.append('..')

In [3]:
from src.network import run_shuffle_test

In [4]:
gdelt_df = pd.read_csv("../data/gdelt_social.csv") 
migration_df = pd.read_csv("../data/migration_bilateral.csv")

In [5]:
gdelt_df

Unnamed: 0,Year,Actor1Country,Actor2Country,weighted_sum_avgtone,weighted_sum_goldstein,sum_nummentions
0,1979,ABW,NLD,5.000000,3.400000,4
1,1979,AFG,AFG,5.596083,0.501637,1161
2,1979,AFG,ARE,6.796117,1.900000,8
3,1979,AFG,BEL,4.966140,2.500000,18
4,1979,AFG,BGR,6.036373,3.080000,100
...,...,...,...,...,...,...
792260,2024,ZWE,WSM,-2.612389,3.457143,70
792261,2024,ZWE,WST,-3.356873,0.384615,13
792262,2024,ZWE,ZAF,-3.317623,0.377538,1763
792263,2024,ZWE,ZMB,-0.847431,1.864479,4465


In [6]:
migration_df = migration_df.drop(columns=["inflow", "outflow"])
migration_df

Unnamed: 0,iso_or,origin,iso_des,destination,year,stock,flow
0,AAB,Antigua and Barbuda,ABW,Aruba,1960,16,
1,AAB,Antigua and Barbuda,ABW,Aruba,1961,16,0.0
2,AAB,Antigua and Barbuda,ABW,Aruba,1962,15,-1.0
3,AAB,Antigua and Barbuda,ABW,Aruba,1963,15,0.0
4,AAB,Antigua and Barbuda,ABW,Aruba,1964,15,0.0
...,...,...,...,...,...,...,...
2889683,ZIM,Zimbabwe,ZAM,Zambia,2016,13239,150.0
2889684,ZIM,Zimbabwe,ZAM,Zambia,2017,13782,629.0
2889685,ZIM,Zimbabwe,ZAM,Zambia,2018,14670,976.0
2889686,ZIM,Zimbabwe,ZAM,Zambia,2019,15720,1142.0


In [10]:
gdelt_df["Year"].unique()

array([1979, 1980, 1981, 1982, 1983, 1984, 1985, 1986, 1987, 1988, 1989,
       1990, 1991, 1992, 1993, 1994, 1995, 1996, 1997, 1998, 1999, 2000,
       2001, 2002, 2003, 2004, 2005, 2006, 2007, 2008, 2009, 2010, 2011,
       2012, 2013, 2014, 2015, 2016, 2017, 2018, 2019, 2020, 2021, 2022,
       2023, 2024])

In [11]:
migration_df["year"].unique()

array([1960, 1961, 1962, 1963, 1964, 1965, 1966, 1967, 1968, 1969, 1970,
       1971, 1972, 1973, 1974, 1975, 1976, 1977, 1978, 1979, 1980, 1981,
       1982, 1983, 1984, 1985, 1986, 1987, 1988, 1989, 1990, 1991, 1992,
       1993, 1994, 1995, 1996, 1997, 1998, 1999, 2000, 2001, 2002, 2003,
       2004, 2005, 2006, 2007, 2008, 2009, 2010, 2011, 2012, 2013, 2014,
       2015, 2016, 2017, 2018, 2019, 2020])

In [14]:
years = list(range(1979,2019))

In [15]:
results_df = run_shuffle_test(gdelt_df, migration_df, years)
results_df

[Parallel(n_jobs=-1)]: Using backend LokyBackend with 8 concurrent workers.
[Parallel(n_jobs=-1)]: Batch computation too fast (0.05454301834106445s.) Setting batch_size=2.
[Parallel(n_jobs=-1)]: Done   2 tasks      | elapsed:    0.1s
[Parallel(n_jobs=-1)]: Done   9 tasks      | elapsed:    0.2s
[Parallel(n_jobs=-1)]: Done  16 tasks      | elapsed:    0.3s
[Parallel(n_jobs=-1)]: Done  34 tasks      | elapsed:    0.4s
[Parallel(n_jobs=-1)]: Done  52 tasks      | elapsed:    0.5s
[Parallel(n_jobs=-1)]: Done  74 tasks      | elapsed:    0.7s
[Parallel(n_jobs=-1)]: Done  96 tasks      | elapsed:    0.9s
[Parallel(n_jobs=-1)]: Done 122 tasks      | elapsed:    1.2s
[Parallel(n_jobs=-1)]: Done 148 tasks      | elapsed:    1.3s
[Parallel(n_jobs=-1)]: Done 178 tasks      | elapsed:    1.7s
[Parallel(n_jobs=-1)]: Done 208 tasks      | elapsed:    1.9s
[Parallel(n_jobs=-1)]: Done 242 tasks      | elapsed:    2.1s
[Parallel(n_jobs=-1)]: Done 276 tasks      | elapsed:    2.4s
[Parallel(n_jobs=-1)]:

Unnamed: 0,year,observed_corr,p_value,null_mean,null_std
0,1979,-0.038202,0.101,0.001546,0.024162
1,1980,-0.043629,0.075,0.001859,0.02412
2,1981,-0.028809,0.238,0.000309,0.025281
3,1982,-0.024191,0.271,-0.000569,0.023085
4,1983,-0.024441,0.24,0.00029,0.022644
5,1984,-0.051594,0.039,-0.000385,0.023385
6,1985,-0.037989,0.084,0.000851,0.021333
7,1986,-0.026088,0.204,-4e-05,0.021233
8,1987,-0.028069,0.194,-0.000389,0.023134
9,1988,-0.03378,0.106,-0.000701,0.022215


In [16]:
results_df.to_dict()

{'year': {0: 1979,
  1: 1980,
  2: 1981,
  3: 1982,
  4: 1983,
  5: 1984,
  6: 1985,
  7: 1986,
  8: 1987,
  9: 1988,
  10: 1989,
  11: 1990,
  12: 1991,
  13: 1992,
  14: 1993,
  15: 1994,
  16: 1995,
  17: 1996,
  18: 1997,
  19: 1998,
  20: 1999,
  21: 2000,
  22: 2001,
  23: 2002,
  24: 2003,
  25: 2004,
  26: 2005,
  27: 2006,
  28: 2007,
  29: 2008,
  30: 2009,
  31: 2010,
  32: 2011,
  33: 2012,
  34: 2013,
  35: 2014,
  36: 2015,
  37: 2016,
  38: 2017,
  39: 2018},
 'observed_corr': {0: -0.0382023800209753,
  1: -0.04362854740577281,
  2: -0.028809327900096847,
  3: -0.02419058587523243,
  4: -0.0244410943194837,
  5: -0.051593604076756214,
  6: -0.03798861032940341,
  7: -0.02608820057834694,
  8: -0.028068557494453583,
  9: -0.033779959521905045,
  10: -0.03253252425743884,
  11: -0.005728938363686133,
  12: -0.004637325206608562,
  13: -0.01249145019996902,
  14: -0.003748668899155499,
  15: 0.000405714752253491,
  16: -0.008872759869002149,
  17: -0.011967318888159292,
  1

This is a series of annual tests comparing the observed correlations between GDELT-based measures and migration flows to a null distribution created by shuffling. Each row corresponds to one year, with the following key columns:

year: The year of the analysis.
observed_corr: The actual correlation observed in the data for that year.
p_value: The probability of observing a correlation at least as extreme as the observed one if there were no true relationship (so under the null hypothesis of no influence).
null_mean & null_std: The mean and standard deviation of the correlation values obtained from the shuffled (null) datasets.

Overall patterns:

- all the observed correlations are small and negative (ranging roughly between -0.05 and 0.00). This indicates that in the original data, the linear association between the GDELT events and migration flows is very weak and consistently negative or near zero.

- statistical significance (p_value): Most p-values are above 0.05. In fact, only a few come close to or below the conventional 0.05 threshold. For instance: Year 1984 has a p-value of about 0.039, which is marginally significant under standard thresholds. Some other years (e.g., 1980, 2005, 2008) have p-values slightly below 0.1, but not below 0.05.
For the vast majority of years, high p-values indicate that the observed correlations could easily have arisen by chance. In other words, there's no strong statistical evidence that GDELT events are systematically influencing migration flows in a given year.

- The null_mean values are very close to zero, as expected. This suggests the permutation procedure is appropriate: when relationships are randomized, the average correlation tends to zero. The null_std values are generally small (mostly around 0.02), indicating the random distribution of correlations is tightly clustered around zero.

Interpretation:

- no consistent evidence of influence: The small, negative observed correlations combined with mostly high p-values suggest that there's no meaningful linear relationship between the variables tested and the observed patterns can generally be explained by random chance.

- one possibly significant year (1984): In 1984, the p-value dips slightly below 0.05, which might hint at a non-random pattern for that particular year. However, this single marginal finding could be a statistical fluke, especially given the number of years tested. Without a clear theoretical reason to single out that year, it’s best interpreted cautiously.

Overall, these results do not provide compelling evidence that GDELT-derived edges (e.g., positive/negative event intensities between countries) have a significant influence on migration flows in these years. The patterns observed are generally indistinguishable from what would be expected by random chance, except for a single marginal case that warrants cautious interpretation.

In [17]:
results_df = run_shuffle_test(gdelt_df, migration_df, years, weight_col_migration="stock")
results_df

[Parallel(n_jobs=-1)]: Using backend LokyBackend with 8 concurrent workers.
[Parallel(n_jobs=-1)]: Done   2 tasks      | elapsed:    1.6s
[Parallel(n_jobs=-1)]: Done   9 tasks      | elapsed:    1.7s
[Parallel(n_jobs=-1)]: Done  16 tasks      | elapsed:    1.7s
[Parallel(n_jobs=-1)]: Done  25 tasks      | elapsed:    1.8s
[Parallel(n_jobs=-1)]: Done  34 tasks      | elapsed:    2.0s
[Parallel(n_jobs=-1)]: Batch computation too fast (0.19957085908092073s.) Setting batch_size=2.
[Parallel(n_jobs=-1)]: Done  45 tasks      | elapsed:    2.1s
[Parallel(n_jobs=-1)]: Done  56 tasks      | elapsed:    2.2s
[Parallel(n_jobs=-1)]: Done  82 tasks      | elapsed:    2.4s
[Parallel(n_jobs=-1)]: Done 108 tasks      | elapsed:    2.6s
[Parallel(n_jobs=-1)]: Done 138 tasks      | elapsed:    2.9s
[Parallel(n_jobs=-1)]: Done 168 tasks      | elapsed:    3.1s
[Parallel(n_jobs=-1)]: Done 202 tasks      | elapsed:    3.4s
[Parallel(n_jobs=-1)]: Done 236 tasks      | elapsed:    3.7s
[Parallel(n_jobs=-1)]:

Unnamed: 0,year,observed_corr,p_value,null_mean,null_std
0,1979,-0.034828,0.167,-0.0007752532,0.025163
1,1980,-0.023457,0.319,6.842734e-05,0.024613
2,1981,-0.021264,0.346,0.001218337,0.02424
3,1982,-0.019342,0.383,0.0003155165,0.023255
4,1983,-0.031663,0.165,-0.0004146883,0.023429
5,1984,-0.044942,0.058,-0.00178909,0.023733
6,1985,-0.031007,0.146,-0.000331859,0.021088
7,1986,-0.039579,0.08,-1.484429e-05,0.022146
8,1987,-0.039119,0.08,0.0005512059,0.022351
9,1988,-0.048814,0.028,8.836579e-05,0.021267


In [18]:
results_df.to_dict()

{'year': {0: 1979,
  1: 1980,
  2: 1981,
  3: 1982,
  4: 1983,
  5: 1984,
  6: 1985,
  7: 1986,
  8: 1987,
  9: 1988,
  10: 1989,
  11: 1990,
  12: 1991,
  13: 1992,
  14: 1993,
  15: 1994,
  16: 1995,
  17: 1996,
  18: 1997,
  19: 1998,
  20: 1999,
  21: 2000,
  22: 2001,
  23: 2002,
  24: 2003,
  25: 2004,
  26: 2005,
  27: 2006,
  28: 2007,
  29: 2008,
  30: 2009,
  31: 2010,
  32: 2011,
  33: 2012,
  34: 2013,
  35: 2014,
  36: 2015,
  37: 2016,
  38: 2017,
  39: 2018},
 'observed_corr': {0: -0.03482843197951735,
  1: -0.02345709512456369,
  2: -0.021263856218299723,
  3: -0.019341968437623484,
  4: -0.03166298045152563,
  5: -0.044941823735416705,
  6: -0.031007250500782423,
  7: -0.0395789960851968,
  8: -0.03911916614570348,
  9: -0.048813840987423474,
  10: -0.04405219071672735,
  11: -0.04730929677828123,
  12: -0.013790053817851266,
  13: -0.028399440229945866,
  14: -0.026507595322399456,
  15: -0.02884419999464076,
  16: -0.027518604250089524,
  17: -0.025294887766623104,
 

A series of annual tests comparing the observed correlations between GDELT-based measures and migration stock to a null distribution created by shuffling, each entry showing:

year: the year analyzed.
observed_corr: The correlation observed in the actual data for that year between the chosen GDELT-based measure and migration stock.
p_value: The probability that a correlation as extreme as the observed one could occur by random chance if there were no true relationship.
null_mean & null_std: The mean and standard deviation of the correlation values generated by shuffled (null) datasets.

Patterns:

- observed correlations: All observed correlations are negative and generally small in magnitude. This indicates that in each year, the raw data show a slight tendency for the variables to move in opposite directions, but the effect sizes are quite modest.

- statistical significance (p_value): Many years have p-values well above 0.05, indicating no statistically significant difference from what would be expected by chance.
Some years stand out with lower p-values:
For example, year 2009 has a p-value of 0.028, year 2011 has 0.071 (marginal), 1984 has 0.05, and year 2018 has 0.022. Values below 0.05 suggest that the observed negative correlation that year is somewhat unlikely to have arisen by random chance.
These instances might imply that for those particular years, there is a statistically detectable pattern that diverges from randomness. However, the correlation magnitudes remain small, so even if it’s statistically significant, it’s not very strong.

- The null means are close to zero, and the standard deviations are small (mostly around 0.02 or less). This indicates that under the null scenario (no real relationship), correlations center near zero and do not vary wildly. When you get an observed correlation noticeably more negative than zero that beats the random distribution, that’s when the p-value falls below 0.05.

Interpretation:

- For most years, the negative correlations are not statistically distinguishable from what you might get through random shuffling, suggesting no clear evidence of a genuine influence relationship.
- A few isolated years show a statistically significant negative correlation, meaning in those years the data deviate from chance in a consistent direction. Still, the correlations are quite small, so even in these cases, the practical significance is questionable.
- Because statistical significance can sometimes arise due to multiple comparisons or noise, and because effect sizes are small, it’s best to be cautious. Without a theoretical reason why certain years should differ, these occasional significant results might be anomalies rather than meaningful patterns.

Possible explanations for the significant years we encountered:
- 1984: Ethiopia experienced a catastrophic famine due to drought and political instability, leading to the displacement of millions to neighbouring countries and beyond. The assassination of Indian Prime Minister Indira Gandhi in 1984 led to anti-Sikh riots, causing internal displacement and prompting some to seek asylum abroad.
- 2009: The aftermath of the 2008 financial crisis led to economic downturns worldwide. Economic hardships in various countries may have prompted individuals to migrate in search of better opportunities. The end of the civil war in Sri Lanka in 2009 also resulted in significant internal displacement.
- 2011: Syrian Civil War - Beginning in 2011, the Syrian conflict led to one of the largest refugee crises in recent history, with millions fleeing to neighboring countries and Europe. Arab Spring - A series of uprisings across the Middle East and North Africa in 2011 resulted in political instability and conflicts, prompting increased migration from affected regions.
- 2018: Severe droughts in Central America's Dry Corridor, exacerbated by climate change, led to crop failures and food insecurity, driving migration towards the United States. Venezuelan Economic Crisis - By 2018, Venezuela's economic collapse had resulted in shortages of basic necessities, leading to a mass exodus of citizens to neighboring countries.

The data mostly show no strong or consistent evidence of a meaningful relationship. In a few isolated years, there appears to be a statistically significant negative association, but the effect sizes are small and could be due to chance or other unmodeled factors.