## Using Shuffle Test 

Want to determine whether the intensity of certain types of GDELT events in year t (negative/positive sentiment events from Country A towards Country B) predicts an increase or decrease in migration flows from A to B in year t+1, controlling for other factors.

In [7]:
import pandas as pd
import networkx as nx
import numpy as np
from scipy.stats import pearsonr

In [8]:
import sys
sys.path.append('..')

In [9]:
from src.network import run_shuffle_test

In [10]:
gdelt_df = pd.read_csv("../data/gdelt_social.csv") 
migration_df = pd.read_csv("../data/migration_bilateral.csv")

In [11]:
gdelt_df

Unnamed: 0,Year,Actor1Country,Actor2Country,weighted_sum_avgtone,weighted_sum_goldstein,sum_nummentions
0,1979,ABW,NLD,5.000000,3.400000,4
1,1979,AFG,AFG,5.596083,0.501637,1161
2,1979,AFG,ARE,6.796117,1.900000,8
3,1979,AFG,BEL,4.966140,2.500000,18
4,1979,AFG,BGR,6.036373,3.080000,100
...,...,...,...,...,...,...
792260,2024,ZWE,WSM,-2.612389,3.457143,70
792261,2024,ZWE,WST,-3.356873,0.384615,13
792262,2024,ZWE,ZAF,-3.317623,0.377538,1763
792263,2024,ZWE,ZMB,-0.847431,1.864479,4465


In [12]:
migration_df = migration_df.drop(columns=["inflow", "outflow"])
migration_df

Unnamed: 0,iso_or,origin,iso_des,destination,year,stock,flow
0,AAB,Antigua and Barbuda,ABW,Aruba,1960,16,
1,AAB,Antigua and Barbuda,ABW,Aruba,1961,16,0.0
2,AAB,Antigua and Barbuda,ABW,Aruba,1962,15,-1.0
3,AAB,Antigua and Barbuda,ABW,Aruba,1963,15,0.0
4,AAB,Antigua and Barbuda,ABW,Aruba,1964,15,0.0
...,...,...,...,...,...,...,...
2889683,ZIM,Zimbabwe,ZAM,Zambia,2016,13239,150.0
2889684,ZIM,Zimbabwe,ZAM,Zambia,2017,13782,629.0
2889685,ZIM,Zimbabwe,ZAM,Zambia,2018,14670,976.0
2889686,ZIM,Zimbabwe,ZAM,Zambia,2019,15720,1142.0


In [13]:
gdelt_df["Year"].unique()

array([1979, 1980, 1981, 1982, 1983, 1984, 1985, 1986, 1987, 1988, 1989,
       1990, 1991, 1992, 1993, 1994, 1995, 1996, 1997, 1998, 1999, 2000,
       2001, 2002, 2003, 2004, 2005, 2006, 2007, 2008, 2009, 2010, 2011,
       2012, 2013, 2014, 2015, 2016, 2017, 2018, 2019, 2020, 2021, 2022,
       2023, 2024])

In [14]:
migration_df["year"].unique()

array([1960, 1961, 1962, 1963, 1964, 1965, 1966, 1967, 1968, 1969, 1970,
       1971, 1972, 1973, 1974, 1975, 1976, 1977, 1978, 1979, 1980, 1981,
       1982, 1983, 1984, 1985, 1986, 1987, 1988, 1989, 1990, 1991, 1992,
       1993, 1994, 1995, 1996, 1997, 1998, 1999, 2000, 2001, 2002, 2003,
       2004, 2005, 2006, 2007, 2008, 2009, 2010, 2011, 2012, 2013, 2014,
       2015, 2016, 2017, 2018, 2019, 2020])

In [15]:
years = list(range(1979,2019))

In [17]:
results_df = run_shuffle_test(gdelt_df, migration_df, years, weight_col_migration="stock")
results_df

Unnamed: 0,year,observed_corr,p_value,null_mean,null_std
0,1979,-0.034828,0.148,1.499403e-07,0.024037
1,1980,-0.023457,0.291,0.0004710115,0.023168
2,1981,-0.021264,0.36,0.0005226955,0.023664
3,1982,-0.019342,0.374,0.001582156,0.022734
4,1983,-0.031663,0.166,-0.001281499,0.022619
5,1984,-0.044942,0.056,-0.0003144856,0.023434
6,1985,-0.031007,0.157,-7.933544e-05,0.022639
7,1986,-0.039579,0.064,-0.0006374244,0.021614
8,1987,-0.039119,0.061,0.001759986,0.021428
9,1988,-0.048814,0.037,0.0001218039,0.021683


In [None]:
results_df.to_dict()

A series of annual tests comparing the observed correlations between GDELT-based measures and migration stock to a null distribution created by shuffling, each entry showing:

year: the year analyzed.
observed_corr: The correlation observed in the actual data for that year between the chosen GDELT-based measure and migration stock.
p_value: The probability that a correlation as extreme as the observed one could occur by random chance if there were no true relationship.
null_mean & null_std: The mean and standard deviation of the correlation values generated by shuffled (null) datasets.

Patterns:

- observed correlations: All observed correlations are negative and generally small in magnitude. This indicates that in each year, the raw data show a slight tendency for the variables to move in opposite directions, but the effect sizes are quite modest.

- statistical significance (p_value): Many years have p-values well above 0.05, indicating no statistically significant difference from what would be expected by chance.
Some years stand out with lower p-values:
For example, year 2009 has a p-value of 0.028, year 2011 has 0.071 (marginal), 1984 has 0.05, and year 2018 has 0.022. Values below 0.05 suggest that the observed negative correlation that year is somewhat unlikely to have arisen by random chance.
These instances might imply that for those particular years, there is a statistically detectable pattern that diverges from randomness. However, the correlation magnitudes remain small, so even if it’s statistically significant, it’s not very strong.

- The null means are close to zero, and the standard deviations are small (mostly around 0.02 or less). This indicates that under the null scenario (no real relationship), correlations center near zero and do not vary wildly. When you get an observed correlation noticeably more negative than zero that beats the random distribution, that’s when the p-value falls below 0.05.

Interpretation:

- For most years, the negative correlations are not statistically distinguishable from what you might get through random shuffling, suggesting no clear evidence of a genuine influence relationship.
- A few isolated years show a statistically significant negative correlation, meaning in those years the data deviate from chance in a consistent direction. Still, the correlations are quite small, so even in these cases, the practical significance is questionable.
- Because statistical significance can sometimes arise due to multiple comparisons or noise, and because effect sizes are small, it’s best to be cautious. Without a theoretical reason why certain years should differ, these occasional significant results might be anomalies rather than meaningful patterns.

Possible explanations for the significant years we encountered:
- 1984: Ethiopia experienced a catastrophic famine due to drought and political instability, leading to the displacement of millions to neighbouring countries and beyond. The assassination of Indian Prime Minister Indira Gandhi in 1984 led to anti-Sikh riots, causing internal displacement and prompting some to seek asylum abroad.
- 2009: The aftermath of the 2008 financial crisis led to economic downturns worldwide. Economic hardships in various countries may have prompted individuals to migrate in search of better opportunities. The end of the civil war in Sri Lanka in 2009 also resulted in significant internal displacement.
- 2011: Syrian Civil War - Beginning in 2011, the Syrian conflict led to one of the largest refugee crises in recent history, with millions fleeing to neighboring countries and Europe. Arab Spring - A series of uprisings across the Middle East and North Africa in 2011 resulted in political instability and conflicts, prompting increased migration from affected regions.
- 2018: Severe droughts in Central America's Dry Corridor, exacerbated by climate change, led to crop failures and food insecurity, driving migration towards the United States. Venezuelan Economic Crisis - By 2018, Venezuela's economic collapse had resulted in shortages of basic necessities, leading to a mass exodus of citizens to neighboring countries.

The data mostly show no strong or consistent evidence of a meaningful relationship. In a few isolated years, there appears to be a statistically significant negative association, but the effect sizes are small and could be due to chance or other unmodeled factors.