# Country Level Attraction Model

This notebook is used to generate a country level attraction model that can be used to assess the relative "attractiveness" of safe haven countries to refugees in the case of a forced migration event such as conflict or natural disaster.

After testing various functional forms and features, the model ultimately looks like:

```
total_refugees = b1*GDP + b2*liberal_democracy
```

where `b1` is a coefficient term for GDP and `b2` that of liberal democracy score.

In [1]:
import pandas as pd
import numpy as np
from sklearn import datasets, linear_model
from sklearn.metrics import mean_squared_error, r2_score
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import MinMaxScaler
from fuzzywuzzy import process
import statsmodels.api as sm

## Data Preparation

First we load the refugee dataset which was curated from UNHCR and other sources. We can exclude data on Russia for the time being since it isn't considered as a haven country:

In [2]:
data=pd.read_csv('refugee_data/merged_refugee_data.csv')

In [3]:
data.head()

Unnamed: 0,Year,Country of origin,Country of origin (ISO),Country of asylum,Country of asylum (ISO),Refugees under UNHCR's mandate,is_bordering,Country Code,v2xeg_eqdr,v2x_libdem,GDP (current US$),"Population, total"
0,2000,Afghanistan,AFG,Afghanistan,AFG,0,False,AFG,,,,20779957.0
1,2000,Afghanistan,AFG,Egypt,EGY,60,False,EGY,,,99838540000.0,68831561.0
2,2000,Afghanistan,AFG,Australia,AUS,4358,False,AUS,,,415576200000.0,19153000.0
3,2000,Afghanistan,AFG,Austria,AUT,679,False,AUT,,,197289600000.0,8011566.0
4,2000,Afghanistan,AFG,Azerbaijan,AZE,172,False,AZE,,,5272798000.0,8048600.0


Now we filter for only bordering countries

In [4]:
data = data[data['is_bordering']==True]

## Featurization and Normalization

Next we calculate various features that will be relevant to our model training. These include:

- `pct_tot`: total percentage of people from `conflict A` who went to `country Z`
- `bilateral_migration_percap`: the amount of bilateral migration (from `conflict A` to `country Z`) per capita (with respect to `country Z` population)
- `gpd_per_cap`: per capita GDP for `country Z`
- `migrants_per_cap`: total migrants per capita for `country Z`

We then scale these using a [`MinMaxScaler`](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.MinMaxScaler.html) which normalizes them to between 0 and 1 on a **_per conflict basis_**. 

In [5]:
data['pct_tot'] = data["Refugees under UNHCR's mandate"] / data.groupby(['Year','Country of origin'])["Refugees under UNHCR's mandate"].transform('sum')

Normalizing within each conflict is crucial since it facilitates comparision of haven countries for a given conflict and reduces the model's tendency to compare grossly across all conflicts.

For example, imagine two conflicts: one in Western Europe and one in Latin America. The potential haven countries in Western Europe will _all be wealthier_ than those in Latin America (generally speaking). Therefore, normalizing _across_ these conflicts would lead to very low normalized GDPs for all Latin American havens and reduce the model's explanatory power.

Instead, we perform `0 to 1` normalization for all possible haven countries with respect to each conflict so that Latin American countries (in this example) are only normalized relative to their neighbors.

In [6]:
cols_to_scale = ['GDP (current US$)', 'Population, total']
scaler = MinMaxScaler()
for col in cols_to_scale:
    print(f"Normalizing column: {col}")
    normed = pd.DataFrame()
    
    for y, x in data.groupby(['Year','Country of origin']):
        norm_ = [i[0] for i in scaler.fit_transform(x[col].values.reshape(-1,1))]
        countries_orig = x['Country of origin']        
        countries_asy = x['Country of asylum']
        year = x['Year']
        res = pd.DataFrame(tuple(zip(countries_orig,countries_asy,year,norm_)), 
                           columns=['orig','asy','Year',f"{col}_norm"])
        normed = normed.append(res)
    data = pd.merge(data, normed, left_on=['Country of origin','Country of asylum', 'Year'], 
                        right_on=['orig','asy','Year'], how='right')

Normalizing column: GDP (current US$)


  data_min = np.nanmin(X, axis=0)
  data_max = np.nanmax(X, axis=0)


Normalizing column: Population, total


In [7]:
data.corr()

Unnamed: 0,Year,Refugees under UNHCR's mandate,v2xeg_eqdr,v2x_libdem,GDP (current US$),"Population, total",pct_tot,GDP (current US$)_norm,"Population, total_norm"
Year,1.0,0.040766,-0.0801,-0.028829,0.074239,0.019921,-0.050953,-0.005015,0.005362
Refugees under UNHCR's mandate,0.040766,1.0,-0.074456,-0.127302,-0.002604,0.070098,0.096649,0.046648,0.050897
v2xeg_eqdr,-0.0801,-0.074456,1.0,0.539734,0.134863,-0.089809,0.11332,-0.018704,-0.158824
v2x_libdem,-0.028829,-0.127302,0.539734,1.0,0.209089,-0.021128,0.177241,0.010347,-0.085
GDP (current US$),0.074239,-0.002604,0.134863,0.209089,1.0,0.499688,0.215787,0.077521,0.062354
"Population, total",0.019921,0.070098,-0.089809,-0.021128,0.499688,1.0,0.109022,0.254814,0.280054
pct_tot,-0.050953,0.096649,0.11332,0.177241,0.215787,0.109022,1.0,0.034247,0.023194
GDP (current US$)_norm,-0.005015,0.046648,-0.018704,0.010347,0.077521,0.254814,0.034247,1.0,0.769961
"Population, total_norm",0.005362,0.050897,-0.158824,-0.085,0.062354,0.280054,0.023194,0.769961,1.0


## Preparing for modeling

Next, we remove the Ukraine conflict and its corresponding countries from the dataset so that we can exclude them from model training.

First we create a dataframe of just Ukraine conflict:

We define the dependant variable as `pct_tot`: the total percentage of refugees from the conflict who went to each country. This provides a mechanism of normalizing refugee counts _across_ conflicts since really we only care about predicting refugee _shares_, not total number of refugees at this point:

In [8]:
data = data.dropna()

In [9]:
y=data['pct_tot']

Here, we are able to set independant variables for modeling. We choose only `historic_GDP_norm` (within conflict normalized GDP) and `v2x_libdem` (which is not normalized since it is already an index).

Other features were tested; they can simply be added to the array below to try additional features with the model.

In [10]:
features_cols = [
                    'GDP (current US$)_norm', 
                    'v2x_libdem',
                ]
features_normalized = data[features_cols]

## Modeling

Finally, we are able to run the model itself. We run a multiple regression using Statsmodels:

In [11]:
results=sm.OLS(y,features_normalized.astype(float)).fit()
results.summary()

0,1,2,3
Dep. Variable:,pct_tot,R-squared (uncentered):,0.408
Model:,OLS,Adj. R-squared (uncentered):,0.408
Method:,Least Squares,F-statistic:,1749.0
Date:,"Mon, 09 May 2022",Prob (F-statistic):,0.0
Time:,13:28:57,Log-Likelihood:,-2730.5
No. Observations:,5073,AIC:,5465.0
Df Residuals:,5071,BIC:,5478.0
Df Model:,2,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
GDP (current US$)_norm,0.1635,0.013,12.994,0.000,0.139,0.188
v2x_libdem,0.6667,0.016,42.488,0.000,0.636,0.697

0,1,2,3
Omnibus:,457.133,Durbin-Watson:,1.789
Prob(Omnibus):,0.0,Jarque-Bera (JB):,268.547
Skew:,0.427,Prob(JB):,4.85e-59
Kurtosis:,2.265,Cond. No.,1.86


Here we can see that our two features are statistically significant and explain a large portion of the variation within the dependent variable (refugee share).

## Backcasting Ukraine

Now we are ready to produce a forecast for Ukraine.

In [12]:
ukr = pd.read_csv('refugee_data/ukr_pred_inputs.csv')
ukr = ukr[~ukr['Country Code'].isin(['RUS','BLR'])]
display(ukr)

Unnamed: 0,Country Code,Year,v2xeg_eqdr,v2x_libdem,GDP (current US$),"Population, total"
1,POL,2020,0.777,0.468,596624400000.0,37899070.0
3,MDA,2020,0.69,0.476,11915550000.0,2620495.0
4,ROU,2020,0.604,0.567,248715600000.0,19257520.0
5,SVK,2020,0.768,0.753,105172600000.0,5458827.0
6,HUN,2020,0.632,0.362,155808400000.0,9750149.0


In [13]:
scaler = MinMaxScaler()
ukr['GDP (current US$)_norm'] = [i[0] for i in scaler.fit_transform(ukr['GDP (current US$)'].values.reshape(-1,1))]

In [15]:
# get refugee shares prediction for Ukraine
features_to_predict=ukr[features_cols]
shares = results.predict(features_to_predict)

# add them to Ukraine dataframe
ukr['predicted_shares'] = shares
ukr_results = ukr[['Country Code','predicted_shares']]

In [16]:
ukr_results

Unnamed: 0,Country Code,predicted_shares
1,POL,0.475492
3,MDA,0.317343
4,ROU,0.44422
5,SVK,0.52809
6,HUN,0.281573


In [None]:

# save the results
ukr_results.to_csv('outputs/ukraine_model_results.csv',index=False)
ukr_results.head()

Finally, we can pickle the model for future use:

In [None]:
results.save("outputs/attraction_model.pickle")