# Country Level Attraction Model

This notebook is used to generate a country level attraction model that can be used to assess the relative "attractiveness" of safe haven countries to refugees in the case of a forced migration event such as conflict or natural disaster.

After testing various functional forms and features, the model ultimately looks like:

```
total_refugees = b1*GDP + b2*liberal_democracy
```

where `b1` is a coefficient term for GDP and `b2` that of liberal democracy score.

In [1]:
import pandas as pd
import numpy as np
from sklearn import datasets, linear_model
from sklearn.metrics import mean_squared_error, r2_score
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import MinMaxScaler
from fuzzywuzzy import process
import statsmodels.api as sm

## Data Preparation

First we load the refugee dataset which was curated from UNHCR and other sources. We can exclude data on Russia for the time being since it isn't considered as a haven country:

In [2]:
data=pd.read_csv('refugee_data/refugee_conflict_5.20.22.csv')

conflicts = '\n\t - '.join(data.conflict.unique())
print(f"This dataset contains refugee data on the following conflicts:\n\t - {conflicts}")

This dataset contains refugee data on the following conflicts:
	 - Afghanistan
	 - Burundi
	 - Central African Republic
	 - Democratic Republic of the Congo
	 - Nigeria
	 - Somalia
	 - South Sudan
	 - Syria
	 - Ukraine
	 - Venezuela


This dataset includes the `country` where refugees went `individualPerCountry` which contains the total refugees who went to `country` from the `conflict`. It also has a number of features we can try to use to model drivers of refugee migration:

Next, we load in and process the liberal democracy index and access to justic for women from [V-Dem](https://www.v-dem.net/).

A key step is looking up V-Dem values for each country during the year **_preceding_** the conflict since this will best represent the initial conditions under which refugees made decisions.

## Featurization and Normalization

Next we calculate various features that will be relevant to our model training. These include:

- `pct_tot`: total percentage of people from `conflict A` who went to `country Z`
- `bilateral_migration_percap`: the amount of bilateral migration (from `conflict A` to `country Z`) per capita (with respect to `country Z` population)
- `gpd_per_cap`: per capita GDP for `country Z`
- `migrants_per_cap`: total migrants per capita for `country Z`

We then scale these using a [`MinMaxScaler`](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.MinMaxScaler.html) which normalizes them to between 0 and 1 on a **_per conflict basis_**. 

Normalizing within each conflict is crucial since it facilitates comparision of haven countries for a given conflict and reduces the model's tendency to compare grossly across all conflicts.

For example, imagine two conflicts: one in Western Europe and one in Latin America. The potential haven countries in Western Europe will _all be wealthier_ than those in Latin America (generally speaking). Therefore, normalizing _across_ these conflicts would lead to very low normalized GDPs for all Latin American havens and reduce the model's explanatory power.

Instead, we perform `0 to 1` normalization for all possible haven countries with respect to each conflict so that Latin American countries (in this example) are only normalized relative to their neighbors.

## Preparing for modeling

Next, we remove the Ukraine conflict and its corresponding countries from the dataset so that we can exclude them from model training.

First we create a dataframe of just Ukraine conflict:

In [4]:
ukr = data[data.conflict=='Ukraine'].copy(deep=True)

Next, we create a dataframe without Ukraine to train the model:

In [5]:
withoutUkrainData=data[(data.conflict != 'Ukraine') & (data['is_bordering']==1)]

We define the dependant variable as `pct_tot`: the total percentage of refugees from the conflict who went to each country. This provides a mechanism of normalizing refugee counts _across_ conflicts since really we only care about predicting refugee _shares_, not total number of refugees at this point:

In [6]:
y=withoutUkrainData['pct_tot']

Here, we are able to set independant variables for modeling. We choose only `historic_GDP_norm` (within conflict normalized GDP) and `v2x_libdem` (which is not normalized since it is already an index).

Other features were tested; they can simply be added to the array below to try additional features with the model.

In [7]:
features_cols = [
                    'GDP (current US$)_norm', 
                    'v2x_libdem',
                ]
features_normalized = withoutUkrainData[features_cols]

## Modeling

Finally, we are able to run the model itself. We run a multiple regression using Statsmodels:

In [8]:
results=sm.OLS(y,features_normalized.astype(float)).fit()
results.summary()

0,1,2,3
Dep. Variable:,pct_tot,R-squared (uncentered):,0.592
Model:,OLS,Adj. R-squared (uncentered):,0.571
Method:,Least Squares,F-statistic:,27.57
Date:,"Fri, 20 May 2022",Prob (F-statistic):,4.01e-08
Time:,21:29:53,Log-Likelihood:,12.201
No. Observations:,40,AIC:,-20.4
Df Residuals:,38,BIC:,-17.02
Df Model:,2,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
GDP (current US$)_norm,0.2548,0.071,3.614,0.001,0.112,0.398
v2x_libdem,0.3878,0.139,2.781,0.008,0.105,0.670

0,1,2,3
Omnibus:,3.06,Durbin-Watson:,2.339
Prob(Omnibus):,0.217,Jarque-Bera (JB):,2.159
Skew:,-0.188,Prob(JB):,0.34
Kurtosis:,4.074,Cond. No.,2.84


Here we can see that our two features are statistically significant and explain a large portion of the variation within the dependent variable (refugee share).

## Backcasting Ukraine

Now we are ready to produce a forecast for Ukraine.

In [12]:
# get refugee shares prediction for Ukraine
features_to_predict=ukr[features_cols]
shares = results.predict(features_to_predict)

# add them to Ukraine dataframe
ukr['predicted_shares'] = shares
ukr_results = ukr[['country','pct_tot','predicted_shares']]

ukr_results['scaled'] = ukr_results['predicted_shares'] / ukr_results['predicted_shares'].sum()
ukr_results['predicted_shares'] = ukr_results['scaled']
del(ukr_results['scaled'])
ukr_results.rename(columns={'pct_tot': 'actual', 'predicted_shares': 'predicted'}, inplace=True)
display(ukr_results.set_index('country'))

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  ukr_results['scaled'] = ukr_results['predicted_shares'] / ukr_results['predicted_shares'].sum()
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  ukr_results['predicted_shares'] = ukr_results['scaled']
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  ukr_results.rename(columns={'pct_tot': 'actual', 'predicted_shares': 'predicted'}, inplace=True)


Unnamed: 0_level_0,actual,predicted
country,Unnamed: 1_level_1,Unnamed: 2_level_1
Hungary,0.091808,0.107708
Belarus,0.004001,0.024912
Moldova,0.068523,0.120277
Poland,0.504044,0.184235
Romania,0.13818,0.169992
Russian Federation,0.130068,0.192083
Slovakia,0.063375,0.200793


In [13]:
mean_squared_error(ukr_results.actual, ukr_results.predicted)

0.018483978708818637

In [15]:
# save the results
ukr_results.to_csv('outputs/ukraine_model_results.csv',index=False)
ukr_results.head()

Unnamed: 0,country,actual,predicted
46,Hungary,0.091808,0.107708
47,Belarus,0.004001,0.024912
48,Moldova,0.068523,0.120277
49,Poland,0.504044,0.184235
50,Romania,0.13818,0.169992


Finally, we can pickle the model for future use:

In [16]:
results.save("outputs/attraction_model.pickle")