## Normalized Edge Weights

Calculate the share of persons of concern based on the country's population

In [1]:
import pandas as pd
import numpy as np

In [2]:
df_population = pd.read_csv('../data/world_population_by_year.csv', skiprows=4)

df_country_list = pd.read_csv('../data/country_list.csv')

df_countries = pd.read_csv('../data/UNHCR_country_names.csv')
df_countries.columns = ["iso","unhcr","display_titles","display_article","notes"]

df_timeseries = pd.read_csv('../data/unhcr_popstats_export_time_series_all_data.csv', skiprows=3, encoding='latin-1', dtype={"Value": object})
df_timeseries = df_timeseries.replace(to_replace='*', value='2')
df_timeseries['Value'] = df_timeseries['Value'].astype(np.int64)
df_timeseries.columns = ["year","origin","destination","type","value"]

1. Add ISO codes
2. Get population from csv for each country / year
3. Calculate share

Change country names, so that they match the ones in `UNHCR_country_names.csv`

In [3]:
df_timeseries['origin'].replace("Palestinian", "State of Palestine",inplace=True)
df_timeseries['origin'].replace("Dem. People's Rep. of Korea", "Rep. of Korea",inplace=True)

In [4]:
# Use UNHCR codes from country_list.csv
for index, country in df_country_list.iterrows():
    df_timeseries.loc[df_timeseries['origin'] == country['name_en'], 'iso'] = country['country_code']
    
# Overwrite codes with UNHCR_country_names.csv when possible
for index, country in df_countries.iterrows():
    df_timeseries.loc[df_timeseries['origin'] == country['unhcr'], 'iso'] = country['iso']

# Manual fixes
df_timeseries['iso'].replace("WES", "WSM",inplace=True) # Samoa
df_timeseries['iso'].replace("SEY", "SYC",inplace=True) # Seychelles

df_timeseries.head()

Unnamed: 0,year,origin,destination,type,value,iso
0,1951,Australia,Various/Unknown,Refugees (incl. refugee-like situations),180000,AUS
1,1951,Austria,Various/Unknown,Refugees (incl. refugee-like situations),282000,AUT
2,1951,Belgium,Various/Unknown,Refugees (incl. refugee-like situations),55000,BEL
3,1951,Canada,Various/Unknown,Refugees (incl. refugee-like situations),168511,CAN
4,1951,"China, Hong Kong SAR",Various/Unknown,Refugees (incl. refugee-like situations),30000,HKG


Check whether there are any countries for which we do not have a code (except `'Various/Unknown'`)
-> We have all codes

In [5]:
df_timeseries.loc[df_timeseries['iso'].isnull() & (df_timeseries['origin'] != "Various/Unknown")]

Unnamed: 0,year,origin,destination,type,value,iso


For the following countries, we don't have historic data on the population. In these cases, we could resort to manually fixing a number from another data source. 

In [7]:
for iso in df_timeseries['iso'].unique():
    if not any(df_population['Country Code'] == iso):
        print(iso)

nan
GUF
ANT
MSR
AIA


Add column for percentages.

In [9]:
df_timeseries['share'] = np.nan

This is only performed once to create the new csv file, so performance is secondary.

In [10]:
for index, country in df_population.iterrows():
    for index, edge in df_timeseries.loc[(df_timeseries['iso'] == country['Country Code']) & (df_timeseries['year'] >= 1960)].iterrows():
        year = str(edge['year'])
        share = edge['value'] / country[year]
        df_timeseries.loc[index,'share'] = share

**Missing shares**:

* SRB: Data only from 1990 on
* GUF: No data
* KWT: Missing data 1992 - 1994
* MSR: No data
* ERI: Missing data from 2012 on
* AIA: No data

In [11]:
df_timeseries.loc[df_timeseries['share'].isnull() & (df_timeseries['year'] >= 1960) & (df_timeseries['origin'] != "Various/Unknown")]['iso'].unique()

array(['SRB', 'GUF', 'KWT', 'ANT', 'MSR', 'ERI', 'AIA'], dtype=object)

Fix missing data

In [18]:
# ANT
for index, edge in df_timeseries.loc[(df_timeseries['iso'] == 'ANT') & (df_timeseries['year'] >= 1960)].iterrows():
        year = str(edge['year'])
        share = edge['value'] / 304759 # Source: https://en.wikipedia.org/wiki/Netherlands_Antilles (Accessed February 12, 2019)
        df_timeseries.loc[index,'share'] = share
        
# MSR
for index, edge in df_timeseries.loc[(df_timeseries['iso'] == 'MSR') & (df_timeseries['year'] >= 1960)].iterrows():
        year = str(edge['year'])
        share = edge['value'] / 296711 # Source: https://en.wikipedia.org/wiki/Montserrat (Accessed February 12, 2019)
        df_timeseries.loc[index,'share'] = share
        
# AIA
for index, edge in df_timeseries.loc[(df_timeseries['iso'] == 'AIA') & (df_timeseries['year'] >= 1960)].iterrows():
        year = str(edge['year'])
        share = edge['value'] / 14764 # Source: https://en.wikipedia.org/wiki/Anguilla (Accessed February 12, 2019)
        df_timeseries.loc[index,'share'] = share
        
# GUF
for index, edge in df_timeseries.loc[(df_timeseries['iso'] == 'GUF') & (df_timeseries['year'] >= 1960)].iterrows():
        year = str(edge['year'])
        share = edge['value'] / 296711 # Source: https://en.wikipedia.org/wiki/Netherlands_Antilles (Accessed February 12, 2019)
        df_timeseries.loc[index,'share'] = share

# KWT
for index, edge in df_timeseries.loc[(df_timeseries['iso'] == 'KWT') & (df_timeseries['year'] >= 1992) & (df_timeseries['year'] <= 1994)].iterrows():
        year = str(edge['year'])
        share = edge['value'] / ((2035661 + 1610651) / 2) # Average of 1991 and 1995 (world_population_by_year.csv)
        df_timeseries.loc[index,'share'] = share
        
# SRB
for index, edge in df_timeseries.loc[(df_timeseries['iso'] == 'SRB') & (df_timeseries['year'] <= 1989)].iterrows():
        year = str(edge['year'])
        share = edge['value'] / 7586000 # From 1990 (world_population_by_year.csv)
        df_timeseries.loc[index,'share'] = share
        
# ERI
for index, edge in df_timeseries.loc[(df_timeseries['iso'] == 'ERI') & (df_timeseries['year'] >= 2012)].iterrows():
        year = str(edge['year'])
        share = edge['value'] / 4474690 # From 2011 (world_population_by_year.csv)
        df_timeseries.loc[index,'share'] = share

In [19]:
df_timeseries.loc[df_timeseries['share'].isnull() & (df_timeseries['year'] >= 1960) & (df_timeseries['origin'] != "Various/Unknown")]['iso'].unique()

array([], dtype=object)

**TODO:**
* ANT -> NLD? -> fix population
* Fix population for GUF, MSR, AIA
* KWT, ERI: Use data from other years
* How to handle various/unknown?

In [80]:
df_timeseries.to_csv('../data/unhcr_time_series_normalized.csv', encoding='utf-8', index=False)