# Data Preprocessing


One can import classes and funtions from scripts in the same folder using the name of the file in question. In this ase the file is *tools.py*. THus we do the following to import them. It is also possible to set an alias for ease of use.

In [1]:
import data_cleaning_tools as tls
import pandas as pd

The data cleaning functions are methods of the **data_cleaner** class object. Thus to use them we can create one such object and call the specific methods when they are needed.

In [2]:
cleaner = tls.data_cleaner()

## 1. Import and clean target and predictor variables

### 1.1 Climate factors

1. Read and clean rainfall, temperature and water resources datasets
2. Replaces country names with country codes
3. Converts monthly data to annual data:
    - Annual rainfall = sum of monthly rainfall
    - Annual temperature = weighted average of monthly temperatures with weights equal to the days per month
4. Averages annual data over the time period between 2013 to 2017
5. Merge datasets
6. Only keep following columns:
    - **Rain:** Total Rainfall (mm)
    - **Temp:** Temperature (°C)
    - **IRWR:** Total Internal Renewable Water Resources
    - **ERWR:** Total External Renewable Water Resources
    - **TRWR:** Total Renewable Water Resources
    - **Dep_ratio:** Dependency ratio



In [3]:
df_climate_factors = cleaner.climate_factors('raw data/WORLDBANK_rainfall.csv', 'raw data/WORLDBANK_temperature.csv', 'raw data/AQUASTAT_water_resources.csv')

In [4]:
df_climate_factors = df_climate_factors[['Country','Temp','Rain','IRWR','ERWR','TRWR','Dep_ratio']]

In [5]:
df_climate_factors.to_csv('clean data/climate_factors.csv')
df_climate_factors

Unnamed: 0,Country,Temp,Rain,IRWR,ERWR,TRWR,Dep_ratio
0,AFG,14.074742,349.736945,47.1500,18.18,65.3300,0.278280
1,AGO,22.182196,960.024065,148.0000,0.40,148.4000,0.002695
2,ALB,12.754647,1079.459168,26.9000,3.30,30.2000,0.109272
3,AND,12.402212,760.241065,0.3156,0.00,0.3156,0.000000
4,ARE,28.010773,64.449765,0.1500,0.00,0.1500,0.000000
...,...,...,...,...,...,...,...
197,WSM,27.578074,3162.300825,0.0000,0.00,0.0000,
198,YEM,24.211854,161.796177,2.1000,0.00,2.1000,0.000000
199,ZAF,18.620716,403.002933,44.8000,6.55,51.3500,0.127556
200,ZMB,22.412762,895.648672,80.2000,24.60,104.8000,0.234733


### 1.2. Water stress indicators

1. Read + general cleanup of waterstress indicator dataset
2. Replaces country names with country codes
3. Only keep water stress indicators for the time period between 2013-2017
4. Rename columns:
    - **WS_MDG:** MDG 7.5. Freshwater withdrawal as % of total renewable water resources
    - **WUE_SDG:** 'SDG 6.4.2. Water Stress
    - **WS_SDG:** SDG 6.4.1. Water Use Efficiency

In [6]:
df_waterstress = cleaner.water_stress('raw data/AQUASTAT_water_stress.csv')

In [7]:
df_waterstress.to_csv('clean data/water_stress.csv')
df_waterstress

Variable Name,WS_MDG,WUE_SDG,WS_SDG
Country,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
AFG,31.045462,0.923778,54.757019
AGO,0.475539,142.467836,1.871883
ALB,3.933775,6.656907,7.139423
ARE,1708.000000,92.773763,1708.000000
ARG,4.301333,13.616564,10.456664
...,...,...,...
VNM,9.259150,2.349448,18.130315
YEM,169.761905,5.219357,169.761905
ZAF,37.740993,14.659097,62.055716
ZMB,1.500000,12.764894,2.835498


### 1.3 Socio-economic factors

1. Read and clean aquastat and unicef datasets
2. Replaces country names with country codes
3. Averages annual data over the time period between 2013 to 2017
4. Merge datasets
5. Only keep following columns:
    - **HDI:** Human Development Index (HDI)
    - **rural_pop:** Rural population (1000 inhab)
    - **urban_pop:** Urban population (1000 inhab)
    - **rural_water:** Rural population with access to safe drinking-water (JMP) (%)
    - **urban_water:** Urban population with access to safe drinking-water (JMP) (%)
    - **r_u:** rural/urban population
    - **ru_access:** rural/urban safe water access
    - **life_ex:** Life expectancy at birth, total (years)
    - **mort_rate:** Mortality rate, infant (per 1,000 live births)
    - **pop_growth:** Population growth (annual %)
    - **GDP_pcp:** GDP per capita, PPP (constant 2011 international $)



In [8]:
df_socioec_factors = cleaner.socioecon_factors('raw data/aquastat_socio_economic.csv', 'raw data/unicef_socio_economic.csv')

In [9]:
df_socioec_factors = df_socioec_factors[['Country', 'rural_pop', 'urban_pop', 'HDI', 'r_u', 'r_u_access','pop_growth','mort_rate','GDP_pcp','life_ex']]        

In [10]:
df_socioec_factors.to_csv('clean data/socioec_factors.csv')
df_socioec_factors

Unnamed: 0,Country,rural_pop,urban_pop,HDI,r_u,r_u_access,pop_growth,mort_rate,GDP_pcp,life_ex
0,AFG,26558.609,8971.472,0.493,2.960340,0.601023,3.06,53.2,2226.0,63.4
1,AGO,10472.554,19311.639,0.576,0.542292,0.374005,3.44,58.6,7859.4,59.2
2,ALB,1190.155,1740.032,0.789,0.683985,1.003161,-0.20,8.6,12227.4,78.0
3,ARE,1292.709,8107.436,0.864,0.159447,1.004016,0.74,7.0,64243.0,77.2
4,ARG,3652.804,40618.237,0.832,0.089930,1.010101,1.08,10.2,23732.2,76.0
...,...,...,...,...,...,...,...,...,...,...
136,VNM,61898.302,33642.498,0.69,1.839884,0.977800,1.04,17.2,6455.0,75.0
137,WSM,160.194,36.246,0.706,4.419633,1.018462,0.66,14.6,6120.0,72.8
138,YEM,18075.808,10174.612,0.463,1.776560,0.645833,2.58,43.0,,66.0
139,ZAF,19369.002,37348.154,0.704,0.518607,0.817269,1.52,28.6,12796.6,62.6


### 1.4 Merge Datasets

In [11]:
df = pd.merge(df_waterstress,df_climate_factors,on=['Country'],how='outer')
df = pd.merge(df, df_socioec_factors,on=['Country'],how='outer')

### 1.5 Remove NaN's

In [12]:
df.isna().sum()

Country        0
WS_MDG        24
WUE_SDG       34
WS_SDG        24
Temp           7
Rain           7
IRWR           0
ERWR           0
TRWR           0
Dep_ratio     24
rural_pop     61
urban_pop     61
HDI           61
r_u           61
r_u_access    61
pop_growth    61
mort_rate     61
GDP_pcp       64
life_ex       62
dtype: int64

In [13]:
df.dropna(inplace=True)

### 1.6 Export Final Dataset

In [14]:
df.reset_index(inplace=True, drop=True)
df.to_csv('clean data/final_data.csv')
df

Unnamed: 0,Country,WS_MDG,WUE_SDG,WS_SDG,Temp,Rain,IRWR,ERWR,TRWR,Dep_ratio,rural_pop,urban_pop,HDI,r_u,r_u_access,pop_growth,mort_rate,GDP_pcp,life_ex
0,AFG,31.045462,0.923778,54.757019,14.074742,349.736945,47.15,18.18,65.33,0.278280,26558.609,8971.472,0.493,2.960340,0.601023,3.06,53.2,2226.0,63.4
1,AGO,0.475539,142.467836,1.871883,22.182196,960.024065,148.00,0.40,148.40,0.002695,10472.554,19311.639,0.576,0.542292,0.374005,3.44,58.6,7859.4,59.2
2,ALB,3.933775,6.656907,7.139423,12.754647,1079.459168,26.90,3.30,30.20,0.109272,1190.155,1740.032,0.789,0.683985,1.003161,-0.20,8.6,12227.4,78.0
3,ARE,1708.000000,92.773763,1708.000000,28.010773,64.449765,0.15,0.00,0.15,0.000000,1292.709,8107.436,0.864,0.159447,1.004016,0.74,7.0,64243.0,77.2
4,ARG,4.301333,13.616564,10.456664,14.767043,598.510300,292.00,584.24,876.24,0.666758,3652.804,40618.237,0.832,0.089930,1.010101,1.08,10.2,23732.2,76.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
117,USA,14.480160,42.378501,28.161984,8.218411,708.156050,2818.00,251.00,3069.00,0.081786,58215.947,266243.516,0.919,0.218657,0.987928,0.68,6.0,58237.6,79.0
118,UZB,120.523839,1.337755,168.913106,13.707806,234.213107,16.34,32.53,48.87,0.665644,15779.684,16130.957,0.707,0.978224,0.821320,1.68,20.4,6037.2,70.8
119,VNM,9.259150,2.349448,18.130315,24.860117,1879.263025,359.42,524.70,884.12,0.593471,61898.302,33642.498,0.69,1.839884,0.977800,1.04,17.2,6455.0,75.0
120,ZAF,37.740993,14.659097,62.055716,18.620716,403.002933,44.80,6.55,51.35,0.127556,19369.002,37348.154,0.704,0.518607,0.817269,1.52,28.6,12796.6,62.6


## Still to do: Projected temperatures - 2020-2039

In [15]:
df_temp_proj = cleaner.temp_proj('raw data/tas_2020_2039_mavg_rcp26_AFG_CAF.csv')

In [16]:
df_temp_proj.to_csv('clean data/projected_temp_2020-2039.csv')
df_temp_proj

Unnamed: 0,Country,Temperature (°C)
0,AFG,12.369804
1,AGO,20.311982
2,ALB,14.062129
3,AND,12.352881
4,ARE,24.896902
...,...,...
188,XRK,11.533050
189,YEM,22.247077
190,ZAF,15.588705
191,ZMB,19.578873
