# Between model
This model takes as input any variable that is static, that is the OSM variables, ESA Landcover variables and the WSF variables. Moreover, it takes the mean over all dynamic variables. The dynamic variables include Nightlights, NDVI, and NDWI_Gao as well as NDWI_McF. 

The idea is that the between model captures variation between clusters and thus the target variable for the between model is $\bar{w}_c = \frac{1}{T_c}\sum_t^{T_c} w_{c,t}$ 

# Within model
This goal of this model is to predict the deviations from the cluster mean for each year. I.e. the model should capture variation within each cluster. To do so, the target variable is $\tilde{w}_{ct} = w_{ct} - \bar{w}_{c}$. 

For cluster $c$ in time period $t$, the feature vector is defined as $\tilde{\boldsymbol{x}}_{ct} = \boldsymbol{x}_{ct} - \bar{\boldsymbol{x}}_{c}, where~\bar{\boldsymbol{x}}_{c} \in \mathbb{R}^{k\times1}$. 

To predict $\tilde{w}_{ct}$, I rely on $\tilde{\boldsymbol{x}}_{ct}$. This allows me to interpret the performance metric as the within R2, i.e. the share of the variance the model captures within clusters. 


(this does not help at all, thus disregard)...
To augment the number of training observations, I train the model on deltas, rather than on the demeaned variables. This substantially increases the number of training observations and covers a wider range of differences, making the training dataset more versatile and robust. Ideally, this helps to learn from a wider range of differences and thus increases the out-of-sample when predicting $\tilde{\boldsymbol{w}}_{ct}$.

In [1]:
import pandas as pd
import matplotlib.pyplot as plt
import matplotlib
import pickle
from sklearn.metrics import r2_score
from sklearn.metrics import mean_squared_error

In [2]:
# load the necessary functions from the analysis package

# load the variable names, this allows to access the variables in the feature data in a compact way
from analysis_utils.variable_names import *

# load flagged ids 
from analysis_utils.flagged_uids import *

# load the functions to do spatial k-fold CV
from analysis_utils.spatial_CV import *

# load the helper functions
from analysis_utils.analysis_helpers import *

# load the random forest trainer and cross_validator
import analysis_utils.RandomForest as rf

# load the combien model
from analysis_utils.CombinedModel import CombinedModel

In [3]:
# set the global file paths
root_data_dir = "../../Data"

# the lsms data
lsms_pth = f"{root_data_dir}/lsms/processed/labels_cluster_v1.csv"

# the feature data
feat_data_pth = f"{root_data_dir}/feature_data/tabular_data.csv"

# set the random seed
random_seed = 423
spatial_cv_random_seed = 348

# set the number of folds for k-fold CV
n_folds = 5

In [4]:
# load the feature and the label data
lsms_df = pd.read_csv(lsms_pth)
# remove flagged ids form dataset
lsms_df = lsms_df[~lsms_df.unique_id.isin(flagged_uids)].reset_index()
lsms_df['avg_log_mean_pc_cons_usd_2017'] = lsms_df.groupby('cluster_id')['log_mean_pc_cons_usd_2017'].transform('mean')
lsms_df['avg_mean_asset_index_yeh'] = lsms_df.groupby('cluster_id')['mean_asset_index_yeh'].transform('mean')
feat_df = pd.read_csv(feat_data_pth)

# describe the training data broadly
print(f"Number of observations {len(lsms_df)}")
print(f"Number of clusters {len(np.unique(lsms_df.cluster_id))}")
print(f"Number of x vars {len(feat_df.columns)-2}")

Number of observations 6401
Number of clusters 2128
Number of x vars 113


In [5]:
# merge the label and the feature data to one dataset
lsms_vars = ['unique_id', 'n_households',           
             'log_mean_pc_cons_usd_2017', 'avg_log_mean_pc_cons_usd_2017',
             'mean_asset_index_yeh', 'avg_mean_asset_index_yeh']
df = pd.merge(lsms_df[lsms_vars], feat_df, on = 'unique_id', how = 'left')

# Run Training

In [6]:
# define the within and between x variables
avg_rs_vars = avg_ndvi_vars + avg_ndwi_gao_vars + avg_nl_vars
osm_vars = osm_dist_vars + osm_count_vars + osm_road_vars

between_x_vars = osm_vars + esa_lc_vars + wsf_vars + avg_rs_vars + avg_preciptiation

dyn_rs_vars = dyn_ndvi_vars + dyn_ndwi_gao_vars + dyn_nl_vars
within_x_vars = dyn_rs_vars + precipitation

### Target: Log per capita consumption

In [7]:
between_target_var = 'avg_log_mean_pc_cons_usd_2017'
cl_df = df[['cluster_id', between_target_var] + between_x_vars].drop_duplicates().reset_index(drop = True)

# normalise the feature data
cl_df_norm = standardise_df(cl_df, exclude_cols = [between_target_var])

In [13]:
lsms_df.columns

Index(['index', 'country', 'start_day', 'start_month', 'start_year', 'end_day',
       'end_month', 'end_year', 'start_ts', 'end_ts', 'wave', 'series',
       'cluster_id', 'rural', 'unique_id', 'lsms_lat', 'lsms_lon',
       'mean_pc_cons_usd_2017', 'median_pc_cons_usd_2017',
       'mean_pc_cons_lcu_2017', 'median_pc_cons_lcu_2017',
       'mean_asset_index_nate', 'median_asset_index_nate',
       'mean_asset_index_yeh', 'median_asset_index_yeh', 'n_households',
       'extreme_poor', 'log_mean_pc_cons_usd_2017', 'country_series', 'lat',
       'lon', 'avg_log_mean_pc_cons_usd_2017', 'avg_mean_asset_index_yeh'],
      dtype='object')

In [8]:
# get the within dataframe
# define the within variables
within_target_var = 'loga'
within_df = df[['cluster_id','unique_id', within_target_var] + within_x_vars]

# demean the data and standardise the variables
demeaned_df = demean_df(within_df)
demeaned_df_norm = standardise_df(demeaned_df, exclude_cols = [within_target_var])

In [11]:
# divide the data into k different folds
fold_ids = split_lsms_spatial(lsms_df, n_folds = n_folds, random_seed = spatial_cv_random_seed)

# run the bewtween training
print('Between training')
between_cv_trainer_cons = rf.CrossValidator(cl_df_norm, 
                                            fold_ids, 
                                            between_target_var, 
                                            between_x_vars, 
                                            id_var = 'cluster_id', 
                                            random_seed = random_seed)
between_cv_trainer_cons.run_cv_training(min_samples_leaf = 1)

# run the within training
print("\nWithin training")
within_cv_trainer_cons = rf.CrossValidator(demeaned_df_norm, 
                                           fold_ids, 
                                           within_target_var, 
                                           within_x_vars, 
                                           id_var = 'unique_id', 
                                           random_seed = random_seed)
within_cv_trainer_cons.run_cv_training(min_samples_leaf = 15)

# combine both models
combined_model_cons = CombinedModel(lsms_df, between_cv_trainer_cons, within_cv_trainer_cons)
combined_model_cons.evaluate()
combined_results = combined_model_cons.compute_overall_performance(use_fold_weights = True)

Fold 0, specified test ratio: 0.2 - Actual test ratio 0.20
Fold 1, specified test ratio: 0.2 - Actual test ratio 0.20
Fold 2, specified test ratio: 0.2 - Actual test ratio 0.21
Fold 3, specified test ratio: 0.2 - Actual test ratio 0.20
Fold 4, specified test ratio: 0.2 - Actual test ratio 0.19
Between training
Initialising training


  0%|          | 0/5 [00:00<?, ?it/s]

Finished training after 170 seconds

Within training
Initialising training


  0%|          | 0/5 [00:00<?, ?it/s]

Finished training after 193 seconds


In [16]:
# save the predictions
combined_model_cons.pred_df.to_csv("results/baseline/exemplary_predictions.csv", index = False)