# Lagged raster point sampling

### Description of the datasets

We have a point dataset, in the form of a CSV file with lat/lon columns. 

Each row in the dataset represents a household as surveyed at a point in time, given by reference_month and reference_location. (In this instance the data are extracted from DHS household surveys.)                             

(Note that the reference month is the actual calendar month of survey for dates >= 15th of the month, and the previous calendar months for dates prior to the 15th. This is because we are interested in the effect of a variable in the period prior to survey and so if a household is surveyed on the 1st of the month we're not really interested in the data for that month but rather for the previous month. In any case this has been pre-processed into the data and here we are just looking at the reference_month and reference_year as they are given.)

Each household has a location (lat/lon) but many households are at the same location (for anonymity); the location is known as a cluster and is unique by surveyid and cluster_number. 

Most but not necessarily all households at the same location (cluster) in a survey will be surveyed in the same month.

We have a folder of rasters which represent monthly data for a single variable of interest (i.e. one raster per month). These are named such that the filename contains the month in the form `*YYYY.MM*` i.e. 4 digits then a dot then 2 digits.

### Description of the task

For each household we want to generate a time series of raster values for the 12 months prior to the reference month, and the reference month. That is, we need to sample 13 rasters, corresponding to the reference month/year, and each of the 12 preceding months, and eventually output a table that is like the input with 13 additional columns for the values at month 0, month -1, month -2 etc.


In [2]:
import pandas as pd
import numpy as np
import os

### Sample of the data
In these first few rows the households (hhid) within the cluster are all sampled in the same month but this isn't always true.
Note that the hhid is as taken from the source data and consists of numbers and spaces - the values should not be .trim()ed as this could break joining downstream.

In [4]:
_folder = "/mnt/c/Users/harry/OneDrive - Nexus365/Informal_Cities/DHS_Data_And_Prep/Survey_Point_locations/Locations/all"
_fn = "all_cluster_locations.csv"

inp = os.path.join(_folder, _fn)

In [7]:
df = pd.read_csv(inp
                # ,usecols=['surveyid','cluster_number','hhid','reference_month','reference_year','lat','lon']
                 , usecols=['surveyid','dhsclust','dhsid','latnum','longnum']
                )
df.head()

Unnamed: 0,dhsid,surveyid,dhsclust,latnum,longnum
0,CM201100000001,337,1,10.34002,15.266488
1,CM201100000002,337,2,4.081516,9.762119
2,CM201100000003,337,3,5.958239,10.186587
3,CM201100000004,337,4,5.967302,10.15011
4,CM201100000005,337,5,5.155473,10.18257


In [8]:
df[df['surveyid']==317]

Unnamed: 0,dhsid,surveyid,dhsclust,latnum,longnum
26472,LS200900000001,317,1,-29.936197,27.520731
26473,LS200900000002,317,2,-30.282707,28.137836
26474,LS200900000003,317,3,-29.512251,27.716495
26475,LS200900000004,317,4,-29.580522,27.545744
26476,LS200900000005,317,5,-29.288247,27.575525
...,...,...,...,...,...
26870,LS200900000396,317,396,-29.048934,28.253653
26871,LS200900000397,317,397,-29.453294,27.724759
26872,LS200900000398,317,398,-30.311016,27.775404
26873,LS200900000399,317,399,-29.293161,28.472647


In [29]:
df = df[df.surveyid==317]

### Generating the sample dates for each row

For passing to the raster sampler we will ultimately need a long-format table (stacked), one row for each location / hh / lag combination, i.e. in the case of 12 months lag, 13 rows for each input row.

It seems strangely hard to define a way in pandas to apply a function that converts each row into multiple rows.

Instead we will add the lags to each row (wide format). 

In [9]:
from datetime import date
from dateutil.relativedelta import relativedelta

In [6]:
N_MONTHS = 12
ONE_MONTH = relativedelta(months=1)
        
def add_lags_to_row(row):
    yr = row['reference_year']
    mth = row['reference_month']
    feat_date = date(yr, mth, 1)
    new_rows = []
    for i in range(N_MONTHS + 1):
        lag_date = feat_date - i * ONE_MONTH
        lag_y = lag_date.year
        lag_m = lag_date.month
        row[f'lag_n{i}'] = i
        row[f'lag_yr{i}'] = lag_y
        row[f'lag_m{i}'] = lag_m
    return row

In [8]:
df_test = df.head(20) 
df_test

Unnamed: 0,surveyid,cluster_number,hhid,reference_month,reference_year,lat,lon
0,211,1,117,10,2001,10.84476,2.109562
1,211,1,1 1,10,2001,10.84476,2.109562
2,211,1,1 2,10,2001,10.84476,2.109562
3,211,1,1 3,10,2001,10.84476,2.109562
4,211,1,1 4,10,2001,10.84476,2.109562
5,211,1,1 5,10,2001,10.84476,2.109562
6,211,1,1 6,10,2001,10.84476,2.109562
7,211,1,1 7,10,2001,10.84476,2.109562
8,211,1,1 8,10,2001,10.84476,2.109562
9,211,1,1 9,10,2001,10.84476,2.109562


In [36]:
wide_df_test = df_test.apply(add_lags_to_row, axis=1)
wide_df_test['id'] = wide_df_test.index
wide_df_test.head()

Unnamed: 0,surveyid,cluster_number,hhid,reference_month,reference_year,lat,lon,lag_n0,lag_yr0,lag_m0,...,lag_n10,lag_yr10,lag_m10,lag_n11,lag_yr11,lag_m11,lag_n12,lag_yr12,lag_m12,id
146159,317,1,1106,11,2009,-29.936197,27.520731,0,2009,11,...,10,2009,1,11,2008,12,12,2008,11,146159
146160,317,1,1239,11,2009,-29.936197,27.520731,0,2009,11,...,10,2009,1,11,2008,12,12,2008,11,146160
146161,317,1,1232,11,2009,-29.936197,27.520731,0,2009,11,...,10,2009,1,11,2008,12,12,2008,11,146161
146162,317,1,1220,11,2009,-29.936197,27.520731,0,2009,11,...,10,2009,1,11,2008,12,12,2008,11,146162
146163,317,1,1207,11,2009,-29.936197,27.520731,0,2009,11,...,10,2009,1,11,2008,12,12,2008,11,146163


This is pretty slow due to all those lookups and python loops on each row. So we'll parallelise it. 

Tried dask which I've not used before; didn't quite get there, would need to redefine the mapped function to expect series I think


In [59]:
import dask.dataframe as dd
from dask.multiprocessing import get

In [None]:
ddata = dd.from_pandas(df, npartitions=100)
#df.compute(get=dask.threaded.get, num_workers=20)
def apply_lagger_to_DF(df): return df.apply((lambda row: add_lags_to_row(**row)), axis=1)
wide_df = ddata.map_partitions(apply_lagger_to_DF).compute(get=get, num_workers=25)

Pandarallel is much more straightforward for this case. 

Install the jupyter widgets extension for progress bars if necessary (https://github.com/nalepae/pandarallel) first

In [7]:
from pandarallel import pandarallel

Limit workers to 25 so as not to take over server (72 cores); can use memory fs as server has large allocation

In [8]:
pandarallel.initialize(nb_workers=10, progress_bar=True, use_memory_fs=True)

INFO: Pandarallel will run on 10 workers.
INFO: Pandarallel will use Memory file system to transfer data between the main process and workers.


In [10]:
wide_df  = df.parallel_apply(add_lags_to_row, axis=1)
#wide_df = df.apply(add_lags_to_row, axis=1)

VBox(children=(HBox(children=(IntProgress(value=0, description='0.00%', max=70538), Label(value='0 / 70538')))…

Still takes a while, so checkpoint:

In [12]:
wide_df.to_csv(os.path.join(_folder,'peak_urban_hh_lags_wide_all.csv'))

In [13]:
wide_df.to_parquet(os.path.join(_folder,'peak_urban_hh_lags_wide_all'), compression='GZIP')

In [14]:
wide_df['id'] = wide_df.index

In [15]:
wide_df.head()

Unnamed: 0,surveyid,cluster_number,hhid,reference_month,reference_year,lat,lon,lag_n0,lag_yr0,lag_m0,...,lag_n10,lag_yr10,lag_m10,lag_n11,lag_yr11,lag_m11,lag_n12,lag_yr12,lag_m12,id
0,211,1,117,10,2001,10.84476,2.109562,0,2001,10,...,10,2000,12,11,2000,11,12,2000,10,0
1,211,1,1 1,10,2001,10.84476,2.109562,0,2001,10,...,10,2000,12,11,2000,11,12,2000,10,1
2,211,1,1 2,10,2001,10.84476,2.109562,0,2001,10,...,10,2000,12,11,2000,11,12,2000,10,2
3,211,1,1 3,10,2001,10.84476,2.109562,0,2001,10,...,10,2000,12,11,2000,11,12,2000,10,3
4,211,1,1 4,10,2001,10.84476,2.109562,0,2001,10,...,10,2000,12,11,2000,11,12,2000,10,4


Use the super-handy wide_to_long function to stack / pseudo-normalise this output to have one row per unique combination of location, time (raster) and how many lag months this time is for this household.



In [16]:
melted = pd.wide_to_long(wide_df, ['lag_n','lag_m', 'lag_yr'], i=['id'], j='thing')
melted.head(5)

Unnamed: 0_level_0,Unnamed: 1_level_0,lat,reference_year,lon,hhid,surveyid,cluster_number,reference_month,lag_n,lag_m,lag_yr
id,thing,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
0,0,10.84476,2001,2.109562,117,211,1,10,0,10,2001
1,0,10.84476,2001,2.109562,1 1,211,1,10,0,10,2001
2,0,10.84476,2001,2.109562,1 2,211,1,10,0,10,2001
3,0,10.84476,2001,2.109562,1 3,211,1,10,0,10,2001
4,0,10.84476,2001,2.109562,1 4,211,1,10,0,10,2001


This will have duplicates where multiple households in the same location (cluster) are interviewed in the same month (which is the norm).

We could use this df directly for the raster sampling, but it'll be more efficient to get rid of the duplicates in this dimension to cut down raster sampling (maybe).

In [17]:
extract_pts = melted.drop_duplicates(subset=['cluster_number', 'surveyid', 'lag_n', 'lag_yr', 'lag_m', 'lag_yr'])
extract_pts.head()

Unnamed: 0_level_0,Unnamed: 1_level_0,lat,reference_year,lon,hhid,surveyid,cluster_number,reference_month,lag_n,lag_m,lag_yr
id,thing,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
0,0,10.84476,2001,2.109562,117,211,1,10,0,10,2001
17,0,10.68541,2001,1.074763,210,211,2,8,0,8,2001
18,0,10.68541,2001,1.074763,2 1,211,2,7,0,7,2001
34,0,10.613593,2001,1.273059,3 9,211,3,8,0,8,2001
46,0,10.512361,2001,0.94593,4 4,211,4,8,0,8,2001


We will use rasterio to handle the sampling

In [3]:
import rasterio as rio
from rasterio import RasterioIOError

Define a function that will take a dataframe or rather sub-set of one with a common month and year for extraction, and extract the matching raster for all these point locations.
The month and year will be taken from the first row of the group and we're not currently checking they are the same across the passed data. Data path is hardcoded here for now

In [22]:
def extract_raster_vals(grp):
    coords = list(zip(grp['lon'], grp['lat']))
    firstrow = grp.iloc[0]
    req_yr = str(firstrow['lag_yr'])
    req_m = str(firstrow['lag_m']).zfill(2)
    #rastername = f'no2/temis_omi_no2.{req_yr}.{req_m}.tif'
    rastername = f'/mnt/d/InformalCities/Data/CHIRPS/chirps-v2-0.{req_yr}.{req_m}.sum.5km.NN.tif'
    #rastername = f'no2/temis_omi_no2.{req_yr}.{req_m}.tif'
    print(f"Trying {rastername} for {grp.shape[0]} points")
    #print(coords)
    try:
        with rio.open(rastername) as src:
            grp['rasterval'] = [x[0] for x in src.sample(coords, masked=True)]
    except RasterioIOError:
        grp['rasterval'] = ''
    return grp

Use groupby to call the function for each yr/mth subset of the data in turn

In [23]:
sampled = extract_pts.loc[pd.notnull(extract_pts.lat)].groupby(['lag_m','lag_yr']).apply(extract_raster_vals)

Trying /mnt/d/InformalCities/Data/CHIRPS/chirps-v2-0.2001.01.sum.5km.NN.tif for 276 points
Trying /mnt/d/InformalCities/Data/CHIRPS/chirps-v2-0.2005.01.sum.5km.NN.tif for 1145 points
Trying /mnt/d/InformalCities/Data/CHIRPS/chirps-v2-0.2006.01.sum.5km.NN.tif for 1113 points
Trying /mnt/d/InformalCities/Data/CHIRPS/chirps-v2-0.2007.01.sum.5km.NN.tif for 1112 points
Trying /mnt/d/InformalCities/Data/CHIRPS/chirps-v2-0.2008.01.sum.5km.NN.tif for 2524 points
Trying /mnt/d/InformalCities/Data/CHIRPS/chirps-v2-0.2009.01.sum.5km.NN.tif for 676 points
Trying /mnt/d/InformalCities/Data/CHIRPS/chirps-v2-0.2010.01.sum.5km.NN.tif for 3267 points
Trying /mnt/d/InformalCities/Data/CHIRPS/chirps-v2-0.2011.01.sum.5km.NN.tif for 2367 points
Trying /mnt/d/InformalCities/Data/CHIRPS/chirps-v2-0.2012.01.sum.5km.NN.tif for 1916 points
Trying /mnt/d/InformalCities/Data/CHIRPS/chirps-v2-0.2013.01.sum.5km.NN.tif for 2687 points
Trying /mnt/d/InformalCities/Data/CHIRPS/chirps-v2-0.2014.01.sum.5km.NN.tif for 44

In [25]:
sampled

Unnamed: 0_level_0,Unnamed: 1_level_0,lat,reference_year,lon,hhid,surveyid,cluster_number,reference_month,lag_n,lag_m,lag_yr,rasterval
id,thing,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1
0,0,10.844760,2001,2.109562,117,211,1,10,0,10,2001,17.5411
17,0,10.685410,2001,1.074763,210,211,2,8,0,8,2001,179.567
18,0,10.685410,2001,1.074763,2 1,211,2,7,0,7,2001,195.308
34,0,10.613593,2001,1.273059,3 9,211,3,8,0,8,2001,210.201
46,0,10.512361,2001,0.945930,4 4,211,4,8,0,8,2001,206.537
...,...,...,...,...,...,...,...,...,...,...,...,...
705252,12,-15.416081,2018,28.370067,541 3,542,541,8,12,8,2017,0.175996
705277,12,-11.150453,2018,32.946955,542 15,542,542,9,12,9,2017,0.424578
705302,12,-15.769918,2018,28.299570,543 2,542,543,11,12,11,2017,111.414
705327,12,-12.876392,2018,30.039495,544 1,542,544,9,12,9,2017,0.881467


merge (left join) the extracted data back onto the with-duplicates inputs and remove the duplicate columns caused by the merge (pandas keeps both even when names match and are tested equal in the join)

In [26]:
full_results_long = pd.merge(melted, sampled, how='left', on=['cluster_number', 'surveyid', 'lag_n', 'lag_m', 'lag_yr'], 
                             indicator=True, validate='m:1',
                            suffixes=('','_y'))

full_results_long.drop(full_results_long.filter(regex='_y$').columns.tolist(),axis=1, inplace=True)

In [27]:
full_results_long.head()

Unnamed: 0,lat,reference_year,lon,hhid,surveyid,cluster_number,reference_month,lag_n,lag_m,lag_yr,rasterval,_merge
0,10.84476,2001,2.109562,117,211,1,10,0,10,2001,17.5411,both
1,10.84476,2001,2.109562,1 1,211,1,10,0,10,2001,17.5411,both
2,10.84476,2001,2.109562,1 2,211,1,10,0,10,2001,17.5411,both
3,10.84476,2001,2.109562,1 3,211,1,10,0,10,2001,17.5411,both
4,10.84476,2001,2.109562,1 4,211,1,10,0,10,2001,17.5411,both


check that any surveys which didn't get through the raster extraction are the ones we expect (the ones with null lat column, which got dropped just befor sampling)

In [28]:
full_results_long[full_results_long['_merge']!= 'both']['surveyid'].unique()

array([253])

In [29]:
np.unique(full_results_long[full_results_long['rasterval']==''][['surveyid', 'reference_year', 'reference_month']], axis=0)


array([], shape=(0, 3), dtype=int64)

In [30]:
full_results_test = full_results_long.head(250000).copy()
full_results_test

Unnamed: 0,lat,reference_year,lon,hhid,surveyid,cluster_number,reference_month,lag_n,lag_m,lag_yr,rasterval,_merge
0,10.844760,2001,2.109562,117,211,1,10,0,10,2001,17.5411,both
1,10.844760,2001,2.109562,1 1,211,1,10,0,10,2001,17.5411,both
2,10.844760,2001,2.109562,1 2,211,1,10,0,10,2001,17.5411,both
3,10.844760,2001,2.109562,1 3,211,1,10,0,10,2001,17.5411,both
4,10.844760,2001,2.109562,1 4,211,1,10,0,10,2001,17.5411,both
...,...,...,...,...,...,...,...,...,...,...,...,...
249995,-23.643455,2011,35.292552,42114,362,421,8,0,8,2011,27.6273,both
249996,-23.643455,2011,35.292552,42115,362,421,8,0,8,2011,27.6273,both
249997,-23.643455,2011,35.292552,42116,362,421,8,0,8,2011,27.6273,both
249998,-23.643455,2011,35.292552,42117,362,421,8,0,8,2011,27.6273,both


Now pivot it back to one set of 3 columns for each lag period i.e. (year,month,value) of extracted raster for each lag amount

https://stackoverflow.com/a/55252414/4150190

There isn't a long_to_wide like there is a wide_to_long!

In [31]:
full_results_long['idx'] = full_results_long.groupby(['cluster_number','surveyid', 'lat', 'lon', 'hhid', 'reference_year','reference_month' ]).cumcount()

In [32]:
pivoted = full_results_long.pivot_table(index=['cluster_number','surveyid', 'lat', 'lon', 'hhid', 'reference_year','reference_month' ],
                                       columns='idx',
                                     values=['lag_n', 'lag_m', 'lag_yr', 'rasterval'],
                       aggfunc='first')
pivoted
#melted_test.pivot(index=['surveyid', 'cluster_number', 'hhid', 'reference_month','reference_year', 'lat', 'lon'], 
#                 columns=['lag_n'], values=['lag_yr', 'lag_m'])

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,Unnamed: 3_level_0,Unnamed: 4_level_0,Unnamed: 5_level_0,Unnamed: 6_level_0,lag_m,lag_m,lag_m,lag_m,lag_m,lag_m,lag_m,lag_m,lag_m,lag_m,...,rasterval,rasterval,rasterval,rasterval,rasterval,rasterval,rasterval,rasterval,rasterval,rasterval
Unnamed: 0_level_1,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,idx,0,1,2,3,4,5,6,7,8,9,...,3,4,5,6,7,8,9,10,11,12
cluster_number,surveyid,lat,lon,hhid,reference_year,reference_month,Unnamed: 7_level_2,Unnamed: 8_level_2,Unnamed: 9_level_2,Unnamed: 10_level_2,Unnamed: 11_level_2,Unnamed: 12_level_2,Unnamed: 13_level_2,Unnamed: 14_level_2,Unnamed: 15_level_2,Unnamed: 16_level_2,Unnamed: 17_level_2,Unnamed: 18_level_2,Unnamed: 19_level_2,Unnamed: 20_level_2,Unnamed: 21_level_2,Unnamed: 22_level_2,Unnamed: 23_level_2,Unnamed: 24_level_2,Unnamed: 25_level_2,Unnamed: 26_level_2,Unnamed: 27_level_2
1,211,10.84476,2.109562,1 1,2001,10,10,9,8,7,6,5,4,3,2,1,...,252.868,154.647,122.535,18.9429,3.98381,3.44386,1.85766,2.19528,2.83618,45.9245
1,211,10.84476,2.109562,1 2,2001,10,10,9,8,7,6,5,4,3,2,1,...,252.868,154.647,122.535,18.9429,3.98381,3.44386,1.85766,2.19528,2.83618,45.9245
1,211,10.84476,2.109562,1 3,2001,10,10,9,8,7,6,5,4,3,2,1,...,252.868,154.647,122.535,18.9429,3.98381,3.44386,1.85766,2.19528,2.83618,45.9245
1,211,10.84476,2.109562,1 4,2001,10,10,9,8,7,6,5,4,3,2,1,...,252.868,154.647,122.535,18.9429,3.98381,3.44386,1.85766,2.19528,2.83618,45.9245
1,211,10.84476,2.109562,1 5,2001,10,10,9,8,7,6,5,4,3,2,1,...,252.868,154.647,122.535,18.9429,3.98381,3.44386,1.85766,2.19528,2.83618,45.9245
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2029,266,0.00000,0.000000,2029135,2006,9,9,8,7,6,5,4,3,2,1,12,...,--,--,--,--,--,--,--,--,--,--
2029,266,0.00000,0.000000,2029139,2006,9,9,8,7,6,5,4,3,2,1,12,...,--,--,--,--,--,--,--,--,--,--
2029,266,0.00000,0.000000,2029141,2006,9,9,8,7,6,5,4,3,2,1,12,...,--,--,--,--,--,--,--,--,--,--
2029,266,0.00000,0.000000,2029149,2006,9,9,8,7,6,5,4,3,2,1,12,...,--,--,--,--,--,--,--,--,--,--


In [33]:
pivoted = pivoted.sort_index(axis=1, level=1)
pivoted

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,Unnamed: 3_level_0,Unnamed: 4_level_0,Unnamed: 5_level_0,Unnamed: 6_level_0,lag_m,lag_n,lag_yr,rasterval,lag_m,lag_n,lag_yr,rasterval,lag_m,lag_n,...,lag_yr,rasterval,lag_m,lag_n,lag_yr,rasterval,lag_m,lag_n,lag_yr,rasterval
Unnamed: 0_level_1,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,idx,0,0,0,0,1,1,1,1,2,2,...,10,10,11,11,11,11,12,12,12,12
cluster_number,surveyid,lat,lon,hhid,reference_year,reference_month,Unnamed: 7_level_2,Unnamed: 8_level_2,Unnamed: 9_level_2,Unnamed: 10_level_2,Unnamed: 11_level_2,Unnamed: 12_level_2,Unnamed: 13_level_2,Unnamed: 14_level_2,Unnamed: 15_level_2,Unnamed: 16_level_2,Unnamed: 17_level_2,Unnamed: 18_level_2,Unnamed: 19_level_2,Unnamed: 20_level_2,Unnamed: 21_level_2,Unnamed: 22_level_2,Unnamed: 23_level_2,Unnamed: 24_level_2,Unnamed: 25_level_2,Unnamed: 26_level_2,Unnamed: 27_level_2
1,211,10.84476,2.109562,1 1,2001,10,10,0,2001,17.5411,9,1,2001,176.503,8,2,...,2000,2.19528,11,11,2000,2.83618,10,12,2000,45.9245
1,211,10.84476,2.109562,1 2,2001,10,10,0,2001,17.5411,9,1,2001,176.503,8,2,...,2000,2.19528,11,11,2000,2.83618,10,12,2000,45.9245
1,211,10.84476,2.109562,1 3,2001,10,10,0,2001,17.5411,9,1,2001,176.503,8,2,...,2000,2.19528,11,11,2000,2.83618,10,12,2000,45.9245
1,211,10.84476,2.109562,1 4,2001,10,10,0,2001,17.5411,9,1,2001,176.503,8,2,...,2000,2.19528,11,11,2000,2.83618,10,12,2000,45.9245
1,211,10.84476,2.109562,1 5,2001,10,10,0,2001,17.5411,9,1,2001,176.503,8,2,...,2000,2.19528,11,11,2000,2.83618,10,12,2000,45.9245
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2029,266,0.00000,0.000000,2029135,2006,9,9,0,2006,--,8,1,2006,--,7,2,...,2005,--,10,11,2005,--,9,12,2005,--
2029,266,0.00000,0.000000,2029139,2006,9,9,0,2006,--,8,1,2006,--,7,2,...,2005,--,10,11,2005,--,9,12,2005,--
2029,266,0.00000,0.000000,2029141,2006,9,9,0,2006,--,8,1,2006,--,7,2,...,2005,--,10,11,2005,--,9,12,2005,--
2029,266,0.00000,0.000000,2029149,2006,9,9,0,2006,--,8,1,2006,--,7,2,...,2005,--,10,11,2005,--,9,12,2005,--


In [34]:
pivoted.columns

MultiIndex([(    'lag_m',  0),
            (    'lag_n',  0),
            (   'lag_yr',  0),
            ('rasterval',  0),
            (    'lag_m',  1),
            (    'lag_n',  1),
            (   'lag_yr',  1),
            ('rasterval',  1),
            (    'lag_m',  2),
            (    'lag_n',  2),
            (   'lag_yr',  2),
            ('rasterval',  2),
            (    'lag_m',  3),
            (    'lag_n',  3),
            (   'lag_yr',  3),
            ('rasterval',  3),
            (    'lag_m',  4),
            (    'lag_n',  4),
            (   'lag_yr',  4),
            ('rasterval',  4),
            (    'lag_m',  5),
            (    'lag_n',  5),
            (   'lag_yr',  5),
            ('rasterval',  5),
            (    'lag_m',  6),
            (    'lag_n',  6),
            (   'lag_yr',  6),
            ('rasterval',  6),
            (    'lag_m',  7),
            (    'lag_n',  7),
            (   'lag_yr',  7),
            ('rasterval',  7),
        

In [35]:
pivoted.columns = [f'{x}_{y}' for x,y in pivoted.columns]
pivoted = pivoted.reset_index()
pivoted

Unnamed: 0,cluster_number,surveyid,lat,lon,hhid,reference_year,reference_month,lag_m_0,lag_n_0,lag_yr_0,...,lag_yr_10,rasterval_10,lag_m_11,lag_n_11,lag_yr_11,rasterval_11,lag_m_12,lag_n_12,lag_yr_12,rasterval_12
0,1,211,10.84476,2.109562,1 1,2001,10,10,0,2001,...,2000,2.19528,11,11,2000,2.83618,10,12,2000,45.9245
1,1,211,10.84476,2.109562,1 2,2001,10,10,0,2001,...,2000,2.19528,11,11,2000,2.83618,10,12,2000,45.9245
2,1,211,10.84476,2.109562,1 3,2001,10,10,0,2001,...,2000,2.19528,11,11,2000,2.83618,10,12,2000,45.9245
3,1,211,10.84476,2.109562,1 4,2001,10,10,0,2001,...,2000,2.19528,11,11,2000,2.83618,10,12,2000,45.9245
4,1,211,10.84476,2.109562,1 5,2001,10,10,0,2001,...,2000,2.19528,11,11,2000,2.83618,10,12,2000,45.9245
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
695060,2029,266,0.00000,0.000000,2029135,2006,9,9,0,2006,...,2005,--,10,11,2005,--,9,12,2005,--
695061,2029,266,0.00000,0.000000,2029139,2006,9,9,0,2006,...,2005,--,10,11,2005,--,9,12,2005,--
695062,2029,266,0.00000,0.000000,2029141,2006,9,9,0,2006,...,2005,--,10,11,2005,--,9,12,2005,--
695063,2029,266,0.00000,0.000000,2029149,2006,9,9,0,2006,...,2005,--,10,11,2005,--,9,12,2005,--


This gives our final output; save it to csv.
Optionally delete the `lag_n_*`, `lag_m_*`  and `lag_yr_*` columns as once we've checked the logic they are redundant

In [37]:
pivoted.to_csv(os.path.join(_folder,'chirps_output.csv'), index=False)

In [40]:
pivoted.drop(pivoted.filter(regex='lag_n_').columns, axis=1, inplace=True)
pivoted.drop(pivoted.filter(regex='lag_m_').columns, axis=1, inplace=True)
pivoted.drop(pivoted.filter(regex='lag_yr_').columns, axis=1, inplace=True)

In [41]:
pivoted

Unnamed: 0,cluster_number,surveyid,lat,lon,hhid,reference_year,reference_month,rasterval_0,rasterval_1,rasterval_2,rasterval_3,rasterval_4,rasterval_5,rasterval_6,rasterval_7,rasterval_8,rasterval_9,rasterval_10,rasterval_11,rasterval_12
0,1,211,10.84476,2.109562,1 1,2001,10,17.5411,176.503,279.409,252.868,154.647,122.535,18.9429,3.98381,3.44386,1.85766,2.19528,2.83618,45.9245
1,1,211,10.84476,2.109562,1 2,2001,10,17.5411,176.503,279.409,252.868,154.647,122.535,18.9429,3.98381,3.44386,1.85766,2.19528,2.83618,45.9245
2,1,211,10.84476,2.109562,1 3,2001,10,17.5411,176.503,279.409,252.868,154.647,122.535,18.9429,3.98381,3.44386,1.85766,2.19528,2.83618,45.9245
3,1,211,10.84476,2.109562,1 4,2001,10,17.5411,176.503,279.409,252.868,154.647,122.535,18.9429,3.98381,3.44386,1.85766,2.19528,2.83618,45.9245
4,1,211,10.84476,2.109562,1 5,2001,10,17.5411,176.503,279.409,252.868,154.647,122.535,18.9429,3.98381,3.44386,1.85766,2.19528,2.83618,45.9245
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
695060,2029,266,0.00000,0.000000,2029135,2006,9,--,--,--,--,--,--,--,--,--,--,--,--,--
695061,2029,266,0.00000,0.000000,2029139,2006,9,--,--,--,--,--,--,--,--,--,--,--,--,--
695062,2029,266,0.00000,0.000000,2029141,2006,9,--,--,--,--,--,--,--,--,--,--,--,--,--
695063,2029,266,0.00000,0.000000,2029149,2006,9,--,--,--,--,--,--,--,--,--,--,--,--,--


In [42]:
pivoted.to_csv(os.path.join(_folder,'chirps_output_clean.csv'), index=False)