# Annual raster point sampling

### Description of the datasets

We have a point dataset, in the form of a CSV file with lat/lon columns. 

Each row in the dataset represents a cluster location `dhsclust` from DHS surveys, as surveyed in a particular year, given by `dhsyear`, `latnum`, and `longnum`.

The data is unique by `surveyid` and `dhsclust`. 

We have a folder of rasters which represent annual data for a single variable of interest (i.e. one raster per year). These are named such that the filename contains the month in the form `*YYYY*` i.e. 4 digits.

### Description of the task

For each cluster we want to extract the raster value for the matching year. The rasters available only cover the years 2000-2015, and for points occurring in earlier years we should extract the value for 2000 and for points in later years we should extract the value for 2015. 


In [1]:
import pandas as pd
import numpy as np
import os

### Sample of the data


In [22]:
_infolder = "/mnt/c/Users/harry/OneDrive - Nexus365/Informal_Cities/DHS_Data_And_Prep/Survey_Point_locations/Locations/all"
_fn = "all_cluster_locations.csv "

inp = os.path.join(_infolder, _fn)

In [3]:
df = pd.read_csv(inp
                # ,usecols=['surveyid','cluster_number','hhid','reference_month','reference_year','lat','lon']
                 , usecols=['surveyid','dhsclust','dhsid','latnum','longnum', 'dhsyear']
                )
df.head()

Unnamed: 0,dhsid,surveyid,dhsyear,dhsclust,latnum,longnum
0,CM201100000001,337,2011,1,10.34002,15.266488
1,CM201100000002,337,2011,2,4.081516,9.762119
2,CM201100000003,337,2011,3,5.958239,10.186587
3,CM201100000004,337,2011,4,5.967302,10.15011
4,CM201100000005,337,2011,5,5.155473,10.18257


In [4]:
df[df['surveyid']==317]

Unnamed: 0,dhsid,surveyid,dhsyear,dhsclust,latnum,longnum
26472,LS200900000001,317,2009,1,-29.936197,27.520731
26473,LS200900000002,317,2009,2,-30.282707,28.137836
26474,LS200900000003,317,2009,3,-29.512251,27.716495
26475,LS200900000004,317,2009,4,-29.580522,27.545744
26476,LS200900000005,317,2009,5,-29.288247,27.575525
...,...,...,...,...,...,...
26870,LS200900000396,317,2009,396,-29.048934,28.253653
26871,LS200900000397,317,2009,397,-29.453294,27.724759
26872,LS200900000398,317,2009,398,-30.311016,27.775404
26873,LS200900000399,317,2009,399,-29.293161,28.472647


We will use rasterio to handle the sampling

In [5]:
import rasterio as rio
from rasterio import RasterioIOError

Define a function that will take a dataframe or rather sub-set of one with a common year for extraction, and extract the matching raster for all these point locations.
The year will be taken from the first row of the group and we're not currently checking they are the same across the passed data. Data path is hardcoded here for now

In [18]:
def extract_raster_vals(grp):
    coords = list(zip(grp['longnum'], grp['latnum']))
    firstrow = grp.iloc[0]
    req_yr = str(firstrow['dhsyear'])
    if int(req_yr)>2015:
        req_yr='2015'
        print(f"Substituting year 2015 for "+str(firstrow['dhsyear']))
    elif int(req_yr)<2015:
        req_yr='2000'
        print(f"Substituting year 2000 for "+str(firstrow['dhsyear']))
    
    rastername = f'/mnt/d/Large_Rasters/Sam_Lucy_Housing/SLUM_v2_mean_{req_yr}.tif'
    print(f"Trying {rastername} for {grp.shape[0]} points")
    #print(coords)
    try:
        with rio.open(rastername) as src:
            grp['rasterval'] = [x[0] for x in src.sample(coords)]
    except RasterioIOError:
        print(f" ...Error, raster not found")
        grp['rasterval'] = ''
    return grp

Use groupby to call the function for each yr/mth subset of the data in turn

In [20]:
sampled = df.loc[pd.notnull(df.latnum)].groupby(['dhsyear']).apply(extract_raster_vals)

Substituting year 2000 for 1986
Trying /mnt/d/Large_Rasters/Sam_Lucy_Housing/SLUM_v2_mean_2000.tif for 156 points
Substituting year 2000 for 1988
Trying /mnt/d/Large_Rasters/Sam_Lucy_Housing/SLUM_v2_mean_2000.tif for 153 points
Substituting year 2000 for 1990
Trying /mnt/d/Large_Rasters/Sam_Lucy_Housing/SLUM_v2_mean_2000.tif for 298 points
Substituting year 2000 for 1991
Trying /mnt/d/Large_Rasters/Sam_Lucy_Housing/SLUM_v2_mean_2000.tif for 149 points
Substituting year 2000 for 1992
Trying /mnt/d/Large_Rasters/Sam_Lucy_Housing/SLUM_v2_mean_2000.tif for 781 points
Substituting year 2000 for 1993
Trying /mnt/d/Large_Rasters/Sam_Lucy_Housing/SLUM_v2_mean_2000.tif for 888 points
Substituting year 2000 for 1994
Trying /mnt/d/Large_Rasters/Sam_Lucy_Housing/SLUM_v2_mean_2000.tif for 477 points
Substituting year 2000 for 1995
Trying /mnt/d/Large_Rasters/Sam_Lucy_Housing/SLUM_v2_mean_2000.tif for 934 points
Substituting year 2000 for 1996
Trying /mnt/d/Large_Rasters/Sam_Lucy_Housing/SLUM_v2_mea

In [21]:
sampled

Unnamed: 0,dhsid,surveyid,dhsyear,dhsclust,latnum,longnum,rasterval
0,CM201100000001,337,2011,1,10.340020,15.266488,3.785988e-02
1,CM201100000002,337,2011,2,4.081516,9.762119,5.483339e-01
2,CM201100000003,337,2011,3,5.958239,10.186587,4.288444e-01
3,CM201100000004,337,2011,4,5.967302,10.150110,4.618978e-01
4,CM201100000005,337,2011,5,5.155473,10.182570,3.228779e-01
...,...,...,...,...,...,...,...
127985,SN201900000210,581,2019,210,13.143196,-15.629210,8.568992e-02
127986,SN201900000211,581,2019,211,12.573376,-15.876414,2.198591e-01
127987,SN201900000212,581,2019,212,12.566986,-15.750106,1.165354e-01
127988,SN201900000213,581,2019,213,12.755641,-15.520960,-3.400000e+38


In [27]:
_outfolder = "/mnt/c/Users/harry/OneDrive - Nexus365/Informal_Cities/DHS_Data_And_Prep/Survey_Point_Locations/Extracted_Spatial_Covariates/Timeseries"
sampled.to_csv(os.path.join(_outfolder,'all_clusters_improved_housing.csv'), index=False)