# SOIL DATA - Troubles finding what to use?!

- SSURGO is more detail can I can work with in the timeframe I need it
    - USDA NRCS 
        - https://sdmdataaccess.nrcs.usda.gov/Default.aspx
- STATSGO may be too limiting
    - USDA NRCS (shares tables in data source as above)
- How about The Global Soil Dataset for Earth System Modeling
    - Land-Atmosphere Interaction Research Group at Sun Yat-sen University
        - http://globalchange.bnu.edu.cn/research/soilwd.jsp

## Explore Soil Organic Carbon Density in The Global Soil Dataset

_I've already done quite a bit of noodling around the USDA data in other notebooks so I'll take a look at whether this one will fit my needs better (and my timeframe and level of pre-aggregation and simplification desired)._

### Load the NetCDF 

Network common data form (NetCDF) is commonly used to store multidimensional geographic data, and especially common with geographic time series data. I'll load the 5 minute geospatial resolution version of the Soil organic carbon density (SOCD5min.zip) NetCDF file in after downloading it from The Global Soil Dataset.

In [69]:
# if xarray is not yet installed, uncomment and run one of the following lines (either/or), 
# which only need to be run once
# !pip install xarray
# I probably should have used conda because my virtual envelope is maintained with conda, so if I run into problems I will uninstall with pip and reinstall with conda
# !conda install xarray 

In [3]:
# import package dependencies for environment
import netCDF4 as nc
import xarray as xr
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import geopandas

In [4]:
# check working directory using Shell command in IPython syntax preceded by '!'
!pwd

/Users/kathrynhurchla/Documents/GitHub/sustain-our-soil-for-our-food/analysis


In [5]:
# list directory contents
!ls

soil_data_analysis.ipynb           soil_data_analysis_SSURGO.html
soil_data_analysis_SSURGO.R        soil_data_analysis_STATSGO.Rmd
soil_data_analysis_SSURGO.Rmd      soil_data_analysis_STATSGO.nb.html


In [6]:
# can I see my data folder in the root directory of my project 
# (i.e. in the parent of current analysis/notebooks folder working directory)?
#!echo ../*/ #alternately
!ls ..

LICENSE
README.md
Style_Tile_soil_health_and_climate_change.ai
[34manalysis[m[m
[34mdata[m[m
[34mdocs[m[m
environment.yml
[34mimg[m[m
[34moutput[m[m
[34mresearch[m[m


In [8]:
# now can I see the files in my data folder?
!ls ../data

[34mProduction_Crops_Livestock_E_All_Data_(Normalized)[m[m
Production_Crops_Livestock_E_All_Data_(Normalized).zip
SOCD5min.nc
SOCD5min.zip
[34mTrade_CropsLivestock_E_All_Data_(Normalized)[m[m
Trade_CropsLivestock_E_All_Data_(Normalized).zip
USDA_Soil_Data_Access_Table_Relationships.csv
ipcc_efdb_cat3B2_CH4_CO2_N2O_output.xls
ipcc_efdb_cat3_CH4_CO2_N2O_output.xls


In [9]:
# Great! Now I've checked and copied the filename from right here in my Notebook!
# load NetCDF .nc file using the netcdf4 package (note can also be done using gdal package)
fn = '../data/SOCD5min.nc' # relative path to netcdf file
ds = nc.Dataset(fn) # read as netcdf dataset
# view info about the variables
print(ds)
# print(ds.__dict__) #alternately print metadata as a Python dictionary

<class 'netCDF4._netCDF4.Dataset'>
root group (NETCDF3_CLASSIC data model, file format NETCDF3):
    Conventions: CF-1.0
    dimensions(sizes): lon(4320), lat(1680), depth(8)
    variables(dimensions): float32 lon(lon), float32 lat(lat), float32 depth(depth), int16 SOCD(depth, lat, lon)
    groups: 


In [10]:
# access information about the single specific variable metadata (that is not a dimension) 
# SOCD is Soil Organic Carbon Density
# measured and recorded in t/ha (tonnes per hectare)
print(ds['SOCD'])

<class 'netCDF4._netCDF4.Variable'>
int16 SOCD(depth, lat, lon)
    missing_value: -999
    units: t/ha
    long_name: soil organic carban density
unlimited dimensions: 
current shape = (8, 1680, 4320)
filling on, default _FillValue of -32767 used


In [11]:
# just print dimensions as a python dictionary
print(ds.dimensions)

{'lon': <class 'netCDF4._netCDF4.Dimension'>: name = 'lon', size = 4320, 'lat': <class 'netCDF4._netCDF4.Dimension'>: name = 'lat', size = 1680, 'depth': <class 'netCDF4._netCDF4.Dimension'>: name = 'depth', size = 8}


In [12]:
# access the data values just like a numpy array
socd = ds['SOCD'][:]
print(socd)

[[[-- -- -- ... -- -- --]
  [-- -- -- ... -- -- --]
  [-- -- -- ... -- -- --]
  ...
  [-- -- -- ... -- -- --]
  [-- -- -- ... -- -- --]
  [-- -- -- ... -- -- --]]

 [[-- -- -- ... -- -- --]
  [-- -- -- ... -- -- --]
  [-- -- -- ... -- -- --]
  ...
  [-- -- -- ... -- -- --]
  [-- -- -- ... -- -- --]
  [-- -- -- ... -- -- --]]

 [[-- -- -- ... -- -- --]
  [-- -- -- ... -- -- --]
  [-- -- -- ... -- -- --]
  ...
  [-- -- -- ... -- -- --]
  [-- -- -- ... -- -- --]
  [-- -- -- ... -- -- --]]

 ...

 [[-- -- -- ... -- -- --]
  [-- -- -- ... -- -- --]
  [-- -- -- ... -- -- --]
  ...
  [-- -- -- ... -- -- --]
  [-- -- -- ... -- -- --]
  [-- -- -- ... -- -- --]]

 [[-- -- -- ... -- -- --]
  [-- -- -- ... -- -- --]
  [-- -- -- ... -- -- --]
  ...
  [-- -- -- ... -- -- --]
  [-- -- -- ... -- -- --]
  [-- -- -- ... -- -- --]]

 [[-- -- -- ... -- -- --]
  [-- -- -- ... -- -- --]
  [-- -- -- ... -- -- --]
  ...
  [-- -- -- ... -- -- --]
  [-- -- -- ... -- -- --]
  [-- -- -- ... -- -- --]]]


In [13]:
socd

masked_array(
  data=[[[--, --, --, ..., --, --, --],
         [--, --, --, ..., --, --, --],
         [--, --, --, ..., --, --, --],
         ...,
         [--, --, --, ..., --, --, --],
         [--, --, --, ..., --, --, --],
         [--, --, --, ..., --, --, --]],

        [[--, --, --, ..., --, --, --],
         [--, --, --, ..., --, --, --],
         [--, --, --, ..., --, --, --],
         ...,
         [--, --, --, ..., --, --, --],
         [--, --, --, ..., --, --, --],
         [--, --, --, ..., --, --, --]],

        [[--, --, --, ..., --, --, --],
         [--, --, --, ..., --, --, --],
         [--, --, --, ..., --, --, --],
         ...,
         [--, --, --, ..., --, --, --],
         [--, --, --, ..., --, --, --],
         [--, --, --, ..., --, --, --]],

        ...,

        [[--, --, --, ..., --, --, --],
         [--, --, --, ..., --, --, --],
         [--, --, --, ..., --, --, --],
         ...,
         [--, --, --, ..., --, --, --],
         [--, --, --, ..., --,

### Need to get this in a more workable format

I'll try to understand this data format more.
Ultimately I want to transform it into a Pandas dataframe.

In [14]:
# check the type of the new named variable socd
type(socd)

numpy.ma.core.MaskedArray

In [15]:
# see the shape of the array
socd.shape

(8, 1680, 4320)

In [16]:
# view an element about midway through
socd[0, 1000, 1000]

masked

In [17]:
# try another
socd[0, 0, 0]

masked

In [18]:
# get the non-masked data, specifically by removing rows with all masked data
# returns invlid syntax error on the axis=1
# socd_unmasked_all = socd[~socd.mask.all[axis=1]]

In [19]:
# the compressed method will remove masked items, but flattens the result to a 1 dimensional array
# so I've lost the location dimensions that way
socd_compressed = socd.compressed()
print(socd_compressed)
socd_compressed.shape

[0 0 0 ... 7 6 0]


(16762449,)

In [20]:
# reshape the masked array to 2D, to try to make it into a dataframe
socd.reshape(-1, 1)

masked_array(
  data=[[--],
        [--],
        [--],
        [--],
        [--],
        [--],
        [--],
        [--],
        [--],
        [--],
        [--],
        [--],
        [--],
        [--],
        [--],
        [--],
        [--],
        [--],
        [--],
        [--],
        [--],
        [--],
        [--],
        [--],
        [--],
        [--],
        [--],
        [--],
        [--],
        [--],
        [--],
        [--],
        [--],
        [--],
        [--],
        [--],
        [--],
        [--],
        [--],
        [--],
        [--],
        [--],
        [--],
        [--],
        [--],
        [--],
        [--],
        [--],
        [--],
        [--],
        [--],
        [--],
        [--],
        [--],
        [--],
        [--],
        [--],
        [--],
        [--],
        [--],
        [--],
        [--],
        [--],
        [--],
        [--],
        [--],
        [--],
        [--],
        [--],
        [--],
      

### Halp!! 
Here's the point where I asked for help. Xarray to the rescue!
Thank you Dr. Larry Gray for your consultation that led me to this pivot!

In [21]:
# use the xarray package instead of netCDF4 to view and process the dataset from here
# added to packages import list at top of notebook
# read the data file in with xarray instead and assign it to ds variable
ds = xr.open_dataset("../data/SOCD5min.nc")
# transform it to a dataframe assigned to df variable
df = ds.to_dataframe()

### Yay, no more errors!

Now this is what I'm used to data looking like!

In [22]:
# view the dataframe
df

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,SOCD
lon,lat,depth,Unnamed: 3_level_1
-179.5,83.5,4.500000,
-179.5,83.5,9.100000,
-179.5,83.5,16.600000,
-179.5,83.5,28.900000,
-179.5,83.5,49.299999,
...,...,...,...
179.5,-55.5,28.900000,
179.5,-55.5,49.299999,
179.5,-55.5,82.900002,
179.5,-55.5,138.300003,


In [23]:
# take a look at the SOCD column
# to better understand the index structure
df.SOCD

lon     lat    depth     
-179.5   83.5  4.500000     NaN
               9.100000     NaN
               16.600000    NaN
               28.900000    NaN
               49.299999    NaN
                             ..
 179.5  -55.5  28.900000    NaN
               49.299999    NaN
               82.900002    NaN
               138.300003   NaN
               229.600006   NaN
Name: SOCD, Length: 58060800, dtype: float32

In [24]:
# view all the column variable names in dataframe
df.columns

Index(['SOCD'], dtype='object')

In [25]:
# view the index
df.index

MultiIndex([(-179.5,                83.5,                4.5),
            (-179.5,                83.5,  9.100000381469727),
            (-179.5,                83.5, 16.600000381469727),
            (-179.5,                83.5, 28.899999618530273),
            (-179.5,                83.5,  49.29999923706055),
            (-179.5,                83.5,   82.9000015258789),
            (-179.5,                83.5,  138.3000030517578),
            (-179.5,                83.5, 229.60000610351562),
            (-179.5,    83.4172134399414,                4.5),
            (-179.5,    83.4172134399414,  9.100000381469727),
            ...
            ( 179.5, -55.417213439941406,  138.3000030517578),
            ( 179.5, -55.417213439941406, 229.60000610351562),
            ( 179.5,               -55.5,                4.5),
            ( 179.5,               -55.5,  9.100000381469727),
            ( 179.5,               -55.5, 16.600000381469727),
            ( 179.5,               -55.

In [26]:
# # commented out because it was unnecessary to remove index in order to remove the na values, and
# # keep for reference in case I need a different structure to plot something with
# # reset the index of the dataframe to remove the hierarchical index structure
# df3 = df.reset_index()
# # subset by variable name SOCD only the records with SOCD values is na is FALSE, using '~'
# df3[~df3.SOCD.isna()]

In [27]:
# try the subset to remove na values without going through the flattening of the index first
# it works without resetting the index
df2 = df[~df.SOCD.isna()]

In [28]:
# view the dataframe now, which retains its original indexing, but has only records with values
df2

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,SOCD
lon,lat,depth,Unnamed: 3_level_1
-179.5,71.247467,4.500000,2.0
-179.5,71.247467,9.100000,1.0
-179.5,71.247467,16.600000,3.0
-179.5,71.247467,28.900000,2.0
-179.5,71.247467,49.299999,2.0
...,...,...,...
179.5,-18.411257,16.600000,1.0
179.5,-18.411257,28.900000,2.0
179.5,-18.411257,49.299999,2.0
179.5,-18.411257,82.900002,2.0


In [29]:
# take the sum of SOCD across all depths, for each location, i.e. when 
# grouped by the first and second level of the index (i.e. by lon then lat)
df2.groupby(level=[0,1]).sum()

Unnamed: 0_level_0,Unnamed: 1_level_0,SOCD
lon,lat,Unnamed: 2_level_1
-179.5,71.247467,12.0
-179.5,71.164680,83.0
-179.5,71.081894,83.0
-179.5,70.999107,91.0
-179.5,70.916321,93.0
...,...,...
179.5,-16.589935,1.0
179.5,-16.672722,22.0
179.5,-16.755508,147.0
179.5,-18.328470,36.0


In [30]:
# add a reset of the index to statement above, i.e.
# # take the sum of SOCD across all depths, for each location and index each result separately
# with an unnamed index sequentially
df2.groupby(level=[0,1]).sum().reset_index()

Unnamed: 0,lon,lat,SOCD
0,-179.5,71.247467,12.0
1,-179.5,71.164680,83.0
2,-179.5,71.081894,83.0
3,-179.5,70.999107,91.0
4,-179.5,70.916321,93.0
...,...,...,...
2167138,179.5,-16.589935,1.0
2167139,179.5,-16.672722,22.0
2167140,179.5,-16.755508,147.0
2167141,179.5,-18.328470,36.0


## For mapping, translate to spatial geometry

I'll use the GeoPandas package to jump from longitude and latitude columns into a mappable format. In order to try to stay true to the original raw dataset as much as possible, I'll test on the dataframe that includes NA values first to see how GeoPandas handles and displays NA values.

In [38]:
# first try this with NA values left in, to stay true to data source as much as possible, and
# first flatten the multiple indexes and reset the index, so that lon, lat will be recognized at keys
df = df.groupby(level=[0,1,2]).sum().reset_index()

In [39]:
# check this dataframe's columns; lon, lat, and depth have been added to columns with SOCD now
df.columns

Index(['lon', 'lat', 'depth', 'SOCD'], dtype='object')

In [40]:
# view the new index; it's now a sequential index
df.index

RangeIndex(start=0, stop=58060800, step=1)

In [42]:
# translate lon and lat columns into a spatial geometry variable in a GeoPandas dataframe
gdf = geopandas.GeoDataFrame(df, geometry=geopandas.points_from_xy(df.lon, df.lat))

  return GeometryArray(vectorized.points_from_xy(x, y, z), crs=crs)


In [None]:
# for the dataframe with NA removed
#gdf2 = geopandas.GeoDataFrame(df2, geometry=geopandas.points_from_xy(df2.lon, df2.lat))