# Data to Pandas table

**Purpose of script:**

Put mw and opt data to one pandas dataframe, create features. One row per pixel.

- In: opt and mw data, all files
- Out: one file (maybe several due to size constraints) (csv or parquet) with table of features prepared to be used in model

## Data Load

In [2]:
# import rioxarray
import xarray
# import rasterio

# from os import listdir
# from os.path import isfile, join

import pandas as pd
from datetime import datetime
import numpy as np
from tqdm import tqdm

Relevant paths:

In [3]:
mw_path = r"../Data/microwave-rs/mw_interpolated/2019-07-01_mw.tif"
opt_path = r"../Data/optical-rs/2019-07-01_grain_diameter.tif"

In [4]:
out_path = r"../Data/combined/"

Load data:

In [5]:
# load all files
# put to one xarray?

In [6]:
#TEMP Data load:
data_opt = xarray.open_dataarray(opt_path)
data_mw = xarray.open_dataarray(mw_path)

## To Pandas

In [7]:
# put all in for loop

### Array to Pandas Dataframe

In [8]:
# convert mw to pandas
df_mw = data_mw.to_dataframe()
df_opt = data_opt.to_dataframe()
# fix index
df_mw = df_mw.reset_index()
# remove columns: spacial_ref, band
df_mw = df_mw[['x', 'y', 'band_data']]
# rename
df_mw.rename({'band_data': 'mw_value'}, axis=1, inplace=True)
# ----------------------
# convert opt to pandas
df_opt = data_opt.to_dataframe()
# fix index
df_opt = df_opt.reset_index()
# remove columns: spacial_ref, band
df_opt = df_opt[['x', 'y', 'band_data']]
# rename
df_opt.rename({'band_data': 'opt_value'}, axis=1, inplace=True)
# fill na for masked opt data
df_opt['opt_value'].fillna(-1, inplace=True)

*Baptiste's :*
- open_mfdataset() 
- concat dim time (to put to ine file?)

### Neighbor features dataframe

In [9]:
def get_neighbors(mat, a, b):
    neighbors = [mat[i][j] if (i > -1 and j > -1 and j < len(mat[0]) and i < len(mat)) else np.nan for i in range(a-1, a+2) for j in range(b-1, b+2) ]
    return neighbors

In [10]:
index_list = [(i,j) for i in range(data_mw.shape[1]) for j in range(data_mw.shape[2])]
value_list = []
data = data_mw.values[0]

for i in tqdm(index_list):
    neighbor = get_neighbors(data, *i)
    neighbor += [i[0], i[1]]
    value_list.append(neighbor)

100%|██████████| 3893306/3893306 [00:34<00:00, 111451.67it/s]


In [11]:
df_neighbors = pd.DataFrame(value_list, columns = ['v1', 'v2', 'v3','v4', 'v5', 'v6','v7', 'v8', 'v9', 'row', 'col'])

### Merge dataframes

In [12]:
df_combined = pd.merge(df_mw, df_opt, how = 'left', on = ['y', 'x']) # left smaller mw, right - opt

Add row and col features

In [13]:
df_combined['col'] = df_combined.groupby("x").ngroup() # xshape 2663 
df_combined['row'] = df_combined.groupby("y").ngroup(ascending=False) # yshape 1462

In [14]:
df_comb = pd.merge(df_combined, df_neighbors, how = 'left', on = ['row', 'col'])


Remove water in mw


In [15]:
df_comb = df_comb.loc[df_comb['mw_value'] != -1]
# suppress warning?

Save masked data to separate df

In [16]:
df_comb = df_comb.loc[df_comb['opt_value'] != -1]

df_masked = df_comb.loc[df_comb['opt_value'] == -1]

### Write to csv/parquet:

In [18]:
# write to csv
#df_comb.to_csv(out_path + 'melt_2019-07-01.csv', index= False)
df_comb.to_parquet(out_path + 'melt_2019-07-01.parquet.gzip', index= False)