# Exploring TorchData for Streaming Data from an AWS S3 Bucket

This notebook explores using the [TorchData](https://pytorch.org/data/beta/index.html) package for setting up data pipelines and for using with cloud storage, which in this case happens to be an Amazon S3 bucket.

In [1]:
from PIL import Image
from pathlib import Path
import torchdata.datapipes.iter as pipes
from torchdata.datapipes.iter import IterableWrapper

In [2]:
S3_URL = "s3://drivendata-competition-biomassters-public-us"
train_features_s3 = S3_URL + "/train_features/"
train_agbm_s3 = S3_URL + "/train_agbm/"
test_features_s3 = S3_URL + "/test_features/"

In [3]:
print(train_features_s3)
print(train_agbm_s3)

s3://drivendata-competition-biomassters-public-us/train_features/
s3://drivendata-competition-biomassters-public-us/train_agbm/


In [4]:
filename = Path("s3://drivendata-competition-biomassters-public-us/train_features/0003d2eb_S1_00.tif")
print(f"Name: {filename.name}")
print(f"Stem: {filename.stem}")
print(f"Suffix: {filename.suffix}")
print(f"Chip ID: {filename.stem.split('_')[0]}")

Name: 0003d2eb_S1_00.tif
Stem: 0003d2eb_S1_00
Suffix: .tif
Chip ID: 0003d2eb


In [5]:
filename.stem

'0003d2eb_S1_00'

In [6]:
def filter_img(filename, satellite='S1', month='00'):
    file_path = Path(filename)
    chip_id = file_path.stem.split("_")[0]
    
    filter_img = f"{chip_id}_{satellite}_{month}.tif"
    return file_path.name == filter_img

In [7]:
filename = Path("s3://drivendata-competition-biomassters-public-us/train_features/0003d2eb_S1_00.tif")
print(filter_img(filename))
filename = Path("s3://drivendata-competition-biomassters-public-us/train_features/0003d2eb_S1_01.tif")
print(filter_img(filename))

True
False


In [8]:
def agbm_target(filename):
    agbm_filename = filename.split("_")[0] + "_agbm.tif"
    return agbm_filename

In [9]:
features_dp = IterableWrapper([train_features_s3]).list_files_by_fsspec()
# Note: Using S3 specific functions leads to CURL errors with CA Certificates
# features_dp = IterableWrapper([train_features_s3]).list_files_by_s3()
features_dp = features_dp.filter(filter_fn=filter_img)

In [10]:
# Sanity check
it1 = iter(features_dp)
print(next(it1))
print(next(it1))

s3://drivendata-competition-biomassters-public-us/train_features/0003d2eb_S1_00.tif
s3://drivendata-competition-biomassters-public-us/train_features/000aa810_S1_00.tif


In [11]:
features_dp = features_dp.sharding_filter()
features_dp = features_dp.open_files_by_fsspec(mode="rb")

# Note: Here also, using S3 specific function results in an error
# TypeError: s3_read(): incompatible function arguments. The following argument types are supported:
#    1. (self: torchdata._torchdata.S3Handler, arg0: str) -> bytes

# Invoked with: <torchdata._torchdata.S3Handler object at 0x7fb30498f030>, ('s3://drivendata-competition-biomassters-public-us/train_features/0003d2eb_S1_00.tif', StreamWrapper<<File-like object S3FileSystem, drivendata-competition-biomassters-public-us/train_features/0003d2eb_S1_00.tif>>)
# This exception is thrown by __iter__ of S3FileLoaderIterDataPipe(source_datapipe=ShardingFilterIterDataPipe)

# features_dp = features_dp.load_files_by_s3()

### Using `rasterio` instead of `PIL`

Using `PIL` for image reading and displaying doesn't work as it doesn't support TIFF format well (limited rather).

Instead I will use `rasterio` library for reading tif data. [Rasterio](https://rasterio.readthedocs.io/en/latest/index.html) is a package build specifically for Geospatial data.

In [12]:
from rasterio import MemoryFile

In [23]:
def read_to_array(data):
    url, file_obj = data
    raw_bytes = file_obj.read()
    
    with MemoryFile(raw_bytes) as memfile:
        try:
            with memfile.open() as dataset:
                raw_bytes = dataset.read(list(range(1, dataset.count+1)))
        except rasterio.errors.NotGeoreferencedWarning:
            pass
        return (url, raw_bytes)

In [14]:
features_dp = features_dp.map(read_to_array)

In [15]:
agbm_dp = IterableWrapper([train_agbm_s3])
agbm_dp = agbm_dp.list_files_by_fsspec()

In [16]:
print(next(iter(agbm_dp)))
print(next(iter(agbm_dp)))

s3://drivendata-competition-biomassters-public-us/train_agbm/0003d2eb_agbm.tif
s3://drivendata-competition-biomassters-public-us/train_agbm/0003d2eb_agbm.tif


In [17]:
agbm_dp = agbm_dp.sharding_filter()
agbm_dp = agbm_dp.open_files_by_fsspec(mode="rb")
agbm_dp = agbm_dp.map(read_to_array)

In [18]:
input_dp = features_dp.zip(agbm_dp)

In [19]:
print(type(input_dp))

<class 'torch.utils.data.datapipes.iter.combining.ZipperIterDataPipe'>


In [20]:
batch = next(iter(input_dp))
len(batch)

  return DatasetReader(mempath, driver=driver, sharing=sharing, **kwargs)
Exception ignored in: <generator object ZipperIterDataPipe.__iter__ at 0x7f5af8631e40>
Traceback (most recent call last):
  File "/usr/local/lib/python3.9/dist-packages/torch/utils/data/datapipes/iter/combining.py", line 546, in __iter__
    unused += list(iterator)
  File "/usr/local/lib/python3.9/dist-packages/torch/utils/data/datapipes/_hook_iterator.py", line 185, in wrap_generator
    response = gen.send(request)
  File "/usr/local/lib/python3.9/dist-packages/torch/utils/data/datapipes/iter/callable.py", line 123, in __iter__
    yield self._apply_fn(data)
  File "/usr/local/lib/python3.9/dist-packages/torch/utils/data/datapipes/iter/callable.py", line 88, in _apply_fn
    return self.fn(data)
  File "/tmp/ipykernel_16025/3389276255.py", line 3, in read_to_array
  File "/usr/local/lib/python3.9/dist-packages/fsspec/spec.py", line 1655, in read
    out = self.cache._fetch(self.loc, self.loc + length)
  File "

2

In [21]:
feat, agbm = batch