# Example of Save Image Data as a Feature Group in the Feature Store

Often, image data can be fed in as raw data to deep learning models and requires less feature engineering than other type of data. Thus, in many cases you would **not** need need to store image data as a feature group in the feature store, but rather you would save it directly as a training dataset in for example .tfrecords format.

However, sometimes you want to join image features with other types of features and you might also need to do feature engineering steps such as *data augmentation, image scaling, image normalization etc.*. This notebook will show you how you can save image data as a feature group in the feature store.

In [9]:
from hops import featurestore
from hops import hdfs
import json

## Step 1: Read in the Raw Image Data

You can read in the image data from HopsFS using for example Spark or Tensorflow. In this example we will use Spark to read in a batch of images stored in the path `hdfs:///Projects/demo_featurestore_admin000/mnist/`

In [10]:
image_dir = "hdfs:///Projects/demo_featurestore_admin000/mnist/"

In [11]:
hdfs.ls(image_dir)

['hdfs://10.0.2.15:8020/Projects/demo_featurestore_admin000/mnist/README.md', 'hdfs://10.0.2.15:8020/Projects/demo_featurestore_admin000/mnist/img_1.jpg', 'hdfs://10.0.2.15:8020/Projects/demo_featurestore_admin000/mnist/img_10.jpg', 'hdfs://10.0.2.15:8020/Projects/demo_featurestore_admin000/mnist/img_2.jpg', 'hdfs://10.0.2.15:8020/Projects/demo_featurestore_admin000/mnist/img_3.jpg', 'hdfs://10.0.2.15:8020/Projects/demo_featurestore_admin000/mnist/img_4.jpg', 'hdfs://10.0.2.15:8020/Projects/demo_featurestore_admin000/mnist/img_5.jpg', 'hdfs://10.0.2.15:8020/Projects/demo_featurestore_admin000/mnist/img_6.jpg', 'hdfs://10.0.2.15:8020/Projects/demo_featurestore_admin000/mnist/img_7.jpg', 'hdfs://10.0.2.15:8020/Projects/demo_featurestore_admin000/mnist/img_8.jpg', 'hdfs://10.0.2.15:8020/Projects/demo_featurestore_admin000/mnist/img_9.jpg']

In [12]:
image_df = spark.read.format("image").load(image_dir)

## Step 2: Process The Images (Feature Engineering)

After having read the images using for example Spark or Tensorflow you can do feature engineering as you like with the images before you save them to the feature store.

In [18]:
#image_df = image_df.map()....

## Step 3: Saving The Processed Images to the Feature Store as a Feature Group

To save the images to the feature store as a feature group you can store them in the format that Spark automatically structures images:

```
root
 |-- image: struct (nullable = true)
 |    |-- origin: string (nullable = true)
 |    |-- height: integer (nullable = true)
 |    |-- width: integer (nullable = true)
 |    |-- nChannels: integer (nullable = true)
 |    |-- mode: integer (nullable = true)
 |    |-- data: binary (nullable = true)
```
Or you can setup your own custom format for storing the images (for example flattening each image to a float array).

In [19]:
featurestore.create_featuregroup(image_df, "mnist_images_featuregroup", 
                                 feature_correlation=False, 
                                 cluster_analysis=False)

computing descriptive statistics for : mnist_images_featuregroup
computing feature histograms for: mnist_images_featuregroup
Could not compute feature histograms for: mnist_images_featuregroup, set the optional argument feature_histograms=False to skip this step,
 error: Can not generate buckets with non-number in RDD
Running sql: use demo_featurestore_admin000_featurestore

In [20]:
image_fg = featurestore.get_featuregroup("mnist_images_featuregroup")

Running sql: use demo_featurestore_admin000_featurestore
Running sql: SELECT * FROM mnist_images_featuregroup_1

In [21]:
image_fg.show(5)

+--------------------+
|               image|
+--------------------+
|[hdfs://10.0.2.15...|
|[hdfs://10.0.2.15...|
|[hdfs://10.0.2.15...|
|[hdfs://10.0.2.15...|
|[hdfs://10.0.2.15...|
+--------------------+
only showing top 5 rows

In [22]:
image_fg.printSchema()

root
 |-- image: struct (nullable = true)
 |    |-- origin: string (nullable = true)
 |    |-- height: integer (nullable = true)
 |    |-- width: integer (nullable = true)
 |    |-- nChannels: integer (nullable = true)
 |    |-- mode: integer (nullable = true)
 |    |-- data: binary (nullable = true)