Release/2.0 (#56)

* Update check_tfrecords to use new dataset load function. * Add tfrecord_dir to create_tfrecords output. * Restructure test image directory to match expected format. * Feature/dataclass (#44) * Added data classes for types. * Checking in progress. * Checking in more changes. * Converted types to classes and refactored schema into OO pattern. * Changed OrderedDict import to support py3.6. * Changed OrderedDict import to support py3.6. * Updated setup.py for version. * fixing setup.py * Patched requirements and setup. * Addressed comments in code review. * Addressed code comments round 2. * refactored IMAGE_CSV_SCHEMA. * Merged check_test.py from dev Co-authored-by: Carlos Ezequiel <cezequiel@google.com> * Feature/structured data tutorial (#45) * Converted types to classes and refactored schema into OO pattern. * Add tutorial on structured data conversion. This changes types.FloatInput to use tf.float32 for its feature_spec attribute to address potential incompatibility with using tf.float64 type in TensorFlow Transform. Co-authored-by: Mike Bernico <mikebernico@google.com> * Update structured data tutorial to use output dir. * Clarify need for proper header when using create_tfrecords. Fixes #47. * Clean up README and update image directory notebook. * Feature/test image dir (#49) * Restructure test image directory to match expected format. * Clean up README and update image directory notebook. * Fix minor issues * Add an explicit error message for missing train split * Configure automated tests for Jupyter notebooks. * Add convert_and_load function. Also refactor create_tfrecords to convert. * Refactor check and common modules to utils. * Add test targets for py files and notebooks. * Feature/convert and load (#55) * Add convert_and_load function. Also refactor create_tfrecords to convert. * Refactor check and common modules to utils. * Add test targets for py files and notebooks. * Update version in setup.py and release notes. * Fix issues with GCS path parsing. Co-authored-by: Mike Bernico <mikebernico@google.com> Co-authored-by: Sergii Khomenko <khomenko@brainscode.com>
google · Nov 4, 2020 · f4650ca · f4650ca
1 parent 747dabe
commit f4650ca
Show file tree

Hide file tree

Showing 38 changed files with 2,333 additions and 742 deletions.
diff --git a/.github/workflows/python-cicd.yml b/.github/workflows/python-cicd.yml
@@ -7,7 +7,6 @@ on: [push]
 
 jobs:
   build:
-
     runs-on: ubuntu-latest
     strategy:
       matrix:
@@ -23,10 +22,14 @@ jobs:
       run: |
         python -m pip install --upgrade pip
         if [ -f requirements.txt ]; then pip install -r requirements.txt; fi
+
+    - name: Run all tests
+      run: |
+        export PYTHONPATH="$GITHUB_WORKSPACE"
+        make test
+
     - name: Lint with pylint
       run: |
         make pylint
 
-    - name: Run tests
-      run: |
-        make test
+
diff --git a/.gitignore b/.gitignore
@@ -1,3 +1,6 @@
+build/
+dist/
+tfrecorder.egg-info
 .idea/
 .ipynb_checkpoints/
 .vscode/

diff --git a/Makefile b/Makefile
@@ -1,12 +1,17 @@
-all: init test pylint
+all: init testnb test pylint
 
 init:
 	pip install -r requirements.txt
 
-test:
+test: test-nb test-py
+
+test-py:
 	nosetests --with-coverage -v --cover-package=tfrecorder
 
+test-nb:
+	ls -1 samples/*.ipynb | grep -v '^.*Dataflow.ipynb' | xargs py.test --nbval-lax -p no:python
+
 pylint:
-	pylint tfrecorder
+	pylint -j 0 tfrecorder
 
-.PHONY: all init test pylint 
+.PHONY: all init test pylint
diff --git a/README.md b/README.md
@@ -9,7 +9,7 @@ TFRecorder can convert any Pandas DataFrame or CSV file into TFRecords. If your
 [Release Notes](RELEASE.md)
 
 ## Why TFRecorder?
-Using the TFRecord storage format is important for optimal machine learning pipelines and getting the most from your hardware (in cloud or on prem). The TFRecorder project started inside [Google Cloud AI Services](https://cloud.google.com/consulting) when we realized we were writing TFRecord conversion code over and over again.  
+Using the TFRecord storage format is important for optimal machine learning pipelines and getting the most from your hardware (in cloud or on prem). The TFRecorder project started inside [Google Cloud AI Services](https://cloud.google.com/consulting) when we realized we were writing TFRecord conversion code over and over again.
 
 When to use TFRecords:
 * Your model is input bound (reading data is impacting training time).
@@ -71,7 +71,7 @@ df.tensorflow.to_tfr(output_dir='/my/output/path')
 
 Google Cloud Platform Dataflow workers need to be supplied with the tfrecorder
 package that you would like to run remotely.  To do so first download or build
-the package (a python wheel file) and then specify the path the the file when
+the package (a python wheel file) and then specify the path the file when
 tfrecorder is called.
 
 Step 1: Download or create the wheel file.
@@ -109,7 +109,7 @@ Using Python interpreter:
 ```python
 import tfrecorder
 
-tfrecorder.create_tfrecords(
+tfrecorder.convert(
     source='/path/to/data.csv',
     output_dir='gs://my/bucket')
 ```
@@ -126,10 +126,9 @@ tfrecorder create-tfrecords \
 ```python
 import tfrecorder
 
-tfrecorder.create_tfrecords(
+tfrecorder.convert(
     source='/path/to/image_dir',
-    output_dir='gs://my/bucket',
-)
+    output_dir='gs://my/bucket')
 ```
 
 The image directory should have the following general structure:
@@ -159,7 +158,7 @@ images/
 
 ### Loading a TF Dataset from TFRecord files
 
-You can load a TensorFlow dataset from TFRecord files generated by TFRecorder 
+You can load a TensorFlow dataset from TFRecord files generated by TFRecorder
 on your local machine.
 
 ```python
@@ -175,8 +174,9 @@ Using Python interpreter:
 ```python
 import tfrecorder
 
-tfrecorder.check_tfrecords(
-    file_pattern='/path/to/tfrecords/train*.tfrecord.gz',
+tfrecorder.inspect(
+    tfrecord_dir='/path/to/tfrecords/',
+    split='TRAIN',
     num_records=5,
     output_dir='/tmp/output')
 ```
@@ -187,16 +187,17 @@ representing the images encoded into TFRecords.
 Using the command line:
 
 ```bash
-tfrecorder check-tfrecords \
-    --file_pattern=/path/to/tfrecords/train*.tfrecord.gz \
+tfrecorder inspect \
+    --tfrecord-dir=/path/to/tfrecords/ \
+    --split='TRAIN' \
     --num_records=5 \
     --output_dir=/tmp/output
 ```
 
 ## Default Schema
 
-If you don't specify an input schema, TFRecorder expects data to be in the same format as 
-[AutoML Vision input](https://cloud.google.com/vision/automl/docs/prepare).  
+If you don't specify an input schema, TFRecorder expects data to be in the same format as
+[AutoML Vision input](https://cloud.google.com/vision/automl/docs/prepare).
 This format looks like a Pandas DataFrame or CSV formatted as:
 
 | split | image_uri                 | label |
@@ -205,139 +206,139 @@ This format looks like a Pandas DataFrame or CSV formatted as:
 
 where:
 * `split` can take on the values TRAIN, VALIDATION, and TEST
-* `image_uri` specifies a local or Google Cloud Storage location for the image file. 
-* `label` can be either a text based label that will be integerized or integer
+* `image_uri` specifies a local or Google Cloud Storage location for the image file.
+* `label` can be either a text-based label that will be integerized or integer
 
 ## Flexible Schema
 
-TFRecorder's flexible schema system allows you to use any schema you want for your input data. To support any input data schema, provide a schema map to TFRecorder. A TFRecorder schema_map creates a mapping between your dataframe column names and their types in the resulting
-TFRecord.
+TFRecorder's flexible schema system allows you to use any schema you want for your input data.
 
-### Creating and using a schema map
-A schema map is a Python dictionary that maps DataFrame column names to [supported
-TFRecorder types.](#Supported-types)
+For example, the default image CSV schema input can be defined like this:
+```python
+import pandas as pd
+import tfrecorder
+from tfrecorder import input_schema
+from tfrecorder import types
 
-For example, the default image CSV input can be defined like this:
+image_csv_schema = input_schema.Schema({
+    'split': types.SplitKey,
+    'image_uri': types.ImageUri,
+    'label': types.StringLabel
+})
 
-```python
-from tfrecorder import schema
+# You can then pass the schema to `tfrecorder.create_tfrecords`.
 
-image_csv_schema = {
-    'split': schema.split_key,
-    'image_uri': schema.image_uri,
-    'label': schema.string_label
-}
+df = pd.read_csv(...)
+df.tensorflow.to_tfr(
+    output_dir='gs://my/bucket',
+    schema_map=image_csv_schema,
+    runner='DataflowRunner',
+    project='my-project',
+    region='us-central1')
 ```
-Once created a schema_map can be sent to TFRecorder.
+
+### Flexible Schema Example
+
+Imagine that you have a dataset that you would like to convert to TFRecords that
+looks like this:
+
+| split | x     |   y  | label |
+|-------|-------|------|-------|
+| TRAIN | 0.32  | 42   |1      |
+
+You can use TFRecorder as shown below:
 
 ```python
 import pandas as pd
-from tfrecorder import schema
 import tfrecorder
+from tfrecorder import input_schema
+from tfrecorder import types
+
+# First create a schema map
+schema = input_schema.Schema({
+    'split': types.SplitKey,
+    'x': types.FloatInput,
+    'y': types.IntegerInput,
+    'label': types.IntegerLabel,
+})
+
+# Now call TFRecorder with the specified schema_map
 
 df = pd.read_csv(...)
 df.tensorflow.to_tfr(
     output_dir='gs://my/bucket',
-    schema_map=schema.image_csv_schema,
+    schema=schema,
     runner='DataflowRunner',
     project='my-project',
     region='us-central1')
 ```
+After calling TFRecorder's `to_tfr()` function, TFRecorder will create an Apache beam pipeline, either locally or in this case
+using Google Cloud's Dataflow runner. This beam pipeline will use the schema map to identify the types you've associated with
+each data column and process your data using [TensorFlow Transform](https://www.tensorflow.org/tfx/transform/get_started) and TFRecorder's image processing functions to convert the data into into TFRecords.
 
 ### Supported types
-TFRecorder's schema system supports several types, all listed below. You can use
-these types by referencing them in the schema map. Each type informs TFRecorder how 
-to treat your DataFrame columns.  For example, the schema mapping 
-`my_split_key: schema.SplitKeyType` tells TFRecorder to treat the column `my_split_key` as
-type `schema.SplitKeyType` and create dataset splits based on it's contents. 
 
-#### schema.ImageUriType
-* Specifies the path to an image. When specified, TFRecorder
-will load the specified image and store the image as a [base64 encoded](https://docs.python.org/3/library/base64.html)
- [tf.string](https://www.tensorflow.org/tutorials/load_data/unicode) in the key 'image' 
-along with the height, width, and image channels  as integers using they keys 'image_height', 'image_width', and 'image_channels'.
-* A schema can contain only one imageUriType
+TFRecorder's schema system supports several types.
+You can use these types by referencing them in the schema map.
+Each type informs TFRecorder how to treat your DataFrame columns.
+
+#### types.SplitKey
 
-#### schema.SplitKeyType
 * A split key is required for TFRecorder at this time.
 * Only one split key is allowed.
-* Specifies a split key that TFRecorder will use to partition the 
+* Specifies a split key that TFRecorder will use to partition the
 input dataset on.
 * Allowed values are 'TRAIN', 'VALIDATION, and 'TEST'
 
-Note: If you do not want your data to be partitioned please include a split_key and
-set all rows to TRAIN.
+Note: If you do not want your data to be partitioned, include a column with
+`types.SplitKey` and set all the elements to `TRAIN`.
+
+#### types.ImageUri
+
+* Specifies the path to an image. When specified, TFRecorder
+will load the specified image and store the image as a [base64 encoded](https://docs.python.org/3/library/base64.html)
+ [tf.string](https://www.tensorflow.org/tutorials/load_data/unicode) in the key 'image'
+along with the height, width, and image channels  as integers using the keys 'image_height', 'image_width', and 'image_channels'.
+* A schema can contain only one imageUri column
+
+#### types.IntegerInput
 
-#### schema.IntegerInputType
 * Specifies an int input.
 * Will be scaled to mean 0, variance 1.
 
-#### schema.FloatInputType
+#### types.FloatInput
+
 * Specifies an float input.
 * Will be scaled to mean 0, variance 1.
 
-#### schema.CategoricalInputType
+#### types.CategoricalInput
+
 * Specifies a string input.
 * Vocabulary computed and output integerized.
 
-#### schema.IntegerLabelType
+#### types.IntegerLabel
+
 * Specifies an integer target.
 * Not transformed.
 
-#### schema.StringLabelType
+#### types.StringLabel
+
 * Specifies a string target.
 * Vocabulary computed and *output integerized.*
 
-### Flexible Schema Example
-
-Imagine that you have a dataset that you would like to convert to TFRecords that 
-looks like this:
-
-| split | x     |   y  | label |
-|-------|-------|------|-------|
-| TRAIN | 0.32  | 42   |1      |
-
-You can use TFRecorder as shown below:
-
-```python
-import pandas as pd
-import tfrecorder
-from tfrecorder import schema
-
-# First create a schema map
-schema_map = {
-    'split':schema.SplitKeyType,
-    'x':schema.FloatInputType,
-    'y':schema.IntegerInputType,
-    'label':schema.IntegerLabelType
-}
-
-# Now call TFRecorder with the specified schema_map
-
-df = pd.read_csv(...)
-df.tensorflow.to_tfr(
-    output_dir='gs://my/bucket',
-    schema_map=schema_map,
-    runner='DataflowRunner',
-    project='my-project',
-    region='us-central1')
-```
-After calling TFRecorder's to_tfr() function, TFRecorder will create an Apache beam pipeline, either locally or in this case
-using Google Cloud's Dataflow runner. This beam pipeline will use the schema map to identify the types you've associated with
-each data column and process your data using [TensorFlow Transform](https://www.tensorflow.org/tfx/transform/get_started) and TFRecorder's image processing functions to convert the data into into TFRecords.
-
 ## Contributing
 
-Pull requests are welcome. Please see our [code of conduct](docs/code-of-conduct.md) and [contributing guide](docs/contributing.md).
+Pull requests are welcome.
+Please see our [code of conduct](docs/code-of-conduct.md) and [contributing guide](docs/contributing.md).
 
 ## Why TFRecorder?
-Using the TFRecord storage format is important for optimal machine learning pipelines and getting the most from your hardware (in cloud or on prem). 
+
+Using the TFRecord storage format is important for optimal machine learning pipelines and getting the most from your hardware (in cloud or on prem).
 
 TFRecords help when:
 * Your model is input bound (reading data is impacting training time).
 * Anytime you want to use tf.Dataset
 * When your dataset can't fit into memory
 
-
-In our work at [Google Cloud AI Services](https://cloud.google.com/consulting) we wanted to help our users spend their time writing AI/ML applications, and spend less time converting data. 
-
+Need help with using AI in the cloud?
+Visit [Google Cloud AI Services](https://cloud.google.com/consulting).
diff --git a/RELEASE.md b/RELEASE.md
@@ -1,3 +1,11 @@
+# Release 2.0
+
+* Changes `create_tfrecords` and `check_tfrecords` to `convert` and `inspect` respectively
+* Adds `convert_and_load` function
+* Changes flexible schema to use `dataclasses`
+* Adds automated testing for notebooks
+* Minor fixes and usability improvements
+
 # Hotfix 1.1.3
 
 * Adds note regarding DataFrame header specification in README.md.

diff --git a/requirements.txt b/requirements.txt
@@ -12,3 +12,6 @@ jupyter >= 1.0.0
 tensorflow >= 2.3.1
 pyarrow <0.18,>=0.17
 frozendict >= 1.2
+dataclasses >= 0.5;python_version<"3.7"
+nbval >= 0.9.6
+pytest >= 6.1.1