Skip to content
This repository has been archived by the owner on Jul 31, 2023. It is now read-only.

Release/2.0 #56

Merged
merged 31 commits into from
Nov 4, 2020
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
31 commits
Select commit Hold shift + click to select a range
64c95f1
Merge branch 'release/1.1.1' into dev
cfezequiel Oct 7, 2020
b0e2d53
Merge branch 'hotfix/update-contributing' into dev
cfezequiel Oct 8, 2020
7fae145
Update check_tfrecords to use new dataset load function.
cfezequiel Oct 9, 2020
7f244ba
Merge pull request #39 from google/feature/update-check
cfezequiel Oct 13, 2020
e3e5807
Add tfrecord_dir to create_tfrecords output.
cfezequiel Oct 15, 2020
8771abe
Restructure test image directory to match expected format.
cfezequiel Oct 15, 2020
a5fb1b3
Feature/dataclass (#44)
mbernico Oct 16, 2020
2af364d
Feature/structured data tutorial (#45)
cfezequiel Oct 16, 2020
2833a70
Merge branch 'dev' into feature/show-output-dir
cfezequiel Oct 19, 2020
8d0b2d9
Update structured data tutorial to use output dir.
cfezequiel Oct 19, 2020
1599ae5
Merge branch 'dev' into feature/test-image-dir
cfezequiel Oct 19, 2020
e276f8d
Merge pull request #46 from google/feature/show-output-dir
cfezequiel Oct 21, 2020
f7d7d8d
Clarify need for proper header when using create_tfrecords. Fixes #47.
cfezequiel Oct 21, 2020
bff529e
Merge branch 'dev' into feature/test-image-dir
cfezequiel Oct 21, 2020
d367aa8
Clean up README and update image directory notebook.
cfezequiel Oct 21, 2020
a142292
Feature/test image dir (#49)
cfezequiel Oct 22, 2020
65e5bdb
Merge branch 'feature/test-image-dir' into dev
cfezequiel Oct 22, 2020
baadc79
Merge branch 'hotfix/1.1.3' into dev
cfezequiel Oct 22, 2020
5eaa204
Fix minor issues
lc0 Oct 22, 2020
1c69fc4
Add an explicit error message for missing train split
lc0 Oct 22, 2020
36b7d6e
Merge pull request #53 from lc0/dev
cfezequiel Oct 23, 2020
d1d153b
Configure automated tests for Jupyter notebooks.
cfezequiel Oct 22, 2020
926d53c
Merge pull request #52 from google/feature/notebook-test
cfezequiel Oct 28, 2020
1db8664
Add convert_and_load function.
cfezequiel Oct 27, 2020
7f3f480
Refactor check and common modules to utils.
cfezequiel Oct 28, 2020
68d94e2
Add test targets for py files and notebooks.
cfezequiel Oct 30, 2020
1115d7a
Feature/convert and load (#55)
cfezequiel Oct 30, 2020
7b30a98
Merge branch 'feature/convert-and-load' into dev
cfezequiel Oct 30, 2020
3aebfe6
Merge branch 'master' into release/2.0
cfezequiel Oct 30, 2020
9e9e145
Update version in setup.py and release notes.
cfezequiel Nov 2, 2020
19d1a5c
Fix issues with GCS path parsing.
cfezequiel Nov 2, 2020
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Jump to
Jump to file
Failed to load files.
Diff view
Diff view
11 changes: 7 additions & 4 deletions .github/workflows/python-cicd.yml
Original file line number Diff line number Diff line change
Expand Up @@ -7,7 +7,6 @@ on: [push]

jobs:
build:

runs-on: ubuntu-latest
strategy:
matrix:
Expand All @@ -23,10 +22,14 @@ jobs:
run: |
python -m pip install --upgrade pip
if [ -f requirements.txt ]; then pip install -r requirements.txt; fi

- name: Run all tests
run: |
export PYTHONPATH="$GITHUB_WORKSPACE"
make test

- name: Lint with pylint
run: |
make pylint

- name: Run tests
run: |
make test

3 changes: 3 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
@@ -1,3 +1,6 @@
build/
dist/
tfrecorder.egg-info
.idea/
.ipynb_checkpoints/
.vscode/
Expand Down
13 changes: 9 additions & 4 deletions Makefile
Original file line number Diff line number Diff line change
@@ -1,12 +1,17 @@
all: init test pylint
all: init testnb test pylint

init:
pip install -r requirements.txt

test:
test: test-nb test-py

test-py:
nosetests --with-coverage -v --cover-package=tfrecorder

test-nb:
ls -1 samples/*.ipynb | grep -v '^.*Dataflow.ipynb' | xargs py.test --nbval-lax -p no:python

pylint:
pylint tfrecorder
pylint -j 0 tfrecorder

.PHONY: all init test pylint
.PHONY: all init test pylint
189 changes: 95 additions & 94 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -9,7 +9,7 @@ TFRecorder can convert any Pandas DataFrame or CSV file into TFRecords. If your
[Release Notes](RELEASE.md)

## Why TFRecorder?
Using the TFRecord storage format is important for optimal machine learning pipelines and getting the most from your hardware (in cloud or on prem). The TFRecorder project started inside [Google Cloud AI Services](https://cloud.google.com/consulting) when we realized we were writing TFRecord conversion code over and over again.
Using the TFRecord storage format is important for optimal machine learning pipelines and getting the most from your hardware (in cloud or on prem). The TFRecorder project started inside [Google Cloud AI Services](https://cloud.google.com/consulting) when we realized we were writing TFRecord conversion code over and over again.

When to use TFRecords:
* Your model is input bound (reading data is impacting training time).
Expand Down Expand Up @@ -71,7 +71,7 @@ df.tensorflow.to_tfr(output_dir='/my/output/path')

Google Cloud Platform Dataflow workers need to be supplied with the tfrecorder
package that you would like to run remotely. To do so first download or build
the package (a python wheel file) and then specify the path the the file when
the package (a python wheel file) and then specify the path the file when
tfrecorder is called.

Step 1: Download or create the wheel file.
Expand Down Expand Up @@ -109,7 +109,7 @@ Using Python interpreter:
```python
import tfrecorder

tfrecorder.create_tfrecords(
tfrecorder.convert(
source='/path/to/data.csv',
output_dir='gs://my/bucket')
```
Expand All @@ -126,10 +126,9 @@ tfrecorder create-tfrecords \
```python
import tfrecorder

tfrecorder.create_tfrecords(
tfrecorder.convert(
source='/path/to/image_dir',
output_dir='gs://my/bucket',
)
output_dir='gs://my/bucket')
```

The image directory should have the following general structure:
Expand Down Expand Up @@ -159,7 +158,7 @@ images/

### Loading a TF Dataset from TFRecord files

You can load a TensorFlow dataset from TFRecord files generated by TFRecorder
You can load a TensorFlow dataset from TFRecord files generated by TFRecorder
on your local machine.

```python
Expand All @@ -175,8 +174,9 @@ Using Python interpreter:
```python
import tfrecorder

tfrecorder.check_tfrecords(
file_pattern='/path/to/tfrecords/train*.tfrecord.gz',
tfrecorder.inspect(
tfrecord_dir='/path/to/tfrecords/',
split='TRAIN',
num_records=5,
output_dir='/tmp/output')
```
Expand All @@ -187,16 +187,17 @@ representing the images encoded into TFRecords.
Using the command line:

```bash
tfrecorder check-tfrecords \
--file_pattern=/path/to/tfrecords/train*.tfrecord.gz \
tfrecorder inspect \
--tfrecord-dir=/path/to/tfrecords/ \
--split='TRAIN' \
--num_records=5 \
--output_dir=/tmp/output
```

## Default Schema

If you don't specify an input schema, TFRecorder expects data to be in the same format as
[AutoML Vision input](https://cloud.google.com/vision/automl/docs/prepare).
If you don't specify an input schema, TFRecorder expects data to be in the same format as
[AutoML Vision input](https://cloud.google.com/vision/automl/docs/prepare).
This format looks like a Pandas DataFrame or CSV formatted as:

| split | image_uri | label |
Expand All @@ -205,139 +206,139 @@ This format looks like a Pandas DataFrame or CSV formatted as:

where:
* `split` can take on the values TRAIN, VALIDATION, and TEST
* `image_uri` specifies a local or Google Cloud Storage location for the image file.
* `label` can be either a text based label that will be integerized or integer
* `image_uri` specifies a local or Google Cloud Storage location for the image file.
* `label` can be either a text-based label that will be integerized or integer

## Flexible Schema

TFRecorder's flexible schema system allows you to use any schema you want for your input data. To support any input data schema, provide a schema map to TFRecorder. A TFRecorder schema_map creates a mapping between your dataframe column names and their types in the resulting
TFRecord.
TFRecorder's flexible schema system allows you to use any schema you want for your input data.

### Creating and using a schema map
A schema map is a Python dictionary that maps DataFrame column names to [supported
TFRecorder types.](#Supported-types)
For example, the default image CSV schema input can be defined like this:
```python
import pandas as pd
import tfrecorder
from tfrecorder import input_schema
from tfrecorder import types

For example, the default image CSV input can be defined like this:
image_csv_schema = input_schema.Schema({
'split': types.SplitKey,
'image_uri': types.ImageUri,
'label': types.StringLabel
})

```python
from tfrecorder import schema
# You can then pass the schema to `tfrecorder.create_tfrecords`.

image_csv_schema = {
'split': schema.split_key,
'image_uri': schema.image_uri,
'label': schema.string_label
}
df = pd.read_csv(...)
df.tensorflow.to_tfr(
output_dir='gs://my/bucket',
schema_map=image_csv_schema,
runner='DataflowRunner',
project='my-project',
region='us-central1')
```
Once created a schema_map can be sent to TFRecorder.

### Flexible Schema Example

Imagine that you have a dataset that you would like to convert to TFRecords that
looks like this:

| split | x | y | label |
|-------|-------|------|-------|
| TRAIN | 0.32 | 42 |1 |

You can use TFRecorder as shown below:

```python
import pandas as pd
from tfrecorder import schema
import tfrecorder
from tfrecorder import input_schema
from tfrecorder import types

# First create a schema map
schema = input_schema.Schema({
'split': types.SplitKey,
'x': types.FloatInput,
'y': types.IntegerInput,
'label': types.IntegerLabel,
})

# Now call TFRecorder with the specified schema_map

df = pd.read_csv(...)
df.tensorflow.to_tfr(
output_dir='gs://my/bucket',
schema_map=schema.image_csv_schema,
schema=schema,
runner='DataflowRunner',
project='my-project',
region='us-central1')
```
After calling TFRecorder's `to_tfr()` function, TFRecorder will create an Apache beam pipeline, either locally or in this case
using Google Cloud's Dataflow runner. This beam pipeline will use the schema map to identify the types you've associated with
each data column and process your data using [TensorFlow Transform](https://www.tensorflow.org/tfx/transform/get_started) and TFRecorder's image processing functions to convert the data into into TFRecords.

### Supported types
TFRecorder's schema system supports several types, all listed below. You can use
these types by referencing them in the schema map. Each type informs TFRecorder how
to treat your DataFrame columns. For example, the schema mapping
`my_split_key: schema.SplitKeyType` tells TFRecorder to treat the column `my_split_key` as
type `schema.SplitKeyType` and create dataset splits based on it's contents.

#### schema.ImageUriType
* Specifies the path to an image. When specified, TFRecorder
will load the specified image and store the image as a [base64 encoded](https://docs.python.org/3/library/base64.html)
[tf.string](https://www.tensorflow.org/tutorials/load_data/unicode) in the key 'image'
along with the height, width, and image channels as integers using they keys 'image_height', 'image_width', and 'image_channels'.
* A schema can contain only one imageUriType
TFRecorder's schema system supports several types.
You can use these types by referencing them in the schema map.
Each type informs TFRecorder how to treat your DataFrame columns.

#### types.SplitKey

#### schema.SplitKeyType
* A split key is required for TFRecorder at this time.
* Only one split key is allowed.
* Specifies a split key that TFRecorder will use to partition the
* Specifies a split key that TFRecorder will use to partition the
input dataset on.
* Allowed values are 'TRAIN', 'VALIDATION, and 'TEST'

Note: If you do not want your data to be partitioned please include a split_key and
set all rows to TRAIN.
Note: If you do not want your data to be partitioned, include a column with
`types.SplitKey` and set all the elements to `TRAIN`.

#### types.ImageUri

* Specifies the path to an image. When specified, TFRecorder
will load the specified image and store the image as a [base64 encoded](https://docs.python.org/3/library/base64.html)
[tf.string](https://www.tensorflow.org/tutorials/load_data/unicode) in the key 'image'
along with the height, width, and image channels as integers using the keys 'image_height', 'image_width', and 'image_channels'.
* A schema can contain only one imageUri column

#### types.IntegerInput

#### schema.IntegerInputType
* Specifies an int input.
* Will be scaled to mean 0, variance 1.

#### schema.FloatInputType
#### types.FloatInput

* Specifies an float input.
* Will be scaled to mean 0, variance 1.

#### schema.CategoricalInputType
#### types.CategoricalInput

* Specifies a string input.
* Vocabulary computed and output integerized.

#### schema.IntegerLabelType
#### types.IntegerLabel

* Specifies an integer target.
* Not transformed.

#### schema.StringLabelType
#### types.StringLabel

* Specifies a string target.
* Vocabulary computed and *output integerized.*

### Flexible Schema Example

Imagine that you have a dataset that you would like to convert to TFRecords that
looks like this:

| split | x | y | label |
|-------|-------|------|-------|
| TRAIN | 0.32 | 42 |1 |

You can use TFRecorder as shown below:

```python
import pandas as pd
import tfrecorder
from tfrecorder import schema

# First create a schema map
schema_map = {
'split':schema.SplitKeyType,
'x':schema.FloatInputType,
'y':schema.IntegerInputType,
'label':schema.IntegerLabelType
}

# Now call TFRecorder with the specified schema_map

df = pd.read_csv(...)
df.tensorflow.to_tfr(
output_dir='gs://my/bucket',
schema_map=schema_map,
runner='DataflowRunner',
project='my-project',
region='us-central1')
```
After calling TFRecorder's to_tfr() function, TFRecorder will create an Apache beam pipeline, either locally or in this case
using Google Cloud's Dataflow runner. This beam pipeline will use the schema map to identify the types you've associated with
each data column and process your data using [TensorFlow Transform](https://www.tensorflow.org/tfx/transform/get_started) and TFRecorder's image processing functions to convert the data into into TFRecords.

## Contributing

Pull requests are welcome. Please see our [code of conduct](docs/code-of-conduct.md) and [contributing guide](docs/contributing.md).
Pull requests are welcome.
Please see our [code of conduct](docs/code-of-conduct.md) and [contributing guide](docs/contributing.md).

## Why TFRecorder?
Using the TFRecord storage format is important for optimal machine learning pipelines and getting the most from your hardware (in cloud or on prem).

Using the TFRecord storage format is important for optimal machine learning pipelines and getting the most from your hardware (in cloud or on prem).

TFRecords help when:
* Your model is input bound (reading data is impacting training time).
* Anytime you want to use tf.Dataset
* When your dataset can't fit into memory


In our work at [Google Cloud AI Services](https://cloud.google.com/consulting) we wanted to help our users spend their time writing AI/ML applications, and spend less time converting data.

Need help with using AI in the cloud?
Visit [Google Cloud AI Services](https://cloud.google.com/consulting).
8 changes: 8 additions & 0 deletions RELEASE.md
Original file line number Diff line number Diff line change
@@ -1,3 +1,11 @@
# Release 2.0

* Changes `create_tfrecords` and `check_tfrecords` to `convert` and `inspect` respectively
* Adds `convert_and_load` function
* Changes flexible schema to use `dataclasses`
* Adds automated testing for notebooks
* Minor fixes and usability improvements

# Hotfix 1.1.3

* Adds note regarding DataFrame header specification in README.md.
Expand Down
3 changes: 3 additions & 0 deletions requirements.txt
Original file line number Diff line number Diff line change
Expand Up @@ -12,3 +12,6 @@ jupyter >= 1.0.0
tensorflow >= 2.3.1
pyarrow <0.18,>=0.17
frozendict >= 1.2
dataclasses >= 0.5;python_version<"3.7"
nbval >= 0.9.6
pytest >= 6.1.1