# TFRecord

```TFRecord``` represents a [Protobuf](https://developers.google.com/protocol-buffers) message, a binary message storage format to exchange records over the network.

> Protocol buffers are Google's language-neutral, platform-neutral, extensible mechanism for serializing structured data

In TensorFlow, a record of features can be represented as a binary ```TFRecord``` with ```tf.train.Example```.


## Example Protobuf

[tf.train.Example](https://www.tensorflow.org/api_docs/python/tf/train/Example) represents a TFRecord instance.

> An Example is a mostly-normalized data format for storing data for training and inference.
It contains a key-value store features where each key (string) maps to a tf.train.Feature message.

* [example.proto](https://github.com/tensorflow/tensorflow/blob/master/tensorflow/core/example/example.proto)

```
message Example {
  Features features = 1;   # 1 is field ID
}
```

Note that TFRecord/tf.train.Example is **Dict[str, Feature]**. 


## Feature Protobuf

[tf.train.Feature](https://www.tensorflow.org/api_docs/python/tf/train/Feature)

> A Feature is a list which may hold zero or more values.


* [feature.proto](https://github.com/tensorflow/tensorflow/blob/master/tensorflow/core/example/feature.proto)

```
message BytesList {
  repeated bytes value = 1;
}
message FloatList {
  repeated float value = 1 [packed = true];
}
message Int64List {
  repeated int64 value = 1 [packed = true];
}

// Containers for non-sequential data.
message Feature {
  // Each feature can be exactly one kind.
  oneof kind {
    BytesList bytes_list = 1;
    FloatList float_list = 2;
    Int64List int64_list = 3;
  }
}

message Features {
  // Map from feature name to feature.
  map<string, Feature> feature = 1;
}
```

In [19]:
import pandas as pd
import tensorflow as tf
from tensorflow.train import (
    BytesList,
    FloatList,
    Int64List,
    Feature,
    Features,
    Example
)

# [Amazon product review dataset](https://s3.amazonaws.com/amazon-reviews-pds/tsv/index.txt)

### DATA COLUMNS:

```
marketplace       - 2 letter country code of the marketplace where the review was written.
customer_id       - Random identifier that can be used to aggregate reviews written by a single author.
review_id         - The unique ID of the review.
product_id        - The unique Product ID the review pertains to. In the multilingual dataset the reviews
                    for the same product in different countries can be grouped by the same product_id.
product_parent    - Random identifier that can be used to aggregate reviews for the same product.
product_title     - Title of the product.
product_category  - Broad product category that can be used to group reviews 
                    (also used to group the dataset into coherent parts).
star_rating       - The 1-5 star rating of the review.
helpful_votes     - Number of helpful votes.
total_votes       - Number of total votes the review received.
vine              - Review was written as part of the Vine program.
verified_purchase - The review is on a verified purchase.
review_headline   - The title of the review.
review_body       - The review text.
review_date       - The date the review was written.
```

### DATA FORMAT
```
First line in each file is header; 1 line corresponds to 1 record.
```

In [62]:
df = pd.read_csv(
    "./data/amazon_product_review_sample.csv",
    header=0,
    usecols=['review_id', 'product_category', 'star_rating', 'review_body', 'review_date'],
    parse_dates=['review_date'],
    dtype={
        'marketplace': 'category',
        'product_category': 'category',
        'star_rating':'int64',
        'vine': 'category',
        'verified_purchase': 'category'
    }
)

In [67]:
df.head(3)

Unnamed: 0,review_id,product_category,star_rating,review_body,review_date,input_ids,attention_mask
0,RSH1OZ87OYK92,Digital_Video_Games,2,I keep buying madden every year hoping they ge...,2015-08-31,"[101, 1045, 2562, 9343, 24890, 2296, 2095, 532...","[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ..."
1,R1WFOQ3N9BO65I,Digital_Video_Games,5,Awesome,2015-08-31,"[101, 12476, 102, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...","[1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ..."
2,R3YOOS71KM5M9,Digital_Video_Games,5,If you are prepping for the end of the world t...,2015-08-31,"[101, 2065, 2017, 2024, 17463, 4691, 2005, 199...","[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ..."


In [64]:
df.dtypes

review_id                   object
product_category          category
star_rating                  int64
review_body                 object
review_date         datetime64[ns]
dtype: object

# BERT Tokenizer  

BERT tokenizer to convert text review text to integer tokens to be able to run ML procedures on the review text.

In [44]:
import os
import transformers
from transformers import (
    DistilBertTokenizerFast,
)

# --------------------------------------------------------------------------------
# Control log level (https://huggingface.co/transformers/main_classes/logging.html)
# --------------------------------------------------------------------------------
os.environ['TRANSFORMERS_VERBOSITY'] = "error"
transformers.logging.set_verbosity(transformers.logging.ERROR)
MAX_SEQUENCE_LENGTH = 256

In [162]:
tokenizer = DistilBertTokenizerFast.from_pretrained(
    'distilbert-base-uncased',
    do_lower_case=True
)


def tokenize(sentences, max_length=MAX_SEQUENCE_LENGTH, padding='max_length'):
    """Tokenize using the Huggingface tokenizer
    Args: 
        sentences: String or list of string to tokenize
        max_length: maximum token length that the tokenizer generates
        padding: Padding method ['do_not_pad'|'longest'|'max_length']
    """
    return tokenizer(
        sentences,
        truncation=True,
        padding=padding,
        max_length=max_length,
        return_tensors="tf"
    )


def decode(input_ids):
    """Decode token ids back to string sequence
    Args: 
        input_ids: Tensor of List[Integer]
    Returns: List of string sentence
    """
    sentence = tokenizer.decode(input_ids.numpy().tolist())
    return sentence.replace('[PAD]', '')


In [160]:
tokens = tokenize(df['review_body'].values.tolist())
df['input_ids'] = tokens['input_ids'].numpy().tolist()
df['attention_mask'] = tokens['attention_mask'].numpy().tolist()
df.head(3)

Unnamed: 0,review_id,product_category,star_rating,review_body,review_date,input_ids,attention_mask
0,RSH1OZ87OYK92,Digital_Video_Games,2,I keep buying madden every year hoping they ge...,2015-08-31,"[101, 1045, 2562, 9343, 24890, 2296, 2095, 532...","[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ..."
1,R1WFOQ3N9BO65I,Digital_Video_Games,5,Awesome,2015-08-31,"[101, 12476, 102, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...","[1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ..."
2,R3YOOS71KM5M9,Digital_Video_Games,5,If you are prepping for the end of the world t...,2015-08-31,"[101, 2065, 2017, 2024, 17463, 4691, 2005, 199...","[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ..."


# Amazon product review to TFRecord

In [132]:
from collections import OrderedDict


def create_tf_record(row):
    record = OrderedDict({
        "review_id": Feature(bytes_list=BytesList(value=[row['review_id'].encode()])),
        "product_category": Feature(bytes_list=BytesList(value=[row['product_category'].encode()])),
        "star_rating": Feature(int64_list=Int64List(value=[row['star_rating']])),
        "review_tokens": Feature(int64_list=Int64List(value=row['input_ids'])),
        "review_attention_mask": Feature(int64_list=Int64List(value=row['attention_mask'])),
    })
    tf_record = tf.train.Example(features=Features(feature=record))
    return tf_record

tf_records = df.apply(create_tf_record, axis=1)

## Serialize into TFRecord file

In [133]:
tf_record_file_path = "amazon_product_review.tfrecord"
options = tf.io.TFRecordOptions(compression_type='GZIP')

with tf.io.TFRecordWriter(tf_record_file_path, options) as f:
    for record in tf_records:
        f.write(record.SerializeToString())

# Load TFRecords from file(s)

In [165]:
record_feature_description = {
    "review_id": tf.io.FixedLenFeature([], tf.string),
    "product_category": tf.io.VarLenFeature(tf.string),
    "star_rating": tf.io.FixedLenFeature([], tf.int64),
    # --------------------------------------------------------------------------------
    # FixedLenFeature([], tf.int64) causes the error: 
    # Invalid argument: Key: review_attention_mask.  Can't parse serialized Example.
    # --------------------------------------------------------------------------------
    # "review_tokens": tf.io.FixedLenFeature([], tf.int64),
    # "review_attention_mask": tf.io.FixedLenFeature([], tf.int64),
    "review_tokens": tf.io.VarLenFeature(tf.int64),
    "review_attention_mask": tf.io.VarLenFeature(tf.int64),
    # --------------------------------------------------------------------------------
}

ds = tf.data.TFRecordDataset(
    filenames=[tf_record_file_path],
    compression_type='GZIP'
)
for row in ds:
    tf_record = tf.io.parse_single_example(row, record_feature_description)
    tf_record['review_tokens'] = tf.sparse.to_dense(tf_record['review_tokens'])
    print("ID:{review_id} Rating:{rating} Review: {review}".format(
        review_id=tf_record['review_id'],
        rating=tf_record['star_rating'],
        review=decode(tf_record['review_tokens'])
    ))

ID:b'RSH1OZ87OYK92' Rating:2 Review: [CLS] i keep buying madden every year hoping they get back to football. this years version is a little better than last years - - but that's not saying much. the game looks great. the only thing wrong with the animation, is the way the players are always tripping on each other. < br / > < br / > the gameplay is still slowed down by the bloated pre - play controls. what used to take two buttons is now a giant pita to get done before an opponent snaps the ball or the play clock runs out. < br / > < br / > the turbo button is back, but the player movement is still slow and awkward. if you liked last years version, i'm guessing you'll like this too. i haven't had a chance to play anything other than training and a few online games, so i'm crossing my fingers and hoping the rest is better. < br / > < br / > the one thing i can recommend is not to buy the madden bundle. the game comes as a download. so if you hate it, there's no trading it in at gamestop.