# Create Dataset for Keras mode.fit()

* [tf.data: Build TensorFlow input pipelines](https://www.tensorflow.org/guide/data) (MUST)
* [Load CSV data](https://www.tensorflow.org/tutorials/load_data/csv)
* [tf.data.experimental.make_csv_dataset](https://www.tensorflow.org/api_docs/python/tf/data/experimental/make_csv_dataset)


## Input format for model.fit()


* [How to create the same structure of tf.data.experimental.make_csv_dataset from pandas
](https://stackoverflow.com/questions/69777802/how-to-create-the-same-structure-of-tf-data-experimental-make-csv-dataset-from-p/69778344#69778344)

> tensorflow.org/api_docs/python/tf/keras/Model#fit defines what should be passed to .fit() 

When passing TF dataset, it shoudl have ```(features, labels)``` format which is to be created by ```tf.data.Dataset.from_tensor_slices((features, labels))``` where **features** can be either:

> * A Numpy array (or array-like), or a list of arrays (in case the model has multiple inputs).
> * A TensorFlow tensor, or a list of tensors (in case the model has multiple inputs).
> * A dict mapping input names to the corresponding array/tensors, if the model has named inputs.
> * A generator or keras.utils.Sequence returning (inputs, targets) or (inputs, targets, sample_weights).

* [tf.keras.Model - fit](https://www.tensorflow.org/api_docs/python/tf/keras/Model#fit)

> ```fit(x=None, y=None)```  
> ### x: Input data.  
> For a tf.data dataset, it should be **a tuple of either (inputs, targets) or (inputs, targets, sample_weights)**.
> ###  y: Target data  
> either Numpy array(s) or TensorFlow tensor(s). It should be consistent with x (you cannot have Numpy inputs and tensor targets, or inversely). 

In [1]:
import tensorflow as tf
import pandas as pd
import numpy as np

# Example CSV

Titanic CSV.

In [2]:
titanic_file = tf.keras.utils.get_file("titanic_train.csv", "https://storage.googleapis.com/tf-datasets/titanic/train.csv")

In [3]:
!head -n 5  ~/.keras/datasets/titanic_train.csv

survived,sex,age,n_siblings_spouses,parch,fare,class,deck,embark_town,alone
0,male,22.0,1,0,7.25,Third,unknown,Southampton,n
1,female,38.0,1,0,71.2833,First,C,Cherbourg,n
1,female,26.0,0,0,7.925,Third,unknown,Southampton,y
1,female,35.0,1,0,53.1,First,C,Southampton,n


# Dictionary to Dataset

In [4]:
features = {
    "name": ["hoge", "tako"],
    "age": [20, 99]
}
labels = {
    "survived": [0, 1]
}
ds_from_dictionary = tf.data.Dataset.from_tensor_slices((features, labels))
for record in ds_from_dictionary:
    print(record)

({'name': <tf.Tensor: shape=(), dtype=string, numpy=b'hoge'>, 'age': <tf.Tensor: shape=(), dtype=int32, numpy=20>}, {'survived': <tf.Tensor: shape=(), dtype=int32, numpy=0>})
({'name': <tf.Tensor: shape=(), dtype=string, numpy=b'tako'>, 'age': <tf.Tensor: shape=(), dtype=int32, numpy=99>}, {'survived': <tf.Tensor: shape=(), dtype=int32, numpy=1>})


2021-11-01 21:19:07.561343: I tensorflow/core/platform/cpu_feature_guard.cc:142] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  SSE4.1 SSE4.2 AVX AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.


# Pandas to Dataset

* [How to create the same structure of tf.data.experimental.make_csv_dataset from pandas
](https://stackoverflow.com/questions/69777802/how-to-create-the-same-structure-of-tf-data-experimental-make-csv-dataset-from-p/69778344#69778344)

If  data fits in memory, ```from_tensor_slices``` method works on dictionaries.

```Pandas -> Python dictionary -> from_tensor_slices -> Dataset```

In [5]:
df = pd.read_csv(titanic_file)
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 627 entries, 0 to 626
Data columns (total 10 columns):
 #   Column              Non-Null Count  Dtype  
---  ------              --------------  -----  
 0   survived            627 non-null    int64  
 1   sex                 627 non-null    object 
 2   age                 627 non-null    float64
 3   n_siblings_spouses  627 non-null    int64  
 4   parch               627 non-null    int64  
 5   fare                627 non-null    float64
 6   class               627 non-null    object 
 7   deck                627 non-null    object 
 8   embark_town         627 non-null    object 
 9   alone               627 non-null    object 
dtypes: float64(2), int64(3), object(5)
memory usage: 49.1+ KB


In [15]:
# Create a dataset of format (features, labels) where 
# - "features" is a dictionary or numpy array or generator.
# - 
titanic_from_pandas = tf.data.Dataset.from_tensor_slices((
    dict(df.loc[:, df.columns != 'survived']),   # Multi-column features as dictionary or numpy array or Tensor
    df.loc[:, 'survived']                        # Labels format depends on the number of classes.

))
tf.data.experimental.get_structure(titanic_from_pandas)

({'sex': TensorSpec(shape=(), dtype=tf.string, name=None),
  'age': TensorSpec(shape=(), dtype=tf.float64, name=None),
  'n_siblings_spouses': TensorSpec(shape=(), dtype=tf.int64, name=None),
  'parch': TensorSpec(shape=(), dtype=tf.int64, name=None),
  'fare': TensorSpec(shape=(), dtype=tf.float64, name=None),
  'class': TensorSpec(shape=(), dtype=tf.string, name=None),
  'deck': TensorSpec(shape=(), dtype=tf.string, name=None),
  'embark_town': TensorSpec(shape=(), dtype=tf.string, name=None),
  'alone': TensorSpec(shape=(), dtype=tf.string, name=None)},
 TensorSpec(shape=(), dtype=tf.int64, name=None))

In [7]:
for row in titanic_from_pandas.batch(1).take(1):  # Take the first batch 
    features = row[0]        # Diectionary
    label = row[1]
    
    for feature, value in features.items():
        print(f"{feature:20s}: {value}")
    
    print(f"label/survived      : {label}")    

sex                 : [b'male']
age                 : [22.]
n_siblings_spouses  : [1]
parch               : [0]
fare                : [7.25]
class               : [b'Third']
deck                : [b'unknown']
embark_town         : [b'Southampton']
alone               : [b'n']
label/survived      : [0]


## Generalized function for pandas to TF dataset

In [2]:
def df_to_dataset(dataframe, label_column_name="label", shuffle=True, batch_size=32):
    dataframe = dataframe.copy()
    labels = dataframe.pop(label_column_name)
    ds = tf.data.Dataset.from_tensor_slices((
        dict(dataframe),   # <--- X: features
        labels             # <--- Y: labels
    ))
    if shuffle:
        ds = ds.shuffle(buffer_size=len(dataframe))
    ds = ds.batch(batch_size).prefetch(batch_size)
    return ds

In [9]:
titanic_from_pandas = df_to_dataset(dataframe=df, label_column_name="survived", shuffle=False, batch_size=1)
tf.data.experimental.get_structure(titanic_from_pandas)

({'sex': TensorSpec(shape=(None,), dtype=tf.string, name=None),
  'age': TensorSpec(shape=(None,), dtype=tf.float64, name=None),
  'n_siblings_spouses': TensorSpec(shape=(None,), dtype=tf.int64, name=None),
  'parch': TensorSpec(shape=(None,), dtype=tf.int64, name=None),
  'fare': TensorSpec(shape=(None,), dtype=tf.float64, name=None),
  'class': TensorSpec(shape=(None,), dtype=tf.string, name=None),
  'deck': TensorSpec(shape=(None,), dtype=tf.string, name=None),
  'embark_town': TensorSpec(shape=(None,), dtype=tf.string, name=None),
  'alone': TensorSpec(shape=(None,), dtype=tf.string, name=None)},
 TensorSpec(shape=(None,), dtype=tf.int64, name=None))

In [10]:
for row in titanic_from_pandas.take(1):  # Take the first batch 
    features = row[0]        # Diectionary
    label = row[1]
    
    for feature, value in features.items():
        print(f"{feature:20s}: {value}")
    
    print(f"label/survived      : {label}")    

sex                 : [b'male']
age                 : [22.]
n_siblings_spouses  : [1]
parch               : [0]
fare                : [7.25]
class               : [b'Third']
deck                : [b'unknown']
embark_town         : [b'Southampton']
alone               : [b'n']
label/survived      : [0]


## tfdf.keras.pd_dataframe_to_tf_dataset

Keras TF Decision Forests has its own implementation of pandas to TF dataset conversion.

* [TensorFlow Decision Forests Installation](https://www.tensorflow.org/decision_forests/installation)
* [tfdf.keras.pd_dataframe_to_tf_dataset](https://www.tensorflow.org/decision_forests/api_docs/python/tfdf/keras/pd_dataframe_to_tf_dataset)

In [None]:
!pip3 install tensorflow_decision_forests --upgrade

---
# Read CSV into Dataset

```make_csv_dataset``` method reads the csv and split them into ```(features, label)``` where ```features``` is a diectioay of ```(feature, value)```.

In [11]:
titanic = tf.data.experimental.make_csv_dataset(
    titanic_file,
    label_name="survived",
    batch_size=1,   # To compre with the head of CSV
    shuffle=False,  # To compre with the head of CSV
    header=True,
)

In [12]:
for row in titanic.take(1):  # Take the first batch 
    features = row[0]        # Diectionary
    label = row[1]
    
    for feature, value in features.items():
        print(f"{feature:20s}: {value}")
    
    print(f"label/survived      : {label}")    

sex                 : [b'male']
age                 : [22.]
n_siblings_spouses  : [1]
parch               : [0]
fare                : [7.25]
class               : [b'Third']
deck                : [b'unknown']
embark_town         : [b'Southampton']
alone               : [b'n']
label/survived      : [0]


2021-11-01 21:19:08.618284: I tensorflow/compiler/mlir/mlir_graph_optimization_pass.cc:185] None of the MLIR Optimization Passes are enabled (registered 2)


# HuggingFace example

* [How to fine-tune HuggingFace BERT model for Text Classification](https://stackoverflow.com/questions/69025750/how-to-fine-tune-huggingface-bert-model-for-text-classification/69025751#69025751)
* [huggingface_fine_tuning.ipynb](https://nbviewer.org/github/omontasama/nlp-huggingface/blob/main/fine_tuning/huggingface_fine_tuning.ipynb)
* [Hugging Face Transformers: Fine-tuning DistilBERT for Binary Classification Tasks](https://towardsdatascience.com/hugging-face-transformers-fine-tuning-distilbert-for-binary-classification-tasks-490f1d192379)

In [13]:
import pandas as pd
import tensorflow as tf
from sklearn.model_selection import train_test_split
from transformers import (
    DistilBertTokenizerFast,
    TFDistilBertForSequenceClassification,
)

MAX_SEQUENCE_LENGTH = 512

# --------------------------------------------------------------------------------
# Tokenizer
# --------------------------------------------------------------------------------
tokenizer = DistilBertTokenizerFast.from_pretrained('distilbert-base-uncased')
def tokenize(sentences, max_length=MAX_SEQUENCE_LENGTH, padding='max_length'):
    """Tokenize using the Huggingface tokenizer
    Args:
        sentences: String or list of string to tokenize
        padding: Padding method ['do_not_pad'|'longest'|'max_length']
    """
    return tokenizer(
        sentences,
        truncation=True,
        padding=padding,
        max_length=max_length,
        return_tensors="tf"
    )

sample_tokens = tokenize(
    [   # Two example seenteces
        "i say hello", 
        "you say good bye",
    ],
    padding='longest'
)
print(f"generatred {type(sample_tokens)} with content:\n{sample_tokens}")

generatred <class 'transformers.tokenization_utils_base.BatchEncoding'> with content:
{'input_ids': <tf.Tensor: shape=(2, 6), dtype=int32, numpy=
array([[ 101, 1045, 2360, 7592,  102,    0],
       [ 101, 2017, 2360, 2204, 9061,  102]], dtype=int32)>, 'attention_mask': <tf.Tensor: shape=(2, 6), dtype=int32, numpy=
array([[1, 1, 1, 1, 1, 0],
       [1, 1, 1, 1, 1, 1]], dtype=int32)>}


In [14]:
DATA_COLUMN = 'comment_text'
LABEL_COLUMN = 'toxic'

raw_train = pd.read_csv("./data/train.csv")
train_data, validation_data, train_label, validation_label = train_test_split(
    raw_train[DATA_COLUMN].tolist(),
    raw_train[LABEL_COLUMN].tolist(),
    test_size=.2,
    shuffle=True
)

X = tf.data.Dataset.from_tensor_slices((
    dict(tokenize(train_data)),  # Convert BatchEncoding instance to dictionary
    train_label                  # List[int32] to be converted into numpy array
)).batch(BATCH_SIZE).prefetch(1)

NameError: name 'BATCH_SIZE' is not defined

In [None]:
X.take(1)