## Prepare Data Pipeline by using tf.data 

The tf.data.Dataset API supports writing descriptive and efficient input pipelines. 

Dataset usage follows a common pattern:
  * Create a source dataset from your input data.
  * Apply dataset transformations to preprocess the data.
  * Iterate over the dataset and process the elements.

Iteration happens in a streaming fashion, so the full dataset does not need to fit into memory.

<b>Pandas Category Type</b> 
One of the main use cases for categorical data types is more efficient memory usage.

In [2]:
import pathlib 

data_dir = pathlib.Path("../datasets/big_ds/img-001/")

In [49]:
import pandas as pd 
data = pd.read_csv("../datasets/attribute_set/list_attr_img.txt", delim_whitespace=True, names= ['paths'] + list(range(1000)))

In [50]:
data.head() 

Unnamed: 0,paths,0,1,2,3,4,5,6,7,8,...,990,991,992,993,994,995,996,997,998,999
0,img/Sheer_Pleated-Front_Blouse/img_00000001.jpg,-1,-1,-1,-1,-1,-1,-1,-1,-1,...,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1
1,img/Sheer_Pleated-Front_Blouse/img_00000002.jpg,-1,-1,-1,-1,-1,-1,-1,-1,-1,...,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1
2,img/Sheer_Pleated-Front_Blouse/img_00000003.jpg,-1,-1,-1,-1,-1,-1,-1,-1,-1,...,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1
3,img/Sheer_Pleated-Front_Blouse/img_00000004.jpg,-1,-1,-1,-1,-1,-1,-1,-1,-1,...,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1
4,img/Sheer_Pleated-Front_Blouse/img_00000005.jpg,-1,-1,-1,-1,-1,-1,-1,-1,-1,...,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1


In [73]:
variables = set()
for col in data.columns[1:]: 
    [variables.add(item) for item in data[col].unique()]
variables

{-1, 1}

In [76]:
data.info() 

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 289222 entries, 0 to 289221
Columns: 1001 entries, paths to 999
dtypes: int64(1000), object(1)
memory usage: 2.2+ GB


In [87]:
for col in data.columns[1:]: 
    data[col] = data[col].astype('category').cat.codes.values

In [88]:
data.info() 

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 289222 entries, 0 to 289221
Columns: 1001 entries, paths to 999
dtypes: int8(1000), object(1)
memory usage: 278.0+ MB


In [90]:
data[0].unique() 

array([0, 1], dtype=int8)