<a href="https://colab.research.google.com/github/joepeskett/tree-pixels/blob/master/notebooks/TF_Datasets.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Using TF Datasets

There are some really good walkthroughs of how this can be achieved in the [tensorflow data api documentation](https://www.tensorflow.org/guide/data#consuming_text_data). 

Get the dataset from stanford

In [0]:
! wget http://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz
! tar -xvf aclImdb_v1.tar.gz

In [0]:
import tensorflow_datasets as tfds
#dataset = tfds.load("imdb_reviews")

In [0]:
# We want to create a similar dataset to the one found in the the tfds package. 
# How we would we efficiently load these files to create a useful dataset

import tensorflow as tf



Notes about this data set - the rating for each movie is in the filename, as is the unique ID of the particular review. 

A good first step would be to have a setup to have a unique ID, text and the rating. 

TODO:

* List the files within the positive and negative training directories. 
* Load these into an appropriate training data set
* Ensure each piece is text is labelled correctly
* Shuffle these dataset
* Create an appropriate representation for the text data, likely using the TextVectorisation layer within Keras
* Add an embedding layer
* Train a model 
* Try to optimise the pipeline that we've prepared

# First: reformat the dataset.

This is so we can make use of the TextLinesDataset ability in the `tf.data` API. 

Rather than having the rating in the filename, we'll change the txt files to individual 

In [0]:
import os
DIRS = ["aclImdb/train/pos/", "aclImdb/train/neg/"]
def create_file_list(dirs = DIRS):
  files = []
  for d in dirs:
    for f in os.listdir(d):
      files.append(os.path.join(d, f))
  return files

In [0]:
# Check the ratings
ratings = []
file_names = create_file_list(DIRS)
for f in file_names:
  id, rating = os.path.splitext(os.path.basename(f))[0].split("_")
  ratings.append(int(rating))
import numpy as np
rate_array = np.array(ratings)
np.histogram(rate_array)

In [0]:
# Simple function to grab the ID and rating from the file name
# We want to return tf.constants
import pandas as pd
def dataset_creator(file_name):
  id, rating = os.path.splitext(os.path.basename(file_name))[0].split("_")
  review = open(file_name).read()
  output =  {'id':int(id), 
             'rating':int(rating), 
             'review':review}
  output_file = pd.DataFrame(output, index = [0])
  if not os.path.exists('train'):
    os.mkdir('train')
  output_file.to_csv('train/{}.csv'.format(id))

def create_files(dirs = DIRS):
  files = create_file_list(dirs)
  for f in files:
    dataset_creator(f)
  return "DONE!"

In [0]:
create_files(DIRS)

Now, we can start looking at how we'd use the tensorflow api.
The key reason to do this is to mimic a setup where we can't git all our training data into memory or we want to start training on a distributed system. 

In this setup, it's not that easy to use the filename in the initial dataset creation, which is why we've done some initial preprocessing in the above section. 

In [0]:
import numpy as np
import tensorflow as tf

# Now we want to create a data set for each individual file
file_path_dataset = tf.data.Dataset.list_files("train/*.csv")

In [0]:
n_readers = 5
dataset = file_path_dataset.interleave(
    lambda filepath: tf.data.TextLineDataset(filepath).skip(1),
    cycle_length = n_readers
)

In [0]:
#What have we got so far?

In [0]:
for line in dataset.take(5):
  print(line)

# Preprocessing

At this point we need a preprocessing function. 

In [0]:
def preprocess(line):
  #defs = #The default values o
  record_defaults = [[1],[1],[1],tf.constant([], dtype=tf.string)]
  fields = tf.io.decode_csv(line, record_defaults=record_defaults)
  X = tf.stack(fields[3])
  y = tf.stack([2])
  return X, y

In [0]:
for line in dataset.take(5):
  print(preprocess(line))

And now we need to do some shuffling around. 

In [0]:
IDs = []
reviews = [] 
ratings = []
for filename in files:
  obs = dataset_creator(filename)
  IDs.append(obs['id'])
  ratings.append(obs['rating'])
  reviews.append(obs['review'])

dataset = tf.data.Dataset.from_tensor_slices((tf.constant(IDs), tf.constant(ratings), tf.constant(reviews)))

In [0]:
print(example_dataset)