<a href="https://colab.research.google.com/github/joepeskett/tree-pixels/blob/master/notebooks/TF_Datasets.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Using TF Datasets

There are some really good walkthroughs of how this can be achieved in the [tensorflow data api documentation](https://www.tensorflow.org/guide/data#consuming_text_data). 

Get the dataset from stanford

In [0]:
! wget http://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz
! tar -xvf aclImdb_v1.tar.gz

In [0]:
import tensorflow_datasets as tfds
#dataset = tfds.load("imdb_reviews")

In [0]:
# We want to create a similar dataset to the one found in the the tfds package. 
# How we would we efficiently load these files to create a useful dataset

import tensorflow as tf



Notes about this data set - the rating for each movie is in the filename, as is the unique ID of the particular review. 

A good first step would be to have a setup to have a unique ID, text and the rating. 

TODO:

* List the files within the positive and negative training directories. DONE
* Load these into an appropriate training data set. DONE
* Ensure each piece is text is labelled correctly. DONE
* Shuffle these dataset. DONE
* Create an appropriate representation for the text data, likely using the TextVectorisation layer within Keras.DOING
* Add an embedding layer.TODO
* Train a model.TODO
* Try to optimise the pipeline that we've prepared.TODO

# First: reformat the dataset.

This is so we can make use of the TextLinesDataset ability in the `tf.data` API. 

Rather than having the rating in the filename, we'll change the txt files to individual 

In [0]:
import os
DIRS = ["aclImdb/train/pos/", "aclImdb/train/neg/"]
def create_file_list(dirs = DIRS):
  files = []
  for d in dirs:
    for f in os.listdir(d):
      files.append(os.path.join(d, f))
  return files

In [0]:
# Check the ratings
ratings = []
file_names = create_file_list(DIRS)
for f in file_names:
  id, rating = os.path.splitext(os.path.basename(f))[0].split("_")
  ratings.append(int(rating))
import numpy as np
rate_array = np.array(ratings)
np.histogram(rate_array)

In [0]:
# Simple function to grab the ID and rating from the file name
# We want to return tf.constants
import pandas as pd
def dataset_creator(file_name, dir_name = "train"):
  id, rating = os.path.splitext(os.path.basename(file_name))[0].split("_")
  review = open(file_name).read()
  output =  {'id':int(id), 
             'rating':int(rating), 
             'review':review}
  output_file = pd.DataFrame(output, index = [0])
  if not os.path.exists(dir_name):
    os.mkdir(dir_name)
  output_file.to_csv('train/{}.csv'.format(id+"_"+rating))

def create_files(dirs = DIRS, output_dir="train"):
  files = create_file_list(dirs)
  for f in files:
    dataset_creator(f, output_dir)
  return "DONE!"

In [0]:
create_files(DIRS)

In [0]:
file_names

In [0]:
len(os.listdir("aclImdb/train/pos/"))

In [0]:
len(os.listdir("aclImdb/train/neg/"))

In [0]:
len(os.listdir("train"))

Now, we can start looking at how we'd use the tensorflow api.
The key reason to do this is to mimic a setup where we can't git all our training data into memory or we want to start training on a distributed system. 

In this setup, it's not that easy to use the filename in the initial dataset creation, which is why we've done some initial preprocessing in the above section. 

In [0]:
import numpy as np
import tensorflow as tf

# Now we want to create a data set for each individual file
file_path_dataset = tf.data.Dataset.list_files("train/*.csv")

In [0]:
n_readers = 5
dataset = file_path_dataset.interleave(
    lambda filepath: tf.data.TextLineDataset(filepath).skip(1),
    cycle_length = n_readers
)

In [0]:
#What have we got so far?

In [0]:
for line in dataset.take(5):
  print(line)

# Preprocessing the raw data into Tensors

At this point we need a preprocessing function. 

In [0]:
def process_raw_input(line):
  #defs = #The default values o
  record_defaults = [[1],[1],[1],tf.constant([], dtype=tf.string)]
  fields = tf.io.decode_csv(line, record_defaults=record_defaults)
  X = tf.stack(fields[3])
  y = tf.stack(fields[2])
  return X, y

In [0]:
for line in dataset.take(1):
  print(process_raw_input(line))

# Putting this together

In [0]:
def dataset_reader(file_path_pattern = "train/*.csv", n_readers = 5, 
                   n_read_threads = None,
                   n_parse_threads = 5,
                   shuffle_buffer_size = 5000,
                   batch_size = 32):
  file_path_dataset = tf.data.Dataset.list_files(file_path_pattern)
  dataset = file_path_dataset.interleave(
    lambda filepath: tf.data.TextLineDataset(filepath).skip(1),
    cycle_length = n_readers, num_parallel_calls = n_read_threads
  )
  dataset = dataset.map(process_raw_input, num_parallel_calls=n_parse_threads)
  dataset = dataset.shuffle(shuffle_buffer_size).repeat(2)
  return dataset.batch(batch_size).prefetch(1)


In [0]:
train_set = dataset_reader()
for ex in train_set.take(5):
  print(ex)

We'll also need to sort this for validation and test set, which we'll do by splitting the test set up into 15k examples for validation and 10k examples for test set. We'll do this a bit later, but we will have to put something in place to shuffle the ratings. 

In [0]:
import random
TEST_DIRS = ["aclImdb/test/neg/", "aclImdb/test/pos"]
test_file_list = create_file_list(TEST_DIRS)
random.shuffle(test_file_list)

In [0]:
import pandas as pd
def dataset_creator(file_name, dir_name = "train"):
  id, rating = os.path.splitext(os.path.basename(file_name))[0].split("_")
  review = open(file_name).read()
  output =  {'id':int(id), 
             'rating':int(rating), 
             'review':review}
  output_file = pd.DataFrame(output, index = [0])
  if not os.path.exists(dir_name):
    os.mkdir(dir_name)
  output_file.to_csv('{}/{}.csv'.format(dir_name,id+"_"+rating))

def create_val_test(dirs = TEST_DIRS):
  test_file_list = create_file_list(TEST_DIRS)
  random.shuffle(test_file_list)
  validation_files = test_file_list[:15000]
  test_files = test_file_list[15000:]
  for f in validation_files:
    dataset_creator(f, "val")
  for f in test_files:
    dataset_creator(f, "test")
  return "COMPLETE!"

create_val_test()

We can now use our already created functions for creating tf datasets.

In [0]:
val_dataset = dataset_reader(file_path_pattern="val/*.csv")
test_dataset = dataset_reader(file_path_pattern="test/*.csv")

# Preprocess the text

Now we need to do some standard text preprocessing. 

1. Do we want to use the entire review, or just the first/last 200 words?
2. Convert everything to lower case
3. Use some regulatr expressions to sort some of the irritating characters out
4. Split each string into individual words.
5. Remove punctuation?
6. Remove stop words? (Need to check what the more standard approach to this is now?
7. Do we want to do any lemmatisation?
8. Depending on the techniques that we'd like to use, we may want to use a rought PoS tag?

We can use the textVectorisation that comes with Keras to see how well this works. We always always have the option of creating our own embedding layer too. 




In [0]:
from tensorflow.keras.layers.experimental.preprocessing import TextVectorization
max_features = 3000
vectorise_layer = TextVectorization(
    max_tokens = max_features,
    output_mode = 'int'
)
vectorise_layer.adapt(train_set.batch(32))