# Basic text classification

This tutorial performs text classification starting from plain text files stored on disk (as opposed to the previous notebook where we used TensorFlow Hub).

In [1]:
import matplotlib.pyplot as plt
import os
import re
import shutil
import string
import tensorflow as tf

In [2]:
from tensorflow.keras import layers
from tensorflow.keras import losses
from tensorflow.keras import preprocessing
from tensorflow.keras.layers.experimental.preprocessing import TextVectorization

In [3]:
tf.__version__

'2.1.0'

## Sentiment analysis

Binary classification of text based on IMDB movie review dataset (similar to previous notebook)

In [4]:
url = "https://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz"
dataset = tf.keras.utils.get_file("aclImdb_v1.tar.gz", url,
                                    untar=True, cache_dir='.',
                                    cache_subdir='')

In [5]:
dataset_dir = os.path.join(os.path.dirname(dataset), 'aclImdb')

In [6]:
os.listdir(dataset_dir)

['test', 'README', 'train', 'imdb.vocab', 'imdbEr.txt']

In [7]:
train_dir = os.path.join(dataset_dir, 'train')
os.listdir(train_dir)

['unsupBow.feat',
 'urls_neg.txt',
 'neg',
 'pos',
 'labeledBow.feat',
 'unsup',
 'urls_unsup.txt',
 'urls_pos.txt']

The `aclImdb/train/pos` and `aclImdb/train/neg` directories contain positive and negative examples of movie reviews. Let's take a look at one of them

In [8]:
sample_file = os.path.join(train_dir, 'pos/1181_9.txt')
with open(sample_file) as f:
    print(f.read())

Rachel Griffiths writes and directs this award winning short film. A heartwarming story about coping with grief and cherishing the memory of those we've loved and lost. Although, only 15 minutes long, Griffiths manages to capture so much emotion and truth onto film in the short space of time. Bud Tingwell gives a touching performance as Will, a widower struggling to cope with his wife's death. Will is confronted by the harsh reality of loneliness and helplessness as he proceeds to take care of Ruth's pet cow, Tulip. The film displays the grief and responsibility one feels for those they have loved and lost. Good cinematography, great direction, and superbly acted. It will bring tears to all those who have lost a loved one, and survived.


## Load the dataset

Use the `text_dataset_from_directory` utility to load the data off disk and prepare it into a suitable format. The data must be of the form

```
.
└── main_directory
    ├── class_a
    │   ├── a_text_1.txt
    │   └── a_text_2.txt
    │   └── ...
    ├── class_b
    │   ├── b_text_1.txt
    │   └── b_text_2.txt
    │   └── ...
    ├── ...

```

The data comes with the directory structure

```
.
└── aclImdb
    ├── test
    │   ├── neg
    │   └── pos
    └── train
        ├── neg
        ├── pos
        └── unsup
```

So we need to prepare two folders on disk, corresponding to `class_a` and `class_b`. We will remove the additional folders

In [9]:
remove_dir = os.path.join(train_dir, 'unsup')
if os.path.exists(remove_dir):
    shutil.rmtree(remove_dir)
    print(remove_dir+' removed.')

./aclImdb/train/unsup removed.


Now we use the `text_dataset_from_directory` utility to create a labelled `tf.data.Dataset`. `tf.data` is a powerful collection of tools for working with data. 

We will create a validation set by using an 80:20 split of the training data by using the `validation_split` argument

In [10]:
batch_size = 32
seed = 42

raw_train_ds = tf.keras.preprocessing.text_dataset_from_directory()

AttributeError: module 'tensorflow_core.keras.preprocessing' has no attribute 'text_dataset_from_directory'

`text_dataset_from_directory` is broken so I'll try a different tutorial on loading text.