# Basic Text Classification w/ Keras

https://www.tensorflow.org/tutorials/keras/text_classification

In [250]:
from pathlib import Path
import os
import shutil
import tensorflow as tf

## Binary Classification (Dataset: IMDB Reviews)

### Fetch and Download Dataset

In [251]:
dataset_url = "https://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz"

In [252]:
datasets_dir = Path(os.path.abspath('')).parent.joinpath('datasets')
datasets_dir.mkdir(parents=True, exist_ok=True)

In [254]:
dataset = tf.keras.utils.get_file("aclImdb_v1", dataset_url,
                                  untar=True, 
                                  cache_dir=dataset_dir,
                                  cache_subdir='')

Downloading data from https://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz


In [276]:
imdb_dataset_dir = datasets_dir.joinpath('aclimdb')

In [277]:
imdb_train_dataset = imdb_dataset_dir.joinpath('train')
imdb_test_dataset = imdb_dataset_dir.joinpath('test')

In [278]:
os.listdir(imdb_train_dataset)

['urls_unsup.txt',
 'neg',
 'urls_pos.txt',
 'urls_neg.txt',
 'pos',
 'unsupBow.feat',
 'labeledBow.feat']

In [279]:
"""Positive IMDB Reviews Dataset"""
imdb_train_dataset_pos = imdb_train_dataset.joinpath('pos')

In [280]:
"""Negative IMDB Reviews Dataset"""
imdb_train_dataset_neg = imdb_train_dataset.joinpath('neg')

In [281]:
"""Drops the 'unsup' dataset"""
imdb_train_dataset_unsup = imdb_train_dataset.joinpath('unsup')
shutil.rmtree(imdb_train_dataset_unsup, ignore_errors=True)

### Loading the Dataset

In [282]:
batch_size = 32
seed = 42

#### Loading Training Dataset

In [283]:
raw_train_ds = tf.keras.preprocessing.text_dataset_from_directory(
    imdb_train_dataset,
    batch_size=batch_size,
    seed=seed,
    validation_split=0.2,
    subset='training'
)

Found 25000 files belonging to 2 classes.
Using 20000 files for training.


#### Data Analysis Training Dataset

In [284]:
"""Available Classes"""
raw_train_ds.class_names

['neg', 'pos']

In [285]:
for text_batch, label_batch in raw_train_ds.take(2):
    for i in range(3):
        tokenized_label = label_batch.numpy()[i]
        corpus=text_batch.numpy()[i]
        class_name = raw_train_ds.class_names[tokenized_label]
        print(f"Label: {tokenized_label} ({class_name})")
        print(f"Review: {corpus}")
        print()

Label: 0 (neg)
Review: b'"Pandemonium" is a horror movie spoof that comes off more stupid than funny. Believe me when I tell you, I love comedies. Especially comedy spoofs. "Airplane", "The Naked Gun" trilogy, "Blazing Saddles", "High Anxiety", and "Spaceballs" are some of my favorite comedies that spoof a particular genre. "Pandemonium" is not up there with those films. Most of the scenes in this movie had me sitting there in stunned silence because the movie wasn\'t all that funny. There are a few laughs in the film, but when you watch a comedy, you expect to laugh a lot more than a few times and that\'s all this film has going for it. Geez, "Scream" had more laughs than this film and that was more of a horror film. How bizarre is that?<br /><br />*1/2 (out of four)'

Label: 0 (neg)
Review: b"David Mamet is a very interesting and a very un-equal director. His first movie 'House of Games' was the one I liked best, and it set a series of films with characters whose perspective of life 

#### Loading Validation Dataset

In [286]:
raw_validation_ds = tf.keras.preprocessing.text_dataset_from_directory(
    imdb_train_dataset,
    batch_size=batch_size,
    seed=seed,
    validation_split=0.2,
    subset='validation'
)

Found 25000 files belonging to 2 classes.
Using 5000 files for validation.


#### Loading Test Dataset

In [287]:
raw_test_ds = tf.keras.preprocessing.text_dataset_from_directory(
    imdb_test_dataset,
    batch_size=batch_size
)

Found 25000 files belonging to 2 classes.


### Dataset Preparation

#### Standarization

#### Tokenization

#### Vectorization