<a href="https://colab.research.google.com/github/martin-fabbri/colab-notebooks/blob/master/deeplearning.ai/tf/c3_w2_loading_text.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Load Text

We will use lower-level utilities like `tf.data.TextLineDataset` to load text files, and `tf.text` to preprocess the data for finer-grain control.

In [1]:
!pip install tensorflow_text -q

[K     |████████████████████████████████| 2.6MB 23.2MB/s 
[?25h

In [2]:
import collections
import pathlib
import re
import string

import tensorflow as tf
import tensorflow_datasets as tfds
import tensorflow_text as tf_text

from tensorflow.keras import layers
from tensorflow.keras import losses
from tensorflow.keras import preprocessing
from tensorflow.keras import utils
from tensorflow.keras import utils
from tensorflow.keras.layers.experimental.preprocessing import TextVectorization

tf.__version__, tfds.__version__

('2.3.0', '4.0.1')

## Predict the tag for a Stack Overflow question

In [3]:
data_url = 'https://storage.googleapis.com/download.tensorflow.org/data/stack_overflow_16k.tar.gz'
dataset = utils.get_file(
    'stack_overflow_16k.tar.gz',
    data_url,
    untar=True,
    cache_dir='stack_overflow',
    cache_subdir=''
)
dataset_dir = pathlib.Path(dataset).parent
dataset_dir, pathlib.Path(dataset)

Downloading data from https://storage.googleapis.com/download.tensorflow.org/data/stack_overflow_16k.tar.gz


(PosixPath('/tmp/.keras'), PosixPath('/tmp/.keras/stack_overflow_16k.tar.gz'))

In [4]:
list(dataset_dir.iterdir())

[PosixPath('/tmp/.keras/README.md'),
 PosixPath('/tmp/.keras/train'),
 PosixPath('/tmp/.keras/stack_overflow_16k.tar.gz.tar.gz'),
 PosixPath('/tmp/.keras/test')]

In [5]:
train_dir = dataset_dir/'train'
list(train_dir.iterdir())

[PosixPath('/tmp/.keras/train/javascript'),
 PosixPath('/tmp/.keras/train/python'),
 PosixPath('/tmp/.keras/train/csharp'),
 PosixPath('/tmp/.keras/train/java')]

The `train/csharp`, `train/java`, `train/python`, and `train/javascript` directories contain many text files, each of wich is Stack Overflow question. Print a file and inspect the data.

In [8]:
sample_file = train_dir/'python/1757.txt'
with open(sample_file) as f:
  print(f.read())

"i want to return a specific list when i have a collection of lists i want to return a specific list, when i have a collection of lists, but i am not to sure how to do this. i tried this approach but it didn't work. any ideas..this_list1 = [2,3,4,5].this_list2 = [5,6,9,8]..x = input(""which list do you want"")..print this_list(x)"



## Load the dataset

Next, we will load the data off disk and prepare it into a format suitable for training. To do so, you will use `text_dataset_from_directory` utility to create a labeled `tf.data.Dataset`.

The `preprocessing.text_dataset_from_directory` expects a directory structure as follows.

```
train/
...csharp/
......1.txt
......2.txt
...java/
......1.txt
......2.txt
...javascript/
......1.txt
......2.txt
...python/
......1.txt
......2.txt


When running a machine learning experiment, it is a best practice to divide your dataset into three splits: train, validation and test. Test Stack Overflow has already been divided into train and test, but it lack a validation set. Create a validation set using a 80:20 split of the training data by using the `validation_split` argument below.


In [9]:
batch_size = 32
seed = 42

raw_train_ds = preprocessing.text_dataset_from_directory(
    train_dir,
    batch_size=batch_size,
    validation_split=0.2,
    subset='training',
    seed=seed    
)

Found 8000 files belonging to 4 classes.
Using 6400 files for training.


As you can see above, there are 8,000 examples in the training folder, of which you will use 80% (or 6,400) for training. As you will see in a moment, you can train a model by passing a `tf.data.Dataset` directly to `model.fit`. First, iterate over the dataset and print out a few examples, to get a feel for the data.

Note: To increase the difficulty of the classification problem, the dataset author replaced occurrences of the words *Python*, *CSharp*, *JavaScript*, or *Java* in the programming question with the word *blank*.

In [13]:
for text_batch, label_batch in raw_train_ds.take(1):
  for i in range(10):
    print('Question:', text_batch.numpy()[i][:100], '...')
    print('Label:', label_batch.numpy()[i])

Question: b'"set blank to quit on exception? i\'m using blank 3..i\'ve been looking around for an answer to this, ' ...
Label: 3
Question: b'"how do i convert a binary image into an in-memory data structure in blank? context: ...i am using b' ...
Label: 3
Question: b'"adding an array of div ids to a var so i have animate multiple divs across site i found some js on ' ...
Label: 2
Question: b'"fails on encountering null this code is part of a shopify sync utility. never has failed, until we ' ...
Label: 0
Question: b'"blank...help me understand the include of _db.users.include(""something"") a newbie question, i kno' ...
Label: 0
Question: b'"insert table into docx file using docx project i create a word document using docx project...i need' ...
Label: 0
Question: b'"printing by prototype not working in js i am trying to make a constructor function. then i am tryin' ...
Label: 2
Question: b'"calculate percentage if percentage sign is in the text my project has a textbox field which is n

The labels are `0`, `1`, `2` or `3`. To see which of these correspond to which string label, you can check the `class_names` property on the dataset.

In [14]:
for i, label in enumerate(raw_train_ds.class_names):
  print('Label', i, 'corresponds to', label)

Label 0 corresponds to csharp
Label 1 corresponds to java
Label 2 corresponds to javascript
Label 3 corresponds to python


In [15]:
raw_train_ds.class_names

['csharp', 'java', 'javascript', 'python']

In [16]:
raw_val_ds = preprocessing.text_dataset_from_directory(
    train_dir,
    batch_size=batch_size,
    validation_split=0.2,
    subset='validation',
    seed=seed
)

Found 8000 files belonging to 4 classes.
Using 1600 files for validation.


In [17]:
test_dir = dataset_dir/'test'
raw_test_dir = preprocessing.text_dataset_from_directory(
    test_dir,
    batch_size=batch_size
)

Found 8000 files belonging to 4 classes.
