# ODSC West Workshop: Before we get started ...

This notebook contains a brief introduction to TensorFlow's Ragged Tensors and TensorFlow Text

<table class="tfo-notebook-buttons" width="100%">
  <td>
    <a target="_blank" href="https://colab.research.google.com/github/hanneshapke/ODSC-BERT-TFX-Pipelines/blob/main/ODSC_West_Workshop_Before_we_get_started.ipynb"><img src="https://www.tensorflow.org/images/colab_logo_32px.png" />Run in Google Colab</a>
  </td>
  <td>
    <a target="_blank" href="https://github.com/hanneshapke/ODSC-BERT-TFX-Pipelines/blob/main/ODSC_West_Workshop_Before_we_get_started.ipynb"><img src="https://www.tensorflow.org/images/GitHub-Mark-32px.png" />View source on GitHub</a>
  </td>
</table>


In [1]:
!pip install -qU tensorflow-text

[?25l[K     |▏                               | 10kB 11.7MB/s eta 0:00:01[K     |▎                               | 20kB 1.6MB/s eta 0:00:02[K     |▍                               | 30kB 1.8MB/s eta 0:00:02[K     |▌                               | 40kB 2.2MB/s eta 0:00:02[K     |▋                               | 51kB 2.0MB/s eta 0:00:02[K     |▊                               | 61kB 2.0MB/s eta 0:00:02[K     |▉                               | 71kB 2.2MB/s eta 0:00:02[K     |█                               | 81kB 2.5MB/s eta 0:00:02[K     |█▏                              | 92kB 2.6MB/s eta 0:00:01[K     |█▎                              | 102kB 2.8MB/s eta 0:00:01[K     |█▍                              | 112kB 2.8MB/s eta 0:00:01[K     |█▌                              | 122kB 2.8MB/s eta 0:00:01[K     |█▋                              | 133kB 2.8MB/s eta 0:00:01[K     |█▊                              | 143kB 2.8MB/s eta 0:00:01[K     |█▉                        

In [7]:
import tensorflow as tf
import tensorflow_text as text

### Example of a Ragged Tensor

In [5]:
tokens = tf.ragged.constant([["Hi", "ODSC", "audience"], ["thanks", "for", "attending", "this", "workshop"]])
print(tokens)
print(tf.strings.substr(tokens, 0, 2))

<tf.RaggedTensor [[b'Hi', b'ODSC', b'audience'], [b'thanks', b'for', b'attending', b'this', b'workshop']]>
<tf.RaggedTensor [[b'Hi', b'OD', b'au'], [b'th', b'fo', b'at', b'th', b'wo']]>


In [6]:
tokens.to_tensor()

<tf.Tensor: shape=(2, 5), dtype=string, numpy=
array([[b'Hi', b'ODSC', b'audience', b'', b''],
       [b'thanks', b'for', b'attending', b'this', b'workshop']],
      dtype=object)>

## Examples around TensorFlow Text

### Tokenization

In [8]:
tokenizer = text.WhitespaceTokenizer()
tokens = tokenizer.tokenize(['This is a little example', u'☜ is quite useful'.encode('UTF-8')])
print(tokens.to_list())

Instructions for updating:
`tf.batch_gather` is deprecated, please use `tf.gather` with `batch_dims=-1` instead.
[[b'This', b'is', b'a', b'little', b'example'], [b'\xe2\x98\x9c', b'is', b'quite', b'useful']]


### n-grams

In [11]:
for i in [1, 2, 3]:
    n_gram = text.ngrams(tokens, i, reduction_type=text.Reduction.STRING_JOIN)
    print("{}-gram: {}".format(i, n_gram.to_list()))

1-gram: [[b'This', b'is', b'a', b'little', b'example'], [b'\xe2\x98\x9c', b'is', b'quite', b'useful']]
2-gram: [[b'This is', b'is a', b'a little', b'little example'], [b'\xe2\x98\x9c is', b'is quite', b'quite useful']]
3-gram: [[b'This is a', b'is a little', b'a little example'], [b'\xe2\x98\x9c is quite', b'is quite useful']]
