# Welcome to this colab for Creating a comment spam classifier!
In this colab, you'll go through code that uses data that we've prepared
for you from thousands of YouTube comments. We've tidied up the text
and put it in a CSV file so it's ready to go. It will train a classifier from
this text in just a few minutes, while abstracting a lot of the difficult
details of building Natural Language Processing models into a simple library.
We've also provided guidance for how you can fine tune the model should you want to do so.

The first two cells here are hidden code. You can unhide them to see what they do, or you can just run them! 

In [None]:
#@title Run to install (It is safe to ignore a tensorflowjs 3.2.0 error if you see it)
# Install Model maker
!pip install -q tflite-model-maker

In [None]:
#@title Run to import required libraries
# Imports and check that we are using TF2.x
import numpy as np
import os

from tflite_model_maker import configs
from tflite_model_maker import ExportFormat
from tflite_model_maker import model_spec
from tflite_model_maker import text_classifier
from tflite_model_maker import TextClassifierDataLoader

import tensorflow as tf
assert tf.__version__.startswith('2')
tf.get_logger().setLevel('ERROR')

In [None]:
# This will download the training data from the given URL
# you can use any CSV file for your data as long as it has a column
# with text that you want to train on, and another with the label.
# The columns should be defined in the first line of the CSV file
training_data = tf.keras.utils.get_file(fname='cleaned_youtube.csv', origin='https://storage.googleapis.com/laurencemoroney-blog.appspot.com/cleaned_youtube.csv', extract=False)

In [None]:
# This will take the downloaded CSV file, and create a pointer to the local
# file so we can train with it later
training_data = os.path.join(os.path.dirname(training_data), 'cleaned_youtube.csv')

In [None]:
# Use a model spec from model maker. Options are 'mobilebert_classifier', 'bert_classifier' and 'average_word_vec'
# The first 2 use the BERT model, which is accurate, but larger and slower to train
# Average Word Vec is kinda like transfer learning where there are pre-trained word weights
# and dictionaries.
spec = model_spec.get('average_word_vec')

# The num_words parameter is the number of words in the corpus you want to train with
# This will be the top 2000 words by frequency of that word being present in all the text
# A nice way to fine tune your model is to tweak this. Too big, and the model will be trained
# using words that almost never show up. Too small, and the model will miss out on some important
# words. Feel free to experiment with this value
spec.num_words = 2000

# When training a natural language processing model, the underlying engine will convert
# words into tokens. Sentences are then sequences of tokens. So all sentences being fed
# into the model will need to be a sequence of tokens, and these all have to be the same
# length. Here we set it to 20, so sentences of more than 20 words will be truncated, and
# sentences of less will be padded. Choose this number carefully based on the length of
# sentences in your training data, and the expected length of sentences sent by users
spec.seq_len = 20

# As you study NLP, you'll learn about Word Embeddings, which are vectors that denote a 'direction' 
# for a word. These vectors are used to establish sentiment. In a simple sense, think about 'true' pointing left
# and 'false' pointing right. Here, from direction, you can establish sentiment. When you have
# thousands of words, you need more dimensions. Research has shown that the best starting point
# for the number of dimensions is the fourth root of the number of words. Given we are using 
# 2000 words, the fourth root of which is 6.68. I have rounded that up to 7 here
spec.wordvec_dim = 7


In [None]:
# Load the CSV using DataLoader.from_csv to make the training_data
# The important things to note are the text_column and the label_column. 
# These should be present in the first row of your file
# So, your CSV should look something like this
# commenttext, spam
# text for the comment, true
# text for another comment, true
# text for yet another comment, false
train_data = TextClassifierDataLoader.from_csv(
      filename=os.path.join(os.path.join(training_data)),
      text_column='commenttext', #For Toxicity use " value_of_text" (note the leading space)
      label_column='spam', #For Toxicity also use "label"
      model_spec=spec,
      delimiter=',',
      is_training=True)

In [None]:
# Build the model
# All of the model architecture is abstracted within this
# With 100 epochs it trains to 97%+ accuracy. You can tweak this value
# to speed up the training, but it's already quite fast!
model = text_classifier.create(train_data, model_spec=spec, epochs=100)

In [None]:
# Export the model as a TFLITE file to use in Android or iOS
model.export(export_dir='/mm_spam')

# If you are an iOS developer, you'll also need the labels and vocab, so you can export them like this
model.export(export_dir='/mm_spam/', export_format=[ExportFormat.LABEL, ExportFormat.VOCAB])

# Downloading the model
In the previous code section, you exported the model to a directory called /mm_spam.

On the left side of colab, there are 4 icons: a list, a magnifying glass, brackets, and a folder. Click the folder.

This gives you access to the directory structure of the virtual machine running your code in the cloud.

Click the folder with the up arrow to go to the root directory.

You'll get a listing of all directories, including the mm_spam directory.

Open it, and you'll see a file called 'model.tflite'. Click on this to download it to use it in an Android or iOS app. If you also exported the labels and vocab file, you'll see labels.txt and vocab. You'll need to download these, too.

In [None]:
# Optionally you can shrink and quantize the model prior to exporting
# This makes the model smaller, and possibly slightly less efficient
config = configs.QuantizationConfig.create_dynamic_range_quantization(optimizations=[tf.lite.Optimize.OPTIMIZE_FOR_LATENCY])
config.experimental_new_quantizer = True
model.export(export_dir='/mm_spam/', quantization_config=config)

In [None]:
# At this point you're done! But if you want to explore the neural network
# architecture of the model, you could do this:
model.summary()

In [None]:
# And if you want to export to a JSON-formatted model to make for easier loading
# into TensorFlow.js you can do this
model.export(export_dir="/mm_js/", export_format=[ExportFormat.TFJS, ExportFormat.LABEL, ExportFormat.VOCAB])