<a href="https://colab.research.google.com/github/ketakee/ketakee.github.io/blob/master/Comment_toxicity_predictor.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

The goal of this tutorial is to buiild an inbrowser text classifier that works offline as well as online.

So that it is easy to deploy, we will use github pages for deploying it. 

This notebook uses code borrowed from: https://github.com/tensorflow/tfjs-examples/blob/master/sentiment/index.js and from a previous Deep Learning Assignment about text classification in browser

At the end of the tutorial, you will be able to build a webpage where in the background colour changes as per the toxicity of the text. This could easily be ported to your website's comment section and you can change it to have the colour of the comment being entered change as per the toxicity of the text.


###Dataset
We will use data from the kaggle toxic comment classification challenge. 
https://www.kaggle.com/c/jigsaw-toxic-comment-classification-challenge/rules

Go ahead and download the train data from the website. You can upload that file to colab.

Once you've done that, let's set up the environment

In [0]:
!pip install tensorflow==2.0.0-alpha0
!pip install tensorflowjs==1.0.1

Let us now set up github parameters so that we can have this notebook talk to github directly instead of manually moving the files around


Set up github pages as instructed in :


Also, get your token from https://github.com/settings/tokens


Remember to make permisions public. 


In [0]:
# your github username and email
USER_NAME = "abc" 
USER_EMAIL = "abc@gmail.com" 

#Do not publish this token anywhere
TOKEN=""
# for example, if your user_name is "foo", then this notebook will create
# a site at "https://foo.github.io"
SITE_NAME = "https://foo.github.io"

In [0]:
#Let's configure git
!git config --global user.email {USER_NAME}
!git config --global user.name  {USER_EMAIL}

In [0]:
#Cloning out github repo
import os
!git clone https://{USER_NAME}:{TOKEN}@github.com/{USER_NAME}/{USER_NAME}.github.io

In [0]:
repo_path = USER_NAME + '.github.io'
os.chdir(repo_path)
!git pull


In [0]:
# #Create a folder for your site
# project_path = os.path.join(os.getcwd())
# if not os.path.exists(project_path): 
#   os.mkdir(project_path)
# os.chdir(project_path)

In [0]:
# DO NOT MODIFY
MODEL_DIR = os.path.join(project_path, "model_js")
if not os.path.exists(MODEL_DIR):
  os.mkdir(MODEL_DIR)

#Model
Now that we are done with housekeeping, we can train our model


In [0]:
#Load the data
import pandas as pd
import numpy as np


In [0]:
df=pd.read_csv("/content/train.csv")

In [0]:
#Let's get the toxic and non toxic comments
toxic=df[df["toxic"]==1]["comment_text"]
non_toxic=df[df["toxic"]==0]["comment_text"]

#converting them to lists
#taking 10k of both classes randomly:
toxic=list(toxic)
non_toxic=list(non_toxic)

In [0]:
#Randomly taking 7000 sentences from each for training and 3000 sentences from each for testing
from random import shuffle
shuffle(toxic)
shuffle(non_toxic)

toxic_train=toxic[:7000]
non_toxic_train = non_toxic[:7000]
toxic_test=toxic[7000:10000]
non_toxic_test = non_toxic[7000:10000]

In [0]:
#Let's create training and testing datasets

x_train=toxic_train + non_toxic_train
y_train=[0,]*len(toxic_train) + [1,]*len(non_toxic_train)
x_test=toxic_test + non_toxic_test
y_test=[0,]*len(toxic_test) + [1,]*len(non_toxic_test)

#Text Pre-processing
Now that we have our text data, let's set up the text pre-processing steps



In [0]:
len_vec = [len(elem) for elem in x_train] #[len(elem) for elem in x_test] + [len(elem) for elem in x_val] 
max_len = 40
num_words = 10000
from keras.preprocessing.text import Tokenizer
# Fit the tokenizer on the training data
t = Tokenizer(num_words=num_words,  filters='!"#$%&()*+,-./:;<=>?@[\\]^_`{|}~\t\n', lower=True, split=' ')
t.fit_on_texts(x_train+x_test)

In [0]:
#LEt's take a look at the dictionary
print(t.word_index)
print(len(t.word_index.keys()))

In [0]:
#The metadata will help load things when we port the model to browser and load it using js. 
#As a rule of thumb, anything you will need to maintain or share, such as word mappings, should be stores in the metadata. This can be as small or as big as you like
metadata = {
  'word_index': t.word_index,
  'max_len': max_len,
  'vocabulary_size': num_words,
}

#Model

In [0]:
embedding_size = 16
n_classes = 2
epochs = 20
import tensorflow as tf

model = tf.keras.Sequential()
model.add(tf.keras.layers.Embedding(num_words, embedding_size, input_shape=(max_len,)))
# model.add(tf.keras.layers.Dense(64, activation='relu'))
model.add(tf.keras.layers.Dense(64, activation='relu'))
model.add(tf.keras.layers.Dense(16, activation='relu'))
model.add(tf.keras.layers.Flatten())
model.add(tf.keras.layers.Dropout(0.6))
model.add(tf.keras.layers.Dense(2, activation='softmax'))
model.compile('adam', 'sparse_categorical_crossentropy', metrics=['accuracy'])
model.summary()

In [0]:
#Preparing the training data
from tensorflow.keras.preprocessing.sequence import pad_sequences
x_train = t.texts_to_sequences(x_train)
x_train = pad_sequences(x_train, maxlen=max_len, padding='post')
print(x_train)

In [0]:
model.fit(x_train, np.array(y_train), epochs=epochs)

Let's test it

In [0]:
#test_example = "Left Munich at 8:35 P. M., on 1st May, arriving at Vienna early next morning."
test_example = non_toxic[18000]
x_test_ex = t.texts_to_sequences([test_example])
x_test_ex = pad_sequences(x_test_ex, maxlen=max_len, padding='post')
print(x_test_ex)

In [0]:
print(test_example)

In [0]:
#0 - homer, 1- moby dick, 2- tale of two cities, 3- pride and prejudice
preds = model.predict(x_test_ex)
print(preds)
print(np.argmax(preds))

In [0]:
x_test = t.texts_to_sequences(x_test)
x_test = pad_sequences(x_test, maxlen=max_len, padding='post')


In [0]:
model.evaluate(x_test,np.array(y_test))

#Convert the model

Now that we have our complete model, we are ready to port it to tensorflow js

In [0]:
import json
import tensorflowjs as tfjs

#Let's write the metadata to a json file
metadata_json_path = os.path.join(MODEL_DIR, 'metadata.json')
json.dump(metadata, open(metadata_json_path, 'wt'))

#Converting the model and saving it
tfjs.converters.save_keras_model(model, MODEL_DIR)
print('\nSaved model artifcats in directory: %s' % MODEL_DIR)

In [0]:
index_html = """
<!doctype html>

<body>
  <style>
    #textfield {
      font-size: 120%;
      width: 60%;
      height: 200px;
    }
  </style>
  <h1>
    Title
  </h1>
  <hr>
  <div class="create-model">
    <button id="load-model" style="display:none">Load model</button>
  </div>
  <div>
    <div>
      <span>Vocabulary size: </span>
      <span id="vocabularySize"></span>
    </div>
    <div>
      <span>Max length: </span>
      <span id="maxLen"></span>
    </div>
  </div>
  <hr>
  <div>
    <textarea id="text-entry"></textarea>
  </div>
  <hr>
  <div>
    <span id="status">Standing by.</span>
  </div>

  <script src='https://cdn.jsdelivr.net/npm/@tensorflow/tfjs/dist/tf.min.js'></script>
  <script src='index.js'></script>
</body>
"""

In [0]:
index_js = """

const HOSTED_URLS = {
  model:
      'model_js/model.json',
  metadata:
      'model_js/metadata.json'
};

function status(statusText) {
  console.log(statusText);
  document.getElementById('status').textContent = statusText;
}

function showMetadata(metadataJSON) {
  document.getElementById('vocabularySize').textContent =
      metadataJSON['vocabulary_size'];
  document.getElementById('maxLen').textContent =
      metadataJSON['max_len'];
}

function settextField(text, predict) {
  const textField = document.getElementById('text-entry');
  textField.value = text;
  doPredict(predict);
}

function setPredictFunction(predict) {
  const textField = document.getElementById('text-entry');
  textField.addEventListener('input', () => doPredict(predict));
}

function disableLoadModelButtons() {
  document.getElementById('load-model').style.display = 'none';
}

function doPredict(predict) {
  const textField = document.getElementById('text-entry');
  const result = predict(textField.value);
  console.log(result);
  score_string = "Class scores: ";
  document.body.style.background = 'rgb(' + ((result.score[0]) *256) +',' + (result.score[1] *256) +',0)';
//   ${result.score[0]}*256,${result.score[1]}*256,0)";
//   document.body.style.background = 'rgb($result.score[0]*256,$result.score[1]*256,0)';
  for (var x in result.score) {
    score_string += x + " :  " + result.score[x].toFixed(3) + ", "
  }
  //console.log(score_string);
  status(
      score_string + ' elapsed: ' + result.elapsed.toFixed(3) + ' ms)');
}

function prepUI(predict) {
  setPredictFunction(predict);
  const testExampleSelect = document.getElementById('example-select');
  testExampleSelect.addEventListener('change', () => {
    settextField(examples[testExampleSelect.value], predict);
  });
  settextField(examples['example1'], predict);
}

async function urlExists(url) {
  status('Testing url ' + url);
  try {
    const response = await fetch(url, {method: 'HEAD'});
    return response.ok;
  } catch (err) {
    return false;
  }
}

async function loadHostedPretrainedModel(url) {
  status('Loading pretrained model from ' + url);
  try {
    const model = await tf.loadLayersModel(url);
    status('Done loading pretrained model.');
    disableLoadModelButtons();
    return model;
  } catch (err) {
    console.error(err);
    status('Loading pretrained model failed.');
  }
}

async function loadHostedMetadata(url) {
  status('Loading metadata from ' + url);
  try {
    const metadataJson = await fetch(url);
    const metadata = await metadataJson.json();
    status('Done loading metadata.');
    return metadata;
  } catch (err) {
    console.error(err);
    status('Loading metadata failed.');
  }
}

class Classifier {

  async init(urls) {
    this.urls = urls;
    this.model = await loadHostedPretrainedModel(urls.model);
    await this.loadMetadata();
    return this;
  }

  async loadMetadata() {
    const metadata =
        await loadHostedMetadata(this.urls.metadata);
    showMetadata(metadata);
    this.maxLen = metadata['max_len'];
    console.log('maxLen = ' + this.maxLen);
    this.wordIndex = metadata['word_index']
  }

  predict(text) {
    // Convert to lower case and remove all punctuations.
    const inputText =
        text.trim().toLowerCase().replace(/(\.|\,|\!)/g, '').split(' ');
    // Look up word indices.
    const inputBuffer = tf.buffer([1, this.maxLen], 'float32');
    for (let i = 0; i < inputText.length; ++i) {
      const word = inputText[i];
      inputBuffer.set(this.wordIndex[word], 0, i);
      //console.log(word, this.wordIndex[word], inputBuffer);
    }
    const input = inputBuffer.toTensor();
    //console.log(input);

    status('Running inference');
    const beginMs = performance.now();
    const predictOut = this.model.predict(input);
    //console.log(predictOut.dataSync());
    const score = predictOut.dataSync();//[0];
    predictOut.dispose();
    const endMs = performance.now();

    return {score: score, elapsed: (endMs - beginMs)};
  }
};

async function setup() {
  if (await urlExists(HOSTED_URLS.model)) {
    status('Model available: ' + HOSTED_URLS.model);
    const button = document.getElementById('load-model');
    button.addEventListener('click', async () => {
      const predictor = await new Classifier().init(HOSTED_URLS);
      prepUI(x => predictor.predict(x));
    });
    button.style.display = 'inline-block';
  }

  status('Standing by.');
}

setup();

"""

In [0]:
with open('index.html','w') as f:
  f.write(index_html)
  
with open('index.js','w') as f:
  f.write(index_js)

In [0]:
!git add . 
!git commit -m "colab -> github"
!git push https://{USER_NAME}:{TOKEN}@github.com/{USER_NAME}/{USER_NAME}.github.io/ master

In [0]:
print("Now, visit https://%s.github.io/" % (USER_NAME))