<a href="https://colab.research.google.com/github/hamletbatista/sel/blob/master/How_to_Evaluate_Content_Quality_with_BERT.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#How to Evalute Content Quality with BERT

1. Build a predictive model to classify grammatically correct sentences
2. Fetch a target page and extract the text.
3. Split it into sentences.
4. Predict each sentence as gramatically correct or not.
5. Calculate and report the percentage of gramatically correct sentences

We will use Ludwig to train a BERT text classification model on the Corpus of Linguistic Acceptability (CoLA) dataset 

In [0]:
url="https://searchengineland.com/the-dangers-of-misplaced-third-party-scripts-327329" #@param {type:"string"}
selector="p > a" #@param {type:"string"}


##Build Predictive Model
Sourced from https://colab.research.google.com/drive/13ErkLg5FZHIbnUGZRkKlL-9WNCNQPIow#scrollTo=RYZgdpzpwY6w

In [0]:
!wget https://nyu-mll.github.io/CoLA/cola_public_1.1.zip

--2020-06-06 18:28:13--  https://nyu-mll.github.io/CoLA/cola_public_1.1.zip
Resolving nyu-mll.github.io (nyu-mll.github.io)... 185.199.108.153, 185.199.110.153, 185.199.109.153, ...
Connecting to nyu-mll.github.io (nyu-mll.github.io)|185.199.108.153|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 255330 (249K) [application/zip]
Saving to: ‘cola_public_1.1.zip’


2020-06-06 18:28:13 (7.29 MB/s) - ‘cola_public_1.1.zip’ saved [255330/255330]



In [0]:
!unzip cola_public_1.1.zip

Archive:  cola_public_1.1.zip
   creating: cola_public/
  inflating: cola_public/README      
   creating: cola_public/tokenized/
  inflating: cola_public/tokenized/in_domain_dev.tsv  
  inflating: cola_public/tokenized/in_domain_train.tsv  
  inflating: cola_public/tokenized/out_of_domain_dev.tsv  
   creating: cola_public/raw/
  inflating: cola_public/raw/in_domain_dev.tsv  
  inflating: cola_public/raw/in_domain_train.tsv  
  inflating: cola_public/raw/out_of_domain_dev.tsv  


In [0]:
import pandas as pd

# Load the dataset into a pandas dataframe.
df = pd.read_csv("./cola_public/raw/in_domain_train.tsv", delimiter='\t', header=None, names=['sentence_source', 'label', 'label_notes', 'sentence'])

# Report the number of sentences.
print('Number of training sentences: {:,}\n'.format(df.shape[0]))

# Display 10 random rows from the data.
df.sample(10)

Number of training sentences: 8,551



Unnamed: 0,sentence_source,label,label_notes,sentence
2669,l-93,1,,Nora sent the book from Paris to London.
4883,ks08,1,,This is the student pictures of whom appeared ...
694,bc01,0,*,"Clearly, John perfectly will immediately learn..."
5186,kl93,1,,Perhaps some dry socks would help?
6964,m_02,1,,Emma and Harriet were attacked by those bandits.
326,bc01,1,,"Louise likes not being happy, doesn't she?"
1847,r-67,0,*,"They can't stand each other, them."
4094,ks08,1,,The picture on the wall reminded him of his co...
4396,ks08,1,,Mary did not avoid Bill.
4878,ks08,1,,The president Fred voted for has resigned.


In [0]:
#save to CSV
df.to_csv("cola_dataset.csv")

### Create Ludwig Model Definition

Sourced from https://gist.github.com/hamletbatista/f5993ee38d14643f0df71ae2303f5dfa#file-bert_model_definition-py



In [0]:
import tensorflow as tf; print(tf.__version__)

2.2.0


In [0]:
#https://github.com/uber/ludwig/blob/master/requirements.txt
#requires tensorflow 1.15.3

!pip install tensorflow-gpu==1.15.3

Collecting tensorflow-gpu==1.15.3
  Using cached https://files.pythonhosted.org/packages/98/ab/19aba3629427c2d96790f73838639136ce02b6e7e1c4f2dd60149174c794/tensorflow_gpu-1.15.3-cp36-cp36m-manylinux2010_x86_64.whl
Installing collected packages: tensorflow-gpu
  Found existing installation: tensorflow-gpu 2.2.0
    Uninstalling tensorflow-gpu-2.2.0:
      Successfully uninstalled tensorflow-gpu-2.2.0
Successfully installed tensorflow-gpu-1.15.3


In [0]:
%tensorflow_version 1.x
import tensorflow as tf; print(tf.__version__)

TensorFlow 1.x selected.
1.15.3


In [0]:
!pip install ludwig



In [0]:
!wget https://storage.googleapis.com/bert_models/2018_10_18/uncased_L-12_H-768_A-12.zip

--2020-06-06 18:47:51--  https://storage.googleapis.com/bert_models/2018_10_18/uncased_L-12_H-768_A-12.zip
Resolving storage.googleapis.com (storage.googleapis.com)... 74.125.20.128, 2607:f8b0:400e:c08::80
Connecting to storage.googleapis.com (storage.googleapis.com)|74.125.20.128|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 407727028 (389M) [application/zip]
Saving to: ‘uncased_L-12_H-768_A-12.zip’


2020-06-06 18:47:54 (127 MB/s) - ‘uncased_L-12_H-768_A-12.zip’ saved [407727028/407727028]



In [0]:
!unzip uncased_L-12_H-768_A-12.zip

Archive:  uncased_L-12_H-768_A-12.zip
   creating: uncased_L-12_H-768_A-12/
  inflating: uncased_L-12_H-768_A-12/bert_model.ckpt.meta  
  inflating: uncased_L-12_H-768_A-12/bert_model.ckpt.data-00000-of-00001  
  inflating: uncased_L-12_H-768_A-12/vocab.txt  
  inflating: uncased_L-12_H-768_A-12/bert_model.ckpt.index  
  inflating: uncased_L-12_H-768_A-12/bert_config.json  


### Get Appropiate Hyperparameters
Sourced from https://app.wandb.ai/cayush/bert-finetuning/reports/Sentence-classification-with-Huggingface-BERT-and-W%26B--Vmlldzo4MDMwNA

In [0]:
#https://uber.github.io/ludwig/user_guide/#bert-encoder

template="""
input_features:
    -
        name: sentence
        type: text
        encoder: bert
        config_path: uncased_L-12_H-768_A-12/bert_config.json
        checkpoint_path: uncased_L-12_H-768_A-12/bert_model.ckpt
        preprocessing:
          word_tokenizer: bert
          word_vocab_file: uncased_L-12_H-768_A-12/vocab.txt
          padding_symbol: '[PAD]'
          unknown_symbol: '[UNK]'

output_features:
    -
        name: label
        type: category
text:
        word_sequence_length_limit: 128
training:
        batch_size: 16
        learning_rate: 0.00003
        epochs: 3

"""

with open("model_definition.yaml", "w") as f:
  f.write(template)

In [0]:
!ls

cola_public	     model_definition.yaml  uncased_L-12_H-768_A-12
cola_public_1.1.zip  sample_data	    uncased_L-12_H-768_A-12.zip


In [0]:
!pip install bert-tensorflow

Collecting bert-tensorflow
[?25l  Downloading https://files.pythonhosted.org/packages/a6/66/7eb4e8b6ea35b7cc54c322c816f976167a43019750279a8473d355800a93/bert_tensorflow-1.0.1-py2.py3-none-any.whl (67kB)
[K     |████▉                           | 10kB 28.2MB/s eta 0:00:01[K     |█████████▊                      | 20kB 2.8MB/s eta 0:00:01[K     |██████████████▋                 | 30kB 3.8MB/s eta 0:00:01[K     |███████████████████▍            | 40kB 4.1MB/s eta 0:00:01[K     |████████████████████████▎       | 51kB 3.3MB/s eta 0:00:01[K     |█████████████████████████████▏  | 61kB 3.8MB/s eta 0:00:01[K     |████████████████████████████████| 71kB 3.0MB/s 
Installing collected packages: bert-tensorflow
Successfully installed bert-tensorflow-1.0.1


### Train the predictive model

Sourced from https://ludwig-ai.github.io/ludwig-docs/examples/#text-classification

In [0]:
!ludwig experiment --data_csv cola_dataset.csv --model_definition_file model_definition.yaml

The TensorFlow contrib module will not be included in TensorFlow 2.0.
For more information, please see:
  * https://github.com/tensorflow/community/blob/master/rfcs/20180907-contrib-sunset.md
  * https://github.com/tensorflow/addons
  * https://github.com/tensorflow/io (for I/O related ops)
If you depend on functionality not listed there, please file an issue.

███████████████████████
█ █ █ █  ▜█ █ █ █ █   █
█ █ █ █ █ █ █ █ █ █ ███
█ █   █ █ █ █ █ █ █ ▌ █
█ █████ █ █ █ █ █ █ █ █
█     █  ▟█     █ █   █
███████████████████████
ludwig v0.2.2.7 - Experiment

Experiment name: experiment
Model name: run
Output path: results/experiment_run


ludwig_version: '0.2.2.7'
command: ('/usr/local/bin/ludwig experiment --data_csv cola_dataset.csv '
 '--model_definition_file model_definition.yaml')
random_seed: 42
input_data: 'cola_dataset.csv'
model_definition: {   'combiner': {'type': 'concat'},
    'input_features': [   {   'checkpoint_path': 'uncased_L-12_H-768_A-12/bert_model.ckpt',
             

### Evaluate the model

Sourced from https://ludwig-ai.github.io/ludwig-docs/getting_started/#programmatic-api

Using cola_public/raw/out_of_domain_dev.tsv  

In [0]:
from ludwig.api import LudwigModel

model = LudwigModel.load("results/experiment_run/model")

test_df = pd.read_csv("./cola_public/raw/out_of_domain_dev.tsv", delimiter='\t', header=None, names=['sentence_source', 'label', 'label_notes', 'sentence'])

#we rename Query to Questions to match what the model expects
predictions = model.predict(test_df)



The TensorFlow contrib module will not be included in TensorFlow 2.0.
For more information, please see:
  * https://github.com/tensorflow/community/blob/master/rfcs/20180907-contrib-sunset.md
  * https://github.com/tensorflow/addons
  * https://github.com/tensorflow/io (for I/O related ops)
If you depend on functionality not listed there, please file an issue.





Instructions for updating:
Please use `rate` instead of `keep_prob`. Rate should be set to `rate = 1 - keep_prob`.
Instructions for updating:
Use keras.layers.Dense instead.
Instructions for updating:
Please use `layer.__call__` method instead.


Instructions for updating:
Use keras.layers.dropout instead.
Instructions for updating:
Use tf.where in 2.0, which has the same broadcast rule as np.where
INFO:tensorflow:Restoring parameters from results/experiment_run/model/model_weights


In [0]:
test_df.join(predictions)[["sentence", "label_predictions"]]

Unnamed: 0,sentence,label_predictions
0,Somebody just left - guess who.,1
1,"They claimed they had settled on something, bu...",1
2,"If Sam was going, Sally would know where.",1
3,"They're going to serve the guests something, b...",1
4,She's reading. I can't imagine what.,1
...,...,...
511,John considers Bill silly.,1
512,John considers Bill to be silly.,1
513,John bought a dog for himself to play with.,1
514,John arranged for himself to get the prize.,1


In [0]:
pred_df = test_df.join(predictions)[["sentence", "label_predictions"]]

In [0]:
pred_df.groupby("label_predictions").count()

Unnamed: 0_level_0,sentence
label_predictions,Unnamed: 1_level_1
0,92
1,424


In [0]:
#pred_df[pred_df.label_predictions != 0]
pred_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 516 entries, 0 to 515
Data columns (total 2 columns):
 #   Column             Non-Null Count  Dtype 
---  ------             --------------  ----- 
 0   sentence           516 non-null    object
 1   label_predictions  516 non-null    object
dtypes: object(2)
memory usage: 8.2+ KB


In [0]:
pred_df[pd.to_numeric(pred_df.label_predictions) == 0]


Unnamed: 0,sentence,label_predictions
11,She knew French for Tom.,0
53,"She was dancing with somebody, but I don't kno...",0
77,"I think Agnes said that Bill would speak, but ...",0
81,Who did they see someone?,0
86,The book was by John written.,0
...,...,...
489,That John is reluctant seems.,0
493,It is to give up to leave.,0
504,I presented Bill with it to read.,0
505,I gave a book to Bill to read.,0


## Convert the Web Page into Sentences to Predict

Source: https://requests.readthedocs.io/projects/requests-html/en/latest/

In [0]:
!pip install requests-html

Collecting requests-html
  Downloading https://files.pythonhosted.org/packages/24/bc/a4380f09bab3a776182578ce6b2771e57259d0d4dbce178205779abdc347/requests_html-0.10.0-py3-none-any.whl
Collecting pyppeteer>=0.0.14
[?25l  Downloading https://files.pythonhosted.org/packages/5d/4b/3c2aabdd1b91fa52aa9de6cde33b488b0592b4d48efb0ad9efbf71c49f5b/pyppeteer-0.2.2-py3-none-any.whl (145kB)
[K     |████████████████████████████████| 153kB 8.7MB/s 
Collecting fake-useragent
  Downloading https://files.pythonhosted.org/packages/d1/79/af647635d6968e2deb57a208d309f6069d31cb138066d7e821e575112a80/fake-useragent-0.1.11.tar.gz
Collecting w3lib
  Downloading https://files.pythonhosted.org/packages/a3/59/b6b14521090e7f42669cafdb84b0ab89301a42f1f1a82fcf5856661ea3a7/w3lib-1.22.0-py2.py3-none-any.whl
Collecting parse
  Downloading https://files.pythonhosted.org/packages/f4/65/220bb4075fddb09d5b3ea2c1c1fa66c1c72be9361ec187aab50fa161e576/parse-1.15.0.tar.gz
Collecting pyquery
  Downloading https://files.pythonho

In [0]:
#Please type: 

from requests_html import HTMLSession
session = HTMLSession()

#url = "https://searchengineland.com/the-dangers-of-misplaced-third-party-scripts-327329"
#now a parameter in the form above

#selector="p > a"
#now a parameter in the form above

with session.get(url) as r:

  post = r.html.find(selector)

  text = post.text



In [0]:
text

'I was recently helping one of my team members diagnose a new prospective customer site to find some low hanging fruit to share with them.\nWhen I checked their home page with our Chrome extension, I found a misplaced canonical tag. We added this type of detection a long time ago when I first encountered the issue.\nWhat is a misplaced SEO tag, you might ask?\nMost SEO tags like the title, meta description, canonical, etc. belong in the HTML HEAD. If they get placed in the HTML BODY, Google and other search engines will ignore them.\nIf you go to the Elements tab, you will find the SEO tags inside the <BODY> tag. But, these tags are supposed to be in the <HEAD>!\nWhy does something like this happen?\nIf we check the page using VIEW SOURCE, the canonical tag is placed correctly inside the HTML HEAD (line 56, while the <BODY> is in line 139.).\nWhat is happening here?!\nIs this an issue with Google Chrome?\nThe canonical is also placed in the BODY in Firefox.\nWe have the same issue with

###Splitting into Sentences
Source: https://stackoverflow.com/questions/4576077/how-can-i-split-a-text-into-sentences

In [0]:
!pip install nltk



In [0]:
import nltk
nltk.download('punkt')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


True

In [0]:
import nltk.data

tokenizer = nltk.data.load('tokenizers/punkt/english.pickle')

In [0]:
print ('\n-----\n'.join(tokenizer.tokenize(text)))


I was recently helping one of my team members diagnose a new prospective customer site to find some low hanging fruit to share with them.
-----
When I checked their home page with our Chrome extension, I found a misplaced canonical tag.
-----
We added this type of detection a long time ago when I first encountered the issue.
-----
What is a misplaced SEO tag, you might ask?
-----
Most SEO tags like the title, meta description, canonical, etc.
-----
belong in the HTML HEAD.
-----
If they get placed in the HTML BODY, Google and other search engines will ignore them.
-----
If you go to the Elements tab, you will find the SEO tags inside the <BODY> tag.
-----
But, these tags are supposed to be in the <HEAD>!
-----
Why does something like this happen?
-----
If we check the page using VIEW SOURCE, the canonical tag is placed correctly inside the HTML HEAD (line 56, while the <BODY> is in line 139.).
-----
What is happening here?!
-----
Is this an issue with Google Chrome?
-----
The canonical

In [0]:
sentences = tokenizer.tokenize(text)

In [0]:
sel_df = pd.DataFrame(sentences, columns=["sentence"])

In [0]:
sel_df.head()

Unnamed: 0,sentence
0,I was recently helping one of my team members ...
1,When I checked their home page with our Chrome...
2,We added this type of detection a long time ag...
3,"What is a misplaced SEO tag, you might ask?"
4,"Most SEO tags like the title, meta description..."


### Evaluating their Grammar (Quality)

In [0]:
predictions = model.predict(sel_df)


In [0]:
pred_df = sel_df.join(predictions)[["sentence", "label_predictions"]]

In [0]:
pred_df.groupby("label_predictions").count()

Unnamed: 0_level_0,sentence
label_predictions,Unnamed: 1_level_1
0,4
1,85


In [0]:
pred_df[pd.to_numeric(pred_df.label_predictions) == 0]


Unnamed: 0,sentence,label_predictions
4,"Most SEO tags like the title, meta description...",0
39,I tested by moving the script to the BODY but ...,0
45,"In the first example, I commented out the open...",0
46,This removes it.,0
