# Naturalness Evaluator for logic

## Traing the formulaicness Model

Lets start by downloading and importing the [sbert](https://www.sbert.net/) library and some other supporting packages.

In [None]:
!pip install -U sentence-transformers

import pandas as pd
from sklearn.linear_model import LogisticRegression
from sentence_transformers import SentenceTransformer
import pickle

### Dataset

Let's download and import the dataset, generated by the [dataset generator](./dataset_generator.ipynb)

In [None]:
df = pd.read_csv("https://raw.githubusercontent.com/jlwbr/naturalness_eval/main/data/train.csv")
df = df[df["Class"] != 0]
df.head()

Unnamed: 0,Formula,Translation,Translation_Type,Class,Difference,length
0,@ x ( ~ ( Small ( x ) & Cube ( x ) ) | RightOf...,"For all x, it is not the case that x is small ...",Baseline,1,False,25
1,"@ x / y ( ( Dodec ( x ) & ~ RightOf ( y , x ) ...","For all x, there is an element y such that if ...",Baseline,1,False,38
2,@ y ( / x ( Large ( x ) & Cube ( x ) & RightOf...,"For all y, if there is an element x such that ...",Ranta,2,False,30
5,"@ x @ y @ z ( Dodec ( x ) $ ( ~ BackOf ( y , x...","For all x, for all y, for all z, if x is a dod...",Baseline,1,False,40
7,"@ x @ y / z ( BackOf ( x , y ) $ Larger ( x , ...","For all x, for all y, there is an element z su...",Baseline,2,True,26


### Training

Now we can train and save the model

In [None]:
docs = df['Translation'].tolist()
y = df['Class'].tolist()

embedding_model =  SentenceTransformer('all-MiniLM-L12-v2')
embeddings = embedding_model.encode(docs)

clf = LogisticRegression(verbose=True, max_iter=1000)
clf.fit(embeddings, y)

with open('formulaicness.pkl', 'wb') as f:
    pickle.dump(clf, f)

## Building the evaluator

Now lets start building the evaluator

### Setup

First, lets download the CoLa model, which is used by GRUEN, and import all the necessary libraries:

In [None]:
!pip install --upgrade --no-cache-dir gdown
!gdown 1Hw5na_Iy4-kGEoX60bD8vXYeJDQrzyj6

!unzip cola_model.zip

!python -m spacy download en_core_web_md

In [None]:
import math
from torch.utils.data import SequentialSampler
from tqdm import tqdm
from transformers.data.processors.utils import InputExample
from transformers import glue_convert_examples_to_features
from sklearn.preprocessing import minmax_scale
import torch
from transformers import BertConfig, BertForMaskedLM, BertForSequenceClassification, BertTokenizer
import numpy as np
from typing import List
import sys
from scipy.stats import norm

### Formulaicness Model

Now we can use the model that was trained in an earlier step to calculate the probability that the sentence that is to be evaluated belongs to each class.

In [None]:
model = pickle.load(open("formulaicness.pkl", "rb"))
embedding_model = SentenceTransformer('all-MiniLM-L12-v2')

def get_bertopic_score(text, target_class = 1):
    embeddings = embedding_model.encode(text)

    preds = model.predict_proba(embeddings)

    return preds[:, target_class - 1].tolist()

### GRUEN Model

This is an implementaion of the grammaticality score from [gruen](https://github.com/WanzhengZhu/GRUEN/blob/master/main.py). It is based on the original authors implementation, with some slight implementation differences that make the code easier to embed within the evaluator.

In [None]:
bert_base_model = BertForMaskedLM.from_pretrained("bert-base-cased")

if not isinstance(bert_base_model, BertForMaskedLM):
    raise Exception("Model is not a BertForMaskedLM")

bert_base_model.eval()

def get_lm_score(text: List[str]):
    tokenizer = BertTokenizer.from_pretrained("bert-base-cased")

    probs = []

    for sentence in tqdm(text):
        tokenized_text = tokenizer.tokenize(sentence)
        input_ids = torch.tensor(tokenizer.encode(tokenized_text)).unsqueeze(0)
        with torch.no_grad():
            loss = bert_base_model(input_ids, labels=input_ids)[0]

        probs.append(math.exp(loss))

    return probs

def load_pretrained_cola_model(model_name, saved_pretrained_CoLA_model_dir):
    config_class = BertConfig
    model_class = BertForSequenceClassification
    tokenizer_class = BertTokenizer

    config = config_class.from_pretrained(saved_pretrained_CoLA_model_dir,
                                          num_labels=2,
                                          finetuning_task='CoLA')

    tokenizer = tokenizer_class.from_pretrained(saved_pretrained_CoLA_model_dir, do_lower_case=0)

    model = model_class.from_pretrained(
        saved_pretrained_CoLA_model_dir,
        from_tf=bool('.ckpt' in model_name),
        config=config
    )

    if not isinstance(model, model_class):
        raise Exception("Model is not a BertForSequenceClassification")

    model.eval()
    return tokenizer, model

cola_tokenizer, cola_model = load_pretrained_cola_model("bert-base-cased", "cola_model/bert-base-cased")

def get_cola_score(text: List[str]):

    def load_and_cache_examples(candidates, tokenizer):
        max_length = 128
        examples = [
            InputExample(guid=str(i), text_a=x)
            for i, x in enumerate(candidates)
        ]
        features = glue_convert_examples_to_features(
            examples,
            tokenizer,
            label_list=["0", "1"],
            max_length=max_length,
            output_mode="classification")

        # Convert to Tensors and build dataset
        all_input_ids = torch.tensor([f.input_ids for f in features],
                                     dtype=torch.long)
        all_attention_mask = torch.tensor(
            [f.attention_mask for f in features], dtype=torch.long)
        all_labels = torch.tensor([0 for _ in features], dtype=torch.long)
        all_token_type_ids = torch.tensor([[0.0] * max_length
            for _ in features],
                                          dtype=torch.long)
        dataset = torch.utils.data.TensorDataset(all_input_ids,
                                                 all_attention_mask,
                                                 all_token_type_ids,
                                                 all_labels)
        return dataset

    dataset = load_and_cache_examples(text, cola_tokenizer)
    dataloader = torch.utils.data.DataLoader(
        dataset,
        sampler=SequentialSampler(dataset),
        batch_size=1
    )

    preds = None
    for batch in tqdm(dataloader):
        inputs = {
            "input_ids": batch[0],
            "attention_mask": batch[1],
            "labels": batch[3]
        }
        outputs = cola_model(**inputs)
        _, logits = outputs[:2]

        if preds is None:
            preds = logits.detach().cpu().numpy()
        else:
            preds = np.append(preds, logits.detach().cpu().numpy(), axis=0)

    if preds is None:
        return []

    return preds[:, 1].tolist()

def get_gramaticality_score(text: List[str]):
    lm_score = get_lm_score(text)
    cola_score = get_cola_score(text)

    grammaticality_score = [
        1.0 * math.exp(-0.5 * x) + 1.0 * y
        for x, y in zip(lm_score, cola_score)
    ]

    grammaticality_score = [
        max(0, x / 8.0 + 0.5) for x in grammaticality_score
    ]  # re-scale

    return grammaticality_score

config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/436M [00:00<?, ?B/s]

Some weights of the model checkpoint at bert-base-cased were not used when initializing BertForMaskedLM: ['bert.pooler.dense.weight', 'cls.seq_relationship.weight', 'cls.seq_relationship.bias', 'bert.pooler.dense.bias']
- This IS expected if you are initializing BertForMaskedLM from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForMaskedLM from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


### Scoring funtion

This function takes a list of utterances and a target class, and calculates a naturalness score. A penalty function is used to augment the score based on the confidence of the bertopic model.

In [None]:
def get_score(text: List[str], target_class=1, verbose=False):
    bertopic_score = get_bertopic_score(text, target_class)
    grammaticality_score = get_gramaticality_score(text)

    if verbose:
      print("text: ", text)
      print("target_class: ", target_class)
      print("")
      print("grammaticality_score: ", grammaticality_score)
      print("bertopic_score: ", bertopic_score)

    score = [
        min(1, max(0, (x + y) / 2))
        for x, y in zip(grammaticality_score, bertopic_score)
    ]

    return score

This is simple test code, input a sentence, and get a naturalness score back

In [None]:
target_class = 3

text = input("Enter a sentence: ")

score = get_score([text], int(target_class), verbose=True)
print("score: ", score[0])

Enter a sentence: Forall x, if x is blue, x is a ball.


100%|██████████| 1/1 [00:00<00:00,  6.17it/s]
100%|██████████| 1/1 [00:00<00:00,  1.63it/s]

text:  ['Forall x, if x is blue, x is a ball.']
target_class:  3

grammaticality_score:  [0.6887161618191097]
bertopic_score:  [0.33093103845203486]
score:  0.5098236001355723





## Testing

First, lets import the testset

In [None]:
df = pd.read_csv("https://raw.githubusercontent.com/jlwbr/naturalness_eval/main/data/test.csv")
df = df[df["Class"] != 0]

X_test = df["Translation"].to_numpy()
y_test = df["Class"].to_numpy()

df.head()

Unnamed: 0,Formula,Translation,Translation_Type,Class,Difference,length
0,"@ y ( Cube ( y ) $ ( ~ SameCol ( a , y ) & ~ S...","For all cubes y, it is not the case that a or ...",LoLa,2,False,20
1,∀x (Snack(x) → (((Salty(x) ∧ ¬Sweet(x)) ∨ (¬Sa...,"A snack is either salty or sweet, but not both...",MALLS,3,False,16
2,"@ z ( Dodec ( z ) $ ~ / y BackOf ( y , z ) )","For all z, if z is a dodecahedron, then it is ...",Baseline,1,False,28
3,∀x (Person(x) ∧ EducatesStudents(x) → Teacher(x)),A person is considered a teacher if they educa...,MALLS,3,False,10
4,"@ x ( ( Tet ( x ) & ~ / y FrontOf ( y , x ) ) ...","For all x, if nothing is in front of x, then x...",LoLa,2,False,22


### Bertopic model

Using the sklearns [clasification_report](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.classification_report.html#sklearn.metrics.classification_report) function we can check some main classification metrics.

In [None]:
from sklearn.metrics import classification_report

embeddings = embedding_model.encode(X_test)
y_pred = model.predict(embeddings)

print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

           1       0.92      0.92      0.92       200
           2       0.92      0.93      0.92       200
           3       1.00      0.99      1.00       200

    accuracy                           0.95       600
   macro avg       0.95      0.95      0.95       600
weighted avg       0.95      0.95      0.95       600

