# Correction of confused adjectives and adverbs

## Introduction

Non-native speakers often find it difficult *(or is it difficultly?)* to learn the proper usage of adverbs and adjectives in English:
* _Do I speak English **fluent** or **fluently**?_
* _Why do I look **nice** but talk **nicely**?_
* _Why is it that my car both is **fast** and goes **fast**?_
* _Why can you both **remote control** and **remotely control** something?_

In this project, we will develop a simple classifier that decides whether an adjective or an adverb is needed in a certain context.

## How do we change adjectives to adverbs and vice versa?

In English, adverbs are formed from adjectives by adding "-ly": free => free**ly**.

However, there are exceptions to that:
- _responsib**le** => responsib**ly**_
- _angr**y** => angr**ily**_
- _idiot**ic** => idiot**ically**_
- _full => full**y**_
- _ugly => in an ugly way?_
- _**good** => **well**_
- _**hard** => **hard**; **hardly** has a different meaning_
- _**state-of-the-art** => **?**_

In [1]:
import time
import random
import json
import en_core_web_md
from spacy import displacy
from spacy.tokens import Doc
from sklearn.feature_extraction import DictVectorizer
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report

In [2]:
# Read all adjectives and adverbs that are present in English Wiktionary
# https://en.wiktionary.org/wiki/Wiktionary:Main_Page

with open("data/adjectives.txt", "r") as f:
    ADJ = set(line.strip() for line in f.readlines())

with open("data/adverbs.txt", "r") as f:
    ADV = set(line.strip() for line in f.readlines())
    
print("Total number of English adjectives:", len(ADJ))
print("Total number of English adverbs:", len(ADV))

Total number of English adjectives: 61768
Total number of English adverbs: 10791


In [3]:
# Learn to transform adjectives to adverbs

def transform_adj_to_adv(adjective):
    """
    Convert an adjective to the corresponding adverb.
    :param adjective: string (adjective)
    :return: string (adverb) or None
    """

    # friendly, ugly, monthly OR meaning change
    if adjective.endswith("ly") or adjective in ["hard", "bare", "on"]:
        return None

    # exceptions
    elif adjective == "good":
        return "well"
    elif adjective in ["whole", "true"]:
        return adjective[:-1] + "ly"

    # responsible => responsibly
    elif adjective.endswith("le") and adjective != "sole":
        adverb = adjective[:-1] + "y"
    # angry => angrily
    elif adjective.endswith("y") and adjective != "shy":
        adverb = adjective[:-1] + "ily"
    # idiotic => idiotically
    elif adjective.endswith("ic"):
        adverb = adjective + "ally"
    # full => fully
    elif adjective.endswith("ll"):
        adverb = adjective + "y"
    # free => freely
    else:
        adverb = adjective + "ly"

    # check for validity
    return adverb if adverb in ADV else None

In [4]:
for i in range(20):
    word = random.sample(ADJ, 1)[0]
    print("{:18} => {}".format(word, transform_adj_to_adv(word)))

cobaltiferous      => None
touched            => None
semicylindrical    => None
carcinomatous      => None
antigraft          => None
autocritical       => None
claimable          => None
crotchless         => None
frigid             => frigidly
podzolic           => None
choreographed      => None
pandeistic         => pandeistically
nonlaying          => None
tubuloalveolar     => None
shutterless        => None
boardlike          => None
angleless          => None
annotinous         => None
tarnishable        => None
extrathoracic      => None


In [5]:
# Create dictionaries for adjective-adverb transformation

adj_to_adv, adv_to_adj = dict(), dict()

for adj in ADJ:
    adv = transform_adj_to_adv(adj)
    if adv and adv != adj:
        adj_to_adv[adj] = adv
        adv_to_adj[adv] = adj

print("Number of adjectives that can be transformed to adverbs (and vice versa):",
      len(adj_to_adv))

Number of adjectives that can be transformed to adverbs (and vice versa): 8463


## What features distinguish adjectives from adverbs?

Hypotheses:
- left and right context
- type of relation to the head
- dependants (if there are any)
- the word itself

In [6]:
# Load spaCy models

start = time.time()
nlp = en_core_web_md.load(disable=['ner'])
print("Models loaded in", round(time.time() - start), "seconds.")

Models loaded in 26 seconds.


In [7]:
# Parse sentences with adjective and adverb

# sentence = nlp("The soup smells good.")
# sentence = nlp("He smells the soup carefully.")
sentence = nlp("She was completely natural and unaffected by the attention.")

print("Parts of speech:")
print(" ".join("{}_{}".format(token.text, token.tag_) for token in sentence))
displacy.render(sentence, style='dep', options={"collapse_punct": False, "distance": 110}, jupyter=True)

Parts of speech:
She_PRP was_VBD completely_RB natural_JJ and_CC unaffected_JJ by_IN the_DT attention_NN ._.


In [8]:
# Collect features

def feature_extractor(sentence, ind):
    """
    Collect features for the INDth token in SENTENCE.
    
    :param sentence: Doc, a parsed sentence
    :param ind: the index of the token
    :return: a feature dictionary
    """
    token = sentence[ind]
    features = dict()
    # context
    features["w-1"] = sentence[ind-1].text if ind > 0 else "<S>"
    features["w+1"] = sentence[ind+1].text if ind < (len(sentence) - 1) else "</S>"
    # children
    for child in token.children:
        features[child.dep_] = child.text
    # if we collect features for an adjective
    if token.tag_ == "JJ" and token.text in adj_to_adv:
        features["adj"] = token.text
        features["adv"] = adj_to_adv[token.text]
        features["adj_head"] = token.dep_ + "_" + token.head.lemma_
        alt_sentence = nlp(" ".join([t.text for t in sentence[:ind]]
                                    + [features["adv"]] +
                                    [t.text for t in sentence[ind + 1:]]))
        features["adv_head"] = alt_sentence[ind].dep_ + "_" + \
                               alt_sentence[ind].head.lemma_
    # if we collect features for an adverb
    elif token.tag_ == "RB" and token.text in adv_to_adj:
        features["adv"] = token.text
        features["adj"] = adv_to_adj[token.text]
        features["adv_head"] = token.dep_ + "_" + token.head.lemma_
        alt_sentence = nlp(" ".join([t.text for t in sentence[:ind]]
                                    + [features["adj"]] +
                                    [t.text for t in sentence[ind + 1:]]))
        features["adj_head"] = alt_sentence[ind].dep_ + "_" + \
                               alt_sentence[ind].head.lemma_
    else:
        # the input data may be noisy
        return None
    return features


In [9]:
# Collect features for sample sentences

corpus = ["The soup smells good.",
          "He smells the soup carefully.",
          "She was completely natural and unaffected by the attention."]

data, labels = [], []
for sentence in corpus:
    sentence = nlp(sentence)
    for token in sentence:
        if token.tag_ in ["JJ", "RB"] and token.head.tag_.startswith("VB"):
            features = feature_extractor(sentence, token.i)
            data.append(features)
            labels.append(token.pos)
            print("Word in question:", token.text)
            print(features)
            print("Label:", token.pos_)
            print("")

Word in question: good
{'w-1': 'smells', 'w+1': '.', 'adj': 'good', 'adv': 'well', 'adj_head': 'acomp_smell', 'adv_head': 'advmod_smell'}
Label: ADJ

Word in question: carefully
{'w-1': 'soup', 'w+1': '.', 'adv': 'carefully', 'adj': 'careful', 'adv_head': 'advmod_smell', 'adj_head': 'oprd_smell'}
Label: ADV

Word in question: natural
{'w-1': 'completely', 'w+1': 'and', 'advmod': 'completely', 'cc': 'and', 'conj': 'unaffected', 'adj': 'natural', 'adv': 'naturally', 'adj_head': 'acomp_be', 'adv_head': 'advmod_be'}
Label: ADJ



In [10]:
# Vectorize features for sample sentences

vec = DictVectorizer()
vectorized_feats = vec.fit_transform(data)

In [11]:
# The full feature set

print("All features:")
print(vec.get_feature_names())
print("\nTotal number of features: ", len(vec.get_feature_names()))

All features:
['adj=careful', 'adj=good', 'adj=natural', 'adj_head=acomp_be', 'adj_head=acomp_smell', 'adj_head=oprd_smell', 'adv=carefully', 'adv=naturally', 'adv=well', 'adv_head=advmod_be', 'adv_head=advmod_smell', 'advmod=completely', 'cc=and', 'conj=unaffected', 'w+1=.', 'w+1=and', 'w-1=completely', 'w-1=smells', 'w-1=soup']

Total number of features:  19


In [12]:
# The resulting sparse matrix

print("The resulting sparse matrix:")
print(vectorized_feats.toarray())

The resulting sparse matrix:
[[0. 1. 0. 0. 1. 0. 0. 0. 1. 0. 1. 0. 0. 0. 1. 0. 0. 1. 0.]
 [1. 0. 0. 0. 0. 1. 1. 0. 0. 0. 1. 0. 0. 0. 1. 0. 0. 0. 1.]
 [0. 0. 1. 1. 0. 0. 0. 1. 0. 1. 0. 1. 1. 1. 0. 1. 1. 0. 0.]]


## Where do we get data?

Possible sources of data are:
- learner corpora (e.g., [lang-8](http://cl.naist.jp/nldata/lang-8/), [NUCLE](http://www.comp.nus.edu.sg/~nlp/conll14st.html))
- use crowdsourcing platform (e.g, [MTurk](https://www.mturk.com/)) or linguists (e.g., [Appen](https://appen.com/)) to annotate data
- use grammatically correct data

Suppose we don't have any money or time (which we don't :trollface:). Thus, we will be using a corpus of allegedly correct English - [The Blog Authorship Corpus](http://u.cs.biu.ac.il/~koppel/BlogCorpus.htm).

In [13]:
with open("data/adj_vs_adv_data.json", "r") as f:
    data = json.load(f)

print(data[20])

{'sentence': ['that', 'was', 'huge', 'and', 'really', 'pretty', '.'], 'label': 'ADJ', 'ind': 2}


In [14]:
print("There are {} samples in the data set.".format(len(data)))
print("{} for ADJ and {} for ADV.".format(len([i for i in data if i["label"] == "ADJ"]),
                                          len([i for i in data if i["label"] == "ADV"])))

There are 20000 samples in the data set.
10000 for ADJ and 10000 for ADV.


In [15]:
# Since the data is already tokenized, it's better to use a custom space tokenizer

class WordTokenizer(object):
    """
    Custom Tokenizer
    """
    def __init__(self, vocab=nlp.vocab, tokenizer=None, return_doc=True):
        self.vocab = vocab
        self._word_tokenizer = tokenizer
        self.return_doc = return_doc

    def __call__(self, text):
        if self._word_tokenizer:
            words = self._word_tokenizer.tokenize(text)
        else:
            words = text.split(' ')
        if self.return_doc:
            spaces = [True] * len(words)
            return Doc(self.vocab, words=words, spaces=spaces)
        else:
            return words

nlp.tokenizer = WordTokenizer(nlp.vocab)

In [16]:
# Collect features and labels from our data set

start = time.time()

x, y = [], []
for sample in data:
    sentence = nlp(" ".join(sample["sentence"]))
    features = feature_extractor(sentence, sample["ind"])
    if features:
        x.append(features)
        y.append(sample["label"])

print("Features extracted in", round((time.time() - start) / 60), "minutes.")

Features extracted in 7 minutes.


In [17]:
# Split the data

x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2,
                                                    random_state=42)

In [18]:
# Check the balance

train_adj = len([i for i in y_train if i == "ADJ"])
test_adj = len([i for i in y_test if i == "ADJ"])
print("The ratio of ADJ to ADV in the train data is {}.".format(round(train_adj / (len(y_train) - train_adj), 2)))
print("The ratio of ADJ to ADV in the test data is {}.".format(round(test_adj / (len(y_test) - test_adj), 2)))

The ratio of ADJ to ADV in the train data is 1.0.
The ratio of ADJ to ADV in the test data is 0.97.


In [19]:
# Vectorize the data

vectorizer = DictVectorizer()
x_train_vect = vectorizer.fit_transform(x_train)
print("\nTotal number of features: ", len(vectorizer.get_feature_names()))


Total number of features:  13685


In [20]:
# Train a classifier

lrc = LogisticRegression(random_state=42, max_iter=500, solver='saga')
lrc.fit(x_train_vect, y_train)

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=500, multi_class='warn',
          n_jobs=None, penalty='l2', random_state=42, solver='saga',
          tol=0.0001, verbose=0, warm_start=False)

In [21]:
predicted = lrc.predict(vectorizer.transform(x_test))
print(classification_report(y_test, predicted))

              precision    recall  f1-score   support

         ADJ       0.96      0.95      0.95      1968
         ADV       0.95      0.96      0.95      2019

   micro avg       0.95      0.95      0.95      3987
   macro avg       0.95      0.95      0.95      3987
weighted avg       0.95      0.95      0.95      3987



In [22]:
# Use the classifier to detect incorrect usage of adjectives with verbs

def is_adj_correct(raw_sentence):
    sentence = nlp(raw_sentence)
    print("Input:", raw_sentence)
    for ind in range(len(sentence)):
        token = sentence[ind]
        if token.tag_ == "JJ" and token.head.tag_.startswith("VB"):
            features = feature_extractor(sentence, ind)
            if not features:
                print("No errors found.")
                return
            predicted_pos = lrc.predict(vectorizer.transform(features))
            if predicted_pos == "ADJ":
                print("No errors found.")
                return
            else:
                print(" ".join([t.text for t in sentence[:ind]]
                                + ["{" + sentence[ind].text + "=>" + adj_to_adv[sentence[ind].text] + "}"] +
                                [t.text for t in sentence[ind+1:]]))
                return
    print("No errors found.")
    return

In [23]:
is_adj_correct("You talk nice .")
print()
is_adj_correct("You look nice .")
print()
is_adj_correct("You speak fluent .")
print()
is_adj_correct("I do n't want TeamViewer to remote control my computer .")
print()
is_adj_correct("You have successful completed the project .")
print()
is_adj_correct("I am busy talking to my friend .")
print()
is_adj_correct("I am emotional talking to my friend .")
print()
is_adj_correct("The soup smells good .")
print()
is_adj_correct("He smells the soup careful .")
print()
is_adj_correct("She was completely natural and unaffected by the attention .")
print()
is_adj_correct("She was complete natural and unaffected by the attention .")

Input: You talk nice .
You talk {nice=>nicely} .

Input: You look nice .
No errors found.

Input: You speak fluent .
You speak {fluent=>fluently} .

Input: I do n't want TeamViewer to remote control my computer .
I do n't want TeamViewer to {remote=>remotely} control my computer .

Input: You have successful completed the project .
You have {successful=>successfully} completed the project .

Input: I am busy talking to my friend .
No errors found.

Input: I am emotional talking to my friend .
I am {emotional=>emotionally} talking to my friend .

Input: The soup smells good .
No errors found.

Input: He smells the soup careful .
He smells the soup {careful=>carefully} .

Input: She was completely natural and unaffected by the attention .
No errors found.

Input: She was complete natural and unaffected by the attention .
She was {complete=>completely} natural and unaffected by the attention .


## Check top features

In [24]:
import numpy as np

feature_names = vectorizer.get_feature_names()
top_features = np.argsort(lrc.coef_[0])

print("Features that correlate with ADJ:")
print(" ".join(feature_names[i] for i in top_features[:25]))
print()

print("Features that correlate with ADV:")
print(" ".join(feature_names[i] for i in top_features[::-1][:25]))

Features that correlate with ADJ:
det=the adv_head=acomp_be prep=about nsubj=it adj_head=acomp_become w+1=! prep=with adj_head=acomp_be adj_head=oprd_keep adj_head=oprd_seem prep=in w+1=or prep=of punct=, prep=to prep=for w-1=The adj_head=ccomp_make w+1=to w+1=as adj_head=amod_look w+1=and adj_head=acomp_sound prep=at w-1=make

Features that correlate with ADV:
adj=actual adv=actually adj=probable adv=probably adj=real adv=really adv=finally adj=final w+1=a adj=definite adv=definitely adv=especially adj=especial adv=currently adj=current w-1=I adj=exact adv=exactly adj=recent adv=recently adj=basic adv=basically adv=certifiably adj=certifiable adv=slightly


### Your turn
* Split the data into train/dev/test and analyze the examples where the classifier fails. Could you add or change features to improve the quality?
* Add a function to correct adverbs to adjectives.
* Experiment with other classifiers and parameters.
* Check if using a larger spaCy model (en_core_web_lg) improves the parse quality and the quality of the classifier.
* Test the solution on a corpus of correct texts to measure the FP rate.
* Test the classifier on one of the available error correction corpora (e.g., NUCLE) to measure precision and recall.
* Tweak the [data extraction script](aux/prepare_data.py) to collect adjectives and adverbs in other syntactic contexts and build a better solution. Note the size and balance of data.