# Correction of confused adjectives and adverbs

## Introduction

Non-native speakers often find it difficult *(or is it difficultly?)* to learn the proper usage of adverbs and adjectives in English:
* _Do I speak English **fluent** or **fluently**?_
* _Why do I look **nice** but talk **nicely**?_
* _Why is it that my car both is **fast** and goes **fast**?_
* _Why can you both **remote control** and **remotely control** something?_

In this project, we will develop a simple classifier that decides whether an adjective or an adverb is needed in a certain context.

## How do we change adjectives to adverbs and vice versa?

In English, adverbs are formed from adjectives by adding "-ly": free => free**ly**.

However, there are exceptions to that:
- _responsib**le** => responsib**ly**_
- _angr**y** => angr**ily**_
- _idiot**ic** => idiot**ically**_
- _full => full**y**_
- _ugly => in an ugly way?_
- _**good** => **well**_
- _**hard** => **hard**; **hardly** has a different meaning_
- _**state-of-the-art** => **?**_

In [1]:
import time
import random
import json
import en_core_web_md
from spacy import displacy
from spacy.tokens import Doc
from sklearn.feature_extraction import DictVectorizer
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import precision_recall_fscore_support

In [2]:
# Read all adjectives and adverbs that are present in English Wiktionary
# https://en.wiktionary.org/wiki/Wiktionary:Main_Page

with open("data/adjectives.txt", "r") as f:
    ADJ = set(line.strip() for line in f.readlines())

with open("data/adverbs.txt", "r") as f:
    ADV = set(line.strip() for line in f.readlines())
    
print("Total number of adjectives:", len(ADJ))
print("Total number of adverbs:", len(ADV))

Total number of adjectives: 61768
Total number of adverbs: 10791


In [21]:
# Learn to transform adjectives to adverbs

def transform_adj_to_adv(adjective):
    """
    Convert an adjective to the corresponding adverb.
    :param adjective: string (adjective)
    :return: string (adverb) or None
    """

    # friendly
    if adjective.endswith("ly"):
        return None
    # hard
    if adjective in ADV:
        return adjective

    # exceptions
    elif adjective == "good":
        return "well"
    elif adjective in ["whole", "true"]:
        return adjective[:-1] + "ly"

    # responsible => responsibly
    elif adjective.endswith("le") and adjective != "sole":
        adverb = adjective[:-1] + "y"
    # angry => angrily
    elif adjective.endswith("y") and adjective != "shy":
        adverb = adjective[:-1] + "ily"
    # idiotic => idiotically
    elif adjective.endswith("ic"):
        adverb = adjective + "ally"
    # full => fully
    elif adjective.endswith("ll"):
        adverb = adjective + "y"
    # free => freely
    else:
        adverb = adjective + "ly"

    # check for validity
    return adverb if adverb in ADV else None

for i in range(20):
    word = random.sample(ADJ, 1)[0]
    print("{:18} => {}".format(word, transform_adj_to_adv(word)))

sheep-headed       => None
bugproof           => None
exigible           => None
zygomycotic        => None
hysteretical       => hysteretically
slovenian          => None
representative     => representatively
underfurnished     => None
rampartlike        => None
geyserlike         => None
boilerplate        => None
ekphrastic         => None
canescent          => None
nonfungicidal      => None
orbitosphenoid     => None
inefficacious      => inefficaciously
herolike           => None
immunodominant     => None
ununitable         => None
multicritical      => None


In [4]:
# Create dictionaries for adjective-adverb transformation

adj_to_adv, adv_to_adj = dict(), dict()

for adj in ADJ:
    adv = transform_adj_to_adv(adj)
    if adv and adv != adj:
        adj_to_adv[adj] = adv
        adv_to_adj[adv] = adj

print("Total number of adjectives:", len(ADJ))
print("Total number of adverbs:", len(ADV))
print("Number of adjectives that can be transformed to adverbs:",
      len(adj_to_adv))

Total number of adjectives: 61768
Total number of adverbs: 10791
Number of adjectives that can be transformed to adverbs: 8330


## What features distinguish adjectives from adverbs?

Hypotheses:
- left and right context
- type of relation to the head
- dependants (if there are any)
- the word itself

In [5]:
# Load spaCy models

start = time.time()
nlp = en_core_web_md.load(disable=['ner'])
print("Models loaded in", round(time.time() - start), "seconds.")

Models loaded in 18 seconds.


In [25]:
# Parse sentences with adjective and adverb

# sentence = nlp("The soup smells good.")
# print("Parts of speech:")
# print(" ".join("{}_{}".format(token.text, token.tag_) for token in sentence))
# displacy.render(sentence, style='dep', options={"collapse_punct": False, "distance": 110}, jupyter=True)

# sentence = nlp("He smells the hot soup carefully.")
# print("Parts of speech:")
# print(" ".join("{}_{}".format(token.text, token.tag_) for token in sentence))
# displacy.render(sentence, style='dep', options={"collapse_punct": False, "distance": 110}, jupyter=True)

# sentence = nlp("Mary naturally and quickly became part of our family.")
# print("Parts of speech:")
# print(" ".join("{}_{}".format(token.text, token.tag_) for token in sentence))
# displacy.render(sentence, style='dep', options={"collapse_punct": False, "distance": 110}, jupyter=True)

sentence = nlp("She was completely natural and unaffected by the attention.")
print("Parts of speech:")
print(" ".join("{}_{}".format(token.text, token.tag_) for token in sentence))
displacy.render(sentence, style='dep', options={"collapse_punct": False, "distance": 110}, jupyter=True)

Parts of speech:
She_PRP was_VBD completely_RB natural_JJ and_CC unaffected_JJ by_IN the_DT attention_NN ._.


In [9]:
# Collect features

def feature_extractor(sentence, ind):
    """
    Collect features for the INDth token in SENTENCE.
    
    :param sentence: Doc, a parsed sentence
    :param ind: the index of the token
    :return: a feature dictionary
    """
    token = sentence[ind]
    features = dict()
    # context
    features["w-1"] = sentence[ind-1].text if ind > 0 else "NONE"
    features["w+1"] = sentence[ind+1].text if ind < (len(sentence) - 1) else "NONE"
    # children
    for child in token.children:
        features[child.dep_] = child.text
    # if we collect features for an adjective
    if token.tag_ == "JJ" and token.text in adj_to_adv:
        features["adj"] = token.text
        features["adv"] = adj_to_adv[token.text]
        features["adj_head"] = token.dep_ + "_" + token.head.lemma_
        alt_sentence = nlp(" ".join([t.text for t in sentence[:ind]]
                                    + [features["adv"]] +
                                    [t.text for t in sentence[ind + 1:]]))
        features["adv_head"] = alt_sentence[ind].dep_ + "_" + \
                               alt_sentence[ind].head.lemma_
    # if we collect features for an adverb
    elif token.tag_ == "RB" and token.text in adv_to_adj:
        features["adv"] = token.text
        features["adj"] = adv_to_adj[token.text]
        features["adv_head"] = token.dep_ + "_" + token.head.lemma_
        alt_sentence = nlp(" ".join([t.text for t in sentence[:ind]]
                                    + [features["adj"]] +
                                    [t.text for t in sentence[ind + 1:]]))
        features["adj_head"] = alt_sentence[ind].dep_ + "_" + \
                               alt_sentence[ind].head.lemma_
    else:
        return None
    return features


In [10]:
# Collect features for sample sentences

corpus = ["The soup smells good.",
          "He smells the hot soup carefully.",
          "Mary naturally and quickly became part of our family.",
          "She was completely natural and unaffected by the attention."]
data, labels = [], []
for sentence in corpus:
    sentence = nlp(sentence)
    for token in sentence:
        if token.tag_ in ["JJ", "RB"] and token.head.tag_.startswith("VB"):
            features = feature_extractor(sentence, token.i)
            data.append(features)
            labels.append(token.pos)
            print("Word in question:", token.text)
            print(features)
            print("Label:", token.pos_)
            print("")

Word in question: good
{'w-1': 'smells', 'w+1': '.', 'adj': 'good', 'adv': 'well', 'adj_head': 'acomp_smell', 'adv_head': 'advmod_smell'}
Label: ADJ

Word in question: carefully
{'w-1': 'soup', 'w+1': '.', 'adv': 'carefully', 'adj': 'careful', 'adv_head': 'advmod_smell', 'adj_head': 'advcl_smell'}
Label: ADV

Word in question: naturally
{'w-1': 'Mary', 'w+1': 'and', 'cc': 'and', 'conj': 'quickly', 'adv': 'naturally', 'adj': 'natural', 'adv_head': 'advmod_become', 'adj_head': 'advmod_become'}
Label: ADV

Word in question: natural
{'w-1': 'completely', 'w+1': 'and', 'advmod': 'completely', 'cc': 'and', 'conj': 'unaffected', 'adj': 'natural', 'adv': 'naturally', 'adj_head': 'acomp_be', 'adv_head': 'advmod_be'}
Label: ADJ



In [11]:
# Vectorize features for sample sentences

vec = DictVectorizer()
x = vec.fit_transform(data)

In [12]:
# The full feature set

print("All features:")
print(vec.get_feature_names())
print("\nTotal number of features: ", len(vec.get_feature_names()))

All features:
['adj=careful', 'adj=good', 'adj=natural', 'adj_head=acomp_be', 'adj_head=acomp_smell', 'adj_head=advcl_smell', 'adj_head=advmod_become', 'adv=carefully', 'adv=naturally', 'adv=well', 'adv_head=advmod_be', 'adv_head=advmod_become', 'adv_head=advmod_smell', 'advmod=completely', 'cc=and', 'conj=quickly', 'conj=unaffected', 'w+1=.', 'w+1=and', 'w-1=Mary', 'w-1=completely', 'w-1=smells', 'w-1=soup']

Total number of features:  23


In [13]:
# The resulting sparse matrix

print("The resulting sparse matrix:")
print(x.toarray())

The resulting sparse matrix:
[[0. 1. 0. 0. 1. 0. 0. 0. 0. 1. 0. 0. 1. 0. 0. 0. 0. 1. 0. 0. 0. 1. 0.]
 [1. 0. 0. 0. 0. 1. 0. 1. 0. 0. 0. 0. 1. 0. 0. 0. 0. 1. 0. 0. 0. 0. 1.]
 [0. 0. 1. 0. 0. 0. 1. 0. 1. 0. 0. 1. 0. 0. 1. 1. 0. 0. 1. 1. 0. 0. 0.]
 [0. 0. 1. 1. 0. 0. 0. 0. 1. 0. 1. 0. 0. 1. 1. 0. 1. 0. 1. 0. 1. 0. 0.]]


## Where do we get data?

Possible sources of data are:
- learner corpora (e.g., [lang-8](http://cl.naist.jp/nldata/lang-8/), [NUCLE](http://www.comp.nus.edu.sg/~nlp/conll14st.html))
- use grammatically correct data
- use crowdsourcing platform (e.g, [MTurk](https://www.mturk.com/) or [CrowdFlower](https://www.figure-eight.com/)) or linguists (e.g., [Appen](https://appen.com/)) to annotate data

But suppose we don't have any money or time :trollface: Thus, we will be using a corpus of allegedly correct English - [The Blog Authorship Corpus](http://u.cs.biu.ac.il/~koppel/BlogCorpus.htm).

In [14]:
with open("data/adj_vs_adv_data.json", "r") as f:
    data = json.load(f)

for k, v in data[100].items():
    print(k + ":", v)

sentence: ['Wow', ',', 'that', 'really', 'does', 'make', 'me', 'sad', '.', ' ']
ind: 7
label: ADJ


In [15]:
# Collect features from our data set

x_features, y = [], []
for sample in data:
    sentence = nlp(" ".join(sample["sentence"]))
    features = feature_extractor(sentence, sample["ind"])
    if features:
        x_features.append(features)
        y.append(sample["label"])

print(len(x_features), len(y))

17571 17571


In [16]:
# Vectorize data

print(x_features[100], y[100])

vectorizer = DictVectorizer()
x = vectorizer.fit_transform(x_features)
print("\nTotal number of features: ", len(vectorizer.get_feature_names()))

{'w-1': 'completely', 'w+1': ',', 'advmod': 'completely', 'adj': 'boring', 'adv': 'boringly', 'adj_head': 'acomp_be', 'adv_head': 'acomp_be'} ADJ

Total number of features:  14196


In [17]:
# Split data

x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2,
                                                    random_state=42)

In [18]:
# Train a classifier

lrc = LogisticRegression(random_state=42)
lrc.fit(x_train, y_train)
predicted = lrc.predict(x_test)
prec, rec, fscore, sup = precision_recall_fscore_support(y_test, predicted, labels=['ADJ', 'ADV'])
print("Precision:", [round(p, 2) for p in prec])
print("Recall:", [round(r, 2) for r in rec])
print("F-score:", [round(f, 2) for f in fscore])

Precision: [0.95, 0.94]
Recall: [0.94, 0.96]
F-score: [0.94, 0.95]


In [42]:
def is_adj_correct(raw_sentence):
    sentence = nlp(raw_sentence)
    print("Input:", raw_sentence)
    for ind in range(len(sentence)):
        token = sentence[ind]
        if token.tag_ == "JJ" and token.head.tag_.startswith("VB"):
            features = feature_extractor(sentence, ind)
            predicted_pos = lrc.predict(vectorizer.transform(features))
            if predicted_pos == "ADJ":
                print("No errors found.")
                return
            else:
                print(" ".join([t.text for t in sentence[:ind]]
                                + ["{" + sentence[ind].text + "=>" + adj_to_adv[sentence[ind].text] + "}"] +
                                [t.text for t in sentence[ind+1:]]))
                return
    print("No errors found.")
    return

is_adj_correct("You have successful completed the project .")
print("")
is_adj_correct("I am busy talking to my friend.")
print("")
is_adj_correct("I am emotional talking to my friend.")
print("")
is_adj_correct("The soup smells good .")
print("")
is_adj_correct("He smells the hot soup careful .")
print("")
is_adj_correct("Mary natural and quickly became part of our family.")
print("")
is_adj_correct("She was completely natural and unaffected by the attention.")

Input: You have successful completed the project .
You have {successful=>successfully} completed the project .

Input: I am busy talking to my friend.
No errors found.

Input: I am emotional talking to my friend.
I am {emotional=>emotionally} talking to my friend .

Input: The soup smells good .
No errors found.

Input: He smells the hot soup careful .
He smells the hot soup {careful=>carefully} .

Input: Mary natural and quickly became part of our family.
Mary {natural=>naturally} and quickly became part of our family .

Input: She was completely natural and unaffected by the attention.
No errors found.
