## Imports

In [1]:
import spacy
import pandas as pd
import numpy as np
from nltk import ngrams
from collections import defaultdict, Counter
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn import metrics
from sklearn.linear_model import SGDClassifier
from sklearn.model_selection import train_test_split, GridSearchCV
from gensim.models import Word2Vec
import gensim.downloader as api
from pattern.en import parse, Sentence, parse, mood, modality

## Get data

In [2]:
path = "C:/Users/Léo/Desktop/semeval-2020-task-5/data/"
train1 = pd.read_csv(path + 'train/subtask1.csv')
train2 = pd.read_csv(path + 'train/data_train_subtask2.csv')
train2 = train2[["sentence"]]
train2.insert(1, "gold_label", "1") # add a column 'gold_label' with value 1 (data for subtask2 only contains counterfactuals)

# concatenate data and remove duplicates
df = pd.concat([train1, train2])
df = df.reset_index(drop=True)
df_gpby = df.groupby(list(df.columns))
idx = [x[0] for x in df_gpby.groups.values() if len(x) == 1]
df = df.reindex(idx)
df['gold_label'] = df['gold_label'].apply(str)

print('*Examples of non-counterfactual sentences (0):')
for i in df.loc[df['gold_label'] == '0', 'sentence'][:3]:
    print('- ' + i)
    
print('')
print('*Examples of counterfactual sentences (1):')
for i in df.loc[df['gold_label'] == '1', 'sentence'][:3]:
    print('- ' + i)

of pandas will change to not sort by default.

To accept the future behavior, pass 'sort=False'.


  


*Examples of non-counterfactual sentences (0):
- The new request, if approved, would keep the military forces on the border through Jan.
- Companies in financial difficulty can currently only negotiate down wages and conditions to below those established by the collective bargaining procedure if they have the approval of unions, which is rarely given.
- If needed, I would like to have the right to try.

*Examples of counterfactual sentences (1):
- Goodfellow's theory has been questioned, however, because the plane made two other sharp turns that would've been impossible if the pilots were unconscious.
- However, both campaigners and pro-People's Vote MPs say that this number would be grow significantly if there were no other viable means of avoiding leaving the EU without a Withdrawal Agreement.
- Things could have been even better if the whole chip industry wasn't constrained by chip foundries such as Taiwan Semiconductor Manufacturing (NYSE: TSM) and United Microelectronics getting s

## Text preprocessing

Experimenting with different preprocessing steps showed that cleaning the sentences was best kept to a minimum. We only keep alphanumeric tokens, lowercase them and merge them with their corresponding part-of-speech tag.

In [3]:
nlp = spacy.load("en_core_web_sm")

# set custom stop words list
stopwords = {'could','are','be','been','can','cannot','could','did','do','does','done','get','had','has',
             'have','if','is','made','might','must','should','were','will','would'}
nlp.Defaults.stop_words -= stopwords

token_pos = []
pos = []
for i in df['sentence']:
    sent = nlp(i)
    tokens_tags = [token.lower_ + '_' + token.tag_ for token in sent if not token.is_punct and not token.is_space] 
    tags = [token.tag_ for token in sent if not token.is_punct and not token.is_space]
    token_pos.append(' '.join(tokens_tags))
    pos.append(' '.join(tags))
    
df['sent_tok'] = token_pos
df['sent_pos'] = pos

print(df['sent_tok'].iloc[0])
print(df['sent_pos'].iloc[0])

goodfellow_NNP 's_POS theory_NN has_VBZ been_VBN questioned_VBN however_RB because_IN the_DT plane_NN made_VBD two_CD other_JJ sharp_JJ turns_NNS that_WDT would_MD 've_VB been_VBN impossible_JJ if_IN the_DT pilots_NNS were_VBD unconscious_JJ
NNP POS NN VBZ VBN VBN RB IN DT NN VBD CD JJ JJ NNS WDT MD VB VBN JJ IN DT NNS VBD JJ


## Adding lexical information

Some ngrams have been selected based on a chi-squared test that was previously conducted (see keyword2.py). It was noted that specifically adding them as one-hot encoded features did not affect the results. Their presence in a sentence will be encapsulated by the vectorizer. 

In [10]:
ngrams_list = ['would', 'say', 'I', "'s", 'may', 'Mr.', 'even', 'year', 'get', 'take', 'Trump', 'company', 'see', 'market', 'use', 'also', 'wish', 'Mr', 'need', 'ask', 'new', 'come', 'much', 'want', 'government', 'patient', 'risk', 'bank', 'change',
'I would', 'Mr. Trump', 'I wish', 'I could', 'year ago', 'I know', 'wish I', 'last year', 'I I', 'euro zone', 'central bank', 'probably would',
'I wish I', 'health care provider', 'I I would', 'wish I could', 'think I would', 'I probably would', 'year ago I']

# One-hot encoding statistically relevant ngrams
for gram in ngrams_list:
    df[gram] = 0
    for idx, sentence in enumerate(df['sentence']):
        if gram in sentence:
            df[gram].iloc[idx] = 1 

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self._setitem_with_indexer(indexer, value)


## Adding syntactic information

Since the construction of counterfactuals is intricately tied with the use of specific verb forms, frequent sequences of POS tags containing at least one verb are considered as an additional feature.

In [11]:
trigram0 = {}
trigram1 = {}
fourgram0 = {}
fourgram1 = {}
count0 = 0
count1 = 0

# Counting POS tags trigrams and fourgrams that contain a verb
for idx, sent in enumerate(df['sent_pos']):
    tri = ngrams(sent.split(), 3)  
    four = ngrams(sent.split(), 4)
    if df['gold_label'].iloc[idx] == '0':
        count0 += 1  
        for i in tri:
            if 'V' in ' '.join(i):
                trigram0[i] = trigram0.get(i, 0) + 1
        for i in four:
            if 'V' in ' '.join(i):
                fourgram0[i] = fourgram0.get(i, 0) + 1
    else: 
        count1 += 1
        for i in tri:
            if 'V' in ' '.join(i):
                trigram1[i] = trigram1.get(i, 0) + 1
        for i in four:
            if 'V' in ' '.join(i):
                fourgram1[i] = fourgram1.get(i, 0) + 1

# Scaling these numbers down for easier comparison
trigram0 = {k: round((v/count0), 2) for k,v in trigram0.items()}
trigram1 = {k: round((v/count1), 2) for k,v in trigram1.items()}
fourgram0 = {k: round((v/count0), 2) for k,v in fourgram0.items()}
fourgram1 = {k: round((v/count1), 2) for k,v in fourgram1.items()}

# Using Counter objects to retrieve the most common values of a dict
trigram0 = Counter(trigram0) 
trigram1 = Counter(trigram1) 
fourgram0 = Counter(fourgram0) 
fourgram1 = Counter(fourgram1) 
high_tri0 = trigram0.most_common(15)                  
high_tri1 = trigram1.most_common(15)  
high_four0 = fourgram0.most_common(10)  
high_four1 = fourgram1.most_common(10)  

def encode_common_seq(top_seqs, ngram_range):
    # One-hot encoding the presence of the most common POS tags sequences
    for seq, count in top_seqs:
        df[seq[0]] = 0 
        for idx, sent in enumerate(df['sent_pos']):
            grams = list(ngrams(sent.split(), ngram_range))  
            if seq in grams:
                df[seq[0]].iloc[idx] = 1
            
encode_common_seq(high_tri0, 3)
encode_common_seq(high_tri1, 3)
encode_common_seq(high_four0, 4)
encode_common_seq(high_four1, 4)
                    
print('Most common trigrams: (0,1)')
for i,j  in zip(high_tri0, high_tri1): 
    print(i[0]," :",i[1]," | ", j[0]," :",j[1])

print('Most common fourgrams: (0,1)')
for i,j  in zip(high_four0, high_four1): 
    print(i[0]," :",i[1]," | ", j[0]," :",j[1])

Most common trigrams: (0,1)
('PRP', 'MD', 'VB')  : 0.38  |  ('MD', 'VB', 'VBN')  : 0.65
('MD', 'VB', 'VBN')  : 0.28  |  ('PRP', 'MD', 'VB')  : 0.44
('NN', 'MD', 'VB')  : 0.25  |  ('IN', 'PRP', 'VBD')  : 0.3
('MD', 'RB', 'VB')  : 0.23  |  ('MD', 'RB', 'VB')  : 0.2
('VB', 'DT', 'NN')  : 0.22  |  ('NN', 'MD', 'VB')  : 0.17
('MD', 'VB', 'DT')  : 0.18  |  ('VB', 'VBN', 'DT')  : 0.16
('NNS', 'MD', 'VB')  : 0.17  |  ('VBN', 'IN', 'DT')  : 0.15
('VBN', 'IN', 'DT')  : 0.15  |  ('VBN', 'DT', 'NN')  : 0.13
('TO', 'VB', 'DT')  : 0.14  |  ('VB', 'VBN', 'IN')  : 0.13
('MD', 'VB', 'IN')  : 0.14  |  ('PRP', 'VBD', 'VBN')  : 0.12
('IN', 'PRP', 'VBD')  : 0.13  |  ('NNS', 'MD', 'VB')  : 0.11
('NNP', 'NNP', 'VBD')  : 0.13  |  ('VBD', 'DT', 'NN')  : 0.11
('IN', 'PRP', 'VBP')  : 0.12  |  ('RB', 'VB', 'VBN')  : 0.11
('NN', 'TO', 'VB')  : 0.12  |  ('PRP', 'VBD', 'RB')  : 0.11
('NNP', 'MD', 'VB')  : 0.11  |  ('DT', 'NN', 'VBD')  : 0.1
Most common fourgrams: (0,1)
('VB', 'DT', 'NN', 'IN')  : 0.09  |  ('PRP', 'M

## Adding semantic information

### Custom embeddings
Using the library Gensim, we are able to train our custom word embeddings on the dataset using Word2Vec, since the corpus at hand is relatively small. Words with similar meanings appear close to each other in the vector space (see below the words related to ‘if’). In order to produce a single vector for one training example (i.e. a sentence), we can sum all the words and average them, thereby obtaining a vector representation of the sentence.

In [12]:
vector_length = 300
model = Word2Vec(sentences=[sent.lower().split() for sent in df['sentence']], 
                 sg=1, 
                 size=vector_length,  
                 workers=1)

def vectorize(sentences, model):
    # Sum all word vectors and average them to obtain a representation of the sentence
    all_vect = []
    for sentence in sentences:
        words = sentence.split()
        vectors = []
        for w in words: 
            if w not in model.wv.vocab:
                vectors.append(np.zeros(vector_length))
            else:
                vectors.append(model.wv[w])                    
        avg = sum(vectors)/len(vectors)
        all_vect.append(avg)
    return all_vect

vectorized_sentences = vectorize(df['sentence'], model)
custom_embeddings = {}
for i in range(vector_length):
    custom_embeddings[i] = [val[i] for val in vectorized_sentences]
    
model.wv.most_similar('if', topn=6)

[('though', 0.75510573387146),
 ('"if', 0.7416023015975952),
 ('up,', 0.7197611331939697),
 ('unless', 0.7177684307098389),
 ('assuming', 0.7159310579299927),
 ('once', 0.6934325695037842)]

### Pretrained embeddings

Using Gensim, we can also access models that have been pre-trained on various datasets. The model ‘word2vec-google-news-300’ is used to create embeddings for sentences in the same manner as the previous step. 
While this served as an interesting investigation of word embeddings, using the embeddings from these two methods as features yielded worsened results. 

In [None]:
model_word2vec_news = api.load("word2vec-google-news-300") 

vectorized_sentences = vectorize(df['sentence'], model_word2vec_news)
pretrained_embeddings = {}
for i in range(vector_length):
    pretrained_embeddings[i] = [val[i] for val in vectorized_sentences]

model_word2vec_news.wv.most_similar('if', topn=6)

### Mood and modality

The Pattern library provides a simple way to extract these two features from sentences. Grammatical mood refers to the use of auxiliary verbs (e.g., could, would) and adverbs (e.g., definitely, maybe) to express uncertainty. The mood() function returns either INDICATIVE, IMPERATIVE, CONDITIONAL or SUBJUNCTIVE for a given parsed sentence. For this specific task, establishing whether a sentence’s mood is conditional could be useful to classify it as counterfactual. The modality() function returns the degree of certainty as a value between -1.0 and +1.0, where values > +0.5 represent facts. Adding modality as a feature slightly improved the model’s accuracy. While mood did seem like a promising feature to include as well (see distributions below), it did not improve the results further.  

In [19]:
count_0 = 0
count_1 = 0
counts_0 = {}
counts_1 = {}
mod0 = 0
mod1 = 0
for i, j in zip(df['sentence'], df['gold_label']):
    s = parse(i)
    s = Sentence(s)
    if j == '0':
        count_0 += 1
        mod0 += modality(s)
        counts_0[mood(s)] = counts_0.get(mood(s), 0) + 1        
    elif j == '1':
        count_1 += 1
        mod1 += modality(s)
        counts_1[mood(s)] = counts_1.get(mood(s), 0) + 1

d0 = {k: round((v/count_0)*100, 2) for k, v in counts_0.items()}
d1 = {k: round((v/count_1)*100, 2) for k, v in counts_1.items()}

print("Mean modality score (0,1):")
print(round(mod0/count_0, 2), round(mod1/count_1, 2))
print("Overall predicted moods (0,1):")
for i, j in zip(sorted(d0.items()), sorted(d1.items())):
    print(i, j)

Mean modality score (0,1):
0.14 -0.02
Overall predicted moods (0,1):
('conditional', 84.25) ('conditional', 88.33)
('imperative', 0.85) ('imperative', 1.0)
('indicative', 10.86) ('indicative', 2.0)
('subjunctive', 4.04) ('subjunctive', 8.67)


In [20]:
# One-hot encode mood as a feature
df['conditional'] = 0
df['indicative'] = 0
df['subjunctive'] = 0

for idx, s in enumerate(df['sentence']):
    s = parse(s)
    s = Sentence(s)
    if mood(s) == 'conditional':
        df['conditional'].iloc[idx] = 1
    elif mood(s) == 'indicative':
        df['indicative'].iloc[idx] = 1
    elif mood(s) == 'subjunctive':
        df['subjunctive'].iloc[idx] = 1

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self._setitem_with_indexer(indexer, value)


## Vectorization

To perform machine learning on our textual data, we need to turn each sentence into a numerical feature vector. Here are two methods we can use:

- Bag-of-Words representation (BoW): This consists in counting the occurence of each token with no regard to syntactic structure. We collect a vocabulary of all occurring tokens, and then we attribute a score to each token, using its count or its frequency.   
- TF-IDF: The problem with the BoW model is that it may attribute more importance to frequent words.  Rarer words specific to a domain are neglected while they could potentially contain useful information. To deal with this issue, a common method consists in scaling word frequency by how often they appear across all sentences, so that more frequent words are penalized. This method is called Term Frequency – Inverse Document Frequency. 


In [21]:
Tfidf_vect = TfidfVectorizer(ngram_range=(1,3), max_features=3000, smooth_idf=True) 
Tfidf_vect = Tfidf_vect.fit_transform(df['sent_tok'])
Tfidf_df = pd.DataFrame(Tfidf_vect.toarray())
df = pd.concat([df, Tfidf_df.set_index(df.index)], axis=1)

## Classification

The following classifiers available in scikit-learn were tested: RandomForestClassifier, svm.SVC, MultinomialNB, SGDClassifier. Among these estimators, SGDClassifier performed the best.

In [27]:
Train_X, Test_X, Train_Y, Test_Y = train_test_split(df.iloc[:,5:], df['gold_label'],test_size=0.3)
text_clf =  SGDClassifier(loss="hinge", alpha=0.0001, penalty="elasticnet", max_iter=1000).fit(Train_X, Train_Y) 
predicted = text_clf.predict(Test_X)

print(metrics.classification_report(Test_Y, predicted, target_names=None))
print(pd.crosstab(Test_Y, predicted, rownames=['True'], colnames=['Predicted'], margins=True))

              precision    recall  f1-score   support

           0       0.92      0.94      0.93      3485
           1       0.85      0.82      0.83      1481

    accuracy                           0.90      4966
   macro avg       0.89      0.88      0.88      4966
weighted avg       0.90      0.90      0.90      4966

Predicted     0     1   All
True                       
0          3269   216  3485
1           272  1209  1481
All        3541  1425  4966


We can inspect the weights attributed to each keword by the classifier.

In [32]:
ngrams_coef = text_clf.coef_[0][-(len(ngrams_list)):]
for i in zip(ngrams_list[:10], ngrams_coef[:10]):
    print(i)

('would', 1.2868696196246325)
('say', 1.0187493868810535)
('I', 0.88067557532815)
("'s", -0.39464025192397323)
('may', 1.678167043264575)
('Mr.', 0.19819411231247994)
('even', -1.110390249471778)
('year', -0.8314548069469684)
('get', -0.6399906420151661)
('take', -0.25161422223071106)


## Hyperparameter optimization

The final step to make the predictions a tad more accurate is to tune the model’s hyperparameters. To do so, an exhaustive grid search was performed using GridSearchCV. GridSearchCV samples from a set of pre-defined hyperparameters and fits various models with different combinations of paramaters using k-fold cross-validation. The most accurate model can then be retrieved.

In [None]:
sgd = SGDClassifier()
sgd_param_grid = {
    'penalty': ['l2', 'l1', 'elasticnet'],
    'alpha': [0.001, 0.0001, 0.00001],
    'max_iter': [500, 1000] 
    }

clf = GridSearchCV(sgd, sgd_param_grid)
clf.fit(train_X, Train_Y)

## Results

|                              | Precision      | Recall    | F1-score  |   |
|------------------------------|----------------|-----------|-----------|---|
| Raw text  - Count vectorizer | 0.83           | 0.68      | 0.70      |   |
| **Raw text - TF-IDF**            | 0.82           | 0.68      | 0.73      |   |
| **Adding subtask2 data**         | 0.85           | 0.82      | 0.84      |   |
| *Preprocessing*                |                |           |           |   |
| Lemmatization                | 0.83           | 0.80      | 0.82      |   |
| **Tokenization**                 | 0.85           | 0.82      | 0.84      |   |
| *Syntactic information*        |                |           |           |   |
| **Merging Token_POS tag**        | 0.86           | 0.83      | 0.84      |   |
| **Ngram range(1,3)**             | 0.88           | 0.85      | 0.86      |   |
| **POS tags sequences**           | 0.88           | 0.87      | 0.87      |   |
| *Semantic information*         |                |           |           |   |
| Custom embeddings            | 0.85           | 0.79      | 0.80      |   |
| Pretrained embeddings        | 0.86           | 0.80      | 0.81      |   |
| **Modality**                     | 0.89           | 0.87      | 0.88      |   |
| Mood                         | 0.87           | 0.87      | 0.87      |   |