# Efficient data augmentation for production-ready NLP tasks [1/3]

As algorithm grow more complex, they also grow hungry for more data, to be able to learn precisely the meaning of the sentence. In Text Classification task for example, the variety of words encountered allows for a much more resilient algorithm, especially when in production with a taste of real word data. 

Just as in Computer Vision where biases such as detecting a seagull is more correlated with the presence of the beach rather than the bird itself, words in NLU can be baddly associated with an incorrect class and can thus be a threat in the case of real world use, as examples get harder to discriminate. 
If automatic generation of new data can be of help in simply **increasing the number of training example**, how good is it performing against the use of **more training data from the same dataset**, and **are there way to efficiently generate more data** ?

This article is inspired by the paper *Learning the Difference that Makes a Difference with Counterfactually-Augmented Data*, that can be found here on [arXiv](https://arxiv.org/abs/1909.12434)

In this study, the authors point out the difficulty for Machine Learning models to generalize the classification rules learned, because their decision rules, described as 'spurious patternes', often miss the key elements that affects most the sentiment of a text. They thus decided to  confusion factor, by changing the label of an asset at the same time as changing the minimum amount of words so those **key-words** would be much easier for the model to spot.

We'll go through details of the paper for a text-classification task, and study 
1. The impact of counterfactually-augmented data 
2. Compare the efficiency and cost of such data generation technique

We'll use the data of the study, the IMDB sentiment analysis dataset, publicly available [here](https://github.com/acmi-lab/counterfactually-augmented-data). 
The dataset consists in 50k reviews of movies, and the task is to classify those reviews as positive or negative opinions about a movie.

## Counterfactual data augmentation

In the article *Learning the Difference that Makes a Difference with Counterfactually-Augmented Data*, the authors created a new labelling task based on the IMDB dataset. On a subset of reviews, it was asked the annotators to change the class of the review by changing the minimum of words in the review. For example, the sentence : 
*Long, boring, blasphemous. Never have I been so glad to see ending credits roll.*
became : *Long, fascinating, soulful. Never have I been so sad to see ending credits roll.* 
In the following, we will use those 4 datasets :
- the *original* dataset, a subset from the imdb dataset
- the *revised* dataset created in the study from variations of the original dataset
- the *combined* dataset, combining the two previous datasets
- the *originalDouble* dataset, enlarging the original dataset with more reviews from the imdb dataset.

We will use in this study different machine learning models, and analyse their performance on the original data to understand how well counterfactual data can improve performance on this text classification task.

No pre-processing is applied to the text data, just a TF-IDF bag of words and a classifier, with a grid search on a dedicated validation dataset. All results are compared for an out-of-sample test set, always on balanced dataset. The metric used can thus be the accuracy.

### Building a more resilient dataset

The interest of such a technic is primarily to help the model grab a better sense of the useful words that really encapture the meaning of the sentence. As a matter of fact, a model trained on the original data fails to score better than random guess on counterfactual out-of-sample data.
The opposite is also true, a model trained on counterfactual data scores very poorly on original data, not better than random guess on out-of-sample data. 


![dataset](./img/counterfactual_original_only.png)

This effect of resilience is well illustrated for the SVM or LogisticRegression classifiers, as it is possible to compute the feature importance of each word for the task. The comparison between the **key words** for the model trained on the combined dataset and for the original dataset is striking.

The words that most contribute to the classification of the review are often not those that are most useful (words like `classic`, `one of`, `romantic`, or `something` for example)

On the contrary, a model trained on the *combined* dataset (with **both the original and the revised dataset**), has much more coherent words as important features, increasing the ability of the model to classify correctly out-of-sample examples.

![features](./img/features_original.png)
![features_comb](./img/features_comb.png)

Furthermore, the **performance is much better for this last model**, on both the *revised* and the *original* dataset

### Counterfactual augmentation 

For a more accurate comparison, we choose to compare the performance of models with the same number of data. Below are gathered the results accross multiple models on the original test data.

![3](./img/3,4k_reviews.png)

It is not surprising to notice that **the model trained with more data perform generally better on the test data, accross every models**. Even so, the model trained on a larger original dataset does not always scores better, for example with Naive Bayes classifier or K-Nearest Neighbors, that perform better with the combined training dataset that gathers counterfactual and original data.

### Robustify a model with counterfactual data

Another great interest for counterfactual data is that, when used with a fully parameterized model with a large training set, the performance increase is steady around 1% for almost all models tested here. 

Furthermore, as previously shown, the model obtained is now resilient towards counterfactual data, even with just the use of a small subset of counterfactual data. 

![3](./img/20,7k_reviews.png)

## Cost / efficiency analysis 

To evaluate further the interest of the counterfactual annotation process, we decided to evaluate the difference in costs between collecting more data and creating counterfactual data. 
We thus reproduced the task of writing movie reviews, and present below the results of such a labelling task, for 100 words of reviews produced. 

**graph to complete**
- exhaustion
- to to 3 in factor (times/100 words for each task)

## Conclusion

It seems that there are two mains interest to use counterfactual data augmentation. 

1. Its impact on the performance for text classification can be as good, and even better to using more original data. Furthermore, counterfactual data provides a resilience to the model towards data that can be considered as adversarial. 

2. Yet, the model of generation being so precise and systematic, it is **far easier to generate than collecting original data**. The original data gathering can as a matter of fact be really tedious and suffers from exhaustion, because it often relies on synthetic data, for examples in the chatbot use cases. In the beginning especially, we generate data syntetically to onboard the first users, but imagination has its limits. The counterfactual alternative guarantees, for real-world use case, a gain in factor 2 to 3 in terms of cost for the same final performance, and in quality of data as it does not suffer from the exhaustion issues.


We studied here the case of text classification, but this could be done for many others fields of NLU, for examples Named entities recognition or intention classification. One of the primary use of this technic could thus be the development of chatbots, in which all those tasks are of primary concern

Intro on text classif 

1. Counterfactual and results
- Plus de données c'est mieux
- le contrefactuel c'est mieux
- le contrefactuel c'est débiaisé 
- Même avec beaucoup de données, c'est plus robuste vis à vis de données de production
On fait tout cela sur l'analyse de sentiments mais étendable à d'autre tâches
2. Cost/efficiency report counterfactual versus more data 

Graphe sur 100 phrases annotées counterfactual versus normalement
- effet d'épuisement à montrer
- idée de facteur 2 à 3 idéal

Conclusions : 
- Donnée générée à la main est de moins bonne qualité (effet d'épuisement)
- Works on other domains : for example Entités nommées classification d'intention - chatbot example