<img align="right" width="400" src="https://www.fhnw.ch/de/++theme++web16theme/assets/media/img/fachhochschule-nordwestschweiz-fhnw-logo.svg" alt="FHNW Logo">


# Data Augmentation with Word Embeddings

by Fabian Märki

## Summary
The aim of this notebook is to show how word embeddings can be used for data augmentation in nlp (similar to [image augmentation](https://github.com/aleju/imgaug)). Data augmentation referes to techniques used to increase the amount of data by adding slightly modified copies of already existing data. It acts as a regularizer and helps to reduce [overfitting](https://en.wikipedia.org/wiki/Overfitting) when training a machine learning model (and thus can improve the model performance). It is closely related to [oversampling](https://en.wikipedia.org/wiki/Oversampling_and_undersampling_in_data_analysis) in data analysis.

A simple technique to augment a text (i.e. generate a new text from an existing text with (hopefully) the same meaning) is to replace a selected word (or words) with a synonym. One possible *snonym provider* are word embeddings. More advanced nlp data augmentation techniques include *backtranslation* (e.g. translate text to english and back to german), text summarization, text generation etc.

### Sources
- [Data Augmentation in NLP: Best Practices](https://neptune.ai/blog/data-augmentation-nlp)
- [Data Augmentation in NLP: Introduction to Text Augmentation](https://towardsdatascience.com/data-augmentation-in-nlp-2801a34dfc28)
- [Data Augmentation Library for Text](https://towardsdatascience.com/data-augmentation-library-for-text-9661736b13ff)

### Libraries
- [EDA](https://github.com/jasonwei20/eda_nlp): Easy Data Augmentation Techniques for Boosting Performance on Text Classification Tasks
- [Snorkel for Data Augmentation](https://www.snorkel.org/use-cases/02-spam-data-augmentation-tutorial) (Snorkel can be useful for much more!)

This notebook contains assigments: <font color='red'>Questions are written in red.</font>

<a href="https://colab.research.google.com/github/markif/2024_HS_DAS_NLP_Notebooks/blob/master/03_a_Data_Augmentation_with_Word_Embeddings.ipynb">
  <img align="left" src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/>
</a>

In [1]:
%%capture

!pip install 'fhnw-nlp-utils>=0.8.0,<0.9.0'

import pandas as pd
import numpy as np

In [2]:
from fhnw.nlp.utils.system import set_log_level
from fhnw.nlp.utils.system import system_info

set_log_level()
print(system_info())

OS name: posix
Platform name: Linux
Platform release: 5.15.0-46-generic
Python version: 3.6.9
CPU cores: 6
RAM: 31.12GB total and 17.29GB available
Tensorflow version: 2.5.1
GPU is available
GPU is a NVIDIA GeForce RTX 2070 with Max-Q Design with 8192MiB


Word similarities and anologies with fasttext

In [3]:
%%capture

!pip install fasttext

import fasttext
import fasttext.util
from fhnw.nlp.utils.colab import runs_on_colab

if runs_on_colab():
    from fhnw.nlp.utils.storage import download
    # colab as problems handling such large files
    model_name = "cc.de.50.bin"
    download("https://drive.switch.ch/index.php/s/fncH84BgISMlT3v/download", model_name)
else:
    model_name = "cc.de.300.bin"
    fasttext.util.download_model('de', if_exists='ignore')
    
ft = fasttext.load_model(model_name)

In [4]:
ft.get_nearest_neighbors("Arzt", k=20)

[(0.8161019086837769, 'Hausarzt'),
 (0.7625795602798462, 'Kinderarzt'),
 (0.7411035299301147, 'Ärztin'),
 (0.738987147808075, 'Augenarzt'),
 (0.7332271337509155, 'Lungenarzt'),
 (0.7182382941246033, 'Frauenarzt'),
 (0.7172082662582397, 'Arztin'),
 (0.7090520858764648, 'Psychiater'),
 (0.7051564455032349, 'Zahnarzt'),
 (0.70427405834198, 'Mediziner'),
 (0.7000358700752258, 'Röntgenarzt'),
 (0.6979174613952637, 'Tierarzt'),
 (0.6966368556022644, 'arzt'),
 (0.6895619034767151, 'Familienarzt'),
 (0.687816858291626, 'Chirurg'),
 (0.6830509305000305, 'Vertretungsarzt'),
 (0.6816859245300293, 'Neurologe'),
 (0.680378258228302, 'HNO-Arzt'),
 (0.6794061660766602, 'Allgemeinmediziner'),
 (0.6781229376792908, 'Krankenhausarzt')]

In [5]:
ft.get_nearest_neighbors("König", k=20)

[(0.7587202191352844, 'Königs'),
 (0.7204199433326721, 'Könige'),
 (0.7051979899406433, 'Kaiser'),
 (0.677097499370575, 'Königin'),
 (0.6656137704849243, 'Prinz'),
 (0.6584029793739319, 'Königen'),
 (0.6460000872612, 'Herrscher'),
 (0.6457017660140991, 'Exkönig'),
 (0.6453478932380676, 'Herzog'),
 (0.643559992313385, 'Kronprinz'),
 (0.6374893188476562, 'Prinzen'),
 (0.6359966397285461, 'Königssohn'),
 (0.632456362247467, 'Oberkönig'),
 (0.6273627281188965, 'Fürst'),
 (0.6244688630104065, 'Marionettenkönig'),
 (0.6177127957344055, 'Kindkönig'),
 (0.6165410280227661, 'könig'),
 (0.6133463978767395, 'Ex-König'),
 (0.610389232635498, 'Vize-König'),
 (0.6092323660850525, 'Sachsenkönig')]

In [6]:
# König - Mann + Frau = ?
# König is to Mann what ? is to Frau
ft.get_analogies("König", "Mann", "Frau", k=5)

[(0.6669625639915466, 'Königin'),
 (0.6069499254226685, 'Königs'),
 (0.5830625891685486, 'Elisabeth'),
 (0.5826849937438965, 'Prinzessin'),
 (0.5708263516426086, 'Beatrix')]

In [7]:
# Bern - Schweiz + Deutschland = ?
# Bern is to Schweiz what ? is to Deutschland
ft.get_analogies("Bern", "Schweiz", "Deutschland", k=5)

[(0.6908500790596008, 'Berlin'),
 (0.6577669382095337, 'Köln'),
 (0.6426689028739929, 'München'),
 (0.6407095789909363, 'Hamburg'),
 (0.6280649304389954, 'Frankfurt')]

In [8]:
# Obama - USA + Deutschland = ?
# Obama is to USA what ? is to Deutschland
ft.get_analogies("Obama", "USA", "Deutschland", k=5)

[(0.6156734824180603, 'Merkel'),
 (0.5905285477638245, 'Gauck'),
 (0.5554210543632507, 'Kanzlerin'),
 (0.5380834937095642, 'Westerwelle'),
 (0.5376471281051636, 'Bundeskanzlerin')]

In [9]:
# get the vector
ft.get_word_vector("König")

array([-7.30943680e-02,  1.31165683e-01, -3.24210897e-02, -1.80250704e-02,
        2.38739066e-02, -4.41337973e-02, -6.20340370e-02, -8.23560171e-03,
        1.73524451e-02, -3.84017229e-02,  1.07646901e-02,  4.14406508e-02,
       -3.67401876e-02, -5.90456091e-03, -4.32579815e-02,  5.14640380e-03,
        6.35619089e-02, -8.11023265e-02, -6.33341633e-03, -6.03447482e-03,
       -3.18285897e-02, -4.17994037e-02, -1.36623690e-02,  2.55087372e-02,
       -9.64077748e-03,  7.87197277e-02,  7.53838345e-02,  3.98097150e-02,
       -1.94040798e-02,  4.53237370e-02, -7.12714046e-02,  5.16645126e-02,
       -7.53148273e-02,  9.91312880e-03, -2.51716301e-02, -4.70333844e-02,
        7.94803165e-03,  9.97908413e-03,  4.16203849e-02, -5.36185205e-02,
        9.62086581e-03, -1.10661499e-01, -4.43915427e-02, -1.02308542e-01,
       -3.96078527e-02,  1.55621711e-02,  4.00714986e-02, -3.04852035e-02,
       -2.95244269e-02,  1.11928508e-01,  4.01469804e-02, -5.12187593e-02,
        8.58442485e-02,  

Let's build a synonym provider using word embeddings (i.e. fasttext).

<font color='red'>**TASK: Implement `synonym_provider` using fasttext's [`get_nearest_neighbors`](https://fasttext.cc/docs/en/unsupervised-tutorial.html#nearest-neighbor-queries) function (as shown above) and provide the functionality as described in the function documentation.**</font>

In [10]:
def synonym_provider(word):
    """Provides a list of synonyms for the given word

    Parameters
    ----------
    word : str
        The word to get the synonym
        
    Returns
    -------
    list
        A list of synonyms for the given word
    """
    
    # TODO: !!! place your code here !!!
    ####################################
    # !!! this needs rework !!!
    synonyms = [word]

    ###################
    # TODO: !!! end !!!
    
    return synonyms

In [11]:
synonym_provider("Arzt")

['Hausarzt',
 'Kinderarzt',
 'Ärztin',
 'Augenarzt',
 'Lungenarzt',
 'Frauenarzt',
 'Arztin',
 'Psychiater',
 'Zahnarzt',
 'Mediziner',
 'Röntgenarzt',
 'Tierarzt',
 'arzt',
 'Familienarzt',
 'Chirurg',
 'Vertretungsarzt',
 'Neurologe',
 'HNO-Arzt',
 'Allgemeinmediziner',
 'Krankenhausarzt']

<font color='red'>**Question: Why is it a good idea to use fasttext to find synonyms for German words (i.e. why would word2vec not work that well)? Are there alternatives we could use?**</font>

<font color='green'>Your answer...</font>

In [12]:
import random
random.seed(1)

def synonym_replacement(words, unique_words, n, synonym_provider):
    """Replaces words through synonyms

    Parameters
    ----------
    words : list
        The word tokens
    unique_words : set
        The set of unique words
    n : int
        The number of words to replace
    synonym_provider : function
        The function to provide a synonym for a specific word
        
    Returns
    -------
    str
        The new text sequence 
    """
    
    import random
    from random import shuffle
        
    random_words = list(unique_words)
    random.shuffle(random_words)
    random_word = random_words[0]
    synonyms = synonym_provider(random_word)
    #random.shuffle(synonyms)
    sentences = []
    
    for i in range(0, min(n, len(synonyms))):
        synonym = synonyms[i]
        new_words = [synonym if word == random_word else word for word in words]
        sentences.append(' '.join(new_words))

    return sentences


def augment_text(text, stopwords, synonym_provider, num_aug=8):
    """The main augmentation function

    Parameters
    ----------
    text : str
        The text
    stopwords : set
        The set of stopwords
    synonym_provider : function
        The function to provide a synonym for a specific word
    num_aug : int
        The number of generated augmented sentences per original sentence
        
    Returns
    -------
    list
        The new generated text sequences 
    """
    
    from fhnw.nlp.utils.processing import is_iterable

    if isinstance(text, str):
        from fhnw.nlp.utils.defaults import default_tokenizer
        words = default_tokenizer()(text)
    elif is_iterable(text):
        words = text
    else:
        raise TypeError("Only string or iterable is supported. Received a "+ str(type(text)))
    
    unique_words = set([word for word in words if word not in stopwords])
    if len(unique_words) == 0:
        # stop here
        return []
        
    return synonym_replacement(words, unique_words, num_aug, synonym_provider)

In [14]:
stopwords = set(["ich", "bin", "mit", "diesem"])

augment_text("ich bin sehr unzufrieden mit diesem Arzt", stopwords, synonym_provider, num_aug=5)

['ich bin äußerst unzufrieden mit diesem Arzt',
 'ich bin extrem unzufrieden mit diesem Arzt',
 'ich bin überaus unzufrieden mit diesem Arzt',
 'ich bin außerordentlich unzufrieden mit diesem Arzt',
 'ich bin ziemlich unzufrieden mit diesem Arzt']