# Application: fake news and deceptive language detection

In this notebook, we will look at how we can use hybrid embeddings in the context of NLP tasks. In particular, we will see how to use and adapt deep learning architectures to take into account hybrid knowledge sources to classify documents. 

## Basic document classification using deep learning
First, we will introduce a basic pipeline for training a deep learning model to perform text classification.

### Dataset: deceptive language (fake hotel reviews)
As a first dataset, we will use the [deceptive opnion spam](http://myleott.com/op-spam.html) dataset (See the exercises below for a couple of more challenging datasets on fake news detection).

This corpus contains:
  * 400 truthful positive reviews from TripAdvisor
  * 400 deceptive positive reviews from Mechanical Turk
  * 400 truthful negative reviews from Expedia, Hotels.com, Orbitz, Priceline, TripAdvisor and Yelp
  * 400 deceptive negative reviews from Mechanical Turk
  
The dataset is described in more detail in the following papers:
  
  [M. Ott, Y. Choi, C. Cardie, and J.T. Hancock. 2011. Finding Deceptive Opinion Spam by Any Stretch of the Imagination. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies.](http://arxiv.org/abs/1107.4557)
  
 [M. Ott, C. Cardie, and J.T. Hancock. 2013. Negative Deceptive Opinion Spam. In Proceedings of the 2013 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies.](http://www.aclweb.org/anthology/N13-1053)
 
 For convenience, we have included the dataset as part of our GitHub tutorial repository.

In [0]:
%ls

In [0]:
%cd /content
!git clone https://github.com/HybridNLP2018/tutorial
!head -n2 /content/tutorial/datasamples/deceptive-opinion.csv

The last two lines show that the dataset is distributed as a comma-separated-value file with various fields. For our purposes, we are only interested in fields:
 * `deceptive`: this can be either *truthful* or *deceptive*
 * `text`: the plain text of the review
 
The other fields: `hotel` (name), `polarity` (positive or negative) and `source` (where the review comes from) are not relevant for us in this notebook.

Let's first load the dataset in a format that is easier to feed into a text classification model. What we need is an object with fields:
  * `texts`: an array of texts
  * `categories`:  an array of textual tags (e.g. *truthful* or *deceptive*) 
  * `tags`: an array of integer tags (the categories)
  * `id2tag`: a map from the integer identifier to the textual identifier for the tag
  
The following cell produces such an object:

In [0]:
import pandas as pd # for handling tables a DataFrames
import tutorial.scripts.classification as clsion # library for text classification

In [0]:
hotel_df = pd.read_csv('/content/tutorial/datasamples/deceptive-opinion.csv',
                   names=["deceptive", "hotel", "polarity", "source", "text"])
hotel_df = hotel_df[1:].reset_index() # first row is the header, so remove
hotel_wnscd_df = pd.read_csv('/content/tutorial/datasamples/deceptive-opinion.tlgs_wnscd',
                            names=['text_tlgs_wnscd'])
hotel_df = pd.concat([hotel_df, hotel_wnscd_df], axis=1)
raw_hotel_ds = clsion.read_classification_corpus(hotel_df, text_fields=['text'], tag_field='deceptive')
raw_hotel_wnscd_ds = clsion.read_classification_corpus(hotel_df, text_fields=['text_tlgs_wnscd'], tag_field='deceptive')

The previous cell has actually loaded two versions of the dataset: 
  * `raw_hotel_ds` contains the actual texts as originally published
  * `raw_hotel_wnscd_ds` provides the WordNet disambiguated `tlgs` tokenization (see notebooke 03 on Vecsigrafo for more details about this format). This is needed because we don't have a python method to automatically disambiguate a text using WordNet, so we provide this disambiguated version as part of the GitHub repo for this tutorial.

In [0]:
hotel_df[:5]

We can print a couple of examples from both datasets. 

In [0]:
clsion.sanity_check(raw_hotel_ds)

In [0]:
clsion.sanity_check(raw_hotel_wnscd_ds)

Cleaning the raw text often produces better results; we can do this as follows:

In [0]:
cl_hotel_ds = clsion.clean_ds_texts(raw_hotel_ds)
clsion.sanity_check(cl_hotel_ds)

### Tokenize and index the dataset

As we said above, the raw datasets consist of `texts`, `categories` and `tags`. There are different ways to process the texts before passing it to a deep learning architecture, but typically they involve:
 * **tokenization**: how to split each document into basic forms which can be represented as vectors. In this notebook we will use tokenizations which result in words and synsets, but there are also architectures that accept character-level or n-grams of characters.
 * **indexing** of the text: in this step, the tokenized text is compared to a **vocabulary** (or, if no vocabulary is provided, it can be used to create a vocabulary),  a list of words, so that you can assign a unique integer identifier to each token. You need this so that tokens will then be represented as embedding or vectors in a matrix. So having an identifier will enable you to know which row in the matrix corresponds to which token in the vocabulary.
 
The `clsion` library, included in the tutorial GitHub repo, already provides various indexing methods for text classification datasets. In the next cell we apply *simple indexing*, which uses white-space tokenization and creates a vocabulary based on the input dataset.

In [0]:
csim_hotel_ds = clsion.simple_index_ds(cl_hotel_ds)

Since the vocabulary was created based on the dataset, all tokens in the dataset are also in the vocabulary. In the next sections, we will see examples where embeddings are provided during indexing. 

The following cell prints a couple of characteristics of the indexed dataset.

In [0]:
print(
    'vocab size:', len(csim_hotel_ds['vocab_embedding']['w2i']),
    'dim:', csim_hotel_ds['vocab_embedding']['dim'],
    'vectors:', csim_hotel_ds['vocab_embedding']['vecs'])

As we can see, the vocabulary is quite small (about 11K words). By default, it specifies that the vocabulary embeddings should be of dimention 150, but no vectors are specified. This means the model can assign random embeddings to the 11K words.

### Define the experiment to run
The `clsion` allows us to specify experiments to run: given an indexed dataset, we can execute a text classification experiment by specifying various hyper-parameters as follows:


  

In [0]:
experiment1 = {
    'hotel_csim': {
        'indexed_dataset': csim_hotel_ds,
        'executor': clsion.execute_experiment,
        'hparams': clsion.merge_hparams([
            clsion.common_hparams, clsion.biLSTM_hparams, 
            clsion.calc_hparams(csim_hotel_ds), 
            {    
                'epochs': 20
            }
        ])
    }
}

Under the hood, the library creates a Bidirectional LSTM model as requested (the library also can create other model architectures such as convolutional NNs).

Since our dataset is fairly small, we don't need a very deep model. A fairly simple bidirectional LSTM should be sufficient. The generated model will consist of the following layers:
  * The **input layer**: is a tensor of shape $(l, )$, where $l$ is the number of tokens for each document. The empty second parameter will let us pass the model different number of input documents, as long as they all have the same number of tokens.
  * The **embedding layer** converts the each input document (a sequence of word ids) into a sequence of embeddings. Since we are not yet using pre-computed embeddings, these will be generated at random and trained with the rest of parameters in the model.
  * The **lstm layer**s: one or more bidirectional LSTMs. Explaining these in detail is out of the scope of this tutorial. Suffice it to say, each layer goes through each embedding in the sequence and produces a new embedding taking into account previous and posterior embeddings. The final layer only produces a single embedding, which represents the full document.
  * The **dense layer**: is a fully connected neural network that maps the output embedding of the final layer to a vector of 2 dimensions which can be compared to the manual labelled tag. 
  
Finally, we can run our experiment using the `n_cross_val` method. Depending on whether you have an environment with a GPU this can be a bit slow, so we only train a model once. (In practice, model resuls may vary due to random initializations, so it's usually a good idea to run the same model several times to get an average evaluation  metric and an idea of how stable the model is.)

In [0]:
ex1_df, ex1_best_run = clsion.n_cross_val(experiment1, n=1)

The first element of the result is a DataFrame containing test results and a record of the used parameters.

In [0]:
ex1_df

### Discussion

Bidirectional LSTMs are really good at learning patters in text. However, this way of training a model will tend to overfit the training dataset. Since our dataset is fairly small and narrow: it only contains texts about hotel reviews, we should not expect this model to be able to detect fake reviews about other products or services. Similarly, we should not expect this model to be applicable to detecting other types of deceptive texts such as fake news.

The reason why such a model is very tied to the training dataset is that even the vocabulary is derived from the dataset: it will be biased towards words (and senses of those words) related to hotel reviews. Vocabulary about other products, services and topics cannot be learned from the input dataset.

Furthermore, since no pre-trained embeddings were used, the model had to learn the embedding weights from scratch based on the signal provided by the 'deceptive' tags. It did not have an opportunity to learn more generic relations between words from a wider corpus.

For these reasons it is a good idea to use pre-trained embeddings as we show in the following sections.

## Using HolE embeddings

In this section we use embeddings learned using `HolE` and trained on WordNet 3.0. As we have seen in previous notebooks, such embeddings capture the relations specified in the WordNet knowledge graph. As such, synset embeddings tend to encode useful knowledge. However, lemma embeddings tend to be of poorer quality.

### Download the embeddings

Execute the following cell to download and unpack the embeddings. If you recently executed previous notebooks as part of this tutorial, you may still have these in your environment.

In [0]:
!mkdir /content/vec/
%cd /content/vec/
!wget https://zenodo.org/record/1446214/files/wn-en-3.0-HolE-500e-150d.tar.gz
!tar -xzf wn-en-3.0-HolE-500e-150d.tar.gz

In [0]:
%ls /content/vec/

### Load the embeddings and convert to the format expected by `clsion`

The provided embeddings are in `swivel`'s binary + vocab format. However, the `clsion` library expects a different python datastructure. Furtheremore, it will be easier to match the lemmas in the dataset to plain text rather than the `lem_<lemma_word>` format used to encode the HolE vocabulary, hence we need to do some cleaning of the vocabulary. This occurs in the following cells:

In [0]:
import tutorial.scripts.swivel.vecs as vecs
vocab_file = '/content/vec/wn-en-3.1-HolE-500e.vocab.txt'
holE_voc_file = '/content/vec/wn-en-3.1-HolE-500e.clean.vocab.txt'
with open(holE_voc_file, 'w', encoding='utf_8') as wf:
  with open(vocab_file, 'r', encoding='utf_8') as f:
    for word in f.readlines():
      word = word.strip()
      if not word:
        continue
      if word.startswith('lem_'):
        word = word.replace('lem_', '').replace('_', ' ')
      print(word, file=wf)
vecbin = '/content/vec/wn-en-3.1-HolE-500e.tsv.bin'
wnHolE = vecs.Vecs(holE_voc_file, vecbin)

In [0]:
import array
import tutorial.scripts.swivel.vecs as vecs

def load_swivel_bin_vocab_embeddings(bin_file, vocab_file):
    vectors = vecs.Vecs(vocab_file, bin_file)
    vecarr = array.array(str('d'))
    for idx in range(len(vectors.vocab)):
        vec = vectors.vecs[idx].tolist()[0]
        vecarr.extend(float(x) for x in vec)
    return {'itos': vectors.vocab,
            'stoi': vectors.word_to_idx,
            'vecs': vecarr,
            'source': 'swivel' + bin_file,
            'dim': vectors.vecs.shape[1]}
wnHolE_emb=load_swivel_bin_vocab_embeddings(vecbin, holE_voc_file)

Now that we have the WordNet HolE embedding in the right format, we can explore some of the 'words' in the vocabulary:

In [0]:
wnHolE_emb['itos'][150000] # integer to string

### Tokenize and index the dataset
As in the previous case, we need to tokenize the raw dataset. However, since we now have access to the WordNet HolE embeddings, it make sense to use the WordNet disambiguated version of the text (i.e. `raw_hotel_wnscd_ds`).  The `clsion` library already provides a method `index_ds_wnet` to perform tokenization and indexing using the expected WordNet encoding for synsets. 

In [0]:
wn_hotel_ds = clsion.index_ds_wnet(raw_hotel_wnscd_ds, wnHolE_emb)

In [0]:
print(
    'vocab size:', len(wn_hotel_ds['vocab_embedding']['w2i']),
    'dim:', wn_hotel_ds['vocab_embedding']['dim'])

The above produces an `ls` tokenization of the input text, which means that each original token is mapped to both a lemma and a synset. The model will then use both of these to map each token to the concatenation of the lemma and synset embedding. Since the WordNet HolE has 150 dimensions, each token will be represented by a 300 dimensional embedding (the concatenation of the lemma and synset embedding).

### Define the experiment and run
We define the experiment using this new dataset as follows, the main change is that we do not want the embedding layer to be trainable, since we want to maintain the knowledge learned via HolE from WordNet. The model should only train the LSTM and dense layers to predict whether the input text is deceptive or not.

In [0]:
experiment2 = {
    'hotel_wn_holE': {
        'indexed_dataset': wn_hotel_ds,
        'executor': clsion.execute_experiment,
        'hparams': clsion.merge_hparams([
            clsion.common_hparams, clsion.biLSTM_hparams, 
            clsion.calc_hparams(wn_hotel_ds), 
            {    
                'epochs': 20,
                'emb_trainable': False
            }
        ])
    }
}

In [0]:
ex2_df, ex2_best_run = clsion.n_cross_val(experiment2, n=1)

In [0]:
ex2_df

### Discussion
Although the model performs worse than the `csim` version, we can expect the model to be applicable to closely related domains. The hope is that, even if words did not appear in the training dataset, the model will be able to exploit embedding similarities learned from WordNet to generalise the 'deceptive' classification.

## Using Vecsigrafo UMBC WNet embeddings


### Download the embeddings

If you executed previous notebooks, you may already have the embedding in your environment.

In [0]:
%mkdir /content/umbc
%mkdir /content/umbc/vec
full_precomp_url = 'https://zenodo.org/record/1446214/files/vecsigrafo_umbc_tlgs_ls_f_6e_160d_row_embedding.tar.gz'
full_precomp_targz = '/content/umbc/vec/tlgs_wnscd_ls_f_6e_160d_row_embedding.tar.gz'
!wget {full_precomp_url} -O {full_precomp_targz}

In [0]:
!tar -xzf {full_precomp_targz} -C /content/umbc/vec/
full_precomp_vec_path = '/content/umbc/vec/vecsi_tlgs_wnscd_ls_f_6e_160d'

In [0]:
%ls /content/umbc/vec/vecsi_tlgs_wnscd_ls_f_6e_160d/

Since the embeddings were distributed as `tsv` files, we can use the `load_tsv_embeddings` method. Training models with all 1.4M vocab elements requires a lot of RAM, so we limit ourselves to only the first 250K vocab elements (these are the most frequent lemmas and synsets in UMBC).

In [0]:
def simple_lemmas(word):
  if word.startswith('lem_'):
    return word.replace('lem_', '').replace('_', ' ')
  else:
    return word
    
wn_vecsi_umbc_emb = clsion.load_tsv_embeddings(full_precomp_vec_path + '/row_embedding.tsv', 
                                               max_words=250000,
                                               word_map_fn=simple_lemmas
                                              )

### Tokenize and index dataset

In [0]:
wn_v_umbc_hotel_ds = clsion.index_ds_wnet(raw_hotel_wnscd_ds, wn_vecsi_umbc_emb)

In [0]:
print(
    'vocab size:', len(wn_v_umbc_hotel_ds['vocab_embedding']['w2i']),
    'dim:', wn_v_umbc_hotel_ds['vocab_embedding']['dim'])

### Define the experiment and run

In [0]:
experiment3 = {
    'hotel_wn_vecsi_umbc': {
        'indexed_dataset': wn_v_umbc_hotel_ds,
        'executor': clsion.execute_experiment,
        'hparams': clsion.merge_hparams([
            clsion.common_hparams, clsion.biLSTM_hparams, 
            clsion.calc_hparams(wn_v_umbc_hotel_ds), 
            {    
                'epochs': 20,
                'emb_trainable': False
            }
        ])
    }
}

In [0]:
ex3_df, ex3_best_run = clsion.n_cross_val(experiment3, n=1)

In [0]:
ex3_df

## Combine HolE and UMBC embeddings
One of the advantages of embeddings as a knowledge representation device is that they are trivial to combine. In the previous experiments we have tried to use lemma and synset embeddings derived from:
  * WordNet via HolE: these embeddings *encode* the knowledge derived from the structure of the WordNet Knowledge Graph
  * the Shallow Connectivity disambiguation of the UMBC corpus: these embeddings *encode* the knowledge derived from trying to predict the lemmas and synsets from their contexts.
  
Since the embeddings encode different types of knowledge, it can be useful to use both embeddings at the same time when passing them to the deep learning model, as shown in this section.

### Combine the embeddings
We use the `concat_embs` method, which will go through the vocabularies of both input embeddings and concatenate them. Missing embeddings from one vocabulary will be mapped to the zero vector. Note that since `wnHolE_emb` has dimension 150 and `wn_vecsi_umbc_emb` has dimension 160, the resulting embedding will have dimension 310. (Besides concatenation, you could also experiment with other merging operations such as summation, substraction or averaging of the embeddings).

In [0]:
wn_vh_emb = clsion.concat_embs(wn_vecsi_umbc_emb, wnHolE_emb)

In [0]:
synsets =  [w for w in wn_vh_emb['itos'] if w.startswith('wn31_')]
print('vocab has ', len(wn_vh_emb['itos']), '"words"', len(synsets), 'of which are synsets')

In [0]:
wn_vh_hotel_ds = clsion.index_ds_wnet(raw_hotel_wnscd_ds, wn_vh_emb)

In [0]:
experiment4 = {
    'hotel_wn_vecsi_umbc': {
        'indexed_dataset': wn_vh_hotel_ds,
        'executor': clsion.execute_experiment,
        'hparams': clsion.merge_hparams([
            clsion.common_hparams, clsion.biLSTM_hparams, 
            clsion.calc_hparams(wn_vh_hotel_ds), 
            {    
                'epochs': 20,
                'emb_trainable': False
            }
        ])
    }
}

In [0]:
ex4_df, _ = clsion.n_cross_val(experiment4, n=1)

## Discussion and results
In this notebook we have shown how to use use different types of embeddings as part of a deep learning text classification pipeline. We have not performed detailed experiments on the WordNet-based embeddings used in this notebook and, because the dataset is fairly small, the results can have quite a bit of variance depending on the initialization parameters. However, we have performed studies based on Cogito-based embeddings. The tables below shows some of our results:

The first set of results correspond to experiment 1 above. We trained the embeddings but explored various tokenizations strategies. 


 | code      | $\mu$ acc | $\sigma$ acc | tok         | vocab | emb                 | trainable             |
 | -------   | --------- | ------------ | ----------- | ----- | ------------------- | --------------------- |
 | sim       |  0.8200   | 0.023        | ws          | ds    | random              | y                     | 
 | tok       |  0.8325   | 0.029        | keras       | ds    | random              | y                     | 
 | csim      |  0.8513   | 0.014        | clean ws    | ds    | random              | y                     | 
 | ctok      |  0.8475   | 0.026        | clean keras | ds    | random              | y                     | 

As discussed above, this approach produces the best test results, but the trained models are very specific to the training dataset. The current practice is therefore to use pre-trained word-embeddings. FastText embeddings tend to yield the best performance. We got the following results.

 | code      | $\mu$ acc | $\sigma$ acc | tok         | vocab | emb                 | trainable             |
 | -------   | --------- | ------------ | ----------- | ----- | ------------------- | --------------------- |
| ft-wiki   |  0.7356   | 0.042        | ws          | 250K  | `wiki-en.vec`       | n                     |
 | ft-wiki   |  0.7775   | 0.044        | clean ws    | 250K  | `wiki-en.vec`       | n                     |
 
 Next, we tried using HolE embedding trained on sensigrafo 14.2, which had very poor results:
 
 | code      | $\mu$ acc | $\sigma$ acc | tok         | vocab | emb                 | trainable             |
 | -------   | --------- | ------------ | ----------- | ----- | ------------------- | --------------------- | 
 | HolE_sensi   |  0.6512   | 0.044        | cogito `s`  | 250K  | `HolE-en.14.2_500e` | n                    |

Next we tried vecsigrafo trained on both wikipedia and umbc, either using only lemmas, only syncons or both lemmas and syncons. Using both lemmas and syncons always is better.

 | code      | $\mu$ acc | $\sigma$ acc | tok         | vocab | emb                 | trainable             |
 | -------   | --------- | ------------ | ----------- | ----- | ------------------- | --------------------- | 
| v_wiki_l  |  0.7450   | 0.050        | cogito `l`  | 250K  | `tlgs_ls_f_6e_160d` | n                     |
 | v_wiki_s  |  0.7363   | 0.039        | cogito `s`  | 250K  | `tlgs_ls_f_6e_160d` | n                     |
 | v_wiki_ls |  0.7450   | 0.032        | cogito `ls` | 250K  | `tlgs_ls_f_6e_160d` | n                     |
 | v_umbc_ls |  0.7413   | 0.038        | cogito `ls` | 250K  | `tlgs_ls_6e_160d`   | n                     |
 | v_umbc_l  |  0.7350   | 0.041        | cogito `l`  | 250K  | `tlgs_ls_6e_160d`   | n                     |
 | v_umbc_s  |  0.7606   | 0.032        | cogito `s`  | 250K  | `tlgs_ls_6e_160d`   | n                     |


Finally, like in the experiment 4 above, we concatenated vecsigrafos (both lemmas and syncons) with HolE embeddings (only syncons, since lemmas tend to be poor quality). This produced the best results with a mean test accuracy of 79.31%. This is still lower than `csim`, but we expect this model to be more generic and applicable to other domains besides hotel reviews.

 | code      | $\mu$ acc | $\sigma$ acc | tok         | vocab | emb                 | trainable             |
 | -------   | --------- | ------------ | ----------- | ----- | ------------------- | --------------------- | 
| vw_H_s    |  0.7413   | 0.033        | cogito `s`  | 304K  | `tlgs_lsf`, `HolE`  | n                     |
 | vw_H_ls   |  0.7213   | 0.067        | cogito `ls` | 250K  | `tlgs_lsf`, `HolE`  | n                     |
 | vw_ls_H_s |  0.7275   | 0.041        | cogito `ls` | 250K  | `tlgs_lsf`, `HolE`  | n                     |
 | vu_H_s    |  0.7669   | 0.043        | cogito `s`  | 309K  | `tlgs_ls`, `HolE`   | n                     |
 | vu_ls_H_s |  0.7188   | 0.043        | cogito `ls` | 250K  | `tlgs_ls`, `HolE`   | n                     |
 | vu_ls_H_s |  0.7225   | 0.033        | cogito `l`  | 250K  | `tlgs_ls`, `HolE`   | n                     |
 | vu_ls_H_s |  0.7788   | 0.033        | cogito `s`  | 250K  | `tlgs_ls`, `HolE`   | n                     |
 | vu_ls_H_s |  0.7800   | 0.035        | cl cog `s`  | 250K  | `tlgs_ls`, `HolE`   | n                     |
 | vu_ls_H_s |  0.7644   | 0.044        | cl cog `l`  | 250K  | `tlgs_ls`, `HolE`   | n                     |
 | vu_ls_H_s |**0.7931** | 0.045        | cl cog `ls` | 250K  | `tlgs_ls`, `HolE`   | n                     |
 | vu_ls_H_s |  0.7838   | 0.028        | cl cog `s`  | 500K  | `tlgs_ls`, `HolE`   | n                     |
 | vu_ls_H_s |  ?        |  ?           | cl cog `l`  | 500K  | `tlgs_ls`, `HolE`   | n                     |
 | vu_ls_H_s |  0.7819   | 0.035        | cl cog `ls` | 500K  | `tlgs_ls`, `HolE`   | n                     |
 
 Finally, we have also experimented with a new type of embeddings, called contextual embeddings. Described in [Peters, M. E., Neumann, M., Iyyer, M., Gardner, M., Clark, C., Lee, K., & Zettlemoyer, L. (2018). Deep contextualized word representations. ](http://arxiv.org/abs/1802.05365). However, we did not manage to reproduce good results with this approach. 
 
| code      | $\mu$ acc | $\sigma$ acc | tok         | vocab | emb                 | trainable             |
 | -------   | --------- | ------------ | ----------- | ----- | ------------------- | --------------------- |  
 | elmo      |  0.7250   | 0.039        | nltk sent   | $\infty$ | `elmo-5.5B`      | n (0.1 dropout)       |
 | elmo      |  0.7269   | 0.038        | nltk sent   | $\infty$ | `elmo-5.5B`      | n (0.5 dropout, 20ep) |
 

# Further Exercises

## Use the `fake_news` dataset from UMichigan
