# Direct comparison of methods of text comprehension
- using [__Amazon Fine Food Reviews__](https://www.kaggle.com/snap/amazon-fine-food-reviews) dataset
- the dataset is in form of ```.csv``` document, the interesting part being columns _Text_ and _Summary_

In [1]:
from main_ntb_backend import dataset, statsKeeper
from text_classifiers.text_rank_classifier import keywords_review as textrank

Number of data: 992


In [2]:
dataset._data['Text'][0]

'I have bought several of the Vitality canned dog food products and have found them all to be of good quality. The product looks more like a stew than a processed meat and it smells better. My Labrador is finicky and she appreciates this product better than  most.'

In [3]:
dataset._data['Summary'][0]

'Good Quality Dog Food'

- the aim is to present different methods of text comprehension, namely:
    - ```tf-idf```
    - ```textrank```
    - abstractive methods based on recurrent neural networks
    
- the main result is the importance of text preprocessing where ```tf-idf``` fails significantly without correct preprocessing (which is shown in ```TF-IDFNotebook.ipynb```) and RNNs are able to achieve better results with preprocessing

## Term Frequency - Inverse Document Frequency
- each review is being treated as a separate document
- after cruising through all the reviews, the document frequency of a word means how many reviews contain this particular word
- term frequency is statistics of a single review, its being calculated as $\frac{Occurences Of Word}{Number Of Words In The Review}$
- inverse document frequency is negative logarithm of document frequency $\ln(\frac{Number Of Documents}{Document Frequency Of Word + 1})$

- we keep the best 5 words from the review as a measure what is the most important part of the document

- all the statistics are kept in ```statsKeeper```

In [4]:
statsKeeper._documents["0"].sorted_tf_idfs[:5]

[('better', 0.28648286272145923),
 ('product', 0.24837373829266315),
 ('vitality', 0.09399881643554253),
 ('labrador', 0.09399881643554253),
 ('appreciates', 0.09399881643554253)]

## Textrank algorithm
- from the paper ["TextRank: Bringing Order into Text"](http://web.eecs.umich.edu/~mihalcea/papers/mihalcea.emnlp04.pdf)
- the algorithm based on ```PageRank```, ```TextRank``` evaluates the importance of a word in a document by the number of ```links``` to the word from the other parts of document
- the algorithm creates a graph of words appearing in a single document and draws an edge between any two words whenever they occur in the same window (meaning they are at most ```N``` words appart)
- then the ```PageRank``` algorithm is used on the graph
- ```PageRank``` computes the importance of a word as the probability of 'browsing' to current node based on the number of ingoing edges and the importance of their second nodes

In [5]:
textrank(dataset._data['cleaned_text'][0])

0.2118     1  several vitality canned dog food products
[several vitality canned dog food products]
0.1950     1  good quality product
[good quality product]


## Recurrent neural networks
- 3 different architectures with 2 different modes of text preprocessing are used
- the main component of the neural network is an ```LSTM``` cell
/n### 1. Architecture
- encoder-decoder architecture of 3 stacked ```LSTM``` nodes in the encoder the last state passed to the only level of ```LSTM``` cells in the decoder with the softmax activation on the outputs of each time step
- since this is the simplest architecture, the expectation is that reviews from this architecture will be of the lowest quality

### 2. Architecure
- encoder-decoder architecture, the same encoder as before, the same decoder as before, with attention on the top of encoder and decoder
- good quality reviews are expected from this architecture

### 3. Architecture
- encoder-decoder architecture, which, in the encoder, uses bidirectional ```LSTM``` cells where the directions are concatenated, the decoder uses unidirectional ```LSTM``` cell, wit attention on the top of encoder and decoder
- the training is expected to take more than week (since only CPUs are used) therefore the results (which are expected to be of the best quality) will be added later

### 1. Preprocessing
- only lowercasing

### 2. Preprocessing
- lowercasing, deleting links, contraction_mapping, deleting all the special characters and short words

- with the second preprocessing the results will hopefully be of better quality, because the neural network will have to learn less difficult connections between the words