# Direct comparison of methods of text comprehension
- using [__Amazon Fine Food Reviews__](https://www.kaggle.com/snap/amazon-fine-food-reviews) dataset
- the dataset is in form of ```.csv``` document, the interesting part being columns _Text_ and _Summary_

In [1]:
from main_ntb_backend import dataset, statsKeeper
from text_classifiers.text_rank_classifier import keywords_review as textrank

Number of data: 992


In [2]:
dataset._data['Text'][0]

'I have bought several of the Vitality canned dog food products and have found them all to be of good quality. The product looks more like a stew than a processed meat and it smells better. My Labrador is finicky and she appreciates this product better than  most.'

In [3]:
dataset._data['Summary'][0]

'Good Quality Dog Food'

- the aim is to present different methods of text comprehension, namely:
    - ```tf-idf```
    - ```textrank```
    - abstractive methods based on recurrent neural networks
    
- the main result is the importance of text preprocessing where ```tf-idf``` fails significantly without correct preprocessing (which is shown in ```TF-IDFNotebook.ipynb```) and RNNs are able to achieve better results with preprocessing

## Term Frequency - Inverse Document Frequency
- each review is being treated as a separate document
- after cruising through all the reviews, the document frequency of a word means how many reviews contain this particular word
- term frequency is statistics of a single review, its being calculated as $\frac{Occurences Of Word}{Number Of Words In The Review}$
- inverse document frequency is negative logarithm of document frequency $\ln(\frac{Number Of Documents}{Document Frequency Of Word + 1})$

- we keep the best 5 words from the review as a measure what is the most important part of the document

- all the statistics are kept in ```statsKeeper```

In [4]:
statsKeeper._documents["0"].sorted_tf_idfs[:5]

[('better', 0.28648286272145923),
 ('product', 0.24837373829266315),
 ('vitality', 0.09399881643554253),
 ('labrador', 0.09399881643554253),
 ('appreciates', 0.09399881643554253)]

## Textrank algorithm
- from the paper ["TextRank: Bringing Order into Text"](http://web.eecs.umich.edu/~mihalcea/papers/mihalcea.emnlp04.pdf)
- the algorithm based on ```PageRank```, ```TextRank``` evaluates the importance of a word in a document by the number of ```links``` to the word from the other parts of document
- the algorithm creates a graph of words appearing in a single document and draws an edge between any two words whenever they occur in the same window (meaning they are at most ```N``` words appart)
- then the ```PageRank``` algorithm is used on the graph
- ```PageRank``` computes the importance of a word as the probability of 'browsing' to current node based on the number of ingoing edges and the importance of their second nodes

In [5]:
textrank(dataset._data['cleaned_text'][0])

0.2118     1  several vitality canned dog food products
[several vitality canned dog food products]
0.1950     1  good quality product
[good quality product]


## Recurrent neural networks
- 3 different architectures with 2 different modes of text preprocessing are used
- the main component of the neural network is an ```LSTM``` cell
- in each distinct architecture, we use ```softmax``` layer as the last layer and ```sparse categorical crossentropy``` as the loss function

### 1. Architecture
- encoder-decoder architecture of 3 stacked ```LSTM``` nodes in the encoder the last state passed to the only level of ```LSTM``` cells in the decoder with the softmax activation on the outputs of each time step
- since this is the simplest architecture, the expectation is that reviews from this architecture will be of the lowest quality

### 2. Architecure
- encoder-decoder architecture, the same encoder as before, the same decoder as before, with attention on the top of encoder and decoder
- good quality reviews are expected from this architecture

### 3. Architecture
- encoder-decoder architecture, which, in the encoder, uses bidirectional ```LSTM``` cells where the directions are concatenated, the decoder uses unidirectional ```LSTM``` cell, wit attention on the top of encoder and decoder
- the training is expected to take more than week (since only CPUs are used) therefore the results (which are expected to be of the best quality) will be added later

### 1. Preprocessing
- only lowercasing

### 2. Preprocessing
- lowercasing, deleting links, contraction_mapping, deleting all the special characters and short words

- with the second preprocessing the results will hopefully be of better quality, because the neural network will have to learn less difficult connections between the words

## Attention mechanism
- based on [paper](https://arxiv.org/abs/1409.0473)
- in standard encoder-decoder architecture the encoder learns the context of the input sequence and returns it as a single fixed-size vector, therefore decoder needs to learn to translate fixed-size vector into human-readable output
- in attention driven encoder-decoder the encoder outputs the learned context at each time-step, then the final human-readable output isn't pure output of the decoder but output of the attention model at the top of encoder-decoder architecture
- attention model adds each time-step of the encoder and current time-step of the decoder then activates it with softmax and creates new context-vector
- the final output is the weighted sum of context-vector and decoder outputs with softmax activation

## Future of this project
- after end of training of all the models the next step is calculating the expected output by using [_beam search_](https://en.wikipedia.org/wiki/Beam_search)
- fine-tuning of the BERT model and application of this model on the dataset

## Results
______
## 1. Architecture, 1. Preprocessing
- the first architecture, 3 layers of ```LSTM``` cells encoder, 1 layer of ```LSTM``` cells decoder, with simple dense layer with softmax activation at the top

- the first preprocessing : just basic lowering of the text and removing parentheses

- the results are unsatisfactory, the neural network couldn't catch the meaning of the input sequence and the summaries are bellow expectations, many times the predicted summary is just ```great product```

- the results shown aren't meant to show if the model predicted exactly the same output as the original but to show if model shown some understanding of the review

- possible explanation of the results:
    - the lack of preprocessing meant that the network needed to learn more complex relationships than possible
    - without attention layer the preprocessing deficit became even more visible

_example of unsatisfactory output 1:_
```
Review: not only will these cookies satisfy a sweet tooth attack but they're healthy and will calm nausea and or upset stomach strong ginger flavor not sugary so although adults really like them a lot of kids probably won't be crazy about them

Original summary: great cookies for adults

Predicted summary:  good but not
```

_example of network showing at least some understanding 1:_
```
Review: these cookies have just the right amount of sweetness and a lovely chewy texture the cinnamon balances out the added sugar and the sweetness from the chocolate chips it's great that these contain chocolate chips and not the usual raisins because it's unexpected and it's a bonus for chocoholics looking for something more

Original summary: very good cookies

Predicted summary:  great snack
```

_example of network showing at least some understanding 2:_
```
Review: this is the best hot cocoa i have tried for the keurig rich chocolate flavor i used the 6oz setting and it is not watery if you like hot cocoa and the convenience of a k cup this is the one to get

Original summary: awesome

Predicted summary:  great flavor
```

- resulting loss on the train dataset: 2.0597
- resulting loss on the validation dataset: 2.3272

## 2. Architecture 1. Preprocessing
- 3 layers of ```LSTM``` encoder, 1 layer ```LSTM``` decoder, on the top of encoder-decoder there is attention layer

- the first preprocessing : just basic lowering of the text and removing parentheses

- following the results of the first architecture, this one succeeded a bit more, there isn't as many meaningless reviews and the network shown some nice understanding of the topic however looking at the same review-summary examples as above, the network tried, but ultimately failed

_example of unsatisfactory output 1:_
```
Review: not only will these cookies satisfy a sweet tooth attack but they're healthy and will calm nausea and or upset stomach strong ginger flavor not sugary so although adults really like them a lot of kids probably won't be crazy about them

Original summary: great cookies for adults

Predicted summary:  not a fan of
```

_example of network showing at least some understanding 1:_
```
Review: these cookies have just the right amount of sweetness and a lovely chewy texture the cinnamon balances out the added sugar and the sweetness from the chocolate chips it's great that these contain chocolate chips and not the usual raisins because it's unexpected and it's a bonus for chocoholics looking for something more

Original summary: very good cookies

Predicted summary:  great tasting
```

_example of network showing at least some understanding 2:_
```
Review: this is the best hot cocoa i have tried for the keurig rich chocolate flavor i used the 6oz setting and it is not watery if you like hot cocoa and the convenience of a k cup this is the one to get 

Original summary: awesome 

Predicted summary:  best hot cocoa
```

- resulting loss on the train dataset: 1.9971
- resulting loss on the validation dataset: 2.2523

## 1. Architecture 2. Preprocessing
- 3 layers of ```LSTM``` encoder, 1 layer ```LSTM``` decoder, on the top simple ```Dense``` layer with ```softmax``` activation

- the results were superior to results of 1. architecture without preprocessing, usually the model could remember one or two words from the review and could append some adjective (e.g ```best tea ever```)

- however there are still many examples of total misunderstanding of the text

_positive example 1:_
```
Review: tried many peach teas either bitter pale tasting hands favorite peach tea makes amazing iced tea summer robust enough handle served iced sweet enough require minimal sweetners refreshing tea

Original summary: best peach tea ever

Predicted summary:  best tea ever
```

_positive example 2:_
```
Review: bought whim wow consider lucky shot dark turned one best hot sauces ever tasted good spicy heat excellent flavor well unexpected mix heat hint sweet robust mix spices one thing really love flavors whatever put overwhelm heat hot sure linger long flavor original went back amazon bought whole case

Original summary: could drink stuff

Predicted summary:  best hot sauce
```

_negative example 1:_
```
Review: mini schnauzer easily started splintering bone matter minutes ever hungry dog swallowed pieces seeing took bone away ate third less hrs later time breakfast refused eat first time ever refused break overnight fast minutes hunger must gotten better gave ate almost hour later threw four rounds puking could see small softened yet still relatively firm pieces healthy edible bone vomit shame went thinking would great alternative feeding dog indigestible nylon

Original summary: made dog throw bad

Predicted summary:  dog likes
```

- resulting loss on the train dataset: 1.7513
- resulting loss on the validation dataset: 1.8970

## 2. Architecture 2. Preprocessing
- 3 layers of ```LSTM``` encoder, 1 layer ```LSTM``` decoder, on the top of encoder-decoder there is attention layer

- the expectations were high as the preprocessing was on another level and the difference between no attention and attention models was quite large in mode without preprocessing

- the actual results weren't quite as good, the model many times misunderstood the sentiment of the review (review negative summary positive and vice versa)

- although the results are better than with previous architectures

_positive example 1:_
```
Review: consider foodie always trying find best everything tried thomas popcorn one oprah favorite things yes good local costco store popcornopolis people store days impressed zebra really great cheddar like none ever tasted wow good says popcorn good back bag explanation two thumbs perfectly sized individually sealed cones eat one cone one sitting stop good limit setter

Original summary: best cheddar popcorn ever

Predicted summary:  best popcorn
```

_positive example 2:_
```
Review: product great taste quite regular milk taste taste like normal powdered milk unique pleasant taste however heavenly unique taste mix glass milk add pappy sassafras add favorite sweetener drank truly great

Original summary: tastes great quite milk

Predicted summary:  great taste
```

_negative, rather funny example:_
```
Review: think reviewers must got bad batch maybe read directions found quite easy prepare topping heated seconds found flavor good compare good chinese take quick lunch time listed grams protein grams fiber sodium somewhat lower competing brands overall best brand found usually add something like fresh apple microwave meal worried environment steamer tray saved reused preparing instant rice meals home alternative meal healthy choice sweet sour chicken also tried

Original summary: quick microwave lunch

Predicted summary:  tasty healthy
```

- resulting loss on the train dataset: not collected
- resulting loss on the validation dataset: not collected