### What are language models?
A language model learns to predict the probability of a sequence of words. In the context of our problem statement these Language models have the capability to generate text.

### GPT-2

OpenAI GPT-2 model was proposed in Language Models are [Unsupervised Multitask Learners](https://cdn.openai.com/better-language-models/language_models_are_unsupervised_multitask_learners.pdf) by Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei and Ilya Sutskever. It’s a causal (unidirectional) transformer pretrained using language modeling on a very large corpus of ~40 GB of text data.

The abstract from the paper is the following:  
GPT-2 is a large transformer-based language model with 1.5 billion parameters, trained on a dataset of 8 million web pages. GPT-2 is trained with a simple objective: predict the next word, given all of the previous words within some text. The diversity of the dataset causes this simple goal to contain naturally occurring demonstrations of many tasks across diverse domains. GPT-2 is a direct scale-up of GPT, with more than 10X the parameters and trained on more than 10X the amount of data.

Tips:
1. GPT-2 is a model with absolute position embeddings so it’s usually advised to pad the inputs on the right rather than the left.

2. GPT-2 was trained with a causal language modeling (CLM) objective and is therefore powerful at predicting the next token in a sequence. Leveraging this feature allows GPT-2 to generate syntactically coherent text as it can be observed in the run_generation.py example script.

3. The PyTorch models can take the past as input, which is the previously computed key/value attention pairs. Using this past value prevents the model from re-computing pre-computed values in the context of text generation.

[Source](https://huggingface.co/transformers/model_doc/gpt2.html)


### ULMFit

Below is an Abstract from the paper:   
Inductive transfer learning has greatly impacted computer vision, but existing approaches in NLP still require task-specific modifications and training from scratch. We propose Universal Language Model Fine-tuning (ULMFiT), an effective transfer learning method that can be applied to any task in NLP, and introduce techniques that are key for fine-tuning a language model. Our method significantly outperforms the state-of-the-art on six text classification tasks, reducing the error by 18-24% on the majority of datasets. Furthermore, with only 100 labeled examples, it matches the performance of training from scratch on 100 times more data. We open-source our pretrained models and code.

[Source](https://www.aclweb.org/anthology/P18-1031/)

As a simple explanation ULMFit has [LSTM](https://yashuseth.blog/2018/09/12/awd-lstm-explanation-understanding-language-model/) as backbone while GPT-2 is a transformer based model.

### Dataset
[Source](https://nijianmo.github.io/amazon/index.html)   
As suggested we utilised [k-core](https://nijianmo.github.io/amazon/index.html#subsets) subsets for experimentation purposes.

### Approach-1
#### Finetuning ULMFit and GPT-2 on mixed sample of amazon reviews

##### GPT-2
Categories consumed:   
All_Beauty_5, Automotive_5, Gift_Cards_5, Magazine_Subscriptions_5, AMAZON_FASHION_5, CDs_and_Vinyl_5, Grocery_and_Gourmet_Food_5, Musical_Instruments_5, Appliances_5, Cell_Phones_and_Accessories_5, Industrial_and_Scientific_5, Office_Products_5, Arts_Crafts_and_Sewing_5, Digital_Music_5, Luxury_Beauty_5, Patio_Lawn_and_Garden_5   
Number of training samples: 40k  
Number of validation samples: 10k  

#####  Fine-tuned GPT-2 Performance on mixed sample of amazon reviews 

|  epoch  | train_loss  |  valid_loss  |  perplexity |  time    |
|  -----  | ----------  |  ----------  |  ---------- | ----     |
|    0    |   3.772770  |  3.672615    |  39.354694  |   1:28:52|


##### ULMFit
Categories consumed:   
All_Beauty_5, Automotive_5, Gift_Cards_5, Magazine_Subscriptions_5, AMAZON_FASHION_5, CDs_and_Vinyl_5, Grocery_and_Gourmet_Food_5, Musical_Instruments_5, Appliances_5, Cell_Phones_and_Accessories_5, Industrial_and_Scientific_5, Office_Products_5, Arts_Crafts_and_Sewing_5, Digital_Music_5, Luxury_Beauty_5, Patio_Lawn_and_Garden_5   
Number of training samples: 144k  
Number of validation samples: 16k 

##### ULMFit Performance on mixed sample of amazon reviews  

|  epoch   |  train_loss   |  valid_loss   | accuracy    |   perplexity    |      time    |
| ------   | ------------- | ------------- | ----------- | --------------  | ------------ |
|    0     |  4.091950     |  3.965241     | 0.294048    |    52.732975    |  1:27:53     |
|    1     |  4.138435     |  4.004159     | 0.290700    |    54.825722    |  1:50:32     |
|    2     |  4.075377     |  3.994715     | 0.291748    |    54.310360    |  2:00:35     |
|    3     |  4.099702     |  3.969203     | 0.294860    |    52.942307    |  1:47:00     |
|    4     |  3.998508     |  3.940119     | 0.298114    |    51.424698    |  1:47:01     |
|    5     |  4.016881     |  3.898405     | 0.302949    |    49.323696    |  1:47:03     |
|    6     |  3.918993     |  3.859989     | 0.307038    |    47.464836    |  1:47:01     |
|    7     |  3.890229     |  3.825045     | 0.311247    |    45.834869    |  1:47:17     |
|    8     |  3.837510     |  3.801839     | 0.314293    |    44.783470    |  1:52:52     |
|    9     |  3.823071     |  3.796320     | 0.315118    |    44.537006    |  1:48:25     |

It can be seen from above performance metrics that GPT-2 has lesser perplexity in comparison to ULMFit.
Even though ULMFit was trained on a larger sample in comparison to GPT-2, both on the basis of perplexity and manual review Finetuned GPT-2 was giving better results. 

##### Generated Examples GPT-2 trained on mixed data


##### Generated Examples ULMFit trained on mixed data

### Approach-2
Finetuning ULMFit and GPT-2 on samples of individual categories from amazon reviews dataset.    
Number of training samples: 40k  
Number of validation samples: 10k

##### Finetuned GPT-2 model Performance on individual categories

category|epoch |train\_loss |valid\_loss |perplexity |time
:-----:|:-----:|:-----:|:-----:|:-----:|:-----:
Toys\_and\_Games\_5|0|3.496102|3.411566|30.312685|01:00:41
Tools\_and\_Home\_Improvement\_5|0|3.597291|3.543385|34.583778|01:16:50
Pet\_Supplies\_5|0|3.499881|3.418969|30.537912|01:17:12
Kindle\_Store\_5|0|3.515485|3.368167|29.02528|02:16:09
Sports\_and\_Outdoors\_5|0|3.67044|3.56395|35.302368|01:11:18
Movies\_and\_TV\_5|0|3.659058|3.624714|37.513996|02:04:32
Electronics\_5|0|3.533853|3.461887|31.87706|01:39:43
Home\_and\_Kitchen\_5|0|3.538355|3.420884|30.596447|01:02:50
Clothing\_Shoes\_and\_Jewelry\_5|0|3.461012|3.390964|29.694557|45:26:00

##### Finetuned ULMFit model Performance on individual categories

                                             Toys_and_Games_5

epoch |train\_loss |valid\_loss |accuracy |perplexity |time
:-----:|:-----:|:-----:|:-----:|:-----:|:-----:
0|3.802606|3.668654|0.306636|39.199108|38:23
1|3.652814|3.595799|0.315577|36.444805|28:24
2|3.604318|3.538344|0.321844|34.409893|18:11
3|3.443948|3.492189|0.328077|32.857792|17:52
4|3.342302|3.485666|0.329396|32.644154|17:51

                                             Tools_and_Home_Improvement_5

epoch |train\_loss |valid\_loss |accuracy |perplexity |time
:-----:|:-----:|:-----:|:-----:|:-----:|:-----:
0|3.949425|3.833902|0.288379|46.242607|21:04
1|3.870751|3.753686|0.295918|42.678101|23:21
2|3.761494|3.691247|0.303549|40.094803|20:52
3|3.621727|3.638863|0.309276|38.048546|21:42
4|3.510433|3.628412|0.310847|37.652977|21:24

                                              Pet_Supplies_5

epoch |train\_loss |valid\_loss |accuracy |perplexity |time
:-----:|:-----:|:-----:|:-----:|:-----:|:-----:
0|3.817931|3.683267|0.293778|39.776115|20:29
1|3.714879|3.591212|0.304431|36.278027|20:16
2|3.610683|3.526938|0.311106|34.019653|20:04
3|3.500048|3.479194|0.316886|32.433567|20:07
4|3.379736|3.468448|0.318994|32.086910|45:55


It can be seen from above comparisons that GPT-2 has lesser perplexity in comparison to ULMFit. So, chose to use GPT-2 as the final generator. Also, lesser perplexity can be observed on training on idvidual category data in comparison to training on mixed category. As a conclusion, fine-tuning GPT-2 on individual categories works better.  


##### Generated Examples GPT-2 trained on individual category data
[GPT-2 Generated Examples](https://github.com/c-shekhar/AmazingReviews/blob/main/nbs/4_GPT2_GeneratedTextExamples.ipynb)


##### Generated Examples ULMFit trained on individual category data
[ULMFit Generated Examples](https://github.com/c-shekhar/AmazingReviews/blob/main/nbs/3_ULMFit_GeneratedTextExamples.ipynb)


### EDA before generating sentences
[Understanding data distributions](https://github.com/c-shekhar/AmazingReviews/blob/main/nbs/CategoriesRatingsSentenceLengthAnalysis.ipynb)

### Finally generating Computer Generated Reviews
##### Sampling strategy
1. Take 5% sample per category.
2. Fix Initial Words=5
3. Fix Total Reviews to generate as 2000 for that category.
4. Make discrete buckets of 1word interval in range of [10,350] for individual sentence lengths.
5. Take initial 5 words from each bucket like from each 10, 11, 12 and generate n=10, n=11 and so on length reviews. This holds from n=20 onwards till 350 while for n=10to20, I am generating 20length sentences because after applying post processing those will cut down in order of 10-12. Short sentences require more post processing while for longer ones, we might not require it that much. This I observed from samples.
6. Manage proportions as per original distribution in the sample for e.g. if proportion of length=50 in original is 0.5%, then 0.005*n of the generated samples will also be length=50
7. Finally generate samples.