# Training an LSTM in order to generate reviews automatically

* LINK TO THE DATA
* In the section below you will find the entire process of generating random reviews of beers, starting from a dataset of 2.924.163 reviews.
* Since I am using the CPU of my computer (MacBook Pro 2,5 GHz Intel Core i7), I have faced some difficulties for what regards time complexity. I will explain why later in details, but I had to choose 100.000 reviews for the moment as starting training dataset to feed in the LSTM.

### Prerequisites

* All the prerequisites you need are explained in the linked [github page](https://github.com/jcjohnson/torch-rnn). If you are using OSX, you can find a tutorial written by Jeff Thompson [here](http://www.jeffreythompson.org/blog/2016/03/25/torch-rnn-mac-install/).
* You should clone the [repository](https://github.com/jcjohnson/torch-rnn) inside this folder.

### Preprocessing

* First of all, from this [beer dataset](http://jmcauley.ucsd.edu/cse255/data/beer/) I used the file Ratebeer.txt.gz, that contains almost 3 million reviews of beers.
* One sample of review is shown below:<br>
`beer/name: John Harvards Simcoe IPA
beer/beerId: 63836
beer/brewerId: 8481
beer/ABV: 5.4
beer/style: India Pale Ale &#40;IPA&#41;
review/appearance: 4/5
review/aroma: 6/10
review/palate: 3/5
review/taste: 6/10
review/overall: 13/20
review/time: 1157587200
review/profileName: hopdog
review/text: On tap at the Springfield, PA location. Poured a deep and cloudy orange (almost a copper) color with a small sized off white head. Aromas or oranges and all around citric. Tastes of oranges, light caramel and a very light grapefruit finish. I too would not believe the 80+ IBUs - I found this one to have a very light bitterness with a medium sweetness to it. Light lacing left on the glass.`
* Since we are interested in only the actual reviews text, and not the various scores, the code below has the function to extract only the contents we are interested in and save them in a file.
* You can change the number of reviews to save and the file name where they will be saved changing the two variables `max_reviews` and `file_name`

In [1]:
# Total reviews = 2.924.163

file_length = 0
max_reviews = 100000
file_name = 'reviews_small.txt'
reviews = []
with open('ratebeer.txt', 'rb') as f:
    for line in f.readlines():
        try:
            line = line.decode("utf-8")
        except:
            continue
            
        try:
            del_index = line.index(':')
        except:
            continue
            
        if line[:del_index] == 'review/text':
            file_length += 1
            with open(file_name, 'a') as new_file:
                new_file.write(line[del_index+2:])
            
        if file_length >= max_reviews:
            break
            
file_length

100000

* After this step, you have 
* I will be using the [LSTM network](https://github.com/jcjohnson/torch-rnn) implemented by [Justin Johnson](http://cs.stanford.edu/people/jcjohns/). He hardcodes an RNN/LSTM with torch in a very efficient way. You can find every statistics about the model on the github page linked above.
* As explained by Justin, in order to work with his network, we have to preprocess the data using his function of preprocessing (`preprocess.py`). This function will create an HDF5 file and JSON file containing a preprocessed version of the data. Moreover it will split the data in training, validation and test, default sizes: 0.1. I decided to use the default sizes.

In [12]:
%run -i "torch-rnn-master/scripts/preprocess.py" --input_txt reviews_small.txt --output_h5 reviews_small.h5 --output_json reviews_small.json

Total vocabulary size: 100
Total tokens in file: 30653625
  Training size: 24522901
  Val size: 3065362
  Test size: 3065362
Using dtype  <class 'numpy.uint8'>


### Training

* Following the procedure described by Justin Johnson, I trained an LSTM network in different ways.
* NOTE: I found an error in the `hfd5` library. The error I was receiving after launching the train command was: `/torch/install/share/lua/5.1/hdf5/ffi.lua:56: ')' expected near '_close' at line 1436`.<br>
<br>
If you have the same problem you can try to follow the following procedure:<br>
`Edit ffi.lua - hdf5 - line 44,
change
local process = io.popen("gcc -E " .. headerPath) -- TODO pass -I
to
local process = io.popen("gcc -D '_Nullable=' -E " .. headerPath) -- TODO pass -I.`

#### 10.000 samples

* First I trained the LSTM network with 10.000 reviews for only 10 epochs. In order to do so type the following commands in the shell:<br>
`cd torch-rnn-master`<br>
`th train.lua -input_h5 reviews_very_small.h5 -input_json reviews_very_small.json -max_epochs 10 -gpu -1`<br><br>

* `-gpu -1` means that I am going to use the cpu instead of the gpu. As default, `-model_type` is LSTM, `-wordvec_size` is 64, `-rnn_size` (number of hidden units in the RNN/LSTM) is 128, `-num_layers` is 2, `-learning_rate` is 2e-3.<br><br>

* The final validation loss was: 1.2175781618465.<br><br>

* Now that the model is trained we can try to make a sample review:<br>
`th sample.lua -checkpoint cv/reviews_very_small/checkpoint_9710.t7 -gpu -1 -length 500`<br><br>

* This is the sample output:<br>
`Ahats black copper, esperis - mid-dark aromas wet lingering the glass. Malty and lest, rettheas.	The body is medium fruity is black pepper. Aroma of sweet aremilen rough peal yeast tite warm fair beins. Dry like all a strong gives has many tongue.  I love the (pan yet is skors an clear spicy brown with an away bittering but leaves over that can beautiful lonced for. Nose of taste except.
330ml bottle bottle. An aroma of hops you cann: Ive definet about on the Chimay a tongue.
Ontent at reddish a` (1)<br><br>
* We can observe that the LSTM has learnt how to separate words and how to use punctuation. It opens one bracket and never close it. Some words do not exist and any sentence as a clear sense. As default, the `-temperature` argument is set to 1: this means that the LSTM will produce a noiser sample, trying to use "difficult" words.<br><br>
* Trying to reduce the `-temperature` to 0.7 we can see the following output:<br><br>
`th sample.lua -checkpoint cv/reviews_very_small/checkpoint_9710.t7 -gpu -1 -length 500 -temperature 0.7`<br><br>
`:2A, 2007 As a detected was off-white head.  Reddish brown color, but not solid carbonation in the flavor.
Chewish as the pale in the head time Monks this one.
Bottle @ Wallen 07/10,0755. 9 start beer diminished something to have a Brunks like this one beers as well.  Pours a bit finish, wanth an average off white head that being the mouth and alcohol.
UPDATED: MAY 18, 2008 Taste is a nice creamy that rich toasty nose and bready flavors, finish is smooth.  Its some clove and a nice bitter slight` (2)<br><br>
* The sentences have a slightly more sense, there are no wrong words, even if the review starts with: `:2A,`. It seems that it is understanding how to use subject + verb + rest of the sentence, but most of the sentences don't have a verb.<br><br>
* Before trying to use a bigger dataset (which training is going to take a lot of time, ~40 hours), I would like to see how much the output changes if I use a "better" model: `-num_layers 3`, `-rnn_size 256`, `-max_epochs 30`.<br><br>
`th train.lua -input_h5 reviews_very_small.h5 -input_json reviews_very_small.json -num_layers 3 -rnn_size 256 -gpu -1 -max_epochs 30`.<br><br>
* The validation loss is: 1.106896893052<br><br>
* That's how you get the new sample:<br><br>
`th sample.lua -checkpoint cv/reviews_very_small_better/checkpoint_29130.t7 -gpu -1 -length 500`<br><br>
* Output for temperature = 1:<br><br>
`st..	a little boring... this suse apple flavors, slight alcohol - the hops togething of fruity, simble herb.  Flavor remains fresh, its somewhat flavors.  Id immediately enjoy.
Cloudy reddish amber with lots of frothing tan head. Yeast with a floral palate (deep amber brown. Fraisy and smooy malt and fruit.  Spice and caramel level.
Nothing too sweet dark and fruity than papais and though the whole point, this food
Draft at brewpub?  Full, beige cocoa, with hurty off-white head retrocal lacing` (3)<br><br>
Not exactly what I was hoping, but still slightly better. Sentences seem to have more sense, but still we can observe that the model prefer not to use verbs, it does not know how to open/close brackets.<br><br>
* Output for temperature = 0.7:<br><br>
`ad 2006.  The aroma on the tongue leaving a strong mouth feel that lingers on the palate. Very steaky straw, but a bit lively carbonation.		The flavor starts sweet w/ some malt and spices, bitter and spicy. Its well carbonated than I expected.  A bit cloudy and cloudy brown with a big head is off white colored bubbles. Moderately sweet with some hay, grassy hops and banana sugar.  Taste of chocolate, and chocolate. Medium bodied and slightly bitter finish.  Medium bodied with a sharp hop bitter	` (4)<br><br>
Using a lower temperature we can note repetitions in the review, but it has way more sense than before. Except for one sentence that clearly does not make any sense (`A bit cloudy and cloudy brown with a big head is off white colored bubbles`), we can detect an improvement from the output (2).

#### 100.000 samples

* Now I'll repeat the same process for 100.000 reviews. Using the default arguments (30 epochs) it took around ~15 hours to train the model. Using `-num_layers 3` `-rnn_size 256`, `-gpu -1`, `-max_epochs 20` it took ~40 hours.<br><br>
* So, first I run:
`th train.lua -input_h5 reviews_small.h5 -input_json reviews_small.json -gpu -1 -max_epochs 30`<br><br>
The validation loss is 1.036688855382.<br>
The sample output for `-temperature 1` is:<br><br>
`th sample.lua -checkpoint cv/reviews_small_30epochs/checkpoint_294270.t7 -gpu -1 -length 500`<br><br>
`K@8{_yl, Hazy golden color with a good beer.  White head. Dusty and spicy head with small head.  The aroma is like a sweetish with a little to fine persistent hop comy of alcohol, and is predominate a drinking me.  Strong hints of - the oily after a touch apricot.  The flavor is sweet, dry finish. Some rich, bitter coffee, grain aroma. Good malt.  average sweet bite. Good sign-spicy alcohol and peaches. Its very balanced product outly in the same, as it quickler beer. This Russian. And it is, a`(5)<br><br>
We can clearly see more complex sentences, even if it's not still very correct. The beginning of the review is no sense (`K@8{_yl,`). Let's the differences with `-temperature 0.7`.<br><br>
`th sample.lua -checkpoint cv/reviews_small_30epochs/checkpoint_294270.t7 -gpu -1 -length 500 -temperature 0.7`<br><br>
`/11. A sweet chocolate sweet aroma with some earthy hops.
Bottle at the Beer Festival at the Grand iPcome on Total tasting & London. Sweet aroma of malt and yeast and some strong spices and caramel.  Taste is slightly lighter sweet and clean aftertaste.  Some light spices, but not soft finish. Not complex malts and some bitterness on the back a lot of drinkability. I like the style.  Malt malt sweetness.  Decent brew
Light amber color with a thick large beige head. Sweet malt aroma with a dry h`(6)<br><br>
This is the first review that overall makes sense. Some words do not exist, like `iPcome`. Others are repeated several times: `sweet`, `malt` and `spices` for example. But we can say that with 10 times more data the LSTM is working way better. We can denote that the validation loss is very similar to the one obtained using 10.000 samples, but with more hidden units and layers. So we should find a better result if we try to use the "better" model on the 100.000 reviews.<br><br>

* Last run (for the moment) - I used 20 epochs instead of 30 because it was going to take 60 hours on my computer:<br>
`th train.lua -input_h5 reviews_small.h5 -input_json reviews_small.json -num_layers 3 -rnn_size 256 -gpu -1 -max_epochs 20 -checkpoint_every 10000`<br><br>
The validation loss is 0.91980762436962 (improvement from the 1.037 of before).<br>
Let's see the output for temperature 1:<br><br>
`th sample.lua -checkpoint cv/reviews_small_better/checkpoint_196180.t7 -gpu -1 -length 500`<br><br>
`NB). Great ale. #8-X11/09+ version of a pilsener, but it is definitely a very nice ale. No aroma, slightly sweet and nutty with some coffee. Theres no looking finish that fades to mind and chewy earthy on the pils finish. Palate really work down which offered after Dug, the Had up, but whiskey nicely schools. Muenentableons.
On tap at DBST but thats awesome.		Pours a clear dark brown with light head. Hoppy aroma with fter bitterness and a tad of a taste, with a honey backbone. Bitter finish, yea`<br><br>
As always, setting the temperature at 1 doesn't give good results, even if we can observe that the number of words used and the variety of them is higher than before. Let's see with `-temperature 0.7` what is the output:<br><br>
`th sample.lua -checkpoint cv/reviews_small_better/checkpoint_196180.t7 -gpu -1 -length 500 -temperature 0.7`<br><br>
`3o bottle the colour is a cloudy golden body with a small white head.  Sweet fruity aroma, light body, a fine toasted and grainy flavor with a little bit of fruit that I thought it was so well still as bitterness.  Wow.  I have been pudded with a stouts but I have had to get a nice complex beer to me.
Bottle. Pours a clear golden yellow with a thick tan head. Aroma is light malty with hints of banana, and some hops, malty and caramel. Medium bodied with very soft carbonation. Flavour is moderate`<br><br>
Here is another example:<br><br>
`< Small body.  Aroma is sweet malt and toffee, with a bit of soft citrus hops and some bourbon.  Very smooth on the palate and too much carbonation.  It is very tasty, but I was expecting a mix of sweet malt in the finish.  I can quite expect from a good beer and easily a great beer.
Bottle: Pours a pale gold color with a large bubbly beige head. Nose of spices, peaches, coffee, and some caramel. Flavor is very dry with metal texture, with a bit of spicy hops. Medium body with soft carbonation.`<br><br>
Ok, these are maybe not perfect reviews, but we are going closer and closer to our goal. The model started to use more complex sentences, using present perfect, longer sentences with more sense than before.<br><br>

* For completeness I am going to show you the results with the temperature set to lower values. <br><br>
    1) temperature 0.5: <p>`Ws courtesy of Dickinson and beers and had more abv in the face to find.  Its still not as good as the white expected the style, but I would be a great beer for a pale ale, but I would have liked to be as good as I would drink.
Bottle.  Clear amber color with a small white head.  The aroma is sweet with hints of caramel and pine.  Taste is sweet and smooth with a nice hop finish.  Very good beer.
Bottle from Seeble Market Room bottle from Billian Strong Ale Festival of Ales, Street and Local Ale`<p>
    2) temperature 0.1: <p>`Y bottle from the brewery.  Pours a clear amber with a small white head.  Aroma is sweet malts, caramel, and a touch of caramel.  Flavor is sweet and sweet with a light bitterness.  A bit of a bitter finish.  The finish is sweet and slightly bitter.  The finish is sweet and slightly sweet.  The finish is sweet and slightly bitter.  The finish is slightly bitter and slightly bitter.  The finish is sweet and slightly bitter.  Not bad.
Bottle.  Pours a clear amber with a small white head.  Aroma is	`<p>

* We can see that the smaller the temperature gets, the more repetetive the sentences become. These repetitions are due to the fact that if the temperature is lower, the output is less noisy, meaning that the model will use only terms and sentence's constructs of which it is more "sure".

### TODO
* Train at least 1 million reviews on the LSTM using a gpu.