### Load previously pickled wine_msds object

In [40]:
# load libraries
import pickle
import numpy as np
import pandas as pd

In [41]:
with open('/home/ec2-user/SageMaker/MSDS696/wine_msds.pkl', 'rb') as f:
    wine_msds = pickle.load(f)

In [42]:
wine_msds.head()

Unnamed: 0,id,country,description,designation,points,price,province,region,taster_name,title,variety,winery,vintage
16,16,Argentina,"Baked plum, molasses, balsamic vinegar and che...",Felix,87,30.0,Other,Cafayate,Michael Schachner,Felix Lavaque 2010 Felix Malbec (Cafayate),Malbec,Felix Lavaque,2010
17,17,Argentina,Raw black-cherry aromas are direct and simple ...,Winemaker Selection,87,13.0,Mendoza Province,Mendoza,Michael Schachner,Gaucho Andino 2011 Winemaker Selection Malbec ...,Malbec,Gaucho Andino,2011
183,183,Argentina,With attractive melon and other tropical aroma...,Salta,88,12.0,Other,Salta,Michael Schachner,Alamos 2007 Torrontés (Salta),Torrontés,Alamos,2007
224,224,Argentina,Blackberry and road-tar aromas are dark and st...,Lunta,90,22.0,Mendoza Province,Luján de Cuyo,Michael Schachner,Mendel 2014 Lunta Malbec (Luján de Cuyo),Malbec,Mendel,2014
231,231,Argentina,"Meaty and rubbery, but that's young Bonarda. T...",Mendoza,85,10.0,Mendoza Province,Mendoza,Michael Schachner,Andean Sky 2007 Bonarda (Mendoza),Bonarda,Andean Sky,2007


### Preprocess wine varieties for training
To generate variety names from scratch, there has to be a system that generates short texts quickly. These texts should have a unique style and could actually serve as names for new types of wines. 

In [43]:
variety_df = wine_msds[['variety']].drop_duplicates()
len(variety_df)

# great, now I am down to 708 unique varieties for wines.

708

In [44]:
variety_df.tail(3)

Unnamed: 0,variety
105458,Babosa Negro
108491,Parraleta
119357,Bobal-Cabernet Sauvignon


In [45]:
variety_df["variety"]= variety_df["variety"].str.split(" ", n = 1, expand = True)
variety_df["variety"]= variety_df["variety"].str.split("-", n = 1, expand = True)

# In order for this to work, I had to sperate the two words into a single word.

In [46]:
variety_df.tail(3)

Unnamed: 0,variety
105458,Babosa
108491,Parraleta
119357,Bobal


In [47]:
# Insert a tab in front of all the names
variety_df['input'] = variety_df['variety'].apply(lambda x : '\t' + x)

# Append a newline at the end of every name
# We already appended a tab in front, so the target word should start at index 1
variety_df['target'] = variety_df['input'].apply(lambda x : x[1:len(x)] + '\n')

# drop variety column
variety_df = variety_df.drop(columns=['variety'])

In [48]:
variety_df.head()

Unnamed: 0,input,target
16,\tMalbec,Malbec\n
183,\tTorrontés,Torrontés\n
231,\tBonarda,Bonarda\n
245,\tChardonnay,Chardonnay\n
261,\tRed,Red\n


In [49]:
# a helper function get_vocabulary() that takes a list of words as an input 
# and returns the vocabulary which is the set of all the characters available the dataset. 

# Get vocabulary of Names dataset
def get_vocabulary(names):  
    # Define vocabulary to be set
    all_chars=set()
    
    # Add the start and end token to the vocabulary
    all_chars.add('\t')
    all_chars.add('\n')  
    
    # Iterate for each name
    for name in names:

        # Iterate for each character of the name
        for c in name:

            if c not in all_chars:
            # If the character is not in vocabulary, add it
                all_chars.add(c)

    # Return the vocabulary
    return all_chars

In [50]:
# Get the vocabulary
vocabulary = get_vocabulary(variety_df['input'])

# Sort the vocabulary
vocabulary_sorted = sorted(vocabulary)

# Create a dictionary char_to_idx mapping each character to its index in the sorted vocabulary vocabulary.
# the mapping of the vocabulary chars to integers
char_to_idx = { char : idx for idx, char in enumerate(vocabulary_sorted) }

# Create a dictionary idx_to_char mapping each index to its character in the sorted vocabulary vocabulary.
# Create the mapping of the integers to vocabulary chars
idx_to_char = { idx : char for idx, char in enumerate(vocabulary_sorted) }

# Print the dictionaries
print(char_to_idx)
print(idx_to_char)


# char_to_idx: Sort the vocabulary and assign numbers in order. Character \t mapped to 0, \n to 1, a to 2, b to 3, etc
# idx_to_char: Integer to character mapping. Integer 0 to \t, 1 to \n, 2 to a, 3 to b, etc

{'\t': 0, '\n': 1, ',': 2, '.': 3, 'A': 4, 'B': 5, 'C': 6, 'D': 7, 'E': 8, 'F': 9, 'G': 10, 'H': 11, 'I': 12, 'J': 13, 'K': 14, 'L': 15, 'M': 16, 'N': 17, 'O': 18, 'P': 19, 'R': 20, 'S': 21, 'T': 22, 'U': 23, 'V': 24, 'W': 25, 'X': 26, 'Y': 27, 'Z': 28, 'a': 29, 'b': 30, 'c': 31, 'd': 32, 'e': 33, 'f': 34, 'g': 35, 'h': 36, 'i': 37, 'j': 38, 'k': 39, 'l': 40, 'm': 41, 'n': 42, 'o': 43, 'p': 44, 'q': 45, 'r': 46, 's': 47, 't': 48, 'u': 49, 'v': 50, 'w': 51, 'x': 52, 'y': 53, 'z': 54, 'Ç': 55, 'à': 56, 'á': 57, 'â': 58, 'ã': 59, 'ä': 60, 'è': 61, 'é': 62, 'ê': 63, 'í': 64, 'ï': 65, 'ñ': 66, 'ó': 67, 'ô': 68, 'ü': 69, 'ă': 70, 'ć': 71, 'ğ': 72, 'ı': 73, 'š': 74, 'Ž': 75, 'ǎ': 76}
{0: '\t', 1: '\n', 2: ',', 3: '.', 4: 'A', 5: 'B', 6: 'C', 7: 'D', 8: 'E', 9: 'F', 10: 'G', 11: 'H', 12: 'I', 13: 'J', 14: 'K', 15: 'L', 16: 'M', 17: 'N', 18: 'O', 19: 'P', 20: 'R', 21: 'S', 22: 'T', 23: 'U', 24: 'V', 25: 'W', 26: 'X', 27: 'Y', 28: 'Z', 29: 'a', 30: 'b', 31: 'c', 32: 'd', 33: 'e', 34: 'f', 35: 'g

### Using RNN - Recurrent Neural Network

### Recurrent because perfoms same computations for every element in the sequence. Inputs, outputs and states are represented by vectors. 
- generate next char given current
- keep track of the history so far
- for example generate variety Malbec
- Sequence \t, m,a,l,b,e,c,\n
- time-step 1: input \t, output m.
- time-step 2: input m, outpyt a.
- state remembers \t and m seen so far and continue till end of sequence

In [51]:
def get_max_len(names):
    """
    Function to return length of the longest name.
    Input: list of names
    Output: length of the longest name
    """

    # create a list to contain all the name lengths
    length_list=[]

    # Iterate over all names and save the name length in the list.]
    for l in names:
        length_list.append(len(l))

    # Find the maximum length
    max_len = np.max(length_list)

    # return maximum length
    return max_len

In [52]:
# Find the length of longest name
max_len = get_max_len(variety_df['input'])

# Each name as a sequence of length max_len

In [53]:
max_len
# Longest len word

17

In [54]:
len(vocabulary)

77

### Initialize the input vector and the target vector: created the input and target tensors of appropriate shape containing all zeros

In [55]:
# Initialize the input vector. Create 3D zero vector of required shape
input_data = np.zeros((len(variety_df['input']), max_len+1, len(vocabulary)), dtype='float32')

In [56]:
# Initialize the target vector. Create 3-D zero vector of required shape for target. 
target_data = np.zeros((len(variety_df['target']), max_len+1, len(vocabulary)), dtype='float32')

### Initialize the input vector and the target vector with values: fill these with actual values. The input and target tensors contain all the names in the dataset. Each name can be thought of as a string having length equal to the length of the longest name and each character in each name is a one-hot encoded vector of size vocabulary.

In [57]:
# Iterate for each name in the dataset
for n_idx, name in enumerate(variety_df['input']):
    # Iterate over each character and convert it to a one-hot encoded vector
    for c_idx, char in enumerate(name): 
        input_data[n_idx, c_idx, char_to_idx[char]] = 1
        
# The tensors can be filled-in as follows: 
# input_data[n_idx, p_idx, char_to_idx[char]] will be set to 1 
# whenever the index of the name in the dataset is n_idx and it contains 
# the character char in position p_idx.

In [58]:
# Iterate for each name in the dataset
for n_idx, name in enumerate(variety_df['target']):
    # Iterate over each character and convert it to a one-hot encoded vector
    for c_idx, char in enumerate(name):
        target_data[n_idx, c_idx,  char_to_idx[char]] = 1

# the input and target vectors of appropriate shape. 
# You can use these vectors to train the recurrent neural network.

In [59]:
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import layers

from keras.models import Sequential
from keras.layers import Dense, SimpleRNN, Activation, TimeDistributed, LSTM
tf.compat.v1.estimator.inputs  # make sure to add this becasue there are many depreciated methods

<module 'tensorflow_estimator.python.estimator.api._v1.estimator.inputs' from '/home/ec2-user/anaconda3/envs/tensorflow_p36/lib/python3.6/site-packages/tensorflow_estimator/python/estimator/api/_v1/estimator/inputs/__init__.py'>

In [60]:
# Build and compile recurrent neural network 
model = Sequential()

model.add(SimpleRNN(500, input_shape=(max_len+1, len(vocabulary)),return_sequences=True))
# Add RNN layer of 50 units. Small network architecture of 50 simple RNN nodes
# Set return sequence true to make sure RNN layer returns a sequence and not just a single vector

# Add a TimeDistributed Dense layer of size same as the vocabulary
model.add(TimeDistributed(Dense(len(vocabulary), activation='softmax')))
# softmax activation predicts prbability values for each char in the vocabulary

model.compile(loss='categorical_crossentropy', optimizer='adam') # compile model
# categorical_crossentropy used

In [61]:
model.summary()

#Build and compile the recurrent neural network model, this model can be trained now


# the difference between 50 and 500 layers is obviously significant. Parameter number increased from 6400 for rnn
#3 to 289000. Significant processing is going to be needed.

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
simple_rnn_5 (SimpleRNN)     (None, 18, 500)           289000    
_________________________________________________________________
time_distributed_5 (TimeDist (None, 18, 77)            38577     
Total params: 327,577
Trainable params: 327,577
Non-trainable params: 0
_________________________________________________________________


### Inference using Recurrent Neural Network

Neural network:a blackbox. (why? given an input it produces an output)
Input target pair(x,y):ideal output y for input x. 
For input x produces output,say,z. 
Goal:reduce difference between actual output z and ideal output y. 
Training: adjust the internal parameters to achieve goal. After training actual output more similar to ideal output.


### Train recurrent network

In [62]:
model.fit(input_data, target_data, batch_size=128, epochs=1000) 

# Batch size: The number of samples after which the parameters are adjusted
# parameters do not need to be adjusted after every run
# Epochs: number of times the full data set will be iterated

# I did 3 main runs one with 200 epcohs and then again with 2000 epochs. 
# A change in magnitude of 10 seemed like a good idea.
# Also finally at 10K epochs. I think i have reached a limit with 10K and it looks like
# from this point on it will be a diminishing rate of return. For these reasons and
# and to wait a reasonable time, I will split the difference and run at 5k


#The number of epochs is not that significant. More important is the the validation and training error. 
#As long as it keeps dropping training should continue. For instance, if the validation error starts 
#increasing that might be a indication of overfitting. You should set the number of epochs as high as 
#possible and terminate training based on the error rates. Just mo be clear, an epoch is one learning 
#cycle where the learner sees the whole training data set. If you have two batches, the learner needs 
#to go through two iterations for one epoch.
# Initially I thought it stopped improving around 1000 epochs and thought ~1000 wouuld be enough.
# however started improving again. Setting to 10000 now.


# wow at 500 rnn networks I reached 0.32 loss at 1000 epochs. Can I go under 3? If so we are trianing all night. lol.

# Nope, it looks like at 0.32 we get the diminishing rate of return. additional rnn layers allowed to reach this loss ratio
# at if you decide to throw this maximum processing, let me know how it turns out in the comments.


Epoch 1/1000
Epoch 2/1000
Epoch 3/1000
Epoch 4/1000
Epoch 5/1000
Epoch 6/1000
Epoch 7/1000
Epoch 8/1000
Epoch 9/1000
Epoch 10/1000
Epoch 11/1000
Epoch 12/1000
Epoch 13/1000
Epoch 14/1000
Epoch 15/1000
Epoch 16/1000
Epoch 17/1000
Epoch 18/1000
Epoch 19/1000
Epoch 20/1000
Epoch 21/1000
Epoch 22/1000
Epoch 23/1000
Epoch 24/1000
Epoch 25/1000
Epoch 26/1000
Epoch 27/1000
Epoch 28/1000
Epoch 29/1000
Epoch 30/1000
Epoch 31/1000
Epoch 32/1000
Epoch 33/1000
Epoch 34/1000
Epoch 35/1000
Epoch 36/1000
Epoch 37/1000
Epoch 38/1000
Epoch 39/1000
Epoch 40/1000
Epoch 41/1000
Epoch 42/1000
Epoch 43/1000
Epoch 44/1000
Epoch 45/1000
Epoch 46/1000
Epoch 47/1000
Epoch 48/1000
Epoch 49/1000
Epoch 50/1000
Epoch 51/1000
Epoch 52/1000
Epoch 53/1000
Epoch 54/1000
Epoch 55/1000
Epoch 56/1000
Epoch 57/1000
Epoch 58/1000
Epoch 59/1000
Epoch 60/1000
Epoch 61/1000
Epoch 62/1000
Epoch 63/1000
Epoch 64/1000
Epoch 65/1000
Epoch 66/1000
Epoch 67/1000
Epoch 68/1000
Epoch 69/1000
Epoch 70/1000
Epoch 71/1000
Epoch 72/1000
E

<keras.callbacks.History at 0x7fae1a166fd0>

### Predict first character

In [63]:
# Initialize the first character of the sequence.
output_seq = np.zeros((1, max_len+1, len(vocabulary))) # create 3 dimensional zero vector
output_seq[0, 0, char_to_idx['\t']] = 1  # initialize it to contain the tab char.

# Probability distribution for the next character in the sequence
probs = model.predict_proba(output_seq, verbose=0)[:,1,:]

# Sample the vocabulary using the probability distribution to get the first char
first_char = np.random.choice(sorted(list(vocabulary)), replace=False, 
                              p=probs.reshape(len(vocabulary)))  # 77 here covers all possible chars

# Print the character generated. Above using the len(vocabulary) to cover 77.
print(first_char)

# This shows that I need to get rid of non engish chars (many tmies during the rins I thought I have to get rid of the
# non english chars but it worked out at the end)

u


### Predict second char using the first

In [64]:
# Insert first char in the sequence:
# Update the vector to contain first the characte
output_seq[0, 1, char_to_idx[first_char]] = 1

### Sample from probability distribution

In [65]:
# Probability distribution for the most probable second char in the sequence
# Get the probabilities for the second character
probs = model.predict_proba(output_seq, verbose=0)[:,2,:]

# Sample vocabulary to get second character
second_char = np.random.choice(sorted(list(vocabulary)), replace=False,
                              p=probs.reshape(len(vocabulary)))

In [66]:
second_char

'k'

In [67]:
# Function to generate wine names
def generate_wine_names(n):
    
    # Repeat for each name to be generated
    for i in range(0,n):

        # Flag to indicate when to stop generating characters
        stop=False

    # Number of characters generated so far
        counter=1

    # Define a zero vector to contain the output sequence
        output_seq = np.zeros((1, max_len+1, len(vocabulary)))

        # Initialize the first character of output sequence as the start token
        output_seq[0, 0, char_to_idx['\t']] = 1

    # Variable to contain the name
        name = ''

        # Repeat until the end token is generated or we get the maximum no of characters
        while stop == False and counter < 10:

            # Get probabilities for the next character in sequence
            probs = model.predict_proba(output_seq, verbose=0)[:,counter-1,:]
            
            # Sample the vocabulary according to the probability distribution
            c = np.random.choice(sorted(list(vocabulary)), replace=False, p=probs.reshape(len(vocabulary)))
            
            if c=='\n':
                # Stop if end token is encountered, else append to existing sequence
                stop=True
            else:
                # Append this character to the name generated so far
                name = name + c

                # Append this character to existing sequence for prediction of next characters
                output_seq[0,counter , char_to_idx[c]] = 1
                
                # Increment the number of characters generated
                counter=counter+1

        # Output generated sequence or name
        print(name)

In [73]:
generate_wine_names(23)

# 23 because I play 23 on roulette table. Give chance a chance!

Lemberger
Boğazkere
Rhône
Früburgun
Tocai
Feteascǎ
Molinara
Moscato
Portugues
Tinta
Cabernet
Posip
Durella
Irsai
Tempranil
Malvar
Marsanne
Albanello
Moscato
Shiraz
Pinot
Colombard
Vidal


# Final thoughts on generating short text:

### Typically having a larger training set should create better results but in this case the amount of data I have seems sufficient enough. I think leaving the non english chars in was a better idea (the authentic black box feel) since other way around the names sound too homogenous.  
### Second typical way to improve the performance of the model could be training for more epochs. I ran 200, 2K and 10K initially and decided 5K would be ideal for my needs. It runs fairly quick and still produces decent and probable results. And loss ration did not change significantly after 5K. 
### Another way could be to increase the hidden layers. Currently using 250 but also experimented with 50, 100 and 150. Seems like as I increase the number of hidden layers, better results are produced. 
### I think results can easily be given to new varieties of wine. Some of my favorites were: Masy, Siovasie, Srigaz, Graüy, Charaussa. And then there were hilarious ones such as Chardonna, Mellot, and Pilot. Perhaps we can even combine the names into a bi-grams; I would be interested in a glass of Mellot Pilot. 

### During my initial runs, at about every 50 names generated, I saw one that is an actual name. I wanted to randomize the final product a bit more and after tinkering around a bit my best run was 500 rnn layers and 1000 epochs. This in my opinion produced the most authentic results without and repetition or signs of overfitting and in the shortest amount of time.