## Text Generation (SPAM SMS)

In [208]:
import torch 
import torch.nn as nn
import random
import torch.nn.functional as F
import numpy as np

In this notebook we are going to use PyTorch and Recurrent Neural Networks generate some Spam SMS messages. The model works on the character level because SMS data contains a lot of abbreviated and slang texts that we want to keep. You can see in the sample text shown below that the word *tkts* is used to refer to tickets.So if we don't use the character level model we will discard these non standard words and we wont be able to properly learn the dataset.

The data that we are using can be downloaded from [here](http://archive.ics.uci.edu/ml/datasets/SMS+Spam+Collection)
First let's load the SPAM data from a text file.

In [14]:
text=[line.replace('spam\t',' ').strip() for line in open('data/spam.txt').readlines() if line.startswith('spam')]
text=' '.join(text)
text[:300]

"Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005. Text FA to 87121 to receive entry question(std txt rate)T&C's apply 08452810075over18's FreeMsg Hey there darling it's been 3 week's now and no word back! I'd like some fun you up for it still? Tb ok! XxX std chgs to send, £1.50 to "

The next step is built our character dictionary of characters available in our text

In [166]:
characters=list(set(text))
index2char=dict(enumerate(characters))
char2index={c:i for i,c in enumerate(characters)}
print('Number of unique characters:',len(characters))

Number of unique characters: 94


### Sampling training data from text
Next we are gonna write a small function to randomly selects a small part for the text and returns it to us.We pass the length and get a text with that size back


In [47]:
def random_sample(text, sample_length=20):
    start_index=random.randint(0,len(text)-sample_length)
    return text[start_index:start_index+sample_length]

In [49]:
random_sample(text,sample_length=300)

"your friend 1/1 For ur chance to win £250 cash every wk TXT: PLAY to 83370. T's&C's www.music-trivia.net custcare 08715705022, 1x150p/wk. Final Chance! Claim ur £150 worth of discount vouchers today! Text YES to 85023 now! SavaMob, member offers mobile! T Cs SavaMob POBOX84, M263UZ. £3.00 Subs 16 Sp"

### One hot encoding
Now we need to perform a one hot encoding of the sample text to be able to feed it into our model. Here is a small utility function that receives a sample text and 
performs a one hot encoding of its characters

In [93]:
def one_hot_encode(sample_text):
    one_hot=torch.zeros(len(sample_text),len(characters))
    for i,character in enumerate(sample_text):
        one_hot[i][char2index[character]]=1
    return one_hot    
        

In [94]:
one_hot_encode('abc')

tensor([[0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
         0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
         0., 0., 1., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
         0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
         0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
         0., 0., 0., 0.],
        [0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
         0., 0., 0., 0., 0., 0., 0., 1., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
         0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
         0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
         0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
         0., 0., 0., 0.],
        [0., 0., 0., 0., 0., 1., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
         0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 

## Creating Input and target sets
For each character in the sample text, we get a single character as a training and get the next character as target. Essentially model receives one character as an input and predicts one character as output. To avoid looping through the sample text characters one by one, we create a batch of (Character, Next character) and then pass all of them at once to the network to get the prediction for all of them. We will then use these predictions to calculate the loss.
for Example :

Text = **ABCDEFGHIJKLMNOPQRSTUVWXYZ**

sample_text = **GHIJKLMNOP**                 #length=10

sample_text[:-1]  ->  inputs  =  **GHIJKLMNO**

sample_text[1: ]  ->  targets =   **HIJKLMNOP**

We can create our inputs an targets sets by simply shifting the sample text. Essentially omitting the last character for to get the inputs vector and omitting the first character to get the target data.

In [121]:
def random_training_data(text,sample_length=20):
    """returns the inputs and targets for a sample text"""
    sample_text=random_sample(text,sample_length)
    inputs=sample_text[:-1]
    targets=sample_text[1:]
    one_hot_inputs=one_hot_encode(inputs)
    target_indexes=torch.LongTensor([char2index[character] for character in targets])
    return one_hot_inputs,target_indexes

### Model
Here we define a very simple network model. We have one GRU for encoding the relation between the input characters and one simple linear layer to project the hidden size to our character size .

In [104]:
class SpamGRU(nn.Module):
    def __init__(self,character_size,hidden_size):
        super(SpamGRU,self).__init__()
        self.hidden_size=hidden_size
        self.gru=nn.GRU(character_size,hidden_size,batch_first=True)
        self.charMapper=nn.Linear(hidden_size,character_size)
    def forward(self,inputs,hidden):
        out,hidden=self.gru(inputs,hidden)
        logits=self.charMapper(out)
        return logits, hidden
    def init_hidden(self):
        return torch.zeros(1,1,self.hidden_size)

In [105]:
model=SpamGRU(len(characters),128)

## Training

In [201]:
def train(model,text,training_steps=5000,sample_length=50,lr=0.005):
    optimizer=torch.optim.Adam(model.parameters(),lr=lr)
    for step in range(training_steps+1):
        optimizer.zero_grad() 
        inputs ,targets=random_training_data(text,sample_length)
        hidden=model.init_hidden()
        outputs,_=model(inputs.unsqueeze(0),hidden)
        loss=F.cross_entropy(outputs.squeeze(0),targets)
        loss.backward()
        optimizer.step()
        if step%200==0:
            print(f'Step:{step} ({step*100/training_steps:.1f}%)  Loss:{loss.item()}')

In [202]:
train(model,text)

Step:0 (0.0%)  Loss:1.3637193441390991
Step:200 (4.0%)  Loss:1.407169222831726
Step:400 (8.0%)  Loss:1.8686273097991943
Step:600 (12.0%)  Loss:0.8712431192398071
Step:800 (16.0%)  Loss:1.2440128326416016
Step:1000 (20.0%)  Loss:2.254335403442383
Step:1200 (24.0%)  Loss:1.6136444807052612
Step:1400 (28.0%)  Loss:1.9385374784469604
Step:1600 (32.0%)  Loss:0.9153338074684143
Step:1800 (36.0%)  Loss:1.354489803314209
Step:2000 (40.0%)  Loss:1.1856731176376343
Step:2200 (44.0%)  Loss:1.6169809103012085
Step:2400 (48.0%)  Loss:1.6817787885665894
Step:2600 (52.0%)  Loss:1.7164816856384277
Step:2800 (56.0%)  Loss:1.7623902559280396
Step:3000 (60.0%)  Loss:1.613673210144043
Step:3200 (64.0%)  Loss:1.054235816001892
Step:3400 (68.0%)  Loss:1.1988729238510132
Step:3600 (72.0%)  Loss:1.80231773853302
Step:3800 (76.0%)  Loss:1.6044071912765503
Step:4000 (80.0%)  Loss:2.095244884490967
Step:4200 (84.0%)  Loss:1.912640929222107
Step:4400 (88.0%)  Loss:1.733980417251587
Step:4600 (92.0%)  Loss:1.90055

### Generating Text
We do not simply choose the index with the highest probability to have more variety in our predicted character. To that we use the Numpy choice function to selected a character index based on their predicted probabilities by the network. we can also use the torch.multinomial() for the sampling, which essentially does the same thing.

Another thing to note is that, passing a hidden state of zeros does not provide a helpful initial context to the model for text generation. We can help the network to have a better initial hidden state by providing a few set of starting characters and create a hidden state from these characters.

In [303]:
def generate(model,starting_text='',desired_length=20, temperature=0.8):
    #Creating initial hidden state
    inputs=one_hot_encode(starting_text)
    hidden=model.init_hidden()
    for i in range(inputs.size(0)):
        out, hidden=model(inputs[i].view(-1,1,inputs[i].size(-1)),hidden)
    inputs=out
    generated_characters=''
    #generating characters
    for i in range(desired_length):
        outs,hidden =model(inputs,hidden)
        prob=F.softmax(outs/temperature,dim=2)
        character_index=np.random.choice(len(characters),1,p=prob.data.flatten().numpy())
        character=index2char[character_index.item()]
        generated_characters+=character
        inputs=one_hot_encode(character).unsqueeze(0)
    return starting_text+' '+generated_characters
        

In [297]:
geenrated_text=generate(model,starting_text='get a free', temperature=0.3, desired_length=200)
geenrated_text

'get a free nd of a £100 prize Games accout the latest from a £2000 prize. To claim call 08000930705 from land line or of Colour from 2004, MUST GO to 8007 Get to receive a £100 prize GUARANTEED Call 090663622066'

In [302]:
generate(model,starting_text='get a free', temperature=0.3, desired_length=200)

'get a free nd call 08000839402 or call 0906636220116+ Gr8 from a charged 4. Customer service reply to receive a £500 prize. To claim your mobile number service reply to receive a £500 prize. Gement for your free'

As you can see, using the sampling technique instead of simply choosing the character with highest output value, results in creating a variety in our generated response.