<a href="https://colab.research.google.com/github/gopal2812/mlblr/blob/master/python_code.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Transformer based model to translate English text to Python code

The goal is to  write a transformer-based model that can translats English text to python code(with proper whitespace indentations)

The training dataset contains around 4600+ examples of English text to python code. 
- must use transformers with self-attention, multi-head, and scaled-dot product attention in the model
- There is no limit on the number of training epochs or total number of parameters in the model
- should have trained a separate embedding layer for python keywords and paid special attention to whitespaces, colon and other things (like comma etc)
- model should to do proper indentation
- model should to use newline properly
- model should understand how to use colon (:)
- model should generate proper python code that can run on a Python interpreter and produce proper results


Some preprocessing checks on the dataset should be carried out like:
 - the dataset provided is divided into English and "python-code" pairs properly
the dataset does not have anomalies w.r.t. indentations (like a mixed-use of tabs and spaces, or use of either 4 or 3 spaces, it should be 4 spaces only). Either use tabs only or 4 spaces only, not both
- the length of the "python-code" generated is not out of your model's capacity


In [4]:
!pip install -U torchtext==0.8.0


Requirement already up-to-date: torchtext==0.8.0 in /usr/local/lib/python3.7/dist-packages (0.8.0)


In [5]:
import torch
from torch.jit import script, trace
import torch.nn as nn
from torch import optim
import torch.nn.functional as F
import csv
import random
import re
import os
import unicodedata
import codecs
from io import open
import itertools
import math
import pandas as pd

from torchtext.data import Field, BucketIterator, LabelField, TabularDataset


USE_CUDA = torch.cuda.is_available()
device = torch.device("cuda" if USE_CUDA else "cpu")

In [6]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [7]:
!ls -l drive/MyDrive/

total 3228465
-rw------- 1 root root      83296 Apr 21  2014  13033148255_BVRxxxxx9Q_A1.zip
drwx------ 2 root root       4096 Jul 26  2014 '13033148255_BVRxxxxx9Q_A1.zip (Unzipped Files)'
-rw------- 1 root root        151 Sep  3  2013 '20130901  po_paymentdetails members 33 to 40.gsheet'
-rw------- 1 root root      31354 Jul  9  2017  20170707_OrderTrades_25068.pdf
-rw------- 1 root root      54690 Dec 16  2013 'A82542 dec.pdf'
-rw------- 1 root root      54576 Dec 16  2013 'A82542 jan.pdf'
-rw------- 1 root root        151 Sep  2  2013 'additional information on demant letters.gsheet'
-rw------- 1 root root     534516 Dec 16  2013 'Aircel P 2 Offer Letter.pdf'
-rw------- 1 root root     771981 Dec 16  2013 'Aircel page 1.pdf'
-rw------- 1 root root        151 Jul 22  2012  Amit_Cv_sample.doc.gdoc
-rw------- 1 root root     313338 Apr  4  2017  aonla2.jpg
-rw------- 1 root root        151 Mar  3  2009 '_AVG certification_.gdoc'
-rw------- 1 root root        151 Nov  9  2019 'Booting_AR

In [8]:
datasets = [[]]
file_name = '/content/drive/MyDrive/clean_data_ex_1.txt'

with open(file_name) as f:
  #my_dict = {"description":[],"code":[]}
  for line in f:
    if line.startswith('#'):
      comment = line.split('\n#')
      if datasets[-1] != []:
        # we are in a new block
        datasets.append(comment)
    else:
      stripped_line = line#.strip()
      if stripped_line:
        datasets[-1].append(stripped_line)
# datasets[0].insert(0,'# write a python program to add two numbers ')        

In [9]:
raw_data = {'Description' : [re.sub(r"^#(\d)*",'',x[0]).strip() for x in datasets], 'Code': [''.join(x[1:]) for x in datasets]}
df = pd.DataFrame(raw_data, columns=["Description", "Code"])

In [10]:
df['Description'][0] = " write a python program to add two numbers"

In [11]:
df.head()

Unnamed: 0,Description,Code
0,write a python program to add two numbers,\nimport string\nfrom itertools import permuta...
1,write a python function to add two user provid...,"\n\ndef add_two_numbers(num1, num2):\n sum ..."
2,write a program to find and print the largest ...,\nnum1 = 10\nnum2 = 12\nnum3 = 14\nif (num1 >=...
3,write a program to find and print the smallest...,\nnum1 = 10\nnum2 = 12\nnum3 = 14\nif (num1 <=...
4,Write a python function to merge two given lis...,"\n\ndef merge_lists(l1, l2):\n return l1 + ..."


In [12]:
df['Code'].replace("", float("NaN"), inplace=True)

In [13]:
df[df.isna().any(axis=1)]

Unnamed: 0,Description,Code


In [14]:
df.dropna(subset = ["Code"], inplace=True)


In [15]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1018 entries, 0 to 1017
Data columns (total 2 columns):
 #   Column       Non-Null Count  Dtype 
---  ------       --------------  ----- 
 0   Description  1018 non-null   object
 1   Code         1018 non-null   object
dtypes: object(2)
memory usage: 23.9+ KB


In [16]:

# Dividing the data into train and validation dataset

train_df = df.sample(frac = 0.80) 
  
# Creating dataframe with rest of the 20% values 
valid_df = df.drop(train_df.index)

In [17]:
print(f'train df {train_df}')
print(f'Valid df {valid_df}')

train_df.to_csv('train.csv', index=False)
valid_df.to_csv('valid.csv', index=False)

train df                                            Description                                               Code
387  write a program to find the frequency of words...  \ntest_str = 'times of india times new india e...
45          Write a lambda function to add two numbers         \n\ndef add(a, b):\n    return a + b\n\n\n
206  Write a function to return the sum of the root...  def sum_of_roots(a: float, c: float):\n    if ...
143  write a python program to replace blank space ...  def f12(x):\n    yield x + 1\n    print("test"...
541  Write a python function that Capitalize the Fi...  \n\ndef capitalize(fname):\n    with open(fnam...
..                                                 ...                                                ...
757                                     usage of break  for i in range(5):\n    if i == 1:\n        br...
357  Write a function that returns derivative deriv...  def derivative_relu(x: float) -> float:\n    x...
800  write the python program to gene

In [18]:
# import io
# from io import BytesIO
# from tokenize import tokenize, untokenize, NUMBER, STRING, NAME, OP, tok_name

# def tokenize_code(text):
#     result = []
#     for tok in tokenize(io.BytesIO(text.encode('utf-8')).readline):
#         if tok_name[tok.exact_type] == 'NAME':
#             result.append(tok.string)
#         else:
#             result.append(tok_name[tok.exact_type])
#     return result

In [19]:
# tokenize_code(df['Code'][1])

In [20]:
'''
ENDMARKER = 0
NAME = 1
NUMBER = 2
STRING = 3
NEWLINE = 4
INDENT = 5
DEDENT = 6
LPAR = 7
RPAR = 8
LSQB = 9
RSQB = 10
COLON = 11
COMMA = 12
SEMI = 13
PLUS = 14
MINUS = 15
STAR = 16
SLASH = 17
VBAR = 18
AMPER = 19
LESS = 20
GREATER = 21
EQUAL = 22
DOT = 23
PERCENT = 24
LBRACE = 25
RBRACE = 26
EQEQUAL = 27
NOTEQUAL = 28
LESSEQUAL = 29
GREATEREQUAL = 30
TILDE = 31
CIRCUMFLEX = 32
LEFTSHIFT = 33
RIGHTSHIFT = 34
DOUBLESTAR = 35
PLUSEQUAL = 36
MINEQUAL = 37
STAREQUAL = 38
SLASHEQUAL = 39
PERCENTEQUAL = 40
AMPEREQUAL = 41
VBAREQUAL = 42
CIRCUMFLEXEQUAL = 43
LEFTSHIFTEQUAL = 44
RIGHTSHIFTEQUAL = 45
DOUBLESTAREQUAL = 46
DOUBLESLASH = 47
DOUBLESLASHEQUAL = 48
AT = 49
ATEQUAL = 50
RARROW = 51
ELLIPSIS = 52
COLONEQUAL = 53
OP = 54
AWAIT = 55
ASYNC = 56
TYPE_IGNORE = 57
TYPE_COMMENT = 58
# These aren't used by the C tokenizer but are needed for tokenize.py
ERRORTOKEN = 59
COMMENT = 60
NL = 61
ENCODING = 62
N_TOKENS = 63
# Special definitions for cooperation with parser
NT_OFFSET = 256
'''

"\nENDMARKER = 0\nNAME = 1\nNUMBER = 2\nSTRING = 3\nNEWLINE = 4\nINDENT = 5\nDEDENT = 6\nLPAR = 7\nRPAR = 8\nLSQB = 9\nRSQB = 10\nCOLON = 11\nCOMMA = 12\nSEMI = 13\nPLUS = 14\nMINUS = 15\nSTAR = 16\nSLASH = 17\nVBAR = 18\nAMPER = 19\nLESS = 20\nGREATER = 21\nEQUAL = 22\nDOT = 23\nPERCENT = 24\nLBRACE = 25\nRBRACE = 26\nEQEQUAL = 27\nNOTEQUAL = 28\nLESSEQUAL = 29\nGREATEREQUAL = 30\nTILDE = 31\nCIRCUMFLEX = 32\nLEFTSHIFT = 33\nRIGHTSHIFT = 34\nDOUBLESTAR = 35\nPLUSEQUAL = 36\nMINEQUAL = 37\nSTAREQUAL = 38\nSLASHEQUAL = 39\nPERCENTEQUAL = 40\nAMPEREQUAL = 41\nVBAREQUAL = 42\nCIRCUMFLEXEQUAL = 43\nLEFTSHIFTEQUAL = 44\nRIGHTSHIFTEQUAL = 45\nDOUBLESTAREQUAL = 46\nDOUBLESLASH = 47\nDOUBLESLASHEQUAL = 48\nAT = 49\nATEQUAL = 50\nRARROW = 51\nELLIPSIS = 52\nCOLONEQUAL = 53\nOP = 54\nAWAIT = 55\nASYNC = 56\nTYPE_IGNORE = 57\nTYPE_COMMENT = 58\n# These aren't used by the C tokenizer but are needed for tokenize.py\nERRORTOKEN = 59\nCOMMENT = 60\nNL = 61\nENCODING = 62\nN_TOKENS = 63\n# Special def

In [21]:
import io
from io import BytesIO
from tokenize import tokenize, untokenize, NUMBER, STRING, NAME, OP, tok_name
#https://docs.python.org/3/library/tokenize.html
def tokenize_python(code_snippet):
    tokens = tokenize(io.BytesIO(code_snippet.encode('utf-8')).readline)
    parsed = []
    for token in tokens:
        if token.type not in [0,59,60,61,62,63,256]:
            parsed.append(token.string)
    return parsed

In [22]:
tokenize_python(df['Code'][1])

['utf-8',
 '\n',
 '\n',
 'def',
 'add_two_numbers',
 '(',
 'num1',
 ',',
 'num2',
 ')',
 ':',
 '\n',
 '    ',
 'sum',
 '=',
 'num1',
 '+',
 'num2',
 '\n',
 'return',
 'sum',
 '\n',
 '\n',
 '\n',
 '']

In [23]:
import spacy
spacy_en = spacy.load('en')

def tokenize_en(text):
    """
    Tokenizes English text from a string into a list of strings
    """
    return [tok.text for tok in spacy_en.tokenizer(text)]

SRC = Field(tokenize= tokenize_en, 
            init_token='<sos>', 
            eos_token='<eos>', 
            lower=True,
            batch_first=True)

TRG = Field(tokenize = tokenize_python, 
            init_token='<sos>', 
            eos_token='<eos>', 
            lower=False,
            batch_first=True)



In [24]:
fields = [('Description', SRC),('Code',TRG)]


In [25]:
# Using tabular dataset to process the text

train_data, test_data = TabularDataset.splits(
                                path = '',   
                                train = './train.csv',
                                test = './valid.csv',
                                format = 'csv',
                                fields = fields)



In [26]:
BATCH_SIZE = 16
device = "cuda" if torch.cuda.is_available() else "cpu"

In [27]:
SRC.build_vocab(train_data, min_freq = 3,max_size= 10000)
TRG.build_vocab(test_data, min_freq = 3,max_size= 10000)

In [28]:
train_iterator, test_iterator = BucketIterator.splits(
    (train_data, test_data), 
    batch_size = BATCH_SIZE,
    sort_key = lambda x: len(x.Description),
    device = device)



In [29]:
class Seq2Seq(nn.Module):
    def __init__(self, 
                 encoder, 
                 decoder, 
                 src_pad_idx, 
                 trg_pad_idx, 
                 device):
        super().__init__()
        
        self.encoder = encoder
        self.decoder = decoder
        self.src_pad_idx = src_pad_idx
        self.trg_pad_idx = trg_pad_idx
        self.device = device
        
    def make_src_mask(self, src):
        
        #src = [batch size, src len]
        
        src_mask = (src != self.src_pad_idx).unsqueeze(1).unsqueeze(2)

        #src_mask = [batch size, 1, 1, src len]

        return src_mask
    
    def make_trg_mask(self, trg):
        
        #trg = [batch size, trg len]
        
        trg_pad_mask = (trg != self.trg_pad_idx).unsqueeze(1).unsqueeze(2)
        
        #trg_pad_mask = [batch size, 1, 1, trg len]
        
        trg_len = trg.shape[1]
        
        trg_sub_mask = torch.tril(torch.ones((trg_len, trg_len), device = self.device)).bool()
        
        #trg_sub_mask = [trg len, trg len]
            
        trg_mask = trg_pad_mask & trg_sub_mask
        
        #trg_mask = [batch size, 1, trg len, trg len]
        
        return trg_mask

    def forward(self, src, trg):
        
        #src = [batch size, src len]
        #trg = [batch size, trg len]
                
        src_mask = self.make_src_mask(src)
        trg_mask = self.make_trg_mask(trg)
        
        #src_mask = [batch size, 1, 1, src len]
        #trg_mask = [batch size, 1, trg len, trg len]


        enc_src = self.encoder(src, src_mask)
        
        #enc_src = [batch size, src len, hid dim]
                
        output, attention = self.decoder(trg, enc_src, trg_mask, src_mask)
        
        #output = [batch size, trg len, output dim]
        #attention = [batch size, n heads, trg len, src len]        
        return output, attention

In [30]:
class Encoder(nn.Module):
    def __init__(self, 
                 input_dim, 
                 hid_dim, 
                 n_layers, 
                 n_heads, 
                 pf_dim,
                 dropout, 
                 device,
                 max_length = 2000):
        super().__init__()

        self.device = device
        
        self.tok_embedding = nn.Embedding(input_dim, hid_dim)
        self.pos_embedding = nn.Embedding(max_length, hid_dim)
        
        self.layers = nn.ModuleList([EncoderLayer(hid_dim, 
                                                  n_heads, 
                                                  pf_dim,
                                                  dropout, 
                                                  device) 
                                     for _ in range(n_layers)])
        
        self.dropout = nn.Dropout(dropout)
        
        self.scale = torch.sqrt(torch.FloatTensor([hid_dim])).to(device)
        
    def forward(self, src, src_mask):
        
        #src = [batch size, src len]
        #src_mask = [batch size, 1, 1, src len]
        
        batch_size = src.shape[0]
        src_len = src.shape[1]
        
        pos = torch.arange(0, src_len).unsqueeze(0).repeat(batch_size, 1).to(self.device)
        
        #pos = [batch size, src len]

        src = self.dropout((self.tok_embedding(src) * self.scale) + self.pos_embedding(pos))

        
        #src = [batch size, src len, hid dim]
        
        for layer in self.layers:
            src = layer(src, src_mask)
            
        #src = [batch size, src len, hid dim]
 
            
        return src

In [31]:
class EncoderLayer(nn.Module):
    def __init__(self, 
                 hid_dim, 
                 n_heads, 
                 pf_dim,  
                 dropout, 
                 device):
        super().__init__()
        
        self.self_attn_layer_norm = nn.LayerNorm(hid_dim)
        self.ff_layer_norm = nn.LayerNorm(hid_dim)
        self.self_attention = MultiHeadAttentionLayer(hid_dim, n_heads, dropout, device)
        self.positionwise_feedforward = PositionwiseFeedforwardLayer(hid_dim, 
                                                                     pf_dim, 
                                                                     dropout)
        self.dropout = nn.Dropout(dropout)
        
    def forward(self, src, src_mask):
        
        #src = [batch size, src len, hid dim]
        #src_mask = [batch size, 1, 1, src len] 
                
        #self attention
        _src, _ = self.self_attention(src, src, src, src_mask)
        
        #dropout, residual connection and layer norm
        src = self.self_attn_layer_norm(src + self.dropout(_src))
        
        #src = [batch size, src len, hid dim]
        
        #positionwise feedforward
        _src = self.positionwise_feedforward(src)
        
        #dropout, residual and layer norm
        src = self.ff_layer_norm(src + self.dropout(_src))
        
        #src = [batch size, src len, hid dim]
        
        return src

In [32]:
class MultiHeadAttentionLayer(nn.Module):
    def __init__(self, hid_dim, n_heads, dropout, device):
        super().__init__()
        
        assert hid_dim % n_heads == 0
        
        self.hid_dim = hid_dim
        self.n_heads = n_heads
        self.head_dim = hid_dim // n_heads
        
        self.fc_q = nn.Linear(hid_dim, hid_dim)
        self.fc_k = nn.Linear(hid_dim, hid_dim)
        self.fc_v = nn.Linear(hid_dim, hid_dim)
        
        self.fc_o = nn.Linear(hid_dim, hid_dim)
        
        self.dropout = nn.Dropout(dropout)
        
        self.scale = torch.sqrt(torch.FloatTensor([self.head_dim])).to(device)
        
    def forward(self, query, key, value, mask = None):
        
        batch_size = query.shape[0]
        
        #query = [batch size, query len, hid dim]
        #key = [batch size, key len, hid dim]
        #value = [batch size, value len, hid dim]
                
        Q = self.fc_q(query)
        K = self.fc_k(key)
        V = self.fc_v(value)
        
        #Q = [batch size, query len, hid dim]
        #K = [batch size, key len, hid dim]
        #V = [batch size, value len, hid dim]
                
        Q = Q.view(batch_size, -1, self.n_heads, self.head_dim).permute(0, 2, 1, 3)
        K = K.view(batch_size, -1, self.n_heads, self.head_dim).permute(0, 2, 1, 3)
        V = V.view(batch_size, -1, self.n_heads, self.head_dim).permute(0, 2, 1, 3)
        
        #Q = [batch size, n heads, query len, head dim]
        #K = [batch size, n heads, key len, head dim]
        #V = [batch size, n heads, value len, head dim]
                
        energy = torch.matmul(Q, K.permute(0, 1, 3, 2)) / self.scale
        
        #energy = [batch size, n heads, query len, key len]
        
        if mask is not None:
            energy = energy.masked_fill(mask == 0, -1e10)
        
        attention = torch.softmax(energy, dim = -1)
                
        #attention = [batch size, n heads, query len, key len]
                
        x = torch.matmul(self.dropout(attention), V)
        
        #x = [batch size, n heads, query len, head dim]
        
        x = x.permute(0, 2, 1, 3).contiguous()
        
        #x = [batch size, query len, n heads, head dim]
        
        x = x.view(batch_size, -1, self.hid_dim)
        
        #x = [batch size, query len, hid dim]
        
        x = self.fc_o(x)
        
        #x = [batch size, query len, hid dim]
        
        return x, attention

In [33]:
class PositionwiseFeedforwardLayer(nn.Module):
    def __init__(self, hid_dim, pf_dim, dropout):
        super().__init__()
        
        self.fc_1 = nn.Linear(hid_dim, pf_dim)
        self.fc_2 = nn.Linear(pf_dim, hid_dim)
        
        self.dropout = nn.Dropout(dropout)
        
    def forward(self, x):
        
        #x = [batch size, seq len, hid dim]
        
        x = self.dropout(torch.relu(self.fc_1(x)))
        
        #x = [batch size, seq len, pf dim]
        
        x = self.fc_2(x)
        
        #x = [batch size, seq len, hid dim]
        
        return x

In [34]:
class Decoder(nn.Module):
    def __init__(self, 
                 output_dim, 
                 hid_dim, 
                 n_layers, 
                 n_heads, 
                 pf_dim, 
                 dropout, 
                 device,
                 max_length = 2000):
        super().__init__()
        
        self.device = device
        
        self.tok_embedding = nn.Embedding(output_dim, hid_dim)
        self.pos_embedding = nn.Embedding(max_length, hid_dim)
        
        self.layers = nn.ModuleList([DecoderLayer(hid_dim, 
                                                  n_heads, 
                                                  pf_dim, 
                                                  dropout, 
                                                  device)
                                     for _ in range(n_layers)])
        
        self.fc_out = nn.Linear(hid_dim, output_dim)
        
        self.dropout = nn.Dropout(dropout)
        
        self.scale = torch.sqrt(torch.FloatTensor([hid_dim])).to(device)
        
    def forward(self, trg, enc_src, trg_mask, src_mask):
        
        #trg = [batch size, trg len]
        #enc_src = [batch size, src len, hid dim]
        #trg_mask = [batch size, 1, trg len, trg len]
        #src_mask = [batch size, 1, 1, src len]
                
        batch_size = trg.shape[0]
        trg_len = trg.shape[1]
        
        pos = torch.arange(0, trg_len).unsqueeze(0).repeat(batch_size, 1).to(self.device)
                            
        #pos = [batch size, trg len]
            
        trg = self.dropout((self.tok_embedding(trg) * self.scale) + self.pos_embedding(pos))
                
        #trg = [batch size, trg len, hid dim]
        
        for layer in self.layers:
            trg, attention = layer(trg, enc_src, trg_mask, src_mask)
        
        #trg = [batch size, trg len, hid dim]
        #attention = [batch size, n heads, trg len, src len]
        
        output = self.fc_out(trg)
        
        #output = [batch size, trg len, output dim]
            
        return output, attention

In [35]:
class DecoderLayer(nn.Module):
    def __init__(self, 
                 hid_dim, 
                 n_heads, 
                 pf_dim, 
                 dropout, 
                 device):
        super().__init__()
        
        self.self_attn_layer_norm = nn.LayerNorm(hid_dim)
        self.enc_attn_layer_norm = nn.LayerNorm(hid_dim)
        self.ff_layer_norm = nn.LayerNorm(hid_dim)
        self.self_attention = MultiHeadAttentionLayer(hid_dim, n_heads, dropout, device)
        self.encoder_attention = MultiHeadAttentionLayer(hid_dim, n_heads, dropout, device)
        self.positionwise_feedforward = PositionwiseFeedforwardLayer(hid_dim, 
                                                                     pf_dim, 
                                                                     dropout)
        self.dropout = nn.Dropout(dropout)
        
    def forward(self, trg, enc_src, trg_mask, src_mask):
        
        #trg = [batch size, trg len, hid dim]
        #enc_src = [batch size, src len, hid dim]
        #trg_mask = [batch size, 1, trg len, trg len]
        #src_mask = [batch size, 1, 1, src len]
        
        #self attention
        _trg, _ = self.self_attention(trg, trg, trg, trg_mask)
        
        #dropout, residual connection and layer norm
        trg = self.self_attn_layer_norm(trg + self.dropout(_trg))
            
        #trg = [batch size, trg len, hid dim]
            
        #encoder attention
        _trg, attention = self.encoder_attention(trg, enc_src, enc_src, src_mask)
        # query, key, value
        
        #dropout, residual connection and layer norm
        trg = self.enc_attn_layer_norm(trg + self.dropout(_trg))
                    
        #trg = [batch size, trg len, hid dim]
        
        #positionwise feedforward
        _trg = self.positionwise_feedforward(trg)
        
        #dropout, residual and layer norm
        trg = self.ff_layer_norm(trg + self.dropout(_trg))
        
        #trg = [batch size, trg len, hid dim]
        #attention = [batch size, n heads, trg len, src len]
        
        return trg, attention

In [36]:
INPUT_DIM = len(SRC.vocab)
OUTPUT_DIM = len(TRG.vocab)
HID_DIM = 256
ENC_LAYERS = 2
DEC_LAYERS = 2
ENC_HEADS = 8
DEC_HEADS = 8
ENC_PF_DIM = 512
DEC_PF_DIM = 512
ENC_DROPOUT = 0.1
DEC_DROPOUT = 0.1

enc = Encoder(INPUT_DIM, 
              HID_DIM, 
              ENC_LAYERS, 
              ENC_HEADS, 
              ENC_PF_DIM, 
              ENC_DROPOUT, 
              device)


dec = Decoder(OUTPUT_DIM, 
              HID_DIM, 
              DEC_LAYERS, 
              DEC_HEADS, 
              DEC_PF_DIM, 
              DEC_DROPOUT,
              device)

In [37]:
SRC_PAD_IDX = SRC.vocab.stoi[SRC.pad_token]
TRG_PAD_IDX = TRG.vocab.stoi[TRG.pad_token]

model = Seq2Seq(enc, dec, SRC_PAD_IDX, TRG_PAD_IDX, device).to(device)

In [38]:
def count_parameters(model):
    return sum(p.numel() for p in model.parameters() if p.requires_grad)

print(f'The model has {count_parameters(model):,} trainable parameters')

The model has 3,902,743 trainable parameters


In [39]:
def initialize_weights(m):
    if hasattr(m, 'weight') and m.weight.dim() > 1:
        nn.init.xavier_uniform_(m.weight.data)

In [40]:
criterion = nn.CrossEntropyLoss(ignore_index = TRG_PAD_IDX)


In [41]:
model.apply(initialize_weights);


In [42]:

LEARNING_RATE = 0.0005
optimizer = torch.optim.Adam(model.parameters(), lr = LEARNING_RATE)

In [43]:
def train(model, iterator, optimizer, criterion, clip):
    
    model.train()
    
    epoch_loss = 0
    
    for i, batch in enumerate(iterator):
        
        src = batch.Description
        trg = batch.Code
        
        optimizer.zero_grad()
        
        output, _ = model(src, trg[:,:-1])

                
        #output = [batch size, trg len - 1, output dim]
        #trg = [batch size, trg len]
            
        output_dim = output.shape[-1]

            
        output = output.contiguous().view(-1, output_dim)
        trg = trg[:,1:].contiguous().view(-1)
                
        #output = [batch size * trg len - 1, output dim]
        #trg = [batch size * trg len - 1]
            
        loss = criterion(output, trg)
        #loss = maskNLLLoss(output, trg,model.trg_pad_idx)
        
        loss.backward()
        
        torch.nn.utils.clip_grad_norm_(model.parameters(), clip)
        
        optimizer.step()
        
        epoch_loss += loss.item()
        
    return epoch_loss / len(iterator)

In [44]:
def evaluate(model, iterator, criterion):
    
    model.eval()
    
    epoch_loss = 0
    
    with torch.no_grad():
    
        for i, batch in enumerate(iterator):

            src = batch.Description
            trg = batch.Code

            output, _ = model(src, trg[:,:-1])
            
            #output = [batch size, trg len - 1, output dim]
            #trg = [batch size, trg len]
            
            output_dim = output.shape[-1]
           
            
            output = output.contiguous().view(-1, output_dim)
            trg = trg[:,1:].contiguous().view(-1)
            
            #output = [batch size * trg len - 1, output dim]
            #trg = [batch size * trg len - 1]
            
            
            loss = criterion(output, trg)
            #loss = maskNLLLoss(output, trg,model.trg_pad_idx)

            #loss,_ = maskNLLLoss(output, trg, mask)

            epoch_loss += loss.item()
        
    return epoch_loss / len(iterator)

In [45]:
def epoch_time(start_time, end_time):
    elapsed_time = end_time - start_time
    elapsed_mins = int(elapsed_time / 60)
    elapsed_secs = int(elapsed_time - (elapsed_mins * 60))
    return elapsed_mins, elapsed_secs

In [46]:
import time
N_EPOCHS = 20
CLIP = 1

best_valid_loss = float('inf')

for epoch in range(N_EPOCHS):
    
    start_time = time.time()
    
    train_loss = train(model, train_iterator, optimizer, criterion, CLIP)
    valid_loss = evaluate(model, test_iterator, criterion)
    
    end_time = time.time()
    
    epoch_mins, epoch_secs = epoch_time(start_time, end_time)
    
    if valid_loss < best_valid_loss:
        best_valid_loss = valid_loss
        torch.save(model.state_dict(), 'tut6-model.pt')
    
    print(f'Epoch: {epoch+1:02} | Time: {epoch_mins}m {epoch_secs}s')
    print(f'\tTrain Loss: {train_loss:.3f} | Train PPL: {math.exp(train_loss):7.3f}')
    print(f'\t Val. Loss: {valid_loss:.3f} |  Val. PPL: {math.exp(valid_loss):7.3f}')



Epoch: 01 | Time: 0m 2s
	Train Loss: 3.157 | Train PPL:  23.493
	 Val. Loss: 2.758 |  Val. PPL:  15.772
Epoch: 02 | Time: 0m 1s
	Train Loss: 2.246 | Train PPL:   9.448
	 Val. Loss: 2.376 |  Val. PPL:  10.757
Epoch: 03 | Time: 0m 1s
	Train Loss: 1.967 | Train PPL:   7.151
	 Val. Loss: 2.194 |  Val. PPL:   8.973
Epoch: 04 | Time: 0m 1s
	Train Loss: 1.777 | Train PPL:   5.914
	 Val. Loss: 2.097 |  Val. PPL:   8.140
Epoch: 05 | Time: 0m 1s
	Train Loss: 1.647 | Train PPL:   5.189
	 Val. Loss: 2.013 |  Val. PPL:   7.486
Epoch: 06 | Time: 0m 1s
	Train Loss: 1.535 | Train PPL:   4.641
	 Val. Loss: 1.962 |  Val. PPL:   7.115
Epoch: 07 | Time: 0m 1s
	Train Loss: 1.446 | Train PPL:   4.246
	 Val. Loss: 1.937 |  Val. PPL:   6.936
Epoch: 08 | Time: 0m 1s
	Train Loss: 1.373 | Train PPL:   3.946
	 Val. Loss: 1.891 |  Val. PPL:   6.624
Epoch: 09 | Time: 0m 1s
	Train Loss: 1.306 | Train PPL:   3.690
	 Val. Loss: 1.920 |  Val. PPL:   6.819
Epoch: 10 | Time: 0m 1s
	Train Loss: 1.246 | Train PPL:   3.475


In [47]:
def translate_sentence(sentence, src_field, trg_field, model, device, max_len = 100):

    model.eval()
        
    if isinstance(sentence, str):
        nlp = spacy.load('en')
        tokens = [token.text.lower() for token in nlp(sentence)]
    else:
        tokens = [token.lower() for token in sentence]

    tokens = [src_field.init_token] + tokens + [src_field.eos_token]
        
    src_indexes = [src_field.vocab.stoi[token] for token in tokens]

    src_tensor = torch.LongTensor(src_indexes).unsqueeze(0).to(device)
    src_mask = model.make_src_mask(src_tensor)

    with torch.no_grad():
        enc_src = model.encoder(src_tensor,src_mask)

    trg_indexes = [trg_field.vocab.stoi[trg_field.init_token]]

    for i in range(max_len):

        trg_tensor = torch.LongTensor(trg_indexes).unsqueeze(0).to(device)
        trg_mask = model.make_trg_mask(trg_tensor)
        with torch.no_grad():
            output, attention = model.decoder(trg_tensor, enc_src, trg_mask, src_mask)
        
        pred_token = output.argmax(2)[:,-1].item()
        
        trg_indexes.append(pred_token)

        if pred_token == trg_field.vocab.stoi[trg_field.eos_token]:
            break
    
    trg_tokens = [trg_field.vocab.itos[i] for i in trg_indexes]
    
    return trg_tokens[1:]#, attention

In [48]:
sentence = "write a program to find and print the largest among three numbers"
code = translate_sentence(sentence, SRC, TRG, model, device)
print(f'predicted trg = {code}')
#print(f'predicted trg = {" ".join(code)}')


predicted trg = ['utf-8', '\n', 'a', '=', 'float', '(', 'input', '(', '<unk>', ')', ')', '\n', 'b', '=', 'float', '(', 'input', '(', '<unk>', ')', ')', '\n', 'b', '=', 'float', '(', '<unk>', ')', '\n', 'c', '=', 'float', '(', 'input', '(', '<unk>', ')', ')', '\n', '\n', 'if', '(', 'a', '>', 'b', 'and', 'c', '>', 'c', ')', ':', '\n', '    ', 'print', '(', 'a', 'and', 'c', ')', '\n', '', 'else', ':', '\n', '    ', 'print', '(', '<unk>', ',', 'c', ')', '', '', '<eos>']


In [49]:
print("".join(code))

utf-8
a=float(input(<unk>))
b=float(input(<unk>))
b=float(<unk>)
c=float(input(<unk>))

if(a>bandc>c):
    print(aandc)
else:
    print(<unk>,c)<eos>


In [50]:
sentence = "write a program to add two numbers"
code = translate_sentence(sentence, SRC, TRG, model, device)
print(f'predicted trg = {code}')
print(f'predicted trg = {" ".join(code)}')


predicted trg = ['utf-8', 'num1', '=', '<unk>', '\n', 'num2', '=', '<unk>', '\n', 'sum', '=', 'num1', '+', 'num2', '\n', 'print', '(', 'num1', ',', 'num2', ')', '', '<eos>']
predicted trg = utf-8 num1 = <unk> 
 num2 = <unk> 
 sum = num1 + num2 
 print ( num1 , num2 )  <eos>


In [51]:
sentence = "write a program to multiply two numbers"
code = translate_sentence(sentence, SRC, TRG, model, device)
print(f'predicted trg = {code}\n')
print(" ".join(code))

predicted trg = ['utf-8', '\n', '\n', 'def', '<unk>', '(', 'a', ')', ':', '\n', '    ', 'sum1', '=', '<unk>', '\n', 'for', 'i', 'in', 'range', '(', '1', ',', '2', ')', ':', '\n', '        ', 'if', '(', 'a', '%', 'i', '==', '0', ')', ':', '\n', '            ', '<unk>', '+=', '1', '\n', '', '', 'return', 'False', '', '<eos>']

utf-8 
 
 def <unk> ( a ) : 
      sum1 = <unk> 
 for i in range ( 1 , 2 ) : 
          if ( a % i == 0 ) : 
              <unk> += 1 
   return False  <eos>


In [52]:
sentence = "write a program to print factorial of a number"
code = translate_sentence(sentence, SRC, TRG, model, device)
print(f'predicted trg = {code}\n')
print(" ".join(code))

predicted trg = ['utf-8', '\n', 'num', '=', 'int', '(', 'input', '(', '<unk>', ')', ')', '\n', 'for', 'i', 'in', 'range', '(', '1', ',', '<unk>', ')', ':', '\n', '    ', 'print', '(', 'num', ',', '<unk>', ')', '', '', '<eos>']

utf-8 
 num = int ( input ( <unk> ) ) 
 for i in range ( 1 , <unk> ) : 
      print ( num , <unk> )   <eos>
