<a href="https://colab.research.google.com/github/ll3091/ANLY-580-01-NLP-Project/blob/master/LongTextGenerationModels.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# NLP Project: Text Generation Model Training

In [1]:
# source https://github.com/minimaxir/textgenrnn
! pip install textgenrnn



In [2]:
from textgenrnn import textgenrnn
from google.colab import drive

Using TensorFlow backend.


In [0]:
# connect to Google drive
drive.mount('/content/gdrive/')

In [0]:
! ls

In [0]:
! ls gdrive/'My Drive'/NLPProject

## Character-Level RNNs

In [0]:
# model configuration
model_cfg = {
    'rnn_size': 128,
    'rnn_layers': 4,
    'rnn_bidirectional': True,
    'max_length': 40,
    'max_words': 300,
    'dim_embeddings': 100,
    'word_level': False,
}

train_cfg = {
    'line_delimited': False,
    'num_epochs': 10,
    'gen_epochs': 2,
    'batch_size': 512,
    'train_size': 0.8,
    'dropout': 0.25,
    'max_gen_length': 150,
    'validation': True,
    'is_csv': False
}

### Conference Paper Abstracts

In [0]:
file = './gdrive/My Drive/NLPProject/abstracts.txt'

In [0]:
# ~2.5 hours to train
model_name = 'char_abstracts'
textgen = textgenrnn(name=model_name)

train_function = textgen.train_from_file if train_cfg['line_delimited'] else textgen.train_from_largetext_file

train_function(
    file_path=file,
    new_model=True,
    num_epochs=train_cfg['num_epochs'],
    gen_epochs=train_cfg['gen_epochs'],
    batch_size=train_cfg['batch_size'],
    train_size=train_cfg['train_size'],
    dropout=train_cfg['dropout'],
    max_gen_length=train_cfg['max_gen_length'],
    validation=train_cfg['validation'],
    is_csv=train_cfg['is_csv'],
    rnn_layers=model_cfg['rnn_layers'],
    rnn_size=model_cfg['rnn_size'],
    rnn_bidirectional=model_cfg['rnn_bidirectional'],
    max_length=model_cfg['max_length'],
    dim_embeddings=model_cfg['dim_embeddings'],
    word_level=model_cfg['word_level'])

Training new model w/ 4-layer, 128-cell Bidirectional LSTMs
Training on 1,123,185 character sequences.
Epoch 1/10
Epoch 2/10
####################
Temperature: 0.2
####################
ional and approximate approaches are a single inference and state of the art algorithm for a set of the proposed approach is a general problem is a no

 and analytical analysis of the surrogate model and a set of data and a set of the problem of the problem of the computational computational analysis.

model and a complexity of the computational and experimental results on the noise and approximate model are a graphical problem in the computed at a p

####################
Temperature: 0.5
####################
state-of-the-art methods for the problem as a problem of data is a single problems.

Representation Estimators for Animing Learning from Learning for 

antial to the general analysis of a general complexity of the noisy information and proposed are state-of-the-art algorithm in a general datasets and

In [0]:
! cp char_abstracts_config.json ./gdrive/'My Drive'/NLPProject/
! cp char_abstracts_vocab.json ./gdrive/'My Drive'/NLPProject/
! cp char_abstracts_weights.hdf5 ./gdrive/'My Drive'/NLPProject/

### Conference Papers

In [0]:
file = './gdrive/My Drive/NLPProject/papers.txt'

In [0]:
# ~x hours to train
model_name = 'char_papers'
textgen = textgenrnn(name=model_name)

train_function = textgen.train_from_file if train_cfg['line_delimited'] else textgen.train_from_largetext_file

train_function(
    file_path=file,
    new_model=True,
    num_epochs=train_cfg['num_epochs'],
    gen_epochs=train_cfg['gen_epochs'],
    batch_size=train_cfg['batch_size'],
    train_size=train_cfg['train_size'],
    dropout=train_cfg['dropout'],
    max_gen_length=train_cfg['max_gen_length'],
    validation=train_cfg['validation'],
    is_csv=train_cfg['is_csv'],
    rnn_layers=model_cfg['rnn_layers'],
    rnn_size=model_cfg['rnn_size'],
    rnn_bidirectional=model_cfg['rnn_bidirectional'],
    max_length=model_cfg['max_length'],
    dim_embeddings=model_cfg['dim_embeddings'],
    word_level=model_cfg['word_level'])

In [0]:
! cp char_papers_config.json ./gdrive/'My Drive'/NLPProject/
! cp char_papers_vocab.json ./gdrive/'My Drive'/NLPProject/
! cp char_papers_weights.hdf5 ./gdrive/'My Drive'/NLPProject/

## Word-Level RNNs

In [0]:
# model configuration
model_cfg = {
    'rnn_size': 128,
    'rnn_layers': 4,
    'rnn_bidirectional': True,
    'max_length': 10,
    'max_words': 10000,
    'dim_embeddings': 100,
    'word_level': True,
}

train_cfg = {
    'line_delimited': False,
    'num_epochs': 50,
    'gen_epochs': 5,
    'batch_size': 512,
    'train_size': 0.8,
    'dropout': 0.25,
    'max_gen_length': 80,
    'validation': True,
    'is_csv': False
}

### Conference Paper Abstracts

In [0]:
file = './gdrive/My Drive/NLPProject/abstracts.txt'

In [0]:
# ~ 1 hour to train
model_name = 'word_abstracts'
textgen = textgenrnn(name=model_name)

train_function = textgen.train_from_file if train_cfg['line_delimited'] else textgen.train_from_largetext_file

train_function(
    file_path=file,
    new_model=True,
    num_epochs=train_cfg['num_epochs'],
    gen_epochs=train_cfg['gen_epochs'],
    batch_size=train_cfg['batch_size'],
    train_size=train_cfg['train_size'],
    dropout=train_cfg['dropout'],
    max_gen_length=train_cfg['max_gen_length'],
    validation=train_cfg['validation'],
    is_csv=train_cfg['is_csv'],
    rnn_layers=model_cfg['rnn_layers'],
    rnn_size=model_cfg['rnn_size'],
    rnn_bidirectional=model_cfg['rnn_bidirectional'],
    max_length=model_cfg['max_length'],
    dim_embeddings=model_cfg['dim_embeddings'],
    word_level=model_cfg['word_level'])

In [0]:
! cp word_abstracts_config.json ./gdrive/'My Drive'/NLPProject/
! cp word_abstracts_vocab.json ./gdrive/'My Drive'/NLPProject/
! cp word_abstracts_weights.hdf5 ./gdrive/'My Drive'/NLPProject/

### Conference Papers

In [0]:
file = './gdrive/My Drive/NLPProject/papers.txt'

In [0]:
# ~ 1 hour to train
model_name = 'word_papers'
textgen = textgenrnn(name=model_name)

train_function = textgen.train_from_file if train_cfg['line_delimited'] else textgen.train_from_largetext_file

train_function(
    file_path=file,
    new_model=True,
    num_epochs=train_cfg['num_epochs'],
    gen_epochs=train_cfg['gen_epochs'],
    batch_size=train_cfg['batch_size'],
    train_size=train_cfg['train_size'],
    dropout=train_cfg['dropout'],
    max_gen_length=train_cfg['max_gen_length'],
    validation=train_cfg['validation'],
    is_csv=train_cfg['is_csv'],
    rnn_layers=model_cfg['rnn_layers'],
    rnn_size=model_cfg['rnn_size'],
    rnn_bidirectional=model_cfg['rnn_bidirectional'],
    max_length=model_cfg['max_length'],
    dim_embeddings=model_cfg['dim_embeddings'],
    word_level=model_cfg['word_level'])

In [0]:
! cp word_papers_config.json ./gdrive/'My Drive'/NLPProject/
! cp word_papers_vocab.json ./gdrive/'My Drive'/NLPProject/
! cp word_papers_weights.hdf5 ./gdrive/'My Drive'/NLPProject/