<a href="https://colab.research.google.com/github/ll3091/ANLY-580-01-NLP-Project/blob/master/TextGenerationModels.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# NLP Project: Text Generation Training

In [1]:
# source https://github.com/minimaxir/textgenrnn
! pip install textgenrnn

Collecting textgenrnn
[?25l  Downloading https://files.pythonhosted.org/packages/ad/f8/f1968b2078a9076f481916fba5d98affa019943e4f5764224ffaeb57b7c7/textgenrnn-1.4.1.tar.gz (1.7MB)
[K    100% |████████████████████████████████| 1.7MB 13.5MB/s 
Building wheels for collected packages: textgenrnn
  Running setup.py bdist_wheel for textgenrnn ... [?25l- \ done
[?25h  Stored in directory: /root/.cache/pip/wheels/30/96/f7/bc7042ea671bc79455c244af21050a7a32d604fe2f7a44e322
Successfully built textgenrnn
Installing collected packages: textgenrnn
Successfully installed textgenrnn-1.4.1


In [2]:
from textgenrnn import textgenrnn
from google.colab import drive

Using TensorFlow backend.


In [0]:
# connect to Google drive
drive.mount('/content/gdrive/')

In [4]:
! ls

gdrive	sample_data


In [5]:
! ls gdrive/'My Drive'/NLPProject

DataExploration.ipynb  motivational_quotes.txt	trump_tweets.txt
jokes.txt	       TextGeneration.ipynb


## Character RNNs

In [0]:
# model configuration
model_cfg = {
    'rnn_size': 128,
    'rnn_layers': 4,
    'rnn_bidirectional': True,
    'max_length': 40,
    'max_words': 300,
    'dim_embeddings': 100,
    'word_level': False,
}

train_cfg = {
    'line_delimited': False,
    'num_epochs': 10,
    'gen_epochs': 2,
    'batch_size': 512,
    'train_size': 0.8,
    'dropout': 0.25,
    'max_gen_length': 300,
    'validation': True,
    'is_csv': False
}

### Trump Tweets

In [0]:
file = './gdrive/My Drive/NLPProject/trump_tweets.txt'

model_name = 'char_trump_tweets'
textgen = textgenrnn(name=model_name)

train_function = textgen.train_from_file if train_cfg['line_delimited'] else textgen.train_from_largetext_file

In [12]:
# ~50 mins to train
train_function(
    file_path=file,
    new_model=True,
    num_epochs=train_cfg['num_epochs'],
    gen_epochs=train_cfg['gen_epochs'],
    batch_size=train_cfg['batch_size'],
    train_size=train_cfg['train_size'],
    dropout=train_cfg['dropout'],
    max_gen_length=train_cfg['max_gen_length'],
    validation=train_cfg['validation'],
    is_csv=train_cfg['is_csv'],
    rnn_layers=model_cfg['rnn_layers'],
    rnn_size=model_cfg['rnn_size'],
    rnn_bidirectional=model_cfg['rnn_bidirectional'],
    max_length=model_cfg['max_length'],
    dim_embeddings=model_cfg['dim_embeddings'],
    word_level=model_cfg['word_level'])

Training new model w/ 4-layer, 128-cell Bidirectional LSTMs
Training on 647,155 character sequences.
Epoch 1/10
Epoch 2/10
####################
Temperature: 0.2
####################
 RT @realDonaldTrump: The Failiness is no totally see the fired to the Fake News Media will be the country and the Democrats and the people of the people of the Senate is no big the most the most be the FBI is no totally and the FBI and the FBI is no long and the people to the people of the Republic

 I will be the people of the press country and the people with the FBI and the Failines to the FBI is no going to the fired to the people and the media is no long to the people with the Failines and the FBI is no going to the people of the United States of State of Security and the FBI is no done an

ets: The Failines and the Senate is no confidence of the Democrats and the total confidence of the Failines and the Failines and the @WhiteHouse of the Election of the FBI and the United States of Justice and the U

"Temperature. We can also play with the temperature of the Softmax during sampling. Decreasing the temperature from 1 to some lower number (e.g. 0.5) makes the RNN more confident, but also more conservative in its samples. Conversely, higher temperatures will give more diversity but at cost of more mistakes (e.g. spelling mistakes, etc)."

*http://karpathy.github.io/2015/05/21/rnn-effectiveness/*

In [0]:
! cp char_trump_tweets_config.json ./gdrive/'My Drive'/NLPProject/
! cp char_trump_tweets_vocab.json ./gdrive/'My Drive'/NLPProject/
! cp char_trump_tweets_weights.hdf5 ./gdrive/'My Drive'/NLPProject/

### Motivation Quotes

In [14]:
!ls

1				reddit_jokes.json	       songs25000.txt
char_trump_tweets_config.json	sample_data		       songs5000.txt
char_trump_tweets_vocab.json	since-20170120-processed.json  stupidstuff.json
char_trump_tweets_weights.hdf5	songdata.csv		       trump_tweets.txt
gdrive				songlyrics.zip		       wocka.json
jokes.txt			songs10000.txt
motivate			songs1000.txt


In [0]:
file = './gdrive/My Drive/NLPProject/motivational_quotes.txt'

model_name = 'char_trump_tweets'
textgen = textgenrnn(name=model_name)

train_function = textgen.train_from_file if train_cfg['line_delimited'] else textgen.train_from_largetext_file

In [0]:
# ~50 mins to train
train_function(
    file_path=file,
    new_model=True,
    num_epochs=train_cfg['num_epochs'],
    gen_epochs=train_cfg['gen_epochs'],
    batch_size=train_cfg['batch_size'],
    train_size=train_cfg['train_size'],
    dropout=train_cfg['dropout'],
    max_gen_length=train_cfg['max_gen_length'],
    validation=train_cfg['validation'],
    is_csv=train_cfg['is_csv'],
    rnn_layers=model_cfg['rnn_layers'],
    rnn_size=model_cfg['rnn_size'],
    rnn_bidirectional=model_cfg['rnn_bidirectional'],
    max_length=model_cfg['max_length'],
    dim_embeddings=model_cfg['dim_embeddings'],
    word_level=model_cfg['word_level'])

# Generating Text from Models

In [0]:
textgen = textgenrnn(weights_path='colaboratory_weights.hdf5',
                       vocab_path='colaboratory_vocab.json',
                       config_path='colaboratory_config.json')

textgen.generate_samples(max_gen_length=1000)
textgen.generate_to_file('textgenrnn_texts.txt', max_gen_length=1000)