<a href="https://colab.research.google.com/github/ll3091/ANLY-580-01-NLP-Project/blob/master/TextGenerationModels.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# NLP Project: Text Generation Model Training

In [4]:
# source https://github.com/minimaxir/textgenrnn
! pip install textgenrnn

Collecting textgenrnn
[?25l  Downloading https://files.pythonhosted.org/packages/ad/f8/f1968b2078a9076f481916fba5d98affa019943e4f5764224ffaeb57b7c7/textgenrnn-1.4.1.tar.gz (1.7MB)
[K    100% |████████████████████████████████| 1.7MB 8.0MB/s 
Building wheels for collected packages: textgenrnn
  Running setup.py bdist_wheel for textgenrnn ... [?25l- \ done
[?25h  Stored in directory: /root/.cache/pip/wheels/30/96/f7/bc7042ea671bc79455c244af21050a7a32d604fe2f7a44e322
Successfully built textgenrnn
Installing collected packages: textgenrnn
Successfully installed textgenrnn-1.4.1


In [5]:
from textgenrnn import textgenrnn
from google.colab import drive

Using TensorFlow backend.


In [0]:
# connect to Google drive
drive.mount('/content/gdrive/')

In [7]:
! ls

gdrive	sample_data


In [8]:
! ls gdrive/'My Drive'/NLPProject

char_jokes_config.json		       songs10000.txt
char_jokes_vocab.json		       songs1000.txt
char_jokes_weights.hdf5		       songs25000.txt
char_motivational_quotes_config.json   songs5000.txt
char_motivational_quotes_vocab.json    TextGenerationModels.ipynb
char_motivational_quotes_weights.hdf5  trump_tweets.txt
char_trump_tweets_config.json	       word_motivational_quotes_config.json
char_trump_tweets_vocab.json	       word_motivational_quotes_vocab.json
char_trump_tweets_weights.hdf5	       word_motivational_quotes_weights.hdf5
DataExploration.ipynb		       word_trump_tweets_config.json
jokes.txt			       word_trump_tweets_vocab.json
motivational_quotes.txt		       word_trump_tweets_weights.hdf5
songdata.csv


## Character-Level RNNs

In [0]:
# model configuration
model_cfg = {
    'rnn_size': 128,
    'rnn_layers': 4,
    'rnn_bidirectional': True,
    'max_length': 40,
    'max_words': 300,
    'dim_embeddings': 100,
    'word_level': False,
}

train_cfg = {
    'line_delimited': False,
    'num_epochs': 10,
    'gen_epochs': 2,
    'batch_size': 512,
    'train_size': 0.8,
    'dropout': 0.25,
    'max_gen_length': 300,
    'validation': True,
    'is_csv': False
}

### Trump Tweets

In [0]:
file = './gdrive/My Drive/NLPProject/trump_tweets.txt'

model_name = 'char_trump_tweets'
textgen = textgenrnn(name=model_name)

train_function = textgen.train_from_file if train_cfg['line_delimited'] else textgen.train_from_largetext_file

In [0]:
# ~50 mins to train
train_function(
    file_path=file,
    new_model=True,
    num_epochs=train_cfg['num_epochs'],
    gen_epochs=train_cfg['gen_epochs'],
    batch_size=train_cfg['batch_size'],
    train_size=train_cfg['train_size'],
    dropout=train_cfg['dropout'],
    max_gen_length=train_cfg['max_gen_length'],
    validation=train_cfg['validation'],
    is_csv=train_cfg['is_csv'],
    rnn_layers=model_cfg['rnn_layers'],
    rnn_size=model_cfg['rnn_size'],
    rnn_bidirectional=model_cfg['rnn_bidirectional'],
    max_length=model_cfg['max_length'],
    dim_embeddings=model_cfg['dim_embeddings'],
    word_level=model_cfg['word_level'])

Training new model w/ 4-layer, 128-cell Bidirectional LSTMs
Training on 647,155 character sequences.
Epoch 1/10
Epoch 2/10
####################
Temperature: 0.2
####################
 RT @realDonaldTrump: The Failiness is no totally see the fired to the Fake News Media will be the country and the Democrats and the people of the people of the Senate is no big the most the most be the FBI is no totally and the FBI and the FBI is no long and the people to the people of the Republic

 I will be the people of the press country and the people with the FBI and the Failines to the FBI is no going to the fired to the people and the media is no long to the people with the Failines and the FBI is no going to the people of the United States of State of Security and the FBI is no done an

ets: The Failines and the Senate is no confidence of the Democrats and the total confidence of the Failines and the Failines and the @WhiteHouse of the Election of the FBI and the United States of Justice and the U

"Temperature. We can also play with the temperature of the Softmax during sampling. Decreasing the temperature from 1 to some lower number (e.g. 0.5) makes the RNN more confident, but also more conservative in its samples. Conversely, higher temperatures will give more diversity but at cost of more mistakes (e.g. spelling mistakes, etc)."

*http://karpathy.github.io/2015/05/21/rnn-effectiveness/*

In [0]:
! cp char_trump_tweets_config.json ./gdrive/'My Drive'/NLPProject/
! cp char_trump_tweets_vocab.json ./gdrive/'My Drive'/NLPProject/
! cp char_trump_tweets_weights.hdf5 ./gdrive/'My Drive'/NLPProject/

### Motivation Quotes

In [0]:
file = './gdrive/My Drive/NLPProject/motivational_quotes.txt'

model_name = 'char_motivational_quotes'
textgen = textgenrnn(name=model_name)

train_function = textgen.train_from_file if train_cfg['line_delimited'] else textgen.train_from_largetext_file

In [0]:
# ~30 mins to train
train_function(
    file_path=file,
    new_model=True,
    num_epochs=train_cfg['num_epochs'],
    gen_epochs=train_cfg['gen_epochs'],
    batch_size=train_cfg['batch_size'],
    train_size=train_cfg['train_size'],
    dropout=train_cfg['dropout'],
    max_gen_length=train_cfg['max_gen_length'],
    validation=train_cfg['validation'],
    is_csv=train_cfg['is_csv'],
    rnn_layers=model_cfg['rnn_layers'],
    rnn_size=model_cfg['rnn_size'],
    rnn_bidirectional=model_cfg['rnn_bidirectional'],
    max_length=model_cfg['max_length'],
    dim_embeddings=model_cfg['dim_embeddings'],
    word_level=model_cfg['word_level'])

Training new model w/ 4-layer, 128-cell Bidirectional LSTMs
Training on 387,804 character sequences.
Epoch 1/10
Epoch 2/10
####################
Temperature: 0.2
####################
uccess is the more the more the more the saction the more the more the secret to be the start the destrue to be the more the start of the secret of the world it will success is a strugge the more to be the more the said to success is the one the secret the best the world and the more the more the mo

s the success is the more to be the one who do an a strugge to the more the secret to the secret of the start to be a success of the more the speach will the secret to success of the secret to be a look the secret of the secret to success is the saction is the world of the secret of the secret to do

 of success is the more the world to be the secret of the success is the spect to success is the world to the control the more the saction will see to be a success is the world to be a success is the more the secre

In [0]:
! cp char_motivational_quotes_config.json ./gdrive/'My Drive'/NLPProject/
! cp char_motivational_quotes_vocab.json ./gdrive/'My Drive'/NLPProject/
! cp char_motivational_quotes_weights.hdf5 ./gdrive/'My Drive'/NLPProject/

### Jokes

In [0]:
file = './gdrive/My Drive/NLPProject/jokes.txt'

model_name = 'char_jokes'
textgen = textgenrnn(name=model_name)

train_function = textgen.train_from_file if train_cfg['line_delimited'] else textgen.train_from_largetext_file

In [0]:
# ~30 mins to train
train_function(
    file_path=file,
    new_model=True,
    num_epochs=train_cfg['num_epochs'],
    gen_epochs=train_cfg['gen_epochs'],
    batch_size=train_cfg['batch_size'],
    train_size=train_cfg['train_size'],
    dropout=train_cfg['dropout'],
    max_gen_length=train_cfg['max_gen_length'],
    validation=train_cfg['validation'],
    is_csv=train_cfg['is_csv'],
    rnn_layers=model_cfg['rnn_layers'],
    rnn_size=model_cfg['rnn_size'],
    rnn_bidirectional=model_cfg['rnn_bidirectional'],
    max_length=model_cfg['max_length'],
    dim_embeddings=model_cfg['dim_embeddings'],
    word_level=model_cfg['word_level'])

Training new model w/ 4-layer, 128-cell Bidirectional LSTMs
Training on 1,957,718 character sequences.
Epoch 1/10
Epoch 2/10
####################
Temperature: 0.2
####################
et of the country and asked him the man and said, "I have to go to the man, and the problem was a priest with a store and the man said "I have a section of the street street to her family to the street and the third was from the forest of the secret to the bartender who was standing at his wife and 

roblem is the trip to the bartender and said "I was a paid of the secret of the street with a street with a priest of the man who was a priest on the street to the street with a store of the street with a britch of the store and said to the man was a door and said "I have to come to go to the proble

oor was a store to the street and said to his wife was a paid of the house. 
	
	The problem was a priest with a strand to the secret with a bar. The best day the blonde was started to the tree and said, "I have t

In [0]:
! cp char_jokes_config.json ./gdrive/'My Drive'/NLPProject/
! cp char_jokes_vocab.json ./gdrive/'My Drive'/NLPProject/
! cp char_jokes_weights.hdf5 ./gdrive/'My Drive'/NLPProject/

## Word-Level RNNs

In [0]:
# model configuration
model_cfg = {
    'rnn_size': 128,
    'rnn_layers': 4,
    'rnn_bidirectional': True,
    'max_length': 10,
    'max_words': 10000,
    'dim_embeddings': 100,
    'word_level': True,
}

train_cfg = {
    'line_delimited': False,
    'num_epochs': 50,
    'gen_epochs': 5,
    'batch_size': 512,
    'train_size': 0.8,
    'dropout': 0.25,
    'max_gen_length': 150,
    'validation': True,
    'is_csv': False
}

### Trump Tweets

In [0]:
file = './gdrive/My Drive/NLPProject/trump_tweets.txt'

model_name = 'word_trump_tweets'
textgen = textgenrnn(name=model_name)

train_function = textgen.train_from_file if train_cfg['line_delimited'] else textgen.train_from_largetext_file

In [11]:
# ~30 mins to train
train_function(
    file_path=file,
    new_model=True,
    num_epochs=train_cfg['num_epochs'],
    gen_epochs=train_cfg['gen_epochs'],
    batch_size=train_cfg['batch_size'],
    train_size=train_cfg['train_size'],
    dropout=train_cfg['dropout'],
    max_gen_length=train_cfg['max_gen_length'],
    validation=train_cfg['validation'],
    is_csv=train_cfg['is_csv'],
    rnn_layers=model_cfg['rnn_layers'],
    rnn_size=model_cfg['rnn_size'],
    rnn_bidirectional=model_cfg['rnn_bidirectional'],
    max_length=model_cfg['max_length'],
    dim_embeddings=model_cfg['dim_embeddings'],
    word_level=model_cfg['word_level'])

Training new model w/ 4-layer, 128-cell Bidirectional LSTMs
Training on 148,415 word sequences.
Epoch 1/50
Epoch 2/50
Epoch 3/50
Epoch 4/50
Epoch 5/50
####################
Temperature: 0.2
####################
united states is a great job and productive mattis , and has my full support . https : / / t . co / gxror3alce

 trump tweets : the fake news media is desperate to me about the enemy of our country . i want to protect our military , and we need our military , and we will have the right to make america great again !

 trump tweets : rt @ realdonaldtrump : the fake news is working on russia . they are a big lie . this is a great job . i will be a great governor !

 trump tweets : the fake news media is desperate to me about the enemy of alabama . i will be very good !

 trump tweets : rt @ realdonaldtrump : the trump post is false " when president obama ,

united states has agreed to be able to me by me for the people . without come up for the wall .

 trump tweets : rt @ realdonal

In [0]:
! cp word_trump_tweets_config.json ./gdrive/'My Drive'/NLPProject/
! cp word_trump_tweets_vocab.json ./gdrive/'My Drive'/NLPProject/
! cp word_trump_tweets_weights.hdf5 ./gdrive/'My Drive'/NLPProject/

### Motivation Quotes

In [0]:
file = './gdrive/My Drive/NLPProject/motivational_quotes.txt'

model_name = 'word_motivational_quotes'
textgen = textgenrnn(name=model_name)

train_function = textgen.train_from_file if train_cfg['line_delimited'] else textgen.train_from_largetext_file

In [14]:
# ~55 mins to train
train_function(
    file_path=file,
    new_model=True,
    num_epochs=train_cfg['num_epochs'],
    gen_epochs=train_cfg['gen_epochs'],
    batch_size=train_cfg['batch_size'],
    train_size=train_cfg['train_size'],
    dropout=train_cfg['dropout'],
    max_gen_length=train_cfg['max_gen_length'],
    validation=train_cfg['validation'],
    is_csv=train_cfg['is_csv'],
    rnn_layers=model_cfg['rnn_layers'],
    rnn_size=model_cfg['rnn_size'],
    rnn_bidirectional=model_cfg['rnn_bidirectional'],
    max_length=model_cfg['max_length'],
    dim_embeddings=model_cfg['dim_embeddings'],
    word_level=model_cfg['word_level'])

Training new model w/ 4-layer, 128-cell Bidirectional LSTMs
Training on 90,046 word sequences.
Epoch 1/50
Epoch 2/50
Epoch 3/50
Epoch 4/50
Epoch 5/50
####################
Temperature: 0.2
####################
the result of taking the misstep in the right direction .

 luke bryan : i ' m not an addiction , i ' m not a product of my circumstances .

 napoleon hill : the more tranquil a leader and the will have to the same as you are capable .

 james e . givens : the most important thing is to quit talking and begin again .

 robin sharma : be yourself . you are not the same .

 james allen : what you do not stop your goals , you will become your dog .

 robert collier : the greatest obstacle is the success in your life .

 james e . e . e . kennedy : the greatest obstacle is in the prints of the end of the right time .



makes a mistake , you will be able to be a success .

 james e . e . cummings : i believe that i could not failed .

 henry ford : if you ’ re not going through hell ,

In [0]:
! cp word_motivational_quotes_config.json ./gdrive/'My Drive'/NLPProject/
! cp word_motivational_quotes_vocab.json ./gdrive/'My Drive'/NLPProject/
! cp word_motivational_quotes_weights.hdf5 ./gdrive/'My Drive'/NLPProject/

### Jokes

In [0]:
file = './gdrive/My Drive/NLPProject/jokes.txt'

model_name = 'word_jokes'
textgen = textgenrnn(name=model_name)

train_function = textgen.train_from_file if train_cfg['line_delimited'] else textgen.train_from_largetext_file

In [0]:
# ~x mins to train
train_function(
    file_path=file,
    new_model=True,
    num_epochs=train_cfg['num_epochs'],
    gen_epochs=train_cfg['gen_epochs'],
    batch_size=train_cfg['batch_size'],
    train_size=train_cfg['train_size'],
    dropout=train_cfg['dropout'],
    max_gen_length=train_cfg['max_gen_length'],
    validation=train_cfg['validation'],
    is_csv=train_cfg['is_csv'],
    rnn_layers=model_cfg['rnn_layers'],
    rnn_size=model_cfg['rnn_size'],
    rnn_bidirectional=model_cfg['rnn_bidirectional'],
    max_length=model_cfg['max_length'],
    dim_embeddings=model_cfg['dim_embeddings'],
    word_level=model_cfg['word_level'])

Training new model w/ 4-layer, 128-cell Bidirectional LSTMs
Training on 536,637 word sequences.
Epoch 1/50
Epoch 2/50
Epoch 3/50
Epoch 4/50
Epoch 5/50
####################
Temperature: 0.2
####################
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .


	
	
	
	
	
	
	
	
	
	
	
	
	
	
	
	
	
	
	
	
	
	
	
	
	
	
	
	
	
	
	
	
	
	
	
	
	
	
	
	
	
	
	
	
	
	
	
	
	
	
	
	
	
	
	
	
	
	
	
	
	
	
	
	
	
	
	
	
	
	
	
	
	
	
 	

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

####################
Temperature: 0

In [0]:
! cp word_jokes_config.json ./gdrive/'My Drive'/NLPProject/
! cp word_jokes_vocab.json ./gdrive/'My Drive'/NLPProject/
! cp word_jokes_weights.hdf5 ./gdrive/'My Drive'/NLPProject/

# Generating Text from Models

In [0]:
textgen = textgenrnn(weights_path='colaboratory_weights.hdf5',
                       vocab_path='colaboratory_vocab.json',
                       config_path='colaboratory_config.json')

textgen.generate_samples(max_gen_length=1000)
textgen.generate_to_file('textgenrnn_texts.txt', max_gen_length=1000)