In [1]:
import numpy as np
import json
import random

from itertools import combinations_with_replacement as combo

### Create the dictionaries we need for training and bopbot pipeline

The purpose of this notebook is to take the saved model and labels from the full train set of data run on 53k clusters and create our data for the GPT2 model. We need to translate keypoint arrays into fake words and back. This requires several steps. They will need to be done in batches; they crashed the kernel when trying to run immediately after the model finished training.

- use the cluster labels for our train_all data to calculate the mean pose for each cluster.
  - save a dictionary with cluster label as key and mean pose as value `label_to_meanPose_50k.json`
- Assign a word to each mean pose, aka, each cluster label. (Some clusters will be empty and will not need an assigned word.)
  - first step is to create a set of 5-letter words - get this from previous notebook
  - save two dictionaries because we need to go both directions:
    - use the words as keys and cluster labels as values: `word_to_label_50k.json`
    - use the cluster labels as keys and words as values: `label_to_word_50k.json`
- pipeline of data needed to train gpt model; here, we just need the entire train_all dataset as a text file with one word for each pose and the `[EOS]` token replacing the zeros; research the format used for the gpt2-tf2 version
  - use the cluster labels (`train_all_52000_cluster_labels.npy`) as keys to get the text word
  - save word to list
  - replace zeros cluster with `[EOS]` 
  - save to text file: `train_all_50k.txt`
- pipeline from user input array to output array returned to user
  - keypoint array -> cluster label (done with predict using saved model)
  - cluster label -> mean pose (used to output rendered video of vocab poses); these are saved in `label_to_mean_poses_50k.npz`
  - cluster label -> text word (these words get run through model)
  - text word -> cluster label -> mean pose (this is the output of the model translated to mean pose arrays ready for rendering; unless we create a third dict, this will require two dicts to perform)
- randomly test a few videos
  - search for zeros arrays; keep the index of these (save for future use); sequential indices in this list will be the beginning and ending index of a single video
  - render the original and the vocab video (do this for a handful of random frames, too
  

#### Upload the data

In [2]:
#Upload the label array.
ballet_labels = np.load('ballet_8000_cluster_labels.npy')
ballet_labels.shape

(1, 929303)

In [4]:
#Upload the train_all array.
train_ballet = np.load('../../data_preproc/train_ballet.npy')
train_ballet.shape

(929303, 14, 2)

#### Create the text vocab

In [5]:
#Create a bunch of 5-letter "words".
b = combo('abcdefghijklmnopqrstuvwxyz', 5)
c = list(b)
vocab_5letter = []
for i in range(len(c)):
    vocab_5letter.append(''.join(c[i]))

In [7]:
#Get the real words from the encoder. (This is the list of words from the tf1 GPT2 model.)
with open('../GPT2/gpt-2/models/124M/encoder.json', 'r') as f:
    gpt_vocab = json.load(f)


In [9]:
#Get the words used in our 52k vocab.
with open('../../data_preproc/word_to_label_50k.json', 'r') as f:
    word_to_label_50k = json.load(f)


In [11]:
#Create a list of "words" that aren't in the real words from the encoder and
# that aren't used in our full 52k vocab.
new_vocab_5letter = []
for i in range(len(vocab_5letter)):
    if vocab_5letter[i] not in gpt_vocab.keys():
        if vocab_5letter[i] not in word_to_label_50k.keys():
            new_vocab_5letter.append(vocab_5letter[i])

In [12]:
#Shuffle the "words" just in case this might make a difference for the tokenizer.
random.shuffle(new_vocab_5letter)
new_vocab_5letter[:10]

['dgtvy',
 'deegw',
 'fjjnx',
 'irttx',
 'bhlmv',
 'aegjr',
 'eoprx',
 'ccdez',
 'lqtvv',
 'mprtu']

#### Create dictionaries with text words

In [13]:
ballet_labels[0][100]

7222

In [14]:
#reshape to 2-dims
train_ballet = train_ballet.reshape(-1,28)
train_ballet.shape

(929303, 28)

In [15]:
index = 0
#Find the label for the zeros array.
for i in range(train_ballet.shape[0]):
    if np.all(train_ballet[i] == np.zeros(28)):
        index = i
        break
index

562

In [17]:
#Find the cluster label for zeros.
ballet_labels[0][562]

4

In [20]:
#Create a dictionary of fake 5-letter words with keys that are the cluster labels.
label_to_word_ballet = {}
count = 0
for label in ballet_labels[0]:
    if label not in label_to_word_ballet.keys():
        if label == 4:
            label_to_word_ballet[int(label)] = '[EOS]'
        else:
            label_to_word_ballet[int(label)] = new_vocab_5letter[count]
            count += 1
    
len(label_to_word_ballet)

7993

In [21]:
#Create the inverse dictionary with words as keys and labels as values.
word_to_label_ballet = {}
for key in label_to_word_ballet:
    new_key = label_to_word_ballet[key]
    word_to_label_ballet[new_key] = int(key)
len(word_to_label_ballet)

7993

In [22]:
label_to_word_ballet[2000]

'coopq'

In [23]:
word_to_label_ballet['coopq']

2000

In [24]:
word_to_label_ballet['[EOS]']

4

Save the two dictionaries.

In [25]:
with open('label_to_word_ballet.json','w') as f:
    json.dump(label_to_word_ballet, f)
with open('word_to_label_ballet.json','w') as moo:
    json.dump(word_to_label_ballet, moo)


#### Create the ballet8000 text file

Based on the input used for this model, we want sentences on a single line with extra newlines at the end of paragraphs. I will use each video as a sentence and have each paragraph be just one sentence long.

To run this, we need the `ballet_labels.npy` array and the `label_to_word_ballet` dictionary. [These are in memory, so don't need to reload.]

In [2]:
# ##Upload the label array.
# train_all_labels = np.load('train_all_52000_cluster_labels.npy').reshape(-1)
# train_all_labels.shape

(35258430,)

In [23]:
# #Upload the label to word dictionary.
# with open('label_to_word_50k.json', 'r') as f:
#     label_to_word_50k = json.load(f)
# len(label_to_word_50k)

51779

In [28]:
ballet_labels = ballet_labels.reshape(-1,1)
ballet_labels.shape

(929303, 1)

In [32]:
ballet_labels[0][0]

2302

In [33]:
#We want to scroll through the label array and add 
# the word for each label to new_sentence until we reach "[EOS]" (label 21).
#Then we append the sentence to the all_text list.
#Final data will be a list of strings to save to txt file with each new string
#starting on a new line.

all_text = []
new_sentence = ''
for i in range(ballet_labels.shape[0]):
    new_sentence += label_to_word_ballet[ballet_labels[i][0]]+' '
    if ballet_labels[i][0] == 4:
        all_text.append(new_sentence)
        new_sentence = ''
print('number of videos in text:',len(all_text))
print('one video sentence:',all_text[100])

number of videos in text: 1396
one video sentence: bikmm qstvv mnstw blmmt dlosw dlosw iijkn iijkn iijkn cccjt iijkn goruu goruu goruu goruu dmopv dmopv kknop kknop hnvvz bcctt bbdnv bbdnv bbjuz bbjuz bglwz bglwz eimsz eimsz myzzz myzzz myzzz myzzz evwwz evwwz evwwz evwwz evwwz evwwz evwwz evwwz evwwz hqqst hqqst fmpwy jnprx jnprx hlprw hlprw hlprw hikpv hikpv hikpv bbjoq bbjoq bbjoq bbjoq bghqs bghqs dlryz dlryz dlryz dlryz dlryz bbjoq hptty bbjoq bbjoq bbjoq bbjoq efglw ggtux ggtux ggtux ggtux bnsuw bnsuw bnsuw bnsuw bnsuw bnsuw bnsuw bnsuw djxxx djxxx djxxx djxxx ccnuz flntx flntx flntx flntx giktv giktv eeryz ciqtt ciqtt ciqtt ciqtt ciqtt dgmtx dgmtx dgmtx dgmtx dgmtx dgmtx dgmtx dgmtx morss ffgqr morss morss morss morss morss morss morss morss morss morss morss morss morss morss morss morss morss morss morss fmmmu fmmmu fmmmu fmmmu fmmmu fmmmu llpwy ilooq hhhsw cimuz cimuz achir asyyy aehir bgppt ghkww ghkww ghkww iijkn iijkn iijkn iijkn goruu goruu goruu dmopv dmopv kknop kknop e

In [34]:
filename = 'train_ballet_8k.txt'
with open(filename, "w") as whatevs:
    for item in all_text:
        whatevs.write('%s\n' % item)

This is a sample of text from current GPT2 input:

    Governor Cuomo was angling for more federal funds way back in July of 2015:

    Israel News reports:

    New rail tunnels under the Hudson River are needed to reduce delays, but the expensive project won’t work without a greater financial commitment from the federal government, Gov. Andrew Cuomo said Wednesday.

    The Democrat said Washington’s proposal to cover $3 billion of the estimated $14 billion tunnel project isn’t enough. He and New Jersey Gov. Chris Christie are expected to meet soon with federal officials to discuss funding for the stalled plan, which comes after a series of delays that underscored the age and condition of the area’s transportation infrastructure.