# Observations about prototype initializations

In it's original version, ProtoryNet uses kmedois to initialize its prototypes. 
More specifically, if we use $k$ prototypes in the model, k-medoids clustering is applied on some sentences of the training set with $\text{number of clusters} = k.$
The resulting $k$ centroids are used as the initial prototypes. 
Note that note all sentences in the training are used in the clustering because with after more than 30000 sentences, approximately, the kmedoids algorithm runs into memory issues. 

In this notebook we illustrate that when using the kmedoids initialization in our dataset some prototypes are redundant.
In other words, some of the initial prototypes are repeated, they are the same sentence.
In previous experiments we observed that prototypes sometimes get "stucked" (the don't change) and lack diversity.
These two problems could be feeding one another.
The lack of diversity persist after training for many epochs, which might indicate that when starting with repeated prototypes it's hard for the model to adquire prototype heterogenity.  
 
We propose, as an alternative, to use a random initialization of the prototypes. 
This comes with the possibility of starting with a "bad random selection" of prototypes that won't let the model learn properly.
As with other algorithms that can be affected by the initialization, we can run the model several times and see how the accuracy measures behave. 

We leave a study on the impact of different initialization for the future.  
For the moment we focus on warranting diversity in the initialization of the prototypes. 


# Importing packages

In [1]:
import time
import argparse
import tensorflow as tf
import tensorflow_hub as hub
from sklearn_extra.cluster import KMedoids
from sklearn.model_selection import train_test_split
import numpy as np
import pandas as pd
import pickle
import os
import sys
import myfunctions
import nltk
nltk.download('punkt')
sys.path.append('../src/protoryNet/')
from protoryNet import ProtoryNet

[nltk_data] Downloading package punkt to
[nltk_data]     /nfshome/students/cm007951/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package punkt to
[nltk_data]     /nfshome/students/cm007951/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


In [2]:
%cd ..

/nfshome/students/cm007951/text-models


# Import datasets and data preprocessing

In [4]:
cornell_prepro_characters = pd.read_csv('datasets/cornell_corpus/cornell_prepro_characters.csv')
cornell_prepro_characters

Unnamed: 0.1,Unnamed: 0,characterID,movieID,character_name,gender,movie_title,movie_year,movieGroup,text_with_punctuation,text
0,0,u0,m0,BIANCA,F,10 things i hate about you,1999,1,They do not! I hope so. Let's go. Okay you're ...,They do not I hope so Lets go Okay youre gonna...
1,1,u100,m6,AMY,F,8mm,1999,1,She died in her sleep three days ago. It was i...,She died in her sleep three days ago It was in...
2,2,u1001,m65,PETE,M,from dusk till dawn,1996,5,Six-fifty. Knock yourself out. That's all that...,Sixfifty Knock yourself out Thats all thats be...
3,3,u1007,m66,BLONDELL,F,g.i. jane,1997,1,Wow Uh don't see it. There's no signature. But...,Wow Uh dont see it Theres no signature But han...
4,4,u1008,m66,C.O.,M,g.i. jane,1997,1,"Of course, but there's more Uh, V.I.P. securit...",Of course but theres more Uh VIP security arra...
...,...,...,...,...,...,...,...,...,...,...
2399,2399,u983,m64,ALICE,F,friday the 13th,2009,3,Maybe we should wait for Mr. Christy. The kill...,Maybe we should wait for Mr Christy The killer...
2400,2400,u985,m64,BILL,M,friday the 13th,2009,3,It's over twenty miles to the crossroads. Stev...,Its over twenty miles to the crossroads Stevel...
2401,2401,u989,m64,MARCIE,F,friday the 13th,2009,3,Gotta pee. You're lying on my bladder. Like wa...,Gotta pee Youre lying on my bladder Like waves...
2402,2402,u993,m64,STEVE,M,friday the 13th,2009,3,I've got to go to town and pick up the trailer...,Ive got to go to town and pick up the trailer ...


In [5]:
# Split data
X = cornell_prepro_characters['text_with_punctuation']
y = np.array(cornell_prepro_characters['gender'] == 'F').astype(int)

x_train, x_val, x_test, y_train, y_val, y_test = myfunctions.balanced_split_train_val_test(X, y, train_split = 0.7, val_split = 0.2, test_split = 0.1, random_seed = 32)

In [7]:
# Saving to pickle format
# The models will use the datasets we save here
directory =  'datasets/cornell_corpus/cornell_prepro_characters_70train_20val_10test/'

with open(directory +'x_train', 'wb') as f:
     pickle.dump(x_train, f)
with open(directory +'x_val', 'wb') as f:
     pickle.dump(x_val, f)
with open(directory +'x_test', 'wb') as f:
     pickle.dump(x_test, f)

with open(directory +'y_train', 'wb') as f:
     pickle.dump(y_train, f)
with open(directory +'y_val', 'wb') as f:
     pickle.dump(y_val, f)
with open(directory +'y_test', 'wb') as f:
     pickle.dump(y_test, f)

In [6]:
# Preprocessing to use protorynet model's results

# In this section there is no need to use the validation set, since we want to evaluate the predictions and accuracy in the test set.
# The train set is necessary because we need to map the prototypes (which belong to the train set only)

# Guarantee target variable is integer
y_train = [int(y) for y in y_train]
y_test = [int(y) for y in y_test]

# Split text into lists of sentences 
x_train = myfunctions.split_sentences(x_train)
x_test = myfunctions.split_sentences(x_test)

# Make a list of sentences (only for training set)
train_sentences = []
for p in x_train:
    train_sentences.extend(p)

# We remove very short or very long sentences since they behave as outliers.
train_sentences = [i for i in train_sentences if len(i)>5 and len(i)<100]

# Train model using kmedoids initialization

In [8]:
# !python scripts_and_notebooks/train_protorynet.py --dataset_path=datasets/cornell_corpus/cornell_prepro_characters_70train_20val_10test/ --results_path=results/protorynet_models/ --results_prefix=cornell_prepro_characters_70train_20val_10test --epochs=2 --number_prototypes=10 --type_init=kmedoids --sample_size_sentences=20000 --init_prototypes_seed=16

[nltk_data] Downloading package punkt to
[nltk_data]     /nfshome/students/cm007951/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
2022-07-28 14:53:19.121868: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:936] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2022-07-28 14:53:19.130403: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:936] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2022-07-28 14:53:19.131935: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:936] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2022-07-28 14:53:19.132911: I tensorflow/core/platform/cpu_feature_guard.cc:151] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU i

Evaluate on valid set:  0.501039501039501
i =   1250
i =   1300
i =   1350
i =   1400
Evaluate on valid set:  0.501039501039501
i =   1450
i =   1500
i =   1550
i =   1600
Evaluate on valid set:  0.5031185031185031
This is the best eval res, saving the model...
saving model now = 2022-07-28 15:16:20.019727
just saved
i =   1650
Epoch  1
i =   0
Evaluate on valid set:  0.501039501039501
i =   50
i =   100
i =   150
i =   200
Evaluate on valid set:  0.501039501039501
i =   250
i =   300
i =   350
i =   400
Evaluate on valid set:  0.501039501039501
i =   450
i =   500
i =   550
i =   600
Evaluate on valid set:  0.501039501039501
i =   650
i =   700
i =   750
i =   800
Evaluate on valid set:  0.501039501039501
i =   850
i =   900
i =   950
i =   1000
Evaluate on valid set:  0.5114345114345115
This is the best eval res, saving the model...
saving model now = 2022-07-28 15:28:39.069133
just saved
i =   1050
i =   1100
i =   1150
i =   1200
Evaluate on valid set:  0.501039501039501
i =   1250

# Model results

The train process, besides saving the model weights, it saves a pickle file with some model information. Among them we can see the initial prototypes chosen by kmedoids.

In [10]:
results_path = 'results/protorynet_models/'

In [14]:
# Load results from the model
model_name = 'cornell_prepro_characters_70train_20val_10test__2epochs__10prototypes__kmedoidstype_init__20000sample_size_sentences__16init_prototypes_seed'
model_kmedoids_results = pickle.load(open(results_path + model_name + '.pickle', 'rb'))

# Extract number of prototypes
number_prototypes = model_kmedoids_results['args'].number_prototypes

We see that using kmedoids the initial prototypes produce repetitions. 
Later in the training the prototypes "don't move", they get stucked and this might be due to the poor initialization. 

In [24]:
model_kmedoids_results['initial_prototypes']

{0: 'Oh, God!',
 1: "I couldn't believe it he just left!",
 2: 'Oh God.',
 3: 'Aw, come on.',
 4: 'You come up with that yourself?',
 5: 'Oh, come on.',
 6: 'Oh my God!',
 7: 'Oh dear.',
 8: 'Oh, dear.',
 9: 'Oh no.'}

Now, we load the model to observe the prototypes after a couple of epochs. 

In [15]:
# Load model
model_path = results_path + model_name + '.h5'
pNet_kmedoids = ProtoryNet()
model = pNet_kmedoids.createModel(np.zeros((number_prototypes, 512)), number_prototypes)
model.load_weights(model_path)

[db] model.input =  KerasTensor(type_spec=TensorSpec(shape=(None,), dtype=tf.string, name='input_1'), name='input_1', description="created by layer 'input_1'")
[db] protoLayerName =  proto_layer
[db] protoLayer =  <protoryNet.ProtoryNet.createModel.<locals>.prototypeLayer object at 0x7feb28706f50>
[db] protoLayer.output =  (<KerasTensor: shape=(1, None, 10) dtype=float32 (created by layer 'proto_layer')>, <KerasTensor: shape=(10, 512) dtype=float32 (created by layer 'proto_layer')>)
[db] distanceLayer.output =  KerasTensor(type_spec=TensorSpec(shape=(1, None, 10), dtype=tf.float32, name=None), name='distance_layer/PartitionedCall:0', description="created by layer 'distance_layer'")
Model: "custom_model"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 input_1 (InputLayer)        [(None,)]                 0         
                                                                 
 keras_layer (KerasLaye

In [18]:
# Sentence embedding using the finetune embedder in the model
train_sentences_embedded = pNet_kmedoids.embed(train_sentences)

In [21]:
# Evaluate the model on testing data
preds_test, accuracy_test = pNet_kmedoids.evaluate(x_test, y_test)

In [26]:
accuracy_test

0.5062240663900415

After a couple of epochs the prototypes have not been able to move. 

In [22]:
# Extract final prototypes of the model
final_prototypes = pNet_kmedoids.showPrototypes(train_sentences, train_sentences_embedded, number_prototypes, printOutput=False, return_prototypes = True)
final_prototypes

{0: 'Oh God!',
 1: "I couldn't believe it he just left!",
 2: 'Oh God!',
 3: 'Aw, come on.',
 4: 'You come up with that yourself?',
 5: 'Oh, come on.',
 6: 'Oh my God!',
 7: 'Oh, dear.',
 8: 'Oh, dear.',
 9: 'No, no!'}

# Train model using random initialization

In [7]:
# !python scripts_and_notebooks/train_protorynet.py --dataset_path=datasets/cornell_corpus/cornell_prepro_characters_70train_20val_10test/ --results_path=results/protorynet_models/ --results_prefix=cornell_prepro_characters_70train_20val_10test --epochs=2 --number_prototypes=10 --type_init=random --sample_size_sentences=20000 --init_prototypes_seed=16

[nltk_data] Downloading package punkt to
[nltk_data]     /nfshome/students/cm007951/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
2022-07-28 16:15:39.063300: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:936] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2022-07-28 16:15:39.071484: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:936] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2022-07-28 16:15:39.072307: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:936] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2022-07-28 16:15:39.073271: I tensorflow/core/platform/cpu_feature_guard.cc:151] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU i

Evaluate on valid set:  0.498960498960499
i =   1250
i =   1300
i =   1350
i =   1400
Evaluate on valid set:  0.498960498960499
i =   1450
i =   1500
i =   1550
i =   1600
Evaluate on valid set:  0.498960498960499
i =   1650
Epoch  1
i =   0
Evaluate on valid set:  0.498960498960499
i =   50
i =   100
i =   150
i =   200
Evaluate on valid set:  0.498960498960499
i =   250
i =   300
i =   350
i =   400
Evaluate on valid set:  0.498960498960499
i =   450
i =   500
i =   550
i =   600
Evaluate on valid set:  0.498960498960499
i =   650
i =   700
i =   750
i =   800
Evaluate on valid set:  0.498960498960499
i =   850
i =   900
i =   950
i =   1000
Evaluate on valid set:  0.5239085239085239
This is the best eval res, saving the model...
saving model now = 2022-07-28 16:50:03.293843
just saved
i =   1050
i =   1100
i =   1150
i =   1200
Evaluate on valid set:  0.5426195426195426
This is the best eval res, saving the model...
saving model now = 2022-07-28 16:52:44.518793
just saved
i =   1250

# Model results

In [8]:
results_path = 'results/protorynet_models/'

In [9]:
# Load results from the model
model_name = 'cornell_prepro_characters_70train_20val_10test__2epochs__10prototypes__randomtype_init__20000sample_size_sentences__16init_prototypes_seed'
model_random_results = pickle.load(open(results_path + model_name + '.pickle', 'rb'))

# Extract number of prototypes
number_prototypes = model_random_results['args'].number_prototypes

With random initialization it is more probable to get heterogeneous initialization. 

In [10]:
model_random_results['initial_prototypes']

{0: 'I called your house like four times.',
 1: "I don't think people even noticed.",
 2: "I'm the victim.",
 3: "But I can't just drop everything and leave.",
 4: 'Oh, some of the scripts were so spirited!',
 5: "How'd you do on the science test?",
 6: "I'm going to look around.",
 7: 'No de-fense.',
 8: 'Baxter?',
 9: 'Poor girl how could you do a thing like that?'}

Now, we load the model to observe the prototypes after a couple of epochs. 

In [11]:
# Load model
model_path = results_path + model_name + '.h5'
pNet_random = ProtoryNet()
model = pNet_random.createModel(np.zeros((number_prototypes, 512)), number_prototypes)
model.load_weights(model_path)

[db] model.input =  KerasTensor(type_spec=TensorSpec(shape=(None,), dtype=tf.string, name='input_1'), name='input_1', description="created by layer 'input_1'")
[db] protoLayerName =  proto_layer
[db] protoLayer =  <protoryNet.ProtoryNet.createModel.<locals>.prototypeLayer object at 0x7f7c71741350>
[db] protoLayer.output =  (<KerasTensor: shape=(1, None, 10) dtype=float32 (created by layer 'proto_layer')>, <KerasTensor: shape=(10, 512) dtype=float32 (created by layer 'proto_layer')>)
[db] distanceLayer.output =  KerasTensor(type_spec=TensorSpec(shape=(1, None, 10), dtype=tf.float32, name=None), name='distance_layer/PartitionedCall:0', description="created by layer 'distance_layer'")
Model: "custom_model"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 input_1 (InputLayer)        [(None,)]                 0         
                                                                 
 keras_layer (KerasLaye

In [12]:
# Sentence embedding using the finetune embedder in the model
train_sentences_embedded = pNet_random.embed(train_sentences)

In [13]:
# Evaluate the model on testing data
preds_test, accuracy_test = pNet_random.evaluate(x_test, y_test)

In [14]:
accuracy_test

0.6099585062240664

After a couple of epochs some prototypes start to change and the test accuracy is way better than when using kmedoids.

In [15]:
# Extract final prototypes of the model
final_prototypes = pNet_random.showPrototypes(train_sentences, train_sentences_embedded, number_prototypes, printOutput=False, return_prototypes = True)
final_prototypes

{0: 'I called your house like four times.',
 1: "I don't think people even noticed.",
 2: "I'm the victim.",
 3: "But I can't just drop everything and leave.",
 4: "He's a very proper actor.",
 5: "How'd you do on the science test?",
 6: 'What are you looking at?',
 7: 'No, and yes.',
 8: '<u>No</u> mention of the Girlscout.',
 9: 'Poor girl how could you do a thing like that?'}