## Experiment 1
#### Document length and classification accuracy
<hr>

###### Part 1: Define experiment 
1. define space of classifiers (and hypers) to consider 
2. write function(s) to train arbitrary clfs and predict 
3. write model evaluation function 
4. write performance curve plotting function 

###### Part 2: Prepare data 
1. load and split data into length subsets 
2. set aside 30% of data for evaluation (stratified)
3. write function(s) to preprocess text (two options)

###### Part 3: Conduct experiment 
1. for each train subset, get 5-fold crossval F1 for each clf
2. for each fit, generate and save preds on evaluation set 


###### Part 4: Evaluate results
1. plot the crossval F1 scores across subsets and clfs
2. plot performance on validation set for each clf and subset 


<br>
#### Part 1: Define experiment
<hr>

##### 1.1 Define space of classifiers 

We will consider the following set of classification strategies:

- `Type A:` Non-NN: 
    - Multinomial Naive Bayes
    - Support Vector Machine
- `Type B:` Feed-forward NN classifiers: 
    - Multilayer Perceptron
    - Convolutional NN
- `Type C:` Recurrent NN classifiers: 
    - LSTM Network 
    - Bi-directional RNN

> *Note:* `Type C` algorithms take sequence data as input (padded token sequences), whereas `Type A` and `Type B` take DTM-type structures as input (vectorized sequences).  

##### 1.2 Write wrapper class for each model type 

Define classes `TypeA`, `TypeB`, and `TypeC`

###### 1.2.1 Define class for `TypeA` classifiers

In [None]:
class TypeA():
  '''Wrapper class for `sklearn` binary text classifiers 
  
  on init:
    - instantiate `Classifier` class, with params given by `**kwargs`
    - store `Classifier.__name__` and param key-vals in .clf_info attr 
  methods:
    - .train(train_dtm, train_labels): call .clf.fit() on dtm and labels
    - .predict(test_dtm): generate predictions over unseen input data 
  
  attributes:
    - .clf: `sklearn.*.Classifier` instance
    - .clf_info: dict, stores classifier name and param key-value pairs 
    - .train_dtm, .train_labels: input data and labels fed to .train()
  
  usage example: 
    ```
    from sklearn.feature_extraction.text import CountVectorizer
    from sklearn.naive_bayes import MultinomialNB
    from sklearn.metrics import f1_score
    
    skl_clf = MultinomialNB
    skl_clf_params = {'alpha': .9, 'fit_prior': True}
    
    # suppose `dat` as a df with columns `subset` (train, test) and `text` 
    train = dat[dat.subset=='train']
    test = dat[dat.subset=='test']
    
    train_text, train_labels = train.text, train.label
    test_text, test_labels = test.text, test.label
    
    vectorizer = CountVectorizer()
    train_dtm = vectorizer.fit_transform(train_text)
    test_dtm = vectorizer.transform(test_text)
    
    classifier = TypeA(skl_clf, **skl_clf_params)
    classifier.train(train_dtm, train_labels)
    # print(classifier.clf_info)
    
    preds = classifier.predict(test_dtm)
    # print(sum([pred==obs for pred,obs in zip(preds,test_labels)]))
    print('f1 score on test set:', round(f1_score(test_labels, preds), 3))
    ```
  
  TODO:
    - want to save weight matrix as an attr??
    - check handling of **kwargs 
    - ... 
  '''
  def __init__(self, Classifier, **kwargs):
    self.clf = Classifier(**kwargs)
    self.clf_info = dict({'clf' : Classifier.__name__}, **kwargs)
  
  def train(self, train_dtm, train_labels):
    self.train_dtm, self.train_labels = train_dtm, train_labels
    self.clf.fit(self.train_dtm, self.train_labels)
  
  def predict(self, test_dtm):
    test_preds = self.clf.predict(test_dtm)
    return test_preds


###### 1.2.2 Define class for `TypeB` classifiers

In [None]:
class TypeB():
  '''Wrapper class for managing `keras.models.Sequential()` models
  
  on init: 
    - instantiates keras model 
    - adds supplied layers (if any)
    - sets a few attrs used during train/compile/predict 
  
  methods:
    - add_layers(layers_list, kwargs_list)
    - compile_model(optimizer, loss, metrics)
    - train(train_X, train_y, valset_prop, epochs, batch_size)
    - predict()
  
  attributes:
    - .model: instance of KerasModel class passed on init 
    - .layers_info: list of dicts with params and type of each layer 
    - .is_compiled, .layers_added: boolean, for tracking model state 
    - .train_X, .train_y: train data and labels, appropriately preprocessed
    - .history: a keras History object with info about training history  
  
  usage example:
    ```
    from keras.layers import Dense
    
    vocab_n = 10000
    layers = [Dense, Dense, Dense]
    kwargss = [dict(units=16, activation='relu', input_shape=(vocab_n)), 
               dict(units=16, activation='relu'), 
               dict(units=1, activation='sigmoid')]
    neural_net = TypeB(keras.models.Sequential, layers, kwargss)
    neural_net.compile_model(optimizer='rmsprop', 
                             loss='binary_crossentropy', 
                             metrics = ['accuracy'])
    # with appropriately preprocessed `train_X` and `train_y`
    neural_net.train(train_X,train_y, valset_prop=.3,epochs=7,batch_size=50)
    
    preds = neural_net.predict(test_X, pred_postprocessor=lambda x: x > .5)
    sum([pred==true for pred, true in zip(preds, test_y)]) / len(test_y)
    ```
  
  TODO: 
    - can have multiple histories?? (if so, maybe append to lsit instead)
    - need to pass anything to KerasModel?!
    - track layer indices in .layers_info?! 
    - add print and/or repr and/or display method?!?! 
    - abstract over **kwargs for .compile_model()
    - abstract over **kwargs for .train()  
    - ... 
  '''
  def __init__(self, KerasModel, layers_list=[], kwargs_list=[]):
    self.model = KerasModel()
    self.layers_info = []
    self.is_compiled = False
    self.layers_added = False
    
    assert len(layers_list) == len(kwargs_list)
    if len(layers_list) > 0: self.add_layers(layers_list, kwargs_list)
  
  def add_layers(self, layers_list, kwargs_list):
    for layer, kwargs in zip(layers_list, kwargs_list):
      self.model.add(layer(kwargs))
      # TODO: FIX THIS (SEE TypeA FIX ABOVE)
      self.layers_info.append(kwargs.update({'layer': layer.__name__}))
    self.layers_added = True
  
  def compile_model(self, optimizer, loss, metrics):
    assert self.layers_added, 'must add layers to model before compiling!'
    self.model.compile(optimizer=optimizer, loss=loss, metrics=metrics)
    self.is_compiled = True
  
  def train(self, train_X, train_y, valset_prop, epochs, batch_size):
    self.train_X, self.train_y = train_X, train_y
    assert len(self.train_X) == len(self.train_y)
    assert self.is_compiled, 'must compile model before training!'
    valset_size = round(valset_prop * len(self.train_y))
    trn_X, val_X = self.train_X[valset_size:], self.train_X[:valset_size]
    trn_y, val_y = self.train_y[valset_size:], self.train_y[:valset_size]
    self.history = self.model.fit(trn_X, trn_y, 
                                  epochs=epochs, batch_size=batch_size, 
                                  validation_data=(val_X, val_y))
  
  def predict(self, test_X, pred_postprocessor=lambda x: x):
    test_probs = self.model.predict(test_X)
    test_preds = [pred_postprocessor(prob) for prob in test_probs]
    return test_preds





###### 1.2.3 Define class for `TypeC` classifiers

In [None]:
# maybe not even necessary?! 
# maybe `TypeC` interface same as `TypeB` since both are keras?! 
# class TypeC(): pass

##### 1.3 Write function to evaluate performance 

In [None]:
# utility func to convert a probability into a binary prediction 
def prob_to_binary(prob, threshold=.5, ret_type=bool):
  assert 0 <= prob <= 1
  assert 0 <= threshold <= 1
  return ret_type(prob > threshold)


def evaluate_model(trained_model, test_data, test_labels, metric):
  test_preds = trained_model.predict(test_data)
  return metric(test_labels, test_preds)

##### 1.4 Write function to plot performance curves 

In [None]:
# TODO 

<br>
#### Part 2: Prepare data
<hr>

##### 2.1 Load and split data into length subsets 

##### 2.2 Set aside stratified evaluation data 

##### 2.3 Write functions to preprocess text 

<br>
#### Part 3: Conduct experiment
<hr>

##### 3.0 Check that classes work as designed

In [None]:
import pandas as pd

data_file = 'data/imdb_decoded.csv'
dat = pd.read_csv(data_file)


display(dat.head())
print(f'shape of full data: {dat.shape[0]}x{dat.shape[1]}')
# print('label distro:\n', dat.label.value_counts())
# print('length bin sizes:\n', dat.length_bin.value_counts())

In [None]:
### TypeA example 
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import f1_score

skl_clf = MultinomialNB
skl_clf_params = {'alpha': .9, 'fit_prior': True}

# suppose `dat` as a df with columns `subset` (train, test) and `text` 
train = dat[dat.subset=='train']
test = dat[dat.subset=='test']

train_text, train_labels = train.text, train.label
test_text, test_labels = test.text, test.label

vectorizer = CountVectorizer()
train_dtm = vectorizer.fit_transform(train_text)
test_dtm = vectorizer.transform(test_text)

classifier = TypeA(skl_clf, **skl_clf_params)
classifier.train(train_dtm, train_labels)
# print(classifier.clf_info)

preds = classifier.predict(test_dtm)
# print(sum([pred==obs for pred,obs in zip(preds,test_labels)]))
print('f1 score on test set:', round(f1_score(test_labels, preds), 3))

In [None]:
### TypeB example
from keras import models
from keras import layers

from keras.preprocessing.text import Tokenizer

vocab_n = 10000

tokenizer = Tokenizer(num_words=vocab_n)
tokenizer.fit_on_texts(train_text)

train_encoded = tokenizer.texts_to_sequences(train_text)
train_dtm = tokenizer.sequences_to_matrix(train_encoded, mode='count')

test_encoded = tokenizer.texts_to_sequences(test_text)
test_dtm = tokenizer.sequences_to_matrix(test_encoded, mode='count')

print(train_dtm.shape)
print(test_dtm.shape)

In [None]:
layer_list = [layers.Dense, layers.Dense, layers.Dense]

kwargs_list = [dict(units=16, activation='relu', input_shape=(vocab_n)), 
               dict(units=16, activation='relu'), 
               dict(units=1, activation='sigmoid')]

# TODO: START HERE!!! (IN PROCESS OF DEBUGGING `TypeB` CLASS!!! )
# TODO: START HERE!!! (IN PROCESS OF DEBUGGING `TypeB` CLASS!!! )
# TODO: START HERE!!! (IN PROCESS OF DEBUGGING `TypeB` CLASS!!! )
# TODO: START HERE!!! (IN PROCESS OF DEBUGGING `TypeB` CLASS!!! )
# TODO: START HERE!!! (IN PROCESS OF DEBUGGING `TypeB` CLASS!!! )
neural_net = TypeB(models.Sequential, layer_list, kwargs_list)


neural_net.compile_model(optimizer='rmsprop', 
                         loss='binary_crossentropy', 
                         metrics = ['accuracy'])

# with appropriately preprocessed `train_X` and `train_y`
neural_net.train(train_dtm, train_labels,
                 valset_prop=.3, epochs=3, batch_size=50)

# preds = neural_net.predict(test_X, pred_postprocessor=lambda x: x > .5)
# sum([pred==true for pred, true in zip(preds, test_y)]) / len(test_y)

In [None]:
### NOTE: `expt1_util` module not ready to be used 
# # this works fine but then there is awkward train-test split 
# from expt1_util import docs_to_dtm
# alltext_dtm = docs_to_dtm(docs=dat.text, mode='count', num_words=vocab_n)
# print(len(alltext_dtm))
# print(alltext_dtm.shape)

##### 3.1 Get 5-fold crossval F1 for each clf across train subsets


##### 3.2 Generate predictions on evaluation set

##### 3.3 Train on random 70% subset, evaluate across subsets on remaining 30%

<br>
#### Part 4: Evaluate results
<hr>

##### 4.1 Plot CV scores across subsets and clfs

##### 4.2 Plot performance on holdout set

<br><br><br><br><br>
<hr><hr>
### sqrache areyaya

In [None]:
##############################################################################

In [None]:
import numpy as np

def boosh(x, **kwargs):
  print(f'x is {x}...')
  return np.random.normal(**kwargs)

def boosh2(f_list, kwargs_list):
  assert len(f_list) == len(kwargs_list)
  for f, kwargs in zip(f_list, kwargs_list):
    val = f(**kwargs)
    print(f'{f.__name__} applied to kwargs: {val}')


# kw = dict(loc=1, scale=2, size=3)
# np.random.normal(**kw)
# boosh(3, **kw)

# np.random.exponential(scale=1.0, size=5)
# np.random.chisquare(df, size=1)


fs = [np.random.normal, np.random.exponential, np.random.chisquare]
kwargss = [dict(loc=1, scale=2, size=3), 
           dict(scale=2.0, size=2), 
           dict(df=2, size=1)]
np.random.seed(6933)
boosh2(fs, kwargss)

In [None]:
import keras
# not exist! keras.History

In [None]:
l1 = [1,2,3]
l2 = ['a','b','c']

for idx, (x, y) in enumerate([*zip(l1,l2)]):
  print(idx, x, y)