# MELEE: A Multi-Embedding LSTM Ensemble of Ensembles

**Team Members:**
1. **Ivan Hernandez** [https://computationalorganizationalresearch.com](https://computationalorganizationalresearch.com)
2. **Joe Meyer** [https://www.linkedin.com/in/meyerjoe152](https://www.linkedin.com/in/meyerjoe152)
3. **Weiwen Nie** [https://www.linkedin.com/in/weiwen-nie-255693a7](https://www.linkedin.com/in/weiwen-nie-255693a7)
4. **Andrew Cutler** [https://www.linkedin.com/in/andrew-cutler-66791781](https://www.linkedin.com/in/andrew-cutler-66791781)


**MELEE: Conceptual Overview**

Our solution to the problem can be summarized as a multiembedding LSTM ensemble of ensembles, referred to as **"MELEE"** for simplicity. A key part of MELEE is being able to get different perspectives on what people said, and then combining the best of those perspectives, and then combining the specific combinations that work best. Here is the process. 

1. Split Data into Train, Holdout, Dev, and Test Sets. Because there is a large amount of ensembling, we need to make sure we set aside data only for ensembling, which is the holdout set made from 20% of the train set. The remaining train set will be used to train the LSTM models. The dev set is necessary to evaluate the performance of difference ensembles. The tet set is used to make out final predictions

2. Train different LSTM models using the sequences of embedded exercise responses as the inputs, using different embedding models and LSTM architectures. You know that sentence embeddings offer a way to quantify what text describes along different dimensions. A single embedding model is going to attuned to a subset of specific aspects. So we'll embed each of the in-basket responses with a single embedding model. Then we'll send that sequence of embeddings to a Long Short Term Memory model, which are great at figuring out how a sequence of vectors combine to form an outcome. We'll repeat this process using different embedding models, which will each pick up on things that other embedding models do not.

3. Now that we have different LSTM models trained, each with their own perspective, some of those perspectives will capture only a piece of the rules for how the exercises were collectively scored
We'll ensemble the predictions of the different LSTMs to learn how to weight each perspective. Because some perspectives might combine better than others and in different ways, we'll vary the LSTMs that we combine as well as type of ensembling method used

4. Now that we have different ensembles trained, we can further optimize the predictions by combining the ensembles together that work best for specific outcomes. By examining the dev performance of each ensemble, we can know which ensembles are more suited for one outcome over another. We'll have the best set of ensembles for a given outcome make predictions on the test set
And then we'll average those predictions together.





# Step 0: Install Necessary Libraries

Before implementing any data processing, modeling, or prediction, we'll install two key libraries not available on Colab by default.

1. Sentence-Transformers: Create embeddings for each exercise
2. Keras-tuner: Try difference LSTM architectures

This notebook also assumes that keras, numpy, sklearn, and pandas as are also installed

In [None]:
!pip install sentence-transformers --quiet
!pip install keras-tuner --upgrade --quiet

# Step 1: Load in Train Data and Split into Train and Holdout Data

We need to load in all of the datasets at the same time.

This includes:
1. Train data: The largest of the datasets, with both exercise responses and scored outcomes. We will use this dataset to train different possible models and ensembles of those models
2. Dev data: The data used to submit to the development leaderboard. Performance on the leaderboard will be used to identify which ensembles work best
3. Test data: The data used for the final submission that determines the winner. We will ensemble the best ensembles and submit those meta-ensemble predictions to the test leaderboard.

## Step 1a: Load in the Datasets into their Own Dataframe

In [None]:
import pandas as pd #import the pandas library to work in reading, manipulating, and writing datasets
df_train = pd.read_csv("train_pub.csv") #read the train csv
df_dev = pd.read_csv("dev_pub.csv") #read the dev csv
df_test = pd.read_csv("test_pub.csv") #read the test csv

## Step 1b: Impute Any Missing Outcome Data

Some of the outcomes are missing for some of the cases.
To not lose the information that case is providing for the non-missing outcomes, we'll imput the missing vlaue using an iterative imputer.

Only the train data has labled outcomes, so we only need to impute that dataset.

We are going to use an iterative imputer, which is similar to the MICE package in R. This imputer provides stochastic model-based imputation, which means that the parameter estimates (mean, standard deviation, covariance) will be unbiased if the missingness mechanism is either MCAR and and MAR.

In [None]:
from sklearn.impute import KNNImputer #load the imputer that uses k-nearest neighbors to impute
imputer = KNNImputer(n_neighbors=4, weights="uniform")

outcome_columns = ['rating_chooses_appropriate_action',
       'rating_commits_to_action', 'rating_gathers_information',
       'rating_identifies_issues_opportunities',
       'rating_interprets_information', 'rating_involves_others',
       'rating_decision_making_final_score']

df_y = df_train[outcome_columns]
df_y = pd.DataFrame(imputer.fit_transform(df_y))
df_y.columns = outcome_columns

## Step 1c: Isolate the Predictor Text

The predictors in this problem are the responses to the exercises.

Those begin at the 9th column (we designate column 8 because arrays begin their index at 0 in Python), so we will extract all of the information beginning and after those columns for each of the datasets.

In [None]:
X = df_train.iloc[:,8:]
X_dev = df_dev.iloc[:,8:]
X_test = df_test.iloc[:,8:]

## Step 1d: Create a Train-Test Split

Because we are ensembling, we are going to create a consistent hold-out dataset from the larger train dataset, that can be used to figure out the best ensemble models parameters.

We are also going to create an outcome with all of the outcome data concatenated from the splits

In [None]:
from sklearn.model_selection import train_test_split
import numpy as np
X_train, X_holdout,y_train, y_holdout = train_test_split(X, df_y, test_size=0.2, random_state=152)
y_all = np.vstack((y_train,y_holdout))

# Step 2: Train Multiple LSTM Models using Different Embeddings

This part of the code is where the main machine learning / modeling occurs.

Specifically, Step 2 cover the "MEL" part of MELEE, where we are using multiple different sentence embedding models to embed the sequence of exercise responses. For each sequence of embeddings, we will train differet types of Long Short Term Memory models to predict the outcome values associated with that person's sequence.

## Step 2a. Create list of Embedding Models

In [None]:
embedding_models = ["all-MiniLM-L6-v2",
                    "paraphrase-albert-small-v2",
                    "all-mpnet-base-v2",
                    "all-roberta-large-v1",
                    "paraphrase-multilingual-mpnet-base-v2",
                    "facebook/contriever-msmarco",
                    "msmarco-bert-co-condensor",
                    "gtr-t5-xxl",
                    "LaBSE",
                    "sentence-t5-xxl",
                    "facebook/bart-large-cnn",
                    "pszemraj/bart-large-instructiongen-w-inputs",
                    "Davlan/distilbert-base-multilingual-cased-ner-hrl",
                    "intfloat/e5-large-v2",
                    "google/electra-base-discriminator",
                    "j-hartmann/emotion-english-distilroberta-base",
                    "google/flan-t5-base",
                    "MoritzLaurer/mDeBERTa-v3-base-mnli-xnli",
                    "pszemraj/opt-125m-email-generation",
                    "Muennighoff/SGPT-2.7B-weightedmean-msmarco-specb-bitfit",
                    "bigscience/sgpt-bloom-7b1-msmarco",
                    "symanto/sn-xlm-roberta-base-snli-mnli-anli-xnli",
                    "google/t5-3b-ssm-nq",
                    "snrspeaks/t5-one-line-summary",
                    "princeton-nlp/unsup-simcse-bert-base-uncased",
                    "xlm-roberta-large"]


## Step 2b. Create a Function to Encode The Exercise Responses as a Sequence of Embeddings

This code defines a function called embed_responses that takes two parameters: embeddingmodel and df. The function is used to encode exercise responses as a sequence of embeddings.


The function starts by specifying a list of column names representing the text responses for different exercises. It then initializes an empty list called embedding_sequences, which will store the encoded sequences of embeddings for each participant's exercise responses.


The function then iterates over each row in the input DataFrame (df). For each participant, it extracts the text responses for the exercises from the specified columns. If a response is missing (indicated by "nan"), the function replaces it with the word "nothing" and keeps track of the index of the missing response. The function then encodes the participant's text responses using the embeddingmodel and stores the resulting embeddings in the participant_scores variable. Any missing responses are assigned a score of 0 in the embeddings. Finally, the encoded sequence of embeddings for each participant is added to the embedding_sequences list.


The function returns a NumPy array containing all the encoded sequences of embeddings for the exercise responses of each participant.

In [None]:
def embed_responses(embeddingmodel,df):
  columns = ['text_exercise_4',
        'text_exercise_5', 'text_exercise_6', 'text_exercise_7',
        'text_exercise_8', 'text_exercise_9', 'text_exercise_10',
        'text_exercise_11', 'text_exercise_12', 'text_exercise_13',
        'text_exercise_14', 'text_exercise_15', 'text_exercise_16',
        'text_exercise_17', 'text_exercise_18', 'text_exercise_19',
        'text_exercise_final']

  embedding_sequences = []

  for idx in range(len(df)):

    participant_text = []
    participant_scores = []
    participant_mask = []
    for idx2,column in enumerate(columns):
      text = df.iloc[idx][column]

      if str(text) == "nan":
        participant_text.append("nothing")
        participant_mask.append(idx2)
      else:
        participant_text.append(text.split("--")[0])

    participant_scores = embeddingmodel.encode(participant_text)
    participant_scores[participant_mask,:] = 0

    embedding_sequences.append(participant_scores)

  return np.array(embedding_sequences)

## Step 2c. Create Custom Loss Function that Optimizes the MSE and the Reproduction of the Observed Correlation Matrix

If we train a multi-output LSTM to minimize the MSE, the predictions of the model will all be highly correlated, which will not be useful for ensembling, because the predictions are not indepdent and therefore their informational value is lowered.

To increase the informativeness of the models' predictions, we will reward models whose correlation between predictions come close to the actually correlation between the different outcome variables. 

Specifically, the loss function will be the MSE + the average absolute difference between the observed and predicted inter-item correlation matrix. 

The code below:
1. Calculates the observed correlation between the 7 outcomes
2. Creates a function to calculate the correlation between to variables
3. Creates a custom metric to monitor the weighted correlation using the weighting criteria of the competition.
4. Creates a custom loss function in Keras that creates and inter-item correlation matrix between all of the predicted outcomes. It takes the absolute difference between the observed and predicted inter-item correlation matrix and then sets the objective function to minimize the MSE as well as the average error in reconstructing the correlation matrix

In [None]:
import itertools
import tensorflow as tf
import numpy as np

y_corr = np.corrcoef(df_y.values.T)


def correlation_coefficient(x, y):    
    mx = tf.math.reduce_mean(x)
    my = tf.math.reduce_mean(y)
    xm, ym = x-mx, y-my
    r_num = tf.math.reduce_mean(tf.multiply(xm,ym))        
    r_den = tf.math.reduce_std(xm) * tf.math.reduce_std(ym)
    return r_num / r_den

def weightedcorrelation(y_true,y_pred):
  c1 = correlation_coefficient(y_true[:,0],y_pred[:,0])
  c2 = correlation_coefficient(y_true[:,1],y_pred[:,1])
  c3 = correlation_coefficient(y_true[:,2],y_pred[:,2])
  c4 = correlation_coefficient(y_true[:,3],y_pred[:,3])
  c5 = correlation_coefficient(y_true[:,4],y_pred[:,4])
  c6 = correlation_coefficient(y_true[:,5],y_pred[:,5])
  c7 = correlation_coefficient(y_true[:,6],y_pred[:,6])
  weighted_error = (c1*.1)+(c2*.1)+(c3*.1) + (c4*.1) + (c5*.1) + (c6*.1)  + (c7*.4) 
  return weighted_error

import itertools
import tensorflow as tf
mse = tf.keras.losses.MeanSquaredError()
def customLoss2(y_true,y_pred):
    c_true = 0
    for i,j, in itertools.combinations(range(y_corr.shape[0]),2):
        c = correlation_coefficient(y_pred[:,i],y_pred[:,j])
        c_y= y_corr[i,j]
        error =  tf.math.abs(tf.subtract(c,c_y))
        c_true += error
    return mse(y_true,y_pred) + tf.math.divide(c_true,21)

## Step 2d. Create Functions that Provide Newly Initialized Shallow, Bidirectional, and Tuned LSTM Models

This code defines three functions to create LSTM models for different purposes. 

1. The first function, get_shallow_model, creates a shallow LSTM model that takes a sequence of input data and predicts an output. 

2. The second function, get_bidirectional_model, creates a bidirectional LSTM model, which means it processes the input sequence in both forward and backward directions to capture more context. 

3. The third function, get_tunable_model, creates a tunable LSTM model that allows the user to choose between a bidirectional LSTM and a regular LSTM, as well as different activation functions and dropout rates. 

These models are compiled with a specific loss function, metrics, and optimizer, and can be used for tasks such as time series analysis or sequence prediction

In [None]:
from tensorflow.keras.layers import Input, Dense,LSTM,Flatten,concatenate, Dropout,BatchNormalization,Masking,Bidirectional
from tensorflow.keras import Model
from tensorflow.keras.optimizers import RMSprop
import tensorflow as tf

earlystopping = tf.keras.callbacks.EarlyStopping(
    monitor="val_loss",
    min_delta=0,
    patience=50,
    verbose=0,
    mode="auto",
    baseline=None,
    restore_best_weights=False,
)


def get_shallow_model(dim):
    first_input = Input(shape=(17,dim))
    lstm_out = LSTM(64)(first_input)
    flattened_out = Flatten()(lstm_out)
    flattened_out = BatchNormalization()(flattened_out)
    fc1 = Dense(16,activation="tanh")(flattened_out)
    fc1 = Dropout(.5)(fc1)
    fc1 = BatchNormalization()(fc1)
    output = Dense(7,activation="linear")(fc1)
    model = Model(inputs=first_input, outputs=output)
    model.compile(loss=customLoss2,metrics=["mse",weightedcorrelation],optimizer=RMSprop(learning_rate=1e-2))
    return model

def get_bidirectional_model(dim):
    first_input = Input(shape=(17,dim))
    lstm_out = Bidirectional(LSTM(64))(first_input)
    flattened_out = Flatten()(lstm_out)
    flattened_out = BatchNormalization()(flattened_out)
    fc1 = Dense(16,activation="tanh")(flattened_out)
    fc1 = Dropout(.5)(fc1)
    fc1 = BatchNormalization()(fc1)
    output = Dense(7,activation="linear")(fc1)
    model = Model(inputs=first_input, outputs=output)
    model.compile(loss=customLoss2,metrics=["mse",weightedcorrelation],optimizer=RMSprop(learning_rate=1e-2))
    return model

def get_tunable_model(hp):
    model = tf.keras.Sequential()
    model.add(Masking(mask_value=0.))
    # Choose between a Bidirectional LSTM and a regular LSTM
    if hp.Choice('use_bidirectional', values=[True, False]):
        model.add(tf.keras.layers.Bidirectional(tf.keras.layers.LSTM(
            units=hp.Int('LSTMUnits', min_value=4, max_value=128, step=4))))
        
    else:
        model.add(tf.keras.layers.LSTM(
            units=hp.Int('LSTMIUnits', min_value=4, max_value=128, step=4)
        ))
    model.add(Flatten())
    model.add(BatchNormalization())
    if hp.Choice('use_tanh', values=[True, False]):
        model.add(tf.keras.layers.Dense(hp.Int('DenseUnits', min_value=4, max_value=64, step=4),activation="tanh"))
    else:
        model.add(tf.keras.layers.Dense(hp.Int('DenseUnits', min_value=4, max_value=64, step=4),activation="relu"))
    model.add(tf.keras.layers.Dropout(rate=hp.Float('dropout', min_value=0.0, max_value=0.7, step=0.1)))
    model.add(BatchNormalization())
    model.add(tf.keras.layers.Dense(units=7, activation='linear'))
    
    model.compile(
        loss=customLoss2,
        metrics=["mse",weightedcorrelation],
        optimizer=tf.keras.optimizers.RMSprop(hp.Choice('learning_rate', values=[1e-1,5e-1,1e-2, 1e-3]))
    )
    
    return model

## Step 2e. Create folders to hold the LSTM predictions

The code will create empty directories to store the predictions an LSTM model makes on each of the datasets when the trained LSTM is applied to the holdout, dev, and test datasets.

In [None]:
!mkdir LSTM_predictions_holdout
!mkdir trained_LSTM
!mkdir LSTM_predictions_dev
!mkdir LSTM_predictions_test

## Step 2f. Train a Shallow, Bidirectional, and Tuned LSTM Model for Each Embedding 

In [None]:
from sentence_transformers import SentenceTransformer
from keras_tuner import BayesianOptimization

for embedding_model_name in embedding_models:
  embeddingmodel = SentenceTransformer(embedding_model_name) #load the model
  embedding_model_name = embedding_model_name.split("/")[-1] #get the name of the model
  print(embedding_model_name)
  embeddingmodel.cuda() #run the model on the gpu
  embedding_sequence_train = embed_responses(embeddingmodel,X_train) #encode the train responses
  embedding_sequence_holdout = embed_responses(embeddingmodel,X_holdout) #encode the holdout responses
  embedding_sequence_dev = embed_responses(embeddingmodel,X_dev) #encode the dev responses
  embedding_sequence_test = embed_responses(embeddingmodel,X_test) #encode the test responses
  embedding_sequence_all = np.vstack((embedding_sequence_train,embedding_sequence_holdout)) #we combine the train and the holdout to train an even more performant model after we identified the idea hyperparameters
  

  dim = embedding_sequence_train.shape[-1]

  ### Train Shallow LSTMs
  idxs = []
  for i in range(10): #We don't know what is a good stopping point for the model, so we train 10 different versions of each model and record the epoch number where the cross-val MSE is most optimal
      model = get_shallow_model(dim)
      history = model.fit(embedding_sequence_train, y_train,shuffle=True,verbose=0,batch_size=1600,epochs=500,validation_batch_size=500,validation_split=.2,callbacks=[earlystopping])
      best_idx = np.argmin(history.history['val_mse'])
      idxs.append(best_idx)

  model = get_shallow_model(dim) #We make a new shallow LSTM model
  best_overall_idx = int(np.mean(idxs)) #We calculate what was the average of the best epochs
  model.fit(embedding_sequence_train, y_train,shuffle=True,verbose=0,batch_size=1600,epochs=best_overall_idx) #We fit a model on the train data
  holdout_pred = model.predict(embedding_sequence_holdout,verbose=0) #We make predictions on the holdout data
  np.save(f"LSTM_predictions_holdout/shallow_{embedding_model_name}.npy",holdout_pred)
  
  model = get_shallow_model(dim) #We make a new shallow LSTM model
  model.fit(embedding_sequence_all, y_all,shuffle=True,verbose=0,batch_size=1600,epochs=best_overall_idx,validation_batch_size=500) #We fit a model on the train and holdout data to get a model that is even better performing
  model.save_weights(f"trained_LSTM/shallow_{embedding_model_name}.h5") # we save the weights on the model
  
  dev_pred = model.predict(embedding_sequence_dev,verbose=None) #We make predictions on the dev data with the combined data model
  test_pred = model.predict(embedding_sequence_test,verbose=None) #We make predictions on the test data with the combined data model
  
  np.save(f"LSTM_predictions_dev/shallow_{embedding_model_name}.npy",dev_pred)
  np.save(f"LSTM_predictions_test/shallow_{embedding_model_name}.npy",test_pred)

  ### Train Bidirectional LSTMs
  idxs = []
  for i in range(5):
      model = get_shallow_model(dim)
      history = model.fit(embedding_sequence_train, y_train,shuffle=True,verbose=0,batch_size=1600,epochs=500,validation_batch_size=500,validation_split=.2,callbacks=[earlystopping])
      best_idx = np.argmin(history.history['val_mse'])
      idxs.append(best_idx)

  model = get_bidirectional_model(dim)
  best_overall_idx = int(np.mean(idxs))
  model.fit(embedding_sequence_train, y_train,shuffle=True,verbose=0,batch_size=1600,epochs=best_overall_idx)
  holdout_pred = model.predict(embedding_sequence_holdout,verbose=None)
  np.save(f"LSTM_predictions_holdout/bidirectional_{embedding_model_name}.npy",holdout_pred)
  
  model = get_bidirectional_model(dim)
  model.fit(embedding_sequence_all, y_all,shuffle=True,verbose=0,batch_size=1600,epochs=best_overall_idx,validation_batch_size=500)
  model.save_weights(f"trained_LSTM/bidirectional_{embedding_model_name}.h5")
  
  dev_pred = model.predict(embedding_sequence_dev,verbose=None)
  test_pred = model.predict(embedding_sequence_test,verbose=None)
  
  np.save(f"LSTM_predictions_dev/bidirectional_{embedding_model_name}.npy",dev_pred)
  np.save(f"LSTM_predictions_test/bidirectional_{embedding_model_name}.npy",test_pred)


  ### Train Tuned LSTMs

  tuner = BayesianOptimization(
          get_tunable_model,
          objective='val_loss',
          max_trials=5,
          executions_per_trial=3,
          directory='my_dir',
          max_consecutive_failed_trials=15,
          overwrite=True,
          project_name='lstm_tuning'
      )

  tuner.search(embedding_sequence_train, y_train,verbose=0,
          epochs=300,batch_size=1600,validation_batch_size=500,shuffle=True,
          validation_split=.2)
  best_hp = tuner.get_best_hyperparameters()[0]
  model = get_tunable_model(best_hp)
  model.fit(embedding_sequence_train, y_train,shuffle=True,verbose=0,batch_size=1600,epochs=300,validation_batch_size=500)
  holdout_pred = model.predict(embedding_sequence_holdout,verbose=None)
  np.save(f"LSTM_predictions_holdout/tuned_{embedding_model_name}.npy",holdout_pred)
  
  model.fit(embedding_sequence_all, y_all,shuffle=True,verbose=0,batch_size=1600,epochs=300,validation_batch_size=500)
  model.save_weights(f"trained_LSTM/tuned_{embedding_model_name}.h5")
  
  dev_pred = model.predict(embedding_sequence_dev,verbose=None)
  test_pred = model.predict(embedding_sequence_test,verbose=None)
  
  np.save(f"LSTM_predictions_dev/tuned_{embedding_model_name}.npy",dev_pred)
  np.save(f"LSTM_predictions_test/tuned_{embedding_model_name}.npy",test_pred)


# Step 3. Ensemble the Different LSTMs

## Step 3a. Get the Predictions of Each LSTM model from the prior section

The code below loads the predictions of each LSTM model that were computed in the prior section and saved as numpy (.npy) files. 

It uses the glob module to get the predictions made on the holdout, dev, and test data. For each of the filenames ending in npy in those fold, the code loads the data from the numpy file using np.load(), and stores the data in a dictionary called holdout_lstm_preds. The keys in the dictionary are the name of the model and the corresponding values are the loaded data. 

The code displays how many files were located within each folder.

In [None]:
import glob
import numpy as np
holdout_lstm_preds = {}

fnames = sorted(glob.glob("LSTM_predictions_holdout/*.npy"))
for fname in fnames:
  data = np.load(fname)
  holdout_lstm_preds[fname.split("/")[-1]] = data
len(holdout_lstm_preds)

In [None]:
import glob
import numpy as np
dev_lstm_preds = {}

fnames = sorted(glob.glob("LSTM_predictions_dev/*.npy"))
for fname in fnames:
  data = np.load(fname)
  dev_lstm_preds[fname.split("/")[-1]] = data
len(dev_lstm_preds)

In [None]:
import numpy as np
import glob
test_lstm_preds = {}
fnames = sorted(glob.glob("LSTM_predictions_test/*.npy"))
for fname in fnames:
  data = np.load(fname)
  test_lstm_preds[fname.split("/")[-1]] = data
len(test_lstm_preds)

## Step 3b. Make Folders to Hold the Predictions of the Ensemble Models

In [None]:
!mkdir ensemble_predictions_dev
!mkdir ensemble_predictions_test

## Step 3c. Create a Function to Ensemble a Random Sample of Those Predictions

The given code is for creating a function to ensemble a random sample of predictions on the 7 outcomes made from different LSTM models that were trained in the previous section. 

The function:
1. Takes the predictions from different LSTM models as input, along with the parameters: k and modeltype. 
2. It selects a random subset of embbedding names based on the specified number (k). It then selects either he shalllow, birdirection, or tuned version of of the LSTM model.
3. Then, it horizontally combines the predictions from the selected models into a single ensemble prediction. For example, if there are 15 models, then there will 15 x 7 columns in the predictor matrix. 
4. The type of ensemble model (e.g., Ridge, Lasso, Random Forest) is specified by the 'modeltype' parameter.
5. The function fits the ensemble model using the combined predictions from the selected LSTM models for the holdout dataset.
6. It then uses the fitted ensemble model to make predictions for the development and test datasets. 
7. The predicted scores are scaled and adjusted, and then stored in separate CSV files for the development and test datasets.








In [None]:
import random
import pandas as pd
from sklearn.linear_model import RidgeCV,LassoCV
from sklearn.ensemble import RandomForestRegressor
import hashlib
def ensemble_preds(holdout_lstm_preds,dev_lstm_preds,test_lstm_preds,k=15,modeltype='ridge'):
  
  possible_models = list(holdout_lstm_preds.keys())
  model_stems = list(set([x.replace("shallow_","").replace("bidirectional_","").replace("tuned_","") for x in possible_models]))
  selected_model_stems = random.sample(model_stems,k)
  selected_models = [random.choice(["shallow_","bidirectional_","tuned_"]) + x for x in selected_model_stems]
  holdout_lstm_X = np.hstack([holdout_lstm_preds[modelname] for modelname in selected_models])

  dev_lstm_X = np.hstack([dev_lstm_preds[modelname] for modelname in selected_models]) 
  test_lstm_X = np.hstack([test_lstm_preds[modelname] for modelname in selected_models])  

  if modeltype == 'ridge':
    ensemblemodel1 =  RidgeCV([.0001,.001,.01,.1,1,10,25,50,75,100,200,300,400,500,600])
    ensemblemodel2 =  RidgeCV([.0001,.001,.01,.1,1,10,25,50,75,100,200,300,400,500,600])
    ensemblemodel3 =  RidgeCV([.0001,.001,.01,.1,1,10,25,50,75,100,200,300,400,500,600])
    ensemblemodel4 =  RidgeCV([.0001,.001,.01,.1,1,10,25,50,75,100,200,300,400,500,600])
    ensemblemodel5 =  RidgeCV([.0001,.001,.01,.1,1,10,25,50,75,100,200,300,400,500,600])
    ensemblemodel6 =  RidgeCV([.0001,.001,.01,.1,1,10,25,50,75,100,200,300,400,500,600])
    ensemblemodel7 =  RidgeCV([.0001,.001,.01,.1,1,10,25,50,75,100,200,300,400,500,600])
 
  if modeltype == "lasso":
    ensemblemodel1 = LassoCV(cv=10, max_iter=10000,random_state=0)
    ensemblemodel2 = LassoCV(cv=10, max_iter=10000,random_state=0)
    ensemblemodel3 = LassoCV(cv=10, max_iter=10000,random_state=0)
    ensemblemodel4 = LassoCV(cv=10, max_iter=10000,random_state=0)
    ensemblemodel5 = LassoCV(cv=10, max_iter=10000,random_state=0)
    ensemblemodel6 = LassoCV(cv=10, max_iter=10000,random_state=0)
    ensemblemodel7 = LassoCV(cv=10, max_iter=10000,random_state=0)

  if modeltype == "randomforest":
    ensemblemodel1 = RandomForestRegressor(n_estimators=1000,max_depth=5,max_features = .667)
    ensemblemodel2 = RandomForestRegressor(n_estimators=1000,max_depth=5,max_features = .667)
    ensemblemodel3 = RandomForestRegressor(n_estimators=1000,max_depth=5,max_features = .667)
    ensemblemodel4 = RandomForestRegressor(n_estimators=1000,max_depth=5,max_features = .667)
    ensemblemodel5 = RandomForestRegressor(n_estimators=1000,max_depth=5,max_features = .667)
    ensemblemodel6 = RandomForestRegressor(n_estimators=1000,max_depth=5,max_features = .667)
    ensemblemodel7 = RandomForestRegressor(n_estimators=1000,max_depth=5,max_features = .667) 

  ensemblemodel1.fit(holdout_lstm_X,y_holdout.iloc[:,0])
  dev_scores1 = ensemblemodel1.predict(dev_lstm_X)
  test_scores1 = ensemblemodel1.predict(test_lstm_X) 

  ensemblemodel2.fit(holdout_lstm_X,y_holdout.iloc[:,1])
  dev_scores2 = ensemblemodel2.predict(dev_lstm_X)
  test_scores2 = ensemblemodel2.predict(test_lstm_X) 
  
  ensemblemodel3.fit(holdout_lstm_X,y_holdout.iloc[:,2])
  dev_scores3 = ensemblemodel3.predict(dev_lstm_X)
  test_scores3 = ensemblemodel3.predict(test_lstm_X)

  ensemblemodel4.fit(holdout_lstm_X,y_holdout.iloc[:,3])
  dev_scores4 = ensemblemodel4.predict(dev_lstm_X)
  test_scores4 = ensemblemodel4.predict(test_lstm_X) 

  ensemblemodel5.fit(holdout_lstm_X,y_holdout.iloc[:,4])
  dev_scores5 = ensemblemodel5.predict(dev_lstm_X)
  test_scores5 = ensemblemodel5.predict(test_lstm_X) 

  ensemblemodel6.fit(holdout_lstm_X,y_holdout.iloc[:,5])
  dev_scores6 = ensemblemodel6.predict(dev_lstm_X)
  test_scores6 = ensemblemodel6.predict(test_lstm_X) 

  ensemblemodel7.fit(holdout_lstm_X,y_holdout.iloc[:,6])
  dev_scores7 = ensemblemodel7.predict(dev_lstm_X)
  test_scores7 = ensemblemodel7.predict(test_lstm_X) 

  dev_scores1 = ((dev_scores1 - dev_scores1.min())/(dev_scores1.max()- dev_scores1.min())) * 3 + 1
  dev_scores2 = ((dev_scores2 - dev_scores2.min())/(dev_scores2.max()- dev_scores2.min())) * 3 + 1
  dev_scores3 = ((dev_scores3 - dev_scores3.min())/(dev_scores3.max()- dev_scores3.min())) * 3 + 1
  dev_scores4 = ((dev_scores4 - dev_scores4.min())/(dev_scores4.max()- dev_scores4.min())) * 3 + 1
  dev_scores5 = ((dev_scores5 - dev_scores5.min())/(dev_scores5.max()- dev_scores5.min())) * 3 + 1
  dev_scores6 = ((dev_scores6 - dev_scores6.min())/(dev_scores6.max()- dev_scores6.min())) * 3 + 1
  dev_scores7 = ((dev_scores7 - dev_scores7.min())/(dev_scores7.max()- dev_scores7.min())) * 6 + 1
  
  df_dev = pd.read_csv("dev_pub.csv")
  df_dev.iloc[:,1] = dev_scores1
  df_dev.iloc[:,2] = dev_scores2
  df_dev.iloc[:,3] = dev_scores3
  df_dev.iloc[:,4] = dev_scores4
  df_dev.iloc[:,5] = dev_scores5
  df_dev.iloc[:,6] = dev_scores6
  df_dev.iloc[:,7] = dev_scores7


  test_scores1 = ((test_scores1 - test_scores1.min())/(test_scores1.max()- test_scores1.min())) * 3 + 1
  test_scores2 = ((test_scores2 - test_scores2.min())/(test_scores2.max()- test_scores2.min())) * 3 + 1
  test_scores3 = ((test_scores3 - test_scores3.min())/(test_scores3.max()- test_scores3.min())) * 3 + 1
  test_scores4 = ((test_scores4 - test_scores4.min())/(test_scores4.max()- test_scores4.min())) * 3 + 1
  test_scores5 = ((test_scores5 - test_scores5.min())/(test_scores5.max()- test_scores5.min())) * 3 + 1
  test_scores6 = ((test_scores6 - test_scores6.min())/(test_scores6.max()- test_scores6.min())) * 3 + 1
  test_scores7 = ((test_scores7 - test_scores7.min())/(test_scores7.max()- test_scores7.min())) * 6 + 1
  
  df_test = pd.read_csv("test_pub.csv")
  df_test.iloc[:,1] = test_scores1
  df_test.iloc[:,2] = test_scores2
  df_test.iloc[:,3] = test_scores3
  df_test.iloc[:,4] = test_scores4
  df_test.iloc[:,5] = test_scores5
  df_test.iloc[:,6] = test_scores6
  df_test.iloc[:,7] = test_scores7

  combined_model_names = " ".join(selected_models)+ " "+modeltype
  hashedname = hashlib.md5(combined_model_names.encode()).hexdigest()
  df_dev.to_csv("ensemble_predictions_dev/"+hashedname+'.csv',index=False)
  df_test.to_csv("ensemble_predictions_test/"+hashedname+'.csv',index=False)
  return

## Step 3d. Train the Ensemble Models

The subsequent code is a loop that iterates 10 times. THis is to create 10 different ensemble models of each ensembling model type (ridge, lasso, random forest), and embedding sample size (15,25). 

The purpose of this loop is to generate multiple ensembles with varying configurations and parameters, potentially exploring different combinations of LSTM models, k values, and ensemble models.

In [None]:
for i in range(10):
  ensemble_preds(holdout_lstm_preds,dev_lstm_preds,test_lstm_preds,k=15,modeltype='ridge')
  ensemble_preds(holdout_lstm_preds,dev_lstm_preds,test_lstm_preds,k=15,modeltype='lasso')
  ensemble_preds(holdout_lstm_preds,dev_lstm_preds,test_lstm_preds,k=15,modeltype='randomforest')

  ensemble_preds(holdout_lstm_preds,dev_lstm_preds,test_lstm_preds,k=25,modeltype='ridge')
  ensemble_preds(holdout_lstm_preds,dev_lstm_preds,test_lstm_preds,k=25,modeltype='lasso')
  ensemble_preds(holdout_lstm_preds,dev_lstm_preds,test_lstm_preds,k=25,modeltype='randomforest')

# Step 4. Ensemble the Best Ensembles

Before running the following code, submit all of the predictions in the "ensemble_predictions_dev" folder and make sure to name those submissions using the md5 hash assigned to the submission.

After submitting the predictions to the dev leaderboard, download the results and examine which models performed the best for each of the outcome.

Write down the md5 hashes for each of the 10 best models for each outcome variable, and write the hashes in the relevant list.

## Step 4a. List Best Performing Models for Each Outcome Variable Based on Dev Board Performance

In [55]:
best_chooses_appropriate_action_ensembles = ['18d313d9701d33f36af52c4696ffa0d7',
                                            '99ac4d4d737ff049249c28efadd08b46',
                                            '7cce2110db3f57621e45a2770229af5b',
                                            'fd8703c72aee0cf909998b4fbdd95f57',
                                            '8a1243ff626a877e7d1b8e47d9e56bcb',
                                            '2404a7b64262ce46d39faba52e790e6e',
                                            'f2eb1c0d1f998d5500e2db2dd4644e53',
                                            '458bed338d1a4eacf59929e32c063ad5',
                                            '09f4005738b2b4485defe59c2818ba46',
                                            '22a7bab35b9b58d53ef1c96dae116654']

best_commits_to_action_ensembles = ['7061187643bea2b7fb5bf618e6ac9256',
                                    '2448f7f260a743534b96322af4fe12a7',
                                    'e10f49808db12fa6c7a81a4449298887',
                                    '173df03c061161f3196c465a68437611',
                                    'b2d0fa9b1c9dcb46953607a65fe7fd32',
                                    'f885f933e9ba1ca54055a7842a8f3ac1',
                                    'd705740e6fe7587ca582e51964db6bf3',
                                    'd0dbe54028612bca03adbd90dfa369c3',
                                    '8b5a8cc5805d3269958e7216fb433551',
                                    'dcf9925c0303b5b38a030d2b947c9354']
                                            
best_gathers_information_ensembles = ['416413f39c1bb2ddbb73b0bbcc7072da',
                                      '372a9928bf14e75e6646ddfa31de86f4',
                                      '25a14971428bc7d0b14d20c0ff362384',
                                      '756551bdc094b50505e1eaba12897d84',
                                      '7aef9245cc5487aed03479543318aaa5',
                                      '5867b534927e189131d3ccc221756c68',
                                      '9da88596b319cd4fc3f9fdbd60296dc6',
                                      'dc47eadf48a7d30a28eb3eb588f6a07e',
                                      '0732b98609c24f00504f776021a9c045',
                                      '1272372d18c2efa48d3f8908ade220ec']

best_identifies_issues_opportunities_ensembles = ['38287cd668eac4ac15b1f7f6b44c53f6',
                                                  'ef20d80c091dd003e8670b89418c3a73',
                                                  '2ccf5765888c29194ba0c4c9b82f9c76',
                                                  '0e9ab116ffaec1eb4518313d6b6c1fd9',
                                                  '55ee36a03a2c08b81300915bd85888da',
                                                  '5eb3f58fbfd3a9cf145c9a03aced28ef',
                                                  'ac89ab7ca35d534565a2cd59df2022df',
                                                  'aba5cc03d41d0c7fb0a163abb2f26259',
                                                  'a0593e08e8c4906257ceba43136678df',
                                                  '1e1409f7fce67981ae234eb255c8cd9a']

best_interprets_information_ensembles = ['c3181e37947baceae95c0d6b41877905',
                                            'fc8d1ee4fa2e69fc9c060a397c280636',
                                            '797c91d603349d1171a09712b3ca78b0',
                                            '955d2c37c120b8a138ce66cd6445973d',
                                            'e2df78fe5a4bb40871e4203a323e0d92',
                                            '951b951987f3df5043a21db7f8f874fe',
                                            'c2ef42db0f8102dda7fa92249d083100',
                                            'f1ab9b101618e55bae2532ca9a3b978b',
                                            'f45e1aca90d4358e276bdf6b923a8018',
                                            '802d0b124ef97afee838e3445384faaf']

best_involves_others_ensembles = ['0ecd197fb7b102ff902c172d8b705702',
                                  '584f1c6dd21c414e0f4576be72c5955b',
                                  'cda0dc0aa51c649bbb79a835a0f37407',
                                  'd96d8aa445cd10f6035fb58af5d4e1b9',
                                  'f74220eabeeb7654f4b9dd0d109196ed',
                                  'b231da56a4d6f3346723f1f725368156',
                                  '9aa6ded4699a5ebb54c27b2925937b46',
                                  '3c5387cbbf0be4345884069014c6048a',
                                  '87493c3c2d8a57724664ba2306b12091',
                                  '024529bb7b0cca130ef1ff1faebe94f2']

best_decision_making_final_score_ensembles = ['c3181e37947baceae95c0d6b41877905',
                                            'fc8d1ee4fa2e69fc9c060a397c280636',
                                            '797c91d603349d1171a09712b3ca78b0',
                                            '955d2c37c120b8a138ce66cd6445973d',
                                            'e2df78fe5a4bb40871e4203a323e0d92',
                                            '951b951987f3df5043a21db7f8f874fe',
                                            'c2ef42db0f8102dda7fa92249d083100',
                                            'f1ab9b101618e55bae2532ca9a3b978b',
                                            'f45e1aca90d4358e276bdf6b923a8018',
                                            '802d0b124ef97afee838e3445384faaf']

## Step 4b. Average the Best Ensembles' Predictions

The below code is creating an average ensemble of predictions from different models. 

1. It starts by initializing an empty array called best_array. 2. Then, for each model in the list corresponding to the outcome of interest, it reads the corresponding CSV files that contains predictions for the test dataset. 
3. The code selects columns 1 to 7 from the CSV file, discarding the first column (responseID). These columns represent the predictions made by the model for different target variables. 
4. The selected predictions are stored in a temporary variable called temp. 
5. Finally, the predictions from all models are averaged using np.mean() function along the axis 0, which means the average is taken column-wise (compute an average for each column across all datasets)

The resulting average ensemble of predictions is stored in the variable ensembled_chooses_appropriate_action_ensembles.

In [None]:
best_array = []
for modelname in best_chooses_appropriate_action_ensembles:
  temp = pd.read_csv("ensemble_predictions_test/"+modelname+".csv").values[:,1:8]
  best_array.append(temp)
ensembled_chooses_appropriate_action_ensembles = np.mean(best_array,axis=0)

In [None]:
best_array = []
for modelname in best_commits_to_action_ensembles:
  temp = pd.read_csv("ensemble_predictions_test/"+modelname+".csv").values[:,1:8]
  best_array.append(temp)
ensembled_commits_to_action_ensembles = np.mean(best_array,axis=0)

In [None]:
best_array = []
for modelname in best_gathers_information_ensembles:
  temp = pd.read_csv("ensemble_predictions_test/"+modelname+".csv").values[:,1:8]
  best_array.append(temp)
ensembled_gathers_information_ensembles = np.mean(best_array,axis=0)

In [None]:
best_array = []
for modelname in best_identifies_issues_opportunities_ensembles:
  temp = pd.read_csv("ensemble_predictions_test/"+modelname+".csv").values[:,1:8]
  best_array.append(temp)
ensembled_identifies_issues_opportunities_ensembles= np.mean(best_array,axis=0)

In [None]:
best_array = []
for modelname in best_interprets_information_ensembles:
  temp = pd.read_csv("ensemble_predictions_test/"+modelname+".csv").values[:,1:8]
  best_array.append(temp)
ensembled_interprets_information_ensembles = np.mean(best_array,axis=0)

In [None]:
best_array = []
for modelname in best_involves_others_ensembles:
  temp = pd.read_csv("ensemble_predictions_test/"+modelname+".csv").values[:,1:8]
  best_array.append(temp)
ensembled_involves_others_ensembles = np.mean(best_array,axis=0)

In [None]:
best_array = []
for modelname in best_decision_making_final_score_ensembles:
  temp = pd.read_csv("ensemble_predictions_test/"+modelname+".csv").values[:,1:8]
  best_array.append(temp)
ensembled_decision_making_final_score_ensembles= np.mean(best_array,axis=0)

## Step 4c. Combine Output of Models Together


The following code is used to create a final prediction dataset. It reads a file called "test_pub.csv" that contains some data. Then, it takes specific columns from different averaged models and assigns them to corresponding columns in the dataset. 

The assigned columns are related to different aspects such as choosing appropriate action, committing to action, gathering information, identifying issues/opportunities, interpreting information, involving others, and decision-making.

Once the columns are filled in, the updated dataset is saved as a new file called "final_test_predictions.csv".

In [None]:
df_test = pd.read_csv("test_pub.csv")
df_test.iloc[:,1] = ensembled_chooses_appropriate_action_ensembles[:,0]
df_test.iloc[:,2] = ensembled_commits_to_action_ensembles[:,1]
df_test.iloc[:,3] = ensembled_gathers_information_ensembles[:,2]
df_test.iloc[:,4] = ensembled_identifies_issues_opportunities_ensembles[:,3]
df_test.iloc[:,5] = ensembled_interprets_information_ensembles[:,4]
df_test.iloc[:,6] = ensembled_involves_others_ensembles[:,5]
df_test.iloc[:,7] = ensembled_decision_making_final_score_ensembles[:,6]
df_test.to_csv("final_test_predictions.csv",index=False)