# Overview

## Initial NeuralNet Approach
The initial investigation in a neural net Ranker used the [multilayered perceptron classifier](https://scikit-learn.org/stable/modules/generated/sklearn.neural_network.MLPClassifier.html) to create a model that maps a 1200-dimensional question-solution vector into the probability that the particular solution will be linked to the question. With a dataset of the 5000 most recent queries in the vSphere 5.5 domain, this network resulted in a PCT@5 score of `0.3752`, correctly including a linked solution in the within the top 5 ranked solutions for 460 out of 1226 questions in the test set. For comparison, the current production version of Ranker for the same domain has a PCT@5 score of `0.433`, while the baseline has a PCT@5 score of `0.1955`. Two things worth noting is that the PCT@5 score for the initial RankerNet was a single trial and it was an unoptimized prototype. So while there certainly isn't enough evidence to conclude that RankerNet is production-worthy, it's exciting that the early results are promising and futher optimizations should improve it even further.

## Early Problems
Unfortunately, SKlearn is not a library optimized for GPU computing. We now have an [Amazon P3](https://aws.amazon.com/ec2/instance-types/p3/) instance with a powerful GPU and we certainly want to leverage that to speed up model training and predicting. 

## Switching to Tensorflow
Instead of SKlearn, this notebook will switch over to [Tensorflow](https://www.tensorflow.org/tutorials/) for the implementation of the neural network. In addition to GPU-compatibility, Tensorflow will give us finer control over the neural network, as we can construct it layer-by-layer.  

# Data Collection
The data collection will remain the same as before, so we'll need to get all of the functions and reuse them here. Rather than defining them within this notebook, I've moved them over to their own files so they can be neatly imported.

In [29]:
from helpers.data_helpers import pull_training_data

queries, doc_id2body, question_id2body, skipped = pull_training_data(limit=100)

Pulling 3567 docs.


IntProgress(value=0, bar_style='info', description='Pulling docs:', max=3567)

Pulled 3567 docs
Found 115649 queries. Query limit: 100


IntProgress(value=0, bar_style='info', description='Pulling queries:')

Reached limit
Pulled 100 queries, 120 skipped
Skipped reasons: Counter({'Linked to docs not included in model': 107, 'Not long enough': 12, 'Non-english': 1})


# Word2Vec
Just as with the data collection, the vectorization functions have been moved to an external file.


In [26]:
# %load_ext autoreload
%autoreload 2

In [23]:
from helpers.word2vec import get_model, vectorize_body

In [17]:
W2V_MODEL = get_model()

In [24]:
vectorize_body("Hello there, how are you?", W2V_MODEL).shape

(600,)

# Creating Train/Test Dataframes
This section will utilize the helper functions to create a dataframe of positive/negative question-solution pairs.

In [134]:
from helpers.data_helpers import get_id2vector, prune, split_train_test, create_dataframe

In [30]:
queries, doc_id2body, question_id2body, skipped = pull_training_data(limit=5000)

Pulling 3567 docs.


IntProgress(value=0, bar_style='info', description='Pulling docs:', max=3567)

Pulled 3567 docs
Found 115649 queries. Query limit: 5000


IntProgress(value=0, bar_style='info', description='Pulling queries:', max=5000)

Reached limit
Pulled 5000 queries, 7091 skipped
Skipped reasons: Counter({'Linked to docs not included in model': 6829, 'Not long enough': 255, 'Non-english': 7})


In [41]:
id2vector, skipped_docs, skipped_questions = get_id2vector(doc_id2body, question_id2body, W2V_MODEL)

IntProgress(value=0, bar_style='info', description='Vectorizing:', max=8566)

  out=out, **kwargs)
  ret = ret.dtype.type(ret / rcount)


In [42]:
pruned_queries = prune(queries, skipped_questions)

In [135]:
date_pivot, train, test = split_train_test(pruned_queries, date_split_proportion=0.75)
print("%d queries in train set" % len(train))
print("%d queries in test set" % len(test))
print("Date pivot point:", date_pivot)

train_df = create_dataframe(train, n=200)
test_df = create_dataframe(test, n=200)

3670 queries in train set
1224 queries in test set
Date pivot point: 2018-09-21T00:26:49Z


IntProgress(value=0, bar_style='info', description='Adding queries:', max=3670)

IntProgress(value=0, bar_style='info', description='Adding queries:', max=1224)

# Vectorization
In order to ____

## Speed/Memory Profiling of Adding Vectors
The initial approach was to iterate across each row in the dataframe, create the concaatenated vector, and then add that as a column to the original dataframe. This section investigates the use of `iterrows()` vs. `apply()` to creating the vectors. __Spoiler alert__: I end up changing the approach entirely and saving the vectors as their own dataframe, where each element has its own column. See the end of this section for the speed and memory benefits of that approach.

In [66]:
import numpy as np
import pandas as pd

from ipywidgets import IntProgress
from IPython.display import display

from helpers.data_helpers import concatenate_vectors


def add_vectors_apply(dataframe, id2vector):
    print(f"Adding {len(dataframe)} vectors.")
    dataframe["vector"] = dataframe.apply(lambda row: concatenate_vectors(id2vector[row["question_id"]], id2vector[row["solution_id"]]), axis=1)
    return dataframe  

def add_vectors_iterrows(dataframe, id2vector):
    progress_bar = IntProgress(min=0, max=(len(dataframe) - 1), description='Adding vectors:', bar_style='info')
    display(progress_bar)
    print(f"Adding {len(dataframe)} vectors.")
    vectors = []
    for count, (index, row) in enumerate(dataframe.iterrows()):
        if count % 10000 == 0:
            progress_bar.value = count
        vectors.append(concatenate_vectors(id2vector[row["question_id"]], id2vector[row["solution_id"]]))
            
            
    progress_bar.value = len(dataframe) - 1
    dataframe["vector"] = pd.Series(vectors, index=dataframe.index)
    return dataframe   

In [49]:
%%timeit
add_vectors_apply(test_df, id2vector)

Adding 245132 vectors.
Adding 245132 vectors.
Adding 245132 vectors.
Adding 245132 vectors.
Adding 245132 vectors.
Adding 245132 vectors.
Adding 245132 vectors.
Adding 245132 vectors.
48.9 s ± 153 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


In [54]:
%%timeit
add_vectors_iterrows(test_df, id2vector)

IntProgress(value=0, bar_style='info', description='Adding vectors:', max=245131)

Adding 245132 vectors.


IntProgress(value=0, bar_style='info', description='Adding vectors:', max=245131)

Adding 245132 vectors.


IntProgress(value=0, bar_style='info', description='Adding vectors:', max=245131)

Adding 245132 vectors.


IntProgress(value=0, bar_style='info', description='Adding vectors:', max=245131)

Adding 245132 vectors.


IntProgress(value=0, bar_style='info', description='Adding vectors:', max=245131)

Adding 245132 vectors.


IntProgress(value=0, bar_style='info', description='Adding vectors:', max=245131)

Adding 245132 vectors.


IntProgress(value=0, bar_style='info', description='Adding vectors:', max=245131)

Adding 245132 vectors.


IntProgress(value=0, bar_style='info', description='Adding vectors:', max=245131)

Adding 245132 vectors.
48.7 s ± 356 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


I would have expected the `apply()` approach to be at least slightly faster than `iterrows()`. Either way, I'll try a new approach where instead of creating a function that takes a single question vector and a single solution vector, the function will take in a list of question vectors and a list of solution vectors. This bypasses the loop across all of the rows of the dataframe, as instead we can just pass the entire columns of the `question_id`s and `solution_id`s.

In [103]:
def add_vectors(dataframe, id2vector):
    return pd.Series(np.concatenate([itemgetter(*dataframe["question_id"].tolist())(id2vector),
                                     itemgetter(*dataframe["solution_id"].tolist())(id2vector)],
                                    axis=1).tolist(), 
                     index=dataframe.index)

In [90]:
%%timeit
add_vectors(test_df, id2vector)

13.7 s ± 81.8 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


We have a winner! Looping over all of the rows is quite expensive, so writing the function in a way that can take in an array of all of the values, rather than a single row's value, provides a _considerable_ boost in performance. However, we won't be using any of those preceding functions. If the vector is stored in a Dataframe as a single value in a column, it has a type of `object`, whereas if we create a new dataframe for the vectors where each element has its own column, then they can be stored more efficiently as `float32`s. 

To get a concrete example, I'll create a DataFrame where each vector is stored in a single column as an array.

In [138]:
exp_df = pd.DataFrame(add_vectors(test_df, id2vector),
                      index=np.arange(len(test_df)),
                      columns=["vector"])

Let's take a look at how much memory it requires:

In [140]:
exp_df.info(memory_usage='deep')

<class 'pandas.core.frame.DataFrame'>
Int64Index: 245132 entries, 0 to 245131
Data columns (total 1 columns):
vector    245132 non-null object
dtypes: object(1)
memory usage: 2.2 GB


As previously mentioned, when the vectors are left as arrays, they're saved as `object`s and this one in particular takes 2.2 GB of data. 

## Creating the Vector Datarame
Now let's do the alternative approach of creating a new dataframe for the vectors where each element has its own column.

In [113]:
def get_vector_dataframe(dataframe, id2vector):
    return pd.DataFrame(np.concatenate([itemgetter(*dataframe["question_id"].tolist())(id2vector),
                                         itemgetter(*dataframe["solution_id"].tolist())(id2vector)],
                                        axis=1),
                         index=np.arange(len(dataframe)),
                         columns=np.arange(1200))

In [120]:
test_vector_df = get_vector_dataframe(test_df, id2vector)

In [121]:
test_vector_df.info(memory_usage='deep')

<class 'pandas.core.frame.DataFrame'>
Int64Index: 245132 entries, 0 to 245131
Columns: 1200 entries, 0 to 1199
dtypes: float32(1200)
memory usage: 1.1 GB


By removing the overhead of saving them as `object`s and explicitly storing them as `float32`s, the size of the vectors decreased by 50%. Not bad! Additionally, it took 1.84 seconds to create the dataframe, while the `iterrows()` and `apply()` approaches took ~50 seconds. We can still easily retrieve the vector as an array with the `.values` attribute.

In [143]:
test_vector_df.iloc[0].values

array([0.00908015, 0.00347431, 0.02504319, ..., 0.3046875 , 0.40820312,
       0.578125  ], dtype=float32)

## Creating the Train/Test Vector Dataframes
To conclude this section, I'll vectorize the training and testing dataframes, creating a `test_vector_df` and `train_vector_df`.

In [144]:
test_vector_df = get_vector_dataframe(test_df, id2vector)

In [145]:
train_vector_df = get_vector_dataframe(train_df, id2vector)

# TenseRankerFlowNet
With the data loaded in and vectorized, it's time to leverage Tensorflow!

In [None]:
import tensorflow as tf
from tensorflow import keras

model = keras.Sequential([
    keras.layers.Dense(1200, activation=tf.nn.relu),
    keras.layers.Dense(300, activation=tf.nn.relu),
    keras.layers.Dense(150, activation=tf.nn.relu),
    keras.layers.Dense(50, activation=tf.nn.relu)
    keras.layers.Dense(1, activation=tf.nn.sigmoid)
])

model.compile(optimizer=tf.train.AdamOptimizer(), 
              loss='binary_crossentropy',
              metrics=['accuracy'])

In [None]:
model.fit(X_train, y_train, epochs=5)

In [None]:
predictions = model.predict(X_test)