# Spotting Patterns in Founding Team Formations Using an LSTM Neural Network

The process through which entrepreneurial founding teams 'reach the starting line' has become a subject of academic study and interest in recent years. Building upon existing academic literature on venture team formation, Professors David Clough and Balagopal Vissa developed a theoretical model which provides novel insights on how a set of cofounders come together to form the initial founding team [1].

In short, the work developed by the authors is grounded on the following pillars:

1) Disaggregating the cofounder searching and evaluation process into separate entities to analyze in more depth how and when task competence and interpersonal fit affect the founding team formation process.

2) Conceiving founding team formation as a dynamic choice process that takes a middle stance between the more recent "agentic view" (i.e. entrepreneurs acting strategically as "players" within their networks) and the more traditional structuralist approach (i.e. entrepreneurs as passive occupants of social structural positions).

3) Using a process-focused perspective to depict different founding team formation pathways, distilling the factors that drive both successful attempts in founding team formation and the subsequent churn when cofounder entry/exit dynamics emerge in the founding team.

Looking to construct an experiment upon which the model's key assumptions and prepositions could be tested on, we decided to develop an LSTM Neural Network that uses an Natural Language Processing (NLP) model to process testimonies shared by entrepreneurs about their experiences on founding team formation. Several paragraphs were scraped from open forums websites such as Quora and entrepreneurial websites such as Y Combinator, looking to build a training set that resembles how entrepreneurs express themselves in real-life settings when faced with the question on how they met their co-founders.

Here's an example of an answer provided by a Quora user on the question "How did you meet your co-founder?"

<img src="images/quora.png">

To make the text "machine readable", the testimonies were broken down into individual sentences and subsequently turned into vectors of indices through a Python function. Once the sentences have been correctly set to work as inputs for the LSTM Neural Network, a layer of pre-trained vectors is loaded into the model using the GloVe 50-dimensional model developed by the Stanford NLP group (https://nlp.stanford.edu/projects/glove/).

Finally, a 2-layer LSTM Neural Network with an additional hidden layer and a Softmax activation function is configured using the Keras Framework to process the inputs and classify the sentences according to the class definition described below:

<b>Class [0]:</b> The entrepreneur gives (or has given) more weight to elements like chemistry, trust and personal bonding when evaluating a prospective cofounder.

<b>Class [1]:</b> The entrepreneur gives (or has given) more weight to elements like technical competences, past experience and execution capabilities when evaluating a prospective cofounder.

<b>Class [2]:</b> The entrepreneur looks (or has looked) for connections with whom she has strong ties when looking out for a potential cofounder (Serendipitous FOCIs).

<b>Class [3]:</b> The entrepreneur actively looks (or has looked) within her professional networks and/or engages in a dedicated search when looking out for a potential cofounder (Goal-directed FOCIs).

These classes were developed taking into account propositions #1, #2, #3 and #4 as described in the research paper, narrowing down our focus to the process followed by founders when they "search for and mutually evaluate" one another. Each sentence uploaded into the training set was carefully labelled to accurately represent the most appropriate class. Given that some sentences could be categorized into two different classes based on subjective claims by the human annotator (i.e. the sentence "Evaluate whether you are able to work well together" may be classified as both [0] and [1]), we estimate a human error level of ≈ 10%.

So far, the model has been able to correctly identify ≈ 97% of the examples in the training set (n = 938), and ≈ 77% of the examples in the test set (n = 44, accounting for data the model has never "seen"), which implies some work has yet to be done to chop down overfitting.

This model has the potential to provide a reliable framework upon which experiments at big scales can be conducted, looking to test the principal hypotheses and prepositions exposed by the conceptual framework. Concretely, the model could help to advance the theoretical research developed by Prof. Clough and Prof. Vissa by using an NLP model to systematically web-scrape, classify and spot patterns among founder testimonies and evaluate them according to the stated prepositions.

<b>References:</b>

[1] Clough, David and Vissa, Balagopal, How Do Founding Teams Form? Towards a Behavioral Theory of Founding Team Formation (July 2, 2018). INSEAD Working Paper No. 2018/26/EFE. Available at SSRN: https://ssrn.com/abstract=3206701 or http://dx.doi.org/10.2139/ssrn.3206701

# Structure of the LSTM Neural Network

In [16]:
# Loading key Python libraries and Deep Learning Frameworks (Keras):

import numpy as np
from keras import backend as K
from keras.models import Model
from keras.layers import Dense, Input, Dropout, LSTM, Activation
from keras.layers.embeddings import Embedding
from keras.preprocessing import sequence
from keras.initializers import glorot_uniform
from utils_v1 import *

# Loading of data files: (explain here the glove 6B model)

word_to_index, index_to_word, word_to_vec_map = read_glove_vecs('data/glove.6B.50d.txt')

X, Y = read_csv('data/sentences_data_v5.csv')
X_train, Y_train = read_csv('data/sentences_data_train_v5.csv')
X_test, Y_test = read_csv('data/sentences_data_test_v5.csv')

In [19]:
# Data cleansing process - eliminating characters that may cause a KeyError message below:

X[0] = X[0].replace('\ufeff', '')
X_train[0] = X_train[0].replace('\ufeff', '')
X_test[0] = X_test[0].replace('\ufeff', '')

# Finding out the max length of any given sentence in the dataset (counting in words) and storing the value in a variable called maxLen

maxLen = len(max(X, key=len).split())

In [20]:
def sentences_to_indices(X, word_to_index, max_len):
    """
    Converts an array of sentences (strings) into an array of indices corresponding to words in the sentences.
    The output shape should be such that it can be given to `Embedding()`
    
    Arguments:
    X -- array of sentences (strings), of shape (m, 1)
    word_to_index -- a dictionary containing all words mapped to their corresponding indeces
    max_len -- maximum number of words in a sentence. Gets plugged as maxLen into the function. 
    
    Returns:
    X_indices -- array of indices corresponding to words in the sentences from X, of shape (m, max_len)
    """
    
    m = X.shape[0]                                   # number of training examples
    # Initializes X_indices as a numpy matrix of zeros and the correct shape
    
    X_indices = np.zeros((m, max_len))
    
    for i in range(m):                               # Creates a loop over training examples
        # Converts the ith training sentence in lower case and split is into words. You should get a list of words.
        
        sentence_words = [word.lower().replace('\t', '') for word in X[i].split(' ') if word.replace('\t', '') != '']
        # Initializes j to 0
        
        j = 0
        # Loop over the words of sentence_words
        
        for w in sentence_words:
            
            # Set the (i,j)th entry of X_indices to the index of the correct word.
            X_indices[i, j] = word_to_index[w]
            
            # Increment j to j + 1
            j = j + 1
    
    return X_indices

In [21]:
indices = sentences_to_indices(X, word_to_index, 18)
print(indices)

[[185457. 170714. 209675. ...      0.      0.      0.]
 [368321. 268046. 254258. ...      0.      0.      0.]
 [ 56041. 103640.  54718. ...      0.      0.      0.]
 ...
 [366254. 239792. 335202. ...      0.      0.      0.]
 [224573. 259594. 141855. ...      0.      0.      0.]
 [357640.  58997. 233708. ...      0.      0.      0.]]


In [22]:
def pretrained_embedding_layer(word_to_vec_map, word_to_index):
    """
    Creates a Keras Embedding() layer and loads in pre-trained GloVe 50-dimensional vectors.
    
    Arguments:
    word_to_vec_map -- dictionary mapping words to their GloVe vector representation.
    word_to_index -- dictionary mapping from words to their indices in the vocabulary (400,001 words)

    Returns:
    embedding_layer -- pretrained layer Keras instance
    """
    
    vocab_len = len(word_to_index) + 1                    # adding 1 to fit Keras embedding (requirement)
    emb_dim = word_to_vec_map["co-founder"].shape[0]      # defines dimensionality of GloVe word vectors (= 50)
    
    # Initializes the embedding matrix as a numpy array of zeros of shape (vocab_len, dimensions of word vectors = emb_dim)
    emb_matrix = np.zeros((vocab_len, emb_dim))
    
    # Sets each row "index" of the embedding matrix to be the word vector representation of the "index"th word of the vocabulary
    for word, index in word_to_index.items():
        emb_matrix[index, :] = word_to_vec_map[word]

    # Defines the Keras embedding layer with the correct output/input sizes
    embedding_layer = Embedding(vocab_len, emb_dim, trainable = True)

    # Builds the embedding layer, which is required before setting the weights of the embedding layer.
    embedding_layer.build((None,))
    
    # Sets the weights of the embedding layer to the embedding matrix. The layer should now be pretrained.
    embedding_layer.set_weights([emb_matrix])
    
    return embedding_layer

In [47]:
def NLP_model_v1(input_shape, word_to_vec_map, word_to_index):
    
    """
    Function creating the NLP model's graph.
    
    Arguments:
    input_shape -- shape of the input, usually (max_len,)
    word_to_vec_map -- dictionary mapping every word in a vocabulary into its 50-dimensional vector representation
    word_to_index -- dictionary mapping from words to their indices in the vocabulary (400,001 words)

    Returns:
    model -- a model instance in Keras
    """
    
    # Defines sentence_indices as the input of the graph, it should be of shape input_shape and dtype 'int32' (as it contains indices)
    sentence_indices = Input(shape = input_shape, dtype = 'int32')
    
    # Creates the embedding layer pretrained with GloVe Vectors
    embedding_layer = pretrained_embedding_layer(word_to_vec_map, word_to_index)
    
    # Propagates sentence_indices through the embedding layer, getting back the embeddings
    embeddings = embedding_layer(sentence_indices)  
    
    # Propagates the embeddings through an LSTM layer with 80-dimensional hidden state, setting return_sequences = True to get back a batch of sequences
    X = LSTM(60, return_sequences = True)(embeddings)
    
    # Adds dropout with a probability of 0.5 to regularize the network
    X = Dropout(0.5)(X)
    
    # Propagates X trough another LSTM layer with 80-dimensional hidden state, this time with the returned output as a single hidden state, not a batch of sequences
    X = LSTM(60, return_sequences = False)(X)
    
    # Adds dropout with a probability of 0.5 to regularize the network
    X = Dropout(0.5)(X)
    
    # Propagates X through a Dense layer with 4 units to get back a batch of 4-dimensional vectors
    X = Dense(4, activation = None)(X)
    
    # Adds a softmax activation to estimate the probablity of each output 
    X = Activation('softmax')(X)
    
    # Creates the Model instance which converts sentence_indices into X.
    model_v1 = Model(sentence_indices, X)
    
    return model_v1

In [48]:
model_v1 = NLP_model_v1((18,), word_to_vec_map, word_to_index)
model_v1.summary()

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
input_7 (InputLayer)         (None, 18)                0         
_________________________________________________________________
embedding_7 (Embedding)      (None, 18, 50)            20000050  
_________________________________________________________________
lstm_13 (LSTM)               (None, 18, 60)            26640     
_________________________________________________________________
dropout_13 (Dropout)         (None, 18, 60)            0         
_________________________________________________________________
lstm_14 (LSTM)               (None, 60)                29040     
_________________________________________________________________
dropout_14 (Dropout)         (None, 60)                0         
_________________________________________________________________
dense_7 (Dense)              (None, 4)                 244       
__________

In [49]:
# Compiles the model using loss = 'categorical crossentropy' as we are dealing with a multi-class optmimization problem. Sets 'adam' as the optimizer method.
model_v1.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])

In [50]:
# Returns X_indices from the sentences_to_indices function and saves it in X_train_indices
X_train_indices = sentences_to_indices(X_train, word_to_index, 18)

# Applies One-hot vectorization to the training set Y_train, stores values in Y_train_oh. We use C = 4 as we are looking to get a probability distribution spread across 4 different clases. 
Y_train_oh = convert_to_one_hot(Y_train, C = 4)

In [51]:
# Fits the Neural Network model using Keras
model_v1.fit(X_train_indices, Y_train_oh, epochs = 25, batch_size = 32, shuffle = True)

Epoch 1/25
Epoch 2/25
Epoch 3/25
Epoch 4/25
Epoch 5/25
Epoch 6/25
Epoch 7/25
Epoch 8/25
Epoch 9/25
Epoch 10/25
Epoch 11/25
Epoch 12/25
Epoch 13/25
Epoch 14/25
Epoch 15/25
Epoch 16/25
Epoch 17/25
Epoch 18/25
Epoch 19/25
Epoch 20/25
Epoch 21/25
Epoch 22/25
Epoch 23/25
Epoch 24/25
Epoch 25/25


<keras.callbacks.History at 0x1a2f3a6da0>

In [52]:
# Evalutes the model accuracy on a test set with 44 training examples

X_test_indices = sentences_to_indices(X_test, word_to_index, max_len = 18)
Y_test_oh = convert_to_one_hot(Y_test, C = 4)
loss_v1, acc_v1 = model_v1.evaluate(X_test_indices, Y_test_oh)
print()
print("Test_v1 accuracy = ", acc_v1)


Test_v1 accuracy =  0.7272727326913313


In [53]:
# Tests the model with sentences that were not included in the Training and Test sets.

x_test = np.array(['Chemistry for me is a key attribute'])
X_test_indices = sentences_to_indices(x_test, word_to_index, 18)
test_v1a = np.argmax(model_v1.predict(X_test_indices), axis = 1)
test_v1b = model_v1.predict(X_test_indices)
print(test_v1a)
print(test_v1b)

[0]
[[9.9787736e-01 1.5828023e-03 4.3174106e-04 1.0811740e-04]]


In [54]:
x_test = np.array(['I joined an accelerator program looking for a cofounder'])
X_test_indices = sentences_to_indices(x_test, word_to_index, 18)
test_v1a = np.argmax(model_v1.predict(X_test_indices), axis = 1)
test_v1b = model_v1.predict(X_test_indices)
print(test_v1a)
print(test_v1b)

[3]
[[9.9803416e-05 1.0860464e-04 8.0783782e-04 9.9898368e-01]]


In [55]:
x_test = np.array(['I think she will make a good work'])
X_test_indices = sentences_to_indices(x_test, word_to_index, 18)
test_v1a = np.argmax(model_v1.predict(X_test_indices), axis = 1)
test_v1b = model_v1.predict(X_test_indices)
print(test_v1a)
print(test_v1b)

[0]
[[9.9689329e-01 2.2917457e-03 6.5543782e-04 1.5956673e-04]]


In [56]:
x_test = np.array(['My cofounder should be able to execute the companys vision'])
X_test_indices = sentences_to_indices(x_test, word_to_index, 18)
test_v1a = np.argmax(model_v1.predict(X_test_indices), axis = 1)
test_v1b = model_v1.predict(X_test_indices)
print(test_v1a)
print(test_v1b)

[1]
[[9.4671600e-04 9.9900025e-01 8.9602026e-06 4.4115735e-05]]


In [57]:
x_test = np.array(['My spouse and I founded a startup together'])
X_test_indices = sentences_to_indices(x_test, word_to_index, 18)
test_v1a = np.argmax(model_v1.predict(X_test_indices), axis = 1)
test_v1b = model_v1.predict(X_test_indices)
print(test_v1a)
print(test_v1b)

[2]
[[7.4918424e-03 1.1428070e-03 9.9079996e-01 5.6543708e-04]]
