<h3> Imports </h3> 

In [None]:
import pandas as pd
from datetime import datetime, timedelta
import matplotlib.pyplot as plt 
import numpy as np 
from math import log
import json 
import time
from random import randint

# keras imports for neural network
from keras.models import Sequential, load_model
from keras.layers import Dense, Dropout
from keras import optimizers

<h2> Data Preparation for training and testing </h2>

I see two dasets available in the task: <br>
One seems to be from 2016, without currency conversion to USD. <br> 
Another is from 2018 with some newly added records

In [None]:
kick_start_2016 = pd.read_csv('../input/ks-projects-201612.csv', encoding = 'ISO-8859-1')
kick_start_2016.head()

In [None]:
kick_start_2018 = pd.read_csv('../input/ks-projects-201801.csv')
kick_start_2018.head()

In [None]:
print(len(kick_start_2018) - len(kick_start_2016))

<h3> So, there are 54911 more records in 2018 data than 2016 data </h3> 

These records may be ideal for testing. <br> 
I will now extract only those records from 2018 data that also exist in 2016 using the ID field.<br>
But first, the 2016 data needs renaming the columns labels (removing trailing spaces etc.) to bring it to the same format as 2018 data.

In [None]:
kick_start_2016.columns

In [None]:
kick_start_2018.columns

In [None]:
# renaming the 2016 data columns
kick_start_2016.columns = ['ID', 'name', 'category', 'main_category', 'currency', 'deadline',
       'goal', 'launched', 'pledged', 'state', 'backers', 'country',
       'usd_pledged', 'Unnamed: 13', 'Unnamed: 14', 'Unnamed: 15',
       'Unnamed: 16']

<h3> Extracting what is common to both datasets, in 2018 data. <br>
This will be used to train the decision tree and the neural network. <br>
I have mentioned how to conbine both these models for both training and testing in the overall strategy further below.<h3>

In [None]:
kick_start_common = kick_start_2018.loc[kick_start_2018.ID.isin(kick_start_2016.ID)].reset_index(drop=True)
len(kick_start_common)

In [None]:
kick_start_common.head()

<h3> Extracting records unique to 2018 data </h3>
<h3> I will use these as my test dataset later towards the end to get the final accuracy of the combined model. 

In [None]:
kick_start_unique = kick_start_2018.loc[-kick_start_2018.ID.isin(kick_start_2016.ID)].reset_index(drop=True)
len(kick_start_unique)

In [None]:
kick_start_unique.head()

<h2> Lets focus on the training dataaset for now and visualize some of its aspects. </h2>

In [None]:
kick_start = kick_start_common

In [None]:
kick_start["state"].unique()

<b> Checking the distribution of the target field </b> 

In [None]:
pd.value_counts(kick_start['state']).plot.pie()

<h3> Analyses, assumptions and visualizations </h3> 

<b>My first assumption is that the "state" will correlate well with the number of days between "launch" and "deadline".</b> <br>
So, I convert those dates and bring them to "No. of days" format.

In [None]:
x = pd.to_datetime(kick_start["deadline"])
y = pd.to_datetime(kick_start["launched"])
z = x - y  

print (z.min())
print (z.mean())
print (z.max())

In [None]:
num_days = z.apply(lambda x: str(x.days))
pd.value_counts(num_days).plot.pie()

<b> Most projects seem to have a deadline of 29 days (~1 month), followed by 59 days (~2 months) etc. </b> 

<h3> Overall strategy for training</h3>

I will select certain fields (mostly categorical) to feed to a Decision Tree model to reduce entropy in the data. This will  also include the "No. of days to deadline" computed above. 

If the decision tree returns some good results and is able to reduce entropy by looking at just the categorical data (<b>without the $amounts, backers etc. </b>), I will output the probability distribution of states as the "partial" predictions from this tree. 

, I will concatenate these partial predictions with the rest of the columns ($amounts, backers etc.) to create a "feature vector". <br>
This feature vector would then go as inputs to a neural network for further training. <br>

So, even while testing, predictions will be broken down to two parts, first to recieve partial predictions from the decision tree and then to get the final predictions from the neural network.

<b> The intuition is to use the Decision Tree to handle entropy in the categorical data and reduce the non-linearity / dimensions for the neural network to achieve better learning. </b>

Since I need the probability distributions from the decision tree, I am implementing a decision tree based on ID3 algorithm for this purpose. 

<h3> Fields selected for decision tree : </h3> 

<ul> 
    <li> category </li>
    <li> main_category </li>
    <li> currency </li>
    <li> country </li> 
    <li> Number of days to deadline </li> 
</ul> 

<b> The "id" and "name" fields will be omitted as I don't think there is much correlation between these fields and the "state" </b> 
    

In [None]:
dt_kstart = kick_start[['category', 'main_category', 'currency', 'country', 'state']]
dt_kstart['num_days'] = num_days

In [None]:
dt_kstart.state.value_counts()

In [None]:
# this function calculates the probability distribution of the unique items for a specified column in a dataframe
def get_probabilities(df, column):
    freqs = df[column].value_counts()
    summation = sum(freqs)
    probabilities = freqs / summation
    return (probabilities) 

# function to return log base 2
def ln(x):
    return log(x)/log(2)

# to return the entropy given the probability distribution    
def get_entropy(probabilities):
    return sum( probabilities * probabilities.apply(ln)) * -1

In [None]:
# Initial entropy of the 'state' column in the training data
get_entropy(get_probabilities(dt_kstart, 'state'))

In [None]:
dt_kstart.head()

In [None]:
# splitting a data-frame, on an index/column and value 
# returns the new dataframe after the split
def split_data(df, column, value):
    return df[df[column] == value]

# to get the best feature based on information gain 
def get_best_feature(df, target):
    initial_entropy = get_entropy(get_probabilities(df, target))
    best_gain = 0.0
    best_feature = None
    feature_list = list(df.columns)
    feature_list.remove(target)
    for feature in feature_list:
        uniques = df[feature].unique()
        new_entropy = 0 
        for value in uniques:
            subset =  split_data (df, feature, value) 
            probability = len(subset) / len(df)
            new_entropy += probability * get_entropy(get_probabilities(subset, target))
        info_gain = initial_entropy - new_entropy
        # print (info_gain, feature)
        if info_gain > best_gain:
            best_gain = info_gain
            best_feature = feature
            
    return best_feature

In [None]:
get_best_feature(dt_kstart, 'state')

<h2>  Looks like, my assumption that num_days would highly correlate with state, is wrong! </h2> 

Nonetheless, there is some correlation and some information gain. <br>
We now proceed to build the Decision Tree <br>
Though the output of this decision tree is a probability distribution, I still label it as 'state' so that it is easier to retreive predictions from this tree later. 

In [None]:
# returns true if there is only one label in the target field 
def is_pure(df, target):
    return len(df[target].unique()) == 1
        
def create_tree(df, target):
    # condition for pure data (when there is only one possible 'state')
    if is_pure(df, target):
        return {'state' : dict(get_probabilities(df, target))}
    
    #condition for leaf nodes
    if len(df.columns) <= 2:
        features = list(df.columns)
        features.remove(target)
        feature = features[0]
        leaf_node = {feature:{}}
        uniques = df[feature].unique()
        for value in uniques:
            subset = split_data(df, feature, value)
            leaf_node[feature][value] = {'state' : dict(get_probabilities(subset, target))}
        return leaf_node
    
    # recursive call to create the nested tree/dictionary
    best_feature = get_best_feature(df, target)
    if best_feature:
        my_tree = {best_feature:{}}
        uniques = df[best_feature].unique()
        for value in uniques:
            subset = split_data(df, best_feature, value)
            subset = subset.drop(best_feature, axis=1)
            my_tree[best_feature][value] = create_tree(subset, target)
    else: 
        my_tree = {'state' : dict(get_probabilities(df, target))}
        
    return my_tree
            

In [None]:
# start time
start = time.perf_counter()

# creating the tree
d_tree = create_tree(dt_kstart, 'state')

# saving the dictionary
filename = '/kaggle/working/decision_tree.txt'
with open(filename, 'w') as f:
    json.dump(d_tree, f)
    
# end time
stop = time.perf_counter()

print('Creating the Decision Tree took close to ' + str((stop-start)/60.0) + ' minutes')

In [None]:
# loading the tree
def load_tree(filename):
    with open(filename, 'r') as f:
        return json.load(f)

d_tree = load_tree('/kaggle/working/decision_tree.txt')

In [None]:
# to predict a single instance of a feature using the decision tree
# inputs: the tree; feature of type pandas.Series.series
# returns: the probability distribution of the states as a dictionary
def partial_predict(tree, features):
    probabs = {}
    first_dict = next(iter(tree))
    second_dict = tree[first_dict]
    feat_value = features[first_dict]
    if first_dict != 'state':
        for key in second_dict.keys():
            if feat_value == key:
                probabs = partial_predict(second_dict[key], features)
    else: 
        probabs = second_dict
    return probabs

<h3> Unit testing the tree on a single instance/row </h3>

In [None]:
labels = dict(dt_kstart.loc[89])
labels

In [None]:
partial_predict(d_tree, labels)

In [None]:
# retrieving data from the tree
d_tree['category']['Restaurants']['num_days']['59']['country']['US']

<h3> Translate probability distributions from a dictionary to a numpy array </h3>

In [None]:
# to translate a predicted distribution (a dictionary) to its corresponding numpy version
def translate(distribution):
    array = np.empty([6])
    # to hardcode positions in the numpy array 
    positions = {'failed':0, 
                 'successful':1, 
                 'canceled':2, 
                 'undefined':3, 
                 'live':4, 
                 'suspended':5} 
    for key in positions:
        if key in distribution.keys():
            array[positions[key]] = distribution[key]
        else:
            array[positions[key]] = 0 
    return array

In [None]:
x = partial_predict(d_tree, labels) #same example as above
print(x)
y = translate(x)
print (y) # this is now translated to a numpy version

In [None]:
# since this is a probability distribution, all entries must sum to 1 
y.sum()

<h3> Generating probability distributions from categorical data for the entire training dataset using the tree </h3>

In [None]:
to_predict = dt_kstart

In [None]:
# to get predictions for an entire dataframe
def get_partial_predictions(tree, inputs):
    partial_predictions = []
    for index,row in inputs.iterrows():
        features = dict(row)
        probabs = partial_predict(tree, features)
        arr = translate(probabs)
        partial_predictions.append(arr)
    return np.array(partial_predictions)

In [None]:
start = time.perf_counter()
part_predict = get_partial_predictions(d_tree, to_predict)
stop = time.perf_counter()

print(part_predict.shape)
print('This process took ' + str(stop-start) + ' seconds')

 <h2> 'part_predict' now contains the partial predictions from the decision tree for the entire dataset </h2>

<h2> We now prepare the training dataset for the neural network </h2> 

As was mentioned earlier, the neural network needs the probability distributions from the tree and the remaining numerical columns. 

So the inputs to the tree would be: 

<ul>
    <li> Probability Distributions from the decision tree (length = 6) </li>
    <li> backers </li>
    <li> usd_pledged_real </li>
    <li> usd_goal_real </li> 
</ul> 

I am omitting the other columns for "pledged" and "goal" as they may contain different 'currencies' which has already been handled by the tree.  

<h2> We first concatenate the partial prediction to the remaining (numeric) fields in the dataset </h2> 

In [None]:
nn_train_part1 = part_predict
nn_train_part2 = np.array(kick_start[['backers', 'usd_pledged_real', 'usd_goal_real']])

print (nn_train_part1.shape)
print (nn_train_part2.shape)

In [None]:
nn_inputs = np.concatenate((nn_train_part1, nn_train_part2), axis=1)
print(nn_inputs.shape)

In [None]:
#save the training_inputs for the neural network
np.save('/kaggle/working/nn_inputs', nn_inputs)

In [None]:
# for the targets of the neural network
states = np.array(kick_start['state'])
states.shape

In [None]:
# to translate the state into integers for one-hot encoding
def translate_states(states):
    array = np.empty([len(states)], dtype = 'int8')
    positions = {'failed':0, 
                 'successful':1, 
                 'canceled':2, 
                 'undefined':3, 
                 'live':4, 
                 'suspended':5} 
    for i, state in enumerate(list(states)):
        array[i] = int(positions[state])
    return array

In [None]:
translated = translate_states(states)
translated[:10]

In [None]:
# One hot encoding
nb_classes = 6
one_hot_targets = np.eye(nb_classes)[translated]

In [None]:
one_hot_targets[:10]

In [None]:
# saving the targets for the neural network
np.save('/kaggle/working/one_hot_targets', one_hot_targets)

<h2> We now have our inputs in "nn_inputs" and our targets in "one_hot_targets". </h2>

<b> Training the Neural Network </b> 

In [None]:
inputs = np.load('/kaggle/working/nn_inputs.npy')
targets = np.load('/kaggle/working/one_hot_targets.npy')

In [None]:
model = Sequential()
model.add(Dense(9, input_dim=9, activation='sigmoid'))
model.add(Dense(20, activation='sigmoid'))
model.add(Dense(20, activation='sigmoid'))
model.add(Dense(10, activation='sigmoid'))
model.add(Dense(6, activation = 'softmax')) 

model.compile(loss='binary_crossentropy', optimizer='nadam', metrics=['accuracy'])

model.fit(nn_inputs, one_hot_targets, validation_split = 0.1, epochs = 6, batch_size=120)

In [None]:
model.save('/kaggle/working/nn_model.h5')

In [None]:
# helper function to convert dates to num_days
def convert_dates(features):
    x = pd.to_datetime(features["deadline"])
    y = pd.to_datetime(features["launched"])
    z = x - y 
    num_days = str(z.days)
    return num_days
    
# for final predictions using both the decision tree and the neural network 
# inputs: a pandas.Series.series object called features, and,
#         the trained decision tree and neural network
# outputs: the predicted 'state' for the provided features as a numpy array 
def predict(features, d_tree, model):
    expected_out = one_hot = None
    reverse_hot = {0:'failed', 
                   1:'successful', 
                   2:'canceled', 
                   3:'undefined', 
                   4:'live', 
                   5:'suspended'}
    num_days = convert_dates(features)
    features = dict(features)
    features['num_days'] = num_days
    part_preds = partial_predict(d_tree, features)
    part1 = translate(part_preds)
    part2 = np.array([features['backers'], 
                      features['usd_pledged_real'], 
                      features['usd_goal_real']])
    to_predict = np.concatenate((part1, part2))
    to_predict = np.array([to_predict])
    predicted_numpy = model.predict(to_predict)
    prediction = np.array(reverse_hot[predicted_numpy.argmax()])
    return prediction


<h3> Testing on some instances </h3>

In [None]:
# test an i'th row in the dataset
def get_prediction(i):
    return predict(kick_start.loc[i], d_tree, model)

print (get_prediction(randint(0,1000)))
print (get_prediction(randint(0,5000)))
print (get_prediction(randint(0,1000)))
print (get_prediction(randint(0,5000)))

<h2> Testing with the entire training dataset </h2> 

Loading the models

In [None]:
model = load_model('/kaggle/working/nn_model.h5')
d_tree = load_tree('/kaggle/working/decision_tree.txt')

<h3> We will drop the 'state' labels from the training data and fill it with 'None' </h3> 

In [None]:
kick_start_test = kick_start.drop(['state'], axis=1)
kick_start_test['state'] = None

In [None]:
kick_start_test.head()

In [None]:
expected_outputs = kick_start['state']
expected_outputs[:10]

In [None]:
start = time.perf_counter()

results = {'predicted':[], 'expected': list(expected_outputs)}
for index,row in kick_start_test.iterrows():
    results['predicted'].append(predict(row, d_tree, model))  

end = time.perf_counter()
print ('Getting predictions on the training dataset took ' + 
       str((end - start) / 60.0) + ' minutes')

In [None]:
results_df = pd.DataFrame(results)
results_df.to_csv('/kaggle/working/results_training_data.csv')

In [None]:
def display_results_data(result_df):
    matches = result_df.loc[(result_df['predicted'] == result_df['expected'])]
    match_percentage = len(matches)/len(result_df) * 100
    errors =  result_df.loc[(result_df['predicted'] != result_df['expected'])]
    error_percentage = len(errors)/len(result_df) * 100
    
    print ('\nTrue Positives = ' + str(len(matches)) + 
           '\t\t' + 'True Pos. Percentage = ' + 
           str(match_percentage))
    print ('\nErrors = ' + str(len(errors)) +
           '\t\t\t' + 'Error Percentage = ' + 
           str(error_percentage))


In [None]:
display_results_data(results_df)

<h1> TESTING ON UNSEEN DATA </h1> 
<h3> I will now use the 'kick_start_unique' dataframe set aside for testing which has data neither model was trained on </h3> 

In [None]:
model = load_model('/kaggle/working/nn_model.h5')
d_tree = load_tree('/kaggle/working/decision_tree.txt')

<h3> Recalling the unique data from 2018 not present in 2016 for this test. <br>
These records are new to both the models as only 2016 records were considered for the training </h3>

In [None]:
kick_start_unique.head()

 Initializing the "expected" outputs and dropping the 'state' column from test_data

In [None]:
expected_outputs = kick_start_unique['state']
expected_outputs[:10]

In [None]:
test_data = kick_start_unique.drop(['state'], axis=1)
test_data['state'] = None
test_data.head()

<h2> Final Model Predictions <h2>

In [None]:
start = time.perf_counter()

results = {'predicted':[], 'expected': list(expected_outputs)}
for index,row in test_data.iterrows():
    results['predicted'].append(predict(row, d_tree, model))  

end = time.perf_counter()
print ('Getting predictions on the Test dataset took ' + 
       str((end - start) / 60.0) + ' minutes')

In [None]:
results_df = pd.DataFrame(results)
results_df.to_csv('/kaggle/working/unseen_test_predictions.csv')

In [None]:
display_results_data(results_df)

<h1> So, the accuracy of the combined, symbiotic models, is still better than 80% even on unseen data </h1>