### Project Background

Our Company, Vis-soft, produces data visualization software that competes with PowerBi and Tableau.  With less brand recognition than many of our competitors, Vis-soft offers a 90 day free-trial of its platform.  The Company's CEO would like to predict which trial customers will convert to a paid subscription and which will not.  This will help the Company both with near term financial modeling but also can be used to offer incentivizes to customers not expected to adopt a paid subscription at the end of the trial period.

### Models and Benchmark

A data scientist consultant has told us this conversion model is a binary classification problem.  That is to say, we can use attributes observed during the trial period to make a binary prediction of conversion (to a paid subscriber) or non-conversion.  Similiary the consulting team told us that common forms of binary classification models include: random forests, METHOD 2, METHOD 3, AND METHOD 4.  We test each of these models to determine which model proves best in predicting the conversion of a trial user to a paid subscriber.

Based on our own research, we conclude that appropriate bechmarks for defining the predictive power of classification models are: accruacy, precision, recall, and F-1 score.  Each model will be evaluated across these four metrics.

### Data Attributes

Our product team has indicated to use that they have identified 20 Vis-soft features that they belief differentiate our product from its competitors.  Furthermore they believe that users that use these features more during the trial period are more likely to convert to paid subscribers.  The product team beliefs two of these features are major differeniators and the other 18 while perhaps useful in a conversion analaysis are less important to customers.  These two major differentiators are described below:

(1) Free publication - Vis-soft users can make their reports and dashboards available to anyone else via a cloud sharing platform.  The Vis-soft user can publish their report, given access to specific email addresses, and anyone registered under that email address can access this report.  While other software packages allow for publishing, that sharing feature has additional costs on other platforms.

(2) AI modules - Vis-soft has native AI modules which allow users to garner deeper insights from their data using AI.  These packages are not yet available in other competitors platforms.

### Monte Carlo Simulation

Our product team tells us that number of published reports/dashboards and number of AI modules used are not indepdent.  First both are correlated with hours of usage during the trial period.  They find no users publish reports if they use the platform for less than 20 hours (as it generally takes about 20 hours for new users to get comfortable using the software), but after that the number of reports published follows a normal distribution with mean 10 and standard deviation of 1.  Our team finds that number of AI modules used is also normally distributed with mean x and standard deviation of 1.  However they have found that mean various with prior BI experience.  Users with 0 years of experience never use AI modules, for users with 1 year of experience the average number of modles used is 5, and for users two or more years of experience the mean is 15.

Finally they tell us that usage time during the trial period is randomly distributed between 0 and 500 hours over the 90 day trial period

In [1]:
import numpy as np
import pandas as pd

In [2]:
#setting up number of simulated users and core customer attribute variables (usage time and prior BI experience)
n = 200000
prior_bi_experience_values = [0,1,2]
prior_bi_experience_prob = [.03,.05,.92]
prior_bi_experience = np.random.choice(prior_bi_experience_values, n, p=prior_bi_experience_prob)
usage_hours = np.random.uniform(0,500,n)

In [3]:
#simulating number of publishes per user

#first assume all users exceed 20 usage hours
publishes_naive = np.random.normal(10,1,n)

#zero out observations with associated 
publications = []
usage_list = list(usage_hours)
publish_naive_list = list(publishes_naive)
for i in range(n):
    if usage_list[i] < 20:
        publications.append(0)
    else:
        publications.append(publish_naive_list[i])

In [4]:
#simulating number of AI modules used
ai_1yr_exp =  np.random.normal(5,1,n)
a2_more_exp =  np.random.normal(15,1,n)

#selecting the appropiate distriubtion based on years of experience
ai = []
exp_list = list(prior_bi_experience)

for i in range(n):
    if exp_list[i] == 0:
        ai.append(0)
    elif exp_list[i] == 1:
        ai.append(ai_1yr_exp[i])
    else:
        ai.append(a2_more_exp[i])

In [5]:
#simple attribute add.  we probably should add several of these
#note if you add a attribute that allows for negative values we might have decide if that is appropriate or note

at3 = np.random.choice([0,1,2,3], n, p=[.3,.4,.1,.2])
at4 = np.random.normal(4,1,n)
at5 = np.random.uniform(0,10,n)

### Adding Attributed to a dataframe

In [6]:
#you will need to add any attributes here
d = {"prior_exp": prior_bi_experience, 
     "usage_hours": usage_hours, 
     "publishes": publications, 
     "ai": ai,
     "at3": at3,
     "at4": at4,
     "at5": at5
    }

In [7]:
df = pd.DataFrame(d)
df.describe()

Unnamed: 0,prior_exp,usage_hours,publishes,ai,at3,at4,at5
count,200000.0,200000.0,200000.0,200000.0,200000.0,200000.0,200000.0
mean,1.89044,250.175218,9.599912,14.051599,1.197865,3.99886,5.004053
std,0.396343,144.626707,2.198206,3.433038,1.076411,1.000608,2.887509
min,0.0,0.001127,0.0,0.0,0.0,-0.446405,0.00014
25%,2.0,124.657227,9.229995,14.102369,0.0,3.325622,2.498452
50%,2.0,250.366247,9.949248,14.889738,1.0,3.996904,5.004031
75%,2.0,375.300415,10.643674,15.606132,2.0,4.673544,7.505106
max,2.0,499.998308,14.683496,19.357789,3.0,8.738615,9.99999


### Utility Function

In [8]:
#if you add attributes you need to add them here
def utility(a1, a2, a3, a4, a5):
    return  a1 + a2 + 0.5*a3 + .5*a4 + .25*a5

In [9]:
#if you add attributes you also need to add them here
df['utility'] = utility(df['publishes'],df['ai'],df['at3'],df['at4'],df['at5'])

In [10]:
df['conversion'] = np.where(df['utility'] > 29, 1, 0)

### Exporing the dataframe to a CSV

In [11]:
df.to_json("sim_data.json")