## Take Home Challenge 2: Relax Inc. Data Science Challenge

The goal of this notebook it to analyze data from the Relax Inc. Data Science Challenge to determine what types of users of this website become adopted users, that is, users who have logged in on three different days during the span of a week at least once in their user history.  Further information describing the data available can be found in the prompt.

This analysis is intended to be brief rather than exhaustive in order to get a quick and actionable insight of the data.

### Load the Data

Change the directory commands below as necessary.
Copies of the .csv files are available in my repository.

In [1]:
cd ~

/Users/nick


In [2]:
cd Desktop/Springboard/takehome2/relax_challenge

/Users/nick/Desktop/Springboard/takehome2/relax_challenge


In [3]:
# Import necessary packages
import pandas as pd
import datetime
from collections import defaultdict
from collections import Counter
from scipy.stats import chisquare

In [4]:
# Load the data into dataframes
engagement=pd.read_csv('takehome_user_engagement.csv')
# and convert the account creation time stamp to a date time data type for future manipulation
engagement.time_stamp=engagement.time_stamp.astype('datetime64[ns]')

users=pd.read_csv('takehome_users.csv',encoding='latin-1')

### Initial Inspection

In [5]:
engagement.head()

Unnamed: 0,time_stamp,user_id,visited
0,2014-04-22 03:53:30,1,1
1,2013-11-15 03:45:04,2,1
2,2013-11-29 03:45:04,2,1
3,2013-12-09 03:45:04,2,1
4,2013-12-25 03:45:04,2,1


In [6]:
users.head()

Unnamed: 0,object_id,creation_time,name,email,creation_source,last_session_creation_time,opted_in_to_mailing_list,enabled_for_marketing_drip,org_id,invited_by_user_id
0,1,2014-04-22 03:53:30,Clausen August,AugustCClausen@yahoo.com,GUEST_INVITE,1398139000.0,1,0,11,10803.0
1,2,2013-11-15 03:45:04,Poole Matthew,MatthewPoole@gustr.com,ORG_INVITE,1396238000.0,0,0,1,316.0
2,3,2013-03-19 23:14:52,Bottrill Mitchell,MitchellBottrill@gustr.com,ORG_INVITE,1363735000.0,0,0,94,1525.0
3,4,2013-05-21 08:09:28,Clausen Nicklas,NicklasSClausen@yahoo.com,GUEST_INVITE,1369210000.0,0,0,1,5151.0
4,5,2013-01-17 10:14:20,Raw Grace,GraceRaw@yahoo.com,GUEST_INVITE,1358850000.0,0,0,193,5240.0


In [7]:
print(engagement.user_id.nunique())
print(users.object_id.nunique())
print(engagement.user_id.nunique()/users.object_id.nunique())

8823
12000
0.73525


Notice that only 8,823 of the 12,000 (or 74% of the) unique users who have signed up even appear on the engagement spreadsheet.  This is rather odd.  Even the person with user/object id 1 appears in both of the above dataframe headers despite her only engagement ever coming at the same time as her initial account creation.  This alone could be enough to make much of the following data and analysis wrong, particularly if that 74% is not representative of all registered users.  More likely, the remaining 26% was removed to act as a hold out set for machine learning predictions for a competition.  Since that cannot be known for a fact, we will proceed assuming that the missing 26% of users did not become adopted users.

### Create New Features

In [8]:
# We want to find users who are highly active within a 7 day window
delta=datetime.timedelta(days=7)

# Create a default dict to store an adopter user indicator
adopted_user=defaultdict(int)
# For each unique user,
for user in engagement.user_id.unique():
    # isolate his/her data,
    subset=engagement[engagement.user_id==user]
    # and iterate through each of his/her logins,
    for i in range(len(subset)):
        # subsetting to only the logins that occur within 7 days of that one.
        window=subset[subset.time_stamp.between(subset.iloc[i,0],subset.iloc[i,0]+delta,inclusive=True)]
        # If there are at least 3 such logins,
        if len(window)>=3:
            # label the user as an adopter user
            adopted_user[user]=1
            # and stop looking for instances.
            break

# Iterate through all users.
for user in users.object_id:
    # If a user is not yet in the dictionary,
    if user not in adopted_user:
        # he/she must not be an adopted user.
        adopted_user[user] = 0

# Join the adopted user data onto the users dataframe
users['adopted']=users['object_id'].map(adopted_user)

# Create a new feature on the users dataframe that indicates
# whether they were invited by another user.
users['invited_by_user']=users.invited_by_user_id.notnull().astype('int')

In [9]:
print(Counter(list(adopted_user.values())))

Counter({0: 10344, 1: 1656})


In [10]:
average=list(adopted_user.values()).count(1)/users.object_id.nunique()
print(average)

0.138


Under the above assumption, only 14% of all new users become adopted users.

### Adopted User Analysis

To analyze which features most strongly predict conversion to an adopted user, I will analyze the total and percent total crosstabulations between various categorical features and whether the users did or did not adopt the system.  Through the analysis, 0 always means an event did not happen, while 1 means it did.

I will then compute the chi squared statistic on the distribution against the null hypothesis that the categorical features are independent of whether a user adopts.  Namely, the null hypothesis is that the distribution should be equal to the average distribution for the whole population with 13.8% adopting and 86.2% not adopting.  Assume a statistical significance of $\alpha=.05$.

In [11]:
ct=pd.crosstab(users.creation_source,users.adopted)
ct['% Not Adopted']=ct[0]/(ct[0]+ct[1])
ct['% Adopted']=ct[1]/(ct[0]+ct[1])
print(ct)

adopted                0    1  % Not Adopted  % Adopted
creation_source                                        
GUEST_INVITE        1794  369       0.829404   0.170596
ORG_INVITE          3680  574       0.865068   0.134932
PERSONAL_PROJECTS   1939  172       0.918522   0.081478
SIGNUP              1785  302       0.855295   0.144705
SIGNUP_GOOGLE_AUTH  1146  239       0.827437   0.172563


In [12]:
for value in users.creation_source.unique():
    total=ct.loc[value,0]+ct.loc[value,1]
    print(str(value)+':')
    print(chisquare([ct.loc[value,0],ct.loc[value,1]],f_exp=[(1-average)*total,average*total]))

GUEST_INVITE:
Power_divergenceResult(statistic=19.320096070207466, pvalue=1.1053682903762694e-05)
ORG_INVITE:
Power_divergenceResult(statistic=0.33664354575292615, pvalue=0.561773067991706)
SIGNUP:
Power_divergenceResult(statistic=0.7888145956227056, pvalue=0.3744588065715626)
PERSONAL_PROJECTS:
Power_divergenceResult(statistic=56.694032030982854, pvalue=5.09178474324349e-14)
SIGNUP_GOOGLE_AUTH:
Power_divergenceResult(statistic=13.908834173798185, pvalue=0.00019189441460436033)


Having a guest invite to an organization, being invited to join a project, and joining via google are all statistically significant.  The mean number of users that become adopted users for each of these populations differs from the rate for the overall population.  We cannot confirm in which way, positively or negatively, it differs from this test, but the rates are suggestive.

In [13]:
ct=pd.crosstab(users.opted_in_to_mailing_list,users.adopted)
ct['% Not Adopted']=ct[0]/(ct[0]+ct[1])
ct['% Adopted']=ct[1]/(ct[0]+ct[1])
print(ct)

adopted                      0     1  % Not Adopted  % Adopted
opted_in_to_mailing_list                                      
0                         7779  1227       0.863757   0.136243
1                         2565   429       0.856713   0.143287


In [14]:
for value in users.opted_in_to_mailing_list.unique():
    total=ct.loc[value,0]+ct.loc[value,1]
    print(str(value)+':')
    print(chisquare([ct.loc[value,0],ct.loc[value,1]],f_exp=[(1-average)*total,average*total]))

1:
Power_divergenceResult(statistic=0.7034187410430643, pvalue=0.40163732663966367)
0:
Power_divergenceResult(statistic=0.23384806914090644, pvalue=0.6286851070985109)


Mailing list status does not seem to affect adoptation rate.

In [15]:
ct=pd.crosstab(users.enabled_for_marketing_drip,users.adopted)
ct['% Not Adopted']=ct[0]/(ct[0]+ct[1])
ct['% Adopted']=ct[1]/(ct[0]+ct[1])
print(ct)

adopted                        0     1  % Not Adopted  % Adopted
enabled_for_marketing_drip                                      
0                           8809  1399       0.862951   0.137049
1                           1535   257       0.856585   0.143415


In [16]:
for value in users.enabled_for_marketing_drip.unique():
    total=ct.loc[value,0]+ct.loc[value,1]
    print(str(value)+':')
    print(chisquare([ct.loc[value,0],ct.loc[value,1]],f_exp=[(1-average)*total,average*total]))

0:
Power_divergenceResult(statistic=0.07754870719038472, pvalue=0.7806472276171589)
1:
Power_divergenceResult(statistic=0.44175067131664303, pvalue=0.5062786889962375)


The marketing drip also seems irrelevant.  If one wanted to be completely sure, if the future a chi squared test could be run against those with the marketing drip assuming the distribution of those without the marketing drip.

In [17]:
ct=pd.crosstab(users.invited_by_user,users.adopted)
ct['% Not Adopted']=ct[0]/(ct[0]+ct[1])
ct['% Adopted']=ct[1]/(ct[0]+ct[1])
print(ct)
# These numbers are the same as GUEST_INVITE+ORG_INVITE

adopted             0    1  % Not Adopted  % Adopted
invited_by_user                                     
0                4870  713       0.872291   0.127709
1                5474  943       0.853047   0.146953


In [18]:
for value in users.invited_by_user.unique():
    total=ct.loc[value,0]+ct.loc[value,1]
    print(str(value)+':')
    print(chisquare([ct.loc[value,0],ct.loc[value,1]],f_exp=[(1-average)*total,average*total]))

1:
Power_divergenceResult(statistic=4.3243630577662335, pvalue=0.03757047275246717)
0:
Power_divergenceResult(statistic=4.970345287781841, pvalue=0.025785499380062776)


Interestingly, the total number of users invited by another user is equal to the total number of users who created their accounts through organizational guest or full user invitations.  This relationship was not immediately apparent early and could warrant further future exploration.  Meanwhile, the rates of adoptation do have a statistically significant difference from the overall adoptation rates despite the rates being minimally different.

Further studies could include whether certain organizations are more likely to have adopted users, or whether users invited by adopted users are more likely to become adopted users.  Does it matter if the invite comes from someone who is not yet an adopted user but becomes one in the future?  Are some people who invite more persuasive in convincing their friends to become adopted users?  Does it matter what day of the week or month of the year a user signs up?

### Conclusion

Recall that this analysis rests heavily on the assumption that all users not appearing in the engagements file did not become adopted users.  With that in mind, the strongest predictive statistic was creating an account via an invitation to join another individual's workspace.  This saw the adoptation rate drop all the way to 8.1%.  On the other hand, organizational guest invitations and Google account creations both experienced higher conversion rates.    Relax Inc. should focus their efforts on obtaining new customers via existing customer invitations and further target larger organizations.

Also consider that qualifying a user as being adopted for using the system at least three different days for at least one single week is a very light criteria.  It only indicated active use over a short term and is not and indicator of long term loyalty.