The  data  is  available  as  two  attached  CSV  files:
takehome_user_engagement. csv
takehome_users . csv
The  data  has  the  following  two  tables:
1]  A  user  table  ( "takehome_users" )  with  data  on  12,000  users  who  signed  up  for  the
product  in  the  last  two  years.   This  table  includes:
● name:  the  user's  name
● object_id:   the  user's  id
● email:  email  address
● creation_source:   how  their  account  was  created.  This  takes  on  one
of  5  values:
○ PERSONAL_PROJECTS:  invited  to  join  another  user's
personal  workspace
○ GUEST_INVITE:  invited  to  an  organization  as  a  guest
(limited  permissions)
○ ORG_INVITE:  invited  to  an  organization  (as  a  full  member)
○ SIGNUP:  signed  up  via  the  website
○ SIGNUP_GOOGLE_AUTH:  signed  up  using  Google
Authentication  (using  a  Google  email  account  for  their  login
id)
● creation_time:  when  they  created  their  account
● last_session_creation_time:   unix  timestamp  of  last  login
● opted_in_to_mailing_list:  whether  they  have  opted  into  receiving
marketing  emails
● enabled_for_marketing_drip:  whether  they  are  on  the  regular
marketing  email  drip
● org_id:   the  organization  (group  of  users)  they  belong  to
● invited_by_user_id:   which  user  invited  them  to  join  (if  applicable).
2]  A  usage  summary  table  ( "takehome_user_engagement" )  that  has  a  row  for  each  day
that  a  user  logged  into  the  product.
Defining  an  "adopted  user"   as  a  user  who   has  logged  into  the  product  on  three  separate
days  in  at  least  one  seven­day  period ,  identify  which  factors  predict  future  user
adoption .
We  suggest  spending  1­-2  hours  on  this,  but  you're  welcome  to  spend  more  or  less.
Please  send  us  a  brief  writeup  of  your  findings  (the  more  concise,  the  better  ­­  no  more
than  one  page),  along  with  any  summary  tables,  graphs,  code,  or  queries  that  can  help
us  understand  your  approach.  Please  note  any  factors  you  considered  or  investigation
you  did,  even  if  they  did  not  pan  out.  Feel  free  to  identify  any  further  research  or  data
you  think  would  be  valuable.

In [37]:
#Problem Statement of this task is to identify which factor(s) predict the future user adoption

In [38]:
import pandas as pd
from pandas import DataFrame, Series
import numpy as np
import matplotlib.pyplot as plt
import time
# Read the data from the dataset files given
activity = pd.read_csv('takehome_user_engagement.csv')
users = pd.read_csv('takehome_users.csv', encoding='latin-1')
#users

In [39]:
# Converting timestamp
activity['time_stampnew'] = pd.to_datetime(activity['time_stamp'])
#activity


In [40]:
# keeping users who have logged at least 3 days
def keep_repeat_users(df, visited=3):
	new_df = df.groupby('user_id').filter(lambda x: len(x) >= visited)
	return new_df

repeated_users_df = keep_repeat_users(activity)
#repeated_users_df

In [41]:
# now, we shall split the above df data by user_id
grouped_users_df = repeated_users_df.groupby('user_id')
#grouped_users_df

In [42]:
#We shall use a function to see if any user is logged in during 3 separate days in a 7 day period

def active_users(period, days_logged, user):
	visited = len(user.index) #get the number of times the user logged in
	i, count = 0, 1
	active_user = False

	while count < days_logged:
		if (i+2) < visited: 
			if (user['time_stampnew'].iloc[i + 1] - user['time_stampnew'].iloc[i]) <= pd.Timedelta(days=period) and (user['time_stampnew'].iloc[i + 1] - user['time_stampnew'].iloc[i]) > pd.Timedelta(days=1) :
				count += 1 
				new_timeframe = (user['time_stampnew'].iloc[i + 1] - user['time_stampnew'].iloc[i])
				if (user['time_stampnew'].iloc[i + 2] - user['time_stampnew'].iloc[i + 1]) <= new_timeframe and (user['time_stampnew'].iloc[i + 2] - user['time_stampnew'].iloc[i + 1]) > pd.Timedelta(days=1):
					active_user = True
					count += 1
				else: 
					i += 1
					count = 1
			else:
				i += 1
				count = 1
		else:
			count = days_logged
	return active_user

def keep_active_users(df):
	active_userzz = df.filter(lambda x: active_users(period=7, days_logged=3, user=x) ==True)

	unique_active_users = DataFrame(Series.unique(active_userzz['user_id']))
	unique_active_users.columns = ['user_id']

	return unique_active_users


unique_active_users_df = keep_active_users(grouped_users_df)
#print(unique_active_users_df) # 1615 users

# Creating an indicator variable if they are an adopted user or not
unique_active_users_df['adopted_user'] = 1

In [43]:
# Merging adopted user dataframe with that of the original
adopted_user_info = pd.merge(unique_active_users_df, users, how='outer',
                  left_on='user_id', right_on='object_id')

# Filling non-adopted users in the above df with 0
adopted_user_info['adopted_user'] = adopted_user_info['adopted_user'].fillna(0)

len(adopted_user_info) # result of this should be 12000


12000

In [44]:
temp = adopted_user_info
temp['creation_time_utc'] = pd.to_datetime(temp['creation_time'], utc=1)
temp['creation_time_unix'] = temp['creation_time_utc'].astype(np.int64) // 10 ** 9
# Time from account creation to last login
temp['creation_delta'] = temp['last_session_creation_time'] - temp['creation_time_unix']
# Time from account creation to today
temp['lifespan_delta'] = int(time.time()) - temp['creation_time_unix']
# Time from last login to today
temp['last_login_delta'] = int(time.time()) - temp['last_session_creation_time']
temp.to_csv('adopted_users.csv')

  This is separate from the ipykernel package so we can avoid doing imports until


In [None]:
df=pd.read_csv('adopted_users.csv')
df.describe()

In this analysis, I have analyzed 12,000 users and out of the total users, 1,656 users became adopted users, which is 13.8% of the total users. The rest, non-adopted users, can be separated into two groups, never visited and visited but not adopted, which are 26.5% and 59.7% respectively. For never visited users, we should attract them to visit for the first time and for visited but not adopted users, we should improve their experience to increase adoption.

By analyzing and preprocessing the data, I have feature engineered 5 new variables to predict adoption, which are email_domain (The domain from the email address), adopted_refer (whether the user is referred by adopted user), same_org (whether the user and the person referred are in the same organization), org_size (the size of the organization) and org_adopt_pct (percentage of the people in the organization are adopted users).

As we can see, the data set is highly imbalanced, which means we have much more non-adopted users than adopted users. If the imbalanced data is directly used, the model would tend to predict non-adopted users to achieved higher accuracy but lose the ability to identify potential adopted users. Therefore, I choose the under sampling method to combat the imbalance issue and leverage the state of art machine learning technology to build the prediction model. The evaluation metric is set to be AUC rather than accuracy due to the goal and imbalanced data set.

First of all, Organization adoption rate plays a very critical role in adoption. Second, the size of the organization is also very important in adoption. The small size companies tend to have higher adoption rate than the larger companies. For referral, if the user is referred by adopted users, the person is more likely to become adopted user in the future.

I conclude by saying, most of the adopted users are from a relatively small organization with high adoption rate in the organization and referred by adopted user; meanwhile, users from certain email domains have lower adoption rate than others. Overall, the adoption rate in the organization plays the most important role in determining whether the new users will adopt in the future.