Defining an "adopted user" as a user who has logged into the product on three separate days in at least one seven­day period, identify which factors predict future user adoption.
We suggest spending 1­2 hours on this, but you're welcome to spend more or less. Please send us a brief writeup of your findings (the more concise, the better ­­ no more than one page), along with any summary tables, graphs, code, or queries that can help us understand your approach. Please note any factors you considered or investigation you did, even if they did not pan out. Feel free to identify any further research or data you think would be valuable.

In [1]:
# Importing useful dictionaries
import datetime
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np


In [2]:
df=pd.read_csv('takehome_users.csv', encoding='latin-1',parse_dates=True)
dfengage=pd.read_csv('takehome_user_engagement.csv', parse_dates=True)

In [3]:
df.head()

Unnamed: 0,object_id,creation_time,name,email,creation_source,last_session_creation_time,opted_in_to_mailing_list,enabled_for_marketing_drip,org_id,invited_by_user_id
0,1,2014-04-22 03:53:30,Clausen August,AugustCClausen@yahoo.com,GUEST_INVITE,1398139000.0,1,0,11,10803.0
1,2,2013-11-15 03:45:04,Poole Matthew,MatthewPoole@gustr.com,ORG_INVITE,1396238000.0,0,0,1,316.0
2,3,2013-03-19 23:14:52,Bottrill Mitchell,MitchellBottrill@gustr.com,ORG_INVITE,1363735000.0,0,0,94,1525.0
3,4,2013-05-21 08:09:28,Clausen Nicklas,NicklasSClausen@yahoo.com,GUEST_INVITE,1369210000.0,0,0,1,5151.0
4,5,2013-01-17 10:14:20,Raw Grace,GraceRaw@yahoo.com,GUEST_INVITE,1358850000.0,0,0,193,5240.0


In [4]:
dfengage.head()

Unnamed: 0,time_stamp,user_id,visited
0,2014-04-22 03:53:30,1,1
1,2013-11-15 03:45:04,2,1
2,2013-11-29 03:45:04,2,1
3,2013-12-09 03:45:04,2,1
4,2013-12-25 03:45:04,2,1


In [5]:
# Time features should be converted into datetime object
dfengage.time_stamp=pd.to_datetime(dfengage.time_stamp)

In [6]:
type(dfengage.time_stamp)

pandas.core.series.Series

In [7]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 12000 entries, 0 to 11999
Data columns (total 10 columns):
object_id                     12000 non-null int64
creation_time                 12000 non-null object
name                          12000 non-null object
email                         12000 non-null object
creation_source               12000 non-null object
last_session_creation_time    8823 non-null float64
opted_in_to_mailing_list      12000 non-null int64
enabled_for_marketing_drip    12000 non-null int64
org_id                        12000 non-null int64
invited_by_user_id            6417 non-null float64
dtypes: float64(2), int64(4), object(4)
memory usage: 937.6+ KB


last_session_creation_time and invited_by_user features have null values that need to be dealt with

In [8]:
# clean unnecessary features, fill the empty values

df.invited_by_user_id = df.invited_by_user_id.fillna(0)
df.last_session_creation_time.fillna(df.creation_time, inplace=True)
df=df.drop('name', axis=1)


# convert necessary columns to datetime
df['creation_time'] = pd.to_datetime(df['creation_time'])
df['last_session_creation_time'] = pd.to_datetime(df['last_session_creation_time'])

# find email providers by deleting the user names
df['email'] = df['email'].str.split('@').str[1]



df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 12000 entries, 0 to 11999
Data columns (total 9 columns):
object_id                     12000 non-null int64
creation_time                 12000 non-null datetime64[ns]
email                         12000 non-null object
creation_source               12000 non-null object
last_session_creation_time    12000 non-null datetime64[ns]
opted_in_to_mailing_list      12000 non-null int64
enabled_for_marketing_drip    12000 non-null int64
org_id                        12000 non-null int64
invited_by_user_id            12000 non-null float64
dtypes: datetime64[ns](2), float64(1), int64(4), object(2)
memory usage: 843.9+ KB


In [9]:
df.head()

Unnamed: 0,object_id,creation_time,email,creation_source,last_session_creation_time,opted_in_to_mailing_list,enabled_for_marketing_drip,org_id,invited_by_user_id
0,1,2014-04-22 03:53:30,yahoo.com,GUEST_INVITE,1970-01-01 00:00:01.398138810,1,0,11,10803.0
1,2,2013-11-15 03:45:04,gustr.com,ORG_INVITE,1970-01-01 00:00:01.396237504,0,0,1,316.0
2,3,2013-03-19 23:14:52,gustr.com,ORG_INVITE,1970-01-01 00:00:01.363734892,0,0,94,1525.0
3,4,2013-05-21 08:09:28,yahoo.com,GUEST_INVITE,1970-01-01 00:00:01.369210168,0,0,1,5151.0
4,5,2013-01-17 10:14:20,yahoo.com,GUEST_INVITE,1970-01-01 00:00:01.358849660,0,0,193,5240.0


In [10]:
# group the engagement data set regarding the visiting times
engage=dfengage.groupby('user_id').sum()
engage.head()

Unnamed: 0_level_0,visited
user_id,Unnamed: 1_level_1
1,1
2,14
3,1
4,1
5,1


In [11]:
dfengage.head()

Unnamed: 0,time_stamp,user_id,visited
0,2014-04-22 03:53:30,1,1
1,2013-11-15 03:45:04,2,1
2,2013-11-29 03:45:04,2,1
3,2013-12-09 03:45:04,2,1
4,2013-12-25 03:45:04,2,1


In [12]:
# Finding the adopted users

def visit_count(grp, freq):
    return grp.rolling(freq, on='time_stamp')['user_id'].count()

dfengage['7_day_visit'] = dfengage.groupby('user_id', as_index=False, group_keys=False).apply(visit_count, '7D')

# Creating a data frame with adopted_user status against the user_id
df_adopted = dfengage.groupby('user_id')['7_day_visit'].max().to_frame().reset_index()
df_adopted['adopted_user'] = (df_adopted['7_day_visit']>2)
df_adopted.head()

Unnamed: 0,user_id,7_day_visit,adopted_user
0,1,1.0,False
1,2,3.0,True
2,3,1.0,False
3,4,1.0,False
4,5,1.0,False


In [13]:
# merge the dataframes
dfmerged = pd.merge(df,df_adopted,how='outer',left_on='object_id',right_on='user_id').drop(['user_id','7_day_visit'],axis=1)

dfmerged.head()

Unnamed: 0,object_id,creation_time,email,creation_source,last_session_creation_time,opted_in_to_mailing_list,enabled_for_marketing_drip,org_id,invited_by_user_id,adopted_user
0,1,2014-04-22 03:53:30,yahoo.com,GUEST_INVITE,1970-01-01 00:00:01.398138810,1,0,11,10803.0,False
1,2,2013-11-15 03:45:04,gustr.com,ORG_INVITE,1970-01-01 00:00:01.396237504,0,0,1,316.0,True
2,3,2013-03-19 23:14:52,gustr.com,ORG_INVITE,1970-01-01 00:00:01.363734892,0,0,94,1525.0,False
3,4,2013-05-21 08:09:28,yahoo.com,GUEST_INVITE,1970-01-01 00:00:01.369210168,0,0,1,5151.0,False
4,5,2013-01-17 10:14:20,yahoo.com,GUEST_INVITE,1970-01-01 00:00:01.358849660,0,0,193,5240.0,False


In [14]:
dfmerged.adopted_user.fillna(False, inplace=True)

In [15]:
dfmerged.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 12000 entries, 0 to 11999
Data columns (total 10 columns):
object_id                     12000 non-null int64
creation_time                 12000 non-null datetime64[ns]
email                         12000 non-null object
creation_source               12000 non-null object
last_session_creation_time    12000 non-null datetime64[ns]
opted_in_to_mailing_list      12000 non-null int64
enabled_for_marketing_drip    12000 non-null int64
org_id                        12000 non-null int64
invited_by_user_id            12000 non-null float64
adopted_user                  12000 non-null bool
dtypes: bool(1), datetime64[ns](2), float64(1), int64(4), object(2)
memory usage: 949.2+ KB


In [16]:
# drop unnecessary columns


dfmerged.drop(['object_id','creation_time','last_session_creation_time',],axis=1,inplace=True)



In [17]:
dfmerged.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 12000 entries, 0 to 11999
Data columns (total 7 columns):
email                         12000 non-null object
creation_source               12000 non-null object
opted_in_to_mailing_list      12000 non-null int64
enabled_for_marketing_drip    12000 non-null int64
org_id                        12000 non-null int64
invited_by_user_id            12000 non-null float64
adopted_user                  12000 non-null bool
dtypes: bool(1), float64(1), int64(3), object(2)
memory usage: 668.0+ KB


In [18]:
#Convert categorical data
# dfmerged['adopted_user']=dfmerged[int(dfmerged['adopted_user'])]
dfmerged= pd.get_dummies(data=dfmerged,columns=['creation_source','email'])
dfmerged.head()

Unnamed: 0,opted_in_to_mailing_list,enabled_for_marketing_drip,org_id,invited_by_user_id,adopted_user,creation_source_GUEST_INVITE,creation_source_ORG_INVITE,creation_source_PERSONAL_PROJECTS,creation_source_SIGNUP,creation_source_SIGNUP_GOOGLE_AUTH,...,email_zkcdj.com,email_zkcep.com,email_zkdih.com,email_zpbkw.com,email_zpcop.com,email_zpcpu.com,email_zsrfb.com,email_zsrgb.com,email_zssin.com,email_zwmry.com
0,1,0,11,10803.0,False,1,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,1,316.0,True,0,1,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,94,1525.0,False,0,1,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,1,5151.0,False,1,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,193,5240.0,False,1,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


### Model and Analysis

In [19]:
from sklearn.model_selection import train_test_split

#set up data by seperating out the labels, then split 
data = dfmerged.drop('adopted_user', axis=1)
labels = dfmerged.adopted_user

X_train, y_train, X_test, y_test = train_test_split(data, labels, test_size=0.33, random_state=42)

In [20]:
from sklearn.ensemble import RandomForestClassifier
from sklearn import metrics
from sklearn.metrics import classification_report

#train and test classifier
rf = RandomForestClassifier(class_weight='balanced_subsample')

rf.fit(X_train, X_test)

rf.score(y_train, y_test)



0.8186868686868687

In [21]:
feature_importances = pd.DataFrame(rf.feature_importances_,
                                   index = X_train.columns,
                                    columns=['importance']).sort_values('importance',ascending=False)
feature_importances

Unnamed: 0,importance
org_id,0.563547
invited_by_user_id,0.235525
enabled_for_marketing_drip,0.017668
opted_in_to_mailing_list,0.016390
creation_source_PERSONAL_PROJECTS,0.012585
...,...
email_mrxqj.com,0.000000
email_mryst.com,0.000000
email_mrytw.com,0.000000
email_mulxe.com,0.000000


### Conclusion

The most important features are: org_id and invited_by_user_id. It can be analyzed that, adoption has a big correlation with interactions of groups. If the user is a part of a group, the adoption probablity is higher. Similarly, invited users have a higher probablity of adoption.