## Relax Challenge

***Prompt:***   
Defining an "adopted user" as a user who has logged into the product on three separate days in at least one seven-day period.   
Identify which factors predict future user adoption.

### Data Cleaning

In [99]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import os
import requests
import time

In [100]:
engagement_df = pd.read_csv('takehome_user_engagement.csv')
users_df = pd.read_csv('takehome_users.csv')

#### User Data

In [101]:
users_df.head()

Unnamed: 0,object_id,creation_time,name,email,creation_source,last_session_creation_time,opted_in_to_mailing_list,enabled_for_marketing_drip,org_id,invited_by_user_id
0,1,2014-04-22 03:53:30,Clausen August,AugustCClausen@yahoo.com,GUEST_INVITE,1398139000.0,1,0,11,10803.0
1,2,2013-11-15 03:45:04,Poole Matthew,MatthewPoole@gustr.com,ORG_INVITE,1396238000.0,0,0,1,316.0
2,3,2013-03-19 23:14:52,Bottrill Mitchell,MitchellBottrill@gustr.com,ORG_INVITE,1363735000.0,0,0,94,1525.0
3,4,2013-05-21 08:09:28,Clausen Nicklas,NicklasSClausen@yahoo.com,GUEST_INVITE,1369210000.0,0,0,1,5151.0
4,5,2013-01-17 10:14:20,Raw Grace,GraceRaw@yahoo.com,GUEST_INVITE,1358850000.0,0,0,193,5240.0


In [102]:
users_df.dtypes

object_id                       int64
creation_time                  object
name                           object
email                          object
creation_source                object
last_session_creation_time    float64
opted_in_to_mailing_list        int64
enabled_for_marketing_drip      int64
org_id                          int64
invited_by_user_id            float64
dtype: object

Looks like there are some data types to cleanup and an index to reset.  

In [103]:
users_df = users_df.set_index('object_id')

In [104]:
users_df['creation_time'] = pd.to_datetime(users_df['creation_time'])
users_df['last_session_creation_time'] = pd.to_datetime(users_df['last_session_creation_time'])

In [105]:
users_df['name'] = users_df['name'].astype(str)
users_df['email'] = users_df['email'].astype(str)

In [106]:
#set no invite to ID 0
users_df['invited_by_user_id'] = users_df['invited_by_user_id'].fillna(0)
users_df['invited_by_user_id'] = users_df['invited_by_user_id'].astype(int)

In [107]:
users_df['invited_by_user_id'].value_counts().sum()

12000

In [108]:
users_df['creation_source'].value_counts()

ORG_INVITE            4254
GUEST_INVITE          2163
PERSONAL_PROJECTS     2111
SIGNUP                2087
SIGNUP_GOOGLE_AUTH    1385
Name: creation_source, dtype: int64

In [109]:
users_df['creation_source'].value_counts().sum()

12000

In [110]:
#will assume 1 is a yes
users_df['opted_in_to_mailing_list'].value_counts()

0    9006
1    2994
Name: opted_in_to_mailing_list, dtype: int64

In [111]:
#will assume 1 is a yes
users_df['enabled_for_marketing_drip'].value_counts()

0    10208
1     1792
Name: enabled_for_marketing_drip, dtype: int64

In [112]:
users_df.dtypes

creation_time                 datetime64[ns]
name                                  object
email                                 object
creation_source                       object
last_session_creation_time    datetime64[ns]
opted_in_to_mailing_list               int64
enabled_for_marketing_drip             int64
org_id                                 int64
invited_by_user_id                     int64
dtype: object

In [113]:
users_df['last_session_creation_time'].max()

Timestamp('1970-01-01 00:00:01.402066730')

There must have been an issue with the datetime conversion on the 'last_session... column.  
Will reimport the column from the CSV and convert from a unix datetime.

In [114]:
re_users_df = pd.read_csv('takehome_users.csv')
users_df['last_session_creation_time'] = pd.to_datetime(re_users_df['last_session_creation_time'], unit='s')

In [115]:
users_df['last_session_creation_time']

object_id
1       2014-03-31 03:45:04
2       2013-03-19 23:14:52
3       2013-05-22 08:09:28
4       2013-01-22 10:14:20
5       2013-12-19 03:37:06
                ...        
11996   2013-01-15 18:28:37
11997   2014-04-27 12:45:16
11998   2012-06-02 11:55:59
11999   2014-01-26 08:57:12
12000                   NaT
Name: last_session_creation_time, Length: 12000, dtype: datetime64[ns]

In [116]:
#store null last_session_creation time as creation time
#Logic being that is the only time they have engaged with the app/service
users_df['last_session_creation_time'] = users_df.apply(lambda x: x['creation_time'] if pd.isnull(x['last_session_creation_time']) else x['last_session_creation_time'], axis=1)


In [117]:
users_df.head()

Unnamed: 0_level_0,creation_time,name,email,creation_source,last_session_creation_time,opted_in_to_mailing_list,enabled_for_marketing_drip,org_id,invited_by_user_id
object_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
1,2014-04-22 03:53:30,Clausen August,AugustCClausen@yahoo.com,GUEST_INVITE,2014-03-31 03:45:04,1,0,11,10803
2,2013-11-15 03:45:04,Poole Matthew,MatthewPoole@gustr.com,ORG_INVITE,2013-03-19 23:14:52,0,0,1,316
3,2013-03-19 23:14:52,Bottrill Mitchell,MitchellBottrill@gustr.com,ORG_INVITE,2013-05-22 08:09:28,0,0,94,1525
4,2013-05-21 08:09:28,Clausen Nicklas,NicklasSClausen@yahoo.com,GUEST_INVITE,2013-01-22 10:14:20,0,0,1,5151
5,2013-01-17 10:14:20,Raw Grace,GraceRaw@yahoo.com,GUEST_INVITE,2013-12-19 03:37:06,0,0,193,5240


In [118]:
#check for any null values
users_df[users_df.isna().any(axis=1)]

Unnamed: 0_level_0,creation_time,name,email,creation_source,last_session_creation_time,opted_in_to_mailing_list,enabled_for_marketing_drip,org_id,invited_by_user_id
object_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1


All clean here.  
Now just need to create some dummy columns for the creation source.

In [121]:
source = pd.get_dummies(users_df['creation_source'], prefix='creation_source', drop_first=True)
source

Unnamed: 0_level_0,creation_source_ORG_INVITE,creation_source_PERSONAL_PROJECTS,creation_source_SIGNUP,creation_source_SIGNUP_GOOGLE_AUTH
object_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
1,0,0,0,0
2,1,0,0,0
3,1,0,0,0
4,0,0,0,0
5,0,0,0,0
...,...,...,...,...
11996,1,0,0,0
11997,0,0,0,1
11998,0,0,0,0
11999,0,1,0,0


In [125]:
users_df = pd.concat([users_df, source], axis=1)

In [127]:
#0 for all creation source is 'GUEST_INVITE'
users_df = users_df.drop('creation_source', axis=1)
users_df.head()

Unnamed: 0_level_0,creation_time,name,email,last_session_creation_time,opted_in_to_mailing_list,enabled_for_marketing_drip,org_id,invited_by_user_id,creation_source_ORG_INVITE,creation_source_PERSONAL_PROJECTS,creation_source_SIGNUP,creation_source_SIGNUP_GOOGLE_AUTH
object_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1
1,2014-04-22 03:53:30,Clausen August,AugustCClausen@yahoo.com,2014-03-31 03:45:04,1,0,11,10803,0,0,0,0
2,2013-11-15 03:45:04,Poole Matthew,MatthewPoole@gustr.com,2013-03-19 23:14:52,0,0,1,316,1,0,0,0
3,2013-03-19 23:14:52,Bottrill Mitchell,MitchellBottrill@gustr.com,2013-05-22 08:09:28,0,0,94,1525,1,0,0,0
4,2013-05-21 08:09:28,Clausen Nicklas,NicklasSClausen@yahoo.com,2013-01-22 10:14:20,0,0,1,5151,0,0,0,0
5,2013-01-17 10:14:20,Raw Grace,GraceRaw@yahoo.com,2013-12-19 03:37:06,0,0,193,5240,0,0,0,0


In [128]:
users_df.describe()

Unnamed: 0,opted_in_to_mailing_list,enabled_for_marketing_drip,org_id,invited_by_user_id,creation_source_ORG_INVITE,creation_source_PERSONAL_PROJECTS,creation_source_SIGNUP,creation_source_SIGNUP_GOOGLE_AUTH
count,12000.0,12000.0,12000.0,12000.0,12000.0,12000.0,12000.0,12000.0
mean,0.2495,0.149333,141.884583,3188.691333,0.3545,0.175917,0.173917,0.115417
std,0.432742,0.356432,124.056723,3869.027693,0.478381,0.380765,0.379054,0.319537
min,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,0.0,0.0,29.0,0.0,0.0,0.0,0.0,0.0
50%,0.0,0.0,108.0,875.0,0.0,0.0,0.0,0.0
75%,0.0,0.0,238.25,6317.0,1.0,0.0,0.0,0.0
max,1.0,1.0,416.0,11999.0,1.0,1.0,1.0,1.0


In [129]:
users_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 12000 entries, 1 to 12000
Data columns (total 12 columns):
 #   Column                              Non-Null Count  Dtype         
---  ------                              --------------  -----         
 0   creation_time                       12000 non-null  datetime64[ns]
 1   name                                12000 non-null  object        
 2   email                               12000 non-null  object        
 3   last_session_creation_time          12000 non-null  datetime64[ns]
 4   opted_in_to_mailing_list            12000 non-null  int64         
 5   enabled_for_marketing_drip          12000 non-null  int64         
 6   org_id                              12000 non-null  int64         
 7   invited_by_user_id                  12000 non-null  int64         
 8   creation_source_ORG_INVITE          12000 non-null  uint8         
 9   creation_source_PERSONAL_PROJECTS   12000 non-null  uint8         
 10  creation_source_SIGNUP

User data looking good.   
Time to clean up the enagement DF before feature engineering.

#### Engagement Data

In [130]:
engagement_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 207917 entries, 0 to 207916
Data columns (total 3 columns):
 #   Column      Non-Null Count   Dtype 
---  ------      --------------   ----- 
 0   time_stamp  207917 non-null  object
 1   user_id     207917 non-null  int64 
 2   visited     207917 non-null  int64 
dtypes: int64(2), object(1)
memory usage: 4.8+ MB
