# Interview Challenge - Relax

The data is available as two attached CSV files: takehome_user_engagement.csv, takehome_users.csv

**1]** A user table ("takehome_users") with data on 12,000 users who signed up for the product in the last two years. 

This table includes:

name: the user's name

object_id: the user's id

email: email address

creation_source: how their account was created. This takes on one
of 5 values:
○ PERSONAL_PROJECTS: invited to join another user's personal workspace
○ GUEST_INVITE: invited to an organization as a guest (limited permissions)
○ ORG_INVITE: invited to an organization (as a full member)
○ SIGNUP: signed up via the website
○ SIGNUP_GOOGLE_AUTH: signed up using Google

Authentication (using a Google email account for their login
id)

creation_time: when they created their account

last_session_creation_time: unix timestamp of last login

opted_in_to_mailing_list: whether they have opted into receiving
marketing emails

enabled_for_marketing_drip: whether they are on the regular
marketing email drip

org_id: the organization (group of users) they belong to

invited_by_user_id: which user invited them to join (if applicable).

**2]** A usage summary table ("takehome_user_engagement") that has a row for each day that a user logged into the product.


**Goal**

Defining an "adopted user" as a user who has logged into the product on three separate days in at least one seven­day period, identify which factors predict future user adoption.

In [317]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

import os
import datetime

In [318]:
cwd = os.getcwd()

In [319]:
# import csvs
users = pd.read_csv(f"{cwd}/data/takehome_users.csv", encoding='latin-1')
user_engagement = pd.read_csv(f"{cwd}/data/takehome_user_engagement.csv")

In [320]:
users.head(5)

Unnamed: 0,object_id,creation_time,name,email,creation_source,last_session_creation_time,opted_in_to_mailing_list,enabled_for_marketing_drip,org_id,invited_by_user_id
0,1,2014-04-22 03:53:30,Clausen August,AugustCClausen@yahoo.com,GUEST_INVITE,1398139000.0,1,0,11,10803.0
1,2,2013-11-15 03:45:04,Poole Matthew,MatthewPoole@gustr.com,ORG_INVITE,1396238000.0,0,0,1,316.0
2,3,2013-03-19 23:14:52,Bottrill Mitchell,MitchellBottrill@gustr.com,ORG_INVITE,1363735000.0,0,0,94,1525.0
3,4,2013-05-21 08:09:28,Clausen Nicklas,NicklasSClausen@yahoo.com,GUEST_INVITE,1369210000.0,0,0,1,5151.0
4,5,2013-01-17 10:14:20,Raw Grace,GraceRaw@yahoo.com,GUEST_INVITE,1358850000.0,0,0,193,5240.0


In [321]:
# explore and clean data

# null last_session_creation_time likely means still active? or was never active after creation?
# check rows where creation_time and last_session_creation_time are same

# are non-null invited_by_user_id more likely to remain?
users.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 12000 entries, 0 to 11999
Data columns (total 10 columns):
 #   Column                      Non-Null Count  Dtype  
---  ------                      --------------  -----  
 0   object_id                   12000 non-null  int64  
 1   creation_time               12000 non-null  object 
 2   name                        12000 non-null  object 
 3   email                       12000 non-null  object 
 4   creation_source             12000 non-null  object 
 5   last_session_creation_time  8823 non-null   float64
 6   opted_in_to_mailing_list    12000 non-null  int64  
 7   enabled_for_marketing_drip  12000 non-null  int64  
 8   org_id                      12000 non-null  int64  
 9   invited_by_user_id          6417 non-null   float64
dtypes: float64(2), int64(4), object(4)
memory usage: 937.6+ KB


In [322]:
# use fromtimestamp() to convert unix time 'last_session_creation_time' to datetime object 
# can't convert null to int for fromtimestamp()
last_session_not_null = users['last_session_creation_time'].dropna()

#print(last_session_not_null)
users['new_last_session_creation_time'] = last_session_not_null.apply(lambda t: datetime.datetime.fromtimestamp(t))

In [323]:
user_engagement.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 207917 entries, 0 to 207916
Data columns (total 3 columns):
 #   Column      Non-Null Count   Dtype 
---  ------      --------------   ----- 
 0   time_stamp  207917 non-null  object
 1   user_id     207917 non-null  int64 
 2   visited     207917 non-null  int64 
dtypes: int64(2), object(1)
memory usage: 4.8+ MB


In [324]:
user_engagement.head(5)

Unnamed: 0,time_stamp,user_id,visited
0,2014-04-22 03:53:30,1,1
1,2013-11-15 03:45:04,2,1
2,2013-11-29 03:45:04,2,1
3,2013-12-09 03:45:04,2,1
4,2013-12-25 03:45:04,2,1


In [325]:
user_engagement['time_stamp'] = pd.to_datetime(user_engagement['time_stamp'])

In [326]:
user_engagement['date'] = user_engagement['time_stamp'].dt.date

In [309]:
user_engagement['user_id'].nunique()

8823

In [329]:
user_engagement

Unnamed: 0,time_stamp,user_id,visited,date
0,2014-04-22 03:53:30,1,1,2014-04-22
1,2013-11-15 03:45:04,2,1,2013-11-15
2,2013-11-29 03:45:04,2,1,2013-11-29
3,2013-12-09 03:45:04,2,1,2013-12-09
4,2013-12-25 03:45:04,2,1,2013-12-25
...,...,...,...,...
207912,2013-09-06 06:14:15,11996,1,2013-09-06
207913,2013-01-15 18:28:37,11997,1,2013-01-15
207914,2014-04-27 12:45:16,11998,1,2014-04-27
207915,2012-06-02 11:55:59,11999,1,2012-06-02


**pick back up here**

In [335]:
new = user_engagement[['user_id', 'date', 'visited']].groupby(['user_id'])
new[['user_id', 'date']].value_counts()

user_id  date      
1        2014-04-22    1
2        2013-11-15    1
         2013-11-29    1
         2013-12-09    1
         2013-12-25    1
                      ..
11996    2013-09-06    1
11997    2013-01-15    1
11998    2014-04-27    1
11999    2012-06-02    1
12000    2014-01-26    1
Length: 207917, dtype: int64

In [310]:
# user_engagement has 8823 unique user_id, so null last_session_creation_time likely means not active user?
d = user_engagement['user_id'].value_counts()

In [311]:
d = pd.DataFrame(d)

In [312]:
d = d[d['user_id'] >= 3]
d = d.rename(columns={'user_id':'visits'})
d.head(5)
# index is user_id

Unnamed: 0,visits
3623,606
906,600
1811,593
7590,590
8068,585


In [313]:
d.reset_index()

Unnamed: 0,index,visits
0,3623,606
1,906,600
2,1811,593
3,7590,590
4,8068,585
...,...,...
2243,11778,3
2244,241,3
2245,4187,3
2246,8109,3


In [315]:
user_engagement['user_id'] == d['index']

KeyError: 'index'

In [274]:
unique_id = user_engagement['user_id'].unique()

In [289]:
# index is user_id, visits is total sum of visited
d

Unnamed: 0,visits
3623,606
906,600
1811,593
7590,590
8068,585
...,...
11778,3
241,3
4187,3
8109,3


In [277]:
# 8823 unique ID and 2248 ID with >= 3 visits. Need them to be over 7 day period
uh = set(unique_id).intersection(d.index)

In [293]:
uh = list(uh)

In [296]:
maybe_adopted_idx

Unnamed: 0,time_stamp,visited,date
3623,2012-06-27 14:34:33,1,2012-06-27
3623,2012-07-01 14:34:33,1,2012-07-01
3623,2012-07-02 14:34:33,1,2012-07-02
3623,2012-07-04 14:34:33,1,2012-07-04
3623,2012-07-08 14:34:33,1,2012-07-08
...,...,...,...
8109,2014-04-10 10:05:29,1,2014-04-10
8109,2014-04-28 10:05:29,1,2014-04-28
8564,2013-12-17 14:17:38,1,2013-12-17
8564,2014-01-06 14:17:38,1,2014-01-06


In [None]:
# were they invited by a user? -- creation_source -- is that user id also very active user id or kept user id 


# Find adopted users

*Users who have logged in 3 separate DAYS in one 7 day period*

In [None]:
# model feature importance