## Springboard take home challenge 2: relax, inc

<div class="span5 alert alert-info">
**DESCRIPTION OF THE DATA:**  
**takehome_users.csv:**  
A user table with data on 12,000 users who signed up for the product in the last two  years.   This table includes:  
● name:  the user's name  
● object_id:   the user's id  
● email:  email address  
● creation_source:   how their account was created.  This takes on one of 5 values:  
○ PERSONAL_PROJECTS:  invited to join another user's personal workspace  
○ GUEST_INVITE:  invited to an organization as a guest (limited permissions)  
○ ORG_INVITE:  invited to an organization (as a full member)  
○ SIGNUP:  signed up via the website  
○ SIGNUP_GOOGLE_AUTH:  signed up using Google Authentication (using a Google email account for their login id)  
● creation_time:  when they created their account  
● last_session_creation_time:  unix timestamp of last login  
● opted_in_to_mailing_list:  whether they have opted into receiving marketing emails  
● enabled_for_marketing_drip:  whether they are on the regular marketing email drip  
● org_id:   the organization (group of users) they belong to  
● invited_by_user_id:  which user invited them to join (if applicable).  

**takehome_user_engagement.csv:**  
A usage summary table that has a row for each day that a user logged into the product.  
</div>

<div class="span5 alert alert-info">
**THE ASSIGNMENT:**  
Defining an "adopted user" as a user who has logged into the product on three separate days in at least one seven­day period, identify which factors predict future user adoption.
We suggest spending 1 or 2 hours on this, but you're welcome to spend more or less.  Please send us a brief writeup of your findings (the more concise, the better ­­ no more than one page), along with any summary tables, graphs, code, or queries that can help us understand your approach.  Please note any factors you considered or investigation you did, even if they did not pan out.  Feel free to identify any further research or data you think would be valuable.
</div>

In [84]:
# import packages 
import numpy as np
import pandas as pd
import json
from pandas.io.json import json_normalize
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

Note: 'takehome_users.csv' was  not encoded to utf-8.  I used encoding to allow proper opening.  (several solutions are presented in stackoverflow, this was the easiest, and seems to work.)

In [85]:
# load login jsons into data frame
# read_csv gave me an error, until I used encoding.  
filepath1 = 'takehome_users.csv'
df1= pd.read_csv(filepath1, encoding='latin1')
print('type:', type(df1))
print('shape', df1.shape)
print('')
print('info',df1.info())
print('')
print('columns:',df1.columns)
users = df1
users.head()

type: <class 'pandas.core.frame.DataFrame'>
shape (12000, 10)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 12000 entries, 0 to 11999
Data columns (total 10 columns):
object_id                     12000 non-null int64
creation_time                 12000 non-null object
name                          12000 non-null object
email                         12000 non-null object
creation_source               12000 non-null object
last_session_creation_time    8823 non-null float64
opted_in_to_mailing_list      12000 non-null int64
enabled_for_marketing_drip    12000 non-null int64
org_id                        12000 non-null int64
invited_by_user_id            6417 non-null float64
dtypes: float64(2), int64(4), object(4)
memory usage: 937.6+ KB
info None

columns: Index(['object_id', 'creation_time', 'name', 'email', 'creation_source',
       'last_session_creation_time', 'opted_in_to_mailing_list',
       'enabled_for_marketing_drip', 'org_id', 'invited_by_user_id'],
      dtype='object'

Unnamed: 0,object_id,creation_time,name,email,creation_source,last_session_creation_time,opted_in_to_mailing_list,enabled_for_marketing_drip,org_id,invited_by_user_id
0,1,2014-04-22 03:53:30,Clausen August,AugustCClausen@yahoo.com,GUEST_INVITE,1398139000.0,1,0,11,10803.0
1,2,2013-11-15 03:45:04,Poole Matthew,MatthewPoole@gustr.com,ORG_INVITE,1396238000.0,0,0,1,316.0
2,3,2013-03-19 23:14:52,Bottrill Mitchell,MitchellBottrill@gustr.com,ORG_INVITE,1363735000.0,0,0,94,1525.0
3,4,2013-05-21 08:09:28,Clausen Nicklas,NicklasSClausen@yahoo.com,GUEST_INVITE,1369210000.0,0,0,1,5151.0
4,5,2013-01-17 10:14:20,Raw Grace,GraceRaw@yahoo.com,GUEST_INVITE,1358850000.0,0,0,193,5240.0


In [86]:
filepath2 = 'takehome_user_engagement.csv'
df2= pd.read_csv(filepath2, encoding='latin1')
print('type:', type(df2))
print('shape', df2.shape)
print('')
print('info',df2.info())
print('')
print('columns:',df2.columns)
engagement = df2
engagement.head()

type: <class 'pandas.core.frame.DataFrame'>
shape (207917, 3)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 207917 entries, 0 to 207916
Data columns (total 3 columns):
time_stamp    207917 non-null object
user_id       207917 non-null int64
visited       207917 non-null int64
dtypes: int64(2), object(1)
memory usage: 4.8+ MB
info None

columns: Index(['time_stamp', 'user_id', 'visited'], dtype='object')


Unnamed: 0,time_stamp,user_id,visited
0,2014-04-22 03:53:30,1,1
1,2013-11-15 03:45:04,2,1
2,2013-11-29 03:45:04,2,1
3,2013-12-09 03:45:04,2,1
4,2013-12-25 03:45:04,2,1


In [90]:
#Convert time column to datetime
engagement.time_stamp = pd.to_datetime(engagement['time_stamp'])

In [88]:
print(len(users),'users signed up for the product')
print(len(engagement.user_id.unique()), 'users logged in')
print(len(users)-len(engagement.user_id.unique()), 'never logged in')
print(len(users[users['last_session_creation_time'].isnull()]), 'have never logged in')

visits = engagement['user_id'].value_counts()
visits3 = visits[visits >= 3]
print(len(visits3), 'have logged in at least 3 times')

12000 users signed up for the product
8823 users logged in
3177 never logged in
3177 have never logged in
2248 have logged in at least 3 times


In [142]:
# Count users who logged in 3x in < 7 days
import datetime
seven_days = datetime.timedelta(7)

# users = list of users who logged in at least 3 times ever
visits = engagement['user_id'].value_counts()
users = visits[visits >= 3].index

adoption = []
for user in users:
    # make a data frame of all engagement for the user
    df = engagement[engagement.user_id == user].reset_index().sort_values(by='time_stamp')   
    for i in range(0, len(df)-2): 
        time_diff = df.time_stamp[i+2] - df.time_stamp[i]
        if (time_diff < seven_days): 
            status = True
        else:
            status = False
    adoption.append(status)   

user_list = pd.DataFrame({'user_id': users, 'adopted': adoption})
adopted_list = pd.DataFrame(user_list[user_list.adopted==True])
print(len(adopted_list), 'users adopted the product (used it 3 times in < a week)')

1043 users adopted the product (used it 3 times in < a week)


In [None]:
#df['phone'] = df['phone'].fillna(0)
# CLEAN UP THE DATA
# start with null values
 
users.days_since_signup.fillna(0, inplace=True)
users.opted_in_to_mailing_list.fillna(0, inplace=True)