## Relax Data Analysis Challenge
Defining an "adopted user" as a user who has logged into the product on three separate days in at least one seven day period, identify which factors predict future user adoption.

### Part 1: Metadata Collection + Anomalies Analysis
We get information about the size of the data sets and find out whether either of the data sets contain null values. From there, we can decide whether we should fill the null values, ignore them, or ignore the feature altogether.

In [1]:
import pandas as pd

In [2]:
engagement_csv = pd.read_csv('takehome_user_engagement.csv')

In [3]:
users_csv = pd.read_csv('takehome_users.csv')

In [4]:
engagement_csv.head()

Unnamed: 0,time_stamp,user_id,visited
0,2014-04-22 03:53:30,1,1
1,2013-11-15 03:45:04,2,1
2,2013-11-29 03:45:04,2,1
3,2013-12-09 03:45:04,2,1
4,2013-12-25 03:45:04,2,1


In [5]:
users_csv.head()

Unnamed: 0,object_id,creation_time,name,email,creation_source,last_session_creation_time,opted_in_to_mailing_list,enabled_for_marketing_drip,org_id,invited_by_user_id
0,1,2014-04-22 03:53:30,Clausen August,AugustCClausen@yahoo.com,GUEST_INVITE,1398139000.0,1,0,11,10803.0
1,2,2013-11-15 03:45:04,Poole Matthew,MatthewPoole@gustr.com,ORG_INVITE,1396238000.0,0,0,1,316.0
2,3,2013-03-19 23:14:52,Bottrill Mitchell,MitchellBottrill@gustr.com,ORG_INVITE,1363735000.0,0,0,94,1525.0
3,4,2013-05-21 08:09:28,Clausen Nicklas,NicklasSClausen@yahoo.com,GUEST_INVITE,1369210000.0,0,0,1,5151.0
4,5,2013-01-17 10:14:20,Raw Grace,GraceRaw@yahoo.com,GUEST_INVITE,1358850000.0,0,0,193,5240.0


In [6]:
engagement_csv.shape

(207917, 3)

In [7]:
users_csv.shape

(12000, 10)

In [8]:
engagement_csv.isnull().sum()

time_stamp    0
user_id       0
visited       0
dtype: int64

In [9]:
users_csv.isnull().sum()

object_id                        0
creation_time                    0
name                             0
email                            0
creation_source                  0
last_session_creation_time    3177
opted_in_to_mailing_list         0
enabled_for_marketing_drip       0
org_id                           0
invited_by_user_id            5583
dtype: int64

We can see that our engagement file has no nulls, however users file has null values in the last_session_creation_time (last login) and invited_by_user_id (user id of user who invited them to join). 

The nulls in the last login field could indicate that a user never logged in. This may mean that a user signed up (and that did not count as an official login), but never used their account. We can check whether this hypthesis is true by seeing if any creation_time fields equal the last_session_creation_time fields. If not, then it could support the fact that creating an account does not count as a login, and nulls in the login field could indicate that the user never logged into the account after its creation.

The nulls in the invitation field could easily indicate that the user was not invited by another user, and instead found the site themselves. This can be verified by checking to see if any of the nulls in the invitation column contain org_invite. If they do, this could mean that the anomaly is for a different reason.

In [10]:
null_login = users_csv[users_csv['last_session_creation_time'].isnull()]
null_login.head()

Unnamed: 0,object_id,creation_time,name,email,creation_source,last_session_creation_time,opted_in_to_mailing_list,enabled_for_marketing_drip,org_id,invited_by_user_id
7,8,2013-07-31 05:34:02,Hamilton Danielle,DanielleHamilton@yahoo.com,PERSONAL_PROJECTS,,1,1,74,
8,9,2013-11-05 04:04:24,Amsel Paul,PaulAmsel@hotmail.com,PERSONAL_PROJECTS,,0,0,302,
11,12,2014-04-17 23:48:38,Mathiesen L�rke,LaerkeLMathiesen@cuvox.de,ORG_INVITE,,0,0,130,9270.0
14,15,2013-07-16 21:33:54,Theiss Ralf,RalfTheiss@hotmail.com,PERSONAL_PROJECTS,,0,0,175,
15,16,2013-02-11 10:09:50,Engel Ren�,ReneEngel@hotmail.com,PERSONAL_PROJECTS,,0,0,211,


In [11]:
last_login = users_csv[users_csv['last_session_creation_time'].notnull()]
last_login.fillna(0, inplace=True)
last_login.head()

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  downcast=downcast, **kwargs)


Unnamed: 0,object_id,creation_time,name,email,creation_source,last_session_creation_time,opted_in_to_mailing_list,enabled_for_marketing_drip,org_id,invited_by_user_id
0,1,2014-04-22 03:53:30,Clausen August,AugustCClausen@yahoo.com,GUEST_INVITE,1398139000.0,1,0,11,10803.0
1,2,2013-11-15 03:45:04,Poole Matthew,MatthewPoole@gustr.com,ORG_INVITE,1396238000.0,0,0,1,316.0
2,3,2013-03-19 23:14:52,Bottrill Mitchell,MitchellBottrill@gustr.com,ORG_INVITE,1363735000.0,0,0,94,1525.0
3,4,2013-05-21 08:09:28,Clausen Nicklas,NicklasSClausen@yahoo.com,GUEST_INVITE,1369210000.0,0,0,1,5151.0
4,5,2013-01-17 10:14:20,Raw Grace,GraceRaw@yahoo.com,GUEST_INVITE,1358850000.0,0,0,193,5240.0


In [12]:
len(users_csv[users_csv['last_session_creation_time'] == 0])

0

In [13]:
set(last_login['creation_time']).intersection(set(last_login['last_session_creation_time']))

set()

Because there are no last logins that are the same as the creation time, this provides support for our hypothesis that creating an account does not count as a login attempt. We can verify with the provider of the data set to validate this hypthesis.

In [14]:
null_invitation = users_csv[users_csv['invited_by_user_id'].isnull()]
null_invitation.head()

Unnamed: 0,object_id,creation_time,name,email,creation_source,last_session_creation_time,opted_in_to_mailing_list,enabled_for_marketing_drip,org_id,invited_by_user_id
6,7,2012-12-16 13:24:32,Sewell Tyler,TylerSewell@jourrapide.com,SIGNUP,1356010000.0,0,1,37,
7,8,2013-07-31 05:34:02,Hamilton Danielle,DanielleHamilton@yahoo.com,PERSONAL_PROJECTS,,1,1,74,
8,9,2013-11-05 04:04:24,Amsel Paul,PaulAmsel@hotmail.com,PERSONAL_PROJECTS,,0,0,302,
10,11,2013-12-26 03:55:54,Paulsen Malthe,MaltheAPaulsen@gustr.com,SIGNUP,1388117000.0,0,0,69,
13,14,2012-10-11 16:14:33,Rivera Bret,BretKRivera@gmail.com,SIGNUP,1350058000.0,0,0,0,


In [15]:
len(users_csv[users_csv['object_id'] == 0])

0

In [16]:
set(null_invitation['creation_source'])

{'PERSONAL_PROJECTS', 'SIGNUP', 'SIGNUP_GOOGLE_AUTH'}

The users that have null in their invited_by_user_id column likely contain nulls because none of them have ORG_INVITE as their creation source. This measn that they were not invited by an existing user, which is why the field is null.

To clean our data set, we are going to fill all null fields. In the last_session_creation_time field, 0 will indicate that no logins were recorded. We can use 0 because no other rows in our data set contain 0 as a value for this field, and it makes intuitive sense. 

We can fill invited_by_user_id with 9999. Although there are no user id's with a value of 0 (object_id represents user id), we want to make it clear to outside users that this value is significant, so we will choose to fill nulls with 9999.

In [17]:
users_csv['last_session_creation_time'].fillna(0, inplace=True)
users_csv['invited_by_user_id'].fillna(9999, inplace=True)
users_csv.isnull().sum()

object_id                     0
creation_time                 0
name                          0
email                         0
creation_source               0
last_session_creation_time    0
opted_in_to_mailing_list      0
enabled_for_marketing_drip    0
org_id                        0
invited_by_user_id            0
dtype: int64

### Part 2: Exploratory Data Analysis
We are going to create the target column based on the definition of an 'adopted user' (above). We check to see if there's strong linear correlation among variables with the target or each other, and decide which algorithms to use to make our predictions + run feature importance on.

In [18]:
timestamp = pd.to_datetime(engagement_csv['time_stamp'])
login_df = pd.DataFrame(timestamp)
login_df['user_id'] = engagement_csv['user_id']
login_df['delta'] = 1

diff = [0]
for i in range(1,len(login_df)):
    diff_dates = login_df['time_stamp'][i] - login_df['time_stamp'][i-1]
    diff_days = diff_dates.days
    diff.append(diff_days)
login_df['delta'] = diff

In [19]:
login_df.head()

Unnamed: 0,time_stamp,user_id,delta
0,2014-04-22 03:53:30,1,0
1,2013-11-15 03:45:04,2,-159
2,2013-11-29 03:45:04,2,14
3,2013-12-09 03:45:04,2,10
4,2013-12-25 03:45:04,2,16


In [20]:
login_df[abs(login_df['delta']) <= 5].head()

Unnamed: 0,time_stamp,user_id,delta
0,2014-04-22 03:53:30,1,0
8,2014-02-08 03:45:04,2,5
9,2014-02-09 03:45:04,2,1
10,2014-02-13 03:45:04,2,4
11,2014-02-16 03:45:04,2,3


In [21]:
delta = 0
adopted_count = []

for i in range(len(login_df)-2):
    if abs(login_df['delta'][i]) <=5:
        delta+=abs(login_df['delta'][i])
        delta+=abs(login_df['delta'][i+1])
        if delta < 7:
            delta+=abs(login_df['delta'][i+2])
            if delta <= 7:
                adopted_count.append(1)
                delta = 0
            else:
                adopted_count.append(0)
                delta = 0
        else:
            adopted_count.append(0)
            delta = 0
    else:
        adopted_count.append(0)
        delta = 0

adopted_count.append(0)
adopted_count.append(0)

In [22]:
login_df['adopted'] = adopted_count

In [23]:
set(login_df['adopted'])

{0, 1}

In [24]:
dict(login_df.groupby('user_id')['adopted'].sum())

{1: 0,
 2: 0,
 3: 0,
 4: 0,
 5: 0,
 6: 0,
 7: 0,
 10: 234,
 11: 0,
 13: 0,
 14: 0,
 17: 0,
 19: 0,
 20: 0,
 21: 0,
 22: 0,
 23: 0,
 24: 0,
 25: 0,
 27: 0,
 28: 0,
 29: 0,
 30: 0,
 31: 0,
 33: 0,
 36: 0,
 37: 0,
 41: 0,
 42: 308,
 43: 2,
 44: 0,
 45: 0,
 46: 0,
 47: 0,
 48: 0,
 49: 0,
 50: 0,
 51: 0,
 53: 2,
 54: 0,
 55: 0,
 56: 0,
 57: 0,
 58: 0,
 59: 0,
 60: 0,
 61: 0,
 63: 329,
 64: 0,
 65: 0,
 66: 0,
 67: 0,
 68: 0,
 69: 487,
 72: 0,
 73: 0,
 74: 22,
 75: 0,
 76: 0,
 77: 0,
 78: 0,
 80: 0,
 81: 41,
 82: 145,
 83: 0,
 84: 0,
 85: 0,
 86: 0,
 87: 70,
 88: 0,
 89: 0,
 90: 0,
 91: 0,
 92: 0,
 94: 0,
 95: 0,
 96: 0,
 97: 0,
 98: 0,
 99: 0,
 100: 0,
 101: 0,
 103: 0,
 105: 0,
 106: 0,
 107: 0,
 109: 0,
 110: 0,
 111: 0,
 112: 0,
 113: 0,
 114: 0,
 115: 0,
 116: 0,
 117: 0,
 119: 0,
 121: 0,
 123: 0,
 124: 0,
 125: 0,
 126: 0,
 127: 0,
 128: 0,
 132: 0,
 133: 32,
 135: 4,
 136: 0,
 138: 0,
 139: 0,
 140: 0,
 141: 2,
 142: 0,
 143: 0,
 144: 0,
 146: 48,
 147: 0,
 150: 0,
 151: 0,
 153: 117,

In [25]:
login_df

Unnamed: 0,time_stamp,user_id,delta,adopted
0,2014-04-22 03:53:30,1,0,0
1,2013-11-15 03:45:04,2,-159,0
2,2013-11-29 03:45:04,2,14,0
3,2013-12-09 03:45:04,2,10,0
4,2013-12-25 03:45:04,2,16,0
5,2013-12-31 03:45:04,2,6,0
6,2014-01-08 03:45:04,2,8,0
7,2014-02-03 03:45:04,2,26,0
8,2014-02-08 03:45:04,2,5,0
9,2014-02-09 03:45:04,2,1,0
