**1. Preprocessing**
- EDA
    - explore intended features
- Explain general setup of feature engineering
- Use of scientific literature supporting the setup
    - clustering + model training per cluster
- Rationale for choice of final attributes
- Preprocess data
    - fix for NaN values
    - variables -> columns

**2. Learn using the dataset**
- Feature Engineered model
    - train/val/test split
- Temporal Model
- Benchmark Model
- Evaluation (validation/test)
- Illustrate performance with graphs

**3. Evaluate and reflect on results**
- Analyse results using statistics
- Analyse results by interpretation
- Pros and cons of different approaches

In [16]:
import pandas as pd
import numpy as np
import datetime

# to be able to use .head(100) to see more rows of df
pd.set_option("display.max_rows", 100, "display.max_columns", None)

In [17]:
df = pd.read_csv('dataset_mood_smartphone.csv', index_col=0)
df['time']= pd.to_datetime(df['time'])
df['date'] = df['time'].dt.date
df

Unnamed: 0,id,time,variable,value,date
1,AS14.01,2014-02-26 13:00:00.000,mood,6.000,2014-02-26
2,AS14.01,2014-02-26 15:00:00.000,mood,6.000,2014-02-26
3,AS14.01,2014-02-26 18:00:00.000,mood,6.000,2014-02-26
4,AS14.01,2014-02-26 21:00:00.000,mood,7.000,2014-02-26
5,AS14.01,2014-02-27 09:00:00.000,mood,6.000,2014-02-27
...,...,...,...,...,...
2770399,AS14.30,2014-04-11 07:51:16.948,appCat.weather,8.032,2014-04-11
2772465,AS14.30,2014-04-19 11:00:32.747,appCat.weather,3.008,2014-04-19
2774026,AS14.30,2014-04-26 10:19:07.434,appCat.weather,7.026,2014-04-26
2774133,AS14.30,2014-04-27 00:44:48.450,appCat.weather,23.033,2014-04-27


In [27]:
# date to predict is date after maximum entries
ydate = df.groupby('date')['time'].count().sort_values().index[-1] + datetime.timedelta(days=1)
ydate

# only take 7 days before date to predict
df_dates = df[(df['date'] >= (ydate - datetime.timedelta(days=7))) & (df['date'] <= ydate)]
print(f'date to predict: {ydate}, dates in df: {df_dates["date"].unique()}')

date to predict: 2014-04-23, dates in df: [datetime.date(2014, 4, 16) datetime.date(2014, 4, 17)
 datetime.date(2014, 4, 18) datetime.date(2014, 4, 19)
 datetime.date(2014, 4, 20) datetime.date(2014, 4, 21)
 datetime.date(2014, 4, 22) datetime.date(2014, 4, 23)]


Steps:
- define day of prediction - to be improved
    - get 7 days with most data - TO DO
    - day after that is the day of prediction
- remove data from other dates - done
- calculate hours of sleep per day per user - done
- remove column 'time' - done
- choose which 'variables' we want as columns - done (changeable)
- transform column 'variable' into separate columns - done
    - drop column 'variable' - done
    - drop duplicate rows - done
- remove outliers from columns - TO DO
- aggregate (mean) score columns per user per day - done
- aggregate (sum) time/amount columns - done
- check dates for weekend or not - TO DO
- create feature based on mood in past dates - TO DO
- create 'social' score - TO DO
    - based on call/sms/appCat.Communication

In [28]:
# calculate hours of sleep per day per user
sleep_per_user = []

# iterate over unique users
for user in df_dates['id'].unique():
    
    # gather user's data
    df_user = df_dates[df_dates['id'] == user]
    
    # iterate over unique dates
    for day in pd.to_datetime(df_user["time"]).map(pd.Timestamp.date).unique():
        
        # extract all the times of measurement
        time_deltas = df_user[df_user['date'] == day]['time'].values
        
        # calculate sleep as the largest interval between measurements, if more than one measurement
        if len(time_deltas) > 1:
            sleep = int(max(abs(x - y) for (x, y) in zip(time_deltas[1:], time_deltas[:-1])) / np.timedelta64(1, 'h'))
        
        # otherwise, set sleep to 8 (changeable)
        else:
            sleep = 8
            
        sleep_per_user.append({'id':user, 'date':day, 'variable':'sleep', 'value': sleep})

In [29]:
# add sleep per user to the df and remove columns time and hours
df_dates = df_dates.append(pd.DataFrame(sleep_per_user))
df_dates = df_dates.drop('time', axis=1)
df_dates = df_dates.sort_values(['id', 'date'])
df_dates.head(100)

Unnamed: 0,id,variable,value,date
131,AS14.01,mood,7.0,2014-04-16
132,AS14.01,mood,8.0,2014-04-16
133,AS14.01,mood,7.0,2014-04-16
134,AS14.01,mood,7.0,2014-04-16
135,AS14.01,mood,7.0,2014-04-16
5772,AS14.01,circumplex.arousal,0.0,2014-04-16
5773,AS14.01,circumplex.arousal,,2014-04-16
5774,AS14.01,circumplex.arousal,1.0,2014-04-16
5775,AS14.01,circumplex.arousal,1.0,2014-04-16
5776,AS14.01,circumplex.arousal,1.0,2014-04-16


In [30]:
# check if 1 sleep value per id/date pair
len(df_dates[df_dates['variable'] == 'sleep']), df_dates[df_dates['variable'] == 'sleep'][['id', 'date']].groupby('id').count().sum()

(216,
 date    216
 dtype: int64)

In [32]:
# turn 'variable' into separate columns, taking sum of values per day per user
df_sum = df_dates.groupby(['id', 'date', 'variable'])['value'].sum().unstack()
df_sum['date'] = df_sum.index.get_level_values(1)
df_sum = df_sum.droplevel(1)

# turn 'variable' into separate columns, taking mean of values per day per user
df_mean = df_dates.groupby(['id', 'date', 'variable'])['value'].mean().unstack()
df_mean['date'] = df_mean.index.get_level_values(1)
df_mean = df_mean.droplevel(1)

In [33]:
# take the 'score'-values from df_mean and the 'time'-values from df_sum
df_combi = df_sum
df_combi['mood'] = df_mean['mood']
df_combi['circumplex.arousal'] = df_mean['circumplex.arousal']
df_combi['circumplex.valence'] = df_mean['circumplex.valence']
df_combi['activity'] = df_mean['activity']
df_combi.columns.name = None
df_combi

Unnamed: 0_level_0,activity,appCat.builtin,appCat.communication,appCat.entertainment,appCat.finance,appCat.game,appCat.office,appCat.other,appCat.social,appCat.travel,appCat.unknown,appCat.utilities,appCat.weather,call,circumplex.arousal,circumplex.valence,mood,screen,sleep,sms,date
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
AS14.01,0.116110,923.613,5175.016,426.734,,,,40.784,1912.852,,,223.587,,,0.75,0.50,7.200000,12522.474999,23.0,1.0,2014-04-16
AS14.01,0.056918,552.201,11853.339,142.241,,,,53.763,2107.146,,,49.718,,2.0,-0.40,0.60,6.600000,15018.480001,23.0,,2014-04-17
AS14.01,0.094033,1860.898,5341.657,774.354,119.163,,,47.789,1993.922,63.874,,88.680,,1.0,-0.80,0.80,6.800000,13490.729000,23.0,,2014-04-18
AS14.01,0.099432,946.190,2728.260,469.274,95.629,29.084,,31.482,396.672,128.890,,305.796,,1.0,-0.20,1.00,7.800000,13293.248001,21.0,,2014-04-19
AS14.01,0.074816,897.086,4984.724,727.492,10.606,,,34.202,2237.742,,,121.035,,2.0,-0.25,0.75,7.250000,8343.518000,22.0,2.0,2014-04-20
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
AS14.33,0.259306,2879.461,3989.499,438.444,,,,55.455,3960.064,,,83.939,,2.0,0.00,1.00,7.666667,10002.624000,22.0,,2014-04-19
AS14.33,0.030758,669.192,1610.627,1526.508,,,,60.848,7490.158,,,3.010,,,0.00,1.20,7.400000,10896.482999,22.0,,2014-04-20
AS14.33,0.039461,621.054,1390.445,962.340,,,,63.275,9143.261,3.019,,4.208,,,0.00,0.60,7.200000,10135.543000,21.0,,2014-04-21
AS14.33,0.085497,3783.640,1313.671,1181.991,,,,78.571,7711.711,52.436,,90.132,,7.0,-0.60,-0.20,6.200000,20483.757999,22.0,,2014-04-22


In [34]:
# drop irrelevant columns
df_combi.drop(['appCat.weather', 'appCat.utilities', 'appCat.unknown', 'appCat.travel', 'appCat.other', 'appCat.office', 'appCat.game', 'appCat.finance', 'appCat.entertainment', 'appCat.builtin'], axis=1)

Unnamed: 0_level_0,activity,appCat.communication,appCat.social,call,circumplex.arousal,circumplex.valence,mood,screen,sleep,sms,date
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
AS14.01,0.116110,5175.016,1912.852,,0.75,0.50,7.200000,12522.474999,23.0,1.0,2014-04-16
AS14.01,0.056918,11853.339,2107.146,2.0,-0.40,0.60,6.600000,15018.480001,23.0,,2014-04-17
AS14.01,0.094033,5341.657,1993.922,1.0,-0.80,0.80,6.800000,13490.729000,23.0,,2014-04-18
AS14.01,0.099432,2728.260,396.672,1.0,-0.20,1.00,7.800000,13293.248001,21.0,,2014-04-19
AS14.01,0.074816,4984.724,2237.742,2.0,-0.25,0.75,7.250000,8343.518000,22.0,2.0,2014-04-20
...,...,...,...,...,...,...,...,...,...,...,...
AS14.33,0.259306,3989.499,3960.064,2.0,0.00,1.00,7.666667,10002.624000,22.0,,2014-04-19
AS14.33,0.030758,1610.627,7490.158,,0.00,1.20,7.400000,10896.482999,22.0,,2014-04-20
AS14.33,0.039461,1390.445,9143.261,,0.00,0.60,7.200000,10135.543000,21.0,,2014-04-21
AS14.33,0.085497,1313.671,7711.711,7.0,-0.60,-0.20,6.200000,20483.757999,22.0,,2014-04-22
