# Data cleaning

### Loading data

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

#Import
data = pd.read_csv('dataset_mood_smartphone.csv')

#Convert time column to date time format
data['time']= pd.to_datetime(data['time']) 

### Changing to multi index with time and ID

In [2]:
data = data.set_index(['id', 'time'])

We want to reformat the data such that we have a number of observations for each patient, with each observation consisting of some time period e.g one week. Each observation will have a measurement for each attribute (e.g., Avg time spent on weather app over period) with the dependent variable as the average mood over the period. 

In [3]:
#reshaping the dataframe so that each column is a feature, indexed first by patient then time
data2 = pd.pivot_table(data, index=['id', 'time'], columns='variable', values='value')

### Aggregating the data into time windows 

For all of the apps, sms and calls we take the sum over the windows, since they are either times spent or number of calls etc. For arousal, mood and valence we take the mean.

In [None]:
prev_index = 0
for row in data2:
    
    if data2.iloc[[0]]['mood'] > 0:



In [4]:
#This is currently grouping observations into 2 day windows, can make a decision on this later

data2 = data2.astype(float)
data2 = data2.groupby([pd.Grouper(level='id'), pd.Grouper(freq='2D', level='time') 
                             ]).agg({'activity': 'sum', 'appCat.builtin':'sum',
                                                           'appCat.communication':'sum', 'appCat.entertainment':'sum',
                                                           'appCat.finance':'sum', 'appCat.game':'sum', 'appCat.office':'sum',
                                                           'appCat.other':'sum', 'appCat.social':'sum', 'appCat.travel':'sum',
                                                           'appCat.unknown':'sum', 'appCat.utilities':'sum', 'appCat.weather':'sum',
                                                           'call':'sum', 'screen':'sum', 'sms':'sum','circumplex.valence':'mean',
                                                           'circumplex.arousal':'mean', 'mood':'mean'})

### Removing observations/days without mood measurements

In [5]:
data2 = data2.dropna(subset=['mood'])

### Dropping columns with little data

In [6]:
data2 = data2.drop(columns=['appCat.finance','appCat.game', 'appCat.unknown', 'appCat.weather'])

### Remove outliers

It is clear from the plots that there are some outliers that are nonsensical. No values should be negative apart from those for arousal or valence, which have a minimum value of -2. Therefore a good starting point is to remove any values which are less than that for the whole dataframe.

In [7]:
from scipy import stats

data2 = data2.astype(float)
data2 = data2[(np.abs(stats.zscore(data2)) < 4).all(axis=1)]

### Instead aggregating data into windows up to each mood measurement 

In [None]:
test = data2.groupby(['id', 'time', 'mood'])

test.head()

In [None]:
#This will give us more observations that grouping by day

data2 = data2.groupby([pd.Grouper(freq='1D', level='time'), 
                             pd.Grouper(level='id')]).agg({'activity': 'sum', 'appCat.builtin':'sum',
                                                           'appCat.communication':'sum', 'appCat.entertainment':'sum',
                                                           'appCat.finance':'sum', 'appCat.game':'sum', 'appCat.office':'sum',
                                                           'appCat.other':'sum', 'appCat.social':'sum', 'appCat.travel':'sum',
                                                           'appCat.unknown':'sum', 'appCat.utilities':'sum', 'appCat.weather':'sum',
                                                           'call':'sum', 'screen':'sum', 'sms':'sum','circumplex.valence':'mean',
                                                           'circumplex.arousal':'mean', 'mood':'mean'})

# Exploration

### Mood distribution 

In [None]:
#We can now look at the distributions for each variable
data2.hist('mood')

### Observation counts for each day

In [None]:
#This tells us the number of individuals for which we have mood data for each day in the dataset
data2.groupby(level=0)['mood'].count()

It is clear that there is a narrow time period over which we have data for all patients simulatneously. We cannot do much with days for which we little or no mood data, therefore we could consider discarding data in the early and late periods.

### Observation counts for each patient 

In [None]:
#This tells us the number of mood observations for each individual
data2.groupby(level=1)['mood'].count()

### Counts for each variable, for each patient

In [None]:
#Number of available instances for each variable, for each patient
data2.groupby(level=1).count()

There are a number of attributes for which we have few or no measurements for many of the patients, namely: Finance, Games, office, unknown, and weather. In some of these cases they may be entirely useless, especially if they happen to be highly correlated with other variables. We could consider taking binary indicators for some of these, with the intuition that if they check finance or office apps they have assets or a job, if they check the weather app they go outside, if they play mobile games they procrastinate etc

### Pairwise plots

Looking at the number of non-zero observations in each column after the above transformations

In [None]:
import seaborn as sns

sns.pairplot(data2.loc[:,data2.dtypes == 'float64'])

### Correlation between predictors

In [None]:
corr_matrix = data2.corr()
corr_matrix['mood']

# Feature engineering

Here we need to create new variables to improve our predictive power

Ideas so far:
- Days of week, month of recording
- Mood swing in last week
- Mornging/evening
- Binary indicators for some ommitted apps

In the paper on this dataset they say that basically none of the apps have any predictive power - once we show this 
#ourselves we could simplify the model down immensely using principle components/ indicators for 
"uses phone a lot-or not" 

### Day of week indicators

In [8]:
names = ['Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday', 'Sunday']

for i, x in enumerate(names):
    data2[x] = (data2.index.get_level_values(1).weekday == i).astype(int)
    
data2.head()


Unnamed: 0_level_0,Unnamed: 1_level_0,activity,appCat.builtin,appCat.communication,appCat.entertainment,appCat.office,appCat.other,appCat.social,appCat.travel,appCat.utilities,call,...,circumplex.valence,circumplex.arousal,mood,Monday,Tuesday,Wednesday,Thursday,Friday,Saturday,Sunday
id,time,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1
AS14.01,2014-02-25,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,4.0,...,0.75,-0.25,6.25,0,1,0,0,0,0,0
AS14.01,2014-02-27,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,4.0,...,0.333333,0.0,6.333333,0,0,0,1,0,0,0
AS14.01,2014-03-21,6.873236,3870.647,11243.808,1100.78,172.206,337.894,4948.132,952.75,716.375,9.0,...,0.333333,0.4,6.3,0,0,0,0,1,0,0
AS14.01,2014-03-23,4.036182,2153.202,14507.948,1071.317,3.01,139.381,4124.465,419.805,208.818,10.0,...,0.4,0.5,6.4,0,0,0,0,0,0,1
AS14.01,2014-03-25,4.792378,2200.265,19265.504,978.685,0.0,276.317,6511.53,0.0,256.258,0.0,...,0.555556,0.111111,6.666667,0,1,0,0,0,0,0


### Month indicators

In [9]:
names = ['January', 'February', 'March', 'April', 'May', 'June', 'July', 'August', 'September', 'October', 'November', 'December']

for i, x in enumerate(names):
    data2[x] = (data2.index.get_level_values(1).month == i-1).astype(int)
    
data2.head()

Unnamed: 0_level_0,Unnamed: 1_level_0,activity,appCat.builtin,appCat.communication,appCat.entertainment,appCat.office,appCat.other,appCat.social,appCat.travel,appCat.utilities,call,...,March,April,May,June,July,August,September,October,November,December
id,time,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1
AS14.01,2014-02-25,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,4.0,...,0,1,0,0,0,0,0,0,0,0
AS14.01,2014-02-27,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,4.0,...,0,1,0,0,0,0,0,0,0,0
AS14.01,2014-03-21,6.873236,3870.647,11243.808,1100.78,172.206,337.894,4948.132,952.75,716.375,9.0,...,0,0,1,0,0,0,0,0,0,0
AS14.01,2014-03-23,4.036182,2153.202,14507.948,1071.317,3.01,139.381,4124.465,419.805,208.818,10.0,...,0,0,1,0,0,0,0,0,0,0
AS14.01,2014-03-25,4.792378,2200.265,19265.504,978.685,0.0,276.317,6511.53,0.0,256.258,0.0,...,0,0,1,0,0,0,0,0,0,0


### Mood swing indicator

# OLS as first test of feature importance

In [None]:
olsdata = data2.groupby([pd.Grouper(freq='6M', level='time'), 
                             pd.Grouper(level='id')]).agg({'activity': 'sum', 'appCat.builtin':'sum',
                                                           'appCat.communication':'sum', 'appCat.entertainment':'sum',
                                                           'appCat.finance':'sum', 'appCat.game':'sum', 'appCat.office':'sum',
                                                           'appCat.other':'sum', 'appCat.social':'sum', 'appCat.travel':'sum',
                                                           'appCat.unknown':'sum', 'appCat.utilities':'sum', 'appCat.weather':'sum',
                                                           'call':'sum', 'screen':'sum', 'sms':'sum','circumplex.valence':'mean',
                                                           'circumplex.arousal':'mean', 'mood':'mean'})
olsdata.head()

# Principle component Analysis

It's very probable that many of the apps serve similar purposes for the user and therefore may have a similar effect - a PCA could possibly decompose these features into a smaller subset representing e.g. A need for socializing (Messneger apps), boredom (News, finance)

In [None]:
from sklearn.preprocessing import StandardScaler

#Before we can run PCA we need to standarize all of the features

features = ['activity', 'appCat.builtin', 'appCat.communication', 'appCat.entertainment', 'appCat.finance', 'appCat.game', 
            'appCat.office', 'appCat.other', 'appCat.social', 'appCat.travel', 'appCat.unknown', 'appCat.utilities', 
            'appCat.weather', 'call', 'screen', 'sms', 'circumplex.valence', 'circumplex.arousal']
x = data2.loc[:, features].values
x = StandardScaler().fit_transform(x)

y = data2.loc[:,['mood']].values

In [None]:
from sklearn.decomposition import PCA

pca = PCA(n_components=3)
principalComponents = pca.fit_transform(x)
principalDf = pd.DataFrame(data = principalComponents
             , columns = ['principal component 1', 'principal component 2', 'principal component 3'])

principalDf.head()

In [None]:
finalDf = pd.concat([principalDf, data2[['mood']]], axis = 1)

# Mean model

# Linear regression for aggregated data

# Decision Tree/Random forrest

# Individual models (ARIMA?)