<a id='top'></a>

In [15]:
import warnings
warnings.filterwarnings('ignore')

# Imputing Missing IEQ Data for Mood Prediction
Filling in any missing data for IEQ measurements so that we can have a more well-rounded prediction.

In [16]:
import sys
sys.path.append('../')
%load_ext autoreload
%autoreload 2
from src.analysis import mood_prediction

import pandas as pd
pd.set_option('display.max_columns', 200)
import numpy as np
from scipy import stats

from datetime import datetime, timedelta

import matplotlib.pyplot as plt
from matplotlib import cm
from matplotlib.colors import ListedColormap, LinearSegmentedColormap
import seaborn as sns
import matplotlib.dates as mdates

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


In [17]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.linear_model import LogisticRegression

# Table of Contents
1. [Data Import](#data_import)
2. [Inspection](#inspection)
3. [Pre-Processing](#pre_processing)
4. [Evaluation](#evaluation)

---

<a id='data_import'></a>

# Data Import
We have two datasets to consider:
1. **IEQ Data When Home**: These data represent the subset of data that we will impute the missing values from
2. **Mood and Activity**: These data will provide the remaining observations that need to be imputed.

We use the `ImportProcessing` to get the relevant data.

In [20]:
data = mood_prediction.ImportProcessing(data_dir="../data/")
data.mood_and_activity = data.remove_participant(data.mood_and_activity,"oxcpr7e3")

In [21]:
data.mood_and_activity.head()

Unnamed: 0,beiwe,content_e,stress_e,lonely_e,sad_e,energy_e,redcap,date,content_m,stress_m,lonely_m,sad_m,energy_m,tst,sol,naw,restful,steps,distance,discontent_m,discontent_e
2,rvhdl2la,2.0,1.0,0.0,0.0,1.0,29,2020-05-13,1.0,1.0,2.0,3.0,0.0,5.3,5.0,2.0,2.0,4722.0,1.853799,2.0,1.0
4,xdbdrk6e,2.0,1.0,2.0,1.0,3.0,23,2020-05-13,2.0,1.0,2.0,1.0,2.0,8.0,20.0,3.0,2.0,4199.0,1.720204,1.0,1.0
6,qh34m4r9,3.0,1.0,0.0,0.0,4.0,68,2020-05-13,3.0,0.0,0.0,0.0,1.0,8.0,20.0,2.0,3.0,11632.0,5.302906,0.0,0.0
8,tmexej5v,2.0,1.0,1.0,0.0,3.0,42,2020-05-13,2.0,1.0,1.0,0.0,3.0,6.0,15.0,0.0,2.0,521.0,0.202008,1.0,1.0
9,vpy1a985,2.0,1.0,2.0,1.0,2.0,50,2020-05-13,2.0,1.0,2.0,1.0,2.0,7.0,10.0,1.0,2.0,553.0,0.237737,1.0,1.0


In [22]:
data.ieq_at_home.head()

Unnamed: 0,timestamp,beiwe,content,stress,lonely,sad,energy,redcap,beacon,time_at_home,tvoc_mean,co2_mean,pm2p5_mass_mean,temperature_c_mean,rh_mean,tvoc_max,co2_max,pm2p5_mass_max,temperature_c_max,rh_max,tvoc_sum,co2_sum,pm2p5_mass_sum,temperature_c_sum,rh_sum,tvoc_range,co2_range,pm2p5_mass_range,temperature_c_range,rh_range,tvoc_delta,co2_delta,pm2p5_mass_delta,temperature_c_delta,rh_delta,tvoc_0.9,co2_0.9,pm2p5_mass_0.9,temperature_c_0.9,rh_0.9,discontent
0,2020-05-15 09:25:04,vr9j5rry,2.0,0.0,0.0,0.0,3.0,34,25.0,29405.0,284.368862,625.903783,20.179947,23.856529,44.88832,719.359263,701.48585,37.798653,25.199962,45.593333,133368.996076,293548.874083,9464.395245,11188.71228,21052.622222,715.335972,184.589704,21.918727,2.170306,1.385,211.027254,68.154688,17.144862,-1.467429,0.289444,530.33353,679.8791,22.449095,24.533201,45.358556,1.0
1,2020-05-17 11:24:02,vr9j5rry,3.0,0.0,0.0,0.0,3.0,34,25.0,13735.0,452.023265,677.984445,18.563323,23.009077,43.579227,1016.818626,765.040923,35.445383,24.778236,46.411667,181713.352495,288143.388998,7889.412101,9778.85765,18521.171667,888.260546,206.188492,21.578579,2.500352,5.498333,154.749369,72.051767,12.049113,-1.291849,1.078333,716.564921,744.616396,24.420434,24.604545,45.523333,0.0
2,2020-05-22 09:01:57,vr9j5rry,2.0,0.0,1.0,0.0,3.0,34,25.0,8736.0,126.825004,576.785137,40.574454,25.356578,46.578796,175.930567,631.137869,43.499281,25.778377,48.331111,34876.87623,158615.912763,11157.974923,6973.05894,12809.168889,84.294494,117.924809,6.654736,1.21017,2.914444,2.477067,-117.924809,-3.296227,0.496737,-2.914444,142.132758,624.386797,42.322003,25.778377,47.847,1.0
3,2020-05-29 09:01:24,vr9j5rry,3.0,0.0,0.0,0.0,3.0,34,25.0,38008.0,254.904878,575.247848,24.570087,24.542439,44.865419,793.995839,669.95035,33.474761,26.491811,48.035,201884.663245,455596.296002,19459.509146,19437.611505,35533.411667,759.276949,166.002859,15.754364,3.24379,5.045,116.95177,25.937998,1.388178,0.966803,1.158333,482.226226,643.45325,29.31797,26.218439,47.5,0.0
4,2020-05-31 11:20:21,vr9j5rry,3.0,1.0,0.0,0.0,3.0,34,25.0,18771.0,284.657043,616.03987,22.977562,24.042582,43.053737,1016.015551,697.470348,28.735318,25.278306,45.023333,285511.01442,617887.989296,23046.494244,24114.709965,43182.898333,938.663185,175.802342,13.572454,2.201977,3.02,-268.693182,-1.094114,2.672467,0.533408,2.985,540.139142,679.69457,26.337786,24.886584,44.521667,0.0


---

<a id='inspection'></a>

# Inspection
Looking at the data that we have available.

In [31]:
def get_summary(total_df,ieq_df):
    """
    Prints out summary information regarding the two datasets
    
    Parameters
    ----------
    
    Returns
    -------
    void
    """
    # observations
    print("Number of Total Observations:\t\t",len(total_df))
    print("Number of Available IEQ Observations:\t", len(ieq_df))
    # participants
    print("Participants in Total:\t\t",len(total_df["beiwe"].unique()))
    print("Participants in IEQ Data:\t",len(ieq_df["beiwe"].unique()))

In [32]:
get_summary(data.mood_and_activity,data.ieq_at_home)

Number of Total Observations:		 1200
Number of Available IEQ Observations:	 316
Participants in Total:		 50
Participants in IEQ Data:	 20


<div class="alert alert-block alert-info">
    
So we need to start by restricting the dataset to only include participants in the IEQ dataset.
    
</div>

---

<a id='pre_processing'></a>

# Pre-Processing

## Filtering by Participant
Getting same participants

In [34]:
ieq = data.ieq_at_home.copy()
mood = data.mood_and_activity[data.mood_and_activity["beiwe"].isin(ieq["beiwe"].unique())]
ieq = ieq[ieq["beiwe"].isin(mood["beiwe"].unique())]
get_summary(mood,ieq)

Number of Total Observations:		 572
Number of Available IEQ Observations:	 299
Participants in Total:		 19
Participants in IEQ Data:	 19


<div class="alert alert-block alert-danger">
    
We have to impute essentially half the dataset and still only get 572 observations to make the prediction with. 
    
</div>

## Combining and Cleaning

[Back to Pre-Processing](#pre_processing)

---

[Back to Top](#top)