# Introduction

This is a tutorial for using Forest to analyze Beiwe data. We will first download the data using mano. We will also be creating some time series plots using the generated statistic summaries. There are four parts to this tutorial.

1. Check Python version and download Forest.
2. Download data for your study from the server.
3. Process data using forest.
4. Creating time series plots. 

## Check Python Version and Download Forest

Before we begin, we need to check the current distribution of Python. Note that forest is built using Python 3.8. 

In [None]:
from platform import python_version
import sys

- Print the python version and the path to the Python interpreter. 

In [None]:
print(python_version()) ## Prints your version of python
print(sys.executable) ## Prints your current python installation

*The output should display two lines.* 

1. The Python version installed- make sure you are not using a version of Python that is earlier than 3.8
2. The path to where Python is currently installed

- You may need to install git, pip, mano and forest. To do so, enter the lines below in a command-line shell. If not, you can skip to the next step. 

`conda install git pip`

`pip install mano`

`pip install https://github.com/onnela-lab/forest/tarball/develop`

## Download Beiwe Data


In this notebook, we will download data from a beiwe study. Edit the cell below to match parameters in your study.

- For **study_id**, enter the "study ID, found in the top right corner of the study page". 
- For **dest_dir**, enter the "path to the folder you want raw data stored in". 
- For **server**, enter the server where data is located. If your Beiwe website URL starts with studies.beiwe.org, enter "studies"
- For **time_start**, enter the earliest date you want to download data for, in YYYY-MM-DD format.
- For **time_end**, enter the latest date you want to download data for, in YYYY-MM-DD format. If this is None, mano will download all data available (up until today at midnight). 
- For **data_streams**, enter a list of data streams you want to download. Forest currently analyzes `gps`, `survey_timings`, `calls`, and `texts` data streams. A full list of data types can be found under the "Download Data" tab of the Beiwe website. If this is None, all possible data streams will be downloaded. 
- For **beiwe_ids**, enter a list of Beiwe IDs you want to download data for. If you leave this as an empty list, mano will attempt to download data for all user IDs

In [None]:
study_id = ""
dest_dir = "raw_data"
server = "studies"
time_start = "2008-01-01"
time_end = None
data_streams = ["gps", "survey_timings", "survey_answers", "calls", "texts"]
beiwe_ids = []

In this next cell, we will define a function that iterates through Beiwe IDs and downloads desired data for each ID. This function also retries downloading data when a network failure occurs. 

In [None]:
import os
import mano
import mano.sync as msync
import requests
from datetime import datetime
def download_data(keyring,  study_id, download_folder, users = [], time_start = "2008-01-01", 
                      time_end = None, data_streams = None):
    '''
    Downloads all data for specified users, time frame, and data streams. 
    
    This function downloads all data for selected users, time frame, and data streams, and writes them to an 
    output folder, with one subfolder for each user, and subfolders inside the user's folder for each data stream. 
    If a server failure happens, the function re-attempts the download. 
    
    Args: 
        keyring: a keyring generated by mano.keyring
    
        users(iterable): A list of users to download data for. If none are entered, it attempts to download data for all users
        
        study_id(str): The id of a study
        
        download_folder(str): path to a folder to download data
        
        time_start(str): The initial date to download data (Formatted in YYYY-MM-DD). Default is 2008-01-01, which is 
            before any Beiwe data existed.
        
        time_end(str): The date to end downloads. The default is today at midnight.
        
        data_streams(iterable): A list of all data streams to download. The default (None) is all possible data streams. 
        
    '''
    if study_id == "":
        print("Error: Study ID is blank")
        return
        
    if (keyring['USERNAME'] == "" or keyring['PASSWORD'] == "" 
        or keyring["ACCESS_KEY"] == "" or keyring["SECRET_KEY"] == ""):
        print("Error: Did you set up the keyring_studies.py file?")
        return
    
    if not os.path.isdir(download_folder):
        os.mkdir(download_folder)
    
    if time_end is None:
        time_end = datetime.today().strftime("%Y-%m-%d")+"T23:59:00"
        
    if users == []:
        print('Obtaining list of users...')
        num_tries = 1
        while num_tries < 5:
            try:
                users = [u for u in mano.users(keyring, study_id)]
                num_tries = 6
            except KeyboardInterrupt:
                print("Someone closed the program")
                sys.exit()
            except:
                num_tries = num_tries + 1
    
    for u in users:
        zf = None
        download_success = False
        num_tries = 0
        while not download_success:
            try:
                print(f'Downloading data for {u}')
                zf = msync.download(keyring, study_id, u, data_streams, time_start = time_start, time_end = time_end)
                if zf is not None:
                    zf.extractall(download_folder)
                download_success = True
            except requests.exceptions.ChunkedEncodingError:
                print(f'Network failed in download of {u}, try {num_tries}')
                num_tries = num_tries + 1
            except KeyboardInterrupt:
                print("Someone closed the program")
                sys.exit()
            except: 
                print(f'Network failed in download of {u}, try {num_tries}')
                num_tries = num_tries + 1
            if num_tries > 5:
                download_success = True
                print(f"Too many failures; skipping user {u}")
        if zf is None:
            print(f'No data for {u}; nothing written')

In this next cell, we will import our keyring_studies.py file which includes download credentials. If you haven't already done this, open the keyring_studies.py file and paste your credentials inside. 

If your keyring_studies.py file is in a different directory than the one which includes this notebook, replace `sys.path.insert(0, '')` with `sys.path.insert(0, 'path/to/dir/containing/file/')`.

In [None]:
# import .py file located in another directory if needed
sys.path.insert(0, '')

import keyring_studies
kr = mano.keyring(None)

This next cell will download your data. Downloading your data will probably be the most time-consuming part of the whole process, so if you've already downloaded the data, you will save time by not running this cell.

In [None]:
download_data(kr, study_id, dest_dir, beiwe_ids, time_start, time_end, data_streams)

## Process Data using Forest 
- Using the Forest library developed by the Onnela lab, we compute daily GPS and communication summary statistics

First, we generate the GPS-related summary statistics by using the **gps_stats_main** function under the **traj2stat.py** in the Jasmine tree of Forest. This code will take between 15 minutes to 12 hours to run, depending on your machine and the quantity of data downloaded. To make sure that everything is working right, change the `beiwe_ids` argument from `None` to a list with just a couple of the Beiwe IDs in your study.

- For **data_dir**, enter the "path to the data file directory". 
- For **output_dir**, enter the "path to the file directory where output is to be stored". 
- For **tz_str**, enter the time zone where the study was conducted. Here, it's **"America/New_York."** We can use "pytz.all_timezones" to check all options.
- For **options**, there are 'daily' or 'hourly' or 'both' for the temporal resolution for summary statistics. Here, we chose **"daily."**
- For **save_traj**, it's "True" if you want to save the trajectories as a csv file, "False" if you don't (default: False). Here, we chose **"True."**
- For **beiwe_ids**, enter the list of Beiwe IDs to run Forest on. If this is `None`, jasmine will run on all users in the data_dir directory.

In [None]:
import forest.jasmine.traj2stats

data_dir = dest_dir
output_dir = "gps_output"
tz_str = "America/New_York"
option = "daily"
save_traj = True 
beiwe_ids = None


forest.jasmine.traj2stats.gps_stats_main(
    data_dir, output_dir, tz_str, option, save_traj, participant_ids = beiwe_ids
)

*The output should describe how the data is being processed. If this is working correctly, you will see something like:*
    
><i>User: tcqrulfj  
Read in the csv files ...  
Collapse data within 10 second intervals ...  
Extract flights and pauses ...  
Infer unclassified windows ...  
Merge consecutive pauses and bridge gaps ...  
Selecting basis vectors ...  
Imputing missing trajectories ...  
Tidying up the trajectories...  
Calculating the daily summary stats...<i>

Second, we compute the call and text-based summary statistics by using the **log_stats_main** function under the **log_stats.py** in the Willow tree of Forest. This should run a lot faster than `forest.jasmine.traj2stats.gps_stats_main`. 


- For **data_dir**, enter the "path to the data file directory". 
- For **output_dir**, enter the "path to the file directory where output is to be stored". 
- For **tz_str**, enter the time zone where the study was conducted. Here, it's **"America/New_York."** 
- For **options**, it's 'daily' or 'hourly' or 'both' for the temporal resolution for summary statistics. Here, we chose **"daily."**
- For **beiwe_ids**, enter the list of Beiwe IDs to run Forest on. If this is `None`, willow will run on all users in the data_dir directory.

In [None]:
import forest.willow.log_stats
data_dir = dest_dir
output_dir = "comm_output"
tz_str = "America/New_York"
option = "daily"
beiwe_ids = None



forest.willow.log_stats.log_stats_main(
    data_dir, output_dir, tz_str, option, beiwe_id = beiwe_ids
)

*The output should describe how the data is being processed (e.g., read, collapse, extracted...imputing, tidying, and calculating daily summary stats).*

>*Note- calls and texts data are only collected on Android phones. If you only enrolled users with iPhones in your study, you will not have any output here.*

The outputs of **gps_stats_main** and **log_stats_main** are generated with respect to each suject in the study folder (there is one csv file per subject). For further use, it is often convenient to concatenate these csv files into one file containing data for all users in the study. 

- The following code is  used to concatenate these files into a single csv for the **GPS summaries**.

In [None]:
import numpy as np
import pandas as pd
import os
import sys
from pathlib import Path
from datetime import datetime
from datetime import timedelta  
import math
from functools import reduce

# Path to subdirectory
direc = os.getcwd()
data_dir = os.path.join(direc,"gps_output")

# initialize dataframe list
df_list = []

# loop through all directories - select folder
for subdir, dirs, files in os.walk(data_dir):
    
    # loop through files in list
    for file in files:
        # obtain subject study_id 
        file_dir = os.path.join(data_dir,file)
        subject_id = os.path.basename(file_dir)[:-4]
        if file[-4:] == ".csv":# only read in csv files
            temp_df = pd.read_csv(file_dir)
            temp_df.insert(loc=0, column="Date", value=pd.to_datetime(temp_df[['day', 'month', 'year']]))
            temp_df.insert(loc=0, column='Beiwe_ID', value=subject_id)
            df_list.append(temp_df)
            
if len(df_list) > 0:
                
    # concatenate dataframes within list --> Final Data for trajectories
    response_data = pd.concat(df_list, axis=0).reset_index()
    response_data = response_data.drop(['index','day', 'month', 'year'], axis=1)

    # print few few observations
    print(response_data.head())

    # Write results to CSV 
    response_filename = 'gps_summary.csv'

    path_resp = os.path.join(direc, response_filename)    

    # write to csv
    response_data.to_csv(path_resp, index=False)
else:
    print("Error: No data found")

*The output should show the data for the first five observations in the concatenated dataset.*

- The following code is  used to concatenate these files into a single csv for the **communication summaries**.

In [None]:
# (use study_id and timestamp)
# Path to subdirectory
direc = os.getcwd()
data_dir = os.path.join(direc,"comm_output")


# initialize dataframe list
df_list = []

# loop through all directories - select folder
for subdir, dirs, files in os.walk(data_dir):
    
    # loop through files in list
    for file in files:
        # obtain patient study_id 
        file_dir = os.path.join(data_dir,file)
        print(file_dir)
        subject_id = os.path.basename(file_dir)[:-4]
        if file[-4:] == ".csv":
            temp_df = pd.read_csv(file_dir)
            temp_df.insert(loc=0, column="Date", value=pd.to_datetime(temp_df[['day', 'month', 'year']]))
            temp_df.insert(loc=0, column='Beiwe_ID', value=subject_id)
            df_list.append(temp_df)
                
# concatenate dataframes within list --> Final Data for trajectories
if len(df_list) > 0:
    response_data = pd.concat(df_list, axis=0).reset_index()
    response_data = response_data.drop(['index','day', 'month', 'year'], axis=1)

    # print few few observations
    print(response_data.head())

    # Write results to CSV 
    response_filename = 'comm_summary.csv'

    path_resp = os.path.join(direc, response_filename)    

    # write to csv
    response_data.to_csv(path_resp, index=False)
else:
    print("Error: No data found")

*The output should show the data for the first five observations in the concatenated dataset.*

Next, we summarize survey information using the **survey_stats_main** function under the **base.py** in the Sycamore tree of Forest. This will take between 5 minutes and 2 hours to run, depending on how many surveys were administered durinng your study.


- For **data_dir**, enter the "path to the data file directory". 
- For **output_dir**, enter the "path to the file directory where output is to be stored". 
- For **tz_str**, enter the time zone where the study was conducted. Here, it's **"America/New_York."** 
- For **beiwe_ids**, enter the list of Beiwe IDs to run Forest on. If this is `None`, sycamore will run on all users in the data_dir directory.
- For **config_path**, enter the filepath to your downloaded survey config file. This can be downloaded by clicking "edit study" on your study page, and clicking "Export study settings JSON file under "Export/Import study settings". If this is None, Sycamore will still run, but fewer outputs will be produced. 
- For **interventions_filepath**, enter the filepath to your downloaded interventions timing file. This can be downloaded by clicking "edit study" on your study page, and clicking "Download Interventions" next to "Intervention Data". If this is None, Sycamore will still run, but fewer outputs will be produced. 

In [None]:
from forest.sycamore.base import survey_stats_main

data_dir = dest_dir
output_dir = "survey_output"
tz_str = "America/New_York"
beiwe_ids = None
config_path = None
interventions_filepath = None

survey_stats_main(
    data_dir, output_dir, tz_str, beiwe_ids, time_start, time_end,
                 config_path, interventions_filepath)




## Plot Data

Now, we will also be generate some time series plots using the generated statistic summaries.
- To read the file, we need to define **response_filename** with the concatenated dataset. Here, we are using 'gps_summary.csv'.

In [None]:
import matplotlib.pyplot as plt
import os
import pandas as pd

direc = os.getcwd()
response_filename = 'gps_summary.csv'
path_resp = os.path.join(direc, response_filename)    

# read data
response_data = pd.read_csv(path_resp)


The data needs to be sorted according to date. The following code will sort and create 4 even time intervals in the plot. 

In [None]:
## Make sure the data is sorted according to date
response_data.sort_values('Date', inplace = True)
response_data.reset_index(drop = True, inplace = True)

def time_series_plot(var_to_plot, ylab = '', xlab = 'Date', num_x_ticks = 4):
    for key, grp in response_data.groupby(['Beiwe_ID']):
        plt.plot(response_data.Date, response_data[var_to_plot], label=key)
    
    #if len(response_data['Beiwe_ID'].unique()) > 1: ## more than one user to plot
    #    plt.plot(response_data.Date, response_data[var_to_plot], c=response_data['Beiwe_ID'].astype('category'))
    #else:
    #    plt.plot(response_data.Date, response_data[var_to_plot]) #just one user
    title = f"Time Series Plot of {var_to_plot}"
    plt.title(title)
    plt.xlabel(xlab)
    plt.ylabel(ylab)
    
    ## get evenly indices
    tick_indices = [(i * (len(response_data.Date.unique()) - 1)) // (num_x_ticks - 1) for i in range(num_x_ticks) ]
    
    plt.xticks(response_data.Date.unique()[tick_indices])
    plt.show()

- You can now create time series plots using **time_series_plot('variable')**.

In [None]:
time_series_plot('dist_traveled', ylab = "km")

*The output displays a time series plot for the variable, "dist_traveled."*

In [None]:
time_series_plot('sd_flight_length', ylab = "km")

*The output displays a time series plot for the variable, "sd_flight_length."*