# Introduction

This is a tutorial for using Forest to analyze Beiwe data. We will first download the data using mano. We will also be creating some time series plots using the generated statistic summaries. There are four parts to this tutorial.

1. Check Python version and download Forest.
2. Download data for your study from the server.
3. Explore the file structure of your data
4. Process data using forest.
5. Creating time series plots. 

## Check Python Version and Download Forest

Before we begin, we need to check the current distribution of Python. Note that forest is built using Python 3.11. 

In [None]:
from platform import python_version
import sys

- Print the python version and the path to the Python interpreter. 

In [None]:
print(python_version()) ## Prints your version of python
print(sys.executable) ## Prints your current python installation

*The output should display two lines.* 

1. The Python version installed- make sure you are not using a version of Python that is earlier than 3.11
2. The path to where Python is currently installed

- You may need to install git, pip, mano and forest. To do so, either run the chunk below (the one with lines starting with "!") or enter the lines below (not starting with "!" in a command-line shell. If you already have mano and forest installed, you can skip to the next step. 

In [None]:
#run this chunk to install mano and forest
!pip install mano 
!pip install --upgrade https://github.com/onnela-lab/forest/tarball/develop

`# Or, copy and paste the below lines into a command-line shell` 

`pip install mano`

`pip install https://github.com/onnela-lab/forest/tarball/develop`

Note: In this notebook, you will install the develop branch of forest. This branch has all of the most recent features (including location type information), but function names are slightly different than in the main branch, so they may not match what is on the website. To find documentation specific to the develop branch, look at the current version's docstring by typing a function name and holding shift+tab.

## Download Beiwe Data


In this notebook, we will download data from a beiwe study. Edit the cell below to match parameters in your study.

- For **study_id**, enter the "study ID, found in the top right corner of the study page". 
- For **direc**, the current working directory will be used. If you want data to be stored in another directory, change this variable to another string with the desired filepath. 
- For **dest_folder_name**, enter the "name of the folder you want raw data stored in". 
- For **server**, enter the server where data is located. If your Beiwe website URL starts with studies.beiwe.org, enter "studies"
- For **time_start**, enter the earliest date you want to download data for, in YYYY-MM-DD format.
- For **time_end**, enter the latest date you want to download data for, in YYYY-MM-DD format. If this is None, mano will download all data available (up until today at midnight). 
- For **data_streams**, enter a list of data streams you want to download. Forest currently analyzes `gps`, `survey_timings`, `calls`, and `texts` data streams. A full list of data types can be found under the "Download Data" tab of the Beiwe website. If this is None, all possible data streams will be downloaded. 
- For **beiwe_ids**, enter a list of Beiwe IDs you want to download data for. If you leave this as an empty list, mano will attempt to download data for all user IDs

In [None]:
import os
study_id = ""
direc = os.getcwd() #current working directory, 
dest_folder_name = "raw_data"
server = "studies"
time_start = "2008-01-01"
time_end = None
data_streams = ["gps", "survey_timings", "survey_answers", "audio_recordings", "calls", "texts", "accelerometer"]
beiwe_ids = []

dest_dir = os.path.join(direc, dest_folder_name)

In this next cell, we will import our keyring_studies.py file which includes download credentials. If you haven't already done this, open the keyring_studies.py file and paste your credentials inside. 

If your keyring_studies.py file is in a different directory than the one which includes this notebook, replace `sys.path.insert(0, '')` with `sys.path.insert(0, 'path/to/dir/containing/file/')`.

In [None]:
# import .py file located in another directory if needed
import mano
import sys
sys.path.insert(0, '')

import keyring_studies
kr = mano.keyring(None)

This next cell will download your data. Downloading your data will probably be the most time-consuming part of the whole process, so if you've already downloaded the data, you will save time by not running this cell.

In [None]:
import os

from helper_functions import download_data
download_data(kr, study_id, dest_dir, beiwe_ids, time_start, time_end, data_streams)

Next, we can directly explore the structure of the sample Beiwe data that we've just downloaded. 

At the top level of the directory `/data`, subject-level data is separately contained with subdirectories. Each subdirectory are named according to the subject's assigned Beiwe ID. In this sample, we observe the six subdirectories each from a separate study participant. 

In [None]:
from helper_functions import tree
from pathlib import Path
import pandas as pd

tree(dest_dir, level=1, limit_to_directories=True)

## Process Data using Forest 
- Using the Forest library developed by the Onnela lab, we compute daily GPS and communication summary statistics

First, we generate the GPS-related summary statistics by using the **gps_stats_main** function under the **traj2stat.py** in the Jasmine tree of Forest. This code will take between 15 minutes to 12 hours to run, depending on your machine and the quantity of data downloaded. To make sure that everything is working right, change the `beiwe_ids` argument from `None` to a list with just a couple of the Beiwe IDs in your study.

- For **data_dir**, enter the "path to the data file directory". This will be the same directory you downloaded data into.
- For **output_dir**, enter the "path to the file directory where output is to be stored". 
- For **tz_str**, enter the time zone where the study was conducted. Here, it's **"America/New_York."** We can use "pytz.all_timezones" to check all options.
- For **frequency**, there are 'daily' or 'hourly' or 'both' for the temporal resolution for summary statistics. Currently, one must pass this as one of the Frequency class imported from Jasmine. So, you may use Frequency.HOURLY or Frequency.DAILY
- For **save_traj**, it's "True" if you want to save the trajectories as a csv file, "False" if you don't (default: False). Here, we chose **"True."**
- For **beiwe_ids**, enter the list of Beiwe IDs to run Forest on. If this is `None`, jasmine will run on all users in the data_dir directory.
- For **places_of_interest**, enter a list of places of interest. This list must contain keywords from [openstreetmaps](https://wiki.openstreetmap.org/wiki/OpenStreetBrowser/Category_list)

There are also more optional arguments that can be passed to the function, which are located in the Hyperparameters class in the traj2stat.py file. These include:
- For **log_threshold**, enter the number of minutes required to be spent at a place to count as a place
- For **save_osm_log**, enter whether you want to save the log associated with places of interest.

and others as can been seen in the class definition.

In [None]:
from forest.jasmine.traj2stats import gps_stats_main, Hyperparameters
from forest.constants import Frequency

data_dir = dest_dir
gps_output_dir = "gps_output"
tz_str = "America/New_York"
freq = Frequency.DAILY
save_traj = True 
beiwe_ids = None
places_of_interest = None

# if you are not interested in more specific hyperparameters, you can use the default ones
# by setting parameters = None or not passing in the parameters argument
parameters = Hyperparameters()
parameters.save_osm_log = False
parameters.log_threshold = 60

gps_stats_main(
    data_dir, gps_output_dir, tz_str, freq, save_traj, places_of_interest = places_of_interest, 
    participant_ids = beiwe_ids, parameters = parameters
)


*The output should describe how the data is being processed. If this is working correctly, you will see something like:*
    
><i>User: tcqrulfj  
Read in the csv files ...  
Collapse data within 10 second intervals ...  
Extract flights and pauses ...  
Infer unclassified windows ...  
Merge consecutive pauses and bridge gaps ...  
Selecting basis vectors ...  
Imputing missing trajectories ...  
Tidying up the trajectories...  
Calculating the daily summary stats...<i>

We will now contatenate GPS summaries into one file. 

In [None]:
from helper_functions import concatenate_summaries


concatenate_summaries(dir_path = os.path.join(direc, gps_output_dir), 
                      output_filename = os.path.join(direc,"gps_summaries.csv"))



Second, we compute the call and text-based summary statistics by using the **log_stats_main** function under the **log_stats.py** in the Willow tree of Forest. This should run a lot faster than `forest.jasmine.traj2stats.gps_stats_main`. 


- For **data_dir**, enter the "path to the data file directory". 
- For **output_dir**, enter the "path to the file directory where output is to be stored". 
- For **tz_str**, enter the time zone where the study was conducted. Here, it's **"America/New_York."** 
- For **option**, choose a Frequency value corresponding to the temporal resolution you would like data to be aggregated to. 
- For **beiwe_ids**, enter the list of Beiwe IDs to run Forest on. If this is `None`, willow will run on all users in the data_dir directory.

In [None]:
import forest.willow.log_stats
data_dir = dest_dir
comm_output_dir = "comm_output"
tz_str = "America/New_York"
option = Frequency.DAILY
beiwe_ids = None



forest.willow.log_stats.log_stats_main(
    data_dir, comm_output_dir, tz_str, option, beiwe_ids = beiwe_ids
)

*The output should describe how the data is being processed (e.g., read, collapse, extracted...imputing, tidying, and calculating daily summary stats).*

>*Note- calls and texts data are only collected on Android phones. If you only enrolled users with iPhones in your study, you will not have any output here.*

- The following code is  used to concatenate these files into a single csv for the **communication summaries**.

In [None]:
from helper_functions import concatenate_summaries

concatenate_summaries(dir_path = os.path.join(direc,comm_output_dir), 
                      output_filename = os.path.join(direc,"comm_summaries.csv"))


*The output should show the data for the first five observations in the concatenated dataset.*

Next, we summarize survey information using the **survey_stats_main** function under the **base.py** in the Sycamore tree of Forest. This will take between 5 minutes and 2 hours to run, depending on how many surveys were administered durinng your study.


- For **data_dir**, enter the "path to the data file directory". 
- For **output_dir**, enter the "path to the file directory where output is to be stored". 
- For **tz_str**, enter the time zone where the study was conducted. Here, it's **"America/New_York."** 
- For **beiwe_ids**, enter the list of Beiwe IDs to run Forest on. If this is `None`, sycamore will run on all users in the data_dir directory.
- For **config_path**, enter the filepath to your downloaded survey config file. This can be downloaded by clicking "edit study" on your study page, and clicking "Export study settings JSON file under "Export/Import study settings". If this is None, Sycamore will still run, but fewer outputs will be produced. 
- For **interventions_filepath**, enter the filepath to your downloaded interventions timing file. This can be downloaded by clicking "edit study" on your study page, and clicking "Download Interventions" next to "Intervention Data". If this is None, Sycamore will still run, but fewer outputs will be produced. (note, this doesn't apply if you are using the main version of sycamore)

In [None]:
from forest.sycamore.base import compute_survey_stats

data_dir = dest_dir
survey_output_dir = "survey_output"
tz_str = "America/New_York"
beiwe_ids = None
config_path = None
interventions_filepath = None

compute_survey_stats(
    study_folder = data_dir, output_folder = survey_output_dir,
    config_path = config_path, tz_str = tz_str, users = beiwe_ids,
    start_date = time_start, end_date = time_end, 
    interventions_filepath = interventions_filepath)

Now, we summarize accelerometer using the **run** function under the **base.py** in the Oak tree of Forest. This tree is in beta testing, so don't be surprised if you encounter errors running this function.

- For **data_dir**, enter the "path to the data file directory". 
- For **accelerometer_output_dir**, enter the "path to the file directory where output is to be stored". 
- For **tz_str**, enter the time zone where the study was conducted. Here, it's **"America/New_York."** 
- For **frequency**, choose a value of frequency similar as what was used in jasmine. 
- For **beiwe_ids**, enter the list of Beiwe IDs to run Forest on. If this is `None`, willow will run on all users in the data_dir directory.

In [None]:
from forest.oak.base import run

data_dir = dest_dir
accelerometer_output_dir = "accel_output"
tz_str = "America/New_York"
frequency = Frequency.DAILY
beiwe_ids = None

run(data_dir, accelerometer_output_dir, 
    tz_str, frequency, users = beiwe_ids)

In [None]:
from helper_functions import concatenate_summaries


concatenate_summaries(dir_path = os.path.join(direc, accelerometer_output_dir), 
                      output_filename = os.path.join(direc,"accel_summaries.csv"))

## Plot Data

Now, we will also be generate some time series plots using the generated statistic summaries.
- To read the file, we need to define **response_filename** with the concatenated dataset. Here, we are using 'gps_summary.csv'.

In [None]:
import matplotlib.pyplot as plt
import os
import pandas as pd

direc = os.getcwd()
response_filename = 'gps_summary.csv'
path_resp = os.path.join(direc, response_filename)    

# read data
response_data = pd.read_csv(path_resp)


The data needs to be sorted according to date. The following code will sort and create 4 even time intervals in the plot. 

In [None]:
## Make sure the data is sorted according to date
response_data.sort_values('Date', inplace = True)
response_data.reset_index(drop = True, inplace = True)

def time_series_plot(var_to_plot, ylab = '', xlab = 'Date', num_x_ticks = 4):
    for key, grp in response_data.groupby(['Beiwe_ID']):
        plt.plot(response_data.Date, response_data[var_to_plot], label=key)
    
    #if len(response_data['Beiwe_ID'].unique()) > 1: ## more than one user to plot
    #    plt.plot(response_data.Date, response_data[var_to_plot], c=response_data['Beiwe_ID'].astype('category'))
    #else:
    #    plt.plot(response_data.Date, response_data[var_to_plot]) #just one user
    title = f"Time Series Plot of {var_to_plot}"
    plt.title(title)
    plt.xlabel(xlab)
    plt.ylabel(ylab)
    
    ## get evenly indices
    tick_indices = [(i * (len(response_data.Date.unique()) - 1)) // (num_x_ticks - 1) for i in range(num_x_ticks) ]
    
    plt.xticks(response_data.Date.unique()[tick_indices])
    plt.show()

- You can now create time series plots using **time_series_plot('variable')**.

In [None]:
time_series_plot('dist_traveled', ylab = "km")

*The output displays a time series plot for the variable, "dist_traveled."*

In [None]:
time_series_plot('sd_flight_length', ylab = "km")

*The output displays a time series plot for the variable, "sd_flight_length."*