# Processing Raw Data From iTrace for Experiments Containing Edits

This procedure requires that you recorded your experiment session with [FLUORITE](http://www.cs.cmu.edu/~fluorite/) in addition to iTrace.

In iTrace v0.0.1 (Alpha), editing is not supported. *If you used a later version of iTrace that explicitly supports editing, this issue does not apply to your experiment, and you do not need the code in this example.* The work-around presented here is very time-consuming and should be avoided if at all possible.

Broadly, this repository combines the capabilities of iTrace and FLUORITE to create functionality that supports editing during an experiment. The `analyzer.py` script is an example of how to use this repository to create a post-processing workflow. In this notebook, that script will be explained in greater detail such that you can write one of your own.

## Contents:
* [General Procedure](#General-Procedure)
* [Ingesting the Raw Data](#Ingesting-the-Raw-Data)
* [Partitioning Raw Data](#Partitioning-Raw-Data)
* [Running gaze2src](#Running-gaze2src)
* [Additional Analyses](#Additional-Analyses)
* [Accumulating Segmented Data](#Accumulating-Segmented-Data)
* [Appendix: Tracing Code Regions](#Appendix:-Tracing-Code-Regions)

## General Procedure:
<img src="../img/chart.png" alt="chart" width="500"/>
Three (3) files are required to run iTrace-post:
* A log from iTrace-Core of the form `core*.xml`
* A log from the iTrace plugin, of the form `eclipse*.xml`
* A log from the FLUORITE plugin, of the form `log*.xml`

All of these must be present to use the functionality described below.

In [1]:
import os
import glob
import pandas as pd
import subprocess
from subprocess import DEVNULL
from fluorite import ProjectHistory, GazeDataPartition
from itrace_post import post_to_aoi, create_combined_archive

## Ingesting the Raw Data

The log from FLUORITE should first be parsed. This creates a `fluorite.ProjectHistory` object that stores information about edits, and can retrieve the content of any file at any time on demand. To use this functionality by itself, please refer to the [relevant notebook](fluorite.ipynb).

In [2]:
# A log file from FLUORITE is parsed
fluorite_log_location = "sample-data/log-files/Log-sample.xml"
project_history = ProjectHistory(fluorite_log_location)

An aside: for your analysis, you may wish to trace the locations of functions and other important sections of code so that fixations can be mapped to these regions. This is the mode used by `analyzer.py`. This feature is further explained in the [Appendix](#Appendix:-Tracing-Code-Regions).

Next, ingest the iTrace log file from Eclipse. Note that iTrace may not record times in UTC, but FLUORITE does. One way to correct this is to ensure that your machine's system time is set to UTC when recording, but it may be better to deduce the correct offset (most likely an integer number of hours) and specify it.


In [3]:
# In our timezone at least, the time iTrace records is four hours 
#    behind that of FLUORITE.
time_offset = 4*3600*1000
eclipse_log = "sample-data/log-files/eclipse_log.xml"
data_partition = GazeDataPartition(eclipse_log, time_offset)

## Partitioning Raw Data

We can now create a timeline of project files on the same range of times found in the iTrace plugin log. Then we can save chunks of the plugin log to the same location.

In [4]:
# Save a file timeline. This creates files in the specified directory, 
#    and returns a list of time ranges.
output_dir = "sample-output"

time_periods = project_history.save_timeline(
    output_dir,         # Directory in which to create timeline
    granularity="finest",    # Granularity (can also be set to create uniform time periods)
    first_time=data_partition.first_time,
    last_time=data_partition.last_time
)

# Write chunks of the plugin log to this directory
data_partition.create_partition(time_periods=time_periods)
data_partition.save_partition(output_dir)

## Running gaze2src

Now we have all our raw data partitioned such that each segment represents a period with no changes. This means we're almost ready to run the `gaze2src` program that ships with v0.0.1 of iTrace. Before this, however, we need to create gzip and srcml archives of all the code we just generated.

This is the time-consuming part. You'll need to wait a while for this to terminate.

For convenience, the sample here has been truncated to just a single segment.

In [5]:
# Parameters for gaze2src:
FILTER = "ivt"
FILTER_ARGS = ["-v 30", "-u 60"]

# Location of log file from iTrace-core
core_log = "sample-data/log-files/core_log.xml"

# Iterate over all segments
for sub_dir in sorted(os.listdir(output_dir), key=lambda x: x.split("_")[0]):
    prefix = output_dir + "/" + sub_dir
    
    # Skip segments where there is no plugin log.
    #    This can happen if edits are made very close together,
    #    or if the plugin log terminates before the FLUORITE log.
    if not os.path.exists(prefix+"/plugin_log.xml"):
        continue
        
    print("\t"+sub_dir)
        
    # Create a tarball of code files
    subprocess.run(["tar", "-czf", prefix+"/src.tar.gz", prefix+"/code_files"],
                   stdout=DEVNULL, stderr=DEVNULL)

    # Run srcml
    subprocess.run(["srcml", prefix+"/src.tar.gz", "-o", prefix+"/src.xml"],
                   stdout=DEVNULL, stderr=DEVNULL)

    # Run gaze2src
    subprocess.run(["gaze2src", core_log, prefix+"/plugin_log.xml", prefix+"/src.xml",
                    "-f", FILTER]+FILTER_ARGS+["-o", prefix+"/gaze2src"],
                   stdout=DEVNULL, stderr=DEVNULL)

	0_1562859140262-1562859424707


## Additional Analyses

In addition to the first-level analysis from `gaze2src`, you may generate AOIs for each segment. These are inferred from the gaze data in each segment, and are calculated using Gaussian smoothing and thresholding.

If you wish to use this analysis on an experiment where iTrace was not used, or with a later version of iTrace, please refer to the [relevant notebook](aoi_generation.ipynb).

In [6]:
# Iterate over all segments
for sub_dir in sorted(os.listdir(output_dir), key=lambda x: x.split("_")[0]):
    prefix = output_dir + "/" + sub_dir
    itrace_prefix = prefix + "/gaze2src"
    
    # Check for a TSV file from gaze2src
    try:
        fixations_tsv = glob.glob(itrace_prefix + "/fixations*.tsv")[0]
    except IndexError:
        continue
        
    # The values stored in the database are also of interest.
    fixations_db = glob.glob(itrace_prefix + "/rawgazes*.db3")[0]
    
    post_to_aoi(
        fixations_db,             # Path to database file
        fixations_tsv,            # Path to TSV file
        prefix+"/code_files",     # Directory containing code files for this segement
        prefix+"/post2aoi",       # Output directory
        5.0,                      # Smoothing parameter
        0.01,                     # Threshold parameter
        time_offset=time_offset,  # Time offset
        compute_aois=True         # Directive to compute AOIs 
                                  #     (if unset, this function just 
                                  #      converts data to a CSV)
        )

## Accumulating Segmented Data

Now that all the data has been processed, we can accumulate it into a single file. In this example, we have created several files of the form `*.java_AOI.csv` that list fixations on a particular code file and give their AOI number. This AOI number is mapped to a position by files of the form `*.java_AOI.json`. Because we are combining data from different segments, and different segments have differernt AOIs, information about AOIs will be lost if you do not assign each row a segment number. That is not done in this example, because only one segment is generated and analyzed.

In [7]:
# Find all CSV files
all_csvs = glob.glob(output_dir + "/*/post2aoi/*.java_AOI.csv")

# Parse all CSV files
archives = [pd.read_csv(
                csv_file, parse_dates=["fix_time"], index_col=["fix_time"]
            ) for csv_file in all_csvs]

# Concatenate data
all_data = pd.concat(archives, ignore_index=False)
all_data = all_data.sort_index()

# Save data
all_data.to_csv("sample-output/sample_accumulated.csv")

## Appendix: Tracing Code Regions

You may also wish to trace the positions of particular code entities, such as functions and enums, as they are edited. This will allow you to map fixations to these regions as well as AOIs. 

One or more files that provide information about the starting locations of these entities are required:

* A "function index", which is a JSON file of the following form:

    ```
    {
        "(Name of file without extension)": 
        {
           "(Same Name.Function)": 
               [
                   (Starting line),
                   (Ending line)
               ],
           ...
        },
        ...
    }
    ```
    
    This allows a fixation to be mapped to a function or other similar structure. Note that there is no requirement that the keys are functions. This file can represent any area of code.
    
    
    
* An "entity index", which is a similar JSON file:
    
    ```
    {
        "(Name of file without extension)":
        {
            "(Entity Type)":
            {
                "(Entity Name)":
                [
                    (Starting line),
                    (Ending line)
                ],
                ...
            },
            ...
        },
        ...
    }
    ```
    
    This allows a fixation to be mapped to a particular type of code structure. The 'Entity Name' will not appear in data files, but its presence may make it easier to compose this file, as you may write a comment that indicates what text is contained in the specified range.

The `analyzer.py` script is a good example of how to use this functionality. The paths to a function index and entity index are given when constructing a `ProjectHistory` object, which allows the object to compute the positions of the given regions as they change over the course of the project timeline. Saving the timeline causes copies of these files to be written alongside the code files, so that they can be referenced when calling `post_to_aoi`. This results in the classifications being appended to each row of the resulting CSVs.