Skip to content

This system generates natural language summaries for time-series data pertaining to the personal health domain

License

Notifications You must be signed in to change notification settings

harrij15/TemporalSummaries

Repository files navigation

Temporal Summaries

Defined Terms

  • Protoform (P): A sentence prototype (or template) that can be used to generate a natural language summary once it is filled in with the necessary information
  • Summarizer (S): A conclusive phrase for the summary
  • Quantifier (Q): A word or phrase that specifies how often the summarizer is true, given the object of interest
  • Attribute (A): A variable of interest in the database
  • Time window (TW): A time window of interest
  • Sub-time window (sTW): A time window at a smaller granularity than the specified time window
  • Qualifier (R): A word or phrase that adds specificity to the summary

Framework

Our framework automatically generates a diverse set of natural language summaries of time-series data, where the linguistic structure of the summaries conform to a set of protoforms. A protoform is a template that can be used to generate a natural language statement once it is filled in with the necessary information [4,5,6,7]. As an example, a simple protoform is:

"On (quantifier) (sub-time window) in the past (time window), your (attribute) was (summarizer)."

where summarizer is a conclusive phrase for the summary (e.g., "high", "low", etc.), and quantifier is a word or phrase that specifies how often the summarizer is true (e.g., "most", "all", etc.), given an attribute of interest. A concrete summary following the above protoform is:

“On most days in the past week, your sugar level was very high.”

Our system relies on SAX representations of time-series data [1,3] and temporal/sequence pattern discovery via the SPADE algorithm [8]. SAX representations allow us to convert raw time-series data into symbolic strings containing letters of the alphabet. These representations make it easier for time-series analysis methods to find interesting patterns and anomalies efficiently in the data. Using the SPADE algorithm, we are able to discover frequent sequences, or patterns, in the data. These patterns are considered "frequent" if they are above the specified minimum support threshold, and summaries for these patterns are generated if they are above the specified minimum confidence threshold.

We provide sample data from the Alpha Vantage REST API and the National Centers For Environmental Information (NCEI). Alpha Vantage [2] provides free APIs that allow users to receive real-time and historical financial data. NCEI [9] provides average temperature and average wind speed data tracked at the daily and hourly granularity by the weather station at the Huntsville International Airport in Huntsville, Alabama.

We are unable to provide data for the running example but this can be accessed by following this link: https://larc.smu.edu.sg/myfitnesspal-food-diary-dataset.

Implemented Protoforms

Below is a table of the current list of protoforms our system uses to generate summaries:

Summary Type Protoform
Standard Evaluation (TW) In the past full TW, your A1 has been S1,..., and your An has been Sn.
Standard Evaluation (sTW) On Q sTW in the past TW, your A1 was S1,..., and your An was Sn.
Standard Evaluation (sTW w/ qualifier) On Q sTW in the past TW R, your A1 was S1,..., and your An was Sn.
Comparison Your A1 was S1,..., and your An was Sn on TW1 N1 than they were on TW2 N2.
Goal Comparison You did S1 overall with keeping your A1 G1,..., and you did Sn overall with keeping your An Gn in TW1 N1 than you did in TW2 N2.
Goal Evaluation On Q sTW in the past TW, you S1 your goal to keep your A1 G1,..., and you Sn your goal to keep your An Gn.
Standard Trends Q time, your A1 S1,..., and your An Sn from one sTW to the next.
Cluster-Based Pattern During Q TW similar to TW N, your A1 S1,..., and your An Sn the next TW.
Standard Pattern The last time you had a TW similar to TW N, your A1 S1,..., and your An Sn the next TW.
If-Then Pattern There is C confidence that, when your A1 is S11, then S21,..., then Sm1,..., and your An is S1n, then S2n,..., then Smn, your A1 tends to be S(m+1)1,..., and your An tends to be S(m+1)n the next TW.
Day If-Then Pattern There is C confidence that, when your A1 is S11 on a D11, then S21 on a D21,..., then Sm1 on a Dm1,..., and your An is S1n on a D1n, then S2n on a D2n,..., then Smn on a Dmn, your A1 tends to be S(m+1)1 on a D(m+1)1,..., and your An tends to be S(m+1)n on a D(m+1)n the next TW.
General If-Then Pattern In general, if your A1 is S1,..., and your An is Sn, then your An+1 is Sn+1,..., and your An+m is Sn+m.
Day-Based Pattern Your A1 tends to be S1,..., and your An tends to be Sn on D.
Goal Assistance In order to better follow the G, you should S1 your A1, S2 your A2, ..., and Sn your An.
Population Evaluation Q1 users in this study had a S1 A1, a S2 A2, ..., and a Sn An P.

where S denotes a summarizer, Q is a quantifier, R is a qualifier, A is an attribute, G is a goal, D is a day of the week, C is a confidence value, TW is a time window, sTW is a sub-time window, N is a time window index, and P is a sub-protoform.

Setup

This system was implemented in Python 3.

  1. Install required Python packages using pip install
  • numpy
  • saxpy
  • squeezer
  1. Set value of attr_index as index for the desired attribute (available attributes stored in attributes list)
  2. Set age, activity level, alpha_size (alphabet size for SAX representation). and tw (time window size)
  3. Set min_conf (minimum confidence) and min_sup (minimum support) thresholds
  4. Set path to store pattern data for cSPADE
  5. Set cygwin_path for path to Cygwin (or equivalent) to run C++ commands
  6. Run python proto.py

Sample Run

The chart above is a snippet of stock market ticker data for Apple and Aetna that spans 100 days. Using a time window of seven days and an alphabet size of 5, our system produces 287 summaries (both univariate and multivariate) using seven different protoforms with a minimum confidence threshold of 80% and a minimum support threshold of 20%. Our approach generates a diverse set of summaries for the ticker data.

A subset of multivariate summaries for Apple and Aetna are shown below. Please note that not all protoforms are appropriate for stock market ticker data.

Summary Type Summary
Standard Evaluation (weekly granularity) In the past full week, the AAPL close value has been very high and the AET close value has been very high.
Standard Evaluation (daily granularity) On all of the days in the past week, the AAPL close value has been very high and the AET close value has been very high.
Standard Evaluation (daily granularity w/ qualifier) On all of the days in the past week, when the AAPL close value was very high, the AET close value was very high.
Comparison The AAPL close value was about the same and the AET close value was about the same in week 13 as they were in week 12.
Standard Trends Some of the time, the AAPL close value increases and the AET close value increases from one day to the next.
If-Then Pattern There is 100% confidence that, when your AAPL close value follows the pattern of being high, your AET close value tends to be high, then high the next day.
Day-Based Pattern The AAPL close value tends to be very high and the AET close value tends to be very high on Thursdays.

We are also able to automatically display the patterns we find in the data in generated time-series charts. The horizontal bars in the following charts represent vertical range corresponding to different summarizers. Starting from the bottom, the ranges determine whether a data point is either "very low" (blue), "low" (yellow), "moderate" (green), "high" (red), or "very high" (purple). For example, if a data point is within the blue range, it will be described as "very low."

We can display the AAPL and AET close value data as:

We display the pattern found for the 'Standard Evaluation (daily granularity)' summary below:

References

  1. Jessica Lin, Eamonn J. Keogh, Li Wei, and Stefano Lonardi. 2007. Experiencing SAX: A Novel Symbolic Representation of Time Series. Data Min. Knowl. Discov. 15 (08 2007), 107-144
  2. Romel Torres. 2019. Alpha Vantage. https://github.com/RomelTorres/alpha_vantage.
  3. Senin, P., Lin, J., Wang, X., Oates, T., Gandhi, S., Boedihardjo, A.P., Chen, C., Frankenstein, S., Lerner, M., GrammarViz 2.0: a tool for grammar-based pattern discovery in time series, ECML/PKDD Conference, 2014.
  4. Ronald R. Yager. 1982. A new approach to the summarization of data. Information Sciences 28, 1 (1982), 69 – 86.
  5. Lotfi A. Zadeh. 1975. The concept of a linguistic variable and its application to approximate reasoning–I. Information Sciences 8, 3 (1975), 199 – 249.
  6. Lotfi A. Zadeh. 1983. A computational approach to fuzzy quantifiers in natural languages. Computers & Mathematics with Applications 9, 1 (1983), 149 – 184.
  7. Lotfi A. Zadeh. 2002. A prototype-centered approach to adding deduction capability to search engines-the concept of protoform. In International IEEE Symposium on Intelligent Systems.
  8. Mohammed J. Zaki. 2001. SPADE: An Efficient Algorithm for Mining Frequent Sequences. Machine Learning 42, 1 (01 Jan 2001), 31-60.
  9. Matthew J. Menne, Imke Durre, Bryant Korzeniewski, Shelley McNeal, Kristy Thomas, Xungang Yin, Steven Anthony, Ron Ray, Russell S.Vose, Byron E.Gleason, and Tamara G. Houston. 2020. Global Historical Climatology Network - Daily (GHCN-Daily), Version 3. https://www.ncei.noaa.gov/.

About

This system generates natural language summaries for time-series data pertaining to the personal health domain

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages