author | title | date |
---|---|---|
Mark Koester |
A Self Across Time |
PYCON MALAYSIA, August 25, 2019 |
Mark Koester | www.markwk.com
PYCON MALAYSIA, August 24, 2019
Slides and Code: github.com/markwk/ts4health
github.com/markwk/ts4health
A time series is a sequence of observations taken sequentially in time.
George Box and Gwilym Jenkins, Time Series Analysis (2015, orig 1970)
::: notes In short, a time series is a sequence of measurements or observations that varies in time. We define this in more detail but this is a good starting point and classic definition. :::
The objective of time series analysis is to decompose a time series into its constituent characteristics and develop a mathematical model for each.
Pal, D. A., & Prakash, D. P. K. S. (2017). Practical Time Series Analysis
::: notes Time Series Analysis describes various techniques used to prepare, process, explore and often visualize time-oriented data. TS strives to extract meaningful statistics and other characteristics of the data. Time series analysis aims to understand and classify the inherent nature of a (time) series and eventually try to formulate a model for the series. :::
- Financial markets: stock prices and daily closing value of the Dow Jones Industrial Average.
- Natural Phenomenon: temperature trends, climate change, ocean tides, sunspots.
- Economics and Price fluctations: GDP, Job Market, supply and demand, i.e. buying and selling drugs.
- Health, Psychology and Behavioral Sciences: Drug or intervention analysis, investigation of natural, biological cycles
Source: Box, 2015.
- Health: N-of-1 trials in Medical Research and Practice, esp intervention studies.
- Self: Within-individual variability for Self-Trackers, Quantified Self and Biohackers.
::: notes Less work has been done and fewer examples are available about time series analysis in the realm of health and wellness.
Most of the related discussions on this are from medical research using what are called n=1 or n-of-1 trials. N-of-1 trials are studies involving a single patient. More on n-of-1 trials will be explored at the end of the talk.
With the rise of quantified self and various self-tracking technologies, we see similar interest in trying to understand and model within-individual (or intra-individual) variations using time series data. This might be personal data like fitness, mood, energy level, biomarker data like blood tests or even productivity and creativity. :::
measuring or documenting something about your self to gain meaning or make improvements
Related: Self-tracking, Biohacking, Data-driven life...
::: notes
- 2007: Quantified Self is neulogism created by Gary Wolf and Kevin Kelly, two writers at Wired Magazine.
- 2008: Wolf and Kelly founded the company Quantified Self with the aim “to help people get meaning out of their personal data”
- Movement
- “Quantified Self (QS) is an emerging area of technology that allows consumers to use a variety of digital tools to collect data and learn about their behaviors and habits of everyday life.” (Rocket Fuel Survey, 2014)
- "The Quantified Self (QS) refers to a movement in which its participants track the biological, physical, behavioural, and/or environmental aspects of their everyday lives" (Eiben, 2015) :::
Source: https://github.com/markwk/qs_mind_map
How to understand human health across time
or an individual self over a lifetime?
::: notes While the gold standard of health research are randomized clinical trials, increasingly as we move towards individualized health and personalized medicine the question becomes not just about how to understand health as such but how to understand the health of a single case, a single person. For this, among other things (like biological variation), we have to look at the role of time. We need time series data analysis to make our personal data useful and usable or analysis.
Health Examples:
- Global Health trends
- Individual health changes in a lifetime :::
- deeply interesting in how both humans and technologies work
- part social scientist, part technologist
- software developer / builder: Int3c.com, github.com/markwk
- writer / thinker: www.markwk.com, DataDrivenYou.com
::: notes My name is Mark Koester. I describe myself as part social scientist, part technologist. I'm deeply interesting in how both humans and technologies work.
Professionally, I'm a software developer and writer. I run a web, app and data development company.
I also write and build stuff too... including apps, websites, open source data products and more, which I can share briefly more about at the end.
In the past, I have also run various innovation, entrepreneurship and startup programs, like accelerators, workshops, startup weekends, etc. n particular during my time at Techstars and with a few other organizations around the world. :::
to transform science and data into better self-understanding and empowered self-improvement
Intersection of data technologies AND human health and optimization
- Doctors and companies to improve health and happiness
- Quantified Self and Biohackers
- Build data-driven tools with Machine Learning and AI (QS Ledger, PhotoStats.io)
::: notes
My main focus nowadays is at the intersection of data technologies AND human health and optimization. For example, among other things, I worked with doctors and companies on how to use wearable and biomarker data to improve employee health and happiness; I've created an open source platform for personal data aggregation and visualization (QS Ledger); and I've built apps using machine learning and AI (like PhotoStats.io and Datadrivenyou.com). I also write a lot on this topics too on my blog www.markwk.com, and I'm currently working on an online course and a book.
:::
Time Series Data Analysis
with Python
- a conceptual understanding of time series analysis
- starter code for visualizing, testing and modeling for time series effects
- Time and Temporal Structures of Time Series Data
- Our Sample Dataset (Sleep and Exercise / Activity) and Practical Questions
- Time Series Visualization, EDA, and Processing
- Tests and Techniques for Time Series Data
- Modeling Time Series Data
- Time Series Data Analysis Applied to Health and Self
- Conclusions, Next Steps and Future Directions
REFERENCES and APPENDIX
Focus will be on univariate, linear, discrete time series
(instead of multivariate, nonlinear, or continuous)
and assume our data/process follows a stochastic model.
What, then, is time? If no one asks me, I know what it is. If I wish to explain it to him who asks, I do not know.
Saint Augustine (AD 354-430, The Confessions)
::: notes
Time has a pecular quality such that we think we intuitive know and understand it. But once we are asked, as Saint Augustine asked 1600 years ago, it seems rather hard to explain what we mean. In fact, some of the difficulty with defining time comes with the different perspectives we might take on the question itself. The nature of time has been a long-standing question for humans throughout history. Various philosophers, thinkers, scientists, physists and artists have all weighed in, often using their conceptual domain as the starting point.
Interestingly, while we often think of time as an innate and universal concept, when you look at history, literature and religion, you notice that our modern sense of time is a rather modern one. In fact, many earlier cultures like the Mayans did not view time as linear, as we now do, but viewed it as cyclical. According to Whitrow in various books on time, key scientific figures like Kant and Descare led to the origin of the idea of time as we know it now.
TODO: History and Evolution of Measuring Time :::
- Much of philosophy and the history of philosophy can be thought of in terms of questions regarding what is timeless and what is not.
::: notes Much of philosophy and the history of philosophy can be thought of in terms questions regarding what is timeless and what is not. Metaphysics would be the central domain in which these questions arised, but in fact, question of what is timeless seeps into nearly all thinkers. Plato, for example, looked to the realm of the Forms to define many of his fundamental concepts. Numerous other philosophers, like Kant, depended on a timeless, immortal realm used to define ethics and reason. Hegal may have added a dynamic quality but the dialetical process transpired largely in the timeless realm of thought and outside of the empirical. Even the question of time ironically becomes whether it's an entity and debated on the terms of metaphysics, including whether it's real or not or just a psychological phenomenon. In short, the timeless haunts even the timed. ::::
- time is unidirectional (arrow of time goes forward)
- time gives order to events
::: notes
Interestingly, while we often think of time as an innate and universal concept, when you look at history, literature and religion, you notice that our modern sense of time is a rather modern one. In fact, many earlier cultures like the Mayans did not view time as linear, as we now do, but viewed it as cyclical. According to Whitrow in various books on time, key scientific figures like Kant and Descare led to the origin of the idea of time as we know it now.
Regardless of the philosophical quandries and the historical conceptions, two defining features of time stand out: 1. time is unidirectional (arrow of time goes forward) and 2. time gives order to events (Aigner, 2011). In short, "the fundamental experience of time for people seems to be events that happen one after the other" (Frank, 1998) and we often use metaphors like a river or journey to reflect this conceptualization [[201904092126_metaphor]].
:::
- Physics of Time
- Our Experience, Conception and Consciousness of Time
- How do we link "models of time to people's basic experiences"? (Frank, 1998).
::: notes Modeling and visualizing time and the time domain presents a number of considerations, including those of actual experience, design, metaphors, and how we link "models of time to people's basic experiences" (Frank, 1998). We also shouldn't forget that while we might think of time as a naturlistic object, there is long cultural history with different views on what is time and our current ideas on time evolved from how we measure time and sciences thoughts on time. :::
Fortunately, we don't need to deal with these general time problems as such, because we only need to deal with the time challenges in our data!
- How to understand the time component in a data set?
- How to isolate out time...
- ...so we can stationalize...
- ...model and forecast the data?
::: notes Time series analysis is generally considered an advanced topic for machine learning and deep learning.
What are the Data Science Challenges with Time Series?
Because trends and seasonality in time results in non-stationary data, we cannot leverage typical statistical methods. So in order to understand events that happen over time or model and forecast time series phenonomon we first need to deal with time or time index. :::
- A time series is a sequence of measurements or observations that varies in time.
::: notes Time-oriented or time-series data is data that is somehow connected to time and have a natural temporal ordering (in contrast to Cross-Sectional Data). Time series data is a collected sequence of observations, often recorded at regular intervals. :::
Part of what happens in your data is because of the effects of time, time's order, cycles, patterns, etc.
::: notes Confusion and challenge with thinking about time series data is that we need to first think about the time index. The time index is just a generic way of thinking that part of what happens in your data is because of the effects of time, time's order, cycles, patterns, etc.
What we are trying to do is isolate out the various aspect of the time index.
:::
::: notes The problem with Non-Stationary Data is that it can be challenging to model due to mixing of effects of time and other variables. Let's look at this visually some. :::
Examples of Stationary vs. Non-Stationary
(aka effects of the time index)
- trend (general direction of data)
- seasonality (weekly, monthly, seasons, etc.)
- cyclical movements (seasonality on a longer scale, often non-calendar)
- serial correlation (lag, meaning previous observation(s) affect current one)
- unexpected variations (noise, randomness)
::: notes Time Series Data has several interesting internal structure that require special techniques (and often mathematical treatment) for its analysis. These internal structures include...
Most time series data displays one or several of these structures. :::
(or stationary time series or stationary process)
- Stationarity means that the statistical properties of the process do not change over time.
::: notes
:::
- Generating stationary data is important in order to understand, model, and forecast time series data.
- In healthcare and biology, we need to understand if an effect is part of a trend or caused by an intervention (or other factors).
::: notes The problem with Non-Stationary Data is that it can challenging to model due to mixing of effects of time and other variables.
The main idea is that we come up with a model that makes our data regular and then once the data is regular we can use linear regression to make forecasts and see correlations outside of simply occurring along with general trends or seasonality. Practically what this means is we come up with a way to normalize the data BEFORE we model and predict. Then after during our modeling and forecasting we can then reapply the time index model to get data back into real numbers that still follow with the time index. So it's like this 1. check if data has a temporal pattern, 2. time series model and transform the data so that temporal pattern is neutralized, 3. run actual analysis and additional modeling, 4. then apply back ts model and transformation so data is back to previous patterns. :::
the transformation process of decomposing and detrending ts data so non-stationary becomes stationary.
::: notes Generating stationary data is important because by taking out trends, seasonal and cyclical components, we'd only be left with irregular fluctations which cannot be modeling using just the time series as an explanatory variabe. So when we do forecasting the irregularity is assumed to be independent and identically distributed (iid) observations and modeled by linear regression on the variables other than the time index
- autocorrelation
- stationarity (or need to convert non-stationary data to stationary) :::
to Time Series Data
Source: Pal, 2017.
- Wearable Devices and individuals: Fitbit (2), Oura (1), Apple Watch (1)
- Target Data:
- Sleep
- Exercise / Activity Level, inc Steps
Our Focus: Within-Individual Variablity
::: notes
While obviously a larger data set would be better, since our goal was to understand within-individual variablity, a sample set of a few individuals was enough for our current investigative purposes. It allowed us to look at differences between individuals and compare the modeling and other tests.
:::
- Are there and what are the time patterns for sleep and activity levels?
- Can we model and forecast these variables?
FUTURE: Does sleep correlate or affect activity level? Or vis-versa?
- Data collection was done using a variation of QS Ledger, an open source Python project for collecting and visualization of self-tracking data (Fitbit, Apple Health, Oura, etc).
- Each data set was then processed and aggregated into a standardized format.
SEE: Previous Speech Python For Self-Trackers.
Source: Fitbit User 1
Source: Fitbit User 2
::: notes Description of dataset here. :::
Exploratory Data Analysis
TS Data Processing
Why do we use data visualization for initial time series analysis?
- check for trends, seasonality and cyclic patterns
- Line Charts
- Histogram and Density Plots
- Box and Whisker Plots by Interval
- Heat Maps
- Time Series Lag Scatter Plots
- Check for lag effects and scatterplot
- Moving Averages (i.e. rolling mean)
- Exponentially-weighted moving average (EWMA)
# assume zero is nan
series.replace(0, np.nan, inplace=True)
series.isna().sum()
# fill in missing values if we can
if series.isna().sum() > 0:
series = series.interpolate(method='time') # time or linear
# drops any we can't fill-in like at start of time series
if series.isna().sum() > 0:
series.dropna(inplace=True)
- Easy to calculate
- Good way to compare between measurements and between individuals when scales may vary (like mood, activity, steps, sleep, etc.)
::: notes In statistics the standard score or z-score is a way of translating raw scores and observations into its relationship with the mean. Put more technically, it's a fractional number according to which the number is above or below the mean. Numbers above are positive, while those below are negative.
The z-score is often used in the z-test in standardized testing as a way to express something like your percentile rank.
Overall, z-scores are a kind of normalization. :::
for Detecting Temporal Effects
- are used check for autocorrelation and if there are other time index effects.
- Examples: Autocorrelation Function (ACF), Partial Autocorrelation Function (PACF), Dickey-Fuller Test
::: notes Autocorrelation is generic form of serial correlation where variable or data is correlated to the time series index and displays some kind of lag.
Autocorrelation reflects the degree of linear dependency between the time series at index t and the time series at indices t-h or t+h. :::
- ACF tells you the correlation between points separated by various time lags.
NOTE: The prefix "auto" refers to "self" (rather than automatic)
::: notes Time-series data is ordered, meaning there is information in your sample data where you can look for useful temporal patterns. The autocorrelation function (ACF) is one tool used to find those temporal patterns in data. Specifically, ACF tells you the correlation between points separated by various time lags.
Based on how many time steps they are separated by, the ACF plot displays how correlated those datapoints are. So, at zero point the correlation is 100%. A value of zero indicates no correlation. :::
- shows you the relationship between an observation in a time series with observations at prior time steps but, unlike ACF, with the relationships of intervening observations removed.
Data_Visualization_Health_and_Self_Time_Series.ipynb
For background see, Time_Series_Data_Visualization_with_Python.ipynb
from statsmodels.graphics.tsaplots import plot_acf, plot_pacf
plot_acf(test_ts)
# plot_pacf(test_ts)
plt.show()
- The Augmented Dickey-Fuller is a test used to check a univariate process or dataset for the presence of serial correlation.
- The test statistic is expected to be negative. So if it is less than or more negative than the critical value we can conclude that the data is stationary.
from statsmodels.tsa.stattools import adfuller
#Perform Dickey-Fuller test:
print('Results of Dickey-Fuller Test:')
dftest = adfuller(timeseries, autolag='AIC')
dfoutput = pd.Series(dftest[0:4], index=['Test Statistic','p-value','#Lags Used','Number of Observations Used'])
for key,value in dftest[4].items():
dfoutput['Critical Value (%s)'%key] = value
print(dfoutput)
Test Statistic 0.815369
p-value 0.991880
#Lags Used 13.000000
Number of Observations Used 130.000000
Critical Value (1%) -3.481682
Critical Value (5%) -2.884042
Critical Value (10%) -2.578770
What Does this Mean? This data is not stationary!
Apple Watch 01 Sleep
fitbit_02 Sleep
fitbit_02 Sleep
Test Statistic -4.625906
p-value 0.000116
#Lags Used 6.000000
Number of Observations Used 357.000000
Critical Value (1%) -3.448801
Critical Value (5%) -2.869670
Critical Value (10%) -2.571101
fitbit_02 steps without outlier tweaking
Test Statistic -7.57
p-value 2.70
#Lags Used 2.00
Number of Observations Used 3.61
Critical Value (1%) -3.44
Critical Value (5%) -2.86
Critical Value (10%) -2.57
fitbit_02 steps with outlier tweaking
apple_watch_01:
Without Averaging Out Outliers: -1.23
With Averaging Out Outliers: -7.22
fitbit_01:
Without Averaging Out Outliers: -3.329097
With Averaging Out Outliers: -5.436158
fitbit_02:
Without Averaging Out Outliers: -4.62
With Averaging Out Outliers: -7.57
apple_watch_01:
Without Averaging Out Outliers: -1.557651e+01
With Averaging Out Outliers: -19.654423
fitbit_01:
Without Averaging Out Outliers: -5.407617
With Averaging Out Outliers: -4.959970
fitbit_02:
Without Averaging Out Outliers: -1.097
With Averaging Out Outliers: -4.925933
Our health data is generally stationary.
- Individual differences exist, meaning some people's sleep or exercise numbers display more temporal patterns, like a lag effect with sleep or exercise.
- Outliers can result in our data appearing more non-stationary.
- By removing or averaging out outliers, we can make our data more stationary
for Dealing with Time Series Data
::: notes There are several methods for transforming non-stationary ts into stationary data. :::
- Smoothing: Moving average
- Exponentially Weighted Moving Average
- Differencing
- Decomposition
Tests_and_Techniques_Health_and_Self_Time_Series.ipynb
for Time Series Data
ARIMA = Auto-Regressive Integrated Moving Average
- AR for Autoregression: A model that uses the dependent relationship between an observation and some number of lagged observations.
- I for Integrated: A preprocessing procedure to “stationarize” time series if needed, for example using differencing of raw observations
- MA for Moving Average: A model that uses the dependency between an observation and a residual error from a moving average model applied to lagged observations.
::: notes AR and MA are two widely used linear models that work on stationary time series. :::
Standard notation: ARIMA(p, d, q)
- p The number of lag observations included in the model, also called the lag order.
- d The number of times that the raw observations are differenced, also called the degree of differencing.
- q The size of the moving average window, also called the order of moving average.
::: notes Parameters are substituted with integer values to indicate the specific ARIMA model being used. :::
from statsmodels.tsa.arima_model import ARIMA
model = ARIMA(ts_log, order=(2, 1, 0)) # set parameters here
results_AR = model.fit(disp=-1)
plt.plot(ts_log_diff)
plt.plot(results_AR.fittedvalues, color='red')
TS_Statistical_Modeling_Health_and_Self_Time_Series.ipynb
- Pmdarima operates by wrapping statsmodels.tsa.ARIMA and statsmodels.tsa.statespace.SARIMAX into one estimator class and creating a more user-friendly estimator interface for programmers familiar with scikit-learn.
Ref: https://pypi.org/project/pmdarima/
model = pm.auto_arima(train, start_p=1, start_q=1,
test='adf', # use adftest to find optimal 'd'
max_p=3, max_q=3, # maximum p and q
m=1, # frequency of series
d=None, # let model determine 'd'
seasonal=False, # No Seasonality
start_P=0,
D=0,
trace=True,
error_action='ignore',
suppress_warnings=True,
stepwise=True)
print(model.summary())
Prophet is a procedure for forecasting time series data based on an additive model where non-linear trends are fit with yearly, weekly, and daily seasonality, plus holiday effects. It works best with time series that have strong seasonal effects and several seasons of historical data. Prophet is robust to missing data and shifts in the trend, and typically handles outliers well.
# pip install fbprophet
# convert to format for prophet
df.columns = ["ds", "y"]
# create and fit prophet model
m = Prophet()
m.fit(df)
::: notes Usage is based on and quite similar to other sci-kit learn models. :::
# create empty future df
future = m.make_future_dataframe(periods=365)
# predict
forecast = m.predict(future)
# plot forecast
fig1 = m.plot(forecast)
# plot componets
fig2 = m.plot_components(forecast)
Plotting Predictions
Components Breakdown
Applied to Health and Self
- Not obvious if there are patterns or not but there are trends and up's and downs.
- Some seasonality effects (esp weekly for some individuals).
- Different people sleep and move in different patterns (compared b/t self and b/t others)
- Data appears to be largely stationary but varies from person to person.
- Some people have more autocollection (i.e. lag than others)
CODE: Tests_and_Techniques_Health_and_Self_Time_Series.ipynb
Can we model the data with ARIMA?
CODE: TS_Statistical_Modeling_Health_and_Self_Time_Series.ipynb
- Overly sensitive to certain outliers and trends
- Best model appears to be no model?
Can we model health data better with Prophet?
Code: Health_TS_with_Prophet.ipynb
- Better at dealing with outliers
- Weekly breakdown components is an interesting pattern insight
- Results appear to largely be flat like better ARIMA models too
- A bit of a black box model?
Why TS Matters, Next Steps and Future Research
- Temporal effects matter because unless we check for underlying trends we can't be sure if health changes are just a temporal pattern OR caused by targetted change (like lifestyle or treatment).
- Reliable statistical tests and visualizations exist to check if data is stationary.
::: notes What we looked at:
- Time and Temporal Structures of Time Series Data
- Time Series Visualization, EDA, and Processing
- Tests and Techniques for Time Series Data
- Modeling Time Series Data :::
- Health data can be powerful but health patterns differ from person to person and within an individual too (and our models need to account for these).
- Unfortunately numerous challanges remain for data-driven health care for doctors...
- ...as well as for quantified self and biohackers and little open source code exists for health and personal data analysis and modeling.
::: notes Currently looking to expand these techniques to be more reliable for various health and self data sets. Exploring ways to incorporate scientific biological patterns and models as well as apply health reference ranges and risk factors too. We are working on combining biometric data with other health tracking and test results data, like glucose monitoring and standard blood tests. Obviously we want to build data-driven health products one day too. :::
- We live in a world of devices, tracking tech and data.
- It's time to build a world of healthier selves with it.
Slides and Code: github.com/markwk/ts4health
www.markwk.com
datadrivenyou.com
"In God we trust, all others bring data." (W. Edwards Deming)
- Box & Jenkins. (2015). Time Series Analysis. John Wiley & Sons. (esp Ch 1-4)
- Pal. (2017). Practical Time Series Analysis. Packt Publishing Ltd. (esp Ch. 1-4)
- Downey, A. (2015). Think Stats (2nd Edition). O’Reilly Media, Inc. (esp Ch 12)
- Velicer. (2012). Time series analysis for psychological research. Handbook of Psychology, Second Edition. (Thorough introduction to ts for social scientists)
- Aigner (2011). Visualization of Time-Oriented Data. Springer Science & Business Media.
- Time Series Data Visualization with Python - Code and example of data visualization for times series
- A comprehensive beginner’s guide to create a Time Series Forecast - nice walkthrough of techniques for time series analysis and transformations
- ARIMA Model – Complete Guide to Time Series Forecasting in Python - good example of ARIMA modeling with step-by-step code from analysis and parameter setting in model to forecasting and model accuracy metrics.
- Time Series Analysis with Pandas - uses Open Power Systems Data with some good examples
- Pandas Time Series - pandas example using sunspots data
- Playing with time series data in python - focuses on energy trends data and more deep learning methods
- Working with Time Series from Python Data Science Handbook
Find me online at www.markwk.com!