PRACTICAL TIME SERIES ANALYSIS
Notes from the O'Reilly Publication
“Practical Time Series Analysis by Aileen Nielsen (O’Reilly). Copyright 2020 Aileen Nielsen, 978-1-492-04165-8.”
aileen.a.nielsen@gmail.com
***

"For better or worse, sensors and tracking mechanisms are everywhere, and as a result there are unprecedented amounts of high-quality time series data available. Time series are uniquely interesting because they can address questions of causality, trends, and the likelihood of future outcomes."

Henceforth, unless noted otherwise, everything will be imaginarily in quotes. Copying straight from the source. 

Time series is an important aspect of data analysis but one that is not found in the standard data science toolkit. This is unfortunate both because time series data is increasingly available and also because it answers questions that cross-sectional data cannot. An analyst who does not know fundamental time series analysis is not making the most of their data.

Time series is an interesting topic with quirky data concerns. Problems associated with data leakage, lookahead, and causality are particularly fun from a time series perspective, as are many techniques that apply uniquely to data ordered along some kind of time axis.

***

# Chapter 1: Time Series: An Overview and a Quick History

As continuous monitoring and data collection become more common, the need for competent time series analysis with both statistical and machine learning techniques will increase. Indeed, the most promising new models combine both of these methodologies.

Time series analysis is the endeavor of extracting meaningful summary and statistical information from points arranged in chronological order. It is done to diagnose past behavior as well as to predict future behavior.

Time series analysis often comes down to the question of causality: how did the past influence the future? At times, such questions (and their answers) are treated strictly within their discipline rather than as part of the general discipline of time series analysis.

Likewise, individualized healthcare using time series analysis remains a young and challenging field because it can be quite difficult to create data sets that are consistent over time. Even for small case-study-based research, maintaining both contact with and participation from a group of individuals is excruciatingly difficult and expensive. When such studies are conducted for long periods of time, they tend to become canonical in their fields—and repeatedly, or even excessively researched—because their data can address important questions despite the challenges of funding and management.

Indicators of production and efficiency in markets have long provided interesting data to study from a time series analysis. Most interesting and urgent has been the question of forecasting future economic states based on the past. Such forecasts aren’t merely useful for making money—they also help promote prosperity and avert social catastrophes.

Nowadays, the United States and most other nations have thousands of government researchers and recordkeepers whose jobs are to record data as accurately as possible and make it available to the public (see Figure 1-4). This practice has proven invaluable to economic growth and the avoidance of economic catastrophe and painful __boom and bust cycles__.

One of the pioneers of mechanical trading, or time series forecasting via algorithm, was Richard Dennis. Dennis was a self-made millionaire who famously turned ordinary people, called the Turtles, into star traders by teaching them a few select rules about how and when to trade. These rules were developed in the 1970s and 1980s and mirrored the “AI” thinking of the 1980s, in which heuristics still strongly ruled the paradigm of how to build intelligent machines to work in the real world.

Since then many “mechanical” traders have adapted these rules, which as a result have become less profitable in a crowded automated market. Mechanical traders continue to grow in number and wealth, they are continually in search of the next best thing because there is so much competition.

####  Time Series Analysis Takes Off

George Box, a pioneering statistician who helped develop a popular time series model, was a great pragmatist. He famously said, “All models are wrong, but some are useful.”

Box made this statement in response to a common attitude that proper time series modeling was a matter of finding the best model to fit the data. As he explained, the idea that any model can describe the real world is very unlikely. Box made this pronouncement in 1978, which seems bizarrely late into the history of a field as important as time series analysis, but in fact the formal discipline was surprisingly young.

More recently, practical uses for time series analysis and machine learning emerged as early as the 1980s, and included a wide variety of scenarios:

Computer security specialists proposed anomaly detection as a method of identifying hackers/intrusions.
Dynamic time warping, one of the dominant methods for “measuring” the similarity of time series, came into use because the computing power would finally allow reasonably fast computation of “distances,” say between different audio recordings.
Recursive neural networks were invented and shown to be useful for extracting patterns from corrupted data.
Time series analysis and forecasting have yet to reach their golden period, and, to date, time series analysis remains dominated by traditional statistical methods as well as simpler machine learning techniques, such as ensembles of trees and linear fits. We are still waiting for a great leap forward for predicting the future.
***
# Chapter 2. Finding and Wrangling Time Series Data

In this chapter we discuss problems that might arise while you are preprocessing time series data. Some of these problems will be familiar to experienced data analysts, but there are specific difficulties posed by timestamps. __As with any data analysis task, cleaning and properly processing data is often the most important step of a timestamp pipeline. Fancy techniques can’t fix messy data.__

Most data analysts will need to find, align, scrub, and smooth their own data either to learn time series analysis or to do meaningful work in their organizations. As you prepare data, you’ll need to do a variety of tasks, from joining disparate columns to resampling irregular or missing data to aligning time series with different time axes. This chapter helps you along the path to an interesting and properly prepared time series data set.

### Prepared Data Sets

The best way to learn an analytical or modeling technique is to run through it on a variety of data sets and see both how to apply it and whether it helps you reach a concrete goal.

Notice that for the purpose of thinking about signs as time series, it doesn’t matter what the unit of time is; the point is the sequencing rather than the exact time. In that case, all you would care about is the ordering of the event, and whether you could assume or confirm from reading the data description that the measurements were taken at regular intervals.

UNIVARIATE VERSUS MULTIVARIATE TIME SERIES

The data sets we have looked at so far are univariate time series; that is, they have just one variable measured against time.

Multivariate time series are series with multiple variables measured at each timestamp. They are particularly rich for analysis because often the measured variables are interrelated and show temporal dependencies between one another. We will encounter multivariate time series data later.

It is great to work on difficult problems, but it is not a good idea to learn on such problems.

### Retrofitting a Time Series Data Collection from a Collection of Tables

The quintessential example of a found time series is one extracted from state-type and event-type data stored in a SQL database. This is also the most relevant example because so much data continues to be stored in traditional structured SQL databases.

WHAT IS A LOOKAHEAD?

The term lookahead is used in time series analysis to denote any knowledge of the future. You shouldn’t have such knowledge when designing, training, or evaluating a model. A lookahead is a way, through data, to find out something about the future earlier than you ought to know it.

A lookahead is any way that information about what will happen in the future might propagate back in time in your modeling and affect how your model behaves earlier in time. For example, when choosing hyperparameters for a model, you might test the model at various times in your data set, then choose the best model and start at the beginning of your data to test this model. This is problematic because you chose the model for one time knowing things that would happen at a subsequent time—a lookahead.

Unfortunately, there is no automated code or statistical test for a lookahead, so it is something you must be vigilant and thoughtful about.

You may be surprised that we need 26 instead of 25 given the subtraction we just performed, but that was an incomplete calculation. When you work with time series data, one thing you should always ask yourself after doing this kind of subtraction is whether you should add 1 to account for the offset at the end. In other words, did you subtract the positions you wanted to count?

Consider this example. Let’s say I have information for April 7th, 14th, 21st, and 28th. I want to know how many data points I should have in total. Subtracting 7 from 28 and dividing by 7 yields 21/7 or 3. However, I should obviously have four data points. I subtracted out April 7th and need to put it back in, so the proper calculation is the difference between the first and last days divided by 7, plus 1 to account for the subtracted start date.

It’s a lot easier to fill in all missing weeks for all members by exploiting Pandas’ indexing functionality, rather than writing our own solution. We can generate a MultiIndex for a Pandas data frame, which will create all combinations of weeks and members—that is, a __Cartesian product__:

#### PYTHON’S PANDAS

Pandas is a data frame analysis package in Python that is used widely in the data science community. Its very name indicates its suitability for time series analysis: “Pandas” refers to “panel data,” which is what social scientists call time series data.

Pandas is based on tables of data with row and column indices. It has SQL-like operations built in, such as group by, row selection, and key indexing. It also has time series–specific functionality, such as indexing by time period, downsampling, and time-based grouping operations.

If you are unfamiliar with Pandas, I strongly recommend looking at a brief overview, such as that provided in the official documentation.

To recap, these are the time-series-specific techniques we used to restructure the data:

Recalibrate the resolution of our data to suit our question. Often data comes with more specific time information than we need.
Understand how we can avoid lookahead by not using data for timestamps that produce the data’s availability.
Record all relevant time periods even if “nothing happened.” A zero count is just as informative as any other count.
Avoid lookahead by not using data for timestamps that produce information we shouldn’t yet know about.

The better you understand your data pipeline, the less likely you are to ask the wrong questions because your timestamps don’t really mean what you think they do. You bear the ultimate responsibility for understanding the data. People who work upstream in the pipeline don’t know what you have in mind for analysis. Try to be as hands-on as possible in assessing how timestamps are generated. So, if you are analyzing data from a mobile app pipeline, download the app, trigger an event in a variety of scenarios, and see what your own data looks like. You’re likely to be surprised about how your actions were recorded after speaking to those who manage the data pipeline. It’s hard to track multiple clocks and contingencies, so most data sets will flatten the temporal realities. You need to know exactly how they do so.

Reading the data as we did in the previous example, you can generate initial hypotheses about what the timestamps mean. In the preceding case, look at data for multiple users to see whether the same pattern (multiple rows with identical timestamps and improbable single meal contents) held or whether this was an anomaly.
Using aggregate-level analyses, you can test hypotheses about what timestamps mean or probably mean. For the preceding data, there are a couple of open questions:
Is the timestamp local or universal time?
Does the time reflect a user action or some external constraint, such as connectivity?

### Local or Universal Time?

Most timestamps are stored in universal (UTC) time or in a single time zone, depending on the server’s location but independent of the user’s location. It is quite unusual to store data according to local time. However, we should consider both possibilities, because both are found in “the wild.”

We form the hypothesis that if the time is a local timestamp (local to each user), we should see daily trends in the data reflecting daytime and nighttime behavior. More specifically, we should expect not to see much activity during the night when our users are sleeping.


##### PSYCHOLOGICAL TIME DISCOUNTING

Time discounting is a manifestation of a phenomenon known as psychological distance, which names our tendency to be more optimistic (and less realistic) when making estimates or assessments that are more “distant” from us. Time discounting predicts that data reported from further in the past will be biased systematically compared to data reported from more recent memory. This is distinct from the more general problem of forgetting and implies a nonrandom error. You should keep this in mind whenever you are looking at human-generated data that was entered manually but not contemporaneously with the event recorded.
