# Time Series with `pandas`

**Credit:** Notebook created by Eni Mustafaraj, loosely based on Chapter 10 of "Python for Data Analysis" by Wes McKinney.


**Table of Contents**  
1. [Time series basics](#sec2)
2. [Indexing and Selection](#sec3)
3. [Resampling and Frequency Conversion](#sec4)
4. [Wikipedia Revision Timeseries](#sec5)
5. [Task for you](#sec6)

**Introduction**

What is a time series? Anything that is **observed or measured** at many points in time.  
There are:

1. fixed-frequency time series (data points occur at regular intervals)
2. irregular time series (no fixed offset between units)

**RUNNING Example:** The example time series in this notebook is the one that shows the history of revisions made in the page of the actress [Rose McGowan](https://en.wikipedia.org/wiki/Rose_McGowan). Ms. McGowan was one of the names mentioned in the context of the Harvey Weinstein sexual misconduct allegations in October 2017. Our data contains a list of usernames and timestamps (as strings), stored as a JSON file (which is in the folder of this notebook).

In [1]:
import json
with open('mcgowan_timestamps.json', 'r') as inputFile:
    usersAndDates = json.load(inputFile)
    
print("length of revisions:", len(usersAndDates))

length of revisions: 3268


In [2]:
# look at a few elements
usersAndDates[:5]

[['RichardBond', '2017-10-21 09:32:02'],
 ['Eamontopleez', '2017-10-20 11:58:49'],
 ['Drmies', '2017-10-18 17:50:47'],
 ['User No. 99', '2017-10-18 01:13:04'],
 ['Gene2010', '2017-10-18 00:00:14']]

<a id="sec1"></a>

In [3]:
import pandas as pd
from pandas import Series, DataFrame

<a id="sec2"></a>
## 1. Time series basics

The most basic kind of time series in pandas is a `Series` indexed by timestamps. One can use as index a list of `datetime` objects created in Python, for example:

In [4]:
from datetime import datetime
# create a list of 6 date objects
dates = [datetime(2021, 1, 2), datetime(2021, 1, 5), datetime(2021, 1, 7),
         datetime(2021, 1, 8), datetime(2021, 1, 10), datetime(2021, 1, 12)]

We can supply the created list for the `index` parameter:

In [5]:
import numpy as np
ts = Series(np.random.randn(6), index=dates)
ts

2021-01-02    0.293949
2021-01-05   -0.076591
2021-01-07   -0.924543
2021-01-08   -0.162554
2021-01-10    0.607913
2021-01-12   -0.409017
dtype: float64

`pandas` creates a new data type for the index column, called `DateTimeIndex`:

In [6]:
type(ts.index)

pandas.core.indexes.datetimes.DatetimeIndex

To see the difference, let's create a simple series object that gets its index automatically from pandas:

In [7]:
s = Series(np.random.randn(6))
s

0    0.813665
1    1.071885
2    1.345403
3    1.833630
4   -0.593156
5    0.018098
dtype: float64

In [8]:
type(s.index)

pandas.core.indexes.range.RangeIndex

### Converting the Wiki data into a time series
Let's look now at how to create a `Series` where time is an index for the Wikipedia data we loaded at the start of the notebook. 

We will use the function `zip` to create two separate columns: one for the timestamps and one for the usernames.
The function `zip` in function can be used in two ways: to zip two sequences into one, and to unzip a sequence into two or more sequences. Below are some examples.

In [9]:
# The unzipping feature uses the operator *

pairs = [(1, 'a'), (2, 'b'), (3, 'c'), (4, 'd')]
numbers, letters = zip(*pairs)
print(numbers)
print(letters)

(1, 2, 3, 4)
('a', 'b', 'c', 'd')


In [10]:
# we can now zip these two sequences again, but in the reverse order

list(zip(letters, numbers))

[('a', 1), ('b', 2), ('c', 3), ('d', 4)]

Now that we now how the `zip` function works, we can use it to unzip the given data into two separate lists:

In [11]:
usernames, timestamps = zip(*usersAndDates)
usernames[:3], timestamps[:3]

(('RichardBond', 'Eamontopleez', 'Drmies'),
 ('2017-10-21 09:32:02', '2017-10-20 11:58:49', '2017-10-18 17:50:47'))

The timestamps are as strings, meanwhile, our example above for creating the timeseries, used datetime objects. It turns out, pandas has its own function that takes a string and turns it into a datetimeindex object.

In [12]:
pd.to_datetime(['2017-10-21 09:32:02'])

DatetimeIndex(['2017-10-21 09:32:02'], dtype='datetime64[ns]', freq=None)

Now that we know this, we can create our timeseries of Wikipedia revisions:

In [13]:
# time series for revisions
tsRevWiki = Series(usernames, index=pd.to_datetime(timestamps))
tsRevWiki.head(10)

2017-10-21 09:32:02            RichardBond
2017-10-20 11:58:49           Eamontopleez
2017-10-18 17:50:47                 Drmies
2017-10-18 01:13:04            User No. 99
2017-10-18 00:00:14               Gene2010
2017-10-17 14:32:36            50.1.85.241
2017-10-17 11:20:34         24.177.155.226
2017-10-17 11:14:41         24.177.155.226
2017-10-17 09:32:53           79.76.177.38
2017-10-17 00:12:51    All Hallow's Wraith
dtype: object

As we saw above in the made-up example with random values, the timestamp column is converted into a `DatetimeIndex` object by `pandas`:

In [14]:
tsRevWiki.index

DatetimeIndex(['2017-10-21 09:32:02', '2017-10-20 11:58:49',
               '2017-10-18 17:50:47', '2017-10-18 01:13:04',
               '2017-10-18 00:00:14', '2017-10-17 14:32:36',
               '2017-10-17 11:20:34', '2017-10-17 11:14:41',
               '2017-10-17 09:32:53', '2017-10-17 00:12:51',
               ...
               '2003-12-02 06:54:51', '2003-10-20 10:45:02',
               '2003-10-20 10:42:45', '2003-10-03 00:52:36',
               '2003-08-14 00:16:56', '2003-07-07 18:12:48',
               '2003-07-06 22:42:34', '2003-07-03 19:10:21',
               '2003-07-03 17:27:25', '2003-07-03 17:26:34'],
              dtype='datetime64[ns]', length=3268, freq=None)

However, the values inside this index are `Timestamp` instances:

In [15]:
tsRevWiki.index[0]

Timestamp('2017-10-21 09:32:02')

A `Timestamp` instance has many more methods (useful for analysis) than `datetime` instances:

In [16]:
print(dir(tsRevWiki.index[0]))

['__add__', '__array_priority__', '__class__', '__delattr__', '__dict__', '__dir__', '__doc__', '__eq__', '__format__', '__ge__', '__getattribute__', '__gt__', '__hash__', '__init__', '__init_subclass__', '__le__', '__lt__', '__module__', '__ne__', '__new__', '__pyx_vtable__', '__radd__', '__reduce__', '__reduce_ex__', '__repr__', '__rsub__', '__setattr__', '__setstate__', '__sizeof__', '__str__', '__sub__', '__subclasshook__', '__weakref__', '_date_attributes', '_date_repr', '_get_date_name_field', '_get_start_end_field', '_has_time_component', '_repr_base', '_round', '_short_repr', '_time_repr', 'asm8', 'astimezone', 'ceil', 'combine', 'ctime', 'date', 'day', 'day_name', 'dayofweek', 'dayofyear', 'days_in_month', 'daysinmonth', 'dst', 'floor', 'fold', 'freq', 'freqstr', 'fromisoformat', 'fromordinal', 'fromtimestamp', 'hour', 'is_leap_year', 'is_month_end', 'is_month_start', 'is_quarter_end', 'is_quarter_start', 'is_year_end', 'is_year_start', 'isocalendar', 'isoformat', 'isoweekday'

For example, notice properties such as `is_month_start`, `weekofyear`, etc.

<a id="sec3"></a>
## 2. Indexing and Selection
In a `Series`, if we use the indices 0 to n-1, we access the variable (the column of the Series), not the index:

In [17]:
tsRevWiki[2]

'Drmies'

Thus, to access the "index" value, we use the `index` attribute:

In [18]:
tsRevWiki.index[2]

Timestamp('2017-10-18 17:50:47')

If this index value is stored in a variable, it can be used for indexing instead of numbers 0 to n-1.  
That is, instead of accessing the values in the column through numbers 0 to n-1, we can access them through their index value:

In [19]:
moment = tsRevWiki.index[2]
tsRevWiki[moment]

'Drmies'

### Accessing values via dates

As we just saw, the primary benefit of a DatetimeIndex is that we can use datetime strings to access data from the series. These can be different valid strings. Let's start with a date (year, month, day):

In [20]:
tsRevWiki['2017-10-17']

2017-10-17 14:32:36            50.1.85.241
2017-10-17 11:20:34         24.177.155.226
2017-10-17 11:14:41         24.177.155.226
2017-10-17 09:32:53           79.76.177.38
2017-10-17 00:12:51    All Hallow's Wraith
dtype: object

This works with incomplete dates as well. Here is with year and month:

In [21]:
tsRevWiki['2017-10'].head(10) # showing only the first 10 values, because too many

2017-10-21 09:32:02            RichardBond
2017-10-20 11:58:49           Eamontopleez
2017-10-18 17:50:47                 Drmies
2017-10-18 01:13:04            User No. 99
2017-10-18 00:00:14               Gene2010
2017-10-17 14:32:36            50.1.85.241
2017-10-17 11:20:34         24.177.155.226
2017-10-17 11:14:41         24.177.155.226
2017-10-17 09:32:53           79.76.177.38
2017-10-17 00:12:51    All Hallow's Wraith
dtype: object

This result was long, we can check the size of the subseries:

In [22]:
tsRevWiki['2017-10'].count()

92

What about the entire year of 2017:

In [23]:
tsRevWiki['2017'].count()

159

**Conclusion:** pandas provides a powerful way to query a times series through string date values.

### Exercise: How to find the number of edits by year? 
We will learn a better method later in this notebook, but here is one that you can do too.  
**TIP:** Try to unpack the expressions to see the role of each method.

In [24]:
# find the first year and last year of revisions
minR = tsRevWiki.index.min().year      # find min value, get its year
maxR = tsRevWiki.index.max().year      # find max value, get its year

print(minR, maxR)

2003 2017


In [25]:
# create a range of years and with a for loop to access the Series

for year in range(minR, maxR+1):
    print(year, tsRevWiki[str(year)].count())

2003 10
2004 24
2005 55
2006 436
2007 638
2008 481
2009 371
2010 301
2011 168
2012 134
2013 138
2014 106
2015 152
2016 95
2017 159


<a id="sec4"></a>

## 3. Resampling and Frequency Conversion
_Resampling_ refers to the process of converting a time series from one frequency to another. Aggregating higher frequency data  
to lower frequency is called _downsampling_ while converting lower frequency to higher frequency is called _upsampling_ .

For regular dataseries (or fixed-frequency) dataseries, the frequency is the interval of time between two measurements. For example: measuring the temperature every 6 hours; the blood pressure every week; the stock price at the end of the day, and so on. 

For irregular data series, like the Wikipedia revisions, which are mostly random (at the will of the editors, though not entirely random, they depend on events in the real world), there is no fixed frequency. In order to study the timeseries though, we might want to use a chosen frequency unit: a day, a week, a month, a year.

Before learning how to do that, let's talk about `date_range` and frequency syntax.

### The `date_range` function

The statements below create a timeseries of random numbers, one number for each day (that is what `freq='D'` means), for a period of 100 days in total.

In [None]:
from numpy.random import randn
drange = pd.date_range('1/1/2000', periods=100, freq='D')
ts = Series(randn(len(drange)), index=drange)
ts.head(10) 

We can indeed check that we have 100 data points:

In [None]:
ts.count()

There are many values that the parameter `freq` can take, here are some more examples:

In [None]:
# frequency is 3 days
pd.date_range("01/01/17", periods=31, freq="3D")

As we can see, 31 days were created, all 3 days apart.

In [None]:
# make the "frequency step" to be 90 minutes
pd.date_range("01/01/17 00:00", periods=10, freq="90min")

If we don't provide a value for the frequency parameter, by default is 'D' for one day. The `date_range` function takes a start and end date, like below:

In [None]:
pd.date_range('10/10/2019', '10/20/2019')

Here is a table that contains all values that can be passed to the `freq` parameter:

<img src="frequency.png" width=800>

### A resampling example

For the timeseries `ts` we created above with random numbes, let's see what resampling does.

Because the values were "recorded" daily, we will ask for a resampling based on a month:

In [None]:
ts.resample("M").mean()

Since we are changing the data to go from daily values to a monthly value, once we resample, we have to specify what kind of value we want (mean, sum, max, min, some other function, etc.)

In [None]:
# let's find the sum for the weekly sample (all data collected in one week)
ts.resample("W").sum()

<a id="sec5"></a>

## 4. Wikipedia Revisions Timeseries

As a reminder, we created a series `tsRevWiki` earlier in this notebook, let's look at it again:

In [None]:
tsRevWiki.head(10)

How big is this series:

In [None]:
tsRevWiki.shape

Let's create a timeseries that shows the number of edits by year. We will be resampling using the frequncy 'A' (see table above). Remember that we did this step with a `for` loop earlier in the notebook and got these results:

```
2003 10
2004 24
2005 55
2006 436
2007 638
2008 481
2009 371
2010 301
2011 168
2012 134
2013 138
2014 106
2015 152
2016 95
2017 159
```

In [None]:
tsRevWiki.resample('A').count()

### Visualizing a timeseries

pandas knows how to plot timeseries automatically, but let's get matplotlib in the namespace, given that it's needed to show the plots.

In [None]:
import matplotlib.pyplot as plt
%matplotlib inline
plt.style.use('ggplot')

Let's store the result into a new series, so that we can call the plot method on it. It's as simple as that:

In [None]:
revByYear = tsRevWiki.resample('A').count()
revByYear.plot(figsize=(8,5), title="Number of revisions by year")
plt.show()

In [None]:
revByMonth = tsRevWiki.resample('M').count()
revByMonth.plot(figsize=(8,5), title="Number of revisions by month")
plt.show()

We can always **smooth** the data by making the interval bigger, for example, 3 months:

In [None]:
revByThreeMonths = tsRevWiki.resample('3M').count()
revByThreeMonths.plot(figsize=(8,5), title="Number of revisions quarterly")
plt.show()

We can focus on a single year, for example, 2007:

In [None]:
revsIn2007 = tsRevWiki['2007']
revsIn2007Months = revsIn2007.resample('M').count()
revsIn2007Months.plot(title="Number of revisions in 2007", # not using figsize
                      x_compat=True # this parameter suppresses how often the xticks are displayed
                     ) 
plt.show()

The same way, we can zoom in in the month of October 2017:

In [None]:
revsOct17 = tsRevWiki['2017-10']
revsOct17Day = revsOct17.resample('D').count()
revsOct17Day.plot(figsize=(8,5), 
                  title="Number of revisions in 10/2017",
                  x_compat=True)
plt.show()

### Find most active editing days

Given that we can create a series based on daily edit counts, we can find the day with most edits:

In [None]:
revByDays = tsRevWiki.resample('D').count()
revByDays.sort_values(ascending=False).head(10)

We can see how 2017/10/13, one of the days in the evolving Harvey Weinstein story, has the second largest number of edits.

### Find most active editors

In order to find editors, we will create a dataframe from the series, in order to use the method `groupby` which works better on dataframes.

Notice that we can create the dataframe using the columns of the timeseries tsRevWiki.

In [None]:
dfWiki = DataFrame({'editors': tsRevWiki}, 
                     index=tsRevWiki.index)
dfWiki.head()

In [None]:
dfWiki.shape

Now we can use the method `groupby` that will group together values in the column 'editors', find how often each editor occurs and show that value in a new column, "total":

In [None]:
dfWikiT = dfWiki.groupby('editors').size().reset_index(name="total")
dfWikiT.head(10)

In [None]:
dfWikiT.shape

We can see that the number of rows in this new dataframe is smaller, since some editors have more than one edit.

By the way, editor names such as 108.2.173.243, refer to the IP address of an editor. If someone edits a Wiki page without using an account, the system automatically captures their IP address.

Let's sort to find the most prolific editors:

In [None]:
dfWikiT.sort_values('total', ascending=False)[:10]

### Focus on one editor

Let's look at the behavior of a single editor, for example, Nymf (third most active):

In [None]:
oneUser = dfWiki[dfWiki['editors']=='Nymf'] # select from the frame the rows that fulfill the query
oneUser.head()

We can resample the events by year and plot the timeseries:

In [None]:
oneUser.resample('A').count().plot(legend=False, title="One user's Wiki editing activity")
plt.show()

Similarly, we can do this with grouping by year:

In [None]:
oneUser.groupby(oneUser.index.year).count()

This shows that this user was active on 7 different years.

Let's look at the most prolific user, who has only an IP address: 68.190.48.20

In [None]:
topUser = dfWiki[dfWiki['editors']=='68.190.48.20'] # select from the frame the rows that fulfill the query
topUser.head()

In [None]:
topUser.groupby(topUser.index.year).count()

Looks like this user has done all his editing in one year. We can look at this particular year:

In [None]:
topUser['2007'].resample('M').count()

This user seems to have done most of his editing in one month:

In [None]:
topUser['2007/03'].resample('D').count().plot(legend=None, title="One user in one month")
plt.show()

**Note:** Look at the x ticks. If we were to use x_compat=True, they will show differently.

### Find page creator

Who created the page of Rose McGowan? We can find this info from the time series.

In [None]:
dfWiki.tail(5)

In [None]:
creator = dfWiki['editors'][-1]
creator

In [None]:
dfWiki[dfWiki['editors']==creator]

This user only created the page and then never returned to make edits to it. Or, they created a username and then used that to do their edits.

<a id="sec6"></a>

## 5. Task for you

To complete this task, you need to have completed that last task on the "Getting data from Wikipedia" notebook, which creates a file titled `wc_revisions.json`. If you have that file, you can continue with this task.

1. Load the file `wc_revisions.json`
2. Create a timeseries for this data, similarly to what you did for Rose McGowan's page
3. Create the visualization of the timeseries with different frequencies (yearly, quarterly, monthly). 

<a id="sec6"></a>