# 09. Handle time Series data in Pandas
- Time series data, also referred to as time-stamped data, is a sequence of data points indexed in time order.
- It's a data collected over different points in time (including realtime).
- Stock prices over time, Weather data, Annual retail Sales...

&#128214; **Further readings:** Understand timeseries data https://www.influxdata.com/what-is-time-series-data/

In [1]:
import pandas as pd
df = pd.read_csv("./data/HistoricalQuotes.csv")
df

Unnamed: 0,date,close,volume,open,high,low
0,16:00,200.99,24619446,201.230,202.760,199.29
1,2019/08/09,200.99,24619750.0000,201.300,202.760,199.29
2,2019/08/08,203.43,27009520.0000,200.200,203.530,199.39
3,2019/08/07,199.04,33364400.0000,195.410,199.560,193.82
4,2019/08/06,197.00,35824790.0000,196.310,198.067,194.04
...,...,...,...,...,...,...
248,2018/08/15,210.24,28595230.0000,209.220,210.740,208.33
249,2018/08/14,209.75,20679270.0000,210.155,210.560,208.26
250,2018/08/13,208.87,25864510.0000,207.700,210.952,207.70
251,2018/08/10,207.53,24592460.0000,207.360,209.100,206.67


In [2]:
df.head()

Unnamed: 0,date,close,volume,open,high,low
0,16:00,200.99,24619446.0,201.23,202.76,199.29
1,2019/08/09,200.99,24619750.0,201.3,202.76,199.29
2,2019/08/08,203.43,27009520.0,200.2,203.53,199.39
3,2019/08/07,199.04,33364400.0,195.41,199.56,193.82
4,2019/08/06,197.0,35824790.0,196.31,198.067,194.04


In [3]:
type(df.date[1])

str

## 9.1. Load time Data and Process

In [4]:
df = pd.read_csv("./data/HistoricalQuotes.csv", parse_dates=["date"], index_col="date")
df

Unnamed: 0_level_0,close,volume,open,high,low
date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
2022-08-03 16:00:00,200.99,24619446,201.230,202.760,199.29
2019-08-09 00:00:00,200.99,24619750.0000,201.300,202.760,199.29
2019-08-08 00:00:00,203.43,27009520.0000,200.200,203.530,199.39
2019-08-07 00:00:00,199.04,33364400.0000,195.410,199.560,193.82
2019-08-06 00:00:00,197.00,35824790.0000,196.310,198.067,194.04
...,...,...,...,...,...
2018-08-15 00:00:00,210.24,28595230.0000,209.220,210.740,208.33
2018-08-14 00:00:00,209.75,20679270.0000,210.155,210.560,208.26
2018-08-13 00:00:00,208.87,25864510.0000,207.700,210.952,207.70
2018-08-10 00:00:00,207.53,24592460.0000,207.360,209.100,206.67


In [5]:
df.describe()

Unnamed: 0,close,open,high,low
count,253.0,253.0,253.0,253.0
mean,193.091858,193.031711,195.073753,191.073962
std,21.750344,21.77206,21.914049,21.70014
min,142.19,143.98,145.72,142.0
25%,174.87,174.94,176.0,173.94
50%,197.0,196.45,199.26,194.04
75%,208.88,209.22,210.64,207.29
max,232.07,230.78,233.47,229.78


In [6]:
df.info()

<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 253 entries, 2022-08-03 16:00:00 to 2018-08-09 00:00:00
Data columns (total 5 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   close   253 non-null    float64
 1   volume  253 non-null    object 
 2   open    253 non-null    float64
 3   high    253 non-null    float64
 4   low     253 non-null    float64
dtypes: float64(4), object(1)
memory usage: 11.9+ KB


### Access the rows in: august, 2019

In [7]:
df['2019-08']

  df['2019-08']


Unnamed: 0_level_0,close,volume,open,high,low
date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
2019-08-09,200.99,24619750.0,201.3,202.76,199.29
2019-08-08,203.43,27009520.0,200.2,203.53,199.39
2019-08-07,199.04,33364400.0,195.41,199.56,193.82
2019-08-06,197.0,35824790.0,196.31,198.067,194.04
2019-08-05,193.34,52392970.0,197.99,198.649,192.58
2019-08-02,204.02,40862120.0,205.53,206.43,201.63
2019-08-01,208.43,54017920.0,213.9,218.03,206.7435


In [8]:
df.loc['2019-08']

Unnamed: 0_level_0,close,volume,open,high,low
date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
2019-08-09,200.99,24619750.0,201.3,202.76,199.29
2019-08-08,203.43,27009520.0,200.2,203.53,199.39
2019-08-07,199.04,33364400.0,195.41,199.56,193.82
2019-08-06,197.0,35824790.0,196.31,198.067,194.04
2019-08-05,193.34,52392970.0,197.99,198.649,192.58
2019-08-02,204.02,40862120.0,205.53,206.43,201.63
2019-08-01,208.43,54017920.0,213.9,218.03,206.7435


In [9]:
df.loc['2019-08'].close.mean()

200.89285714285717

In [10]:
# date range
df.loc['2019-08-09':'2019-01-01']

Unnamed: 0_level_0,close,volume,open,high,low
date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
2019-08-09,200.99,24619750.0000,201.30,202.7600,199.29
2019-08-08,203.43,27009520.0000,200.20,203.5300,199.39
2019-08-07,199.04,33364400.0000,195.41,199.5600,193.82
2019-08-06,197.00,35824790.0000,196.31,198.0670,194.04
2019-08-05,193.34,52392970.0000,197.99,198.6490,192.58
...,...,...,...,...,...
2019-01-08,150.75,40622910.0000,149.56,151.8200,148.52
2019-01-07,147.93,54571440.0000,148.70,148.8300,145.90
2019-01-04,148.26,57423650.0000,144.53,148.5499,143.80
2019-01-03,142.19,91106840.0000,143.98,145.7200,142.00


# 9.2. Parsing Unix timestamps

- It's not obvious how to deal with Unix timestamps in pandas.
- The file we're using here is a popularity-contest file I found on my system at `/var/log/popularity-contest`.
- `<atime>` and `<ctime>` are the access time and creation time of the program.
- `<package-name>` is the name of the Debian package that contains `<mru-program>`. 
- `<mru-program>` is the most recently used program, static library, or header
(.h) file in the package.

Here's an [explanation of how this file works](http://popcon.ubuntu.com/README).

In [11]:
# Read it, and remove the last row
#pop_con: popularity_Contest
pop_con = pd.read_csv('./data/popularity-contest.txt', sep=' ', )[:-1]
pop_con.columns = ['atime', 'ctime', 'package-name', 'mru-program', 'tag']

In [12]:
pop_con.head()

Unnamed: 0,atime,ctime,package-name,mru-program,tag
0,1387295797,1367633260,perl-base,/usr/bin/perl,
1,1387295796,1354370480,login,/bin/su,
2,1387295743,1354341275,libtalloc2,/usr/lib/x86_64-linux-gnu/libtalloc.so.2.0.7,
3,1387295743,1387224204,libwbclient0,/usr/lib/x86_64-linux-gnu/libwbclient.so.0,<RECENT-CTIME>
4,1387295742,1354341253,libselinux1,/lib/x86_64-linux-gnu/libselinux.so.1,


In [13]:
type(pop_con.atime[0])

str

The magical part about parsing timestamps in pandas is that numpy datetimes are already stored as Unix timestamps. So all we need to do is tell pandas that these integers are actually datetimes -- it doesn't need to do any conversion at all.

We need to convert these to ints to start:

In [14]:
pop_con['atime'] = pop_con['atime'].astype(int)
pop_con['ctime'] = pop_con['ctime'].astype(int)

Every numpy array and pandas series has a dtype -- this is usually `int64`, `float64`, or `object`. Some of the time types available are `datetime64[s]`, `datetime64[ms]`, and `datetime64[us]`. There are also `timedelta` types, similarly.

We can use the `pd.to_datetime` function to convert our integer timestamps into datetimes. This is a constant-time operation -- we're not actually changing any of the data, just how pandas thinks about it.

In [15]:
pop_con['atime'] = pd.to_datetime(pop_con['atime'], unit='s')
pop_con['ctime'] = pd.to_datetime(pop_con['ctime'], unit='s')

If we look at the dtype now, it's `<M8[ns]`. As far as I can tell `M8` is secret code for `datetime64`.

In [16]:
pop_con['atime'].dtype

dtype('<M8[ns]')

In [23]:
type(pop_con.atime[1])

pandas._libs.tslibs.timestamps.Timestamp

So now we can look at our `atime` and `ctime` as dates!

In [17]:
pop_con.head()

Unnamed: 0,atime,ctime,package-name,mru-program,tag
0,2013-12-17 15:56:37,2013-05-04 02:07:40,perl-base,/usr/bin/perl,
1,2013-12-17 15:56:36,2012-12-01 14:01:20,login,/bin/su,
2,2013-12-17 15:55:43,2012-12-01 05:54:35,libtalloc2,/usr/lib/x86_64-linux-gnu/libtalloc.so.2.0.7,
3,2013-12-17 15:55:43,2013-12-16 20:03:24,libwbclient0,/usr/lib/x86_64-linux-gnu/libwbclient.so.0,<RECENT-CTIME>
4,2013-12-17 15:55:42,2012-12-01 05:54:13,libselinux1,/lib/x86_64-linux-gnu/libselinux.so.1,


Now suppose we want to look at all packages that aren't libraries.

First, I want to get rid of everything with timestamp 0. Notice how we can just use a string in this comparison, even though it's actually a timestamp on the inside? That is because pandas is amazing.

In [18]:
pop_con = pop_con[pop_con['atime'] > '1970-01-01']

Now we can use panda's magical string abilities to just look at rows where the package name doesn't contain 'lib'.

In [19]:
nonlibraries = pop_con[~pop_con['package-name'].str.contains('lib')]

In [20]:
nonlibraries.sort_values('ctime', ascending=False).head(10)

Unnamed: 0,atime,ctime,package-name,mru-program,tag
57,2013-12-17 04:55:39,2013-12-17 04:55:42,ddd,/usr/bin/ddd,<RECENT-CTIME>
450,2013-12-16 20:03:20,2013-12-16 20:05:13,nodejs,/usr/bin/npm,<RECENT-CTIME>
454,2013-12-16 20:03:20,2013-12-16 20:05:04,switchboard-plug-keyboard,/usr/lib/plugs/pantheon/keyboard/options.txt,<RECENT-CTIME>
445,2013-12-16 20:03:20,2013-12-16 20:05:04,thunderbird-locale-en,/usr/lib/thunderbird-addons/extensions/langpac...,<RECENT-CTIME>
396,2013-12-16 20:08:27,2013-12-16 20:05:03,software-center,/usr/sbin/update-software-center,<RECENT-CTIME>
449,2013-12-16 20:03:20,2013-12-16 20:05:00,samba-common-bin,/usr/bin/net.samba3,<RECENT-CTIME>
397,2013-12-16 20:08:25,2013-12-16 20:04:59,postgresql-client-9.1,/usr/lib/postgresql/9.1/bin/psql,<RECENT-CTIME>
398,2013-12-16 20:08:23,2013-12-16 20:04:58,postgresql-9.1,/usr/lib/postgresql/9.1/bin/postmaster,<RECENT-CTIME>
452,2013-12-16 20:03:20,2013-12-16 20:04:55,php5-dev,/usr/include/php5/main/snprintf.h,<RECENT-CTIME>
440,2013-12-16 20:03:20,2013-12-16 20:04:54,php-pear,/usr/share/php/XML/Util.php,<RECENT-CTIME>


Okay, cool, it says that I installed ddd recently. And postgresql! I remember installing those things. Neat.

The whole message here is that if you have a timestamp in seconds or milliseconds or nanoseconds, then you can just "cast" it to a `'datetime64[the-right-thing]'` and pandas/numpy will take care of the rest.

<style>
    @font-face {
        font-family: "Computer Modern";
        src: url('http://mirrors.ctan.org/fonts/cm-unicode/fonts/otf/cmunss.otf');
    }
    div.cell{
        width:800px;
        margin-left:16% !important;
        margin-right:auto;
    }
    h1 {
        font-family: Helvetica, serif;
    }
    h4{
        margin-top:12px;
        margin-bottom: 3px;
       }
    div.text_cell_render{
        font-family: Computer Modern, "Helvetica Neue", Arial, Helvetica, Geneva, sans-serif;
        line-height: 145%;
        font-size: 130%;
        width:800px;
        margin-left:auto;
        margin-right:auto;
    }
    .CodeMirror{
            font-family: "Source Code Pro", source-code-pro,Consolas, monospace;
    }
    .text_cell_render h5 {
        font-weight: 300;
        font-size: 22pt;
        color: #4057A1;
        font-style: italic;
        margin-bottom: .5em;
        margin-top: 0.5em;
        display: block;
    }
    
    .warning{
        color: rgb( 240, 20, 20 )
        }  

## Credit: 
- https://github.com/sarincr/Time-series-analysis-using-Python
- https://github.com/jvns/pandas-cookbook