# Data Preparation: Load & Explore TS Data

## Import libraries

In [1]:
import numpy as np
import pandas as pd

## Load data

- header: ‘0’ specifies the header information is available for use.
- parse_dates: ‘True’ helps pandas to recognize that data in the first column contains dates that need to be parsed. But there are always weird formats that need to be defined manually, in such a case adding a date_parser() function is the better approach.
- index_col: ‘0’ hints pandas that our first column, the time series column contains our index information
- squeeze: ‘True’ hints pandas that we only have one column and we want to use this as Series

In [2]:
# download series
url = "https://raw.githubusercontent.com/Kanbc/ar-model-python/master/data/daily-minimum-temperatures.csv"
series = pd.read_csv(url, 
                     header=0, 
                     index_col=0, 
                     parse_dates=True,
                     squeeze=True)
print(type(series))

<class 'pandas.core.series.Series'>


## Explore data

- Use the head() function to look at the first five records, we can also specify the first n records to view.
- Validate the number of observations in the given series to avoid any error.
- Slice and dice the time series by querying different time intervals. For example, let’s take a look at all the observations from January 1981.
- Calculating and reviewing summary statistics is an important step in time series data exploration as well, it gives us an idea about the distribution and spread of the values. The describe() function will help us in calculating these statistics.


In [3]:
# first glance at data
print(series.head())

Date
1981-01-01    20.7
1981-01-02    17.9
1981-01-03    18.8
1981-01-04    14.6
1981-01-05    15.8
Name: Temp, dtype: float64


In [4]:
# number of observations
print(series.size)

3650


In [5]:
# querying by time
print(series["1981-01"])

Date
1981-01-01    20.7
1981-01-02    17.9
1981-01-03    18.8
1981-01-04    14.6
1981-01-05    15.8
1981-01-06    15.8
1981-01-07    15.8
1981-01-08    17.4
1981-01-09    21.8
1981-01-10    20.0
1981-01-11    16.2
1981-01-12    13.3
1981-01-13    16.7
1981-01-14    21.5
1981-01-15    25.0
1981-01-16    20.7
1981-01-17    20.6
1981-01-18    24.8
1981-01-19    17.7
1981-01-20    15.5
1981-01-21    18.2
1981-01-22    12.1
1981-01-23    14.4
1981-01-24    16.0
1981-01-25    16.5
1981-01-26    18.7
1981-01-27    19.4
1981-01-28    17.2
1981-01-29    15.5
1981-01-30    15.1
1981-01-31    15.4
Name: Temp, dtype: float64


In [6]:
# summary statistics of data
print(series.describe())

count    3650.000000
mean       11.177753
std         4.071837
min         0.000000
25%         8.300000
50%        11.000000
75%        14.000000
max        26.300000
Name: Temp, dtype: float64
