# Loading time series data into Pandas

This notebook demonstrates how to load a time series into a `pandas.DataFrame`.   There are a couple of extra steps we need to take when dealing with time series.

* set the index of the `pandas.DataFrame` to be a `DateTimeIndex`
* Set the frequency of the `DateTimeIndex` i.e. Days, Month, etc
* parse dates and watch out for US and UK date format issues (i.e. mm/dd/yyyy Vs. dd/mm/yyyy)

#### References
* [Pandas Timeseries](#https://pandas.pydata.org/docs/user_guide/timeseries.html)

* [Pandas `read_csv`](#https://pandas.pydata.org/docs/reference/api/pandas.read_csv.html)

# Imports

Lets import Pandas as usual and check the version number.

In [1]:
import pandas as pd

### Alcohol sales dataset

Let's download and open an example time series.  This particular one is alcohol sales (it looks like the figure below) and covers a large time scale (1992 - 2019) using monthly level data.  Its an interesting time series as the variation in it increases over time! It contains seasonality and trend, in addition to variability that increases over time (aka "multiplicative seasonality").

![image](images/alcohol_ts.png)


### Reading a CSV

To load a time series into `pandas` you need to use the `read_csv()` function. Some additional steps

* set `parse_dates=True` - this parses the dates (process data as date and not just a string)
* set `index_col` to the name of the date column. In this example the column has a name called `DATE`
* Before you load the dataset check the date format.  If it is in UK day first format then set `dayfirst=True`
* After you have loaded the data set the frequency of the date time index
    * Daily frequency = `D`
    * Monthly frequency = `MS` (month start)
    * See full list of offset aliases [HERE](https://pandas.pydata.org/pandas-docs/stable/user_guide/timeseries.html#offset-aliases)


In [2]:
#read in data
url = 'https://raw.githubusercontent.com/hsma5/9a_introduction_to_forecasting/main/' \
        + 'data/Alcohol_Sales.csv'
ts = pd.read_csv(url, 
                 parse_dates=True, 
                 index_col='DATE')

In [3]:
#lets take a look at the data
ts.head()

Unnamed: 0_level_0,sales
DATE,Unnamed: 1_level_1
1992-01-01,3459
1992-02-01,3458
1992-03-01,4002
1992-04-01,4564
1992-05-01,4221


In [4]:
#lets take a look at the info
ts.info()

<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 336 entries, 1992-01-01 to 2019-12-01
Data columns (total 1 columns):
 #   Column  Non-Null Count  Dtype
---  ------  --------------  -----
 0   sales   336 non-null    int64
dtypes: int64(1)
memory usage: 5.2 KB


In [5]:
# lets take a look at the index
# datetime64[ns] can go down to the nano second, should that level
# of granulairty be required.
# NOTE the freq = None
ts.index

DatetimeIndex(['1992-01-01', '1992-02-01', '1992-03-01', '1992-04-01',
               '1992-05-01', '1992-06-01', '1992-07-01', '1992-08-01',
               '1992-09-01', '1992-10-01',
               ...
               '2019-03-01', '2019-04-01', '2019-05-01', '2019-06-01',
               '2019-07-01', '2019-08-01', '2019-09-01', '2019-10-01',
               '2019-11-01', '2019-12-01'],
              dtype='datetime64[ns]', name='DATE', length=336, freq=None)

In [6]:
# set the frequency of the datetime index (monthly data)
ts.index.freq = 'MS'

In [7]:
# lets now take another look at the index
ts.index

DatetimeIndex(['1992-01-01', '1992-02-01', '1992-03-01', '1992-04-01',
               '1992-05-01', '1992-06-01', '1992-07-01', '1992-08-01',
               '1992-09-01', '1992-10-01',
               ...
               '2019-03-01', '2019-04-01', '2019-05-01', '2019-06-01',
               '2019-07-01', '2019-08-01', '2019-09-01', '2019-10-01',
               '2019-11-01', '2019-12-01'],
              dtype='datetime64[ns]', name='DATE', length=336, freq='MS')

In [8]:
# lets now take another look at the info
# Freq: MS
ts.info()

<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 336 entries, 1992-01-01 to 2019-12-01
Freq: MS
Data columns (total 1 columns):
 #   Column  Non-Null Count  Dtype
---  ------  --------------  -----
 0   sales   336 non-null    int64
dtypes: int64(1)
memory usage: 5.2 KB


In [11]:
# read in data - concise
url = 'https://raw.githubusercontent.com/hsma5/9a_introduction_to_forecasting/main/' \
        + 'data/Alcohol_Sales.csv'
ts = pd.read_csv(url, 
                 parse_dates=True, 
                 index_col='DATE')
# set the frequency of the datetime index (monthly data)
ts.index.freq = 'MS'

ts.head()


Unnamed: 0_level_0,sales
DATE,Unnamed: 1_level_1
1992-01-01,3459
1992-02-01,3458
1992-03-01,4002
1992-04-01,4564
1992-05-01,4221
