# Introduction to Time Series Data

We will explore two main kinds of tabular data in this workshop. The first is time series data. Time series data will be *indexed* with a date and time. We'll look a bit more closely at that soon, but for now just think of it as each row having a date or time, rather than a row number.

## Pandas Refresh

One of the most popular packages in Python for working with tabular data is called Pandas, which was introduced in a supplementary notebook in DSA.

The first thing we'll do is `import` the `pandas` package. Convention has us use a shortform name - `pd` - because we'll be using the package so often.

In [1]:
import pandas as pd

And below we'll use pandas' `read_csv()` to load the data into a `DataFrame`. DataFrames are the main data structure in pandas for tabular data, and lots of other programming languages use the concept of a DataFrame too! By convention, you'll often see `df` used as a variable name.

In [2]:
# Load the data
df = pd.read_csv("https://raw.githubusercontent.com/ImperialCollegeLondon/efds-ta-python/refs/heads/main/data/AAPL_2023.csv")

Before we do anything else, it's a good idea to take a look at the DataFrame. Some methods will let us take a closer look at parts of our data. 

In [None]:
df   # display the data frame - open data wrangling for more operation


Unnamed: 0,Date,Open,High,Low,Close,Adj Close,Volume
0,2023-01-03,130.279999,130.899994,124.169998,125.070000,124.216301,112117500
1,2023-01-04,126.889999,128.660004,125.080002,126.360001,125.497498,89113600
2,2023-01-05,127.129997,127.769997,124.760002,125.019997,124.166641,80962700
3,2023-01-06,126.010002,130.289993,124.889999,129.619995,128.735229,87754700
4,2023-01-09,130.470001,133.410004,129.889999,130.149994,129.261627,70790800
...,...,...,...,...,...,...,...
245,2023-12-22,195.179993,195.410004,192.970001,193.600006,193.353287,37122800
246,2023-12-26,193.610001,193.889999,192.830002,193.050003,192.803986,28919300
247,2023-12-27,192.490005,193.500000,191.089996,193.149994,192.903839,48087700
248,2023-12-28,194.139999,194.660004,193.169998,193.580002,193.333298,34049900


Other methods and attributes can give us an overview, and give us further insights to our data in general. `shape()` will tell us the number of rows and columns in our data frame, while `info()` will give us some info on the data type (`dtype`) of each column.

You'll notice the types are slightly different from the usual Python types - this is because they belong to the `numpy` package, which sits under the hood of `pandas`. We'll look more at `numpy` tomorrow, but for now here is a word about each of the types in our data frame.

- `float64` - 64-bit floating point (number with a decimal point)
- `int64` - 64-bit integer (whole number)
- `object` - other Python data types (strings in this case)

In [7]:
df.shape

df.info()

df.head()

df.tail() 

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 250 entries, 0 to 249
Data columns (total 7 columns):
 #   Column     Non-Null Count  Dtype  
---  ------     --------------  -----  
 0   Date       250 non-null    object 
 1   Open       250 non-null    float64
 2   High       250 non-null    float64
 3   Low        250 non-null    float64
 4   Close      250 non-null    float64
 5   Adj Close  250 non-null    float64
 6   Volume     250 non-null    int64  
dtypes: float64(5), int64(1), object(1)
memory usage: 13.8+ KB


Unnamed: 0,Date,Open,High,Low,Close,Adj Close,Volume
245,2023-12-22,195.179993,195.410004,192.970001,193.600006,193.353287,37122800
246,2023-12-26,193.610001,193.889999,192.830002,193.050003,192.803986,28919300
247,2023-12-27,192.490005,193.5,191.089996,193.149994,192.903839,48087700
248,2023-12-28,194.139999,194.660004,193.169998,193.580002,193.333298,34049900
249,2023-12-29,193.899994,194.399994,191.729996,192.529999,192.284637,42628800


For a look at some actual data within the data frame, we can use square bracket notation and `iloc` to access columns and rows. The `i` in `iloc` refers to **integer-based indexing**, so looking at a row or column *number*.

In [None]:
# index data frame as a dictionary 

df["Close"]  # index Close column 

df.Close  # dot access - same as indexing - short cut with some risk (e.g., no spaces, no reserved keywords)

df["New column"] = 1  # create a new column and fill it with 1s 

df[["Close", "Open"]]  # use a list to index multiple columns 

df[df.Close > 150]   # filter all row with close price > 150 using conditions

df[df.Close > 150]["Date"]  # get the dates when close price > 150, or df[df.Close > 150].Date 

df.loc["2023-12-01"]  # index 2023-12-01 row 

df.iloc[6]  # get row 6 using integer location, usually used in loop

df[0:6]   # get the first 6 rows using columns - for single row, e.g., row 6, use df[5:6]

21     2023-02-02
22     2023-02-03
23     2023-02-06
24     2023-02-07
25     2023-02-08
          ...    
245    2023-12-22
246    2023-12-26
247    2023-12-27
248    2023-12-28
249    2023-12-29
Name: Date, Length: 220, dtype: object

### Exercise: End of Year

Display the data for the entire month of December 2023.

In [14]:
df[(df.Date >= "2023-12-01") & (df.Date <= "2023-12-31")]

Unnamed: 0,Date,Open,High,Low,Close,Adj Close,Volume,New column
230,2023-12-01,190.330002,191.559998,189.229996,191.240005,190.996292,45679300,1
231,2023-12-04,189.979996,190.050003,187.449997,189.429993,189.188583,43389500,1
232,2023-12-05,190.210007,194.399994,190.179993,193.419998,193.173508,66628400,1
233,2023-12-06,194.449997,194.759995,192.110001,192.320007,192.074921,41089700,1
234,2023-12-07,193.630005,195.0,193.589996,194.270004,194.02243,47477700,1
235,2023-12-08,194.199997,195.990005,193.669998,195.710007,195.460587,53377300,1
236,2023-12-11,193.110001,193.490005,191.419998,193.179993,192.933807,60943700,1
237,2023-12-12,193.080002,194.720001,191.720001,194.710007,194.461868,52696900,1
238,2023-12-13,195.089996,198.0,194.850006,197.960007,197.707718,70404200,1
239,2023-12-14,198.020004,199.619995,196.160004,198.110001,197.857529,66831600,1


### Exercise: The First Fifty

Display the data for the first fifty (50) days of trading in the period.

In [None]:
df.iloc[0:50]

Unnamed: 0,Date,Open,High,Low,Close,Adj Close,Volume,New column
0,2023-01-03,130.279999,130.899994,124.169998,125.07,124.216301,112117500,1
1,2023-01-04,126.889999,128.660004,125.080002,126.360001,125.497498,89113600,1
2,2023-01-05,127.129997,127.769997,124.760002,125.019997,124.166641,80962700,1
3,2023-01-06,126.010002,130.289993,124.889999,129.619995,128.735229,87754700,1
4,2023-01-09,130.470001,133.410004,129.889999,130.149994,129.261627,70790800,1
5,2023-01-10,130.259995,131.259995,128.119995,130.729996,129.837677,63896200,1
6,2023-01-11,131.25,133.509995,130.460007,133.490005,132.578842,69458900,1
7,2023-01-12,133.880005,134.259995,131.440002,133.410004,132.49939,71379600,1
8,2023-01-13,132.029999,134.919998,131.660004,134.759995,133.840149,57809700,1
9,2023-01-17,134.830002,137.289993,134.130005,135.940002,135.0121,63646600,1


### Exercise: Roughly Monthly

Display the data at each 21-day interval, over the entire period.

In [21]:
df[::21]

Unnamed: 0,Date,Open,High,Low,Close,Adj Close,Volume,New column
0,2023-01-03,130.279999,130.899994,124.169998,125.07,124.216301,112117500,1
21,2023-02-02,148.899994,151.179993,148.169998,150.820007,149.790543,118339000,1
42,2023-03-06,153.789993,156.300003,153.460007,153.830002,153.013275,87558000,1
63,2023-04-04,166.600006,166.839996,165.110001,165.630005,164.75061,46278300,1
84,2023-05-04,164.889999,167.039993,164.309998,165.789993,164.90976,81235400,1
105,2023-06-05,182.630005,184.949997,178.039993,179.580002,178.873611,121946500,1
126,2023-07-06,189.839996,192.020004,189.199997,191.809998,191.055511,45094300,1
147,2023-08-04,185.520004,187.380005,181.919998,181.990005,181.274155,115799700,1
168,2023-09-05,188.279999,189.979996,187.610001,189.699997,189.208969,45280000,1
189,2023-10-04,171.089996,174.210007,170.970001,173.660004,173.210495,53020300,1


## Setting the Index

In a DataFrame, each row is assigned a unique index value. By default, this is just a number (starting at 0). When it makes sense, we can choose one of the other columns to be an index. For time series data, where each row represents a different point in time, we'll set our `Date` column as the index. This will make it easier for us to work with the data, and can speed up other operations later on.


In [None]:
df.Date = pd.to_datetime(df.Date)

df.set_index("Date", inplace=True)  # don't run this line twice - you'll run into AttributeError - interrupt and rerun 
df

Unnamed: 0_level_0,Open,High,Low,Close,Adj Close,Volume,New column
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
2023-01-03,130.279999,130.899994,124.169998,125.070000,124.216301,112117500,1
2023-01-04,126.889999,128.660004,125.080002,126.360001,125.497498,89113600,1
2023-01-05,127.129997,127.769997,124.760002,125.019997,124.166641,80962700,1
2023-01-06,126.010002,130.289993,124.889999,129.619995,128.735229,87754700,1
2023-01-09,130.470001,133.410004,129.889999,130.149994,129.261627,70790800,1
...,...,...,...,...,...,...,...
2023-12-22,195.179993,195.410004,192.970001,193.600006,193.353287,37122800,1
2023-12-26,193.610001,193.889999,192.830002,193.050003,192.803986,28919300,1
2023-12-27,192.490005,193.500000,191.089996,193.149994,192.903839,48087700,1
2023-12-28,194.139999,194.660004,193.169998,193.580002,193.333298,34049900,1


We convert the 'Date' column to a datetime object because pandas can recognise and efficiently work with datetime objects. We set the `Date` column as the index because in time-series data like ours, operations are time-based.

With the index set, we can now use it to access different portions of our data a little bit more easily. Because our indices are labeled, we can use `loc` for **label-based indexing**.

In [27]:
df.loc["2023-08-18"]

df.loc["2023-03":"2023-05"]

df.loc["2023-12", "Close"]

Date
2023-12-01    191.240005
2023-12-04    189.429993
2023-12-05    193.419998
2023-12-06    192.320007
2023-12-07    194.270004
2023-12-08    195.710007
2023-12-11    193.179993
2023-12-12    194.710007
2023-12-13    197.960007
2023-12-14    198.110001
2023-12-15    197.570007
2023-12-18    195.889999
2023-12-19    196.940002
2023-12-20    194.830002
2023-12-21    194.679993
2023-12-22    193.600006
2023-12-26    193.050003
2023-12-27    193.149994
2023-12-28    193.580002
2023-12-29    192.529999
Name: Close, dtype: float64

## Basic Operations
There are also many basic operations we can do with pandas, such as calculating the mean of a column, the maximum of a column, and so on.


In [31]:
df.Close.mean()

df.Close.max()

df.Close.idxmax()  # index of maximum value 

df.Close.min()

df.Close.idxmin()  # index of minimum value 

df[["Low", "High"]].mean()  # average of each column 

df[["Low", "High"]].mean(axis=1)  # average of the two columns

Date
2023-01-03    127.534996
2023-01-04    126.870003
2023-01-05    126.264999
2023-01-06    127.589996
2023-01-09    131.650002
                 ...    
2023-12-22    194.190002
2023-12-26    193.360001
2023-12-27    192.294998
2023-12-28    193.915001
2023-12-29    193.064995
Length: 250, dtype: float64

### Exercise: Apple Quarters

Compare AAPL's *median* **high** in Q1 and Q2 of 2023. In which quarter was it higher? Use the cell below to show your work.

In [40]:
Q1 = df.loc["2023-01":"2023-03", "High"].median()

Q2 = df.loc["2023-04":"2023-06", "High"].median()

print(Q1, Q2)
if Q1 > Q2:
    print("Quarter 1 has a higher median high")
else:
    print("Quarter 2 has a higher median high")

151.23999786376953 173.96499633789062
Quarter 2 has a higher median high


### Exercise: March Madness

Looking only at the month of March, print the following information:

* First opening price of the period
* Last close price of the period
* Total volume traded over the period

In [39]:
print("First opening price of March is ", df.loc["2023-03-01", "Open"])
print("Last close price of March is ", df.loc["2023-03-31", "Close"])
print("Total volume traded over March is ", df.loc["2023-03", "Volume"].sum())

First opening price of March is  146.8300018310547
Last close price of March is  164.89999389648438
Total volume traded over March is  1520266600
