# Introduction to Time Series Data

We will explore two main kinds of tabular data in this workshop. The first is time series data. Time series data will be *indexed* with a date and time. We'll look a bit more closely at that soon, but for now just think of it as each row having a date or time, rather than a row number.

## Pandas Refresh

One of the most popular packages in Python for working with tabular data is called Pandas, which was introduced in a supplementary notebook in DSA.

The first thing we'll do is `import` the `pandas` package. Convention has us use a shortform name - `pd` - because we'll be using the package so often.

In [27]:
import pandas as pd

And below we'll use pandas' `read_csv()` to load the data into a `DataFrame`. DataFrames are the main data structure in pandas for tabular data, and lots of other programming languages use the concept of a DataFrame too! By convention, you'll often see `df` used as a variable name.

In [None]:
# Load the data
df = pd.read_csv("https://raw.githubusercontent.com/ImperialCollegeLondon/efds-ta-python/refs/heads/main/data/AAPL_2023.csv")

Before we do anything else, it's a good idea to take a look at the DataFrame. Some methods will let us take a closer look at parts of our data. 

In [29]:
# Print the first five rows
print(df.head())

# Print the last 15 rows
print(df.tail(15))


         Date        Open        High         Low       Close   Adj Close  \
0  2023-01-03  130.279999  130.899994  124.169998  125.070000  124.216301   
1  2023-01-04  126.889999  128.660004  125.080002  126.360001  125.497498   
2  2023-01-05  127.129997  127.769997  124.760002  125.019997  124.166641   
3  2023-01-06  126.010002  130.289993  124.889999  129.619995  128.735229   
4  2023-01-09  130.470001  133.410004  129.889999  130.149994  129.261627   

      Volume  
0  112117500  
1   89113600  
2   80962700  
3   87754700  
4   70790800  
           Date        Open        High         Low       Close   Adj Close  \
235  2023-12-08  194.199997  195.990005  193.669998  195.710007  195.460587   
236  2023-12-11  193.110001  193.490005  191.419998  193.179993  192.933807   
237  2023-12-12  193.080002  194.720001  191.720001  194.710007  194.461868   
238  2023-12-13  195.089996  198.000000  194.850006  197.960007  197.707718   
239  2023-12-14  198.020004  199.619995  196.160004 

Other methods and attributes can give us an overview, and give us further insights to our data in general. `shape()` will tell us the number of rows and columns in our data frame, while `info()` will give us some info on the data type (`dtype`) of each column.

You'll notice the types are slightly different from the usual Python types - this is because they belong to the `numpy` package, which sits under the hood of `pandas`. We'll look more at `numpy` tomorrow, but for now here is a word about each of the types in our data frame.

- `float64` - 64-bit floating point (number with a decimal point)
- `int64` - 64-bit integer (whole number)
- `object` - other Python data types (strings in this case)

In [30]:
# Print rows and columns
print("ðŸš£ Rows and columns: ", df.shape, "\n")

# Print summary info
print("â„¹ Info:")
print(df.info())


ðŸš£ Rows and columns:  (250, 7) 

â„¹ Info:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 250 entries, 0 to 249
Data columns (total 7 columns):
 #   Column     Non-Null Count  Dtype  
---  ------     --------------  -----  
 0   Date       250 non-null    object 
 1   Open       250 non-null    float64
 2   High       250 non-null    float64
 3   Low        250 non-null    float64
 4   Close      250 non-null    float64
 5   Adj Close  250 non-null    float64
 6   Volume     250 non-null    int64  
dtypes: float64(5), int64(1), object(1)
memory usage: 13.8+ KB
None


For a look at some actual data within the data frame, we can use square bracket notation and `iloc` to access columns and rows. The `i` in `iloc` refers to **integer-based indexing**, so looking at a row or column *number*.

In [31]:
# Access a column
print(df["Close"])

# Access multiple columns
print(df[["Open", "Close"]])

# Access a row
print(df[df["Date"] == "2023-08-18"])

# Access the first row
print(df.iloc[0])

# Access the tenth row and the third column - a specific cell
print(df.iloc[9, 2])

# However, it is preferable to use .iat for single values
print(df.iat[9, 2])


0      125.070000
1      126.360001
2      125.019997
3      129.619995
4      130.149994
          ...    
245    193.600006
246    193.050003
247    193.149994
248    193.580002
249    192.529999
Name: Close, Length: 250, dtype: float64
           Open       Close
0    130.279999  125.070000
1    126.889999  126.360001
2    127.129997  125.019997
3    126.010002  129.619995
4    130.470001  130.149994
..          ...         ...
245  195.179993  193.600006
246  193.610001  193.050003
247  192.490005  193.149994
248  194.139999  193.580002
249  193.899994  192.529999

[250 rows x 2 columns]
           Date        Open        High         Low       Close   Adj Close  \
157  2023-08-18  172.300003  175.100006  171.960007  174.490005  174.038345   

       Volume  
157  61114200  
Date         2023-01-03
Open         130.279999
High         130.899994
Low          124.169998
Close            125.07
Adj Close    124.216301
Volume        112117500
Name: 0, dtype: object
137.2899932861328
1

### Exercise: End of Year

Display the data for the entire month of December 2023.

In [32]:
df[(df["Date"] >= "2023-12-01") & (df["Date"] <= "2023-12-31")]

Unnamed: 0,Date,Open,High,Low,Close,Adj Close,Volume
230,2023-12-01,190.330002,191.559998,189.229996,191.240005,190.996292,45679300
231,2023-12-04,189.979996,190.050003,187.449997,189.429993,189.188583,43389500
232,2023-12-05,190.210007,194.399994,190.179993,193.419998,193.173508,66628400
233,2023-12-06,194.449997,194.759995,192.110001,192.320007,192.074921,41089700
234,2023-12-07,193.630005,195.0,193.589996,194.270004,194.02243,47477700
235,2023-12-08,194.199997,195.990005,193.669998,195.710007,195.460587,53377300
236,2023-12-11,193.110001,193.490005,191.419998,193.179993,192.933807,60943700
237,2023-12-12,193.080002,194.720001,191.720001,194.710007,194.461868,52696900
238,2023-12-13,195.089996,198.0,194.850006,197.960007,197.707718,70404200
239,2023-12-14,198.020004,199.619995,196.160004,198.110001,197.857529,66831600


### Exercise: The First Fifty

Display the data for the first fifty (50) days of trading in the period.

In [33]:
df[:50] # Just like list indexing in Python

Unnamed: 0,Date,Open,High,Low,Close,Adj Close,Volume
0,2023-01-03,130.279999,130.899994,124.169998,125.07,124.216301,112117500
1,2023-01-04,126.889999,128.660004,125.080002,126.360001,125.497498,89113600
2,2023-01-05,127.129997,127.769997,124.760002,125.019997,124.166641,80962700
3,2023-01-06,126.010002,130.289993,124.889999,129.619995,128.735229,87754700
4,2023-01-09,130.470001,133.410004,129.889999,130.149994,129.261627,70790800
5,2023-01-10,130.259995,131.259995,128.119995,130.729996,129.837677,63896200
6,2023-01-11,131.25,133.509995,130.460007,133.490005,132.578842,69458900
7,2023-01-12,133.880005,134.259995,131.440002,133.410004,132.49939,71379600
8,2023-01-13,132.029999,134.919998,131.660004,134.759995,133.840149,57809700
9,2023-01-17,134.830002,137.289993,134.130005,135.940002,135.0121,63646600


### Exercise: Roughly Monthly

Display the data at each 21-day interval, over the entire period.

In [34]:
df[::21] # Again, like indexing a list. No start or end means we get the whole period.

Unnamed: 0,Date,Open,High,Low,Close,Adj Close,Volume
0,2023-01-03,130.279999,130.899994,124.169998,125.07,124.216301,112117500
21,2023-02-02,148.899994,151.179993,148.169998,150.820007,149.790543,118339000
42,2023-03-06,153.789993,156.300003,153.460007,153.830002,153.013275,87558000
63,2023-04-04,166.600006,166.839996,165.110001,165.630005,164.75061,46278300
84,2023-05-04,164.889999,167.039993,164.309998,165.789993,164.90976,81235400
105,2023-06-05,182.630005,184.949997,178.039993,179.580002,178.873611,121946500
126,2023-07-06,189.839996,192.020004,189.199997,191.809998,191.055511,45094300
147,2023-08-04,185.520004,187.380005,181.919998,181.990005,181.274155,115799700
168,2023-09-05,188.279999,189.979996,187.610001,189.699997,189.208969,45280000
189,2023-10-04,171.089996,174.210007,170.970001,173.660004,173.210495,53020300


## Setting the Index

In a DataFrame, each row is assigned a unique index value. By default, this is just a number (starting at 0). When it makes sense, we can choose one of the other columns to be an index. For time series data, where each row represents a different point in time, we'll set our `Date` column as the index. This will make it easier for us to work with the data, and can speed up other operations later on.


In [35]:
# Convert the 'Date' column to a datetime object
df['Date'] = pd.to_datetime(df['Date'])

# Set the 'Date' column as the index
df.set_index('Date', inplace=True)

We convert the 'Date' column to a datetime object because pandas can recognise and efficiently work with datetime objects. We set the `Date` column as the index because in time-series data like ours, operations are time-based.

With the index set, we can now use it to access different portions of our data a little bit more easily. Because our indices are labeled, we can use `loc` for **label-based indexing**.

In [36]:
# Access a row
print(df.loc['2023-08-18'])

# Access a specific cell - we can use .loc
print(df.loc['2023-08-18', 'Close'])

# But it is preferable to use .at for a single cell
print(df.at['2023-08-18', 'Close'])

# Access a range
print(df.loc["2023-08-18":"2023-08-22"])


Open         1.723000e+02
High         1.751000e+02
Low          1.719600e+02
Close        1.744900e+02
Adj Close    1.740383e+02
Volume       6.111420e+07
Name: 2023-08-18 00:00:00, dtype: float64
174.49000549316406
174.49000549316406
                  Open        High         Low       Close   Adj Close  \
Date                                                                     
2023-08-18  172.300003  175.100006  171.960007  174.490005  174.038345   
2023-08-21  175.070007  176.130005  173.740005  175.839996  175.384842   
2023-08-22  177.059998  177.679993  176.250000  177.229996  176.771240   

              Volume  
Date                  
2023-08-18  61114200  
2023-08-21  46311900  
2023-08-22  42084200  


## Basic Operations
There are also many basic operations we can do with pandas, such as calculating the mean of a column, the maximum of a column, and so on.


In [37]:
# Calculate the mean of 'Close' prices
print("Mean close", df['Close'].mean())

# Find the maximum volume traded
print("Max volume", df['Volume'].max())

# Find the day that had the max volume traded
print("Max volume day", df['Volume'].idxmax())

# Be careful when using these operations on multiple columns
# We can calculate the mean of the high and low column like so
print("High/low COLUMN average")
print(df[["Low", "High"]].mean())

# Or we can calculate the mean high low of each row
print("High/low ROW average")
print(df[["Low", "High"]].mean(axis=1))

Mean close 172.54900030517578
Max volume 154357300
Max volume day 2023-02-03 00:00:00
High/low COLUMN average
Low     170.98188
High    173.85752
dtype: float64
High/low ROW average
Date
2023-01-03    127.534996
2023-01-04    126.870003
2023-01-05    126.264999
2023-01-06    127.589996
2023-01-09    131.650002
                 ...    
2023-12-22    194.190002
2023-12-26    193.360001
2023-12-27    192.294998
2023-12-28    193.915001
2023-12-29    193.064995
Length: 250, dtype: float64


### Exercise: Apple Quarters

Compare AAPL's *median* **high** in Q1 and Q2 of 2023. In which quarter was it higher? Use the cell below to show your work.

In [38]:
# It may be helpful to break the problem down into two steps
# First, we get the subset
q1 = df.loc["2023-01":"2023-03"]
q2 = df.loc["2023-04":"2023-06"]

# Then we find and print the medians for a visual comparison
print("Q1", q1["High"].median()) 
print("Q2", q2["High"].median()) # HIGHER

Q1 151.23999786376953
Q2 173.96499633789062


### Exercise: March Madness

Looking only at the month of March, print the following information:

* First opening price of the period
* Last close price of the period
* Total volume traded over the period

In [39]:
# First we get the subset
march_data = df.loc["2023-03"]

# Then print the relevant information
print("Initial opening", march_data["Open"].iloc[0])
print("Closing", march_data["Close"].iloc[-1])
print("Total Volume", march_data["Volume"].sum())

Initial opening 146.8300018310547
Closing 164.89999389648438
Total Volume 1520266600
