# Introduction to Time Series Data

Data can come in many different formats, and many differentshapes and sizes. You've maybe heard of tabular data, a format you may be familiar with from working in something like Excel. 

We will explore two main kinds of tabular data in this module. The first is time series data. Time series data will be *indexed* with a date and time. We'll look a bit more closely at that soon, but for now just think of it as each row having a date or time, rather than a row number.

## Loading Data

One of the most popular packages in Python for working with tabular data is called Pandas. Today we'll get acquainted with Pandas.

The first thing we'll do is `import` the `pandas` package. Convention has us use a shortform name - `pd` - because we'll be using the package so often.

In [1]:
import pandas as pd

And below we'll use pandas' `read_csv()` to load the data into a `DataFrame`. DataFrames are the main data structure in pandas for tabular data, and lots of other programming languages use the concept of a DataFrame too! By convention, you'll often see `df` used as a variable name.

In [2]:
# Load the data
url = "https://raw.githubusercontent.com/ImperialCollegeLondon/efds-ta-python/main/data/AAPL_2020.csv"
df = pd.read_csv(url)

Before we do anything else, it's a good idea to take a look at the DataFrame. Some methods will let us take a closer look at parts of our data. 

In [3]:
# Print the first five rows
print(df.head())

# Print the last 15 rows
print(df.tail(15))


         Date       Open       High        Low      Close  Adj Close  \
0  2020-01-02  74.059998  75.150002  73.797501  75.087502  73.347923   
1  2020-01-03  74.287498  75.144997  74.125000  74.357498  72.634850   
2  2020-01-06  73.447502  74.989998  73.187500  74.949997  73.213615   
3  2020-01-07  74.959999  75.224998  74.370003  74.597504  72.869278   
4  2020-01-08  74.290001  76.110001  74.290001  75.797501  74.041489   

      Volume  
0  135480400  
1  146322800  
2  118387200  
3  108872000  
4  132079200  
           Date        Open        High         Low       Close   Adj Close  \
237  2020-12-09  124.529999  125.949997  121.000000  121.779999  119.986038   
238  2020-12-10  120.500000  123.870003  120.150002  123.239998  121.424522   
239  2020-12-11  122.430000  122.760002  120.550003  122.410004  120.606758   
240  2020-12-14  122.599998  123.349998  121.540001  121.779999  119.986038   
241  2020-12-15  124.339996  127.900002  124.129997  127.879997  125.996178   
242

Other methods and attributes can give us an overview, and give us further insights to our data in general. `shape()` will tell us the number of rows and columns in our data frame, while `info()` will give us some info on the data type (`dtype`) of each column.

You'll notice the types are slightly different from the usual Python types - this is because they belong to the `numpy` package, which sits under the hood of `pandas`. We'll look more at `numpy` tomorrow, but for now here is a word about each of the types in our data frame.

- `float64` - 64-bit floating point (number with a decimal point)
- `int64` - 64-bit integer (whole number)
- `object` - other Python data types (strings in this case)

In [4]:
# Print rows and columns
print("Rows and columns: ", df.shape)

# Print summary info
print("Info")
print(df.info())


Rows and columns:  (252, 7)
Info
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 252 entries, 0 to 251
Data columns (total 7 columns):
 #   Column     Non-Null Count  Dtype  
---  ------     --------------  -----  
 0   Date       252 non-null    object 
 1   Open       252 non-null    float64
 2   High       252 non-null    float64
 3   Low        252 non-null    float64
 4   Close      252 non-null    float64
 5   Adj Close  252 non-null    float64
 6   Volume     252 non-null    int64  
dtypes: float64(5), int64(1), object(1)
memory usage: 13.9+ KB
None


For a look at some actual data within the data frame, we can use square bracket notation and `iloc` to access columns and rows.

In [5]:
# Access a column
print(df["Close"])

# Access multiple columns
print(df[["Open", "Close"]])

# Access a row
print(df[df["Date"] == "2020-08-18"])

# Access the first row
print(df.iloc[0])

# Access the tenth row and the third column
print(df.iloc[9, 2])


0       75.087502
1       74.357498
2       74.949997
3       74.597504
4       75.797501
          ...    
247    130.960007
248    131.970001
249    136.690002
250    134.869995
251    133.720001
Name: Close, Length: 252, dtype: float64
           Open       Close
0     74.059998   75.087502
1     74.287498   74.357498
2     73.447502   74.949997
3     74.959999   74.597504
4     74.290001   75.797501
..          ...         ...
247  132.160004  130.960007
248  131.320007  131.970001
249  133.990005  136.690002
250  138.050003  134.869995
251  135.580002  133.720001

[252 rows x 2 columns]
           Date        Open   High       Low     Close   Adj Close     Volume
158  2020-08-18  114.352501  116.0  114.0075  115.5625  113.664024  105633600
Date         2020-01-02
Open          74.059998
High          75.150002
Low           73.797501
Close         75.087502
Adj Close     73.347923
Volume        135480400
Name: 0, dtype: object
78.875


## Setting the Index

In a DataFrame, each row is assigned a unique index value. By default, this is just a number (starting at 0). When it makes sense, we can choose one of the other columns to be an index. For time series data, where each row represents a different point in time, we'll set our `Date` column as the index. This will make it easier for us to work with the data, and can speed up other operations later on.


In [6]:
# Convert the 'Date' column to a datetime object
df['Date'] = pd.to_datetime(df['Date'])

# Set the 'Date' column as the index
df.set_index('Date', inplace=True)

We convert the 'Date' column to a datetime object because pandas can recognise and efficiently work with datetime objects. We set the `Date` column as the index because in time-series data like ours, operations are time-based.

With the index set, we can now use it to access different portions of our data a little bit more easily.

In [7]:
# Access a row
print(df.loc['2020-08-18'])

# Access a specific cell
print(df.loc['2020-08-18', 'Close'])

# Access a range
print(df.loc["2020-08-18":"2020-08-20"])


Open         1.143525e+02
High         1.160000e+02
Low          1.140075e+02
Close        1.155625e+02
Adj Close    1.136640e+02
Volume       1.056336e+08
Name: 2020-08-18 00:00:00, dtype: float64
115.5625
                  Open        High         Low       Close   Adj Close  \
Date                                                                     
2020-08-18  114.352501  116.000000  114.007500  115.562500  113.664024   
2020-08-19  115.982498  117.162498  115.610001  115.707497  113.806641   
2020-08-20  115.750000  118.392502  115.732498  118.275002  116.331970   

               Volume  
Date                   
2020-08-18  105633600  
2020-08-19  145538000  
2020-08-20  126907200  


## Basic Operations
There are also many basic operations we can do with pandas, such as calculating the mean of a column, the maximum of a column, and so on.


In [8]:
# Calculate the mean of 'Close' prices
print(df['Close'].mean())

# Find the maximum volume traded
print(df['Volume'].max())

# Find the day that had the max volume traded
print(df['Volume'].idxmax())

95.19888865758502
426510000
2020-02-28 00:00:00


### Exercise 1

Compare AAPL's *median* **high** in Q1 and Q2 of 2020. In which quarter was it higher? Use the cell below to show your work.

In [9]:
## YOUR CODE GOES HERE

### Exercise 2

Looking only at the first 100 days of trading in the AAPL dataset, print the following information:

* First opening price of the period
* Last close price of the period
* Total volume traded over the period

In [10]:
## YOUR CODE GOES HERE

### (Optional) Exercise 3

Run the cell below to create a new DataFrame with hourly trading info from a single day.

Extend the code to compare the mean close price and trading volume in
- the morning (up to and including 11:00)
- the afternoon (from 12:00 onwards)

**HINT** Instead of square brackets, use `between_time()` to slice.

In [11]:
data = {
    'Time': ['2023-06-01 00:00:00', '2023-06-01 01:00:00', '2023-06-01 02:00:00', 
             '2023-06-01 03:00:00', '2023-06-01 04:00:00', '2023-06-01 05:00:00', 
             '2023-06-01 06:00:00', '2023-06-01 07:00:00', '2023-06-01 08:00:00', 
             '2023-06-01 09:00:00', '2023-06-01 10:00:00', '2023-06-01 11:00:00',
             '2023-06-01 12:00:00', '2023-06-01 13:00:00', '2023-06-01 14:00:00',
             '2023-06-01 15:00:00', '2023-06-01 16:00:00', '2023-06-01 17:00:00',
             '2023-06-01 18:00:00', '2023-06-01 19:00:00', '2023-06-01 20:00:00',
             '2023-06-01 21:00:00', '2023-06-01 22:00:00', '2023-06-01 23:00:00'],
    'Close': [120, 121, 119, 119, 118, 119, 120, 121, 122, 123, 124, 125,
              125, 126, 127, 128, 129, 130, 131, 132, 133, 134, 135, 136],
    'Volume': [1000, 1050, 1075, 1100, 1125, 1150, 1200, 1250, 1300, 1350, 
               1400, 1450, 1500, 1550, 1600, 1650, 1700, 1750, 1800, 1850, 
               1900, 1950, 2000, 2050]
}

trading = pd.DataFrame(data) # Create a DataFrame "literal"

## YOUR CODE GOES HERE