# <span style="color:red"> Lecture 18 - Time Data </span>

<font size = "4">

In this class we will ...

- Process time series data in Python and Pandas
- Introduce new datatype for time
- Plot multiple series
- Compute growth rates

First, let's import the libraries we'll need, and load in the data

In [None]:
import pandas as pd
import matplotlib.pyplot as plt


df_financial = pd.read_csv("data_raw/financial.csv")

<font size = "4">

The data can be downloaded from [FRED (Federal Reserve Bank of St. Louis)](https://fred.stlouisfed.org/categories/32255)

Let's inspect the data, and its dtypes:

In [None]:
display(df_financial.head())
print()
display(df_financial.dtypes)

<font size = "4">

The columns represent:

- First column: date (month, day, year)
- Second column: the S&P 500 stock market index
- Third column: the Dow Jones Industrial Average stock market index
- Columns 4, 5, 6: Other representations of the date. We will discuss them next lecture

<font size = "4">

Notice that the data is ordered from oldest to newest:

In [None]:
print(df_financial["date_str"].iloc[0])
print(df_financial["date_str"].iloc[1])
print("...")
print(df_financial["date_str"].iloc[-2])
print(df_financial["date_str"].iloc[-1])

<font size = "4">

**Question:** What if I want the data arranged in the other order (newest first)? How can I use ``.sort_values()`` to re-order the data?

In [None]:
# Use .sort_values() to sort data newest to oldest



<font size = "4">

If you try to sort dates that are represented by *strings*, they will be sorted "lexicographically" (alphabetically).

The strings representing the dates must be converted to a new datatype, the ``datetime`` format.

This can be done with the Pandas function ``to_datetime``

In [None]:
# Use the Pandas datetime to convert the column to datetime format 
# We'll add a new column to the DataFrame



<font size = "4">

Let's check the DataTypes again

In [None]:
display(df_financial.dtypes)

<font size = "4">

Compare with the types of the column elements:

In [None]:
print(type(df_financial['date_str'].iloc[0]))
print(type(df_financial['sp500'].iloc[0]))
print(type(df_financial['date'].iloc[0])) 

<font size = "4">

- So the "date" column has the datatype ``datetime64[ns]``, but its elements are Pandas ``Timestamp`` objects.

- ????

- There is a difference between how Pandas stores data "internally" and when you extract parts of the data.

- We've actually seen this before. Note that the "date_str" column has the datatype ``object``, but its elements are ``str`` (strings).

Regardless, now that we have converted the dates to an appropriate datatype, we can now sort them sensibly!

In [None]:
# sort df_financial by the "date" column



<font size = "4">

How can I plot time vs. the S&P 500 index?

In [None]:
# Goal: plot date (x-axis) vs. S&P 500 (y-axis)

x_vals = ...
y_vals = ...
plt.plot(x_vals, y_vals)
# label x-axis

# label y-axis

# title



<font size = "4">

So there's another advantage of converting to ``datetime`` objects...plots look much better!

**Note**: ``matplotlib.pyplot.plot`` has compatibility with Pandas DataFrames:

In [None]:
# Another way of using plt.plot with the "data" argument



<font size = "4">

**Important:** Whatever variable you put on the x-axis, the data better be sorted by that variable!!

In [None]:
df_sorted = df_financial.sort_values(by = "sp500")
display(df_sorted)

# try plotting date vs. sp500 for df_sorted



<font size = "4">

**Exercise:**

How would I generate a plot of time vs. Dow Jones Industrial Average?

In [None]:
# Write your own code




<font size = "4">

- How do we plot multiple columns of the dataset? 
- For instance, let's plot both time vs. S&P 500 and time vs. Dow Jones
- Each DataFrame has its own ``.plot`` method.

In [None]:
# Step 1 of chain:
# Grab columns we want to plot



In [None]:
# Step 2 of chain:
# We will plot the date on the x-axis.
# So we will make "date" the index column



In [None]:
# Step 3 of chain:
# We will use the DataFrame's .plot method.
# This will plot index vs. column 1 AND
# will plot index vs. column 2

df_financial[   ["date","sp500","djia"]   ].set_index("date").plot()

In [None]:
# .plot() used column names for the legend and x-axis.
# Here's how to adjust them ourselves



<font size = "4">

The S&P 500 and Dow Jones have different units. Should either:

- Convert one to the other's units
- Have a left and right y-axis, one for each variable (more on this in a future lecture).
- Plot a "unitless" or "non-dimensional" measure of both.

The **percentage growth** is a non-dimensional quantity we can calculate for both:

$$\textrm{per. growth} = \frac{(\textrm{today's index}) - (\textrm{yesterday's index})}{\textrm{yesterdays's index}} \times 100\  \%$$

In [None]:
# Let's calculate the numerator: the difference between today's index 
# and yesterday's index.



<font size = "4">

To easily divide by "yesterday's index", we will shift the "sp500" column down, and make it a new column

In [None]:
# Let's make the denominator: "yesterday's index"
# Also known as the "lag"

# ".shift(1)" computes a new column with the value of "sp500"
# one period before. By convention the first column is assigned
# a missing value



In [None]:
# Now we combine ".diff()" and ".shift()" to compute growth rates



<font size = "4">

Now, we plot the growth rate:

In [None]:
plt.plot("date", "growth_sp500",
          data = df_financial)
plt.xlabel("Time")
plt.ylabel("Daily percentage change ")
plt.title("Change in the S&P 500 Index")
plt.show()

<font size = "4" >

**Exercise**

- Compute a column with the growth of the Dow Jones
- Plot the growth of the S&P 500 and Dow Jones in a <br>
single plot

In [None]:
# Write your own code

df_financial["growth_djia"] = (df_financial["djia"].diff()\
         / df_financial["djia"].shift(1) )* 100

plt.plot("date", "growth_sp500",
          data = df_financial)
plt.plot("date", "growth_djia",
          data = df_financial,alpha = 0.75)
plt.xlabel("Time")
plt.ylabel("Daily percentage change ")
plt.title("Change in the stock market funds")
plt.legend(["S&P 500", "DJIA"])
plt.show()

# <span style="color:red"> III. Subsets of time-series data </span>

<font size = "4" >

Like other DataFrames, we can use ``.query()`` to extract subsets of time-series data. Since we have converted to the "Datetime" datatype, logical conditions can be used in a straightforward way. 

In [None]:
# Since the "date" column has a time format, Python
# will interpret "2019-01-01" as a date inside the query command

subset_before  = df_financial.query('date <= "2019-01-01" ')
subset_after   = df_financial.query('date >= "2019-01-01" ')

display(subset_after)

<font size = "4">

Here are some other subsets we might be interested in:

In [None]:
# Beginning of Covid pandemic (independent of data)
subset_between = df_financial.query("'2020-03-01' <= date <= '2020-05-01'")

# large changes in percentage growth (positive or negative)
subset_large_change = df_financial.query("growth_sp500 > 5 or growth_sp500 < -5")
display(subset_large_change)

In [None]:
# alternate way. 
# Datetime objects have a ".between()" method



In [None]:
# alternate way. 
# (x > 4 or x < -4) is equivalent to |x| > 4



<font size = "4">

Once we have identified an interesting subset of the data, we might want to visualize it within the context of the original time-series. For example, we might want to **highlight** these regions after plotting the entire series.

We can do this using the ``fill_between`` function from the ``matplotlib.pyplot`` library

In [None]:
# Create a line plot
plt.plot("date", "growth_sp500", data = df_financial)
plt.xlabel("Time")
plt.ylabel("Daily percentage change ")
plt.title("The S&P 500 during the start of COVID")

# Add a shaded region wth a rectangle
# "x" is the x-coordinates, "y1" and "y2" are the lower
# and upper bounds of the rectangle. We can set this
# to be the minimum and maximum of the outcome.
# we use "where" to test a logical condition

x_vals = df_financial["date"]
y_vals = df_financial["growth_sp500"]
condition = df_financial["date"].between("2020-03-01","2020-05-01")

plt.fill_between(x = x_vals,
                 y1 = y_vals.min(),
                 y2 = y_vals.max(),
                 where = condition,
                 alpha = 0.2,color = "red")


plt.show()

<font size = "4">

- If we want to repeatedly refer to this region of the data, it might be a good idea to add a column to the DataFrame which will indicate which rows are part of the range.

- We can "flag" the data, and add a column of Boolean type to the DataFrame

In [None]:
# Add a column called "covid_period" of Boolean type.
# "True" if date is between March 1st, 2020 and May 1st, 2020
df_financial["covid_period"]  = df_financial["date"].between("2020-03-01","2020-05-01")

display(df_financial.head())
display(df_financial.dtypes)

<font size = "4">

**Exercise**

- Generate a plot of the percentage growth rate of the Dow Jones 
- Highlight regions where there was growth higher than 3\%
or below -3\%

In [None]:
# Create a line plot
plt.plot(...)



# Add shaded region(s) with plt.fill_between




# show the plot
plt.show()