<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Date-formats:-pd.to_datetime()-and-dt.to_period()" data-toc-modified-id="Date-formats:-pd.to_datetime()-and-dt.to_period()-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Date formats: <code>pd.to_datetime()</code> and <code>dt.to_period()</code></a></span></li><li><span><a href="#Sorting" data-toc-modified-id="Sorting-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>Sorting</a></span><ul class="toc-item"><li><span><a href="#.sort_values()" data-toc-modified-id=".sort_values()-2.1"><span class="toc-item-num">2.1&nbsp;&nbsp;</span><code>.sort_values()</code></a></span></li><li><span><a href="#.sort_index()" data-toc-modified-id=".sort_index()-2.2"><span class="toc-item-num">2.2&nbsp;&nbsp;</span><code>.sort_index()</code></a></span></li></ul></li><li><span><a href="#Lagging-and-leading" data-toc-modified-id="Lagging-and-leading-3"><span class="toc-item-num">3&nbsp;&nbsp;</span>Lagging and leading</a></span><ul class="toc-item"><li><span><a href="#Lagging-and-leading-using-.shift()" data-toc-modified-id="Lagging-and-leading-using-.shift()-3.1"><span class="toc-item-num">3.1&nbsp;&nbsp;</span>Lagging and leading using <code>.shift()</code></a></span></li><li><span><a href="#Leading-and-lagging-with-.merge()" data-toc-modified-id="Leading-and-lagging-with-.merge()-3.2"><span class="toc-item-num">3.2&nbsp;&nbsp;</span>Leading and lagging with <code>.merge()</code></a></span></li></ul></li></ul></div>

In [None]:
import pandas as pd
import numpy as np

# Date formats: ``pd.to_datetime()`` and ``dt.to_period()``

Pandas offers a lot of flexibility to manipulate dates and time stamps. Much of this functionality can only be used on columns that have the Pandas "datetime" data type. We can convert dates to this data type using the ``.to_datetime()`` function.

First, note that the ``date`` column in our dataframe is of type "object":

We'll create a copy of the ``df`` dataframe to avoid changing the original data in ``df``:

Now we can apply many useful date functions (they usually have the prefix ``dt.``) to this datetime variable. For example, we can extract information about specific components of the date (year, quarter, month, day, etc):

Another common use of the ``.to_datetime()`` function is to construct a datetime variable from date components:

The other, very commonly used type for date data is the Pandas ``period`` format. This is used to specify that your data has a particular frequency, and can be done by applying the ``.to_period()`` function to a datetime variable (e.g. use 'Y' for yearly frequency data, 'M' for monthly, and 'Q' for quarterly):

These types of ``period`` dates are useful for many operations on the data, the most important one being that Pandas understands what you mean if you want to add or subtract some number of periods to/from a given date. For example:

# Sorting

We can sort a dataframe based on the values in a particular column using the ``.sort_values()`` function. To sort based on the values in an index, we use the ``.sort_index()`` function. 

## ``.sort_values()``
Syntax:
```python
DataFrame.sort_values(by, axis=0, ascending=True, inplace=False)
```

Remember, functions that have an ``inplace`` parameter do not actually change the original dataset unless we set that parameter to ``True``:

## ``.sort_index()``
Abbreviated syntax:
```python
DataFrame.sort_index(axis=0, level=None, ascending=True, inplace=False)
```

# Lagging and leading

In the context of most financial datasets which contain time-indexed information, lagging a particular variable (column) means obtaining the values for that particular variable from a prior point in time (e.g. lagging "twice" in a dataset with monthly frequency, means obtaining data from "two months ago"). Leading a variable means obtaining values from a future point in time.

If you research how to lead and lag variables in with pandas dataframes, most sources (including the official Pandas user guide) claims that you can do this using the ``.shift()`` function. In this section I first show you how to use this method, and then show you that it may run into problems when lagging is done based on dates. I then show you how you can create leads and lags in a more robust way (using ``period`` dates and the ``merge`` function).


## Lagging and leading using ``.shift()``

Syntax:
```python
DataFrame.shift(periods=1, freq=None, axis=0, fill_value=NoDefault.no_default)
```

Suppose we want to create a new column ``lag_return`` which tells us, for each firm, its returns from **the prior month**.
The general advice you'll see is that you should first sort your dataframe by firm identifier and by date:

And then use the ``.shift()`` function, after you tell Python that your dates are grouped at the firm level (each firm identifier has its own set of dates). You can do this with the ``.groupby()`` function (more on this function later):

Note that the entries for ``lag_ret`` on the second row and the last row are not exactly what we want: they do not tell us what the return of the firm was in the prior month. This happens because 

1. Our data has gaps in coverage (October 2010 is missing for firmid==1)
2. Our data has duplicates (there are two entries for December 2010 for firmid==3)

Note that both of these issues disappear if we first get rid of duplicates and if we interpret "lagging X times" to mean "data from X rows above" not "data from X time periods ago" and "leading X times" to mean "data from X rows below" and not "data from X time periods in the future". To keep things simple, **this is the approach we will take in this course**.

However, if this interpretation of lagging and leading is not exactly what you need for your application and you need to lag and lead in terms of calendar periods, you should follow the approach below:

## Leading and lagging with ``.merge()``

First, create a copy of the original dataset again:

We will first create a new date variable that tells Python the frequency of our dates:

Now create a new dataframe containing the firm identifiers (``firmid``), the period date (``mdate``) and the variable we want to lag (``return``):

Now add (subtract in case of leads) the number of periods you want to lag the return variable (1 in our example) to the period date:

And rename ``return`` to ``lag_return``:

Finally, merge this lagged data into the original dataset:

Note that now the second and last entries in ``lag_return`` are correct (they tell us the returns from the prior month). 

You still have to contend with what it means that you have duplicate entries for ``return`` for December 2010 for ``firmid==3`` but dealing with duplicates needs to be addressed on a case by case basis, depending on the particulars of the data you are using.