# 3.6 Date Formatting

Working with dates in Pandas can be slightly tricky at first, but is often essential in data analysis. Pandas often interprets date fields as strings, but by assigning the column a data type of *datetime*, they obtain access to additional methods that can improve analysis. For example, the analyst can extract the month number or number of days since a date, which can show the change of data over time.

### About the data
Since the *Titanic* data set does not contain date fields, it is not used in this notebook. Instead, this notebook contains data showing earthquake occurenecs in Greece.

In [2]:
import pandas as pd
df = pd.read_csv("./data/earthquakes.csv")

Notice that there is a field "DATETIME" in this dataset. However, Pandas does not recognize this column as a datetime column but instead thinks that it's an `object` (string).

In [3]:
df.head()

Unnamed: 0,DATETIME,LAT,LONG,DEPTH,MAGNITUDE
0,1/7/1965 10:22,36.5,26.5,10,5.3
1,1/10/1965 8:02,39.25,22.25,10,4.9
2,1/12/1965 17:26,37.0,22.0,10,4.0
3,1/15/1965 14:56,36.75,21.75,10,4.5
4,3/9/1965 19:16,39.0,24.0,10,4.2


Pandas does not think that `DATETIME` is a date, but instead thinks it is a string (Pandas calls strings "objects"). We can check the data type by using the `.dtype` property on the dataframe.

In [4]:
df.dtypes

DATETIME      object
LAT          float64
LONG         float64
DEPTH          int64
MAGNITUDE    float64
dtype: object

The difference between the data type `object` and `datetime` is very important. Strings indicate that the data is textual and have their own methods for data analysis in Pandas. For example, we can find the length of a string, see if another string is contained in a string, and split up a string on a specific set of characters to create additional dimensions for the data set.

Date times have their own set of methods. When a date is converted from a string into a datetime, it allows special Pandas datetime methods to be applied to it. These datetime methods allow aggregation by date to occur much more easily than would otherwise be possible, and makes it especially easy to analyze data where the date or time is important.

For example, without converting the `DATETIME` column above into a `datetime` column, we could still probably use string methods to split the data up into columns that represent the year, month, day, hour, minute, and second. However, adding new columns could decrease processing efficiency for a large dataset and would require a lot more code to create. You would also have to convert each of those created columns to integer fields before grouping them together.

The other option is to simply convert the column to data type `datetime`. Below we will explore how to do this.

### Casting a column to a datetime type
We can use the `to_datetime()` function to cast all of the values in a column to a datetime type. However, Pandas needs to know how each of the numbers in the column correspond to date parts. In other words, is it "day/month/year" or "month/day/year"?

The `to_datetime()` funtion is a **Pandas function** (not a dataframe method) that can accept a Series object.

We can pass in a [Python format code](https://docs.python.org/3/library/datetime.html#strftime-and-strptime-behavior) to tell Pandas how to interpret the date. The formatting codes aren't something you need to memorize, but you should keep them handy for future reference (remember that you can find them online and look them up when you need them). Each code represents a part of the date. For example, `%B` would indicate a full month name (ie. January) whereas `%Y` indicates a full year (ie. 2023).

Many times, especially when using CSV and Excel files, Pandas is able to automatically interpret the datetime field and a Python format code isn't necessary. Thus, many times, you only need to pass in the Series that you want to convert to a datetime.

In the code below, we call the Pandas function `to_datetime()` and pass in two arguments. The first one is a Series object that contains strings that we want to turn into dates, and the second is a string that contains Python format codes. The `%m` denotes a two digit month, the `%d` a two digit day of month, the `%Y` a four digit year, the `%H` a two digit hour, and the `%M` a two digit minute value.

We then save the Series created by the `to_datetime()` function back to the `DATETIME` column.

In [5]:
df['DATETIME'] = pd.to_datetime(df['DATETIME'], format="%m/%d/%Y %H:%M")

Now we can see that Pandas recognizes that the `DATETIME` column has a dtype of `datetime64`.

In [6]:
df.dtypes

DATETIME     datetime64[ns]
LAT                 float64
LONG                float64
DEPTH                 int64
MAGNITUDE           float64
dtype: object

### Using datetime methods

After creating or converting a column to a datetime type, you can use the `.dt` accessor object to get different properties of the Series object. For example, you could get the specific hour, day of week, minute, year, or day name. Note that the `.dt` accessor object uses both methods (which end in parentheses) and properties (which do not use parentheses). Note that these methods and properties can only be used on `Series` objects of `dtype=datetime`.

To use the following methods and properties on a `Series` object, you must get a `Series` of `dtype=datetime`, use the accessor object `.dt`, and then use the desired method or property in the format `df['datetime_column'].dt.method()`.

Each of the examples below shows the return value of the first row of the data set.

| Method/Property | Description                                                                                                                                                                 |                                   Example                                   |   |   |
|:---------------:|-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------|-----------------------------------------------------------------------------|---|---|
| `.second`       | Gets the second relative to the entire minute (ie. between 0 and 60)                                                                                                        | `df['DATETIME'].dt.second` # returns 0                                      |   |   |
| `.minute`       | Gets the minute relative to the entire hour (ie. between 0 and 60)                                                                                                          | `df['DATETIME'].dt.minute` # returns 22                                     |   |   |
| `.hour`         | Gets the hour relative to the entire day (ie. between 0 and 24)                                                                                                             | `df['DATETIME'].dt.hour` # returns 10                                       |   |   |
| `.time`         | Gets only the time of the datetime in format hh:mm:ss (00:00:00)                                                                                                            | `df['DATETIME'].dt.time` # returns 10:22:00                                 |   |   |
| `.day`          | Gets the day of the month (ie. from 1 to 31)                                                                                                                                | `df['DATETIME'].dt.day` # returns 7                                         |   |   |
| `.day_of_year`  | Gets the day relative to the year (ie. from 1 to 365)                                                                                                                       | `df['DATETIME'].dt.day_of_year` # returns 7                                 |   |   |
| `.day_of_week`  | Gets the day relative to the week (ie. from 1 to 7)                                                                                                                         | `df['DATETIME'].dt.day_of_week` # returns 3                                 |   |   |
| `.day_name()`   | Returns the day of the week (ie. Monday, Tuesday)                                                                                                                           | `df['DATETIME'].dt.day_name()` # returns 'Thursday'                         |   |   |
| `.month_name()` | Returns the month name (ie. January, February)                                                                                                                              | `df['DATETIME'].dt.month_name()` # returns 'January'                        |   |   |
| `.normalize()`  | Changes the 'time' component of datetime fields to midnight (ie. 00:00:00). This is useful if time data is unnecessary and data needs to be grouped without regard to time. | `df['DATETIME'].dt.normalize()` # returns 1965-01-07 (doesn't include time) |   |   |

For example, we can see the day of the month of each earthquake below.

In [7]:
df['DATETIME'].dt.day

0          7
1         10
2         12
3         15
4          9
          ..
251258    31
251259    31
251260    31
251261    31
251262    31
Name: DATETIME, Length: 251263, dtype: int64

### Filtering a dataframe by date

Much in the same way that normal filters are used, datetime fields can be used to filter data by year, month, day, day of week, or any other level of detail pertaining to datetime. We can get rows with dates occurring before, during, and after specific dates with the `<`, `==`, and `>` symbols just like in a regular `if` statement. However, because a `datetime` is not a `string`, we can't just pass in a string as the value to compare against:

In [8]:
filt = df['DATETIME'] == '1965-12-31'
df.loc[ filt ] # No output because the 'DATETIME' column is type datetime and the comparison is just a string

Unnamed: 0,DATETIME,LAT,LONG,DEPTH,MAGNITUDE


To perform the comparison, we have to convert our string to a datetime. Datetimes are represented as strings when printed out but are actually different types on the inside and are thus incomparable.

To convert a string to a datetime, we first need to import the `datetime` library from Python. This library allows us to convert strings to datetimes. There are many functions in the datetime library that we could use for datetime comparisons, but in this course, we will only import the `datetime()` function from the `datetime` library (yes, both library and function share the same name).

In [9]:
from datetime import datetime

Using the `datetime()` function, we can convert strings to datetimes and use them in a filter. The `datetime()` function takes in many parameters depending on how precise the date needs to be, but the only three required parameters are year, month, and day. By passing in a year, a month, and a day to the `datetime()` function that we imported, we get a datetime object back that can be used to filter our dataframe.

If only a year, month, and a day are provided, the hour, minute, and second values of the datetime default to 00:00:00.

In [10]:
date = datetime(1965, 12, 31)
type(date)

datetime.datetime

And when we print out the new datetime, we can see the year, month, and day followed by a couple of zeroes that represent the hour and minute (which we did not specify). Remember that although Python is showing us the datetime with letters, numbers, and symbols, the datetime is **not** a string. This means that we cannot even compare this datetime object to the string `"datetime.datetime(1965, 12, 31, 0, 0)"`. The datetime is simply represented as a string below so that we can understand it.

In [11]:
date

datetime.datetime(1965, 12, 31, 0, 0)

Now we can pass the newly created `date` variable to a filter. However, notice that the filter still doesn't work!

In [10]:
filt = df['DATETIME'] == date
df.loc[ filt ]

Unnamed: 0,DATETIME,LAT,LONG,DEPTH,MAGNITUDE


The filter above didn't work because the `datetime()` function defaults to a time of 00:00:00 when no time variables are provided. That means that the filter is only getting rows that occurred exactly on the date 12-31-1965 and at the time 00:00:00. That's a lot that was specified; no wonder nothing was returned!

To get around this, we can use the `.normalize()` method that we saw in the table above to turn the times of all the dates in the `DATETIME` column into 00:00:00. Then, the filter works! Now we get all of the rows that occured on that date regardless of the time that they occurred.

In [11]:
filt = df['DATETIME'].dt.normalize() == date
df.loc[ filt ]

Unnamed: 0,DATETIME,LAT,LONG,DEPTH,MAGNITUDE
67,1965-12-31 22:43:00,39.1,20.9,10,4.1


### Finding the difference between dates

Impressively, datetimes can be subtracted from each other using the standard subtraction `-` operator. Subtracting two datetimes will return the number of days between them as a `timedelta` object. The `timedelta` object only supports getting the difference in time in days, so if you want to get the difference between two dates in hours, you'll have to extract the difference in days and then multiply by 24 hours (per day).

You can access the days from the `timedelta` as an integer by using the `.days` as a property of the `timedelta` in the format `timedelta.days`.

In [12]:
date_1 = datetime(2020, 2, 29)
date_2 = datetime(2022, 5, 15)
difference_between_dates = date_2 - date_1

Notice below that printing out the `timedelta` in the variable `difference_between_dates` does not return a number, but rather returns a `timedelta` object. We can see in the parentheses that this object has a property `days` that equals 806.

In [14]:
# Notice that printing out difference_between_dates does not return a number!
difference_between_dates

datetime.timedelta(days=806)

We can get the `days` property out of the `timedelta` object by adding on `.days` to the end of it.

In [15]:
# We can get the number out of the timedelta by adding `.days`
difference_between_dates.days

806

We can then multiply by 24 to get the number of hours between these two dates, or even divide by 7 to get the number of weeks.

In [17]:
# We can then convert this figure to hours by multiplying by 24
difference_in_hours = difference_between_dates.days * 24
difference_in_hours

19344

#### How is this useful?
You can see that finding the difference between dates didn't use Pandas at all. However, we can use the ability to subtract dates on our dataframes. For example, we can select the first datetime in our dataset and the last datetime in our dataset and then subtract them to see how many days the dataset spans.

Note that the `.max()` and `.min()` methods return the most recent and least recent dates from the `DATETIME` column, respectively.

In [16]:
first_date = df['DATETIME'].min()
last_date = df['DATETIME'].max()

last_date - first_date

Timedelta('20812 days 13:14:00')

### Aggregating by datetime
Time series data is any data that is collected through repeated measurements over time. For example, the stock exchange collects data every second about the price of stocks in the market. This repeated collection of the same data over time allows for insights to be gained regarding trends and forecasts of future observations.

This "Earthquakes in Greece" data set probably isn't the best example of time series data because the data isn't recorded at specific intervals of time (every hour, for example), but rather is only recorded when an earthquake happens. However, we can still use it to show the principles of using time series data.

Time series data allows us to aggregate across specific intervals of time. In this dataset, that means that we can look at the number of earthquakes per month, the highest magnitude earthquakes per week, and average magnitudes per year.

To aggregate across dates, we could simply add a new column describing the unit of time that we want to aggregate by. In other words, if we want to see the biggest earthquake per year, we could do the following steps:
1. Make a new column called "YEAR" that just contains the year of the "DATETIME" column.
2. Group by "YEAR" and aggregate by the "max".

In [16]:
# make a new column "YEAR" which is the year of the "DATETIME" column
df['YEAR'] = df['DATETIME'].dt.year

In [17]:
# group by YEAR and aggregate MAGNITUDE by the maximum.
df.groupby("YEAR").agg({"MAGNITUDE": "max"}).head()

Unnamed: 0_level_0,MAGNITUDE
YEAR,Unnamed: 1_level_1
1965,5.9
1966,6.0
1967,5.3
1968,6.7
1969,5.6


This worked in this case. However, what if we also want to aggregate across months? and week number? and hours? We wouldn't be able to do this easily because the `.month` property just returns a month number *without* the year, meaning that all earthquakes that occurred in January **in any year** are grouped together.

In [23]:
df['MONTH'] = df['DATETIME'].dt.month # make a new column "MONTH"
df.groupby('MONTH').agg({'MAGNITUDE': 'max'})

Unnamed: 0_level_0,MAGNITUDE
MONTH,Unnamed: 1_level_1
1,6.4
2,6.7
3,6.0
4,6.2
5,6.3
6,6.5
7,6.2
8,6.6
9,6.0
10,6.7


To fix this problem, we could group by both year and month. However, what if we then want to group by *day* as well? We'd have to make another column, and eventually we could have an extra 10 columns.

One easier solution is to change the named index of the dataframe to the datetime column. That means that instead of each row having a named index of 1, 2, 3, etc... the column `DATETIME` will actually become the named index of each row!

We can set the index of our dataframe by using the `.set_index()` method on the dataframe and passing in a column name to be used as the index. In the example below, I'm actually also creating a new dataframe `date_df` just in case I ever need the original dataframe again.

In [24]:
date_df = df.set_index('DATETIME')

I can see that the index has changed by printing out the dataframe. Notice that the index (the leftmost column) has now been replaced with the `DATETIME` column, whose values are now **bolded**. Compare this to the original dataframe.

In [28]:
date_df.head()

Unnamed: 0_level_0,LAT,LONG,DEPTH,MAGNITUDE,YEAR,MONTH
DATETIME,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
1965-01-07 10:22:00,36.5,26.5,10,5.3,1965,1
1965-01-10 08:02:00,39.25,22.25,10,4.9,1965,1
1965-01-12 17:26:00,37.0,22.0,10,4.0,1965,1
1965-01-15 14:56:00,36.75,21.75,10,4.5,1965,1
1965-03-09 19:16:00,39.0,24.0,10,4.2,1965,3


Setting the index of the dataframe to a column with `dtype=datetime` means that we can use the `.resample()` method on on column that we want to aggregate across. This method allows us to do aggregate across years, months, days, or whatever specificity of time that we want to without having to make new columns! 

All we have to do is retrieve the column we want to aggregate, call the `.resample()` method while passing in a string that indicates the level to group by.

For example, the code below aggregates the `MAGNITUDE` column across each month in the data set, returning its maximum value.

In [29]:
date_df['MAGNITUDE'].resample("M").max()

DATETIME
1965-01-31    5.3
1965-02-28    NaN
1965-03-31    5.1
1965-04-30    4.6
1965-05-31    4.0
             ... 
2021-08-31    5.4
2021-09-30    5.8
2021-10-31    4.5
2021-11-30    5.0
2021-12-31    5.4
Freq: M, Name: MAGNITUDE, Length: 684, dtype: float64

In this notebook we learned how to convert columns to have a type of `datetime`. We also learned how to filter by datetime, how to subtract datetimes, and how to set the datetime field as the named row index in order to aggregate across time series more easily.