# 📦 0. Prepare data

In [1]:
import pandas as pd
df = pd.DataFrame({'flight_start': ['2019-02-04 13:15:00', 
                                    '2020-01-01 21:30:00', 
                                    '2021-10-28 02:00:00'],
                   'flight_length': [7, 21.5, 30],
                   'return_start': ['01/03/2019 10:00:00', 
                                    '11/01/2020 20:50:00', 
                                    '05/11/2021 08:20:00']})
df

Unnamed: 0,flight_start,flight_length,return_start
0,2019-02-04 13:15:00,7.0,01/03/2019 10:00:00
1,2020-01-01 21:30:00,21.5,11/01/2020 20:50:00
2,2021-10-28 02:00:00,30.0,05/11/2021 08:20:00


In [2]:
df.dtypes

flight_start      object
flight_length    float64
return_start      object
dtype: object

## 📍 1. Transform to datetime or timedelta dtypes

We will transform `flight_start` and `return_start` to *datetime* with `pd.to_datetime()` and `flight_length` to *timedelta* with `pd.to_timedelta()`.

In [3]:
df['flight_start'] = pd.to_datetime(df['flight_start'])
df['flight_length'] = pd.to_timedelta(df['flight_length'], 'h')
df['return_start'] = pd.to_datetime(
    df['return_start'], format='%d/%m/%Y %H:%M:%S'
)
df

Unnamed: 0,flight_start,flight_length,return_start
0,2019-02-04 13:15:00,0 days 07:00:00,2019-03-01 10:00:00
1,2020-01-01 21:30:00,0 days 21:30:00,2020-01-11 20:50:00
2,2021-10-28 02:00:00,1 days 06:00:00,2021-11-05 08:20:00


In line 4, we specified the datetime format using the `format` argument. If we don’t specify it, pandas will be smart enough to convert it correctly in this example. The reason we used `format` argument was to familiarise with it so that we know how to specify format if there is a need for it in future.

In [4]:
df.dtypes

flight_start      datetime64[ns]
flight_length    timedelta64[ns]
return_start      datetime64[ns]
dtype: object

Before we move on, I want to show one useful tip when converting dirty data. Sometimes, we may be working with bad data like this: `202Y-01–01 21:30:00` that contains strings or other invalid datetimes. Trying to convert it like below will trigger an error:

In [5]:
# pd.to_datetime(pd.Series(['2019-02-04 13:15:00', 
#                           '202Y-01-01 21:30:00']))

# This gives an error

In this case, we can specify `errors='coerce'` in the function to convert valid cases only and replace invalid cases to missing:

In [6]:
pd.to_datetime(pd.Series(['2019-02-04 13:15:00', 
                          '202Y-01-01 21:30:00']), errors='coerce')

0   2019-02-04 13:15:00
1                   NaT
dtype: datetime64[ns]

Similarly, we will get an error when converting dirty data to timedelta:

In [7]:
# pd.to_timedelta(pd.Series([7, 'T']), unit='H')

# This gives an error

We can also use the `errors='coerce'`:

In [8]:
pd.to_timedelta(pd.Series([7, 'T']), unit='H', errors='coerce')

0   0 days 07:00:00
1               NaT
dtype: timedelta64[ns]

## 📍 2. Extract datetime parts

Extracting datetime parts with pandas datetime is easy through the `.dt` accessor. For instance, here’s how we could extract a date using the accessor:

In [9]:
df['flight_start'].dt.date

0    2019-02-04
1    2020-01-01
2    2021-10-28
Name: flight_start, dtype: object

| ATTRIBUTE | DESCRIPTION | EXAMPLE | 
| :-------: | :---------: | :-----: | 
| `.dt.date` | Date | 2019-02-04 |
| `.dt.year` | Year | 2019 |
| `.dt.isocalendar().year` | ISO year | 2019 |
| `.dt.quarter` | Quarter number | 1 | 
| `.dt.isocalendar().week` | ISO week number | 6 | 
| `.dt.month` | Month number | 2 |
| `.dt.month_name()` | Month name | February | 
| `.dt.dayofyear` or `.dt.day_of_year` | Day of the year | 35 | 
| `.dt.day` | Day of the month |  4 |
| `.dt.weekday` or `.dt.dayofweek` or `.dt.day_of_week` | Day of the week in numbers between 0 to 6 | 0 | 
| `.dt.isocalendar().day` | ISO day of the week in numbers between 1 to 7 | 0 | 
| `.dt.day_name()` | Name of day of the week | Monday | 
| `.dt.daysinmonth` or `.dt.days_in_month` | Number of days in the month | 28 | 
| `.dt.time` | Time | 13:15:00 | 
| `.dt.hour` | Hour | 13 | 
| `.dt.minute` | Minute | 15 | 
| `.dt.second` | Second | 0 | 

## 📍 3. Find datetime differences

Let’s see how long has it been since the `return_start`. In pandas, we can get the current local time with `pd.Timestamp.now()`:

In [10]:
pd.Timestamp.now() - df['return_start']

0   1414 days 07:09:08.628022
1   1097 days 20:19:08.628022
2    434 days 08:49:08.628022
Name: return_start, dtype: timedelta64[ns]

The resulting new series is already in `timedelta` data type. We were able to do this operation between two objects with varying length (a scalar and an array) because of *broadcasting*. Now, let’s calculate the duration between the two flights start datetimes and save it as `duration`:

In [11]:
df['duration'] = df['return_start'] - df['flight_start']
df

Unnamed: 0,flight_start,flight_length,return_start,duration
0,2019-02-04 13:15:00,0 days 07:00:00,2019-03-01 10:00:00,24 days 20:45:00
1,2020-01-01 21:30:00,0 days 21:30:00,2020-01-11 20:50:00,9 days 23:20:00
2,2021-10-28 02:00:00,1 days 06:00:00,2021-11-05 08:20:00,8 days 06:20:00


Like before, we can extract timedelta parts using `.dt` accessor. We can extract days like below:

In [12]:
df['duration'].dt.days

0    24
1     9
2     8
Name: duration, dtype: int64

Here’s the summary of the commonly used attributes:

| ATTRIBUTE | DESCRIPTION | EXAMPLE | 
| :-------: | :---------: | :-----: | 
| `.dt.days` | Number of days | 24 |
| `.dt.seconds` | Number of seconds (ranges between 0 to 86399) | 74700 (=20\*3600 + 45\*60) |

If we wanted to get more precise duration, we can calculate like below:

In [13]:
df['duration'].dt.days + df['duration'].dt.seconds/(24*60*60)

0    24.864583
1     9.972222
2     8.263889
Name: duration, dtype: float64

Here’s another ways to calculate difference between two datetimes:

In [14]:
df['return_start'].dt.dayofyear - df['flight_start'].dt.dayofyear

0    25
1    10
2     8
dtype: int64

In [15]:
df['return_start'].dt.date - df['flight_start'].dt.date

0   25 days
1   10 days
2    8 days
dtype: timedelta64[ns]

From these different approaches, we get slightly different answers. The correct one depends on the application and intention of the calculation.

## 📍 4. Derive datetimes from datetimes and timedeltas

In [16]:
df['flight_end'] = df['flight_start'] + df['flight_length']
df[['flight_start', 'flight_length', 'flight_end']]

Unnamed: 0,flight_start,flight_length,flight_end
0,2019-02-04 13:15:00,0 days 07:00:00,2019-02-04 20:15:00
1,2020-01-01 21:30:00,0 days 21:30:00,2020-01-02 19:00:00
2,2021-10-28 02:00:00,1 days 06:00:00,2021-10-29 08:00:00
