# Helper Functions Volume 2A: Date Time Processing

Author: Koh Kok Bin  
Date: 10 Dec 2021

This notebook provides aan overall guide to users who have to deal with date/time data. Having a good knowledge of date/time manipulation will greatly reduce the time taken to code the scripts, especially on data cleaning.  

I will try my best to make this as easy to understand for Python users of all levels. Most if not all of the codes used here can be cross-applied to other areas of your work.

Have fun copying!


### <a id = "VOL2A_toc"> Table of contents:  </a>

- [Reference Materials -- This needs to be run so that Parts 1 to n can work.](#VOL2A_refmat)
- [Help on codes](#VOL2A_help)
- [Part 1: Selecting and Indexing data](#VOL2A_selectdata)
- [Data Processing](#VOL2A_dataclean)  
- [Exploratory Data Analysis](#VOL2A_EDA)

Each dataset has its own date/time range. Yours may be in days (2020-01-01), months (2020-01), or years (2020). Some may be hours (2020-01-01 00H), minutes (2020-01-01 00-00), seconds (2020-01-01 000000).  

Note the different symbols used (-, H). Because there's likely a human behind the process to annotate the dates (can be data entry for physical forms, officer who drew up google forms or form.sg), the date format maybe different. Python doesn't know that. Hence, we need to tell it how to break down into the individual time components.

Pandas can usually infer the date from the format that was passed in. We can also specify our own formats. Formats can be found in this internet link:

- https://strftime.org  

Some common formats are listed here. Unlikely you will use others, unless the date and time data is quite messy.

- %d (0-padded day of month)  
- %m (0-padded month)  
- %Y (Year)  
- %H (24 Hour clock, 0-padded)  
- %M (0-padded minutes)  
- %S (0-padded seconds)  

Most date time functions follow or make use of native Python package "datetime" to parse dates and times. The most popular functions here are:  

- datetime.datetime.strftime  
- datetime.datetime.strptime  

There's no need to understand what the individual words mean. Just know that these 2 functions are the more popular functions to deal with datetime. The difference between *strftime* and *strptime* is that one is to format the time for output as a string, the other is to parse (read) the string as a datetime.

strftime (in my own words): string format time  
strptime (in my own words): string parse time  

When you want to output time formats (as string), use strftime. If you want to read in time formats (from string), use strptime.

In [21]:
import datetime
import os
import pandas as pd
import numpy as np
import random

In [3]:
# Get directory name of this file. Helpful to specify the directory of the file, 
# so you can also interact with the files in the same location via relative paths.

dirname = globals()["_dh"][0]

In [53]:
# Let's start with a simple exercise
start = datetime.datetime(2022, 1, 19, 13, 1, 1)
end = datetime.datetime(2022, 1, 21, 6, 56, 1)
print("Full date = {}".format(start))
print("Date with specific components separate: Year: {}  Month: {}  Day: {}  Hours: {}  Minutes: {}  Seconds: {}".format(
    start.year, start.month, start.day, start.hour, start.minute, start.second))
print("Representing it in another format = {}".format(datetime.datetime.strftime(start, "%d-%b-%y %I:%M:%S %p")))
print("Time between 2 dates = {}".format(end - start))
print("Time between 2 dates: Days = {}, Hours = {}, in Minutes = {}".format(
    (end - start).days, (end - start).total_seconds() / 3600, (end - start).total_seconds() / 60))

# See the difference?

Full date = 2022-01-19 13:01:01
Date with specific components separate: Year: 2022  Month: 1  Day: 19  Hours: 13  Minutes: 1  Seconds: 1
Representing it in another format = 19-Jan-22 01:01:01 PM
Time between 2 dates = 1 day, 17:55:00
Time between 2 dates: Days = 1, Hours = 41.916666666666664, in Minutes = 2515.0


In [54]:
start = datetime.datetime(2022, 1, 19, 13, 0, 0)
end = datetime.datetime(2022, 1, 26, 13, 0, 0)
print("Time between 2 dates: Days = {}, Hours = {}, in Minutes = {}".format((end - start).days, (end - start).total_seconds() / 3600, (end - start).total_seconds() / 60))
print((end-start).seconds)

# Why is there a difference? If you want to calculate the total time elapsed, used the total_seconds() function call.
# The seconds attribute shows the difference in the seconds component of date1 and date2, which is 0.

Time between 2 dates: Days = 7, Hours = 168.0, in Minutes = 10080.0
0


In [None]:
help(start)

The code above can already go a long way in helping you deal with datetime data. If you're stuck, google probably has the solution 1 search away.  

if you run the above cell "help(start)", you'll see that start is actually a datetime object. You can also run this code "help(end-start)". This shows that the result is a timedelta object.  

Since they are different objects, we should use them differently; Rightly so, they have different attributes and methods (functions) that can help manipulate their values, as shown in the cells above.  

Timedelta is basically the delta (difference) of time between 2 datetime objects. At its core, it is stored as seconds, and is retrievable via total_seconds(). To retrieve the hours equivalent, divide by the total number of seconds in an hour (3600). Whilst it can seem difficult to manage, datetime and timedelta objects are pretty straightforward. Let's try another exercise.

In [131]:
start = datetime.datetime(2022, 1, 19, 13, 0, 0)
end = datetime.datetime(2022, 1, 26, 13, 0, 0)

# Create 100 data randomly throughout the start and end date
# Every run of this cell generates 100 different dates due to the random() function.
# You may not get the same result as I did here.
arr = dict()
arr["DATETIME"] = [random.random() * (end-start) + start for _ in range(100)]
arr["COUNT"] = [random.randint(1,5) for _ in range(100)]

arr_df = pd.DataFrame(arr).sort_values(by = "DATETIME").reset_index(drop = True)
arr_df.head()

Unnamed: 0,DATETIME,COUNT
0,2022-01-19 13:38:38.110223,1
1,2022-01-19 15:56:22.390005,3
2,2022-01-19 20:47:37.499262,4
3,2022-01-19 20:54:32.640059,1
4,2022-01-19 20:57:44.840250,3


In [132]:
arr_df["DATETIME"] = pd.DatetimeIndex(arr_df["DATETIME"]).floor("S")
arr_df.head()

Unnamed: 0,DATETIME,COUNT
0,2022-01-19 13:38:38,1
1,2022-01-19 15:56:22,3
2,2022-01-19 20:47:37,4
3,2022-01-19 20:54:32,1
4,2022-01-19 20:57:44,3


Let's assume this dataset is the amount of individuals arriving into Singapore via car at a certain time. Count = number of individuals. Using this dataset to answer the questions below:    

1) How many individuals travelled into Singapore for each day?  
2) How many individuals travelled on weekdays vs weekends?  
3) What is the average indivdual per arrival, average arrival rate per hour for weekdays and weekends?  
4) How many individuals travelled in the different times of day? (12AM to 8AM, 8AM to 4PM, 4PM to 12AM)

In [133]:
# 1) How many individuals travelled into Singapore for each day?

arr_df.groupby(by = arr_df["DATETIME"].dt.date)[["COUNT"]].count()

# .dt is a pandas attribute, allowing us to use the datetime properties like we do with datetime, but via pandas objects like dataframe and series.
# within .dt there are other attributes like .day, .hour, .date. Check out dir(arr_df["DATETIME"].dt) for more information.

Unnamed: 0_level_0,COUNT
DATETIME,Unnamed: 1_level_1
2022-01-19,6
2022-01-20,14
2022-01-21,13
2022-01-22,20
2022-01-23,8
2022-01-24,17
2022-01-25,14
2022-01-26,8


In [136]:
# 2) How many individuals travelled on weekdays vs weekends?

test = arr_df["DATETIME"].dt.weekday
answer = list()

for no in test:
    if no < 5:
        answer.append("WEEKDAY")
    else:
        answer.append("WEEKEND")

arr_df["WEEKNO"] = answer
arr_df.groupby(by = ["WEEKNO"])[["COUNT"]].count()

Unnamed: 0_level_0,COUNT
WEEKNO,Unnamed: 1_level_1
WEEKDAY,72
WEEKEND,28


In [None]:
# 3) What is the average indivdual per arrival, average arrival rate per hour for weekdays and weekends?

avg_passengers = arr_df.mean()
print(avg_passengers)

arr_df.groupby(by = ["WEEKNO", arr_df["DATETIME"].dt.hour]).mean()

In [148]:
# 4) How many individuals travelled in the different times of day? (12AM to 8AM, 8AM to 4PM, 4PM to 12AM)

morning = [a for a in range(0, 8)]
afternoon = [a for a in range(8, 16)]
night = [a for a in range(16, 24)]

answer = list()
for hour in list(arr_df["DATETIME"].dt.hour):
    
    if hour in morning:
        answer.append("Morning")
    elif hour in afternoon:
        answer.append("Afternoon")
    else:
        answer.append("Night")

arr_df["TIME_DAY"] = answer

arr_df.groupby(by = ["TIME_DAY", arr_df["DATETIME"].dt.hour])[["COUNT"]].count()

Unnamed: 0_level_0,Unnamed: 1_level_0,COUNT
TIME_DAY,DATETIME,Unnamed: 2_level_1
Afternoon,8,5
Afternoon,9,3
Afternoon,10,4
Afternoon,11,6
Afternoon,12,5
Afternoon,13,4
Afternoon,14,3
Afternoon,15,5
Morning,0,1
Morning,1,6
