### Date and Time Manipulation Using Pandas

In Python, we have `datetime` module to handle datetime object but the question is why do we need to manipulate datetime or need to use `datetime` module to handle datetime object. 

Can' we store in a form of string and perform operation?

We can store it in the form of string but then we are limiting the fucntionality and performing additional operation to achieve what we can achieve in fewer line of code using **datetime** module or pandas **datetime** object

In [1]:
import pandas as pd
import numpy as np

In [2]:
from datetime import datetime

In [3]:
myday = 12
mymonth = 1
myyear = 2024

In [5]:
datetime(year=myyear, month=mymonth, day=myday)

datetime.datetime(2024, 1, 12, 0, 0)

Using this `datetime` object, we can easily access the year, month and day. Also, we can perform yesr, month, day gap

In [6]:
# get the year from datetime object
dt = datetime(year=myyear, month=mymonth, day=myday)

In [7]:
dt.year

2024

In [8]:
dt.month

1

In [9]:
dt.day

12

In [10]:
today_dt = datetime(year=2024, month=2, day=6)

In [11]:
today_dt

datetime.datetime(2024, 2, 6, 0, 0)

In [12]:
# find diff
today_dt - dt

datetime.timedelta(days=25)

In [14]:
month_diff = today_dt.month - dt.month

In [15]:
month_diff

1

In [16]:
day_diff = today_dt.day - dt.day

In [17]:
day_diff

-6

In [19]:
df = pd.read_csv(filepath_or_buffer='../datasets/hotel_booking_data.csv')

In [20]:
df.sample(5)

Unnamed: 0,hotel,is_canceled,lead_time,arrival_date_year,arrival_date_month,arrival_date_week_number,arrival_date_day_of_month,stays_in_weekend_nights,stays_in_week_nights,adults,...,customer_type,adr,required_car_parking_spaces,total_of_special_requests,reservation_status,reservation_status_date,name,email,phone-number,credit_card
111994,City Hotel,0,31,2017,May,20,14,2,3,1,...,Transient-Party,130.0,0,0,Check-Out,2017-05-19,Justin Barnes,Justin_Barnes66@gmail.com,159-119-1181,************8885
24961,Resort Hotel,0,24,2016,June,24,7,0,4,2,...,Transient,131.0,0,1,Check-Out,2016-06-11,Jeremy Smith,Smith_Jeremy@aol.com,608-278-7209,************3956
86300,City Hotel,0,2,2016,March,14,27,2,2,3,...,Transient,147.0,0,2,Check-Out,2016-03-31,Haley Paul,Haley.Paul46@comcast.net,837-129-2748,************9234
67491,City Hotel,1,143,2017,May,18,4,0,2,2,...,Transient,120.0,0,0,Canceled,2016-12-12,Michelle Richards,MRichards@yahoo.com,423-644-3366,************2685
24027,Resort Hotel,0,98,2016,May,19,4,2,4,2,...,Transient,43.43,1,1,Check-Out,2016-05-10,Lisa Humphrey,Humphrey.Lisa@yahoo.com,836-294-5065,************9749


In [21]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 119390 entries, 0 to 119389
Data columns (total 36 columns):
 #   Column                          Non-Null Count   Dtype  
---  ------                          --------------   -----  
 0   hotel                           119390 non-null  object 
 1   is_canceled                     119390 non-null  int64  
 2   lead_time                       119390 non-null  int64  
 3   arrival_date_year               119390 non-null  int64  
 4   arrival_date_month              119390 non-null  object 
 5   arrival_date_week_number        119390 non-null  int64  
 6   arrival_date_day_of_month       119390 non-null  int64  
 7   stays_in_weekend_nights         119390 non-null  int64  
 8   stays_in_week_nights            119390 non-null  int64  
 9   adults                          119390 non-null  int64  
 10  children                        119386 non-null  float64
 11  babies                          119390 non-null  int64  
 12  meal            

In [22]:
# convert reservation_status_date to datetime
df['reservation_status_date']

0         2015-07-01
1         2015-07-01
2         2015-07-02
3         2015-07-02
4         2015-07-03
             ...    
119385    2017-09-06
119386    2017-09-07
119387    2017-09-07
119388    2017-09-07
119389    2017-09-07
Name: reservation_status_date, Length: 119390, dtype: object

In [23]:
df['reservation_status_date'].dtype

dtype('O')

In [24]:
type(df['reservation_status_date'])

pandas.core.series.Series

In [26]:
pd.to_datetime(df['reservation_status_date'])

0        2015-07-01
1        2015-07-01
2        2015-07-02
3        2015-07-02
4        2015-07-03
            ...    
119385   2017-09-06
119386   2017-09-07
119387   2017-09-07
119388   2017-09-07
119389   2017-09-07
Name: reservation_status_date, Length: 119390, dtype: datetime64[ns]

In [27]:
pd.to_datetime(df['reservation_status_date']).dtype

dtype('<M8[ns]')

In [32]:
df = df.astype({'reservation_status_date': 'datetime64[ns]'})

In [33]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 119390 entries, 0 to 119389
Data columns (total 36 columns):
 #   Column                          Non-Null Count   Dtype         
---  ------                          --------------   -----         
 0   hotel                           119390 non-null  object        
 1   is_canceled                     119390 non-null  int64         
 2   lead_time                       119390 non-null  int64         
 3   arrival_date_year               119390 non-null  int64         
 4   arrival_date_month              119390 non-null  object        
 5   arrival_date_week_number        119390 non-null  int64         
 6   arrival_date_day_of_month       119390 non-null  int64         
 7   stays_in_weekend_nights         119390 non-null  int64         
 8   stays_in_week_nights            119390 non-null  int64         
 9   adults                          119390 non-null  int64         
 10  children                        119386 non-null  float64

In [36]:
df['reservation_status_date'].dt.year

0         2015
1         2015
2         2015
3         2015
4         2015
          ... 
119385    2017
119386    2017
119387    2017
119388    2017
119389    2017
Name: reservation_status_date, Length: 119390, dtype: int32

In [37]:
df['reservation_status_date'].dt.month

0         7
1         7
2         7
3         7
4         7
         ..
119385    9
119386    9
119387    9
119388    9
119389    9
Name: reservation_status_date, Length: 119390, dtype: int32

In [39]:
df['reservation_status_date'].dt.day

0         1
1         1
2         2
3         2
4         3
         ..
119385    6
119386    7
119387    7
119388    7
119389    7
Name: reservation_status_date, Length: 119390, dtype: int32

In [40]:
# also pd.datetime can read custom date and convert to date format
custom_date = "12th of Jan 2023"
pd.to_datetime(custom_date)

Timestamp('2023-01-12 00:00:00')

In [45]:
pd.to_datetime(custom_date)

Timestamp('2023-01-12 00:00:00')

In [49]:
style_date = '12--Jan--2020'

In [50]:
pd.to_datetime(style_date)

DateParseError: Unknown datetime string format, unable to parse: 12--Jan--2020, at position 0

In [51]:
pd.to_datetime(style_date, format='%d--%b--%Y')

Timestamp('2020-01-12 00:00:00')

In [53]:
df = pd.read_csv('../datasets/RetailSales_BeerWineLiquor.csv')

In [54]:
df.sample(4)

Unnamed: 0,DATE,MRTSSM4453USN
331,2019-08-01,5270
264,2014-01-01,3381
279,2015-04-01,3867
160,2005-05-01,2709


In [55]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 340 entries, 0 to 339
Data columns (total 2 columns):
 #   Column         Non-Null Count  Dtype 
---  ------         --------------  ----- 
 0   DATE           340 non-null    object
 1   MRTSSM4453USN  340 non-null    int64 
dtypes: int64(1), object(1)
memory usage: 5.4+ KB


In [57]:
df['DATE'] = pd.to_datetime(df['DATE'])

In [59]:
df.sample(4)

Unnamed: 0,DATE,MRTSSM4453USN
122,2002-03-01,2366
214,2009-11-01,3356
224,2010-09-01,3365
139,2003-08-01,2660


In [61]:
df['DATE'].dtype

dtype('<M8[ns]')

In [62]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 340 entries, 0 to 339
Data columns (total 2 columns):
 #   Column         Non-Null Count  Dtype         
---  ------         --------------  -----         
 0   DATE           340 non-null    datetime64[ns]
 1   MRTSSM4453USN  340 non-null    int64         
dtypes: datetime64[ns](1), int64(1)
memory usage: 5.4 KB


In [63]:
df['DATE'].dt.year

0      1992
1      1992
2      1992
3      1992
4      1992
       ... 
335    2019
336    2020
337    2020
338    2020
339    2020
Name: DATE, Length: 340, dtype: int32

In [64]:
# also, we can convert columns while reading data from file
df = pd.read_csv('../datasets/RetailSales_BeerWineLiquor.csv', parse_dates=['DATE'])

In [65]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 340 entries, 0 to 339
Data columns (total 2 columns):
 #   Column         Non-Null Count  Dtype         
---  ------         --------------  -----         
 0   DATE           340 non-null    datetime64[ns]
 1   MRTSSM4453USN  340 non-null    int64         
dtypes: datetime64[ns](1), int64(1)
memory usage: 5.4 KB


#### `resample()` if we have date as index

In [68]:
df.set_index(keys=['DATE'])

Unnamed: 0_level_0,MRTSSM4453USN
DATE,Unnamed: 1_level_1
1992-01-01,1509
1992-02-01,1541
1992-03-01,1597
1992-04-01,1675
1992-05-01,1822
...,...
2019-12-01,6630
2020-01-01,4388
2020-02-01,4533
2020-03-01,5562


In [69]:
df_dt = df.set_index(keys=['DATE'])

In [71]:
df_dt.sample(4)

Unnamed: 0_level_0,MRTSSM4453USN
DATE,Unnamed: 1_level_1
2001-12-01,3542
2009-08-01,3399
2013-08-01,4100
2008-01-01,2675


In [75]:
# resample works like a group by 
df_dt.resample(rule='YE')

<pandas.core.resample.DatetimeIndexResampler object at 0x16851ba90>

In [76]:
df_dt.resample(rule='YE').mean()

Unnamed: 0_level_0,MRTSSM4453USN
DATE,Unnamed: 1_level_1
1992-12-31,1807.25
1993-12-31,1794.833333
1994-12-31,1841.75
1995-12-31,1833.916667
1996-12-31,1929.75
1997-12-31,2006.75
1998-12-31,2115.166667
1999-12-31,2206.333333
2000-12-31,2375.583333
2001-12-31,2468.416667


In [77]:
df.groupby('DATE')

<pandas.core.groupby.generic.DataFrameGroupBy object at 0x1691eaf90>

In [78]:
df.groupby('DATE').mean()

Unnamed: 0_level_0,MRTSSM4453USN
DATE,Unnamed: 1_level_1
1992-01-01,1509.0
1992-02-01,1541.0
1992-03-01,1597.0
1992-04-01,1675.0
1992-05-01,1822.0
...,...
2019-12-01,6630.0
2020-01-01,4388.0
2020-02-01,4533.0
2020-03-01,5562.0


In [79]:
df.groupby('DATE').count()

Unnamed: 0_level_0,MRTSSM4453USN
DATE,Unnamed: 1_level_1
1992-01-01,1
1992-02-01,1
1992-03-01,1
1992-04-01,1
1992-05-01,1
...,...
2019-12-01,1
2020-01-01,1
2020-02-01,1
2020-03-01,1
