# 介绍时间列的相关操作

时间列是做数据分析主要遇到的核心类型之一。

In [1]:
__auther__ = 'zhenhang.sun@gmail.com'

In [2]:
pwd

'D:\\github\\pandas-tutorial'

In [3]:
import time

In [4]:
import pandas as pd

# 1. datetime 类型

datetime是pandas中操作时间序列的核心数据类型

## 1.1 创建 datetime
#### `pandas.to_datetime(arg, errors='raise', dayfirst=False, yearfirst=False, utc=None, format=None, exact=True, unit=None, infer_datetime_format=False, origin='unix', cache=True)`

常用如下参数：
- arg 一般为需要作为datetime处理的列，原始类型可以是字符串或 int型的时间戳（s或ms）
- format 如果arg是字符串，用这个参数来说明时间的样式
- unit 如果arg是int型，用这个参数来说明是s还是ms
- utc 是否utc，也即是否是0时区，如果是可以设置为True，否则不用设置，我们一般在东八区

In [5]:
df = pd.DataFrame({'str_time':['2020-01-01 12:03:05','2022-01-10 12:34:07','2022-01-31 12:05:09'],
                   'str_time2':['20220112 12:03:05','20220112 12:04:07','20220112 12:05:09'],
                   's_time': [int(time.time()), int(time.time()+100), int(time.time()+200)],
                   'ms_time': [1000*int(time.time()), 1000*int(time.time()+100), 100*int(time.time()+200)],
                  })
df

Unnamed: 0,str_time,str_time2,s_time,ms_time
0,2020-01-01 12:03:05,20220112 12:03:05,1642244343,1642244343000
1,2022-01-10 12:34:07,20220112 12:04:07,1642244443,1642244443000
2,2022-01-31 12:05:09,20220112 12:05:09,1642244543,164224454300


In [6]:
# 字符串默认时间格式
df['utctime1'] = pd.to_datetime(df['str_time'])
df

Unnamed: 0,str_time,str_time2,s_time,ms_time,utctime1
0,2020-01-01 12:03:05,20220112 12:03:05,1642244343,1642244343000,2020-01-01 12:03:05
1,2022-01-10 12:34:07,20220112 12:04:07,1642244443,1642244443000,2022-01-10 12:34:07
2,2022-01-31 12:05:09,20220112 12:05:09,1642244543,164224454300,2022-01-31 12:05:09


In [7]:
# 字符串非默认时间格式，需要给定format
df['utctime2'] = pd.to_datetime(df['str_time2'], format="%Y%m%d %H:%M:%S")
df

Unnamed: 0,str_time,str_time2,s_time,ms_time,utctime1,utctime2
0,2020-01-01 12:03:05,20220112 12:03:05,1642244343,1642244343000,2020-01-01 12:03:05,2022-01-12 12:03:05
1,2022-01-10 12:34:07,20220112 12:04:07,1642244443,1642244443000,2022-01-10 12:34:07,2022-01-12 12:04:07
2,2022-01-31 12:05:09,20220112 12:05:09,1642244543,164224454300,2022-01-31 12:05:09,2022-01-12 12:05:09


In [8]:
# s级时间戳
df['utctime3'] = pd.to_datetime(df['s_time'], unit='s')
df

Unnamed: 0,str_time,str_time2,s_time,ms_time,utctime1,utctime2,utctime3
0,2020-01-01 12:03:05,20220112 12:03:05,1642244343,1642244343000,2020-01-01 12:03:05,2022-01-12 12:03:05,2022-01-15 10:59:03
1,2022-01-10 12:34:07,20220112 12:04:07,1642244443,1642244443000,2022-01-10 12:34:07,2022-01-12 12:04:07,2022-01-15 11:00:43
2,2022-01-31 12:05:09,20220112 12:05:09,1642244543,164224454300,2022-01-31 12:05:09,2022-01-12 12:05:09,2022-01-15 11:02:23


In [9]:
# ms级时间戳，注意m后面带小数
df['utctime4'] = pd.to_datetime(df['ms_time'], unit='ms')
df

Unnamed: 0,str_time,str_time2,s_time,ms_time,utctime1,utctime2,utctime3,utctime4
0,2020-01-01 12:03:05,20220112 12:03:05,1642244343,1642244343000,2020-01-01 12:03:05,2022-01-12 12:03:05,2022-01-15 10:59:03,2022-01-15 10:59:03.000
1,2022-01-10 12:34:07,20220112 12:04:07,1642244443,1642244443000,2022-01-10 12:34:07,2022-01-12 12:04:07,2022-01-15 11:00:43,2022-01-15 11:00:43.000
2,2022-01-31 12:05:09,20220112 12:05:09,1642244543,164224454300,2022-01-31 12:05:09,2022-01-12 12:05:09,2022-01-15 11:02:23,1975-03-16 17:54:14.300


In [10]:
df.dtypes

str_time             object
str_time2            object
s_time                int64
ms_time               int64
utctime1     datetime64[ns]
utctime2     datetime64[ns]
utctime3     datetime64[ns]
utctime4     datetime64[ns]
dtype: object

# 2. 时间列操作

对列加上 `dt` 后就可以进行时间操作

## 2.1 取时间组件操作

In [11]:
# 顾名思义
df['utctime1'].dt.year  
# df['utctime1'].dt.month
# df['utctime1'].dt.day
# df['utctime1'].dt.hour
# df['utctime1'].dt.minute
# df['utctime1'].dt.second
# df['utctime1'].dt.microsecond 

# df['utctime1'].dt.date  
# df['utctime1'].dt.time

0    2020
1    2022
2    2022
Name: utctime1, dtype: int64

## 2.2 判断操作

In [12]:
# 
df['utctime1'].dt.is_leap_year  #是否闰年
# df['utctime1'].dt.is_month_start  #是否闰年
# df['utctime1'].dt.is_month_end  #是否闰年

0     True
1    False
2    False
Name: utctime1, dtype: bool

## 2.3 时间对齐操作

支持的freq: https://pandas.pydata.org/docs/user_guide/timeseries.html#timeseries-offset-aliases

In [13]:
df['utctime1']

0   2020-01-01 12:03:05
1   2022-01-10 12:34:07
2   2022-01-31 12:05:09
Name: utctime1, dtype: datetime64[ns]

In [14]:
# 向上
df['utctime1'].dt.ceil(freq="H")

0   2020-01-01 13:00:00
1   2022-01-10 13:00:00
2   2022-01-31 13:00:00
Name: utctime1, dtype: datetime64[ns]

In [15]:
# 向下
df['utctime1'].dt.floor(freq="H")

0   2020-01-01 12:00:00
1   2022-01-10 12:00:00
2   2022-01-31 12:00:00
Name: utctime1, dtype: datetime64[ns]

In [16]:
# 向最近的
df['utctime1'].dt.round(freq="H")

0   2020-01-01 12:00:00
1   2022-01-10 13:00:00
2   2022-01-31 12:00:00
Name: utctime1, dtype: datetime64[ns]

# 3. 类型转换

In [17]:
# 转换为字符串
df['utctime1'].dt.strftime('%m/%d/%Y %H:%M:%S')

0    01/01/2020 12:03:05
1    01/10/2022 12:34:07
2    01/31/2022 12:05:09
Name: utctime1, dtype: object

In [18]:
# 转换为int时间戳，到ns级别
df['utctime1'].values.astype('int64')

array([1577880185000000000, 1641818047000000000, 1643630709000000000],
      dtype=int64)

# 4. timedelta 类型

## 4.1 创建

##### 创建单变量
`pd.timedelta(class pandas.Timedelta(value=<object object>, unit=None, **kwargs)`
- kwargs 一般用这个，kwargs: {days, seconds, microseconds, milliseconds, minutes, hours, weeks}.

##### 创建列表
`pd.to_timedelta(arg)`
- arg  list of timedelta

In [19]:
td = pd.Timedelta(hours=3, minutes=5)
td

Timedelta('0 days 03:05:00')

In [20]:
df['utctime1']

0   2020-01-01 12:03:05
1   2022-01-10 12:34:07
2   2022-01-31 12:05:09
Name: utctime1, dtype: datetime64[ns]

In [21]:
df['utctime1'] + td

0   2020-01-01 15:08:05
1   2022-01-10 15:39:07
2   2022-01-31 15:10:09
Name: utctime1, dtype: datetime64[ns]

In [22]:
df['utctime1'] - td

0   2020-01-01 08:58:05
1   2022-01-10 09:29:07
2   2022-01-31 09:00:09
Name: utctime1, dtype: datetime64[ns]

#### 也可以直接用内置的timedelta 替代

In [23]:
from datetime import timedelta
df['utctime1'] + timedelta(hours=3, minutes=5)

0   2020-01-01 15:08:05
1   2022-01-10 15:39:07
2   2022-01-31 15:10:09
Name: utctime1, dtype: datetime64[ns]

In [24]:
tds = pd.to_timedelta([pd.Timedelta(hours=1),pd.Timedelta(minutes=1),pd.Timedelta(seconds=30)])
tds

TimedeltaIndex(['0 days 01:00:00', '0 days 00:01:00', '0 days 00:00:30'], dtype='timedelta64[ns]', freq=None)

In [25]:
df['utctime1']

0   2020-01-01 12:03:05
1   2022-01-10 12:34:07
2   2022-01-31 12:05:09
Name: utctime1, dtype: datetime64[ns]

In [26]:
df['utctime1'] + tds

0   2020-01-01 13:03:05
1   2022-01-10 12:35:07
2   2022-01-31 12:05:39
dtype: datetime64[ns]

## 4.2 创建列

In [27]:
# 查看结构
td.components

Components(days=0, hours=3, minutes=5, seconds=0, milliseconds=0, microseconds=0, nanoseconds=0)

In [28]:
# 查看结构
tds.components

Unnamed: 0,days,hours,minutes,seconds,milliseconds,microseconds,nanoseconds
0,0,1,0,0,0,0,0
1,0,0,1,0,0,0,0
2,0,0,0,30,0,0,0


In [29]:
# 转化为 s
td.total_seconds()

11100.0

In [30]:
tds.total_seconds()

Float64Index([3600.0, 60.00000000000001, 30.000000000000004], dtype='float64')