- I want to extract only the year from a column that has a datetime format.
- As in the sample data below,  
  if the 'date' column contains a missing value or an error value (such as "0"),  
  there are two ways to extract the year from the column,  
  depending on the situation.

In [1]:
import pandas as pd
import numpy as np
import datetime
import re
pd.set_option('mode.chained_assignment',  None)

In [2]:
df = pd.read_excel("data/sample_data_date.xlsx")
print(df.shape)
df

(18, 2)


Unnamed: 0,date,price
0,1975-10-04 00:00:00,300
1,0,300
2,0,300
3,2003-09-06 00:00:00,500
4,2010-02-04 00:00:00,500
5,2010-02-04 00:00:00,400
6,2004-03-12 00:00:00,300
7,2003-11-07 00:00:00,400
8,2005-01-18 00:00:00,300
9,2011-06-29 00:00:00,300


In [3]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 18 entries, 0 to 17
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   date    18 non-null     object
 1   price   18 non-null     int64 
dtypes: int64(1), object(1)
memory usage: 416.0+ bytes


### Method 1. Utilize the "datetime" package

- First, remove error values (such as "0" that cannot be converted to datetime format) from the 'date' column.
- Convert object type('date' column) to datetime type('date_1' column).
- Extract the year from the 'date_1' column'.

In [4]:
df1 = df[df['date'] != 0]

print(df.shape)
print(df1.shape)
df1.head()

(18, 2)
(15, 2)


Unnamed: 0,date,price
0,1975-10-04 00:00:00,300
3,2003-09-06 00:00:00,500
4,2010-02-04 00:00:00,500
5,2010-02-04 00:00:00,400
6,2004-03-12 00:00:00,300


In [5]:
df1['date_1'] = df1['date'].apply(lambda x : datetime.datetime.strptime(str(x), "%Y-%m-%d %H:%M:%S"))
df1['year'] = df1['date_1'].dt.year

In [6]:
print(df1.shape)
df1

(15, 4)


Unnamed: 0,date,price,date_1,year
0,1975-10-04 00:00:00,300,1975-10-04,1975
3,2003-09-06 00:00:00,500,2003-09-06,2003
4,2010-02-04 00:00:00,500,2010-02-04,2010
5,2010-02-04 00:00:00,400,2010-02-04,2010
6,2004-03-12 00:00:00,300,2004-03-12,2004
7,2003-11-07 00:00:00,400,2003-11-07,2003
8,2005-01-18 00:00:00,300,2005-01-18,2005
9,2011-06-29 00:00:00,300,2011-06-29,2011
10,2011-06-29 00:00:00,500,2011-06-29,2011
11,2020-07-19 00:00:00,500,2020-07-19,2020


### Method 2. Utilize the "re" package

- A situation where a row with an error value in the 'date' column cannot be deleted,  
  to preserve the information in the "price" column
- Using the "re" package, only numbers are extracted from the 'date' column  
  and converted into the 'date_1' column.
- Create a year column by extracting the first 4 digits from the 'date_1' column.

In [7]:
df2 = df.copy()

print(df.shape)
print(df2.shape)
df2.head()

(18, 2)
(18, 2)


Unnamed: 0,date,price
0,1975-10-04 00:00:00,300
1,0,300
2,0,300
3,2003-09-06 00:00:00,500
4,2010-02-04 00:00:00,500


In [8]:
df2['date_1'] = np.where(df2['date'] != 0, df2['date'].apply(lambda x: re.sub(r'[^0-9]', '', str(x))), np.nan)
df2['year'] = np.where(df2['date_1'].notnull(), df2['date_1'].str[0:4], np.nan)

In [9]:
print(df2.shape)
df2

(18, 4)


Unnamed: 0,date,price,date_1,year
0,1975-10-04 00:00:00,300,19751004000000.0,1975.0
1,0,300,,
2,0,300,,
3,2003-09-06 00:00:00,500,20030906000000.0,2003.0
4,2010-02-04 00:00:00,500,20100204000000.0,2010.0
5,2010-02-04 00:00:00,400,20100204000000.0,2010.0
6,2004-03-12 00:00:00,300,20040312000000.0,2004.0
7,2003-11-07 00:00:00,400,20031107000000.0,2003.0
8,2005-01-18 00:00:00,300,20050118000000.0,2005.0
9,2011-06-29 00:00:00,300,20110629000000.0,2011.0
