---

<center><h1>📍 📍 Preprocessing Timeseries Data 📍 📍</h1></center>

---


**`Pandas`** contains extensive capabilities and features for working with time series data for all domains. Using the NumPy datetime64 and timedelta64 dtypes, pandas has consolidated a large number of features from other Python libraries like scikits.timeseries as well as created a tremendous amount of new functionality for manipulating time series data.

---

In [33]:
# importing the pandas library and datetime module of Python Standard Library
import pandas as pd
import datetime as dt

In [2]:
# read the data set
data = pd.read_csv('Time series Data/time_series.csv')

In [3]:
# view the top rows
data.head(5)

Unnamed: 0,ID,Datetime,Count
0,0,25-08-2012 00:00,8
1,1,25-08-2012 01:00,2
2,2,25-08-2012 02:00,6
3,3,25-08-2012 03:00,2
4,4,25-08-2012 04:00,2


---

***Data Types of columns***

---

In [4]:
data.dtypes

ID           int64
Datetime    object
Count        int64
dtype: object

***By Default: All datetime based columns are considered as strings.***

---

#### `CHANGE THE DATA TYPE TO DATETIME`

---

In [5]:
# change type to datetime
data.Datetime = pd.to_datetime(data.Datetime)

In [6]:
# check the data types again
data.dtypes

ID                   int64
Datetime    datetime64[ns]
Count                int64
dtype: object

In [9]:
data

Unnamed: 0,ID,Datetime,Count
0,0,2012-08-25 00:00:00,8
1,1,2012-08-25 01:00:00,2
2,2,2012-08-25 02:00:00,6
3,3,2012-08-25 03:00:00,2
4,4,2012-08-25 04:00:00,2
...,...,...,...
18283,18283,2014-09-25 19:00:00,868
18284,18284,2014-09-25 20:00:00,732
18285,18285,2014-09-25 21:00:00,702
18286,18286,2014-09-25 22:00:00,580


In [11]:
data.Datetime.apply(lambda x: x.month_name())

0           August
1           August
2           August
3           August
4           August
           ...    
18283    September
18284    September
18285    September
18286    September
18287    September
Name: Datetime, Length: 18288, dtype: object

In [None]:
data.Datetime.apply(lambda x: x.month_name())

---

***Let's see another data file with the time series data.***

---

In [12]:
data_2 = pd.read_csv('Time series Data/time_series_2.csv')

In [13]:
# view the top rows of the data
data_2

Unnamed: 0,ID,Datetime,Count
0,0,25 Aug 2012,8
1,1,25 Aug 2012,2
2,2,25 Aug 2012,6
3,3,25 Aug 2012,2
4,4,25 Aug 2012,2
...,...,...,...
18283,18283,25 Sep 2014,868
18284,18284,25 Sep 2014,732
18285,18285,25 Sep 2014,702
18286,18286,25 Sep 2014,580


---

***We can read the date time by specifying the format.***

***Here are some of the common used directives.***

----

| **Directive** | **Meaning**                                            |
| ---           | ---                                                    |
|  **%a**       | Weekday as locale’s abbreviated name.                  |
|  **%A**       | Weekday as locale’s full name.                         |  
|  **%d**       | Day of the month as a zero-padded decimal number.      |
|  **%b**       | Month as locale’s abbreviated name.	                 |
|  **%B**       | Month as locale’s full name.	                         |
|  **%m**       | Month as a zero-padded decimal number.                 |
|  **%y**       | Year without century as a zero-padded decimal number.  |
|  **%Y**       | Year with century as a decimal number.                 |
|  **%H**       | Hour (24-hour clock) as a zero-padded decimal number.  |

---

***You can read more about the other directives and datetime library here: https://docs.python.org/3/library/datetime.html***

---

In [21]:
# convert to datetime by specifying the data time format
data_2.Datetime = pd.to_datetime(data_2.Datetime, format="%d %b %Y ")

In [22]:
data_2.head()

Unnamed: 0,ID,Datetime,Count
0,0,2012-08-25,8
1,1,2012-08-25,2
2,2,2012-08-25,6
3,3,2012-08-25,2
4,4,2012-08-25,2


---

#### `TIME BASED FEATURES`

---


---

***Create features like `month` and `month_name` from the data.***

---

In [23]:
data

Unnamed: 0,ID,Datetime,Count,day_name,month,day_of_week,day_of_year
0,0,2012-08-25 00:00:00,8,Saturday,8,5,238
1,1,2012-08-25 01:00:00,2,Saturday,8,5,238
2,2,2012-08-25 02:00:00,6,Saturday,8,5,238
3,3,2012-08-25 03:00:00,2,Saturday,8,5,238
4,4,2012-08-25 04:00:00,2,Saturday,8,5,238
...,...,...,...,...,...,...,...
18283,18283,2014-09-25 19:00:00,868,Thursday,9,3,268
18284,18284,2014-09-25 20:00:00,732,Thursday,9,3,268
18285,18285,2014-09-25 21:00:00,702,Thursday,9,3,268
18286,18286,2014-09-25 22:00:00,580,Thursday,9,3,268


In [24]:
# create month and month_name 
data['month'] = data.Datetime.dt.month
data['month_name'] = data.Datetime.dt.month_name()


In [25]:
# view the data
data

Unnamed: 0,ID,Datetime,Count,day_name,month,day_of_week,day_of_year,month_name
0,0,2012-08-25 00:00:00,8,Saturday,8,5,238,August
1,1,2012-08-25 01:00:00,2,Saturday,8,5,238,August
2,2,2012-08-25 02:00:00,6,Saturday,8,5,238,August
3,3,2012-08-25 03:00:00,2,Saturday,8,5,238,August
4,4,2012-08-25 04:00:00,2,Saturday,8,5,238,August
...,...,...,...,...,...,...,...,...
18283,18283,2014-09-25 19:00:00,868,Thursday,9,3,268,September
18284,18284,2014-09-25 20:00:00,732,Thursday,9,3,268,September
18285,18285,2014-09-25 21:00:00,702,Thursday,9,3,268,September
18286,18286,2014-09-25 22:00:00,580,Thursday,9,3,268,September


---

***Create `day_name`, `day_of_week`, & `day_of_year`***

---

In [57]:
# create features
data['day_name'] = data.Datetime.dt.day_name()
data['month'] = data.Datetime.dt.month
data['day_of_week'] = data.Datetime.dt.dayofweek
data['day_of_year'] = data.Datetime.dt.dayofyear

In [58]:
# view the data
data.head()

Unnamed: 0,ID,Datetime,Count,day_name,month,day_of_week,day_of_year,month_name,today,day_difference,asia_timezone,utc_timezone
0,0,2012-08-25 00:00:00,8,Saturday,8,5,238,August,2021-07-18,3249,2012-08-25 00:00:00+05:30,2012-08-24 18:30:00+00:00
1,1,2012-08-25 01:00:00,2,Saturday,8,5,238,August,2021-07-18,3248,2012-08-25 01:00:00+05:30,2012-08-24 19:30:00+00:00
2,2,2012-08-25 02:00:00,6,Saturday,8,5,238,August,2021-07-18,3248,2012-08-25 02:00:00+05:30,2012-08-24 20:30:00+00:00
3,3,2012-08-25 03:00:00,2,Saturday,8,5,238,August,2021-07-18,3248,2012-08-25 03:00:00+05:30,2012-08-24 21:30:00+00:00
4,4,2012-08-25 04:00:00,2,Saturday,8,5,238,August,2021-07-18,3248,2012-08-25 04:00:00+05:30,2012-08-24 22:30:00+00:00




---

#### `DIFFERENCE BETWEEN 2 DATES`

---


***Add the current date in the new column***

---

In [28]:
data.head()

Unnamed: 0,ID,Datetime,Count,day_name,month,day_of_week,day_of_year,month_name
0,0,2012-08-25 00:00:00,8,Saturday,8,5,238,August
1,1,2012-08-25 01:00:00,2,Saturday,8,5,238,August
2,2,2012-08-25 02:00:00,6,Saturday,8,5,238,August
3,3,2012-08-25 03:00:00,2,Saturday,8,5,238,August
4,4,2012-08-25 04:00:00,2,Saturday,8,5,238,August


In [35]:
data['today'] = pd.to_datetime(dt.date.today())

In [36]:
data.head()

Unnamed: 0,ID,Datetime,Count,day_name,month,day_of_week,day_of_year,month_name,today
0,0,2012-08-25 00:00:00,8,Saturday,8,5,238,August,2021-07-18
1,1,2012-08-25 01:00:00,2,Saturday,8,5,238,August,2021-07-18
2,2,2012-08-25 02:00:00,6,Saturday,8,5,238,August,2021-07-18
3,3,2012-08-25 03:00:00,2,Saturday,8,5,238,August,2021-07-18
4,4,2012-08-25 04:00:00,2,Saturday,8,5,238,August,2021-07-18


In [37]:
difference_of_dates = data['today'] - data['Datetime']

In [38]:
difference_of_dates

0       3249 days 00:00:00
1       3248 days 23:00:00
2       3248 days 22:00:00
3       3248 days 21:00:00
4       3248 days 20:00:00
               ...        
18283   2487 days 05:00:00
18284   2487 days 04:00:00
18285   2487 days 03:00:00
18286   2487 days 02:00:00
18287   2487 days 01:00:00
Length: 18288, dtype: timedelta64[ns]

In [39]:
difference_of_dates.apply(lambda x: x.days)

0        3249
1        3248
2        3248
3        3248
4        3248
         ... 
18283    2487
18284    2487
18285    2487
18286    2487
18287    2487
Length: 18288, dtype: int64

In [40]:
data['day_difference'] = difference_of_dates.apply(lambda x: x.days)

In [41]:
data.head()

Unnamed: 0,ID,Datetime,Count,day_name,month,day_of_week,day_of_year,month_name,today,day_difference
0,0,2012-08-25 00:00:00,8,Saturday,8,5,238,August,2021-07-18,3249
1,1,2012-08-25 01:00:00,2,Saturday,8,5,238,August,2021-07-18,3248
2,2,2012-08-25 02:00:00,6,Saturday,8,5,238,August,2021-07-18,3248
3,3,2012-08-25 03:00:00,2,Saturday,8,5,238,August,2021-07-18,3248
4,4,2012-08-25 04:00:00,2,Saturday,8,5,238,August,2021-07-18,3248


---

### `CHALLENGES WITH TIME DATA`

---

#### DEALING WITH TIME ZONES
 
- If you have the dataset of a specific time zone. You can tell pandas about the local time zone and later you can convert it into different time zones.
- Use function `dt.tz_localize` to set the local time zone.
 
 
---

In [42]:
# set the current time as of Asia
data['asia_timezone'] = data.Datetime.dt.tz_localize('Asia/Calcutta')

In [43]:
# view the data
data.head()

Unnamed: 0,ID,Datetime,Count,day_name,month,day_of_week,day_of_year,month_name,today,day_difference,asia_timezone
0,0,2012-08-25 00:00:00,8,Saturday,8,5,238,August,2021-07-18,3249,2012-08-25 00:00:00+05:30
1,1,2012-08-25 01:00:00,2,Saturday,8,5,238,August,2021-07-18,3248,2012-08-25 01:00:00+05:30
2,2,2012-08-25 02:00:00,6,Saturday,8,5,238,August,2021-07-18,3248,2012-08-25 02:00:00+05:30
3,3,2012-08-25 03:00:00,2,Saturday,8,5,238,August,2021-07-18,3248,2012-08-25 03:00:00+05:30
4,4,2012-08-25 04:00:00,2,Saturday,8,5,238,August,2021-07-18,3248,2012-08-25 04:00:00+05:30


---

- Use the column asia_timezone and convert it into the UTC timezone. 
- Use the function `tz_convert` to convert the timezone.

---

In [44]:
# change the asia time zone to UTC
data['utc_timezone'] = data.asia_timezone.dt.tz_convert('UTC')

In [45]:
data.head()

Unnamed: 0,ID,Datetime,Count,day_name,month,day_of_week,day_of_year,month_name,today,day_difference,asia_timezone,utc_timezone
0,0,2012-08-25 00:00:00,8,Saturday,8,5,238,August,2021-07-18,3249,2012-08-25 00:00:00+05:30,2012-08-24 18:30:00+00:00
1,1,2012-08-25 01:00:00,2,Saturday,8,5,238,August,2021-07-18,3248,2012-08-25 01:00:00+05:30,2012-08-24 19:30:00+00:00
2,2,2012-08-25 02:00:00,6,Saturday,8,5,238,August,2021-07-18,3248,2012-08-25 02:00:00+05:30,2012-08-24 20:30:00+00:00
3,3,2012-08-25 03:00:00,2,Saturday,8,5,238,August,2021-07-18,3248,2012-08-25 03:00:00+05:30,2012-08-24 21:30:00+00:00
4,4,2012-08-25 04:00:00,2,Saturday,8,5,238,August,2021-07-18,3248,2012-08-25 04:00:00+05:30,2012-08-24 22:30:00+00:00


In [46]:
data[['asia_timezone', 'utc_timezone']]

Unnamed: 0,asia_timezone,utc_timezone
0,2012-08-25 00:00:00+05:30,2012-08-24 18:30:00+00:00
1,2012-08-25 01:00:00+05:30,2012-08-24 19:30:00+00:00
2,2012-08-25 02:00:00+05:30,2012-08-24 20:30:00+00:00
3,2012-08-25 03:00:00+05:30,2012-08-24 21:30:00+00:00
4,2012-08-25 04:00:00+05:30,2012-08-24 22:30:00+00:00
...,...,...
18283,2014-09-25 19:00:00+05:30,2014-09-25 13:30:00+00:00
18284,2014-09-25 20:00:00+05:30,2014-09-25 14:30:00+00:00
18285,2014-09-25 21:00:00+05:30,2014-09-25 15:30:00+00:00
18286,2014-09-25 22:00:00+05:30,2014-09-25 16:30:00+00:00


---

***Select a random date***

---

In [47]:
data['asia_timezone'][18287]

Timestamp('2014-09-25 23:00:00+0530', tz='Asia/Calcutta')

In [48]:
data['utc_timezone'][18287]

Timestamp('2014-09-25 17:30:00+0000', tz='UTC')

***You can see that time difference is 5 hours 30 minutes.***

---

---

#### `READING DATA WITH UNIX TIMESTAMP.`

- A UNIX timestamp is a way of storing a specific date and time. The timestamp is a ten digit number which represents the number of seconds that have passed since midnight on the 1st January 1970, UTC time.

---

In [51]:
# read data
data_with_unix_ts = pd.read_csv('Time series Data/data_with_timestamp.csv')

In [52]:
# view the data
data_with_unix_ts.head()

Unnamed: 0,ID,timestamp,Count
0,0,1345852800,8
1,1,1345856400,2
2,2,1345860000,6
3,3,1345863600,2
4,4,1345867200,2


In [55]:
# convert the unix timestamp to datetime.
data_with_unix_ts.timestamp = pd.to_datetime(data_with_unix_ts.timestamp, unit='s')

In [56]:
# view the top rows
data_with_unix_ts.head()

Unnamed: 0,ID,timestamp,Count
0,0,2012-08-25 00:00:00,8
1,1,2012-08-25 01:00:00,2
2,2,2012-08-25 02:00:00,6
3,3,2012-08-25 03:00:00,2
4,4,2012-08-25 04:00:00,2
