# **Introduction to Dates & Time with pandas**

This jupyter notebook can be found on my GitHub account: https://github.com/mbonnemaison/Learning-Python/tree/master/Learning_pandas
### **pandas** is a python library that facilitates data analysis organized in a table.

### Sources:
- Information to install pandas, introduce pandas and the user guide: https://pandas.pydata.org/pandas-docs/stable/getting_started/index.html
- Python for Data Analysis by Wes McKinney (2nd edition used here) - Chapter 5 (Introduction), Chapter 11 (Time Series)

## **Project presentation**
Detector placed at entrance of kitchen since December 1st recording 6 parameters:
- Humidity
- Brightness
- Temperature
- Movement (called Presence)
- UV
- Vibration (called Sabotage)

More information on this project here: https://github.com/mbonnemaison/adelego
### **Reading data from a csv file using pandas**

In [1]:
import pandas as pd

In [3]:
data = pd.read_csv("24h_2021-03-14.csv",  sep = '\t')

In [4]:
data

Unnamed: 0,Date,Equipment,Parameter,Value,Unit
0,2021-03-14 00:10:00,5MultiSensor 6 (ZW100),HUMIDITY,21000000000,%
1,2021-03-14 01:10:00,5MultiSensor 6 (ZW100),HUMIDITY,20750000000,%
2,2021-03-14 03:10:00,5MultiSensor 6 (ZW100),HUMIDITY,20,%
3,2021-03-14 03:25:00,5MultiSensor 6 (ZW100),HUMIDITY,21,%
4,2021-03-14 03:40:00,5MultiSensor 6 (ZW100),HUMIDITY,21,%
...,...,...,...,...,...
431,2021-03-14 22:55:00,5MultiSensor 6 (ZW100),UV,0,
432,2021-03-14 23:10:00,5MultiSensor 6 (ZW100),UV,0,
433,2021-03-14 23:25:00,5MultiSensor 6 (ZW100),UV,0,
434,2021-03-14 23:40:00,5MultiSensor 6 (ZW100),UV,0,


Link to user guide for **pd.read_csv()**: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html?highlight=read_csv#pandas.read_csv

In [6]:
data.head()

Unnamed: 0,Date,Equipment,Parameter,Value,Unit
0,2021-03-14 00:10:00,5MultiSensor 6 (ZW100),HUMIDITY,21000000000,%
1,2021-03-14 01:10:00,5MultiSensor 6 (ZW100),HUMIDITY,20750000000,%
2,2021-03-14 03:10:00,5MultiSensor 6 (ZW100),HUMIDITY,20,%
3,2021-03-14 03:25:00,5MultiSensor 6 (ZW100),HUMIDITY,21,%
4,2021-03-14 03:40:00,5MultiSensor 6 (ZW100),HUMIDITY,21,%


In [7]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 436 entries, 0 to 435
Data columns (total 5 columns):
 #   Column     Non-Null Count  Dtype 
---  ------     --------------  ----- 
 0   Date       436 non-null    object
 1   Equipment  436 non-null    object
 2   Parameter  436 non-null    object
 3   Value      436 non-null    object
 4   Unit       258 non-null    object
dtypes: object(5)
memory usage: 17.2+ KB


### **Data manipulations in pandas**
**Select columns**

In [8]:
data['Date']
#The output is a Series, i.e. a 1-column table

0      2021-03-14 00:10:00
1      2021-03-14 01:10:00
2      2021-03-14 03:10:00
3      2021-03-14 03:25:00
4      2021-03-14 03:40:00
              ...         
431    2021-03-14 22:55:00
432    2021-03-14 23:10:00
433    2021-03-14 23:25:00
434    2021-03-14 23:40:00
435    2021-03-14 23:55:00
Name: Date, Length: 436, dtype: object

In [9]:
data[['Date', 'Value']]

Unnamed: 0,Date,Value
0,2021-03-14 00:10:00,21000000000
1,2021-03-14 01:10:00,20750000000
2,2021-03-14 03:10:00,20
3,2021-03-14 03:25:00,21
4,2021-03-14 03:40:00,21
...,...,...
431,2021-03-14 22:55:00,0
432,2021-03-14 23:10:00,0
433,2021-03-14 23:25:00,0
434,2021-03-14 23:40:00,0


**Select rows**

In [10]:
data.iloc[1:10]

Unnamed: 0,Date,Equipment,Parameter,Value,Unit
1,2021-03-14 01:10:00,5MultiSensor 6 (ZW100),HUMIDITY,20750000000,%
2,2021-03-14 03:10:00,5MultiSensor 6 (ZW100),HUMIDITY,20,%
3,2021-03-14 03:25:00,5MultiSensor 6 (ZW100),HUMIDITY,21,%
4,2021-03-14 03:40:00,5MultiSensor 6 (ZW100),HUMIDITY,21,%
5,2021-03-14 03:55:00,5MultiSensor 6 (ZW100),HUMIDITY,21,%
6,2021-03-14 04:10:00,5MultiSensor 6 (ZW100),HUMIDITY,21,%
7,2021-03-14 04:25:00,5MultiSensor 6 (ZW100),HUMIDITY,20,%
8,2021-03-14 04:40:00,5MultiSensor 6 (ZW100),HUMIDITY,21,%
9,2021-03-14 04:55:00,5MultiSensor 6 (ZW100),HUMIDITY,21,%


In [11]:
data.loc[(data['Date'] > '2021-03-14 15:00:00') & (data['Parameter'] == 'PRESENCE')]

Unnamed: 0,Date,Equipment,Parameter,Value,Unit
200,2021-03-14 15:58:46,5MultiSensor 6 (ZW100),PRESENCE,1,
201,2021-03-14 16:10:36,5MultiSensor 6 (ZW100),PRESENCE,0,
202,2021-03-14 16:14:21,5MultiSensor 6 (ZW100),PRESENCE,1,
203,2021-03-14 16:33:26,5MultiSensor 6 (ZW100),PRESENCE,0,
204,2021-03-14 18:00:18,5MultiSensor 6 (ZW100),PRESENCE,1,
205,2021-03-14 18:08:29,5MultiSensor 6 (ZW100),PRESENCE,0,
206,2021-03-14 18:13:13,5MultiSensor 6 (ZW100),PRESENCE,1,
207,2021-03-14 18:18:49,5MultiSensor 6 (ZW100),PRESENCE,0,
208,2021-03-14 18:34:28,5MultiSensor 6 (ZW100),PRESENCE,1,
209,2021-03-14 18:53:19,5MultiSensor 6 (ZW100),PRESENCE,0,


**Select cells**

In [12]:
data['Date'][0:5]

0    2021-03-14 00:10:00
1    2021-03-14 01:10:00
2    2021-03-14 03:10:00
3    2021-03-14 03:25:00
4    2021-03-14 03:40:00
Name: Date, dtype: object

**Sort values**

In [None]:
data.sort_values(by = ["Parameter"], ascending=True)

**Count the different values in the column 'Parameter'**

In [13]:
data['Parameter'].value_counts()

HUMIDITY       86
TEMPERATURE    86
UV             86
BRIGHTNESS     86
PRESENCE       46
SABOTAGE       46
Name: Parameter, dtype: int64

In [14]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 436 entries, 0 to 435
Data columns (total 5 columns):
 #   Column     Non-Null Count  Dtype 
---  ------     --------------  ----- 
 0   Date       436 non-null    object
 1   Equipment  436 non-null    object
 2   Parameter  436 non-null    object
 3   Value      436 non-null    object
 4   Unit       258 non-null    object
dtypes: object(5)
memory usage: 17.2+ KB


## **Introduction to Time & Dates**
Some of the elementary data structures for working with date & time data are:

- **Timestamp** : specific instant in time
- **Timedelta**: Interval of time indicated by a start and end timestamp.

### **Timestamp**

***Timestamp*** is pandas equivalent of python’s datetime.datetime object and is interchangeable with it in most cases.

### **Convert strings to timestamps**
Strings can be converted to dates using **pd.to_datetime**.

Note: Information on format can be found here: https://docs.python.org/3/library/datetime.html#strftime-and-strptime-behavior)

In [15]:
mytimestamp = '2021/10/23 4:34:2'

In [16]:
mytimestamp

'2021/10/23 4:34:2'

In [17]:
mytimestamp_real = pd.to_datetime(mytimestamp)

In [18]:
mytimestamp_real

Timestamp('2021-10-23 04:34:02')

In [21]:
pd.to_datetime('2021-02-19 22:45:56', format = '%Y-%m-%d')

Timestamp('2021-02-19 22:45:56')

In [24]:
pd.to_datetime('20210223232323')

Timestamp('2021-02-23 23:23:23')

### **Convert a list of dates from string to Timestamp**

In [25]:
date_list_str = ['2021-03-14', '2020-12-25', '2025-02-19']

In [26]:
date_list_str

['2021-03-14', '2020-12-25', '2025-02-19']

In [29]:
pd.to_datetime(date_list_str)

### **Dealing with missing values**

In [31]:
date_list_str2 = ['2021-03-14', '2020-12-25', '2025-02-19', '2021-04-14', None]

In [32]:
date_list_str2

['2021-03-14', '2020-12-25', '2025-02-19', '2021-04-14', None]

In [33]:
pd.to_datetime(date_list_str2)

DatetimeIndex(['2021-03-14', '2020-12-25', '2025-02-19', '2021-04-14', 'NaT'], dtype='datetime64[ns]', freq=None)

**NaT** means Not a Time

### **Convert values in the "Date" column from string to Timestamp**

In [34]:
data.head(10)

Unnamed: 0,Date,Equipment,Parameter,Value,Unit
0,2021-03-14 00:10:00,5MultiSensor 6 (ZW100),HUMIDITY,21000000000,%
1,2021-03-14 01:10:00,5MultiSensor 6 (ZW100),HUMIDITY,20750000000,%
2,2021-03-14 03:10:00,5MultiSensor 6 (ZW100),HUMIDITY,20,%
3,2021-03-14 03:25:00,5MultiSensor 6 (ZW100),HUMIDITY,21,%
4,2021-03-14 03:40:00,5MultiSensor 6 (ZW100),HUMIDITY,21,%
5,2021-03-14 03:55:00,5MultiSensor 6 (ZW100),HUMIDITY,21,%
6,2021-03-14 04:10:00,5MultiSensor 6 (ZW100),HUMIDITY,21,%
7,2021-03-14 04:25:00,5MultiSensor 6 (ZW100),HUMIDITY,20,%
8,2021-03-14 04:40:00,5MultiSensor 6 (ZW100),HUMIDITY,21,%
9,2021-03-14 04:55:00,5MultiSensor 6 (ZW100),HUMIDITY,21,%


In [35]:
data['Date']

0      2021-03-14 00:10:00
1      2021-03-14 01:10:00
2      2021-03-14 03:10:00
3      2021-03-14 03:25:00
4      2021-03-14 03:40:00
              ...         
431    2021-03-14 22:55:00
432    2021-03-14 23:10:00
433    2021-03-14 23:25:00
434    2021-03-14 23:40:00
435    2021-03-14 23:55:00
Name: Date, Length: 436, dtype: object

In [36]:
data['Date'] = pd.to_datetime(data["Date"])

In [37]:
data["Date"]

0     2021-03-14 00:10:00
1     2021-03-14 01:10:00
2     2021-03-14 03:10:00
3     2021-03-14 03:25:00
4     2021-03-14 03:40:00
              ...        
431   2021-03-14 22:55:00
432   2021-03-14 23:10:00
433   2021-03-14 23:25:00
434   2021-03-14 23:40:00
435   2021-03-14 23:55:00
Name: Date, Length: 436, dtype: datetime64[ns]

In [38]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 436 entries, 0 to 435
Data columns (total 5 columns):
 #   Column     Non-Null Count  Dtype         
---  ------     --------------  -----         
 0   Date       436 non-null    datetime64[ns]
 1   Equipment  436 non-null    object        
 2   Parameter  436 non-null    object        
 3   Value      436 non-null    object        
 4   Unit       258 non-null    object        
dtypes: datetime64[ns](1), object(4)
memory usage: 17.2+ KB


***Missing values in DataFrame...***

In [None]:
dataNaT = pd.read_csv("24h_2021-03-14_NaT.csv", sep = '\t')

In [None]:
dataNaT.head(10)

In [None]:
dataNaT.info()

In [None]:
dataNaT["Date"] = pd.to_datetime(dataNaT["Date"])

In [None]:
dataNaT.info()

In [None]:
dataNaT.head(10)

In [None]:
dataNaT["Date"][33]

### **Generate Timestamps at fixed frequency**
*Fixed frequency* consists of data points that occur at regular intervals, like every 5 minutes.

In [45]:
tsff = pd.date_range(start = '1/1/2021', periods = 50, freq = '4h')

In [46]:
tsff

DatetimeIndex(['2021-01-01 00:00:00', '2021-01-01 04:00:00',
               '2021-01-01 08:00:00', '2021-01-01 12:00:00',
               '2021-01-01 16:00:00', '2021-01-01 20:00:00',
               '2021-01-02 00:00:00', '2021-01-02 04:00:00',
               '2021-01-02 08:00:00', '2021-01-02 12:00:00',
               '2021-01-02 16:00:00', '2021-01-02 20:00:00',
               '2021-01-03 00:00:00', '2021-01-03 04:00:00',
               '2021-01-03 08:00:00', '2021-01-03 12:00:00',
               '2021-01-03 16:00:00', '2021-01-03 20:00:00',
               '2021-01-04 00:00:00', '2021-01-04 04:00:00',
               '2021-01-04 08:00:00', '2021-01-04 12:00:00',
               '2021-01-04 16:00:00', '2021-01-04 20:00:00',
               '2021-01-05 00:00:00', '2021-01-05 04:00:00',
               '2021-01-05 08:00:00', '2021-01-05 12:00:00',
               '2021-01-05 16:00:00', '2021-01-05 20:00:00',
               '2021-01-06 00:00:00', '2021-01-06 04:00:00',
               '2021-01-

## **Timedeltas**
Timedelta represents the temporal difference between two datetime objects.

In [39]:
pd.Timedelta(weeks = 1, days = 4, hours = 5)

Timedelta('11 days 05:00:00')

### **Timedelta operations**
**Add time to Timestamps**

In [44]:
ts = pd.to_datetime('2021/3/23 3:20:00') + pd.Timedelta(days=3, hours = 7)

In [45]:
ts

Timestamp('2021-03-26 10:20:00')

**Difference between Timestamps generates a Timedelta**

In [46]:
delta = pd.to_datetime('2021/3/23 23:20:00') - pd.to_datetime('2021/3/20 2:34:14')

In [47]:
delta

Timedelta('3 days 20:45:46')

**Adding Timedeltas**

In [48]:
td1 = pd.Timedelta(weeks = 3, days = 3, hours = 3)
td2 = pd.Timedelta(weeks = 1, days = 1, hours = 1)

In [49]:
td1+td2

Timedelta('32 days 04:00:00')

### **Convert strings to Timedelta**

In [50]:
pd.to_timedelta('45:53:23')

Timedelta('1 days 21:53:23')

## **Going further**
### ***Time periods*** 

*Periods* can be thought of as special cases of intervals.

Example of periods: the month of March 2021 or the year 2020

### **Generate Time Periods**

In [86]:
tp = pd.Period(2020, freq='A-OCT')
#A-OCT means that we are looking at a period starting on 1/1/2020 and ending on 10/31/2020.

In [87]:
tp

Period('2020', 'A-OCT')

### **Generate Time Periods at fixed frequency**

In [84]:
tp2 = pd.period_range(start='2000-01-01', end='2020-01-01', freq='A-OCT')

In [85]:
tp2

PeriodIndex(['2000', '2001', '2002', '2003', '2004', '2005', '2006', '2007',
             '2008', '2009', '2010', '2011', '2012', '2013', '2014', '2015',
             '2016', '2017', '2018', '2019', '2020'],
            dtype='period[A-OCT]', freq='A-OCT')

## **Practice**

In [51]:
us_cities = pd.read_csv('top12.csv')

In [52]:
us_cities

Unnamed: 0,Cities,State,Population,Density(/sq mi),Incorporated
0,Los Angeles,California,3979576,8484,4/4/1850
1,Chicago,Illinois,2693976,11900,3/4/1837
2,Houston,Texas,2320268,3613,6/5/1837
3,Phoenix,Arizona,1680992,3120,2/25/1881
4,Philadelphia,Pennsylvania,1584064,11683,10/25/1701
5,San Antonio,Texas,1547253,3238,6/5/1837
6,San Diego,California,1423851,4325,3/27/1850
7,Dallas,Texas,1343573,3866,2/2/1856
8,San Jose,California,1021795,5777,3/27/1850
9,Austin,Texas,978908,3031,12/27/1839


In [53]:
cities.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 11 entries, 0 to 10
Data columns (total 5 columns):
 #   Column           Non-Null Count  Dtype 
---  ------           --------------  ----- 
 0   Cities           11 non-null     object
 1   State            11 non-null     object
 2   Population       11 non-null     int64 
 3   Density(/sq mi)  11 non-null     int64 
 4   Incorporated     11 non-null     object
dtypes: int64(2), object(3)
memory usage: 568.0+ bytes


**Question 1**: How would you convert the Incorporated date from string to Timestamp?

In [54]:
us_cities['Incorporated'] = pd.to_datetime(us_cities['Incorporated'], format= '%m/%d/%Y')
us_cities.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 11 entries, 0 to 10
Data columns (total 5 columns):
 #   Column           Non-Null Count  Dtype         
---  ------           --------------  -----         
 0   Cities           11 non-null     object        
 1   State            11 non-null     object        
 2   Population       11 non-null     int64         
 3   Density(/sq mi)  11 non-null     int64         
 4   Incorporated     11 non-null     datetime64[ns]
dtypes: datetime64[ns](1), int64(2), object(2)
memory usage: 568.0+ bytes


**Question 2**: How many days between Philadelphia and Dallas incorporated dates?

In [55]:
us_cities

Unnamed: 0,Cities,State,Population,Density(/sq mi),Incorporated
0,Los Angeles,California,3979576,8484,1850-04-04
1,Chicago,Illinois,2693976,11900,1837-03-04
2,Houston,Texas,2320268,3613,1837-06-05
3,Phoenix,Arizona,1680992,3120,1881-02-25
4,Philadelphia,Pennsylvania,1584064,11683,1701-10-25
5,San Antonio,Texas,1547253,3238,1837-06-05
6,San Diego,California,1423851,4325,1850-03-27
7,Dallas,Texas,1343573,3866,1856-02-02
8,San Jose,California,1021795,5777,1850-03-27
9,Austin,Texas,978908,3031,1839-12-27


In [56]:
us_cities['Incorporated'][7] - us_cities['Incorporated'][4]

Timedelta('56347 days 00:00:00')

## **Problems**
### **Problem 1: Timestamp limitation**
New York City was incorporated on September 2nd 1664. Convert this date into a Timestamp.

In [57]:
NYC = pd.to_datetime('9-2-1664')

OutOfBoundsDatetime: Out of bounds nanosecond timestamp: 1664-09-02 00:00:00

Timestamp limitations: https://pandas-docs.github.io/pandas-docs-travis/user_guide/timeseries.html#timeseries-timestamp-limits

#### Python ***datetime*** module
Python provides the date and time functionality in the **datetime** module that contains three popular classes:

- **Date class**: to work with dates (day, month, year)
- **Time class**: to work with times (hours, minutes, seconds, microseconds)
- **Datetime class**: to work with components of both date and time

In [68]:
from datetime import datetime
NYC2 = datetime(1664,9,2)

In [None]:
NYC2

***Convert strings to datetime.datetime objects***

In [None]:
NYC3 = datetime.strptime('2/9/1664', '%d/%m/%Y')

In [None]:
NYC3

***Working with a list of dates***

In [None]:
date_list_str = ['2021-03-14', '2020-12-25', '2025-02-19']

In [None]:
[datetime.strptime(x, '%Y-%m-%d') for x in date_list_str]

***Convert Incorporated dates into datetime.datetime objects***

In [70]:
us_cities = pd.read_csv('top12.csv')

In [None]:
us_cities.info()

In [71]:
[datetime.strptime(x, '%m/%d/%Y') for x in us_cities['Incorporated']]

[datetime.datetime(1850, 4, 4, 0, 0),
 datetime.datetime(1837, 3, 4, 0, 0),
 datetime.datetime(1837, 6, 5, 0, 0),
 datetime.datetime(1881, 2, 25, 0, 0),
 datetime.datetime(1701, 10, 25, 0, 0),
 datetime.datetime(1837, 6, 5, 0, 0),
 datetime.datetime(1850, 3, 27, 0, 0),
 datetime.datetime(1856, 2, 2, 0, 0),
 datetime.datetime(1850, 3, 27, 0, 0),
 datetime.datetime(1839, 12, 27, 0, 0),
 datetime.datetime(1832, 2, 9, 0, 0)]

### **Problem 2: Time zone**
What time is it now?

In [65]:
now = pd.to_datetime('now')

In [66]:
now

Timestamp('2021-04-09 18:53:47.048442')

In [None]:
now_utc = now.tz_localize('UTC')

In [None]:
now_utc

In [63]:
now_est = now_utc.tz_convert('US/Eastern')

In [64]:
now_est

Timestamp('2021-04-09 14:48:46.600075-0400', tz='US/Eastern')

There is conversion of TIMESTAMP values from the current time zone to UTC for storage, and back from UTC to the current time zone for retrieval.  By default, the current time zone for each connection is the server's time. This does not occur for other types such as DATETIME.
#### Python **datetime** module

In [72]:
now = datetime.now()

In [73]:
now

datetime.datetime(2021, 4, 9, 14, 57, 0, 722080)

In [74]:
now.date()

datetime.date(2021, 4, 9)

In [75]:
now.time()

datetime.time(14, 57, 0, 722080)

In [77]:
now.hour

14