
## Lab 7: Working with Dates and Times 



**Due Date: Friday March 27, 2020 at 11:59 AM.**

`pandas` supports many methods for manipulating dates and times. For Project 1, we will need to understand timestamps of data scraped from the internet. While we could work with dates and times as strings, it will be helpful for us to understand  

* Objects for storing dates / times 
* Methods for accessing and modifying dates / times
* Switching between timezones

We will use a financial dataset consisting of information about stock movements.

In [None]:
import datetime
from datetime import datetime
from datetime import timedelta
import pytz

import numpy as np
import pandas as pd

import sys
from IPython.display import Image

In [None]:
# TEST 

assert 'pandas' in sys.modules and "pd" in locals()
assert 'numpy' in sys.modules and "np" in locals()
assert 'pytz' in sys.modules 
assert 'datetime' in sys.modules 

### 1. Date and Time Data Types

Before studying dates and times in the `pandas` package, we will try to understand objects for storing dates and times. We will use the `datetime` package. 

Each `datetime` object consists of year, month and day in the [Gregorian calendar](https://en.wikipedia.org/wiki/Gregorian_calendar).  

In [None]:
now = datetime.now()
print(now.year, now.month, now.day)


Moreover `datetime` objects can store time of day as hours, minutes, seconds, and microseconds  

In [None]:
print(now.hour, now.minute, now.second, now.microsecond)

The package allows us to [overload](https://en.wikipedia.org/wiki/Function_overloading) arithmetic operations. 

For example, we can determine the elapsed time between two dates by applying subtraction.

In [None]:
delta = datetime(2011, 1, 7) - datetime(2008, 6, 24, 8, 15)
print("Days ", delta.days)
print("Seconds ", delta.seconds)

Note that the difference between two dates is a `timedelta` object.

In [None]:
type(delta)

We can incorporate `timedelta` objects into arithmetic operations

In [None]:
start = datetime(2011, 1, 7)
finish = start + 2 * timedelta(days = 4)
print(finish.year, finish.month, finish.day)

### 1.1 Coverting between string and datetime

We can cast a string to a `datetime` object or cast a `datetime` object to a string. For example, if we print a `datetime` object then we get a representation as a string.  

In [None]:
stamp = datetime(2011, 1, 3)
str(stamp)

We can adjust the formatting with the following 

%Y 4-digit year   
%y 2-digit year  
%m 2-digit month [01, 12]  
%d 2-digit day [01, 31]  
%H Hour (24-hour clock) [00, 23]  
%I Hour (12-hour clock) [01, 12]  
%M 2-digit minute [00, 59]  
%S Second [00, 61] (seconds 60, 61 account for leap seconds)  
%w Weekday as integer [0 (Sunday), 6]  
%z UTC time zone offset as +HHMM or -HHMM, empty if time zone naive  


In [None]:
print("Format: (4 digit year) - (2 digit month) - (2 digit day)\n",stamp.strftime('%Y-%m-%d'))

If we know the format of a string, then we can convert it to a `datetime` object.

In [None]:
str_date = '201101/03'
datetime.strptime(str_date, '%Y%m/%d')

### 1.2 Using datetime with pandas

Often we want to use `datetime` objects in the index of a `pandas` series or dataframe.

In [None]:
dates = [datetime(2011, 1, 2), datetime(2011, 1, 5), datetime(2011, 1, 7),
         datetime(2011, 1, 8), datetime(2011, 1, 10), datetime(2011, 1, 12)]

ts = pd.Series(np.random.randn(6), index=dates)
ts

Note that the index is a `DatetimeIndex` meaning a `pandas` data structure that support operations like substraction. 

In [None]:
print("Index type", type(ts.index))

Alternatively, we can skip the `datetime` package to convert from a string using `to_datetime` method.

In [None]:
datestrs = ['7/6/2011', '8/6/2011']
pd.to_datetime(datestrs)

Note that we can handle missing dates like missing numbers. Instead of `NaN` we have `NaT`.

In [None]:
date_index = pd.to_datetime(datestrs + [None])
date_index

We can check for missing values with `isna`

In [None]:
pd.isna(date_index)

### 2. Date Ranges

Often we want to generate many dates according to a pattern. 

#### 2.1 Ranges

If we have a starting date and ending date, then we can fill in intermediate dates according to a frequency. 

In [None]:
date_index = pd.date_range('4/1/2012', '6/1/2012', freq="D")
date_index[:3]

If we have a starting date, then we can add a certain number of periods according to a frequency. 

In [None]:
pd.date_range(start='4/1/2012', periods=20, freq="M")[:3]

#### 2.2 Access

We have different ways to access the dates. For example, we could specify either a string or a `datetime` object corresponding to a particular entry

In [None]:
ts = pd.Series(np.random.randn(1000),
                   index=pd.date_range('1/1/2000', periods=1000))
ts.head() 

In [None]:
ts['2000-01-10']

In [None]:
ts[datetime(2000, 1, 7)]

If we want a collection of dates, then we could specify less information.

In [None]:
ts['2001'].head()

In [None]:
ts['2001-05'].head()  

Or we could specify a slice of dates like a slice of numbers.

In [None]:
ts[datetime(2001, 1, 7):].head()

In [None]:
ts['1/6/2001':'2/10/2001'].head()

### 3. Time Zones

We will use the `pytz` package to deal with timezones. The package recognizes timezones according to strings.

In [None]:
pytz.common_timezones[-5:]

In [None]:
est_tz = pytz.timezone('US/Eastern')
est_tz

By default a `DatetimeIndex` does not have an associated timezone. We can check by accessing the `tz` attribute of the index. 

In [None]:
rng = pd.date_range('3/9/2012 9:30', periods=6, freq='D')
print(rng.tz)

In [None]:
rng = pd.date_range('3/9/2012 9:30', periods=10, freq='D', tz='UTC')
print(rng.tz)

We can add timezones to an existing `DatetimeIndex` using the `tz_localize` method.

In [None]:
rng = pd.date_range('3/9/2012 9:30', periods=6, freq='D')
rng_utc = rng.tz_localize('UTC')
rng_utc[:3]

If we have included the timezone, then we can convert using the `tz_convert` method. 

In [None]:
rng_utc.tz_convert('US/Eastern')[:3]

## Questions 

We have stock market data in ```raw_data.csv```

In [None]:
df_raw = pd.read_csv("raw_data.csv")

df_raw.head(3)

Note that the entries of the column ```times_of_trade``` are strings representing dates and times in the form ```dd-mm-yyyy hh:mm:ss```. While each time refers to `PST` timezone, the format is `GMT` timezone. 

Generate another ```pd.DataFrame``` called ```df``` from ```df_raw``` through the following operations:

 - Replace ```times_of_trade``` with ```Time```
 - Use ```pd.to_datetime``` to convert each string in ```Time```. Remember that the format is ```dd-mm-yyyy hh:mm:ss```
 - Add the timezone `UTC'. Set the timezone to 'US/Pacific'.
 - Apply ```sort_values``` to sort by the entries in ```Time```
 - Invoke ```set_index``` to set the index to be ```Time```

In [None]:
Image("table.PNG")

In [None]:
df = ...

# YOUR CODE HERE
raise NotImplementedError()

In [None]:
# TEST 

assert set(df.columns) == {'Volume', "Price"}


In [None]:
# TEST 

assert df.index[0] < df.index[1]


In [None]:
# TEST 

assert type(df.index) == pd.core.indexes.datetimes.DatetimeIndex
