In [39]:
import pandas as pd
from datetime import timedelta

%pylab inline

df_goog = pd.read_csv('../../assets/datasets/goog.csv')

Populating the interactive namespace from numpy and matplotlib


Take a high-level look at the data. Describe it. What are we looking at? Hint: We can use our `plot` function to provide a good visual.

In [40]:
print(df_goog.describe())
print(df_goog.head())
print(df_goog.shape)

             Open        High         Low       Close        Volume  \
count   22.000000   22.000000   22.000000   22.000000  2.200000e+01   
mean   575.890686  609.268155  552.366753  584.801935  2.019245e+06   
std     56.597440   71.429837   64.162213   69.206444  6.682940e+05   
min    524.729980  541.412415  487.562195  520.510010  2.530000e+04   
25%    538.548111  565.495086  516.023072  538.463135  1.685675e+06   
50%    560.617554  581.727631  534.417419  559.487549  1.856900e+06   
75%    577.745132  639.383209  565.040634  600.655640  2.387900e+06   
max    747.109985  775.955017  745.630005  762.369995  3.290800e+06   

        Adj Close  
count   22.000000  
mean   584.801935  
std     69.206444  
min    520.510010  
25%    538.463135  
50%    559.487549  
75%    600.655640  
max    762.369995  
         Date        Open        High         Low       Close   Volume  \
0  2015-12-01  747.109985  775.955017  745.630005  762.369995  2519600   
1  2015-11-02  711.059998  762.7

Looking a little deeper, let's gauge the integrity of our data. Is there any cleaning we can do? 

In [41]:
print(len(df_goog.isnull()))
#looks like all the data is there
df_goog['Date'] = pd.to_datetime(df_goog['Date'])
#let's look at its datatype
for col in list(df_goog):
    print("{} is of type {}".format(col, df_goog[col].dtype))


22
Date is of type datetime64[ns]
Open is of type float64
High is of type float64
Low is of type float64
Close is of type float64
Volume is of type int64
Adj Close is of type float64


Let's examine the Date column. We should probably make it the index for our DataFrame, as we need to order the data by time. Doing this will result in 6 Series objects indexed by DateTime- literal Time Series!

In [None]:
df_goog.set_index('Date', inplace=True)


We need to convert the string to a DateTime object. Pandas has a built in function for this! Easy peasy. We should also ensure that the dates are sorted.

In [47]:
df_goog.head()

Unnamed: 0_level_0,Open,High,Low,Close,Volume,Adj Close
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
2015-12-01,747.109985,775.955017,745.630005,762.369995,2519600,762.369995
2015-11-02,711.059998,762.708008,705.849976,742.599976,1795300,742.599976
2015-10-01,608.369995,730.0,599.849976,710.809998,2337100,710.809998
2015-09-01,602.359985,650.900024,589.380005,608.419983,2398400,608.419983
2015-08-03,625.340027,674.900024,565.049988,618.25,2661600,618.25


Let's add some more columns with useful data extracted from the DateTime index.

In [59]:
df_goog['Year'] = df_goog.index.year

df_goog['Month'] = df_goog.index.month

df_goog['Day'] = df_goog.index.day

df_goog.sort_index(inplace=True)
df_goog

Unnamed: 0_level_0,Open,High,Low,Close,Volume,Adj Close,Year,Closed_Up,Month,Day
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
2014-03-27,568.002563,568.002563,552.922546,556.972473,25300,556.972473,2014,False,3,27
2014-04-01,558.712585,604.832764,502.802277,526.662415,3290800,526.662415,2014,False,4,1
2014-05-01,527.112366,567.84259,503.302277,559.892578,1828500,559.892578,2014,True,5,1
2014-06-02,560.702576,582.452637,538.752441,575.282593,1872200,575.282593,2014,True,6,2
2014-07-01,578.322632,599.65271,565.012573,571.602539,1668800,571.602539,2014,False,7,1
2014-08-01,570.402588,587.342651,560.002563,571.602539,1368800,571.602539,2014,True,8,1
2014-09-02,571.852539,596.482666,568.212646,577.36261,1673200,577.36261,2014,True,9,2
2014-10-01,576.012634,581.002625,508.102295,559.08252,2356400,559.08252,2014,False,10,1
2014-11-03,555.502502,557.902527,530.082397,541.832458,1561200,541.832458,2014,False,11,3
2014-12-01,538.902466,541.412415,489.002228,526.402405,2146700,526.402405,2014,False,12,1


Let's walk through adding a dummy variable to flag days where the Close price was higher than the Open price

In [58]:
df_goog["Closed_Up"] = (df_goog['Close']-df_goog['Open']) > 0
df_goog.head()

Unnamed: 0_level_0,Open,High,Low,Close,Volume,Adj Close,Year,Closed_Up
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
2014-03-27,568.002563,568.002563,552.922546,556.972473,25300,556.972473,2014,False
2014-04-01,558.712585,604.832764,502.802277,526.662415,3290800,526.662415,2014,False
2014-05-01,527.112366,567.84259,503.302277,559.892578,1828500,559.892578,2014,True
2014-06-02,560.702576,582.452637,538.752441,575.282593,1872200,575.282593,2014,True
2014-07-01,578.322632,599.65271,565.012573,571.602539,1668800,571.602539,2014,False


We can use the DateTime object to access various different cuts of data using date attributes. For example, if we wanted to get all of the cuts from 2015, we would do as such:

Let's recall the TimeDelta object. We can use this to shift our entire index by a given offset.

On your own, try to shift the entire time series **both** forwards and backwards by the following intervals:
- 1 hour
- 3 days
- 12 years, 1 hour, and 43 seconds

## Discussion: Date ranges and Frequencies

Note that `asfreq` gives us a `method` keyword argument. Backfill, or bfill, will propogate the last valid observation forward. In other words, it will use the value preceding a range of unknown indices to fill in the unknowns. Inversely, pad, or ffill, will use the first value succeeding a range of unknown indices to fill in the unknowns.

Now, let's discuss the following points:
- What does `asfreq` do?
- What does `resample` do?
- What is the difference?
- When would we want to use each?

We can also create our own date ranges using a built in function, `date_range`. The `periods` and `freq` keyword arguments grant the user finegrained control over the resulting values. To reset the time data, use the `normalize=True` directive.

**NOTE:** See Reference B in the lesson notes for all of the available offset aliases

We are also given a Period object, which can be used to represent a time interval. The Period object consists of a start time and an end time, and can be created by providing a start time and a given frequency.

Each of these objects can be used to alter and access data from our DataFrames. We'll try those out in our Independent Practice in a moment.