### Pandas Lab: Time Shifts & Multi Level Indexing

This lab is designed to introduce you to working with time in a more granular way, and understanding how to build features when your data has hierarchies or panels.  

Ie, when you have repeated observations for the same objects.  This is an important concept because lots of statistical methods don't explicitly account for values which might naturally be correlated with one another over time.  

But lots of data **is** highly correlated over time!  

By the time you're done with this lab, you'll have built 9 columns that capture a variety of information about how an observed value is changing with respect to itself.

**Question 1:** To capture some other aspects of dates, create columns in your dataset that capture the following aspects of each timestamp:

  - What quarter it's in
  - What month it's in
  - What year it's in
  - The number of days passed in the `visit_date` column

If you want to try adding different pandas date parts, you can find them here:  https://pandas.pydata.org/pandas-docs/stable/user_guide/timeseries.html#time-date-components

In [2]:
# your answer here
import pandas as pd
import numpy as np

# you might have to change the file path
df = pd.read_csv('Users\mcs275\dat-class-repo\data\restaurant_data\master.csv', parse_dates=['visit_date', 'calendar_date'])
# sorting the values by the dates before you do this is not a bad idea
df.sort_values(by=['id', 'visit_date'], inplace=True)

FileNotFoundError: [Errno 2] No such file or directory: 'Users\\mcs275\\dat-class-repo\\data\restaurant_data\\master.csv'

In [33]:
df['quarter'] = df['visit_date'].dt.quarter
df['month']   = df['visit_date'].dt.month
df['year']    = df['visit_date'].dt.year
df['time']    = (df['visit_date'] - df['visit_date'].min()).dt.days

**Question 2:** Time Series Embedding

Lots of times if you're trying to predict the value of something tomorrow, the most import piece of information is what the value of something is today, and yesterday, and so on.

However, your data won't really "know" about those values unless they can be observed alongside the current observation.

To that end, make three columns that capture the value of the following:

 - What the previous recorded attendance for the previous observation
 - The attendance from two observations ago
 - The attendance from 7 observations ago (ie, week over week)

In [34]:
# your answer here
df['yesterday']    = df.groupby('id')['visitors'].shift()
df['two_days_ago'] = df.groupby('id')['visitors'].shift(2)
df['one_week_ago'] = df.groupby('id')['visitors'].shift(7)

**Bonus Answer:** 

In [35]:
# create the date offsets
one_day_ago  = pd.DateOffset(days=1)
two_days_ago = pd.DateOffset(days=2)
one_week_ago = pd.DateOffset(weeks=1)

In [36]:
# and the groupings
one_day_shift  = df.set_index('visit_date').groupby('id')[['visitors']].shift(freq=one_day_ago).rename({'visitors': 'one_day_ago'}, axis=1)
two_days_shift = df.set_index('visit_date').groupby('id')[['visitors']].shift(freq=two_days_ago).rename({'visitors': 'two_days_ago'}, axis=1)
one_week_shift = df.set_index('visit_date').groupby('id')[['visitors']].shift(freq=one_week_ago).rename({'visitors': 'one_week_ago_'}, axis=1)

In [37]:
# merge them back in 
df = df.merge(one_day_shift, left_on=['id', 'visit_date'], right_index=True, how='left')
df = df.merge(two_days_shift, left_on=['id', 'visit_date'], right_index=True, how='left')
df = df.merge(one_week_shift, left_on=['id', 'visit_date'], right_index=True, how='left')

In [38]:
# last three columns are the new ones we created -- might want to rename for clarity
df

Unnamed: 0,id,visit_date,visitors,calendar_date,day_of_week,holiday,genre,area,latitude,longitude,...,quarter,month,year,time,yesterday,two_days_ago_x,one_week_ago,one_day_ago,two_days_ago_y,one_week_ago_
166836,air_00a91d42b08b08d9,2016-07-01,35,2016-07-01,Friday,0,Italian/French,Tōkyō-to Chiyoda-ku Kudanminami,35.694003,139.753595,...,3,7,2016,182,,,,,,
166837,air_00a91d42b08b08d9,2016-07-02,9,2016-07-02,Saturday,0,Italian/French,Tōkyō-to Chiyoda-ku Kudanminami,35.694003,139.753595,...,3,7,2016,183,35.0,,,35.0,,
166838,air_00a91d42b08b08d9,2016-07-04,20,2016-07-04,Monday,0,Italian/French,Tōkyō-to Chiyoda-ku Kudanminami,35.694003,139.753595,...,3,7,2016,185,9.0,35.0,,,9.0,
166839,air_00a91d42b08b08d9,2016-07-05,25,2016-07-05,Tuesday,0,Italian/French,Tōkyō-to Chiyoda-ku Kudanminami,35.694003,139.753595,...,3,7,2016,186,20.0,9.0,,20.0,,
166840,air_00a91d42b08b08d9,2016-07-06,29,2016-07-06,Wednesday,0,Italian/French,Tōkyō-to Chiyoda-ku Kudanminami,35.694003,139.753595,...,3,7,2016,187,25.0,20.0,,25.0,20.0,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
216643,air_fff68b929994bfbd,2017-04-18,6,2017-04-18,Tuesday,0,Bar/Cocktail,Tōkyō-to Nakano-ku Nakano,35.708146,139.666288,...,2,4,2017,473,3.0,7.0,1.0,3.0,7.0,1.0
216644,air_fff68b929994bfbd,2017-04-19,2,2017-04-19,Wednesday,0,Bar/Cocktail,Tōkyō-to Nakano-ku Nakano,35.708146,139.666288,...,2,4,2017,474,6.0,3.0,6.0,6.0,3.0,6.0
216645,air_fff68b929994bfbd,2017-04-20,2,2017-04-20,Thursday,0,Bar/Cocktail,Tōkyō-to Nakano-ku Nakano,35.708146,139.666288,...,2,4,2017,475,2.0,6.0,1.0,2.0,6.0,1.0
216646,air_fff68b929994bfbd,2017-04-21,4,2017-04-21,Friday,0,Bar/Cocktail,Tōkyō-to Nakano-ku Nakano,35.708146,139.666288,...,2,4,2017,476,2.0,2.0,5.0,2.0,2.0,5.0


**Question 3:** Window Statistics

Lots of times, we want to capture some idea of momentum, or how some value changes with what's usually observed.

Ie, if we had 48 purchases in a store today, how does that number compare to what's happened in the last 14 days?  Are things trending up or trending down?  

This also allows us to get a clearer picture of general trends in values, even if there are irregular daily spikes.

To handle these sorts of issues, pandas has an entire section to calculate window statistics called `rolling`, it works like this:

In [39]:
# I'll create a sample dataframe with 30 days worth of values
import numpy as np
index = pd.date_range(start='01/01/2020', end='02/05/2020')
sample_df = pd.DataFrame(np.random.randn(36), index=index, columns=['Value'])
# and here's what it looks like
sample_df.head()

Unnamed: 0,Value
2020-01-01,-0.968186
2020-01-02,-0.160958
2020-01-03,0.934989
2020-01-04,0.98636
2020-01-05,0.494931


In [40]:
# and now we'll see rolling 10 day averages
sample_df.rolling(10).mean()

Unnamed: 0,Value
2020-01-01,
2020-01-02,
2020-01-03,
2020-01-04,
2020-01-05,
2020-01-06,
2020-01-07,
2020-01-08,
2020-01-09,
2020-01-10,0.166719


You can specify the number of observations to calculate, and choose your aggregator -- `mean()`, `min()`, `sum()`, etc, although `mean()` is the most common.

**Your Turn:** Calculate the rolling 7, 25, and 60 day moving averages for visits for each restaurant inside the dataset.

And be mindful of performing these on the appropriate levels of your dataset.

In [41]:
# your answer here
df['rolling_mean_7']  = df.groupby('id')['visitors'].rolling(7).mean().shift().values
df['rolling_mean_25'] = df.groupby('id')['visitors'].rolling(25).mean().shift().values
df['rolling_mean_60'] = df.groupby('id')['visitors'].rolling(60).mean().shift().values

In [42]:
# our final dataset
df

Unnamed: 0,id,visit_date,visitors,calendar_date,day_of_week,holiday,genre,area,latitude,longitude,...,time,yesterday,two_days_ago_x,one_week_ago,one_day_ago,two_days_ago_y,one_week_ago_,rolling_mean_7,rolling_mean_25,rolling_mean_60
166836,air_00a91d42b08b08d9,2016-07-01,35,2016-07-01,Friday,0,Italian/French,Tōkyō-to Chiyoda-ku Kudanminami,35.694003,139.753595,...,182,,,,,,,,,
166837,air_00a91d42b08b08d9,2016-07-02,9,2016-07-02,Saturday,0,Italian/French,Tōkyō-to Chiyoda-ku Kudanminami,35.694003,139.753595,...,183,35.0,,,35.0,,,,,
166838,air_00a91d42b08b08d9,2016-07-04,20,2016-07-04,Monday,0,Italian/French,Tōkyō-to Chiyoda-ku Kudanminami,35.694003,139.753595,...,185,9.0,35.0,,,9.0,,,,
166839,air_00a91d42b08b08d9,2016-07-05,25,2016-07-05,Tuesday,0,Italian/French,Tōkyō-to Chiyoda-ku Kudanminami,35.694003,139.753595,...,186,20.0,9.0,,20.0,,,,,
166840,air_00a91d42b08b08d9,2016-07-06,29,2016-07-06,Wednesday,0,Italian/French,Tōkyō-to Chiyoda-ku Kudanminami,35.694003,139.753595,...,187,25.0,20.0,,25.0,20.0,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
216643,air_fff68b929994bfbd,2017-04-18,6,2017-04-18,Tuesday,0,Bar/Cocktail,Tōkyō-to Nakano-ku Nakano,35.708146,139.666288,...,473,3.0,7.0,1.0,3.0,7.0,1.0,4.285714,5.04,5.000000
216644,air_fff68b929994bfbd,2017-04-19,2,2017-04-19,Wednesday,0,Bar/Cocktail,Tōkyō-to Nakano-ku Nakano,35.708146,139.666288,...,474,6.0,3.0,6.0,6.0,3.0,6.0,5.000000,4.96,4.966667
216645,air_fff68b929994bfbd,2017-04-20,2,2017-04-20,Thursday,0,Bar/Cocktail,Tōkyō-to Nakano-ku Nakano,35.708146,139.666288,...,475,2.0,6.0,1.0,2.0,6.0,1.0,4.428571,4.76,4.883333
216646,air_fff68b929994bfbd,2017-04-21,4,2017-04-21,Friday,0,Bar/Cocktail,Tōkyō-to Nakano-ku Nakano,35.708146,139.666288,...,476,2.0,2.0,5.0,2.0,2.0,5.0,4.571429,4.72,4.833333


One additional note:  for a calculation such as this is best if you shift the values up by one -- why might this be the case?