# Time Series Data Munging
## Lagging Variables that are Distributed Across Multiple Groups

1. Lag one or more variables across one group — using shift method
2. Lag one variable across multiple groups — using unstack method
3. Lag multiple variables across multiple groups — with groupby



See the complete tutorial [here](https://towardsdatascience.com/timeseries-data-munging-lagging-variables-that-are-distributed-across-multiple-groups-86e0a038460c).

In [0]:
import pandas as pd
import numpy as np

np.random.seed(0) # ensures the same set of random numbers are generated
date = ['2019-01-01']*3 + ['2019-01-02']*3 + ['2019-01-03']*3
var1, var2 = np.random.randn(9), np.random.randn(9)*20 
group = ["group1", "group2", "group3"]*3 # to assign the groups for the multiple group case

df_manygrp = pd.DataFrame({"date": date, "group":group, "var1": var1}) # one var, many groups
df_combo = pd.DataFrame({"date": date, "group":group, "var1": var1, "var2": var2}) # many vars, many groups
df_onegrp = df_manygrp[df_manygrp["group"]=="group1"] # one var, one group

In [5]:
for d in [df_onegrp, df_manygrp, df_combo]: # loop to apply the change to both dfs
    d["date"] = pd.to_datetime(d['date'])
    print("Column changed to: ", d.date.dtype.name)

Column changed to:  datetime64[ns]
Column changed to:  datetime64[ns]
Column changed to:  datetime64[ns]


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  


In [6]:
df_manygrp.head()


Unnamed: 0,date,group,var1
0,2019-01-01,group1,1.764052
1,2019-01-01,group2,0.400157
2,2019-01-01,group3,0.978738
3,2019-01-02,group1,2.240893
4,2019-01-02,group2,1.867558


In [7]:
df_combo.head()

Unnamed: 0,date,group,var1,var2
0,2019-01-01,group1,1.764052,8.21197
1,2019-01-01,group2,0.400157,2.880871
2,2019-01-01,group3,0.978738,29.08547
3,2019-01-02,group1,2.240893,15.220755
4,2019-01-02,group2,1.867558,2.4335


In [8]:
df_onegrp.head()

Unnamed: 0,date,group,var1
0,2019-01-01,group1,1.764052
3,2019-01-02,group1,2.240893
6,2019-01-03,group1,0.950088


## Step 1. Lag one or more variables across one group/category
### Using “shift” method

In [9]:
df_onegrp.set_index(["date"]).shift(1)

Unnamed: 0_level_0,group,var1
date,Unnamed: 1_level_1,Unnamed: 2_level_1
2019-01-01,,
2019-01-02,group1,1.764052
2019-01-03,group1,2.240893


## Step 2. Lag one variable across multiple groups
### Using “unstack” method

In [0]:
df = df_manygrp.set_index(["date", "group"]) # index

In [18]:
# pull out the groups, shift with lag step=1
df = df.unstack().shift(1)
df.head()

Unnamed: 0_level_0,var1,var1,var1
group,group1,group2,group3
date,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2
2019-01-01,,,
2019-01-02,1.764052,0.400157,0.978738
2019-01-03,2.240893,1.867558,-0.977278


In [0]:
# stack the groups back, keep the missing values
df= df.stack(dropna=False)

In [0]:
#df2 = df.reset_index().sort_values("group")
df.sort_values("group",inplace=True)
df.reset_index(inplace=True)

In [21]:
df.head(20)

Unnamed: 0,date,group,var1
0,2019-01-01,group1,
1,2019-01-02,group1,1.764052
2,2019-01-03,group1,2.240893
3,2019-01-01,group2,
4,2019-01-02,group2,0.400157
5,2019-01-03,group2,1.867558
6,2019-01-01,group3,
7,2019-01-02,group3,0.978738
8,2019-01-03,group3,-0.977278


3. Lag multiple variables distributed across multiple groups, simultaneously — using “groupby” method


Assign:Generate a new column and assign new values to it, and returns a copy of the data.

Method-chaining: The function also uses parenthesis in the return statement in order to allow method chaining.

In [0]:
grouped_df = df_combo.groupby(["group"])

In [0]:
def lag_by_group(key, value_df):
    # this pandas method returns a copy of the df, with group columns assigned the key value
    df = value_df.assign(group = key)
    return (df.sort_values(by=["date"], ascending=True)
        .set_index(["date"])
        .shift(1)
               ) # the parenthesis allow you to chain methods and avoid intermediate variable assignment

In [0]:
dflist = [lag_by_group(g, grouped_df.get_group(g)) for g in grouped_df.groups.keys()]

In [43]:
pd.concat(dflist, axis=0).reset_index()

Unnamed: 0,date,group,var1,var2
0,2019-01-01,,,
1,2019-01-02,group1,1.764052,8.21197
2,2019-01-03,group1,2.240893,15.220755
3,2019-01-01,,,
4,2019-01-02,group2,0.400157,2.880871
5,2019-01-03,group2,1.867558,2.4335
6,2019-01-01,,,
7,2019-01-02,group3,0.978738,29.08547
8,2019-01-03,group3,-0.977278,8.877265
