In [1]:
import pandas as pd

Create a Dataframe, with values as a list-of-lists and columns as a list

In [2]:
df = pd.DataFrame([[123,'xt23',20],[123,'q45',2],[123,'a89',25],[77,'q45',3],[77,'a89',30],[92,'xt23',24],[92,'m33',60],[92,'a89',28]], columns=['userid','product','price'])
df

Unnamed: 0,userid,product,price
0,123,xt23,20
1,123,q45,2
2,123,a89,25
3,77,q45,3
4,77,a89,30
5,92,xt23,24
6,92,m33,60
7,92,a89,28


If we want the maximum price anyone paid, we just do this:

In [3]:
df['price'].max()

60

If we want the max price per user, we'll do a groupby. When we do that, it does the aggregation on each column seperately. So the value we get on the price column might not be for the product that we get on the product column 

In [4]:
df.groupby('userid').max()

Unnamed: 0_level_0,product,price
userid,Unnamed: 1_level_1,Unnamed: 2_level_1
77,q45,30
92,xt23,60
123,xt23,25


Just like max, we can do sum, etc. Pandas will smartly leave out columns for which that aggregation doesn't have meaning.

In [5]:
df.groupby('userid').sum()

Unnamed: 0_level_0,price
userid,Unnamed: 1_level_1
77,33
92,112
123,47


Diff is another routine. It does a diff with value in the previous row

In [None]:
df[['userid','price']].groupby(['userid']).diff()

We can sort columns this way:

In [7]:
df.sort_values(by=['userid','price'])

Unnamed: 0,userid,product,price
3,77,q45,3
4,77,a89,30
5,92,xt23,24
7,92,a89,28
6,92,m33,60
1,123,q45,2
0,123,xt23,20
2,123,a89,25


We can sort and filter columns this way:

In [10]:
df.sort_values(by=['userid','product'])[['userid','price']]

Unnamed: 0,userid,price
4,77,30
3,77,3
7,92,28
6,92,60
5,92,24
2,123,25
1,123,2
0,123,20


If we want the maximum price each user paid and the product associated with that price, we will sort, group and filter. Groupby will maintain the sort order within each group.
*(For SQL users: in SQL, you groupby and the sort, but in Pandas, it's easier to do it the other way around)*

In [5]:
df.sort_values(by=['userid','price'],ascending=False).groupby('userid').head(1)

Unnamed: 0,userid,product,price
2,123,a89,25
6,92,m33,60
4,77,a89,30


Adding a new column is easy:

In [6]:
df['website']=['Amazon','Amazon','NewEgg','NewEgg','NewEgg','Amazon','Amazon','Amazon']
df

Unnamed: 0,userid,product,price,website
0,123,xt23,20,Amazon
1,123,q45,2,Amazon
2,123,a89,25,NewEgg
3,77,q45,3,NewEgg
4,77,a89,30,NewEgg
5,92,xt23,24,Amazon
6,92,m33,60,Amazon
7,92,a89,28,Amazon


In [7]:
df.groupby(['userid','website']).sum()

Unnamed: 0_level_0,Unnamed: 1_level_0,price
userid,website,Unnamed: 2_level_1
77,NewEgg,33
92,Amazon,112
123,Amazon,22
123,NewEgg,25


Below, we are going to do the same groupby as above. But if we set the as_index flag to "False" we get a flat table instead of the nested indexes

In [15]:
df3=df.groupby(['userid','website'],as_index=False).sum()
df3

Unnamed: 0,userid,website,price
0,77,NewEgg,33
1,92,Amazon,112
2,123,Amazon,22
3,123,NewEgg,25


Let's create a second table:

In [16]:
df2 = pd.DataFrame([[123,'USA'],[77,'Canada'],[92,'USA']], columns=['userid','country'])
df2

Unnamed: 0,userid,country
0,123,USA
1,77,Canada
2,92,USA


We can combine the two tables using a merge function. What it does is, it will do a pairwise comparision of every row in table1 with every row in table2 and if the "on" condition matches, it will create a single row with columns from both those matched rows.

Merge of two tables with 5 rows each can give as little as 0 rows and as much as 25 rows.

    [1,2,3,4,5] merged with [6,7,8,9,10] will give 0 rows
    [1,2,3,4,5] merged with [1,2,3,4,5] will give 5 rows
    [1,1,1,1,1] merged with [1,1,1,1,1] will give 25 rows

In [26]:
pd.merge(df,df2,on='userid')

Unnamed: 0,userid,product,price,website,country
0,123,xt23,20,Amazon,USA
1,123,q45,2,Amazon,USA
2,123,a89,25,NewEgg,USA
3,77,q45,3,NewEgg,Canada
4,77,a89,30,NewEgg,Canada
5,92,xt23,24,Amazon,USA
6,92,m33,60,Amazon,USA
7,92,a89,28,Amazon,USA


We can merge and then groupby to get what we want (Money spent on each website per country)

In [25]:
pd.merge(df,df2,on='userid').groupby(['country','website']).sum()

Unnamed: 0_level_0,Unnamed: 1_level_0,userid,price
country,website,Unnamed: 2_level_1,Unnamed: 3_level_1
Canada,NewEgg,154,33
USA,Amazon,522,134
USA,NewEgg,123,25


We can also work with previously merged tables. Below we use df3 instead of df (scroll up to see what df3 is). The result is the same as the previous box.

In [18]:
pd.merge(df3,df2,on='userid').groupby(['country','website']).sum()

Unnamed: 0_level_0,Unnamed: 1_level_0,userid,price
country,website,Unnamed: 2_level_1,Unnamed: 3_level_1
Canada,NewEgg,77,33
USA,Amazon,215,134
USA,NewEgg,123,25


Let's add another column: purchase date

In [19]:
df['date']=['2015-01-12','2015-01-08','2015-01-06','2015-01-03','2015-01-05','2015-01-04','2015-01-07','2015-01-02']
df

Unnamed: 0,userid,product,price,website,date
0,123,xt23,20,Amazon,2015-01-12
1,123,q45,2,Amazon,2015-01-08
2,123,a89,25,NewEgg,2015-01-06
3,77,q45,3,NewEgg,2015-01-03
4,77,a89,30,NewEgg,2015-01-05
5,92,xt23,24,Amazon,2015-01-04
6,92,m33,60,Amazon,2015-01-07
7,92,a89,28,Amazon,2015-01-02


Here is a tricky task. For each row, I want the average purchase price for that user prior to that purchase. One option is to do some loops. But another solution is to just do a merge on itself and filter.

But first, let's review what a merge (or 'join' if you come from SQL) does. Say you merge two dataframes with 3 rows each, how many rows would you end up with? The answer is anywhere between 0 and 9.

Consider the following examples, where table x has users and the movies they like. And table y has users and the wines they line. And let's do a merge to come up with possible movie and wine pairings for any user. In case A, we get 0 rows, in case B, we get 3 rows and case C we get 9 rows.

In [20]:
dfx = pd.DataFrame([[1,'Godfather'],[2,'Amelie'],[3,'Chicago']],columns=['userid','movies'])
dfy = pd.DataFrame([[4,'red'],[5,'white'],[6,'pink']],columns=['userid','wines'])
dfm1=pd.merge(dfx,dfy,on='userid')
dfm1

Unnamed: 0,userid,movies,wines


In [21]:
dfx = pd.DataFrame([[1,'Godfather'],[2,'Amelie'],[3,'Chicago']],columns=['userid','movies'])
dfy = pd.DataFrame([[1,'red'],[2,'white'],[3,'pink']],columns=['userid','wines'])
dfm1=pd.merge(dfx,dfy,on='userid')
dfm1

Unnamed: 0,userid,movies,wines
0,1,Godfather,red
1,2,Amelie,white
2,3,Chicago,pink


In [24]:
dfx = pd.DataFrame([[1,'Godfather'],[1,'Amelie'],[1,'Chicago']],columns=['userid','movies'])
dfy = pd.DataFrame([[1,'red'],[1,'white'],[1,'pink']],columns=['userid','wines'])
dfm1=pd.merge(dfx,dfy,on='userid')
dfm1

Unnamed: 0,userid,movies,wines
0,1,Godfather,red
1,1,Godfather,white
2,1,Godfather,pink
3,1,Amelie,red
4,1,Amelie,white
5,1,Amelie,pink
6,1,Chicago,red
7,1,Chicago,white
8,1,Chicago,pink


Now let's return to the original question: For each row, I want the average purchase price for that user prior to that purchase. Let's do a merge on itself and filter.

If we join a table on itself, for each row, you'll get every other purchase the user did.

In [31]:
df4=pd.merge(df[['userid','date']],df[['userid','price','date']],on='userid')
df4

Unnamed: 0,userid,date_x,price,date_y
0,123,2015-01-12,20,2015-01-12
1,123,2015-01-12,2,2015-01-08
2,123,2015-01-12,25,2015-01-06
3,123,2015-01-08,20,2015-01-12
4,123,2015-01-08,2,2015-01-08
5,123,2015-01-08,25,2015-01-06
6,123,2015-01-06,20,2015-01-12
7,123,2015-01-06,2,2015-01-08
8,123,2015-01-06,25,2015-01-06
9,77,2015-01-03,3,2015-01-03


Then we can filter out the purchases that not prior to the current purchase

In [32]:
df4=df4[df4['date_x']>df4['date_y']]
df4

Unnamed: 0,userid,date_x,price,date_y
1,123,2015-01-12,2,2015-01-08
2,123,2015-01-12,25,2015-01-06
5,123,2015-01-08,25,2015-01-06
11,77,2015-01-05,3,2015-01-03
15,92,2015-01-04,28,2015-01-02
16,92,2015-01-07,24,2015-01-04
18,92,2015-01-07,28,2015-01-02


Then we can group by to get the average price that we wanted

In [33]:
df5 = df4.groupby(['userid','date_x']).mean()
df5.rename(columns={'price': 'avg_price_prior'}, inplace=True)
df5

Unnamed: 0_level_0,Unnamed: 1_level_0,avg_price_prior
userid,date_x,Unnamed: 2_level_1
77,2015-01-05,3.0
92,2015-01-04,28.0
92,2015-01-07,26.0
123,2015-01-08,25.0
123,2015-01-12,13.5


Finally, we merge with the original dataframe

In [34]:
df6 = df.merge(df5, left_on=['userid', 'date'], right_index=True, how='left')
df6

Unnamed: 0,userid,product,price,website,date,avg_price_prior
0,123,xt23,20,Amazon,2015-01-12,13.5
1,123,q45,2,Amazon,2015-01-08,25.0
2,123,a89,25,NewEgg,2015-01-06,
3,77,q45,3,NewEgg,2015-01-03,
4,77,a89,30,NewEgg,2015-01-05,3.0
5,92,xt23,24,Amazon,2015-01-04,28.0
6,92,m33,60,Amazon,2015-01-07,26.0
7,92,a89,28,Amazon,2015-01-02,


In [37]:
df6[df6.userid.isin([123,77])]

Unnamed: 0,userid,product,price,website,date,avg_price_prior
0,123,xt23,20,Amazon,2015-01-12,13.5
1,123,q45,2,Amazon,2015-01-08,25.0
2,123,a89,25,NewEgg,2015-01-06,
3,77,q45,3,NewEgg,2015-01-03,
4,77,a89,30,NewEgg,2015-01-05,3.0
