In general, pandas has plenty going on for the split-apply-combine process of general data science.  While piping might be applicable, it may not be useful. I will bounce back and forth to demonstrate the examples, but likely won't demo all the ones in the tidyverse chapter.

Refer to the [pandas doc](https://pandas.pydata.org/pandas-docs/stable/getting_started/comparison/comparison_with_r.html) comparison for quick reference.


### Preliminaries

In [None]:
import pandas as pd
import numpy as np

# note that doing much with R in anaconda notebooks will fail at some point
import rpy2.robjects as robjects
from rpy2.robjects.packages import importr
from rpy2.robjects import r, pandas2ri
pandas2ri.activate()

In [None]:
## ----load_bball----------------------------------------------------------
# load('data/bball.RData')
# glimpse(bball[,1:5])

robjects.r['load']('../data/bball.RData')
bball = robjects.r.bball
# bball = pd.read_csv('../data/bball.csv')
bball.iloc[:, 1:5].info()

In [None]:
## ----select1-------------------------------------------------------------
# bball %>% 
#   select(Player, Tm, Pos) %>% 
#   head

(bball
 .loc[:,['Player', 'Tm', 'Pos']]
 .head()
)

# or
(bball[['Player', 'Tm', 'Pos']]
 .head()
)


In [None]:
## ----select2-------------------------------------------------------------
# bball %>%     
#   select(-Player, -Tm, -Pos)  %>% 
#   head

(bball
 .drop(columns=['Player', 'Tm', 'Pos'])
 .head()
)

The following example uses tidyverse helper functions, which are available as basic string functions in Python (e.g. str.contains), but I haven't found how to implement them as cleanly in the pandaverse (e.g. using filter or query). 

In [None]:
## ----select3-------------------------------------------------------------
# bball %>% 
#   select(Player, contains("3P"), ends_with("RB")) %>% 
#   arrange(desc(TRB)) %>% 
#   head

(bball
 .filter(regex = '3P|RB$', axis = 'columns')  # columns is the default
 .sort_values(by = 'TRB', ascending = False)
 .head()
)

# looks funny because we haven't filtered out the repeated headers yet

### Filtering Rows

In [None]:
## ----filter0-------------------------------------------------------------
# bball = bball %>% 
#   filter(Rk != "Rk")

bball = (bball
         .query('Rk != "Rk"')
         .apply(pd.to_numeric, errors = 'ignore')
        )

# redo previous
(bball
 .filter(regex = '3P|RB$', axis = 'columns')  # columns is the default
 .sort_values(by = 'TRB', ascending = False)
 .head()
)

In [None]:
## ----filter1-------------------------------------------------------------
# bball %>% 
#   filter(Age > 35, Pos == "SF" | Pos == "PF") %>% 
#   distinct(Player, Pos, Age)     

(bball
 .query('Age > 35 & (Pos == "SF"| Pos == "PF")')
 .drop_duplicates(subset = ['Player', 'Pos', 'Age'])
)

In [None]:
## ----filter2-------------------------------------------------------------
# bball %>% 
#   slice(1:10)


bball.iloc[:10]

In [None]:
## ----uniteFilterArrange--------------------------------------------------
# bball %>% 
#   unite("posTeam", Pos, Tm) %>%         # create a new variable
#   filter(posTeam == "PF_SAS") %>%       # use it for filtering
#   select(Player, posTeam, Age) %>%      # use it for selection
#   arrange(desc(Age))                    # order 

(bball
 .assign(posTeam = bball.Pos + '_' + bball.Tm)
 .query('posTeam == "PF_SAS"')
 .loc[:, ['Player', 'posTeam', 'Age']]
 .sort_values(by = 'Age', ascending = False)
)

### Generating New Data

In [None]:
## ----mutateAt------------------------------------------------------------
# bball = bball %>% 
#   mutate_at(vars(-Player, -Pos, -Tm), funs(as.numeric))   

# glimpse(bball[,1:7])

# we already did this in the first 'filtering rows' example


In [None]:
## ----mutate--------------------------------------------------------------
# bball = bball %>% 
#   mutate(trueShooting = PTS / (2 * (FGA + (.44 * FTA))),
#          effectiveFG = (FG + (.5 * X3P)) / FGA, 
#          shootingDif = trueShooting - FG.)

# summary(select(bball, shootingDif))  # select and others don't have to be piped to use

# slight issue due to there being a dot in the column name; but really this is not very satisfactory,
# but using newly created variables is on the way https://github.com/pandas-dev/pandas/issues/14207
# see also https://stackoverflow.com/questions/42496102/how-to-use-created-variable-in-same-assign-function-with-pandas
bball = (bball
         .assign(
             trueShooting = bball.PTS / (2 * (bball.FGA + (.44 * bball.FTA))),
             effectiveFG = (bball.FG + .5*bball.X3P) / bball.FGA)
        )
bball = bball.assign(shootingDif  = bball.trueShooting - bball.loc[:, 'FG.'])

bball.shootingDif.describe()

### Groupby

In [None]:
## ----groupby-------------------------------------------------------------
# bball %>%   
#   mutate(trueShooting = PTS / (2 * (FGA + (.44 * FTA))),
#          effectiveFG = (FG + (.5 * X3P)) / FGA, 
#          shootingDif = trueShooting - FG.) %>%  
#   select(Player, Tm, Pos, MP, trueShooting, effectiveFG, PTS) %>% 
#   group_by(Pos) %>%                                                 
#   summarize(meanTrueShooting = mean(trueShooting, na.rm = TRUE)) 

(bball
 .assign(
     trueShooting = bball.PTS / (2 * (bball.FGA + (.44 * bball.FTA))),
     effectiveFG  = (bball.FG + .5*bball.X3P) / bball.FGA,
     shootingDif  = bball.trueShooting - bball.loc[:, 'FG.']
 )
 .loc[:, ['Player', 'Tm', 'Pos', 'MP', 'trueShooting', 'effectiveFG', 'PTS']]
 .groupby('Pos')
 .agg({'trueShooting': 'mean'})
 .rename(columns={'trueShooting': 'trueShooting_mean'})
)

In [None]:
# I'm not currently aware of a do operation in pandas (nor have a good way to look it up). 
# However, I also find it somewhat awkward in the R implementation and rarely useful compared 
# to other approaches.

## ----do------------------------------------------------------------------
# bball %>% 
#   mutate(Pos = if_else(Pos=='PF-C', 'C', Pos)) %>% 
#   group_by(Pos) %>%     
#   do(FgFt_Corr=cor(.$FG., .$FT., use='complete')) %>% 
#   unnest(FgFt_Corr)

## ----do2-----------------------------------------------------------------
# library(nycflights13)
# carriers = group_by(flights, carrier)
# group_size(carriers)

# mods = do(carriers, model = lm(arr_delay ~ dep_time, data = .)) # reminder that data frames are lists
# mods %>% 
#   summarize(rsq = summary(model)$r.squared) %>% 

#   head

### Merge by id

In [None]:
## ----merge_demo
# band_members = data_frame(Name = c('Seth', 'Francis', 'Bubba'),
#                           Band = c('Com Truise', 'Pixies', 'The New Year'))
# band_instruments = data_frame(Name = c('Seth', 'Francis', 'Bubba'),
#                               Instrument = c('Synthesizer', 'Guitar', 'Guitar'))

# band_members
# band_instruments

# left_join(band_members, band_instruments)

band_members = pd.DataFrame({'Name' : ['Seth', 'Francis', 'Bubba'],
                             'Band' : ['Com Truise', 'Pixies', 'The New Year']
                            })
band_instruments = pd.DataFrame({'Name' : ['Seth', 'Francis', 'Bubba'],
                               'Instrument' : ['Synthesizer', 'Guitar', 'Guitar']
                            })

band_members
band_instruments


band_members.merge(band_instruments)

# alternative
# band_members = pd.DataFrame({'Band' : ['Com Truise', 'Pixies', 'The New Year']
#                             }, index = ['Seth', 'Francis', 'Bubba'])
# band_instruments = pd.DataFrame({'Instrument' : ['Synthesizer', 'Guitar', 'Guitar']}, 
#                                 index = ['Seth', 'Francis', 'Bubba'])
# band_members.join(band_instruments, how='left')


In [None]:
## ----gather_spread-------------------------------------------------------
# library(tidyr)
# stocks <- data.frame( time = as.Date('2009-01-01') + 0:9,
#                       X = rnorm(10, 0, 1),
#                       Y = rnorm(10, 0, 2),
#                       Z = rnorm(10, 0, 4) )
# stocks %>% head
# stocks %>% 
#   gather(stock, price, -time) %>% 
#   head

## ----tidyrSpread---------------------------------------------------------
# bball %>% 
#   separate(Player, into=c('firstName', 'lastName'), sep=' ') %>% 
#   select(1:5) %>% 
#   head

stocks = pd.DataFrame({
    'time' : pd.date_range('2009-01-01', periods=10),
    'X' : np.random.randn(10),
    'Y' : np.random.normal(0, 2, 10),
    'Z' : np.random.normal(0, 4, 10)
})

stocks.head()

In [None]:
stocks_melt = stocks.melt(id_vars='time') 

stocks_melt

In [None]:
stocks_melt.pivot(index='time', columns='variable')