This document is a Python exploration of this R-based document: http://m-clark.github.io/data-processing-and-visualization/tidyverse.html. Code is not optimized for anything but learning. In addition, all the content is located with the main document, not here, so many sections may not be included. I only focus on reproducing the code chunks.

In general, pandas has plenty going on for the split-apply-combine process of general data science.  While piping might be applicable, it may not be useful. I will bounce back and forth to demonstrate the examples, but likely won't demo all the ones in the tidyverse chapter.

Refer to the [pandas doc](https://pandas.pydata.org/pandas-docs/stable/getting_started/comparison/comparison_with_r.html) comparison for quick reference.


## Preliminaries

In [1]:
import pandas as pd
import numpy as np

# note that doing much with R in anaconda notebooks will fail at some point
# import rpy2.robjects as robjects
# from rpy2.robjects.packages import importr
# from rpy2.robjects import r, pandas2ri
# pandas2ri.activate()

## Running Example

In [2]:
## ----load_bball----------------------------------------------------------
# load('data/bball.RData')
# glimpse(bball[,1:5])

#robjects.r['load']('../data/bball.RData')
#bball = robjects.r.bball
bball = pd.read_csv('../data/bball.csv')
bball.iloc[:, 1:5].info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 619 entries, 0 to 618
Data columns (total 4 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   Player  619 non-null    object
 1   Pos     619 non-null    object
 2   Age     619 non-null    object
 3   Tm      619 non-null    object
dtypes: object(4)
memory usage: 19.5+ KB


**In the following examples, I'll show the R code in comments first, followed by the pandas approach.**

## Selecting Columns

Selecting columns is done as before with the section on indexing.  Here we will do it in pipe-oriented fashion.

In [4]:
# bball %>% 
#   select(Player, Tm, Pos) %>% 
#   head

(
    bball
    .loc[:,['Player', 'Tm', 'Pos']]
    .head()
)

# or
(
    bball[['Player', 'Tm', 'Pos']]
    .head()
)


Unnamed: 0,Player,Tm,Pos
0,Alex Abrines,OKC,SG
1,Quincy Acy,TOT,PF
2,Quincy Acy,DAL,PF
3,Quincy Acy,BRK,PF
4,Steven Adams,OKC,C


In [6]:
# bball %>%     
#   select(-Player, -Tm, -Pos)  %>% 
#   head

(
    bball
    .drop(columns=['Player', 'Tm', 'Pos'])
    .head()
)

Unnamed: 0,Rk,Age,G,GS,MP,FG,FGA,FG.,X3P,X3PA,...,FT.,ORB,DRB,TRB,AST,STL,BLK,TOV,PF,PTS
0,1,23,68,6,1055,134,341,0.393,94,247,...,0.898,18,68,86,40,37,8,33,114,406
1,2,26,38,1,558,70,170,0.412,37,90,...,0.75,20,95,115,18,14,15,21,67,222
2,2,26,6,0,48,5,17,0.294,1,7,...,0.667,2,6,8,0,0,0,2,9,13
3,2,26,32,1,510,65,153,0.425,36,83,...,0.754,18,89,107,18,14,15,19,58,209
4,3,23,80,80,2389,374,655,0.571,0,1,...,0.611,281,332,613,86,89,78,146,195,905


The following example uses tidyverse helper functions, which are available as basic string functions in Python (e.g. str.contains), but I haven't found how to implement them as cleanly in the pandaverse (e.g. using filter or query). 

In [5]:
# bball %>% 
#   select(Player, contains("3P"), ends_with("RB")) %>% 
#   arrange(desc(TRB)) %>% 
#   head

(
    bball
    .filter(regex = '3P|RB$', axis = 'columns')  # columns is the default
    .sort_values(by = 'TRB', ascending = False)
    .head()
)

# looks funny because we haven't filtered out the repeated headers yet

Unnamed: 0,X3P,X3PA,X3P.,ORB,DRB,TRB
583,3P,3PA,3P%,ORB,DRB,TRB
507,3P,3PA,3P%,ORB,DRB,TRB
353,3P,3PA,3P%,ORB,DRB,TRB
47,3P,3PA,3P%,ORB,DRB,TRB
76,3P,3PA,3P%,ORB,DRB,TRB


### Filtering Rows

There are repeated header rows in this data1, so we need to drop them. This is also why everything was character string when we first scraped it, because having any character strings in a column coerces the entire column to be character, since all elements of a vector need to be of the same type. Character string is chosen over others because anything can be converted to a string, but not everything can be a number.

Filtering by rows requires the basic indexing knowledge we talked about before, especially Boolean indexing. In the following, Rk, or rank, is for all intents and purposes just a row id, but if it equals the actual text ‘Rk’ instead of something else, we know we’re dealing with a header row, so we’ll drop it.

In [8]:
# bball = bball %>% 
#   filter(Rk != "Rk")

bball = (
    bball
    .query('Rk != "Rk"')
    .apply(pd.to_numeric, errors = 'ignore')
)

# redo previous
(
    bball
    .filter(regex = '3P|RB$', axis = 'columns')  # columns is the default
    .sort_values(by = 'TRB', ascending = False)
    .head()
)

Unnamed: 0,X3P,X3PA,X3P.,ORB,DRB,TRB
142,2,7,0.286,345,770,1115
304,0,2,0.0,298,816,1114
584,0,0,,293,795,1088
196,0,1,0.0,314,721,1035
550,101,275,0.367,296,711,1007


Say we want to look at forwards (SF or PF) over the age of 35. The following will do this, and since some players play on multiple teams, we’ll want only the unique information on the variables of interest. The `drop_duplicates` method allows us to do this.

In [None]:
# bball %>% 
#   filter(Age > 35, Pos == "SF" | Pos == "PF") %>% 
#   distinct(Player, Pos, Age)     

(
    bball
    .query('Age > 35 & (Pos == "SF"| Pos == "PF")')
    .drop_duplicates(subset = ['Player', 'Pos', 'Age'])
)

In [9]:
# bball %>% 
#   slice(1:10)


bball.iloc[:10]

Unnamed: 0,Rk,Player,Pos,Age,Tm,G,GS,MP,FG,FGA,...,FT.,ORB,DRB,TRB,AST,STL,BLK,TOV,PF,PTS
0,1,Alex Abrines,SG,23,OKC,68,6,1055,134,341,...,0.898,18,68,86,40,37,8,33,114,406
1,2,Quincy Acy,PF,26,TOT,38,1,558,70,170,...,0.75,20,95,115,18,14,15,21,67,222
2,2,Quincy Acy,PF,26,DAL,6,0,48,5,17,...,0.667,2,6,8,0,0,0,2,9,13
3,2,Quincy Acy,PF,26,BRK,32,1,510,65,153,...,0.754,18,89,107,18,14,15,19,58,209
4,3,Steven Adams,C,23,OKC,80,80,2389,374,655,...,0.611,281,332,613,86,89,78,146,195,905
5,4,Arron Afflalo,SG,31,SAC,61,45,1580,185,420,...,0.892,9,116,125,78,21,6,42,104,515
6,5,Alexis Ajinca,C,28,NOP,39,15,584,89,178,...,0.725,46,131,177,12,20,22,31,77,207
7,6,Cole Aldrich,C,28,MIN,62,0,531,45,86,...,0.682,51,107,158,25,25,23,17,85,105
8,7,LaMarcus Aldridge,PF,31,SAS,72,72,2335,500,1049,...,0.812,172,351,523,139,46,88,98,158,1243
9,8,Lavoy Allen,PF,27,IND,61,5,871,77,168,...,0.697,105,114,219,57,18,24,29,78,177


We can use filtering even with variables just created, but this isn't quite as clean as with R.

In [10]:
# bball %>% 
#   unite("posTeam", Pos, Tm) %>%         # create a new variable
#   filter(posTeam == "PF_SAS") %>%       # use it for filtering
#   select(Player, posTeam, Age) %>%      # use it for selection
#   arrange(desc(Age))                    # order 

(
    bball
    .assign(posTeam = bball.Pos + '_' + bball.Tm)
    .query('posTeam == "PF_SAS"')
    .loc[:, ['Player', 'posTeam', 'Age']]
    .sort_values(by = 'Age', ascending = False)
)

Unnamed: 0,Player,posTeam,Age
328,David Lee,PF_SAS,33
8,LaMarcus Aldridge,PF_SAS,31
51,Davis Bertans,PF_SAS,24


### Generating New Data

One of the most common data processing tasks is generating new variables.  So let's see how we can go about that with pandas.

In [None]:
# bball = bball %>% 
#   mutate_at(vars(-Player, -Pos, -Tm), funs(as.numeric))   

# glimpse(bball[,1:7])

# we already did this in the first 'filtering rows' example


Pandas doesn't really have a clean way to use a variable just created within a single `assign` call, so probably best not to force it.

In [18]:
# bball = bball %>% 
#   mutate(trueShooting = PTS / (2 * (FGA + (.44 * FTA))),
#          effectiveFG = (FG + (.5 * X3P)) / FGA, 
#          shootingDif = trueShooting - FG.)

# summary(select(bball, shootingDif))  # select and others don't have to be piped to use

# see also https://stackoverflow.com/questions/42496102/how-to-use-created-variable-in-same-assign-function-with-pandas
bball = (
    bball
    .assign(
        trueShooting = bball.PTS / (2 * (bball.FGA + (.44 * bball.FTA))),
        effectiveFG = (bball.FG + .5*bball.X3P) / bball.FGA
    )
    .assign(shootingDif  = bball.trueShooting - bball['FG.'])  # slight issue with FG. due to ., but this works fine.
)

bball.shootingDif.describe()

count    593.000000
mean       0.085550
std        0.056424
min       -0.468085
25%        0.052019
50%        0.090717
75%        0.117596
max        0.397872
Name: shootingDif, dtype: float64

### Groupby

In [None]:
## ----groupby-------------------------------------------------------------
# bball %>%   
#   mutate(trueShooting = PTS / (2 * (FGA + (.44 * FTA))),
#          effectiveFG = (FG + (.5 * X3P)) / FGA, 
#          shootingDif = trueShooting - FG.) %>%  
#   select(Player, Tm, Pos, MP, trueShooting, effectiveFG, PTS) %>% 
#   group_by(Pos) %>%                                                 
#   summarize(meanTrueShooting = mean(trueShooting, na.rm = TRUE)) 

(bball
 .assign(
     trueShooting = bball.PTS / (2 * (bball.FGA + (.44 * bball.FTA))),
     effectiveFG  = (bball.FG + .5*bball.X3P) / bball.FGA,
     shootingDif  = bball.trueShooting - bball.loc[:, 'FG.']
 )
 .loc[:, ['Player', 'Tm', 'Pos', 'MP', 'trueShooting', 'effectiveFG', 'PTS']]
 .groupby('Pos')
 .agg({'trueShooting': 'mean'})
 .rename(columns={'trueShooting': 'trueShooting_mean'})
)

In [None]:
# I'm not currently aware of a do operation in pandas (nor have a good way to look it up). 
# However, I also find it somewhat awkward in the R implementation and rarely useful compared 
# to other approaches.

## ----do------------------------------------------------------------------
# bball %>% 
#   mutate(Pos = if_else(Pos=='PF-C', 'C', Pos)) %>% 
#   group_by(Pos) %>%     
#   do(FgFt_Corr=cor(.$FG., .$FT., use='complete')) %>% 
#   unnest(FgFt_Corr)

## ----do2-----------------------------------------------------------------
# library(nycflights13)
# carriers = group_by(flights, carrier)
# group_size(carriers)

# mods = do(carriers, model = lm(arr_delay ~ dep_time, data = .)) # reminder that data frames are lists
# mods %>% 
#   summarize(rsq = summary(model)$r.squared) %>% 

#   head

### Merge by id

In [None]:
## ----merge_demo
# band_members = data_frame(Name = c('Seth', 'Francis', 'Bubba'),
#                           Band = c('Com Truise', 'Pixies', 'The New Year'))
# band_instruments = data_frame(Name = c('Seth', 'Francis', 'Bubba'),
#                               Instrument = c('Synthesizer', 'Guitar', 'Guitar'))

# band_members
# band_instruments

# left_join(band_members, band_instruments)

band_members = pd.DataFrame({'Name' : ['Seth', 'Francis', 'Bubba'],
                             'Band' : ['Com Truise', 'Pixies', 'The New Year']
                            })
band_instruments = pd.DataFrame({'Name' : ['Seth', 'Francis', 'Bubba'],
                               'Instrument' : ['Synthesizer', 'Guitar', 'Guitar']
                            })

band_members
band_instruments


band_members.merge(band_instruments)

# alternative
# band_members = pd.DataFrame({'Band' : ['Com Truise', 'Pixies', 'The New Year']
#                             }, index = ['Seth', 'Francis', 'Bubba'])
# band_instruments = pd.DataFrame({'Instrument' : ['Synthesizer', 'Guitar', 'Guitar']}, 
#                                 index = ['Seth', 'Francis', 'Bubba'])
# band_members.join(band_instruments, how='left')


In [None]:
## ----gather_spread-------------------------------------------------------
# library(tidyr)
# stocks <- data.frame( time = as.Date('2009-01-01') + 0:9,
#                       X = rnorm(10, 0, 1),
#                       Y = rnorm(10, 0, 2),
#                       Z = rnorm(10, 0, 4) )
# stocks %>% head
# stocks %>% 
#   gather(stock, price, -time) %>% 
#   head

## ----tidyrSpread---------------------------------------------------------
# bball %>% 
#   separate(Player, into=c('firstName', 'lastName'), sep=' ') %>% 
#   select(1:5) %>% 
#   head

stocks = pd.DataFrame({
    'time' : pd.date_range('2009-01-01', periods=10),
    'X' : np.random.randn(10),
    'Y' : np.random.normal(0, 2, 10),
    'Z' : np.random.normal(0, 4, 10)
})

stocks.head()

In [None]:
stocks_melt = stocks.melt(id_vars='time') 

stocks_melt

In [None]:
stocks_melt.pivot(index='time', columns='variable')