In [5]:
# Be sure you installed dfply
!pip install dfply



## <font color="red"> With the currect version of anaconda3 [2022.05], this won't work on newer M1 Mac's.  If this describes you, please skip over all R code/installations.</font>

In [8]:
#First, make sure you have R installed ... this could take a while ;P
!brew install R

We do not provide support for this pre-release version.
You will encounter build failures with some formulae.
Please create pull requests instead of asking for help on Homebrew's GitHub,
Twitter or any other official channels. You are responsible for resolving
any issues you experience while you are running this
pre-release version.

[34m==>[0m [1mDownloading https://ghcr.io/v2/homebrew/core/libpng/manifests/1.6.37[0m
Already downloaded: /Users/sky/Library/Caches/Homebrew/downloads/e6be7ba72607e72eff6fd2dc3cbad15128f3b3b35a78f2578706d2a8be708f8b--libpng-1.6.37.bottle_manifest.json
[34m==>[0m [1mDownloading https://ghcr.io/v2/homebrew/core/libpng/blobs/sha256:7209cfe63b2[0m
Already downloaded: /Users/sky/Library/Caches/Homebrew/downloads/3ff9768d38a7896a2a4bcd2e32be75450188616bf003b519b2d27008ed1e4dd0--libpng--1.6.37.monterey.bottle.tar.gz
[34m==>[0m [1mDownloading https://ghcr.io/v2/homebrew/core/freetype/manifests/2.12.1[0m
Already downloaded: /Users/sky/Library/Caches/Hom

Already downloaded: /Users/sky/Library/Caches/Homebrew/downloads/f9e86ee818c57a136b1c1c82caecdf07bb8eaad8e31f5e59e456a10d6d1a1cc5--cairo-1.16.0_5.bottle_manifest.json
[34m==>[0m [1mDownloading https://ghcr.io/v2/homebrew/core/cairo/blobs/sha256:ccf4f80f5115[0m
Already downloaded: /Users/sky/Library/Caches/Homebrew/downloads/a87b1a537832992c2951fccb2d9e4ef90213162d48a897f853e044ea566aa541--cairo--1.16.0_5.monterey.bottle.tar.gz
[31mError:[0m gcc: the bottle needs the Apple Command Line Tools to be installed.
  You can install them, if desired, with:
    xcode-select --install

You can try to install from source with:
  brew install --build-from-source gcc
Please note building from source is unsupported. You will encounter build
failures with some formulae. If you experience any issues please create pull
requests instead of asking for help on Homebrew's GitHub, Twitter or any other
official channels.


In [10]:
# Next, we install rpy2 to allow running R code in a notebook
!pip install rpy2



In [11]:
#Load rpy2 and R magic commands
import rpy2
%load_ext rpy2.ipython

ModuleNotFoundError: No module named 'rpy2'

# Select, Filter, and Mutate

In this lecture, we will look at three important actions used to process data frames.  While each framework uses different names for these functions, we will use the names from the `R` library `dplyr`, namely `select`, `mutate`, and `filter`.  The most important takeaway will be that, regardless of framework or scale, we can process data frames in the same way by applying the same sequence of data verbs.

## R and Python can interact!

In [12]:
import warnings
warnings.filterwarnings('ignore')

In [14]:
%%R
rnorm(5, 2, 3)

UsageError: Cell magic `%%R` not found.


## We love dplyr!

In [15]:
%%R 
library(dplyr)
artists <- read.csv('./data/Artists.csv')

(artists %>%
  select(BeginDate, 
         DisplayName, 
         Nationality) %>%
  filter(BeginDate > 0) %>%
  head) -> output
output

UsageError: Cell magic `%%R` not found.


## What makes `dplyr` so great?

* Focus on data verbs
* Pipes lead to code that is
    * More readable
    * Easy to compose and debug

## Set up

Let's read in a data set in each of the three frameworks

In [8]:
import pandas as pd
from dfply import *
heroes = pd.read_csv('./data/heroes_information.csv')
heroes.head()

Unnamed: 0.1,Unnamed: 0,name,Gender,Eye color,Race,Hair color,Height,Publisher,Skin color,Alignment,Weight
0,0,A-Bomb,Male,yellow,Human,No Hair,203.0,Marvel Comics,-,good,441.0
1,1,Abe Sapien,Male,blue,Icthyo Sapien,No Hair,191.0,Dark Horse Comics,blue,good,65.0
2,2,Abin Sur,Male,blue,Ungaran,No Hair,185.0,DC Comics,red,good,90.0
3,3,Abomination,Male,green,Human / Radiation,No Hair,203.0,Marvel Comics,-,bad,441.0
4,4,Abraxas,Male,blue,Cosmic Entity,Black,-99.0,Marvel Comics,-,bad,-99.0


## Selecting Columns

<img src="./img/select.png">

The first verb, `select` 

* filters the *columns*
* At the core of `SQL` statements

## How to select
 pipe (`>>`) into `select` and use `X.column_name` or `X['column name']`

In [9]:
(heroes
 >> select(X.name, 
           X['Gender'],
           'Eye color'
          )
 >> head
)

Unnamed: 0,name,Gender,Eye color
0,A-Bomb,Male,yellow
1,Abe Sapien,Male,blue
2,Abin Sur,Male,blue
3,Abomination,Male,green
4,Abraxas,Male,blue


## Filtering Rows

<img src="./img/filter.png">

The next verb, `filter` 

* filters the *rows*
* is related to the `SQL` `WHERE` clause

## How to filter

* pipe (`>>`) into `filter_by` 
* First argument is a boolean expression
* Reference columns with `X.column_name` or `X['column name']`

In [10]:
(heroes 
 >> filter_by(X.Gender == 'Male') 
 >> head
)

Unnamed: 0.1,Unnamed: 0,name,Gender,Eye color,Race,Hair color,Height,Publisher,Skin color,Alignment,Weight
0,0,A-Bomb,Male,yellow,Human,No Hair,203.0,Marvel Comics,-,good,441.0
1,1,Abe Sapien,Male,blue,Icthyo Sapien,No Hair,191.0,Dark Horse Comics,blue,good,65.0
2,2,Abin Sur,Male,blue,Ungaran,No Hair,185.0,DC Comics,red,good,90.0
3,3,Abomination,Male,green,Human / Radiation,No Hair,203.0,Marvel Comics,-,bad,441.0
4,4,Abraxas,Male,blue,Cosmic Entity,Black,-99.0,Marvel Comics,-,bad,-99.0


## Chaining Data Verbs

* Processing df $\rightarrow$ chaining data verbs
* Accomplished through pipes/dot-chains

## Example 1 - `select` + `filter`

In [11]:
(heroes 
 >> filter_by(X.Gender == 'Male') 
 >> select(X.name, X.Gender, X.Weight) 
 >> head
)

Unnamed: 0,name,Gender,Weight
0,A-Bomb,Male,441.0
1,Abe Sapien,Male,65.0
2,Abin Sur,Male,90.0
3,Abomination,Male,441.0
4,Abraxas,Male,-99.0


## Example 2 - `filter` + `filter`

Note that chaining `filter`s is an `and` operation.

####  `pandas` + `dfply`

In [13]:
(heroes >>
   select(X.name, X.Gender, X.Weight) >>
   filter_by(X.Gender == 'Male') >>
   filter_by(X.Weight > 0) >>
   head)

Unnamed: 0,name,Gender,Weight
0,A-Bomb,Male,441.0
1,Abe Sapien,Male,65.0
2,Abin Sur,Male,90.0
3,Abomination,Male,441.0
5,Absorbing Man,Male,122.0


## <font color="red"> Exercise 2.2.1: Blue-eyed Heroes </font>

Create a query that

1. Selects the name, Gender, and Eye Color columns
2. Filters on eye_color == 'blue'

In [14]:
# Your code here

## Constructing New Columns

The third verb, `mutate` 

* Creates new columns
* Changes existing columns

## How to mutate

*  pipe (`>>`) into `mutate`
* First argument is a transformational expression
* Reference columns with `X.column_name` or `X['column name']`

## Example 3 - Converting Weight to kilograms

Currently, the weight column is in pounds.  Let's convert to kilograms.

In [15]:
(heroes 
 >> select(X.name, 
           X.Gender, 
           X.Weight) 
 >> mutate(Weight_kg = X.Weight/2.2046) 
 >> head
)

Unnamed: 0,name,Gender,Weight,Weight_kg
0,A-Bomb,Male,441.0,200.036288
1,Abe Sapien,Male,65.0,29.483807
2,Abin Sur,Male,90.0,40.823732
3,Abomination,Male,441.0,200.036288
4,Abraxas,Male,-99.0,-44.906105


## Referencing a new column

Each framework provides a way to reference a new column.

* **Create:** Use `mutate(new_col = ...)`
* **Later reference:** Use `X.new_col` or `X['new_col']`

## Example 4 - Converting Weight to kilograms and filter

Let's find all heroes with a weight under 100kg.

In [16]:
(heroes 
 >> select(X.name, X.Gender, X.Weight) 
 >> mutate(Weight_kg = X.Weight/2.2046) 
 >> filter_by(X.Weight_kg < 100) 
 >> head
)

Unnamed: 0,name,Gender,Weight,Weight_kg
1,Abe Sapien,Male,65.0,29.483807
2,Abin Sur,Male,90.0,40.823732
4,Abraxas,Male,-99.0,-44.906105
5,Absorbing Man,Male,122.0,55.338837
6,Adam Monroe,Male,-99.0,-44.906105


## <font color="red"> Exercise 2.2.2: Tall Heroes </font>

Create a query that

1. Selects the name, Gender, and Height columns
2. Compute the height in inches.
    * Check [here](https://www.kaggle.com/claudiodavi/superhero-set) to determine the current units.
3. Filters on height_in > 72

In [None]:
# Your code here