# Select, Filter, and Mutate in `pyspark`

In this lecture, we will look at three important actions used to process data frames.  While each framework uses different names for these functions, we will use the names from the `R` library `dplyr`, namely `select`, `mutate`, and `filter`.  The most important takeaway will be that, regardless of framework or scale, we can process data frames in the same way by applying the same sequence of data verbs.

In [1]:
from pyspark.sql import SparkSession
from pyspark.sql.functions import col, mean

spark1 = SparkSession.builder.appName('Ops').getOrCreate()
df_spark = spark1.read.csv('data/heroes_information.csv', inferSchema=True, header=True)

## Selecting Columns

The first verb, `select` 

* filters the *columns*
* At the core of `SQL` statements

In [2]:
from more_pyspark import to_pandas
from pyspark.sql.functions import column, col

pyspark_result = (df_spark.
                    select(df_spark.name, # Column via dataframe.name
                           col('Gender'), # Column expression (lazy)
                           'Weight'). # String
                    take(5))
pyspark_result >> to_pandas

Unnamed: 0,name,Gender,Weight
0,A-Bomb,Male,441.0
1,Abe Sapien,Male,65.0
2,Abin Sur,Male,90.0
3,Abomination,Male,441.0
4,Abraxas,Male,-99.0


## Filtering Rows

The next verb, `filter` 

* filters the *rows*
* is related to the `SQL` `WHERE` clause
* `pyspark`: Use the `where` method

#### `where` in `pyspark` using `dataframe.col_name`

In [3]:
f_result = (df_spark
            .where(df_spark.Gender == 'Male')
            .take(5))
f_result >> to_pandas

Unnamed: 0,_c0,name,Gender,Eye color,Race,Hair color,Height,Publisher,Skin color,Alignment,Weight
0,0,A-Bomb,Male,yellow,Human,No Hair,203.0,Marvel Comics,-,good,441.0
1,1,Abe Sapien,Male,blue,Icthyo Sapien,No Hair,191.0,Dark Horse Comics,blue,good,65.0
2,2,Abin Sur,Male,blue,Ungaran,No Hair,185.0,DC Comics,red,good,90.0
3,3,Abomination,Male,green,Human / Radiation,No Hair,203.0,Marvel Comics,-,bad,441.0
4,4,Abraxas,Male,blue,Cosmic Entity,Black,-99.0,Marvel Comics,-,bad,-99.0


#### `where` in `pyspark` using column expression

`col('name')` is lazy and analogous to `X.name` or `X['name']` in `dfply`

In [4]:
col('Gender') == 'Male'

Column<'(Gender = Male)'>

In [5]:
f_result = (df_spark
            .where(col('Gender') == 'Male')
            .take(5))
f_result >> to_pandas

Unnamed: 0,_c0,name,Gender,Eye color,Race,Hair color,Height,Publisher,Skin color,Alignment,Weight
0,0,A-Bomb,Male,yellow,Human,No Hair,203.0,Marvel Comics,-,good,441.0
1,1,Abe Sapien,Male,blue,Icthyo Sapien,No Hair,191.0,Dark Horse Comics,blue,good,65.0
2,2,Abin Sur,Male,blue,Ungaran,No Hair,185.0,DC Comics,red,good,90.0
3,3,Abomination,Male,green,Human / Radiation,No Hair,203.0,Marvel Comics,-,bad,441.0
4,4,Abraxas,Male,blue,Cosmic Entity,Black,-99.0,Marvel Comics,-,bad,-99.0


## Chaining Data Verbs

* Processing df $\rightarrow$ chaining data verbs
* Accomplished through dot-chains

## Example 1 - `select` + `filter`

In [6]:
sf_result = (df_spark.
            where(df_spark.Gender == 'Male').
            select(df_spark.name, 
                   df_spark.Gender, 
                   df_spark.Weight).
            take(5))
sf_result >> to_pandas

Unnamed: 0,name,Gender,Weight
0,A-Bomb,Male,441.0
1,Abe Sapien,Male,65.0
2,Abin Sur,Male,90.0
3,Abomination,Male,441.0
4,Abraxas,Male,-99.0


## Example 2 - `filter` + `filter`

Note that chaining `filter`s is an `and` operation.

In [7]:
ff_result = (df_spark.
               select(df_spark.name, 
                      df_spark.Gender, 
                      df_spark.Weight).
               where(df_spark.Gender == 'Male').
               where(df_spark.Weight > 0).
               take(5))
ff_result >> to_pandas

Unnamed: 0,name,Gender,Weight
0,A-Bomb,Male,441.0
1,Abe Sapien,Male,65.0
2,Abin Sur,Male,90.0
3,Abomination,Male,441.0
4,Absorbing Man,Male,122.0


## <font color="red"> Exercise 1: Blue-eyed Heroes </font>

Create a query that

1. Selects the name, Gender, and Eye Color columns
2. Filters on eye_color == 'blue'

In [8]:
# Your code here
blue_eyed = (df_spark
             .select(col('name'), col('Gender'), col('Eye Color'))
             .where(col('Eye Color') == 'blue')
             .take(5)
            )
blue_eyed >> to_pandas

Unnamed: 0,name,Gender,Eye Color
0,Abe Sapien,Male,blue
1,Abin Sur,Male,blue
2,Abraxas,Male,blue
3,Absorbing Man,Male,blue
4,Adam Monroe,Male,blue


## Constructing New Columns

The third verb, `mutate` 

* Creates new columns
* Changes existing columns
* `pyspark`: Use the `withColumns` method

## Example 3 - Converting Weight to kilograms

Currently, the weight column is in pounds.  Let's convert to kilograms.

#### Using `df.col_name`

In [9]:
m_result = (df_spark.
              select(df_spark.name, 
                     df_spark.Gender, 
                     df_spark.Weight).
              withColumn('Weight_kg', df_spark.Weight/2.2046).
              take(5))
m_result >> to_pandas

Unnamed: 0,name,Gender,Weight,Weight_kg
0,A-Bomb,Male,441.0,200.036288
1,Abe Sapien,Male,65.0,29.483807
2,Abin Sur,Male,90.0,40.823732
3,Abomination,Male,441.0,200.036288
4,Abraxas,Male,-99.0,-44.906105


#### Using `col('name')`

In [10]:
m_result = (df_spark.
              select(df_spark.name, 
                     df_spark.Gender, 
                     df_spark.Weight).
              withColumn('Weight_kg', col('Weight')/2.2046).
              take(5))
m_result >> to_pandas

Unnamed: 0,name,Gender,Weight,Weight_kg
0,A-Bomb,Male,441.0,200.036288
1,Abe Sapien,Male,65.0,29.483807
2,Abin Sur,Male,90.0,40.823732
3,Abomination,Male,441.0,200.036288
4,Abraxas,Male,-99.0,-44.906105


## Referencing a new column

 Use the `col` function with the label from `withColumn`

## Example 4 - Converting Weight to kilograms and filter

Let's find all heroes with a weight under 100kg.

In [10]:
from pyspark.sql.functions import col
new_col_result = (df_spark
                  .select(df_spark.name, df_spark.Gender, df_spark.Weight)
                  .withColumn('Weight_kg', df_spark.Weight/2.2046)
                  .where(col('Weight_kg') < 100 )
                  .take(5))
new_col_result >> to_pandas

Unnamed: 0,name,Gender,Weight,Weight_kg
0,Abe Sapien,Male,65.0,29.483807
1,Abin Sur,Male,90.0,40.823732
2,Abraxas,Male,-99.0,-44.906105
3,Absorbing Man,Male,122.0,55.338837
4,Adam Monroe,Male,-99.0,-44.906105


## <font color="red"> Exercise 2: Tall Heroes </font>

Create a query that

1. Selects the name, Gender, and Height columns
2. Compute the height in inches.
    * Check [here](https://www.kaggle.com/claudiodavi/superhero-set) to determine the current units.
3. Filters on height_in > 72

In [11]:
# Your code here
tall_heroes = (df_spark
               .select(df_spark.name, df_spark.Gender, df_spark.Height)
               .withColumn('Height_inches', df_spark.Height/2.54)
               .where(col('Height_inches') > 72)
               .take(5)
              )
tall_heroes >> to_pandas

Unnamed: 0,name,Gender,Height,Height_inches
0,A-Bomb,Male,203.0,79.92126
1,Abe Sapien,Male,191.0,75.19685
2,Abin Sur,Male,185.0,72.834646
3,Abomination,Male,203.0,79.92126
4,Absorbing Man,Male,193.0,75.984252
