# Spark API Mini Exercises

### Contents
- [Section I - Spark Dataframe Basics](#Section-I)
- [Section II - Column Manipulation](#Section-II)
- [Section III - Spark SQL](#Section-III)
- [Section IV - Type casting](#Section-IV)
- [Section V - Built-in Functions](#Section-V)
- [Section VI - Filter / Where](#Section-VI)
- [Section VII - When / Otherwise](#Section-VII)
- [Section VIII - Sorting](#Section-VIII)
- [Section IX - Aggregating](#Section-IX)
- [Appendix](#Appendix)

[top](#Contents)

Copy the code below to create a pandas dataframe with 20 rows and 3 columns:

```python
import pandas as pd
import numpy as np

np.random.seed(13)

pandas_dataframe = pd.DataFrame(
    {
        "n": np.random.randn(20),
        "group": np.random.choice(list("xyz"), 20),
        "abool": np.random.choice([True, False], 20),
    }
)
```

In [1]:
import pyspark
import pyspark.sql.functions as F

spark = pyspark.sql.SparkSession.builder.getOrCreate()

---
<h1 border=1>Section I</h1>

## <mark>Spark Dataframe Basics</mark>

[top](#Contents)

### I.A 
Use the starter code above to create a pandas dataframe.

In [2]:
import pandas as pd
import numpy as np

np.random.seed(13)

pandas_dataframe = pd.DataFrame(
    {
        "n": np.random.randn(20),
        "group": np.random.choice(list("xyz"), 20),
        "abool": np.random.choice([True, False], 20),
    }
)

### I.B
Convert the pandas dataframe to a spark dataframe. From this point forward, do all of your work with the spark dataframe, not the pandas dataframe.

In [3]:
dfi = spark.createDataFrame(pandas_dataframe)

### I.C 
Show the first 3 rows of the dataframe.

In [4]:
dfi.show(3)

+--------------------+-----+-----+
|                   n|group|abool|
+--------------------+-----+-----+
|  -0.712390662050588|    z|false|
|   0.753766378659703|    x|false|
|-0.04450307833805...|    z|false|
+--------------------+-----+-----+
only showing top 3 rows



### I.D 
Show the first 7 rows of the dataframe.

In [5]:
dfi.show(7)

+--------------------+-----+-----+
|                   n|group|abool|
+--------------------+-----+-----+
|  -0.712390662050588|    z|false|
|   0.753766378659703|    x|false|
|-0.04450307833805...|    z|false|
| 0.45181233874578974|    y|false|
|  1.3451017084510097|    z|false|
|  0.5323378882945463|    y|false|
|  1.3501878997225267|    z|false|
+--------------------+-----+-----+
only showing top 7 rows



### I.E 
View a summary of the data using `.describe`.

In [6]:
dfi.describe().show()

+-------+-------------------+-----+
|summary|                  n|group|
+-------+-------------------+-----+
|  count|                 20|   20|
|   mean|0.36640264498852165| null|
| stddev| 0.8905322898155364| null|
|    min| -1.261605945319069|    x|
|    max| 2.1503829673811126|    z|
+-------+-------------------+-----+



### I.F 
Use .select to create a new dataframe with just the `n` and `abool` columns. View the first 5 rows of this dataframe.

In [7]:
dfi_nabool = dfi.select('n','abool')
dfi_nabool.show(5)

+--------------------+-----+
|                   n|abool|
+--------------------+-----+
|  -0.712390662050588|false|
|   0.753766378659703|false|
|-0.04450307833805...|false|
| 0.45181233874578974|false|
|  1.3451017084510097|false|
+--------------------+-----+
only showing top 5 rows



### I.G 
Use `.select` to create a new dataframe with just the `group` and `abool` columns. View the first 5 rows of this dataframe.

In [8]:
dfi_groupabool = dfi.select('group','abool')
dfi_groupabool.show(5)

+-----+-----+
|group|abool|
+-----+-----+
|    z|false|
|    x|false|
|    z|false|
|    y|false|
|    z|false|
+-----+-----+
only showing top 5 rows



### I.H
Use `.select` to create a new dataframe with the `group` column and the `abool` column renamed to `a_boolean_value`. Show the first 3 rows of this dataframe.

In [9]:
dfi_groupaboolean = dfi.select('group',dfi.abool.alias('a_boolean_value'))
dfi_groupaboolean.show(3)

+-----+---------------+
|group|a_boolean_value|
+-----+---------------+
|    z|          false|
|    x|          false|
|    z|          false|
+-----+---------------+
only showing top 3 rows



### I.J 
Use `.select` to create a new dataframe with the `group` column and the `n` column renamed to `a_numeric_value`. Show the first 6 rows of this dataframe.

In [10]:
dfi_groupnum = dfi.select('group',dfi.n.alias('a_numeric_value'))
dfi_groupnum.show(6)

+-----+--------------------+
|group|     a_numeric_value|
+-----+--------------------+
|    z|  -0.712390662050588|
|    x|   0.753766378659703|
|    z|-0.04450307833805...|
|    y| 0.45181233874578974|
|    z|  1.3451017084510097|
|    y|  0.5323378882945463|
+-----+--------------------+
only showing top 6 rows



# Section II
## Column Manipulation

[top](#Contents)

### II.A. 
Use the starter code above to re-create a spark dataframe. Store the spark dataframe in a varaible named df

In [11]:
dfii = spark.createDataFrame(pandas_dataframe)

### II.B
Use .select to add 4 to the `n` column. Show the results.

In [12]:
n_plus_4 = (dfii.n+4).alias('n_plus_4')
dfii.select('*', n_plus_4).show(5)

+--------------------+-----+-----+------------------+
|                   n|group|abool|          n_plus_4|
+--------------------+-----+-----+------------------+
|  -0.712390662050588|    z|false|3.2876093379494122|
|   0.753766378659703|    x|false| 4.753766378659703|
|-0.04450307833805...|    z|false|3.9554969216619464|
| 0.45181233874578974|    y|false|  4.45181233874579|
|  1.3451017084510097|    z|false|5.3451017084510095|
+--------------------+-----+-----+------------------+
only showing top 5 rows



### II.C
Subtract 5 from the `n` column and view the results.

In [13]:
n_less_5 = (dfii.n - 5).alias('n_less_5')
dfii.select('*', n_less_5).show(5)

+--------------------+-----+-----+-------------------+
|                   n|group|abool|           n_less_5|
+--------------------+-----+-----+-------------------+
|  -0.712390662050588|    z|false| -5.712390662050588|
|   0.753766378659703|    x|false| -4.246233621340297|
|-0.04450307833805...|    z|false| -5.044503078338053|
| 0.45181233874578974|    y|false|  -4.54818766125421|
|  1.3451017084510097|    z|false|-3.6548982915489905|
+--------------------+-----+-----+-------------------+
only showing top 5 rows



### II.D 
Multiply the `n` column by 2. View the results along with the original numbers.

In [14]:
n_times_2 = (dfii.n * 2).alias('n_times_2')
dfii.select('*', n_times_2).show(5)

+--------------------+-----+-----+--------------------+
|                   n|group|abool|           n_times_2|
+--------------------+-----+-----+--------------------+
|  -0.712390662050588|    z|false|  -1.424781324101176|
|   0.753766378659703|    x|false|   1.507532757319406|
|-0.04450307833805...|    z|false|-0.08900615667610691|
| 0.45181233874578974|    y|false|  0.9036246774915795|
|  1.3451017084510097|    z|false|  2.6902034169020195|
+--------------------+-----+-----+--------------------+
only showing top 5 rows



### II.E 
Add a new column named `n2` that is the `n` value multiplied by -1. Show the first 4 rows of your dataframe. You should see the original `n` value as well as `n2`.

In [15]:
n2 = (dfii.n * -1).alias('n2')
dfii = dfii.select('*', n2)
dfii.show(4)

+--------------------+-----+-----+--------------------+
|                   n|group|abool|                  n2|
+--------------------+-----+-----+--------------------+
|  -0.712390662050588|    z|false|   0.712390662050588|
|   0.753766378659703|    x|false|  -0.753766378659703|
|-0.04450307833805...|    z|false|0.044503078338053455|
| 0.45181233874578974|    y|false|-0.45181233874578974|
+--------------------+-----+-----+--------------------+
only showing top 4 rows



### II.F
Add a new column named `n3` that is the `n` value squared. Show the first 5 rows of your dataframe. You should see both `n`, `n2`, and `n3`.

In [16]:
n3 = (dfii.n ** 2).alias('n3')
dfii = dfii.select('*', n3)
dfii.show(5)

+--------------------+-----+-----+--------------------+--------------------+
|                   n|group|abool|                  n2|                  n3|
+--------------------+-----+-----+--------------------+--------------------+
|  -0.712390662050588|    z|false|   0.712390662050588|   0.507500455376875|
|   0.753766378659703|    x|false|  -0.753766378659703|  0.5681637535977627|
|-0.04450307833805...|    z|false|0.044503078338053455|0.001980523981562...|
| 0.45181233874578974|    y|false|-0.45181233874578974| 0.20413438944294027|
|  1.3451017084510097|    z|false| -1.3451017084510097|  1.8092986060778251|
+--------------------+-----+-----+--------------------+--------------------+
only showing top 5 rows



### II.G 
What happens when you run the code below?
```python
df.group + df.abool
```

In [17]:
dfii.group + dfii.abool

# OUTPUT: Column<b'(group + abool)'>

Column<b'(group + abool)'>

### II.H
What happens when you run the code below? What is the difference between this and the previous code sample?
```python
df.select(df.group + df.abool)
```

In [18]:
# dfii.select(dfii.group + dfii.abool)

# OUTPUT: ERROR

#### Output II.H
[See error](#Example-II.H)

### II.I
Try adding various other columns together. What are the results of combining the different data types?

In [19]:
# dfii.select(dfii.n + dfii.abool).show(5)

# OUTPUT: Error

#### Output II.I
[See error](#Example-II.I)

In [20]:
dfii.select(dfii.n + dfii.n3).show(5)

# OUTPUT: Mathemagics

+--------------------+
|            (n + n3)|
+--------------------+
|-0.20489020667371294|
|  1.3219301322574657|
|-0.04252255435649...|
|    0.65594672818873|
|   3.154400314528835|
+--------------------+
only showing top 5 rows



# Section III
## Spark SQL

[top](#Contents)

3. 

    1. Use the starter code above to re-create a spark dataframe.
    1. Turn your dataframe into a table that can be queried with spark SQL. Name
       the table `my_df`. Answer the rest of the questions in this section with
       a spark sql query (`spark.sql`) against `my_df`. After each step, view
       the first 7 records from the dataframe.
    1. Write a query that shows all of the columns from your dataframe.
    1. Write a query that shows just the `n` and `abool` columns from the
       dataframe.
    1. Write a query that shows just the `n` and `group` columns. Rename the
       `group` column to `g`.
    1. Write a query that selects `n`, and creates two new columns: `n2`, the
       original `n` values halved, and `n3`: the original n values minus 1.
    1. What happens if you make a SQL syntax error in your query?

In [21]:
# from pyspark.sql.functions import col, expr

### III.A 
Use the starter code above to re-create a spark dataframe.

In [22]:
dfiii = spark.createDataFrame(pandas_dataframe)

### III.B
Turn your dataframe into a table that can be queried with spark SQL. Name the table `my_df`. Answer the rest of the questions in this section with a spark sql query (`spark.sql`) against `my_df`. After each step, view the first 7 records from the dataframe.

In [23]:
my_df = dfiii
# display(my_df.show())
my_df.createOrReplaceTempView("my_df")

spark.sql('''
SELECT * FROM my_df LIMIT 7
''').show()

+--------------------+-----+-----+
|                   n|group|abool|
+--------------------+-----+-----+
|  -0.712390662050588|    z|false|
|   0.753766378659703|    x|false|
|-0.04450307833805...|    z|false|
| 0.45181233874578974|    y|false|
|  1.3451017084510097|    z|false|
|  0.5323378882945463|    y|false|
|  1.3501878997225267|    z|false|
+--------------------+-----+-----+



### III.C
Write a query that shows all of the columns from your dataframe.

In [24]:
spark.sql('''
SELECT * FROM my_df
''').show(5)

+--------------------+-----+-----+
|                   n|group|abool|
+--------------------+-----+-----+
|  -0.712390662050588|    z|false|
|   0.753766378659703|    x|false|
|-0.04450307833805...|    z|false|
| 0.45181233874578974|    y|false|
|  1.3451017084510097|    z|false|
+--------------------+-----+-----+
only showing top 5 rows



### III.D 
Write a query that shows just the `n` and `abool` columns from the dataframe.

In [25]:
spark.sql('''
SELECT n, abool FROM my_df
''').show(5)

+--------------------+-----+
|                   n|abool|
+--------------------+-----+
|  -0.712390662050588|false|
|   0.753766378659703|false|
|-0.04450307833805...|false|
| 0.45181233874578974|false|
|  1.3451017084510097|false|
+--------------------+-----+
only showing top 5 rows



### III.E 
Write a query that shows just the `n` and `group` columns. Rename the group column to `g`.


In [26]:
spark.sql('''
SELECT n, group as g FROM my_df
''').show(5)

+--------------------+---+
|                   n|  g|
+--------------------+---+
|  -0.712390662050588|  z|
|   0.753766378659703|  x|
|-0.04450307833805...|  z|
| 0.45181233874578974|  y|
|  1.3451017084510097|  z|
+--------------------+---+
only showing top 5 rows



### III.F 
Write a query that selects `n`, and creates two new columns: `n2`, the original n values halved, and `n3`: the original n values minus 1.

In [27]:
spark.sql('''
SELECT n, (n/2) as n2, (n-1) as n3 FROM my_df
''').show(5)

+--------------------+--------------------+--------------------+
|                   n|                  n2|                  n3|
+--------------------+--------------------+--------------------+
|  -0.712390662050588|  -0.356195331025294|  -1.712390662050588|
|   0.753766378659703|  0.3768831893298515|-0.24623362134029703|
|-0.04450307833805...|-0.02225153916902...| -1.0445030783380536|
| 0.45181233874578974| 0.22590616937289487| -0.5481876612542103|
|  1.3451017084510097|  0.6725508542255049| 0.34510170845100974|
+--------------------+--------------------+--------------------+
only showing top 5 rows



### III.G 
What happens if you make a SQL syntax error in your query?

In [28]:
# spark.sql('''
# SELECT l, n, (n/2) as n2, (n-1) as n3 FROM my_df
# ''').show(5)

# OUTPUT: ERROR

#### Output III.G
[see error](#Example-III.G)

<hr>
<h1>Section IV</h1>
<h2><mark>Type casting</mark></h2>

[top](#Contents)

### IV.A
Use the starter code above to re-create a spark dataframe.

In [29]:
dfiv = spark.createDataFrame(pandas_dataframe)

### IV.B
Use `.printSchema` to view the datatypes in your dataframe.

In [30]:
dfiv.printSchema()

# root
#  |-- n: double (nullable = true)
#  |-- group: string (nullable = true)
#  |-- abool: boolean (nullable = true)

root
 |-- n: double (nullable = true)
 |-- group: string (nullable = true)
 |-- abool: boolean (nullable = true)



### IV.C
Use `.dtypes` to view the datatypes in your dataframe.

In [31]:
dfiv.dtypes

[('n', 'double'), ('group', 'string'), ('abool', 'boolean')]

### IV.D
What is the difference between the two code samples below?

    ```python
    df.abool.cast('int')
    ```

    ```python
    df.select(df.abool.cast('int')).show()
    ```

In [32]:
dfiv.abool.cast('int')

# references a spark column object

Column<b'CAST(abool AS INT)'>

In [33]:
dfiv.select(dfiv.abool.cast('int')).show(5)

# Displays a spark dataframe object

+-----+
|abool|
+-----+
|    0|
|    0|
|    0|
|    0|
|    0|
+-----+
only showing top 5 rows



### IV.E
Use `.select` and `.cast` to convert the `abool` column to an integer type. View the results.

In [34]:
dfiv.select(dfiv.n, dfiv.group, dfiv.abool.cast('int')).show(5)

+--------------------+-----+-----+
|                   n|group|abool|
+--------------------+-----+-----+
|  -0.712390662050588|    z|    0|
|   0.753766378659703|    x|    0|
|-0.04450307833805...|    z|    0|
| 0.45181233874578974|    y|    0|
|  1.3451017084510097|    z|    0|
+--------------------+-----+-----+
only showing top 5 rows



### IV.F
Convert the `group` column to a integer data type and view the results. What happens?

In [35]:
dfiv.select(dfiv.group.cast('int')).show(5)

# Displays a spark dataframe object

+-----+
|group|
+-----+
| null|
| null|
| null|
| null|
| null|
+-----+
only showing top 5 rows



### IV.H
Convert the `n` column to a integer data type and view the results. What happens?

In [36]:
dfiv.select(dfiv.n.cast('int')).show(5)

# Everything rounds towards zero

+---+
|  n|
+---+
|  0|
|  0|
|  0|
|  0|
|  1|
+---+
only showing top 5 rows



### IV.I
Convert the `abool` column to a string data type and view the results. What happens?

In [37]:
dfiv.select(dfiv.abool.cast('string')).show(5)

# Looks the same ...

+-----+
|abool|
+-----+
|false|
|false|
|false|
|false|
|false|
+-----+
only showing top 5 rows



In [38]:
dfiv.select(dfiv.abool, dfiv.abool.cast('string').alias('peekabool')).printSchema()

# ... but isn't

root
 |-- abool: boolean (nullable = true)
 |-- peekabool: string (nullable = true)



# Section V
## Built-in Functions

[top](#Contents)

### V.A
Use the starter code above to re-create a spark dataframe.

In [39]:
dfv = spark.createDataFrame(pandas_dataframe)

### V.B
Import the necessary functions from `pyspark.sql.functions`

In [40]:
import pyspark.sql.functions as F

### V.C
Find the highest `n` value.

In [41]:
dfv.select(F.max(dfv.n)).show()

+------------------+
|            max(n)|
+------------------+
|2.1503829673811126|
+------------------+



### V.D
Find the lowest `n` value.

In [42]:
dfv.select(F.min(dfv.n)).show()

+------------------+
|            min(n)|
+------------------+
|-1.261605945319069|
+------------------+



### V.E
Find the average `n` value.

In [43]:
dfv.select(F.avg(dfv.n)).show()

+-------------------+
|             avg(n)|
+-------------------+
|0.36640264498852165|
+-------------------+



### V.F
Use `concat` to change the `group` column to say, e.g. "Group: x" or "Group: y"

In [44]:
dfv.select(F.concat(F.lit('Group '), dfv.group).alias('group')).show(5)

+-------+
|  group|
+-------+
|Group z|
|Group x|
|Group z|
|Group y|
|Group z|
+-------+
only showing top 5 rows



### V.G
Use `concat` to combine the `n` and `group` columns to produce results that look like this: "x: -1.432" or "z: 2.352"

In [45]:
dfv.select(F.concat(dfv.group, F.lit(': '), F.round(dfv.n,3)).alias('groupn')).show(5)

+---------+
|   groupn|
+---------+
|z: -0.712|
| x: 0.754|
|z: -0.045|
| y: 0.452|
| z: 1.345|
+---------+
only showing top 5 rows



# Section VI
## Filter / Where

[top](#Contents)

### VI.A
Use the starter code above to re-create a spark dataframe.


In [46]:
dfvi = spark.createDataFrame(pandas_dataframe)

### VI.B
Use `.filter` or `.where` to select just the rows where the group is `y` and view the results.

In [47]:
dfvi.filter(dfvi.group=='y').show()

+--------------------+-----+-----+
|                   n|group|abool|
+--------------------+-----+-----+
| 0.45181233874578974|    y|false|
|  0.5323378882945463|    y|false|
| -1.0453771305385342|    y| true|
|  -1.261605945319069|    y|false|
|  0.5628467852810314|    y| true|
|-0.24332625188556253|    y| true|
|  0.9137407048596775|    y|false|
|  2.1503829673811126|    y| true|
+--------------------+-----+-----+



### VI.C
Select just the columns where the `abool` column is false and view the results.

In [48]:
dfvi.filter(dfvi.abool==False).show()

+--------------------+-----+-----+
|                   n|group|abool|
+--------------------+-----+-----+
|  -0.712390662050588|    z|false|
|   0.753766378659703|    x|false|
|-0.04450307833805...|    z|false|
| 0.45181233874578974|    y|false|
|  1.3451017084510097|    z|false|
|  0.5323378882945463|    y|false|
|  1.3501878997225267|    z|false|
|  0.8612113741693206|    x|false|
| -0.7889890249515489|    x|false|
|  -1.261605945319069|    y|false|
|  0.9137407048596775|    y|false|
| 0.31735092273633597|    x|false|
| 0.12730328020698067|    z|false|
|  0.6062886568962988|    x|false|
+--------------------+-----+-----+



### VI.D
Find the columns where the `group` column is *not* `y`.

In [49]:
dfvi.filter(dfvi.group!='y').show()

+--------------------+-----+-----+
|                   n|group|abool|
+--------------------+-----+-----+
|  -0.712390662050588|    z|false|
|   0.753766378659703|    x|false|
|-0.04450307833805...|    z|false|
|  1.3451017084510097|    z|false|
|  1.3501878997225267|    z|false|
|  0.8612113741693206|    x|false|
|  1.4786857374358966|    z| true|
| -0.7889890249515489|    x|false|
| 0.31735092273633597|    x|false|
| 0.12730328020698067|    z|false|
|  0.6062886568962988|    x|false|
|-0.02677164998644...|    x| true|
+--------------------+-----+-----+



### VI.E
Find the columns where `n` is positive.

In [50]:
dfvi.filter(dfvi.n>0).show()

+-------------------+-----+-----+
|                  n|group|abool|
+-------------------+-----+-----+
|  0.753766378659703|    x|false|
|0.45181233874578974|    y|false|
| 1.3451017084510097|    z|false|
| 0.5323378882945463|    y|false|
| 1.3501878997225267|    z|false|
| 0.8612113741693206|    x|false|
| 1.4786857374358966|    z| true|
| 0.5628467852810314|    y| true|
| 0.9137407048596775|    y|false|
|0.31735092273633597|    x|false|
|0.12730328020698067|    z|false|
| 2.1503829673811126|    y| true|
| 0.6062886568962988|    x|false|
+-------------------+-----+-----+



### VI.F
Find the columns where `abool` is true and the `group` column is `z`.

In [51]:
dfvi.filter(dfvi.group=='z').where(dfvi.abool==True).show()

+------------------+-----+-----+
|                 n|group|abool|
+------------------+-----+-----+
|1.4786857374358966|    z| true|
+------------------+-----+-----+



### VI.G
Find the columns where `abool` is true or the `group` column is `z`.

In [52]:
dfvi.where((dfvi.abool==True) | (dfvi.group=='z')).show()

+--------------------+-----+-----+
|                   n|group|abool|
+--------------------+-----+-----+
|  -0.712390662050588|    z|false|
|-0.04450307833805...|    z|false|
|  1.3451017084510097|    z|false|
|  1.3501878997225267|    z|false|
|  1.4786857374358966|    z| true|
| -1.0453771305385342|    y| true|
|  0.5628467852810314|    y| true|
|-0.24332625188556253|    y| true|
| 0.12730328020698067|    z|false|
|  2.1503829673811126|    y| true|
|-0.02677164998644...|    x| true|
+--------------------+-----+-----+



### VI.H
Find the columns where `abool` is false and `n` is less than 1

In [53]:
dfvi.filter(dfvi.n<1).where(dfvi.abool==False).show()

+--------------------+-----+-----+
|                   n|group|abool|
+--------------------+-----+-----+
|  -0.712390662050588|    z|false|
|   0.753766378659703|    x|false|
|-0.04450307833805...|    z|false|
| 0.45181233874578974|    y|false|
|  0.5323378882945463|    y|false|
|  0.8612113741693206|    x|false|
| -0.7889890249515489|    x|false|
|  -1.261605945319069|    y|false|
|  0.9137407048596775|    y|false|
| 0.31735092273633597|    x|false|
| 0.12730328020698067|    z|false|
|  0.6062886568962988|    x|false|
+--------------------+-----+-----+



### VI.I
Find the columns where `abool` is false or `n` is less than 1

In [54]:
dfvi.where((dfvi.abool==False) | (dfvi.n<1)).show()

+--------------------+-----+-----+
|                   n|group|abool|
+--------------------+-----+-----+
|  -0.712390662050588|    z|false|
|   0.753766378659703|    x|false|
|-0.04450307833805...|    z|false|
| 0.45181233874578974|    y|false|
|  1.3451017084510097|    z|false|
|  0.5323378882945463|    y|false|
|  1.3501878997225267|    z|false|
|  0.8612113741693206|    x|false|
| -1.0453771305385342|    y| true|
| -0.7889890249515489|    x|false|
|  -1.261605945319069|    y|false|
|  0.5628467852810314|    y| true|
|-0.24332625188556253|    y| true|
|  0.9137407048596775|    y|false|
| 0.31735092273633597|    x|false|
| 0.12730328020698067|    z|false|
|  0.6062886568962988|    x|false|
|-0.02677164998644...|    x| true|
+--------------------+-----+-----+



# Section VII
## When / Otherwise

[top](#Contents)

### VII.A
Use the starter code above to re-create a spark dataframe.

In [55]:
dfvii = spark.createDataFrame(pandas_dataframe)

### VII.B
Use `when` and `.otherwise` to create a column that contains the text "It is true" when `abool` is true and "It is false"" when `abool` is false.

In [56]:
dfvii.select('*', F.when(dfvii.abool, 'It is true').otherwise('It is false').alias('itiswha')).show(15)

+--------------------+-----+-----+-----------+
|                   n|group|abool|    itiswha|
+--------------------+-----+-----+-----------+
|  -0.712390662050588|    z|false|It is false|
|   0.753766378659703|    x|false|It is false|
|-0.04450307833805...|    z|false|It is false|
| 0.45181233874578974|    y|false|It is false|
|  1.3451017084510097|    z|false|It is false|
|  0.5323378882945463|    y|false|It is false|
|  1.3501878997225267|    z|false|It is false|
|  0.8612113741693206|    x|false|It is false|
|  1.4786857374358966|    z| true| It is true|
| -1.0453771305385342|    y| true| It is true|
| -0.7889890249515489|    x|false|It is false|
|  -1.261605945319069|    y|false|It is false|
|  0.5628467852810314|    y| true| It is true|
|-0.24332625188556253|    y| true| It is true|
|  0.9137407048596775|    y|false|It is false|
+--------------------+-----+-----+-----------+
only showing top 15 rows



### VII.C
Create a column that contains 0 if n is less than 0, otherwise, the original n value.

In [57]:
dfvii.select('*', F.when(dfvii.n<0, 0).otherwise(dfvii.n).alias('positiven')).show(15)

+--------------------+-----+-----+-------------------+
|                   n|group|abool|          positiven|
+--------------------+-----+-----+-------------------+
|  -0.712390662050588|    z|false|                0.0|
|   0.753766378659703|    x|false|  0.753766378659703|
|-0.04450307833805...|    z|false|                0.0|
| 0.45181233874578974|    y|false|0.45181233874578974|
|  1.3451017084510097|    z|false| 1.3451017084510097|
|  0.5323378882945463|    y|false| 0.5323378882945463|
|  1.3501878997225267|    z|false| 1.3501878997225267|
|  0.8612113741693206|    x|false| 0.8612113741693206|
|  1.4786857374358966|    z| true| 1.4786857374358966|
| -1.0453771305385342|    y| true|                0.0|
| -0.7889890249515489|    x|false|                0.0|
|  -1.261605945319069|    y|false|                0.0|
|  0.5628467852810314|    y| true| 0.5628467852810314|
|-0.24332625188556253|    y| true|                0.0|
|  0.9137407048596775|    y|false| 0.9137407048596775|
+---------

# Section VIII
## Sorting

[top](#Contents)

### VIII.A
Use the starter code above to re-create a spark dataframe.

In [58]:
dfviii = spark.createDataFrame(pandas_dataframe)

### VIII.B
Sort by the `n` value.

In [59]:
dfviii.sort(dfviii.n).show(10)

+--------------------+-----+-----+
|                   n|group|abool|
+--------------------+-----+-----+
|  -1.261605945319069|    y|false|
| -1.0453771305385342|    y| true|
| -0.7889890249515489|    x|false|
|  -0.712390662050588|    z|false|
|-0.24332625188556253|    y| true|
|-0.04450307833805...|    z|false|
|-0.02677164998644...|    x| true|
| 0.12730328020698067|    z|false|
| 0.31735092273633597|    x|false|
| 0.45181233874578974|    y|false|
+--------------------+-----+-----+
only showing top 10 rows



### VIII.C
Sort by the `group` value, both ascending and descending.

In [60]:
dfviii.sort(dfviii.group).show(10)

+--------------------+-----+-----+
|                   n|group|abool|
+--------------------+-----+-----+
|   0.753766378659703|    x|false|
| -0.7889890249515489|    x|false|
|-0.02677164998644...|    x| true|
|  0.6062886568962988|    x|false|
| 0.31735092273633597|    x|false|
|  0.8612113741693206|    x|false|
|  -1.261605945319069|    y|false|
|  0.9137407048596775|    y|false|
|  0.5628467852810314|    y| true|
| 0.45181233874578974|    y|false|
+--------------------+-----+-----+
only showing top 10 rows



In [61]:
dfviii.sort(dfviii.n.desc()).show(10)

+------------------+-----+-----+
|                 n|group|abool|
+------------------+-----+-----+
|2.1503829673811126|    y| true|
|1.4786857374358966|    z| true|
|1.3501878997225267|    z|false|
|1.3451017084510097|    z|false|
|0.9137407048596775|    y|false|
|0.8612113741693206|    x|false|
| 0.753766378659703|    x|false|
|0.6062886568962988|    x|false|
|0.5628467852810314|    y| true|
|0.5323378882945463|    y|false|
+------------------+-----+-----+
only showing top 10 rows



### VIII.D
Sort by the group value first, then, within each group, sort by `n` value.

In [62]:
dfviii.sort(dfviii.group, dfviii.n).show(10)

+--------------------+-----+-----+
|                   n|group|abool|
+--------------------+-----+-----+
| -0.7889890249515489|    x|false|
|-0.02677164998644...|    x| true|
| 0.31735092273633597|    x|false|
|  0.6062886568962988|    x|false|
|   0.753766378659703|    x|false|
|  0.8612113741693206|    x|false|
|  -1.261605945319069|    y|false|
| -1.0453771305385342|    y| true|
|-0.24332625188556253|    y| true|
| 0.45181233874578974|    y|false|
+--------------------+-----+-----+
only showing top 10 rows



### VIII.E
Sort by `abool`, `group`, and `n`. Does it matter in what order you specify the columns when sorting?

In [63]:
dfviii.sort(dfviii.abool, dfviii.group, dfviii.n).show(10)

+-------------------+-----+-----+
|                  n|group|abool|
+-------------------+-----+-----+
|-0.7889890249515489|    x|false|
|0.31735092273633597|    x|false|
| 0.6062886568962988|    x|false|
|  0.753766378659703|    x|false|
| 0.8612113741693206|    x|false|
| -1.261605945319069|    y|false|
|0.45181233874578974|    y|false|
| 0.5323378882945463|    y|false|
| 0.9137407048596775|    y|false|
| -0.712390662050588|    z|false|
+-------------------+-----+-----+
only showing top 10 rows



In [64]:
dfviii.sort(dfviii.group, dfviii.abool, dfviii.n).show(10)

# YES! Order matters!

+--------------------+-----+-----+
|                   n|group|abool|
+--------------------+-----+-----+
| -0.7889890249515489|    x|false|
| 0.31735092273633597|    x|false|
|  0.6062886568962988|    x|false|
|   0.753766378659703|    x|false|
|  0.8612113741693206|    x|false|
|-0.02677164998644...|    x| true|
|  -1.261605945319069|    y|false|
| 0.45181233874578974|    y|false|
|  0.5323378882945463|    y|false|
|  0.9137407048596775|    y|false|
+--------------------+-----+-----+
only showing top 10 rows



In [65]:
dfviii.select('group').distinct().show()

+-----+
|group|
+-----+
|    x|
|    z|
|    y|
+-----+



# Section IX
## Aggregating

[top](#Contents)

In [66]:
dfix = spark.createDataFrame(pandas_dataframe)

### IX.A 
What is the average `n` value for each group in the `group` column?

In [67]:
dfix.groupby('group').agg(F.avg('n')).show()

+-----+-------------------+
|group|             avg(n)|
+-----+-------------------+
|    x|0.28714277625394485|
|    z|  0.590730814237962|
|    y| 0.2576014196023739|
+-----+-------------------+



### IX.B
What is the maximum `n` value for each group in the `group` column?

In [68]:
dfix.groupby('group').agg(F.max('n')).show()

+-----+------------------+
|group|            max(n)|
+-----+------------------+
|    x|0.8612113741693206|
|    z|1.4786857374358966|
|    y|2.1503829673811126|
+-----+------------------+



### IX.C
What is the minimum `n` value by `abool`?

In [69]:
dfix.groupby('group').agg(F.min('n')).show()

+-----+-------------------+
|group|             min(n)|
+-----+-------------------+
|    x|-0.7889890249515489|
|    z| -0.712390662050588|
|    y| -1.261605945319069|
+-----+-------------------+



### IX.D
What is the average `n` value for each unique combination of the `group` and `abool` column?

In [70]:
dfix.groupby('group', 'abool').agg(F.avg('n')).sort('group',dfix.abool.desc()).show()

+-----+-----+--------------------+
|group|abool|              avg(n)|
+-----+-----+--------------------+
|    x| true|-0.02677164998644...|
|    x|false|   0.349925661502022|
|    y| true| 0.35613159255951177|
|    y|false| 0.15907124664523611|
|    z| true|  1.4786857374358966|
|    z|false| 0.41313982959837514|
+-----+-----+--------------------+



[top](#Contents)

## Appendix

[top](#Contents)

#### Example II.H
[Return](#II.H)


<code>---------------------------------------------------------------------------
Py4JJavaError                             Traceback (most recent call last)
/usr/local/anaconda3/lib/python3.7/site-packages/pyspark/sql/utils.py in deco(*a, **kw)
     62         try:
---> 63             return f(*a, **kw)
     64         except py4j.protocol.Py4JJavaError as e:

/usr/local/anaconda3/lib/python3.7/site-packages/py4j/protocol.py in get_return_value(answer, gateway_client, target_id, name)
    327                     "An error occurred while calling {0}{1}{2}.\n".
--> 328                     format(target_id, ".", name), value)
    329             else:

Py4JJavaError: An error occurred while calling o123.select.
: org.apache.spark.sql.AnalysisException: cannot resolve '(CAST(`group` AS DOUBLE) + `abool`)' due to data type mismatch: differing types in '(CAST(`group` AS DOUBLE) + `abool`)' (double and boolean).;;
'Project [(cast(group#179 as double) + abool#180) AS (group + abool)#284]
+- Project [n#178, group#179, abool#180, n2#238, POWER(n#178, cast(2 as double)) AS n3#256]
   +- Project [n#178, group#179, abool#180, (n#178 * cast(-1 as double)) AS n2#238]
      +- LogicalRDD [n#178, group#179, abool#180], false

	at org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:42)
	at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$3.applyOrElse(CheckAnalysis.scala:116)
	at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$3.applyOrElse(CheckAnalysis.scala:108)
	at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:281)
	at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:281)
	at org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:70)
	at org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:280)
	at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:278)
	at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:278)
	at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:329)
	at org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:187)
	at org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:327)
	at org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:278)
	at org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$transformExpressionsUp$1.apply(QueryPlan.scala:93)
	at org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$transformExpressionsUp$1.apply(QueryPlan.scala:93)
	at org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$1.apply(QueryPlan.scala:105)
	at org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$1.apply(QueryPlan.scala:105)
	at org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:70)
	at org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpression$1(QueryPlan.scala:104)
	at org.apache.spark.sql.catalyst.plans.QueryPlan.org$apache$spark$sql$catalyst$plans$QueryPlan$$recursiveTransform$1(QueryPlan.scala:116)
	at org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$org$apache$spark$sql$catalyst$plans$QueryPlan$$recursiveTransform$1$2.apply(QueryPlan.scala:121)
	at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
	at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
	at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
	at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
	at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
	at scala.collection.AbstractTraversable.map(Traversable.scala:104)
	at org.apache.spark.sql.catalyst.plans.QueryPlan.org$apache$spark$sql$catalyst$plans$QueryPlan$$recursiveTransform$1(QueryPlan.scala:121)
	at org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$2.apply(QueryPlan.scala:126)
	at org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:187)
	at org.apache.spark.sql.catalyst.plans.QueryPlan.mapExpressions(QueryPlan.scala:126)
	at org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpressionsUp(QueryPlan.scala:93)
	at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:108)
	at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:86)
	at org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:127)
	at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.checkAnalysis(CheckAnalysis.scala:86)
	at org.apache.spark.sql.catalyst.analysis.Analyzer.checkAnalysis(Analyzer.scala:95)
	at org.apache.spark.sql.catalyst.analysis.Analyzer$$anonfun$executeAndCheck$1.apply(Analyzer.scala:108)
	at org.apache.spark.sql.catalyst.analysis.Analyzer$$anonfun$executeAndCheck$1.apply(Analyzer.scala:105)
	at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper$.markInAnalyzer(AnalysisHelper.scala:201)
	at org.apache.spark.sql.catalyst.analysis.Analyzer.executeAndCheck(Analyzer.scala:105)
	at org.apache.spark.sql.execution.QueryExecution.analyzed$lzycompute(QueryExecution.scala:57)
	at org.apache.spark.sql.execution.QueryExecution.analyzed(QueryExecution.scala:55)
	at org.apache.spark.sql.execution.QueryExecution.assertAnalyzed(QueryExecution.scala:47)
	at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:78)
	at org.apache.spark.sql.Dataset.org$apache$spark$sql$Dataset$$withPlan(Dataset.scala:3412)
	at org.apache.spark.sql.Dataset.select(Dataset.scala:1340)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:498)
	at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
	at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
	at py4j.Gateway.invoke(Gateway.java:282)
	at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
	at py4j.commands.CallCommand.execute(CallCommand.java:79)
	at py4j.GatewayConnection.run(GatewayConnection.java:238)
	at java.lang.Thread.run(Thread.java:748)


During handling of the above exception, another exception occurred:

AnalysisException                         Traceback (most recent call last)
<ipython-input-21-435d2c8b5834> in <module>
      2 #  df.select(df.group + df.abool)
      3 
----> 4 dfii.select(dfii.group + dfii.abool)
      5 
      6 # OUTPUT: ERROR

/usr/local/anaconda3/lib/python3.7/site-packages/pyspark/sql/dataframe.py in select(self, *cols)
   1319         [Row(name=u'Alice', age=12), Row(name=u'Bob', age=15)]
   1320         """
-> 1321         jdf = self._jdf.select(self._jcols(*cols))
   1322         return DataFrame(jdf, self.sql_ctx)
   1323 

/usr/local/anaconda3/lib/python3.7/site-packages/py4j/java_gateway.py in __call__(self, *args)
   1255         answer = self.gateway_client.send_command(command)
   1256         return_value = get_return_value(
-> 1257             answer, self.gateway_client, self.target_id, self.name)
   1258 
   1259         for temp_arg in temp_args:

/usr/local/anaconda3/lib/python3.7/site-packages/pyspark/sql/utils.py in deco(*a, **kw)
     67                                              e.java_exception.getStackTrace()))
     68             if s.startswith('org.apache.spark.sql.AnalysisException: '):
---> 69                 raise AnalysisException(s.split(': ', 1)[1], stackTrace)
     70             if s.startswith('org.apache.spark.sql.catalyst.analysis'):
     71                 raise AnalysisException(s.split(': ', 1)[1], stackTrace)

AnalysisException: "cannot resolve '(CAST(`group` AS DOUBLE) + `abool`)' due to data type mismatch: differing types in '(CAST(`group` AS DOUBLE) + `abool`)' (double and boolean).;;\n'Project [(cast(group#179 as double) + abool#180) AS (group + abool)#284]\n+- Project [n#178, group#179, abool#180, n2#238, POWER(n#178, cast(2 as double)) AS n3#256]\n   +- Project [n#178, group#179, abool#180, (n#178 * cast(-1 as double)) AS n2#238]\n      +- LogicalRDD [n#178, group#179, abool#180], false\n"</code>

[Return](#II.H)

#### Example II.I
[Return](#II.I)

<code>---------------------------------------------------------------------------
Py4JJavaError                             Traceback (most recent call last)
/usr/local/anaconda3/lib/python3.7/site-packages/pyspark/sql/utils.py in deco(*a, **kw)
     62         try:
---> 63             return f(*a, **kw)
     64         except py4j.protocol.Py4JJavaError as e:

/usr/local/anaconda3/lib/python3.7/site-packages/py4j/protocol.py in get_return_value(answer, gateway_client, target_id, name)
    327                     "An error occurred while calling {0}{1}{2}.\n".
--> 328                     format(target_id, ".", name), value)
    329             else:

Py4JJavaError: An error occurred while calling o123.select.
: org.apache.spark.sql.AnalysisException: cannot resolve '(`n` + `abool`)' due to data type mismatch: differing types in '(`n` + `abool`)' (double and boolean).;;
'Project [(n#178 + abool#180) AS (n + abool)#285]
+- Project [n#178, group#179, abool#180, n2#238, POWER(n#178, cast(2 as double)) AS n3#256]
   +- Project [n#178, group#179, abool#180, (n#178 * cast(-1 as double)) AS n2#238]
      +- LogicalRDD [n#178, group#179, abool#180], false

	at org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:42)
	at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$3.applyOrElse(CheckAnalysis.scala:116)
	at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$3.applyOrElse(CheckAnalysis.scala:108)
	at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:281)
	at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:281)
	at org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:70)
	at org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:280)
	at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:278)
	at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:278)
	at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:329)
	at org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:187)
	at org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:327)
	at org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:278)
	at org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$transformExpressionsUp$1.apply(QueryPlan.scala:93)
	at org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$transformExpressionsUp$1.apply(QueryPlan.scala:93)
	at org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$1.apply(QueryPlan.scala:105)
	at org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$1.apply(QueryPlan.scala:105)
	at org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:70)
	at org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpression$1(QueryPlan.scala:104)
	at org.apache.spark.sql.catalyst.plans.QueryPlan.org$apache$spark$sql$catalyst$plans$QueryPlan$$recursiveTransform$1(QueryPlan.scala:116)
	at org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$org$apache$spark$sql$catalyst$plans$QueryPlan$$recursiveTransform$1$2.apply(QueryPlan.scala:121)
	at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
	at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
	at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
	at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
	at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
	at scala.collection.AbstractTraversable.map(Traversable.scala:104)
	at org.apache.spark.sql.catalyst.plans.QueryPlan.org$apache$spark$sql$catalyst$plans$QueryPlan$$recursiveTransform$1(QueryPlan.scala:121)
	at org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$2.apply(QueryPlan.scala:126)
	at org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:187)
	at org.apache.spark.sql.catalyst.plans.QueryPlan.mapExpressions(QueryPlan.scala:126)
	at org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpressionsUp(QueryPlan.scala:93)
	at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:108)
	at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:86)
	at org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:127)
	at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.checkAnalysis(CheckAnalysis.scala:86)
	at org.apache.spark.sql.catalyst.analysis.Analyzer.checkAnalysis(Analyzer.scala:95)
	at org.apache.spark.sql.catalyst.analysis.Analyzer$$anonfun$executeAndCheck$1.apply(Analyzer.scala:108)
	at org.apache.spark.sql.catalyst.analysis.Analyzer$$anonfun$executeAndCheck$1.apply(Analyzer.scala:105)
	at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper$.markInAnalyzer(AnalysisHelper.scala:201)
	at org.apache.spark.sql.catalyst.analysis.Analyzer.executeAndCheck(Analyzer.scala:105)
	at org.apache.spark.sql.execution.QueryExecution.analyzed$lzycompute(QueryExecution.scala:57)
	at org.apache.spark.sql.execution.QueryExecution.analyzed(QueryExecution.scala:55)
	at org.apache.spark.sql.execution.QueryExecution.assertAnalyzed(QueryExecution.scala:47)
	at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:78)
	at org.apache.spark.sql.Dataset.org$apache$spark$sql$Dataset$$withPlan(Dataset.scala:3412)
	at org.apache.spark.sql.Dataset.select(Dataset.scala:1340)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:498)
	at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
	at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
	at py4j.Gateway.invoke(Gateway.java:282)
	at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
	at py4j.commands.CallCommand.execute(CallCommand.java:79)
	at py4j.GatewayConnection.run(GatewayConnection.java:238)
	at java.lang.Thread.run(Thread.java:748)


During handling of the above exception, another exception occurred:

AnalysisException                         Traceback (most recent call last)
<ipython-input-23-fad6529b8e19> in <module>
      1 # I Try adding various other columns together. What are the results of combining the different data types?
      2 
----> 3 dfii.select(dfii.n + dfii.abool).show(5)
      4 
      5 # OUTPUT: Error

/usr/local/anaconda3/lib/python3.7/site-packages/pyspark/sql/dataframe.py in select(self, *cols)
   1319         [Row(name=u'Alice', age=12), Row(name=u'Bob', age=15)]
   1320         """
-> 1321         jdf = self._jdf.select(self._jcols(*cols))
   1322         return DataFrame(jdf, self.sql_ctx)
   1323 

/usr/local/anaconda3/lib/python3.7/site-packages/py4j/java_gateway.py in __call__(self, *args)
   1255         answer = self.gateway_client.send_command(command)
   1256         return_value = get_return_value(
-> 1257             answer, self.gateway_client, self.target_id, self.name)
   1258 
   1259         for temp_arg in temp_args:

/usr/local/anaconda3/lib/python3.7/site-packages/pyspark/sql/utils.py in deco(*a, **kw)
     67                                              e.java_exception.getStackTrace()))
     68             if s.startswith('org.apache.spark.sql.AnalysisException: '):
---> 69                 raise AnalysisException(s.split(': ', 1)[1], stackTrace)
     70             if s.startswith('org.apache.spark.sql.catalyst.analysis'):
     71                 raise AnalysisException(s.split(': ', 1)[1], stackTrace)

AnalysisException: "cannot resolve '(`n` + `abool`)' due to data type mismatch: differing types in '(`n` + `abool`)' (double and boolean).;;\n'Project [(n#178 + abool#180) AS (n + abool)#285]\n+- Project [n#178, group#179, abool#180, n2#238, POWER(n#178, cast(2 as double)) AS n3#256]\n   +- Project [n#178, group#179, abool#180, (n#178 * cast(-1 as double)) AS n2#238]\n      +- LogicalRDD [n#178, group#179, abool#180], false\n"</code>

[Return](#II.I)

#### Example III.G
[Return](#III.G)

<Code>---------------------------------------------------------------------------
Py4JJavaError                             Traceback (most recent call last)
/usr/local/anaconda3/lib/python3.7/site-packages/pyspark/sql/utils.py in deco(*a, **kw)
     62         try:
---> 63             return f(*a, **kw)
     64         except py4j.protocol.Py4JJavaError as e:

/usr/local/anaconda3/lib/python3.7/site-packages/py4j/protocol.py in get_return_value(answer, gateway_client, target_id, name)
    327                     "An error occurred while calling {0}{1}{2}.\n".
--> 328                     format(target_id, ".", name), value)
    329             else:

Py4JJavaError: An error occurred while calling o20.sql.
: org.apache.spark.sql.AnalysisException: cannot resolve '`l`' given input columns: [my_df.n, my_df.group, my_df.abool]; line 2 pos 7;
'Project ['l, n#284, (n#284 / cast(2 as double)) AS n2#397, (n#284 - cast(1 as double)) AS n3#398]
+- SubqueryAlias `my_df`
   +- LogicalRDD [n#284, group#285, abool#286], false

	at org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:42)
	at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$3.applyOrElse(CheckAnalysis.scala:111)
	at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$3.applyOrElse(CheckAnalysis.scala:108)
	at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:281)
	at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:281)
	at org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:70)
	at org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:280)
	at org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$transformExpressionsUp$1.apply(QueryPlan.scala:93)
	at org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$transformExpressionsUp$1.apply(QueryPlan.scala:93)
	at org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$1.apply(QueryPlan.scala:105)
	at org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$1.apply(QueryPlan.scala:105)
	at org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:70)
	at org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpression$1(QueryPlan.scala:104)
	at org.apache.spark.sql.catalyst.plans.QueryPlan.org$apache$spark$sql$catalyst$plans$QueryPlan$$recursiveTransform$1(QueryPlan.scala:116)
	at org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$org$apache$spark$sql$catalyst$plans$QueryPlan$$recursiveTransform$1$2.apply(QueryPlan.scala:121)
	at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
	at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
	at scala.collection.immutable.List.foreach(List.scala:392)
	at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
	at scala.collection.immutable.List.map(List.scala:296)
	at org.apache.spark.sql.catalyst.plans.QueryPlan.org$apache$spark$sql$catalyst$plans$QueryPlan$$recursiveTransform$1(QueryPlan.scala:121)
	at org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$2.apply(QueryPlan.scala:126)
	at org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:187)
	at org.apache.spark.sql.catalyst.plans.QueryPlan.mapExpressions(QueryPlan.scala:126)
	at org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpressionsUp(QueryPlan.scala:93)
	at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:108)
	at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:86)
	at org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:127)
	at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.checkAnalysis(CheckAnalysis.scala:86)
	at org.apache.spark.sql.catalyst.analysis.Analyzer.checkAnalysis(Analyzer.scala:95)
	at org.apache.spark.sql.catalyst.analysis.Analyzer$$anonfun$executeAndCheck$1.apply(Analyzer.scala:108)
	at org.apache.spark.sql.catalyst.analysis.Analyzer$$anonfun$executeAndCheck$1.apply(Analyzer.scala:105)
	at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper$.markInAnalyzer(AnalysisHelper.scala:201)
	at org.apache.spark.sql.catalyst.analysis.Analyzer.executeAndCheck(Analyzer.scala:105)
	at org.apache.spark.sql.execution.QueryExecution.analyzed$lzycompute(QueryExecution.scala:57)
	at org.apache.spark.sql.execution.QueryExecution.analyzed(QueryExecution.scala:55)
	at org.apache.spark.sql.execution.QueryExecution.assertAnalyzed(QueryExecution.scala:47)
	at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:78)
	at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:642)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:498)
	at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
	at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
	at py4j.Gateway.invoke(Gateway.java:282)
	at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
	at py4j.commands.CallCommand.execute(CallCommand.java:79)
	at py4j.GatewayConnection.run(GatewayConnection.java:238)
	at java.lang.Thread.run(Thread.java:748)


During handling of the above exception, another exception occurred:

AnalysisException                         Traceback (most recent call last)
<ipython-input-40-0a9fc38cab0c> in <module>
      3 spark.sql('''
      4 SELECT l, n, (n/2) as n2, (n-1) as n3 FROM my_df
----> 5 ''').show(5)

/usr/local/anaconda3/lib/python3.7/site-packages/pyspark/sql/session.py in sql(self, sqlQuery)
    765         [Row(f1=1, f2=u'row1'), Row(f1=2, f2=u'row2'), Row(f1=3, f2=u'row3')]
    766         """
--> 767         return DataFrame(self._jsparkSession.sql(sqlQuery), self._wrapped)
    768 
    769     @since(2.0)

/usr/local/anaconda3/lib/python3.7/site-packages/py4j/java_gateway.py in __call__(self, *args)
   1255         answer = self.gateway_client.send_command(command)
   1256         return_value = get_return_value(
-> 1257             answer, self.gateway_client, self.target_id, self.name)
   1258 
   1259         for temp_arg in temp_args:

/usr/local/anaconda3/lib/python3.7/site-packages/pyspark/sql/utils.py in deco(*a, **kw)
     67                                              e.java_exception.getStackTrace()))
     68             if s.startswith('org.apache.spark.sql.AnalysisException: '):
---> 69                 raise AnalysisException(s.split(': ', 1)[1], stackTrace)
     70             if s.startswith('org.apache.spark.sql.catalyst.analysis'):
     71                 raise AnalysisException(s.split(': ', 1)[1], stackTrace)

AnalysisException: "cannot resolve '`l`' given input columns: [my_df.n, my_df.group, my_df.abool]; line 2 pos 7;\n'Project ['l, n#284, (n#284 / cast(2 as double)) AS n2#397, (n#284 - cast(1 as double)) AS n3#398]\n+- SubqueryAlias `my_df`\n   +- LogicalRDD [n#284, group#285, abool#286], false\n"</code>

[Return](#III.G)