# Spark API Mini Exercises

In [2]:
import pandas as pd
import numpy as np

np.random.seed(13)

#### 1. Spark Dataframe Basics

i. Use the starter code below to create a pandas dataframe (just run the cell):

In [3]:
pandas_dataframe = pd.DataFrame({
    "n": np.random.randn(20),
    "group": np.random.choice(list("xyz"), 20),
    "abool": np.random.choice([True, False], 20),
})

ii. Convert the pandas dataframe to a spark dataframe. From this point forward, do all of your work with the spark dataframe, not the pandas dataframe.

In [5]:
import pyspark

spark = pyspark.sql.SparkSession.builder.getOrCreate()

spark_df = spark.createDataFrame(pandas_dataframe)

iii. Show the first 3 rows of the dataframe.

In [6]:
spark_df.show(3)

+--------------------+-----+-----+
|                   n|group|abool|
+--------------------+-----+-----+
|  -0.712390662050588|    z|false|
|   0.753766378659703|    x|false|
|-0.04450307833805...|    z|false|
+--------------------+-----+-----+
only showing top 3 rows



iv. Show the first 7 rows of the dataframe.

In [7]:
spark_df.show(7)

+--------------------+-----+-----+
|                   n|group|abool|
+--------------------+-----+-----+
|  -0.712390662050588|    z|false|
|   0.753766378659703|    x|false|
|-0.04450307833805...|    z|false|
| 0.45181233874578974|    y|false|
|  1.3451017084510097|    z|false|
|  0.5323378882945463|    y|false|
|  1.3501878997225267|    z|false|
+--------------------+-----+-----+
only showing top 7 rows



v. View a summary of the data using `.describe()`.
> Note that `.describe` returns another dataframe, so we still have to do `.show()` at the end.

In [10]:
spark_df.describe().show()

+-------+------------------+-----+
|summary|                 n|group|
+-------+------------------+-----+
|  count|                20|   20|
|   mean|0.3664026449885217| null|
| stddev|0.8905322898155363| null|
|    min|-1.261605945319069|    x|
|    max|2.1503829673811126|    z|
+-------+------------------+-----+



vi. Use `.select()` to create a new dataframe with just the `n` and `abool` columns. View the first 5 rows of this dataframe.

In [13]:
spark_df.select(spark_df.n, spark_df.abool).show(5)

+--------------------+-----+
|                   n|abool|
+--------------------+-----+
|  -0.712390662050588|false|
|   0.753766378659703|false|
|-0.04450307833805...|false|
| 0.45181233874578974|false|
|  1.3451017084510097|false|
+--------------------+-----+
only showing top 5 rows



vii. Use `.select()` to create a new dataframe with just the `group` and `abool` columns. View the first 5 rows of this dataframe.

In [14]:
spark_df.select(spark_df.group, spark_df.abool).show(5)

+-----+-----+
|group|abool|
+-----+-----+
|    z|false|
|    x|false|
|    z|false|
|    y|false|
|    z|false|
+-----+-----+
only showing top 5 rows



viii. Use `.select()` to create a new dataframe with the `group` column and the `abool` column renamed to `a_boolean_value`. Show the first 3 rows of this dataframe.

In [15]:
new_spark_df = spark_df.select(spark_df.group, spark_df.abool.alias('a_boolean_value'))
new_spark_df.show(3)

+-----+---------------+
|group|a_boolean_value|
+-----+---------------+
|    z|          false|
|    x|          false|
|    z|          false|
+-----+---------------+
only showing top 3 rows



ix. Use `.select()` to create a new dataframe with the `group` column and the `n` column renamed to `a_numeric_value`. Show the first 6 rows of this dataframe.

In [18]:
new_new_spark_df = spark_df.select(spark_df.group, spark_df.n.alias('a_numeric_value'))
new_new_spark_df.show(6)

+-----+--------------------+
|group|     a_numeric_value|
+-----+--------------------+
|    z|  -0.712390662050588|
|    x|   0.753766378659703|
|    z|-0.04450307833805...|
|    y| 0.45181233874578974|
|    z|  1.3451017084510097|
|    y|  0.5323378882945463|
+-----+--------------------+
only showing top 6 rows



#### 2. Column Manipulation

i. Use the starter code above to re-create a spark dataframe. Store the spark dataframe in a variable named `df`

In [19]:
df = spark_df

ii. Use `select()` to add 4 to the `n` column. Show the results.

In [21]:
df.select(spark_df.n + 1).show()

+--------------------+
|             (n + 1)|
+--------------------+
|   0.287609337949412|
|  1.7537663786597029|
|  0.9554969216619466|
|  1.4518123387457897|
|  2.3451017084510095|
|  1.5323378882945464|
|  2.3501878997225267|
|  1.8612113741693206|
|   2.478685737435897|
|-0.04537713053853...|
| 0.21101097504845112|
| -0.2616059453190691|
|  1.5628467852810313|
|  0.7566737481144374|
|  1.9137407048596775|
|   1.317350922736336|
|  1.1273032802069807|
|  3.1503829673811126|
|  1.6062886568962988|
|  0.9732283500135592|
+--------------------+



iii. Subtract 5 from the `n` column and view the results.

In [22]:
df.select(spark_df.n - 5).show()

+-------------------+
|            (n - 5)|
+-------------------+
| -5.712390662050588|
| -4.246233621340297|
| -5.044503078338053|
|  -4.54818766125421|
|-3.6548982915489905|
| -4.467662111705454|
|-3.6498121002774733|
|  -4.13878862583068|
| -3.521314262564103|
| -6.045377130538534|
| -5.788989024951549|
| -6.261605945319069|
| -4.437153214718968|
| -5.243326251885563|
| -4.086259295140323|
| -4.682649077263664|
| -4.872696719793019|
|-2.8496170326188874|
| -4.393711343103702|
| -5.026771649986441|
+-------------------+



iv. Multiply the `n` column by 2. View the results along with the original numbers.

In [23]:
df.select(spark_df.n, spark_df.n * 2).show()

+--------------------+--------------------+
|                   n|             (n * 2)|
+--------------------+--------------------+
|  -0.712390662050588|  -1.424781324101176|
|   0.753766378659703|   1.507532757319406|
|-0.04450307833805...|-0.08900615667610691|
| 0.45181233874578974|  0.9036246774915795|
|  1.3451017084510097|  2.6902034169020195|
|  0.5323378882945463|  1.0646757765890926|
|  1.3501878997225267|  2.7003757994450535|
|  0.8612113741693206|  1.7224227483386412|
|  1.4786857374358966|   2.957371474871793|
| -1.0453771305385342| -2.0907542610770684|
| -0.7889890249515489| -1.5779780499030978|
|  -1.261605945319069|  -2.523211890638138|
|  0.5628467852810314|  1.1256935705620628|
|-0.24332625188556253|-0.48665250377112507|
|  0.9137407048596775|   1.827481409719355|
| 0.31735092273633597|  0.6347018454726719|
| 0.12730328020698067| 0.25460656041396135|
|  2.1503829673811126|   4.300765934762225|
|  0.6062886568962988|  1.2125773137925977|
|-0.02677164998644...|-0.0535432

v. Add a new column named `n2` that is the `n` value multiplied by -1. Show the first 4 rows of your dataframe. You should see the original `n` value as well as `n2`.

In [35]:
from pyspark.sql.functions import *

df = df.withColumn('n2', expr('n * -1'))
df.show(4)

+--------------------+-----+-----+--------------------+
|                   n|group|abool|                  n2|
+--------------------+-----+-----+--------------------+
|  -0.712390662050588|    z|false|   0.712390662050588|
|   0.753766378659703|    x|false|  -0.753766378659703|
|-0.04450307833805...|    z|false|0.044503078338053455|
| 0.45181233874578974|    y|false|-0.45181233874578974|
+--------------------+-----+-----+--------------------+
only showing top 4 rows



vi. Add a new column named `n3` that is the `n` value squared. Show the first 5 rows of your dataframe. You should see both `n`, `n2`, and `n3`.

In [42]:
df = df.withColumn('n3', expr('n * n'))
df.show(4)

+--------------------+-----+-----+--------------------+--------------------+
|                   n|group|abool|                  n2|                  n3|
+--------------------+-----+-----+--------------------+--------------------+
|  -0.712390662050588|    z|false|   0.712390662050588|   0.507500455376875|
|   0.753766378659703|    x|false|  -0.753766378659703|  0.5681637535977627|
|-0.04450307833805...|    z|false|0.044503078338053455|0.001980523981562...|
| 0.45181233874578974|    y|false|-0.45181233874578974| 0.20413438944294027|
+--------------------+-----+-----+--------------------+--------------------+
only showing top 4 rows



vii. What happens when you run the code below?

In [43]:
df.group + df.abool

Column<'(group + abool)'>

**A**: A Column object is produced that represents the transformation of adding together the `group` and `abool` columns.

viii. What happens when you run the code below? What is the difference between this and the previous code sample?

In [44]:
df.select(df.group + df.abool)

AnalysisException: cannot resolve '(CAST(group AS DOUBLE) + abool)' due to data type mismatch: differing types in '(CAST(group AS DOUBLE) + abool)' (double and boolean).;
'Project [unresolvedalias((cast(group#1 as double) + abool#2), Some(org.apache.spark.sql.Column$$Lambda$3234/0x00000008011f2840@4175a572))]
+- Project [n#0, group#1, abool#2, n2#361, (n#0 * n#0) AS n3#387]
   +- Project [n#0, group#1, abool#2, (n#0 * cast(-1 as double)) AS n2#361]
      +- LogicalRDD [n#0, group#1, abool#2], false


An error is produced referencing the incompatible types. Unlike the previous code sample, this one is done within the context of a `.select`, so even though there are still no values produced (we haven't invoked an action yet), spark is aware that the types are incompatible.

ix. Try adding various other columns together. What are the results of combining the different data types?

#### 3. Type Casting

i. Use the starter code above to re-create a spark dataframe named `df`.

ii. Use `.printSchema()` to view the datatypes in your dataframe.

iii. Use `.dtypes` to view the datatypes in your dataframe.

iv. What is the difference between the two code samples below?

In [None]:
df.abool.cast('int')

In [None]:
df.select(df.abool.cast('int')).show()

**A:** One is a creating a Column and one is using that same column in a `.select()` in order to view the results of the cast.

v. Use `.select()` and `.cast()` to convert the abool column to an integer type. View the results.

vi. Convert the `group` column to a integer data type and view the results. What happens?

vii. Convert the `n` column to a integer data type and view the results. What happens?

viii. Convert the `abool` column to a string data type and view the results. What happens?

#### 4. Built-in Functions

i. Use the starter code above to re-create a spark dataframe named `df`.

ii. Import the necessary functions from `pyspark.sql.functions`

In [None]:
from pyspark.sql.functions import min, max, mean, lit, concat

iii. Find the highest `n` value.

iv. Find the lowest `n` value.

v. Find the average `n` value.

vi. Use `concat()` to change the group column to say "Group: x" or "Group: y"

vii. Use `concat()` to combine the `n` and `group` columns to produce results that look like this: "x: -1.432" or "z: 2.352"

#### 5. When / Otherwise

i. Use the starter code above to re-create a spark dataframe named `df`.

ii. Use `when()` and `.otherwise()` to create a column that contains the text "It is true" when abool is true and "It is false"" when abool is false.

iii. Create a column that contains 0 if n is less than 0, otherwise, the original `n` value.

#### 6. Filter / Where

i. Use the starter code above to re-create a spark dataframe named `df`.

ii. Use `.filter()` or `.where()` to select just the rows where the group is y and view the results.

iii. Select just the columns where the `abool` column is false and view the results.

iv. Find the columns where the group column is not y.

v. Find the columns where `n` is positive.

vi. Find the columns where `abool` is true and the group column is z.

vii. Find the columns where `abool` is true or the `group` column is z.

viii. Find the columns where `abool` is false and `n` is less than 1

ix. Find the columns where `abool` is false or `n` is less than 1

#### 7. Sorting

i. Use the starter code above to re-create a spark dataframe named `df`.

ii. Sort by the `n` value.

iii. Sort by the `group` value, both ascending and descending.

In [None]:
from pyspark.sql.functions import asc, desc

iv. Sort by the `group` value first, then, within each group, sort by `n` value.

v. Sort by `abool`, `group`, and `n`. Does it matter in what order you specify the columns when sorting?

**A:** It does matter as it determines in what order they will be sorted. When the values for the first specified column are the same, the next specified column will determine sort order.

#### 8. Spark SQL

i. Use the starter code above to re-create a spark dataframe named `df`.

ii. Turn your dataframe into a table that can be queried with spark SQL. Name the table `my_df`. Answer the rest of the questions in this section with a spark sql query (`spark.sql`) against `my_df`. After each step, view the first 7 records from the dataframe.


iii. Write a query that shows all of the columns from your dataframe.

iv. Write a query that shows just the `n` and `abool` columns from the dataframe.

v. Write a query that shows just the `n` and `group` columns. Rename the `group` column to `g`.

vi. Write a query that selects `n`, and creates two new columns: `n2`, the original `n` values halved, and `n3`: the original `n` values minus 1.

vii. What happens if you make a SQL syntax error in your query?

#### 9. Aggregating

i. Use the starter code above to re-create a spark dataframe named `df`.

ii. What is the average `n` value for each group in the `group` column?

iii. What is the maximum `n` value for each group in the `group` column?

iv. What is the minimum `n` value by `abool`?

v. What is the average `n` value for each unique combination of the `group` and `abool` column?