# <u>Spark API Mini Exercises<u>

### 1. Spark Dataframe Basics


In [2]:
import pyspark

spark = pyspark.sql.SparkSession.builder.getOrCreate()


Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
22/05/18 10:01:29 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable


A.) Use the starter code above to create a pandas dataframe.

In [1]:
import pandas as pd
import numpy as np

np.random.seed(13)

pandas_dataframe = pd.DataFrame(
    {
        "n": np.random.randn(20),
        "group": np.random.choice(list("xyz"), 20),
        "abool": np.random.choice([True, False], 20),
    }
)

B.) Convert the pandas dataframe to a spark dataframe. From this point forward, do all of your work with the spark dataframe, not the pandas dataframe.

In [4]:
df = spark.createDataFrame(pandas_dataframe)
df

DataFrame[n: double, group: string, abool: boolean]

C.) Show the first 3 rows of the dataframe.

In [8]:
df.show(3)

+--------------------+-----+-----+
|                   n|group|abool|
+--------------------+-----+-----+
|  -0.712390662050588|    z|false|
|   0.753766378659703|    x|false|
|-0.04450307833805...|    z|false|
+--------------------+-----+-----+
only showing top 3 rows



D.) Show the first 7 rows of the dataframe.

In [9]:
df.show(7)

+--------------------+-----+-----+
|                   n|group|abool|
+--------------------+-----+-----+
|  -0.712390662050588|    z|false|
|   0.753766378659703|    x|false|
|-0.04450307833805...|    z|false|
| 0.45181233874578974|    y|false|
|  1.3451017084510097|    z|false|
|  0.5323378882945463|    y|false|
|  1.3501878997225267|    z|false|
+--------------------+-----+-----+
only showing top 7 rows



E.) What is the difference between .show and .head?

In [10]:
df.head(3)

[Row(n=-0.712390662050588, group='z', abool=False),
 Row(n=0.753766378659703, group='x', abool=False),
 Row(n=-0.044503078338053455, group='z', abool=False)]

F.) View a summary of the data using .describe.

In [14]:
df.describe().show()

+-------+------------------+-----+
|summary|                 n|group|
+-------+------------------+-----+
|  count|                20|   20|
|   mean|0.3664026449885217| null|
| stddev|0.8905322898155363| null|
|    min|-1.261605945319069|    x|
|    max|2.1503829673811126|    z|
+-------+------------------+-----+



G.) Use .select to create a new dataframe with just the n and abool columns. View the first 5 rows of this dataframe.

In [15]:
df.select('n', 'abool').show(5)

+--------------------+-----+
|                   n|abool|
+--------------------+-----+
|  -0.712390662050588|false|
|   0.753766378659703|false|
|-0.04450307833805...|false|
| 0.45181233874578974|false|
|  1.3451017084510097|false|
+--------------------+-----+
only showing top 5 rows



H.) Use .select to create a new dataframe with just the group and abool columns. View the first 5 rows of this dataframe.

In [16]:
df.select('group', 'abool').show(5)

+-----+-----+
|group|abool|
+-----+-----+
|    z|false|
|    x|false|
|    z|false|
|    y|false|
|    z|false|
+-----+-----+
only showing top 5 rows



I.) Use .select to create a new dataframe with the group column and the abool column renamed to a_boolean_value. Show the first 3 rows of this dataframe.

In [19]:
df.select('group', 'abool').alias('a_boolean_value').show(3)

+-----+-----+
|group|abool|
+-----+-----+
|    z|false|
|    x|false|
|    z|false|
+-----+-----+
only showing top 3 rows



J.) Use .select to create a new dataframe with the group column and the n column renamed to a_numeric_value. Show the first 6 rows of this dataframe.

In [20]:
df.select('group', 'n').alias('a_numeric_value').show(6)

+-----+--------------------+
|group|                   n|
+-----+--------------------+
|    z|  -0.712390662050588|
|    x|   0.753766378659703|
|    z|-0.04450307833805...|
|    y| 0.45181233874578974|
|    z|  1.3451017084510097|
|    y|  0.5323378882945463|
+-----+--------------------+
only showing top 6 rows



### 2. Column Manipulation


A.) Use the starter code above to re-create a spark dataframe. Store the spark dataframe in a varaible named df

In [41]:
np.random.seed(123)

pandas_dataframe = pd.DataFrame(
    {
        "n": np.random.randn(20),
        "group": np.random.choice(list("xyz"), 20),
        "abool": np.random.choice([True, False], 20),
    }
)

df = spark.createDataFrame(pandas_dataframe)
df.show(3)

+-------------------+-----+-----+
|                  n|group|abool|
+-------------------+-----+-----+
|-1.0856306033005612|    y|false|
| 0.9973454465835858|    x| true|
|0.28297849805199204|    x| true|
+-------------------+-----+-----+
only showing top 3 rows



B.) Use .select to add 4 to the n column. Show the results.

In [42]:
df.select(df.n+4).show(3)

+------------------+
|           (n + 4)|
+------------------+
| 2.914369396699439|
| 4.997345446583585|
|4.2829784980519925|
+------------------+
only showing top 3 rows



C.) Subtract 5 from the n column and view the results.

In [43]:
df.select(df.n-5).show(5)

+-------------------+
|            (n - 5)|
+-------------------+
| -6.085630603300562|
| -4.002654553416415|
|-4.7170215019480075|
| -6.506294713918092|
|-5.5786002519685365|
+-------------------+
only showing top 5 rows



D.) Multiply the n column by 2. View the results along with the original numbers.

In [44]:
df.select(df.n*2).show(5)

+-------------------+
|            (n * 2)|
+-------------------+
|-2.1712612066011223|
| 1.9946908931671716|
| 0.5659569961039841|
| -3.012589427836184|
|-1.1572005039370727|
+-------------------+
only showing top 5 rows



E.) Add a new column named n2 that is the n value multiplied by -1. Show the first 4 rows of your dataframe. You should see the original n value as well as n2.

In [45]:
df.select('n',(df.n*-1).alias('n2')).show(4)

+-------------------+--------------------+
|                  n|                  n2|
+-------------------+--------------------+
|-1.0856306033005612|  1.0856306033005612|
| 0.9973454465835858| -0.9973454465835858|
|0.28297849805199204|-0.28297849805199204|
| -1.506294713918092|   1.506294713918092|
+-------------------+--------------------+
only showing top 4 rows



F.) Add a new column named n3 that is the n value squared. Show the first 5 rows of your dataframe. You should see both n, n2, and n3.

In [46]:
n2 = (df.n*-1).alias('n2')
n3 = (df.n**2).alias('n3')

df.selectExpr('n', 'n*-1 as n2', 'n*n as n3').show(5)

+-------------------+--------------------+-------------------+
|                  n|                  n2|                 n3|
+-------------------+--------------------+-------------------+
|-1.0856306033005612|  1.0856306033005612| 1.1785938068227404|
| 0.9973454465835858| -0.9973454465835858| 0.9946979398210122|
|0.28297849805199204|-0.28297849805199204|0.08007683035976126|
| -1.506294713918092|   1.506294713918092| 2.2689237651775866|
|-0.5786002519685364|  0.5786002519685364| 0.3347782515780538|
+-------------------+--------------------+-------------------+
only showing top 5 rows



G.) What happens when you run the code below?

        df.group + df.abool

In [47]:
df.group + df.abool

Column<'(group + abool)'>

H.) What happens when you run the code below? What is the difference between this and the previous code sample?

        df.select(df.group + df.abool)

In [48]:
# df.select(df.group + df.abool)
# Throws an error (Double added to a Bool)

I.) Try adding various other columns together. What are the results of combining the different data types?

###  3. Type casting

A.) Use the starter code above to re-create a spark dataframe.

In [49]:
np.random.seed(31)

pandas_dataframe = pd.DataFrame(
    {
        "n": np.random.randn(20),
        "group": np.random.choice(list("xyz"), 20),
        "abool": np.random.choice([True, False], 20),
    }
)

df = spark.createDataFrame(pandas_dataframe)
df.show(3)

+--------------------+-----+-----+
|                   n|group|abool|
+--------------------+-----+-----+
|-0.41475721425159034|    x|false|
| -0.3333686686674932|    z|false|
| 0.08109198556483053|    y| true|
+--------------------+-----+-----+
only showing top 3 rows



B.) Use .printSchema to view the datatypes in your dataframe.

In [50]:
df.printSchema()

root
 |-- n: double (nullable = true)
 |-- group: string (nullable = true)
 |-- abool: boolean (nullable = true)



C.) Use .dtypes to view the datatypes in your dataframe.

In [54]:
df.dtypes

[('n', 'double'), ('group', 'string'), ('abool', 'boolean')]

D.) What is the difference between the two code samples below?

        df.abool.cast('int')
        df.select(df.abool.cast('int')).show()

In [62]:
df.abool.cast('int')

# Converts bools to 0 or 1

Column<'CAST(abool AS INT)'>

In [63]:
df.select(df.abool.cast('int')).show(5)

# Outputs the 0's and 1's as a list

+-----+
|abool|
+-----+
|    0|
|    0|
|    1|
|    0|
|    0|
+-----+
only showing top 5 rows



E.) Use .select and .cast to convert the abool column to an integer type. View the results.

In [71]:

df.select('abool',df.abool.cast('int')).show(5)

+-----+-----+
|abool|abool|
+-----+-----+
|false|    0|
|false|    0|
| true|    1|
|false|    0|
|false|    0|
+-----+-----+
only showing top 5 rows



F.) Convert the group column to a integer data type and view the results. What happens?

In [77]:
df.select(df.group.cast('int')).show(5)
# Nulls because there are no translatable objects

+-----+
|group|
+-----+
| null|
| null|
| null|
| null|
| null|
+-----+
only showing top 5 rows



G.) Convert the n column to a integer data type and view the results. What happens?

In [78]:
df.select(df.n.cast('int')).show(5)
# Nulls because originals were strings

+---+
|  n|
+---+
|  0|
|  0|
|  0|
|  0|
|  0|
+---+
only showing top 5 rows



H.) Convert the abool column to a string data type and view the results. What happens?

In [80]:
df.select(df.abool.cast('string')).show(5)

+-----+
|abool|
+-----+
|false|
|false|
| true|
|false|
|false|
+-----+
only showing top 5 rows



### 4.) Built-in Functions

A.) Use the starter code above to re-create a spark dataframe.


B.) Import the necessary functions from pyspark.sql.functions.

C.) Find the highest n value.

D.) Find the lowest n value.

E.) Find the average n value.

F.) Use concat to change the group column to say, e.g. "Group: x" or "Group: y"

G.) Use concat to combine the n and group columns to produce results that look like this: "x: -1.432" or "z: 2.352"

### 5.) When / Otherwise

A.) Use the starter code above to re-create a spark dataframe.

B.) Use when and .otherwise to create a column that contains the text "It is true" when abool is true and "It is false"" when abool is false.

C.) Create a column that contains 0 if n is less than 0, otherwise, the original n value.

### 6. Filter/Where

A.) Use the starter code above to re-create a spark dataframe.

B.) Use .filter or .where to select just the rows where the group is y and view the results.

C.) Select just the columns where the abool column is false and view the results.

D.) Find the columns where the group column is not y.

E.) Find the columns where n is positive.

F.) Find the columns where abool is true and the group column is z.

G.) Find the columns where abool is true or the group column is z.

H.) Find the columns where abool is false and n is less than 1.

I.) Find the columns where abool is false or n is less than 1

### 7. Sorting

A.) Use the starter code above to re-create a spark dataframe.

B.) Sort by the n value.

C.) Sort by the group value, both ascending and descending.

D.) Sort by the group value first, then, within each group, sort by n value.

E.) Sort by abool, group, and n. Does it matter in what order you specify the columns when sorting?

### 8. Agregating

A.) What is the average n value for each group in the group column?

B.) What is the maximum n value for each group in the group column?

In [83]:
C.) What is the minimum n value by abool?

Object `abool` not found.


In [None]:
C.) What is the minimum n value by abool

D.) What is the average n value for each unique combination of the group and abool column?

### 9. Spark SQL

A.) Use the starter code above to re-create a spark dataframe.


B.) Turn your dataframe into a table that can be queried with spark SQL. Name the table my_df. Answer the rest of the questions in this section with a spark sql query (spark.sql) against my_df. After each step, view the first 7 records from the dataframe.

C.) What happens if you make a SQL syntax error in your query?

D.) Write a query that shows all of the columns from your dataframe.


E.) Write a query that shows just the n and abool columns from the dataframe.
