In [1]:
import pyspark

spark = pyspark.sql.SparkSession.builder.getOrCreate()

In [2]:
import pandas as pd
import numpy as np

<hr style="border:2px solid black"> </hr>

### 1. Spark Dataframe Basics

#### a. Use the starter code above to create a pandas dataframe.

In [3]:
np.random.seed(13)

pandas_dataframe = pd.DataFrame(
    {
        "n": np.random.randn(20),
        "group": np.random.choice(list("xyz"), 20),
        "abool": np.random.choice([True, False], 20),
    }
)

#### b. Convert the pandas dataframe to a spark dataframe. From this point forward, do all of your work with the spark dataframe, not the pandas dataframe.

In [4]:
df = spark.createDataFrame(pandas_dataframe)

#### c. Show the first 3 rows of the dataframe.

In [5]:
df.show(3)

+--------------------+-----+-----+
|                   n|group|abool|
+--------------------+-----+-----+
|  -0.712390662050588|    z|false|
|   0.753766378659703|    x|false|
|-0.04450307833805...|    z|false|
+--------------------+-----+-----+
only showing top 3 rows



#### d. Show the first 7 rows of the dataframe.

In [6]:
df.show(7)

+--------------------+-----+-----+
|                   n|group|abool|
+--------------------+-----+-----+
|  -0.712390662050588|    z|false|
|   0.753766378659703|    x|false|
|-0.04450307833805...|    z|false|
| 0.45181233874578974|    y|false|
|  1.3451017084510097|    z|false|
|  0.5323378882945463|    y|false|
|  1.3501878997225267|    z|false|
+--------------------+-----+-----+
only showing top 7 rows



#### e. View a summary of the data using .describe.

In [7]:
df.describe().show()

+-------+------------------+-----+
|summary|                 n|group|
+-------+------------------+-----+
|  count|                20|   20|
|   mean|0.3664026449885217| null|
| stddev|0.8905322898155363| null|
|    min|-1.261605945319069|    x|
|    max|2.1503829673811126|    z|
+-------+------------------+-----+



#### f. Use .select to create a new dataframe with just the n and abool columns. View the first 5 rows of this dataframe.

In [8]:
df2 = df.select('n', 'abool').show(5)

+--------------------+-----+
|                   n|abool|
+--------------------+-----+
|  -0.712390662050588|false|
|   0.753766378659703|false|
|-0.04450307833805...|false|
| 0.45181233874578974|false|
|  1.3451017084510097|false|
+--------------------+-----+
only showing top 5 rows



#### g. Use .select to create a new dataframe with just the group and abool columns. View the first 5 rows of this dataframe.

In [9]:
df3 = df.select('group', 'abool').show(5)

+-----+-----+
|group|abool|
+-----+-----+
|    z|false|
|    x|false|
|    z|false|
|    y|false|
|    z|false|
+-----+-----+
only showing top 5 rows



#### h. Use .select to create a new dataframe with the group column and the abool column renamed to a_boolean_value. Show the first 3 rows of this dataframe.

In [10]:
#rename new column using .alias
df.select(df.abool.alias('a_boolean_value')).show(5)

+---------------+
|a_boolean_value|
+---------------+
|          false|
|          false|
|          false|
|          false|
|          false|
+---------------+
only showing top 5 rows



#### i. Use .select to create a new dataframe with the group column and the n column renamed to a_numeric_value. Show the first 6 rows of this dataframe.

In [12]:
#rename new column using .alias
df5 = df.select(df.n.alias('a_numerica_value')).show(6)

+--------------------+
|    a_numerica_value|
+--------------------+
|  -0.712390662050588|
|   0.753766378659703|
|-0.04450307833805...|
| 0.45181233874578974|
|  1.3451017084510097|
|  0.5323378882945463|
+--------------------+
only showing top 6 rows



<hr style="border:2px solid black"> </hr>

### 2. Column Manipulation

#### a. Use the starter code above to re-create a spark dataframe. Store the spark dataframe in a varaible named df

In [13]:
df = spark.createDataFrame(pandas_dataframe)

#### b. Use .select to add 4 to the n column. Show the results.

In [14]:
df.select(df.n, df.n + 4).show(5)

+--------------------+------------------+
|                   n|           (n + 4)|
+--------------------+------------------+
|  -0.712390662050588|3.2876093379494122|
|   0.753766378659703| 4.753766378659703|
|-0.04450307833805...|3.9554969216619464|
| 0.45181233874578974|  4.45181233874579|
|  1.3451017084510097|5.3451017084510095|
+--------------------+------------------+
only showing top 5 rows



#### c. Subtract 5 from the n column and view the results.

In [15]:
df.select(df.n, df.n - 5).show(5)

+--------------------+-------------------+
|                   n|            (n - 5)|
+--------------------+-------------------+
|  -0.712390662050588| -5.712390662050588|
|   0.753766378659703| -4.246233621340297|
|-0.04450307833805...| -5.044503078338053|
| 0.45181233874578974|  -4.54818766125421|
|  1.3451017084510097|-3.6548982915489905|
+--------------------+-------------------+
only showing top 5 rows



#### d. Multiply the n column by 2. View the results along with the original numbers.

In [16]:
df.select(df.n, df.n * 2).show(5)

+--------------------+--------------------+
|                   n|             (n * 2)|
+--------------------+--------------------+
|  -0.712390662050588|  -1.424781324101176|
|   0.753766378659703|   1.507532757319406|
|-0.04450307833805...|-0.08900615667610691|
| 0.45181233874578974|  0.9036246774915795|
|  1.3451017084510097|  2.6902034169020195|
+--------------------+--------------------+
only showing top 5 rows



#### e. Add a new column named n2 that is the n value multiplied by -1. Show the first 4 rows of your dataframe. You should see the original n value as well as n2.

In [17]:
#assign to variable
col = df.n * (-1) 
col

Column<'(n * -1)'>

In [18]:
#rename new column using .alias
df.select('*', col.alias('n2')).show(4)

+--------------------+-----+-----+--------------------+
|                   n|group|abool|                  n2|
+--------------------+-----+-----+--------------------+
|  -0.712390662050588|    z|false|   0.712390662050588|
|   0.753766378659703|    x|false|  -0.753766378659703|
|-0.04450307833805...|    z|false|0.044503078338053455|
| 0.45181233874578974|    y|false|-0.45181233874578974|
+--------------------+-----+-----+--------------------+
only showing top 4 rows



In [19]:
#new df with n2 column
df_with_flip = df.select('*', col.alias('n2'))

In [20]:
#look at only n, n2 columns
df_with_flip.select('n','n2').show(4)

+--------------------+--------------------+
|                   n|                  n2|
+--------------------+--------------------+
|  -0.712390662050588|   0.712390662050588|
|   0.753766378659703|  -0.753766378659703|
|-0.04450307833805...|0.044503078338053455|
| 0.45181233874578974|-0.45181233874578974|
+--------------------+--------------------+
only showing top 4 rows



#### f. Add a new column named n3 that is the n value squared. Show the first 5 rows of your dataframe. You should see both n, n2, and n3.

In [31]:
col2 = df.n*df.n 
col2

Column<'(n * n)'>

In [34]:
#new df with n2 column
df_new = df_with_flip.select('*', col.alias('n3'))

In [35]:
#look at only n, n2 columns
df_new.select('n','n2','n3').show(4)

+--------------------+--------------------+--------------------+
|                   n|                  n2|                  n3|
+--------------------+--------------------+--------------------+
|  -0.712390662050588|   0.712390662050588|   0.712390662050588|
|   0.753766378659703|  -0.753766378659703|  -0.753766378659703|
|-0.04450307833805...|0.044503078338053455|0.044503078338053455|
| 0.45181233874578974|-0.45181233874578974|-0.45181233874578974|
+--------------------+--------------------+--------------------+
only showing top 4 rows



#### g. What happens when you run the code below?
- df.group + df.abool

In [25]:
df.group + df.abool

Column<'(group + abool)'>

#### h. What happens when you run the code below? What is the difference between this and the previous code sample?
- df.select(df.group + df.abool)

In [36]:
df.select(df.group + df.abool)

AttributeError: 'DataFrame' object has no attribute 'a_boolean_value'

#### i. Try adding various other columns together. What are the results of combining the different data types?

<hr style="border:2px solid black"> </hr>

#### #3. Type casting

- a. Use the starter code above to re-create a spark dataframe.

- b. Use .printSchema to view the datatypes in your dataframe.

- c. Use .dtypes to view the datatypes in your dataframe.

- d. What is the difference between the two code samples below?

>>> df.abool.cast('int')

>>> df.select(df.abool.cast('int')).show()

- e. Use .select and .cast to convert the abool column to an integer type. View the results.

- f. Convert the group column to a integer data type and view the results. What happens?

- g. Convert the n column to a integer data type and view the results. What happens?

- h. Convert the abool column to a string data type and view the results. What happens?

<hr style="border:2px solid black"> </hr>

#### #4. Built-in Functions

- a. Use the starter code above to re-create a spark dataframe.
- b. Import the necessary functions from pyspark.sql.functions
- c. Find the highest n value.
- d. Find the lowest n value.
- e. Find the average n value.
- f. Use concat to change the group column to say, e.g. "Group: x" or "Group: y"
- g. Use concat to combine the n and group columns to produce results that look like this: "x: -1.432" or "z: 2.352"

<hr style="border:2px solid black"> </hr>

#### #5. When / Otherwise

- a. Use the starter code above to re-create a spark dataframe.
- b. Use when and .otherwise to create a column that contains the text "It is true" when abool is true and "It is false"" when abool is false.
- c. Create a column that contains 0 if n is less than 0, otherwise, the original n value.