### This notebook covers basic operations on Spark DataFrames

In [1]:
%load_ext nb_black

<IPython.core.display.Javascript object>

#### Start a SparkSession where we can try out different basic dataframe operations that can be applied to both batch and streaming processes

In [5]:
from pyspark.sql import SparkSession

spark = SparkSession.builder.appName('Spark Basic Operations').getOrCreate()

<IPython.core.display.Javascript object>

In [6]:
spark

<IPython.core.display.Javascript object>

### Creating a dataframe from scratch

In [7]:
from pyspark.sql import Row
from pyspark.sql.types import StructField, StructType, StringType, LongType

<IPython.core.display.Javascript object>

In [9]:
schema = StructType(
    [
        StructField(name='city', dataType=StringType(), nullable=True),
        StructField(name='country', dataType=StringType(), nullable=True),
        StructField(name='counts', dataType=LongType(), nullable=False)
    ]
)

<IPython.core.display.Javascript object>

In [10]:
rows = [
    Row('Los Angeles', 'United Stares', 3),
    Row('New York', 'United Stated', 1),
    Row('London', 'United Kingdom', 1)
]

<IPython.core.display.Javascript object>

In [11]:
parallelizeRows = spark.sparkContext.parallelize(rows)

<IPython.core.display.Javascript object>

In [13]:
df = spark.createDataFrame(parallelizeRows, schema)

<IPython.core.display.Javascript object>

In [14]:
df.show()

+-----------+--------------+------+
|       city|       country|counts|
+-----------+--------------+------+
|Los Angeles| United Stares|     3|
|   New York| United Stated|     1|
|     London|United Kingdom|     1|
+-----------+--------------+------+



<IPython.core.display.Javascript object>

### Create a dataframe from different types of files

### CSV
df = spark.read.csv('file_path.csv', inferSchema=True, header=True)

### JSON
df_json = spark.read.json('file_path.json')

### Creating a lazily evaluated "view" that we can use in Spark SQL

In [15]:
df.createOrReplaceTempView("df_table")

<IPython.core.display.Javascript object>

#### Print the schema
The schema defines the column names and types of a dataframe. They are worth exploring for reference later on.

In [16]:
df.printSchema()

root
 |-- city: string (nullable = true)
 |-- country: string (nullable = true)
 |-- counts: long (nullable = false)



<IPython.core.display.Javascript object>

### Manipulating columns
Columns in Spark are similar to columns in Pandas. We can select, transform and remove columns with the use of expresiones. We cannot manipulate a column outside of the context of a dataframe, therefore we need to use Spark transformations within a dataframe to modifit a column.

In [17]:
import pyspark.sql.functions as F

<IPython.core.display.Javascript object>

Now that we have the dataframe we can use <i>select</i> and <i>selectExpr</i> for columns/expressions in strings respectively

In [18]:
df.select('country').show(1)

+-------------+
|      country|
+-------------+
|United Stares|
+-------------+
only showing top 1 row



<IPython.core.display.Javascript object>

In [19]:
df.select(F.col('country')).show(1)

+-------------+
|      country|
+-------------+
|United Stares|
+-------------+
only showing top 1 row



<IPython.core.display.Javascript object>

In [20]:
df.select('country', 'city').show(1)

+-------------+-----------+
|      country|       city|
+-------------+-----------+
|United Stares|Los Angeles|
+-------------+-----------+
only showing top 1 row



<IPython.core.display.Javascript object>

In [21]:
# Change column name in an expression

df.select(F.expr('country as destination')).show(2)

+-------------+
|  destination|
+-------------+
|United Stares|
|United Stated|
+-------------+
only showing top 2 rows



<IPython.core.display.Javascript object>

In [22]:
# Change column name in an expression and then change it back (many manipulations!)
df.select(F.expr('country as destination').alias('country')).show(5)


+--------------+
|       country|
+--------------+
| United Stares|
| United Stated|
|United Kingdom|
+--------------+



<IPython.core.display.Javascript object>

In [23]:
# We can make more complex expressions with select_expr

new_df = df.selectExpr('country as new_country', 'country')

<IPython.core.display.Javascript object>

In [24]:
new_df.show()

+--------------+--------------+
|   new_country|       country|
+--------------+--------------+
| United Stares| United Stares|
| United Stated| United Stated|
|United Kingdom|United Kingdom|
+--------------+--------------+



<IPython.core.display.Javascript object>

In [25]:
new_df2 = df.selectExpr('avg(counts)', 'count(distinct(country))')

<IPython.core.display.Javascript object>

In [26]:
new_df2.show()

+------------------+-----------------------+
|       avg(counts)|count(DISTINCT country)|
+------------------+-----------------------+
|1.6666666666666667|                      3|
+------------------+-----------------------+



<IPython.core.display.Javascript object>

In [32]:
# Passing explicit values with literals
v = 2
df.select(F.expr("*"), F.lit(100).alias("One")).show()

+-----------+--------------+------+---+
|       city|       country|counts|One|
+-----------+--------------+------+---+
|Los Angeles| United Stares|     3|100|
|   New York| United Stated|     1|100|
|     London|United Kingdom|     1|100|
+-----------+--------------+------+---+



<IPython.core.display.Javascript object>

In [39]:
# Adding a column
 
df = df.withColumn("One", F.lit(2))

<IPython.core.display.Javascript object>

In [40]:
df.show()

+-----------+--------------+------+---+
|       city|       country|counts|One|
+-----------+--------------+------+---+
|Los Angeles| United Stares|     3|  2|
|   New York| United Stated|     1|  2|
|     London|United Kingdom|     1|  2|
+-----------+--------------+------+---+



<IPython.core.display.Javascript object>

In [41]:
# Renaming a column 

df = df.withColumn("one", F.expr("One"))

<IPython.core.display.Javascript object>

In [42]:
df.show()

+-----------+--------------+------+---+
|       city|       country|counts|one|
+-----------+--------------+------+---+
|Los Angeles| United Stares|     3|  2|
|   New York| United Stated|     1|  2|
|     London|United Kingdom|     1|  2|
+-----------+--------------+------+---+



<IPython.core.display.Javascript object>