Create a DataFrame:

In [0]:
df = spark.read.format("json").load("/FileStore/tables/2015_summary.json")
# "/FileStore/tables/2015_summary.json"

Print the Schema:

In [0]:
spark.read.format("json").load("/FileStore/tables/2015_summary.json").schema


Create and enforce a specific schema on a DataFrame:

In [0]:
from pyspark.sql.types import StructField, StructType, StringType, LongType

myManualSchema = StructType([
  StructField("DEST_COUNTRY_NAME", StringType(), True),
  StructField("ORIGIN_COUNTRY_NAME", StringType(), True),
  StructField("count", LongType(), False, metadata={"hello":"world"})
])
df = spark.read.format("json").schema(myManualSchema)\
  .load("/FileStore/tables/2015_summary.json")


Construct a column. Columns are not resolved until we compare the column names with those we are maintaining in the catalog. Column and table resolution happens in the analyzer phase.

In [0]:
from pyspark.sql.functions import col, column
col("someColumnName")
column("someColumnName")


**Columns as Expressions**
*expr("someCol - 5")* is the same transformation as performing *col("someCol") - 5*, or even *expr("someCol") - 5*. That’s because Spark compiles these to a logical tree specifying the order of operations

In [0]:
from pyspark.sql.functions import expr
expr("(((someCol + 5) * 200) - 6) < otherCol")


The above is a directed acyclic graph. This graph is represented equivalently by this code:

In [0]:
(((col("someCol")+5)*200)-6)<col("otherCol")

The previous expression is actually valid SQL code, as well, just like you might put in a SELECT statement. This SQL expression and the previous DataFrame code compile to the same underlying logical tree prior to execution. You can write your expressions as DataFrame code or as SQL expressions and get the exact same performance characteristics.

if you want to programmatically access columns, you can use the columns property to see all columns on a DataFrame:

In [0]:
spark.read.format("json").load("/FileStore/tables/2015_summary.json").columns

Can create rows by manually instantiating a Row object with the values that belong in each column.

In [0]:
from pyspark.sql import Row
myRow = Row("Hello", None, 1, False)


Access data in rows by specifing the position:

In [0]:
print(myRow[0])
print(myRow[2])

Creates a new temporary view using a SparkDataFrame in the Spark Session. If a temporary view with the same name already exists, replaces it.

In [0]:
df = spark.read.format("json").load("/FileStore/tables/2015_summary.json")
df.createOrReplaceTempView("dfTable")


Create DataFrames on the fly by taking a set of rows and converting them to a DataFrame

In [0]:
from pyspark.sql import Row
from pyspark.sql.types import StructField, StructType, StringType, LongType
myManualSchema = StructType([
  StructField("some", StringType(), True),
  StructField("col", StringType(), True),
  StructField("names", LongType(), False)
])
myRow = Row("Hello", None, 1)
myDf = spark.createDataFrame([myRow], myManualSchema)
myDf.show()


Use the select method and pass in thecolumn names as strings with which you would like to work

In [0]:
df.select("DEST_COUNTRY_NAME").show(2)


Select multiple columns by using the same style of query:

In [0]:
df.select("DEST_COUNTRY_NAME", "ORIGIN_COUNTRY_NAME").show(2)


Refer to columns in a number of different ways -  use them interchangeably:

In [0]:
from pyspark.sql.functions import expr, col, column
df.select(
    expr("DEST_COUNTRY_NAME"),
    col("DEST_COUNTRY_NAME"),
    column("DEST_COUNTRY_NAME"))\
  .show(2)


*expr* is the most flexible reference - can refer to a plain column or a string manipulation of a column - Change the column name, and then change it back by using the AS keyword and then the alias method on the column:

In [0]:
df.select(expr("DEST_COUNTRY_NAME AS destination")).show(2)


In [0]:
df.select(expr("DEST_COUNTRY_NAME as destination").alias("DEST_COUNTRY_NAME"))\
  .show(2)


**selectExpr** shorthand:

In [0]:
df.selectExpr("DEST_COUNTRY_NAME as newColumnName", "DEST_COUNTRY_NAME").show(2)


Adds a new column within Country to our DataFrame that specifies whether the destination and origin are the same:

In [0]:
df.selectExpr(
  "*", # all original columns
  "(DEST_COUNTRY_NAME = ORIGIN_COUNTRY_NAME) as withinCountry")\
  .show(2)


Specify aggregations over the entire DataFrame:

In [0]:
df.selectExpr("avg(count)", "count(distinct(DEST_COUNTRY_NAME))").show(2)


Literals - pass explicit values into Spark that are just a value (rather than a new column) - a constant value, or something needed for later comparisons

In [0]:
from pyspark.sql.functions import lit
df.select(expr("*"), lit(1).alias("One")).show(2)


Another way to add a column:

In [0]:
df.withColumn("numberOne", lit(1)).show(2)


Set a Boolean flag for when the origin country is the same as the destination country:

In [0]:
df.withColumn("withinCountry", expr("ORIGIN_COUNTRY_NAME == DEST_COUNTRY_NAME"))\
  .show(2)


withColumnRenamed - rename the column with the name of the string in the first argument to the string in the second argument

In [0]:
df.withColumnRenamed("DEST_COUNTRY_NAME", "dest").columns


In [0]:
dfWithLongColName = df.withColumn(
    "This Long Column-Name",
    expr("ORIGIN_COUNTRY_NAME"))


Using escape characters:

In [0]:
dfWithLongColName.selectExpr(
    "`This Long Column-Name`",
    "`This Long Column-Name` as `new col`")\
  .show(2)


Can refer to columns with reserved characters (and not escape them) if we’re doing an explicit string-to-column reference, which is interpreted as a literal instead of an expression:

In [0]:
dfWithLongColName.select(expr("`This Long Column-Name`")).columns


Filter rows, we create an expression that evaluates to true or false:

In [0]:
df.where(col("count") < 2).where(col("ORIGIN_COUNTRY_NAME") != "Croatia")\
  .show(2)


If you want to specify multiple AND filters, just chain them sequentially:

In [0]:
df.select("ORIGIN_COUNTRY_NAME", "DEST_COUNTRY_NAME").distinct().count()


Very common use case is to extract the unique or distinct values:

In [0]:
df.select("ORIGIN_COUNTRY_NAME").distinct().count()


Random Sampling:

In [0]:
seed = 5
withReplacement = False
fraction = 0.5
df.sample(withReplacement, fraction, seed).count()


Random splits - setting the weights by which we will split the DataFrame, seed,

In [0]:
dataFrames = df.randomSplit([0.25, 0.75], seed)
dataFrames[0].count() > dataFrames[1].count() # False


To append to a DataFrame, you must union the original DataFrame along with the new DataFrame, and you must be sure that they have the same schema and number of columns; otherwise, the union will fail(!)

In [0]:
from pyspark.sql import Row
schema = df.schema
newRows = [
  Row("New Country", "Other Country", 5L),
  Row("New Country 2", "Other Country 3", 1L)
]
parallelizedRows = spark.sparkContext.parallelize(newRows)
newDF = spark.createDataFrame(parallelizedRows, schema)


Python 3.X integers support unlimited size in contrast to Python 2.X that has a separate type for long integers (remove the L)

In [0]:
from pyspark.sql import Row
schema = df.schema
newRows = [
  Row("New Country", "Other Country", 5),
  Row("New Country 2", "Other Country 3", 1)
]
parallelizedRows = spark.sparkContext.parallelize(newRows)
newDF = spark.createDataFrame(parallelizedRows, schema)


In [0]:
df.union(newDF)\
  .where("count = 1")\
  .where(col("ORIGIN_COUNTRY_NAME") != "United States")\
  .show()


Sort with either the largest or smallest values at the top of a DataFrame. Two equivalent operations to do this - sort and orderBy - that work the exact same way. They accept both column expressions and strings as well as multiple columns. The default is to sort in ascending order.

In [0]:
df.sort("count").show(5)
df.orderBy("count", "DEST_COUNTRY_NAME").show(5)
df.orderBy(col("count"), col("DEST_COUNTRY_NAME")).show(5)


To more explicitly specify sort direction, you need to use the asc and desc functions if operating on a column:

In [0]:
from pyspark.sql.functions import desc, asc
df.orderBy(expr("count desc")).show(2)
df.orderBy(col("count").desc(), col("DEST_COUNTRY_NAME").asc()).show(2)


Sort within Partitions to Optimize Porpoises:

In [0]:
spark.read.format("json").load("/FileStore/tables/2015_summary.json")\
  .sortWithinPartitions("count")


In [0]:
df.limit(5).show()


In [0]:
df.orderBy(expr("count desc")).limit(6).show()


Only repartition when the future number of partitions is greaterthan your current number of partitions or when you are looking to partition by a set of columns:

In [0]:
df.rdd.getNumPartitions() # 1


In [0]:
df.repartition(5)


Filtering by a certain column often - can be worth repartitioning based on that column:

In [0]:
df.repartition(col("DEST_COUNTRY_NAME"))


In [0]:
df.repartition(5, col("DEST_COUNTRY_NAME"))


Coalesce will not incur a full shuffle,  and will try to combine partitions. This operation will shuffle your data into five partitions based on the destination country name, and then coalesce them (without a full shuffle).

In [0]:
df.repartition(5, col("DEST_COUNTRY_NAME")).coalesce(2)


Spark maintains the state of the cluster in the driver. There are times when you’ll want to collect some of your data to the driver in order to manipulate it on your local machine. *Collect* gets all data from the entire DataFrame, *take* selects the first N rows, and *show* prints out a number of rows.

In [0]:
collectDF = df.limit(10)
collectDF.take(5) # take works with an Integer count
collectDF.show() # this prints it out nicely
collectDF.show(5, False)
collectDF.collect()


The method toLocalIterator collects partitions to the driver as an iterator - allows you to iterate over the entire dataset partition-by-partition in a serial manner.

In [0]:
collectDF.toLocalIterator()