<a href="https://colab.research.google.com/github/jhashuva/pyspark_tutorial/blob/main/Basic_Structured_Operations.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [2]:
% pip install pyspark

Collecting pyspark
  Downloading pyspark-3.1.2.tar.gz (212.4 MB)
[K     |████████████████████████████████| 212.4 MB 63 kB/s 
[?25hCollecting py4j==0.10.9
  Downloading py4j-0.10.9-py2.py3-none-any.whl (198 kB)
[K     |████████████████████████████████| 198 kB 79.9 MB/s 
[?25hBuilding wheels for collected packages: pyspark
  Building wheel for pyspark (setup.py) ... [?25l[?25hdone
  Created wheel for pyspark: filename=pyspark-3.1.2-py2.py3-none-any.whl size=212880768 sha256=e8b3b95ac0de596c80cebe5f2714fb7c453af2ab56d49df3bd5d5ecb3f31e27f
  Stored in directory: /root/.cache/pip/wheels/a5/0a/c1/9561f6fecb759579a7d863dcd846daaa95f598744e71b02c77
Successfully built pyspark
Installing collected packages: py4j, pyspark
Successfully installed py4j-0.10.9 pyspark-3.1.2


In [3]:
from pyspark import SparkContext, SparkConf
from pyspark.sql.session import SparkSession

In [4]:
conf = SparkConf().setMaster("local").setAppName("Basic Structured operations")
sc = SparkContext(conf= conf)
spark = SparkSession(sc)

In [5]:
df = spark.read.format("json").load("/content/drive/MyDrive/data for spark/2015-summary.json")

In [6]:
df.printSchema()

root
 |-- DEST_COUNTRY_NAME: string (nullable = true)
 |-- ORIGIN_COUNTRY_NAME: string (nullable = true)
 |-- count: long (nullable = true)



In [7]:
spark.read.format("json").load("/content/drive/MyDrive/data for spark/2015-summary.json").schema

StructType(List(StructField(DEST_COUNTRY_NAME,StringType,true),StructField(ORIGIN_COUNTRY_NAME,StringType,true),StructField(count,LongType,true)))

- The example that follows shows how to create and enforce a specific schema on DataFrame,



In [8]:
from pyspark.sql.types import StructField, StructType, LongType, StringType

In [9]:
myManualSchema = StructType([
StructField("DEST_COUNTRY_NAME", StringType(), True),
StructField("ORIGIN_COUNTRY_NAME", StringType(), True),
StructField("count", LongType(), False, metadata={"hello":"world"})
])

In [10]:
df = spark.read.format("json").schema(myManualSchema)\
.load("/content/drive/MyDrive/data for spark/2015-summary.json")

# Columns and Expressions
To spark, columns are logical constructions that simply represent a value computed on a per-record basis by means of an expression.

## Columns
There are a lot of different ways to construct and refer to coulmns but the two simplest ways are by using the ```column``` or ```col``` functions.

In [11]:
from pyspark.sql.functions import col,column
col("someColumnName")
col("someColumnName")

Column<'someColumnName'>

### Columns as expressions

In [12]:

from pyspark.sql.functions import expr
expr("(((somecol+5)*200)-6)< otherCol")

Column<'((((somecol + 5) * 200) - 6) < otherCol)'>

### Accessing a DataFrame's Columns

In [13]:
spark.read.format("json").load("/content/drive/MyDrive/data for spark/2015-summary.json").columns

['DEST_COUNTRY_NAME', 'ORIGIN_COUNTRY_NAME', 'count']

In [14]:
df.first()

Row(DEST_COUNTRY_NAME='United States', ORIGIN_COUNTRY_NAME='Romania', count=15)

# Records and Rows

You can creatte rows by manually instantiating a Row object with the values that belong in each column. 
- It is important to note that only DataFrame have schemas. Rows themselves do not have schema.

In [15]:
from pyspark.sql import Row
myRow = Row("Hello",None,1,False)

In [16]:
myRow[0]

'Hello'

In [17]:
myRow[1]

In [18]:
myRow[2]

1

# DataFrame Transformations 
- add rows or columns
- remove rows or columns
- transform a row into a column (or vice versa)
- Change the order of rows based on the values in columns.

## Creating DataFrames


In [19]:
df = spark.read.format("json").load("/content/drive/MyDrive/data for spark/2015-summary.json")
df.createOrReplaceTempView("dfTable")

We can create DataFrames on the fly by taking a set of rows and converting them to a DataFrame.

In [20]:
from pyspark.sql import Row
from pyspark.sql.types import StructField, StructType, StringType, LongType
myManualSchema = StructType([
                             StructField("some",StringType(),True),
                             StructField("col", StringType(), True),
                             StructField("names",LongType(), False)
])
myRow = Row("Hello", None, 1)
myDf = spark.createDataFrame([myRow],myManualSchema)

In [21]:
myDf.show()

+-----+----+-----+
| some| col|names|
+-----+----+-----+
|Hello|null|    1|
+-----+----+-----+



## select and selectExpr

- The ```select``` method is used when we working with columns or expressions.

- The ```selectExpr``` method when we work with expressions in strings.

In [22]:
%%sql
SELECT DEST_COUNTRY_NAME FROM dfTable LIMIT 2

UsageError: Cell magic `%%sql` not found.


In [23]:
df.select("DEST_COUNTRY_NAME").show(2)

+-----------------+
|DEST_COUNTRY_NAME|
+-----------------+
|    United States|
|    United States|
+-----------------+
only showing top 2 rows



In [24]:
df.select("DEST_COUNTRY_NAME","ORIGIN_COUNTRY_NAME").show(2)

+-----------------+-------------------+
|DEST_COUNTRY_NAME|ORIGIN_COUNTRY_NAME|
+-----------------+-------------------+
|    United States|            Romania|
|    United States|            Croatia|
+-----------------+-------------------+
only showing top 2 rows



One common error is attempting to mix Column objects and strings. For example, the following code will result in compile error.

In [25]:
df.select(col("DEST_COUNTRY_NAME"),"DEST_COUNTRY_NAME")

DataFrame[DEST_COUNTRY_NAME: string, DEST_COUNTRY_NAME: string]

```expr``` is the most flexible reference that we can use. It can refer to plain column or a string manipulation of a column. To illustrate, lets change the column name, and then change it back by using the AS keyword and then the alias method on the column.

In [26]:
df.select(expr("DEST_COUNTRY_NAME AS destination")).show(2)

+-------------+
|  destination|
+-------------+
|United States|
|United States|
+-------------+
only showing top 2 rows



This changes the column name to "destination". You can further manipulate the result of your expression as another expression. 

In [27]:
df.select(expr("DEST_COUNTRY_NAME as destination").alias("DEST_COUNTRY_NAME")).show(2)

+-----------------+
|DEST_COUNTRY_NAME|
+-----------------+
|    United States|
|    United States|
+-----------------+
only showing top 2 rows



In [28]:
df.columns

['DEST_COUNTRY_NAME', 'ORIGIN_COUNTRY_NAME', 'count']

In [29]:
df.selectExpr("DEST_COUNTRY_NAME as newColumnName", "DEST_COUNTRY_NAME").show(2)

+-------------+-----------------+
|newColumnName|DEST_COUNTRY_NAME|
+-------------+-----------------+
|United States|    United States|
|United States|    United States|
+-------------+-----------------+
only showing top 2 rows



In [30]:
df.selectExpr(
    "*",# all original columns
    "(DEST_COUNTRY_NAME = ORIGIN_COUNTRY_NAME) as withinCOuntry").show(2)

+-----------------+-------------------+-----+-------------+
|DEST_COUNTRY_NAME|ORIGIN_COUNTRY_NAME|count|withinCOuntry|
+-----------------+-------------------+-----+-------------+
|    United States|            Romania|   15|        false|
|    United States|            Croatia|    1|        false|
+-----------------+-------------------+-----+-------------+
only showing top 2 rows



With select expression, we can also specify aggregations over the entire DataFrame by taking advantage of the functions that we have.

In [31]:
df.selectExpr("avg(count)","count(distinct(DEST_COUNTRY_NAME))").show(2)

+-----------+---------------------------------+
| avg(count)|count(DISTINCT DEST_COUNTRY_NAME)|
+-----------+---------------------------------+
|1770.765625|                              132|
+-----------+---------------------------------+



## Converting to Spark Types (Literals)
- Some times we need to pass explicit values into the spark that are just a value that might be constant value or something that needs to compare later on. One way to do this is through ```literlas```. 
- That is basically a translation from a given programming language's literal value to one that spark understands.

- Literals are expressions and we can use them in same way we use expressions.

In [32]:
from pyspark.sql.functions import lit
df.select(expr("*"), lit(1).alias("One")).show(2)

+-----------------+-------------------+-----+---+
|DEST_COUNTRY_NAME|ORIGIN_COUNTRY_NAME|count|One|
+-----------------+-------------------+-----+---+
|    United States|            Romania|   15|  1|
|    United States|            Croatia|    1|  1|
+-----------------+-------------------+-----+---+
only showing top 2 rows



## Adding Columns
Using the ```withColumn``` method we can add new column to the data frame.

In [33]:
df.withColumn("numberOne",lit(1)).show(2)

+-----------------+-------------------+-----+---------+
|DEST_COUNTRY_NAME|ORIGIN_COUNTRY_NAME|count|numberOne|
+-----------------+-------------------+-----+---------+
|    United States|            Romania|   15|        1|
|    United States|            Croatia|    1|        1|
+-----------------+-------------------+-----+---------+
only showing top 2 rows



In [34]:
df.withColumn("withinCountry", expr("ORIGIN_COUNTRY_NAME==DEST_COUNTRY_NAME")).show(2)

+-----------------+-------------------+-----+-------------+
|DEST_COUNTRY_NAME|ORIGIN_COUNTRY_NAME|count|withinCountry|
+-----------------+-------------------+-----+-------------+
|    United States|            Romania|   15|        false|
|    United States|            Croatia|    1|        false|
+-----------------+-------------------+-----+-------------+
only showing top 2 rows



- ```withColumn``` function takes two arguments: the column name and the expression that will create the value for the given row in the DataFrame.

In [35]:
df.withColumn("Destination", expr("DEST_COUNTRY_NAME")).columns

['DEST_COUNTRY_NAME', 'ORIGIN_COUNTRY_NAME', 'count', 'Destination']

## Renaming Columns
We can rename the columns with the help of ```ColumnRenamed``` method.

In [36]:
df.withColumnRenamed("DEST_COUNTRY_NAME", "dest").columns

['dest', 'ORIGIN_COUNTRY_NAME', 'count']

## Reserved Characters and Keywords


In [37]:
dfWithLongColName = df.withColumn(
    "This Long Column-Name",
    expr("ORIGIN_COUNTRY_NAME")
)

In [38]:
dfWithLongColName.selectExpr(
    "`This Long Column-Name`",
    "`This Long Column-Name` as 'new col'").show(2)

ParseException: ignored

In [39]:
dfWithLongColName.createOrReplaceTempView("dfTableLong")

In [40]:
--in SQL
SELECT 'This Long Column-Name', 'This Long COlumn-Name' as 'new col' FROM dfTableLong LIMIT 2

SyntaxError: ignored

## Case Sensitivity

- By default Spark is case insensitive; however, we can make Spark case sensitive by setting the configuration.

In [41]:
--in SQL
set spark.sql.caseSensitive true

SyntaxError: ignored

## Removing Columns

- There is dedicated method called ```drop```

In [42]:
df.drop("ORIGIN_COUNTRY_NAME").columns

['DEST_COUNTRY_NAME', 'count']

- We can drop multiple columns by passing multiple columns as arguments:

In [43]:
dfWithLongColName.drop("ORIGIN_COUNTRY_NAME","DEST_COUNTRY_NAME")

DataFrame[count: bigint, This Long Column-Name: string]

### Changing a Column's Type(cast)
- We can convert columns from one type to another by casting the column from one type to another.

In [44]:
df.withColumn("count2",col("count").cast("long"))

DataFrame[DEST_COUNTRY_NAME: string, ORIGIN_COUNTRY_NAME: string, count: bigint, count2: bigint]

In [45]:
-- in SQL
SELECT *, cast(count as long) AS count2 FROM dfTable

SyntaxError: ignored

### Filtering Rows
- To filter rows, we create an expression that evalutes to true or false.
- You then filter out the rows with an expression that is equal to false.

- There are two methods to perform this operation: We can use ```where``` or ```filter``` and they both will perform the same operation and accept the same argument types when used with DataFrames.

NOTE: When using the Dataset API from either scala or Java, filter also accepts an arbitrary function that Spark will apply to each record in the Dataset.

In [46]:
df.filter(col("count")<2).show(2)

+-----------------+-------------------+-----+
|DEST_COUNTRY_NAME|ORIGIN_COUNTRY_NAME|count|
+-----------------+-------------------+-----+
|    United States|            Croatia|    1|
|    United States|          Singapore|    1|
+-----------------+-------------------+-----+
only showing top 2 rows



In [47]:
df.where("count<2").show(2)

+-----------------+-------------------+-----+
|DEST_COUNTRY_NAME|ORIGIN_COUNTRY_NAME|count|
+-----------------+-------------------+-----+
|    United States|            Croatia|    1|
|    United States|          Singapore|    1|
+-----------------+-------------------+-----+
only showing top 2 rows



In [48]:
-- in SQL
SELECT * FROM dfTable WHERE count < 2 LIMIT 2

SyntaxError: ignored

Spark automatically performs all filtering operations at the same time regardless of the filter ordering. 
- If we want to specify multiple AND filters, chain them sequentially and let Spark handle the rest.

In [49]:
# in python
df.where(col("count")<2).where(col("ORIGIN_COUNTRY_NAME")!= "Croatia").show(2)

+-----------------+-------------------+-----+
|DEST_COUNTRY_NAME|ORIGIN_COUNTRY_NAME|count|
+-----------------+-------------------+-----+
|    United States|          Singapore|    1|
|          Moldova|      United States|    1|
+-----------------+-------------------+-----+
only showing top 2 rows



In [50]:
--in SQL
SELECT * FROM dfTable WHERE count < 2 AND ORIGIN_COUNTRY_NAME != "Croatia" LIMIT 2

SyntaxError: ignored

### Getting Unique Rows
BY using the ```distinct``` method on a DataFrame, which allows us to deduplicate any rows that are in that DataFrame.

In [51]:
# in python
df.select("ORIGIN_COUNTRY_NAME","DEST_COUNTRY_NAME").distinct().count()

256

In [52]:
-- in SQL
SELECT COUNT(DISTINCT(ORIGIN_COUNTRY_NAME, DEST_COUNTRY_NAME)) FROM dfTable

SyntaxError: ignored

In [53]:
# in python
df.select("ORIGIN_COUNTRY_NAME").distinct().count()

125

In [54]:
--inSQL
SELECT COUNT(DISTINCT ORIGIN_COUTNRY_NAME) FROM dfTable

SyntaxError: ignored

### Random Samples
If we want to sample some random records we can do this by using the ```sample``` method on a DataFrame to specify a fraction of rows to extract from a DataFrame and whether we would like to sample with or without replacement.

In [55]:
# in python
seed = 5
withReplacement = False
fraction = 0.5
df.sample(withReplacement, fraction, seed).count()

138

### Concatenating and Appending Rows(Union)
- To append to a DataFrame, we must ```union``` the original DataFrame along with the new DataFrame, that will concatenates the two DataFrames.
- To ```union``` two DataFrames, we must be sure that they have the same schema and number of columns; otherwise the union will fail.

**WARNING**: Unions are currently performed based on location, not on the schema. This means that columns will not automatically line up the way we think.

In [56]:
# in python
from pyspark.sql import Row
schema = df.schema
newRows = [
           Row("New Country", "Other Country", 5),
           Row("New Country 2", "Other Country 3", 1)
]
parallelizedRows = spark.sparkContext.parallelize(newRows)
newDF = spark.createDataFrame(parallelizedRows, schema)

In [57]:
df.union(newDF).where("count=1").where(col("ORIGIN_COUNTRY_NAME")!="United States").show()

+-----------------+-------------------+-----+
|DEST_COUNTRY_NAME|ORIGIN_COUNTRY_NAME|count|
+-----------------+-------------------+-----+
|    United States|            Croatia|    1|
|    United States|          Singapore|    1|
|    United States|          Gibraltar|    1|
|    United States|             Cyprus|    1|
|    United States|            Estonia|    1|
|    United States|          Lithuania|    1|
|    United States|           Bulgaria|    1|
|    United States|            Georgia|    1|
|    United States|            Bahrain|    1|
|    United States|   Papua New Guinea|    1|
|    United States|         Montenegro|    1|
|    United States|            Namibia|    1|
|    New Country 2|    Other Country 3|    1|
+-----------------+-------------------+-----+



### Sorting Rows
When we sort the values in a DataFrame, we always want to sort with either the largest or smallest values at the top of a DataFrame.

- There are two equivalent operations to do this ```sort``` and ```orderBy``` that work the exact same way.

- They accept both column expressions and strings as well as multiple columns. 

- The default is to sort in ascending order.

In [58]:
# in python
df.sort("count").show(5)

+--------------------+-------------------+-----+
|   DEST_COUNTRY_NAME|ORIGIN_COUNTRY_NAME|count|
+--------------------+-------------------+-----+
|               Malta|      United States|    1|
|Saint Vincent and...|      United States|    1|
|       United States|            Croatia|    1|
|       United States|          Gibraltar|    1|
|       United States|          Singapore|    1|
+--------------------+-------------------+-----+
only showing top 5 rows



- To more explicitly specify sort direction, you need to use the ```asc``` and ```desc``` functions if operating on a column. These allow you to specify the order in which s given column should be sorted.

In [59]:
from pyspark.sql.functions import desc, asc
df.orderBy(expr("count desc")).show(3)

+-----------------+-------------------+-----+
|DEST_COUNTRY_NAME|ORIGIN_COUNTRY_NAME|count|
+-----------------+-------------------+-----+
|          Moldova|      United States|    1|
|    United States|          Singapore|    1|
|    United States|            Croatia|    1|
+-----------------+-------------------+-----+
only showing top 3 rows



In [60]:
df.orderBy(col("count").desc(), col("DEST_COUNTRY_NAME").asc()).show(2)

+-----------------+-------------------+------+
|DEST_COUNTRY_NAME|ORIGIN_COUNTRY_NAME| count|
+-----------------+-------------------+------+
|    United States|      United States|370002|
|    United States|             Canada|  8483|
+-----------------+-------------------+------+
only showing top 2 rows



In [61]:
--in SQL
SELECT * FROM dfTable ORDER BY count DESC, DEST_COUNTRY_NAME ASC LIMIT 2

SyntaxError: ignored

- An advanced tip is to use ```asc_nulls_first```,```desc_nulls_first```,```asc_nulls_last```, or ```desc_nulls_last``` to specify where we would like we null values to appear in an ordered DataFrame.

- For optimization purposes, it's sometimes advisable to sort within each partition before another set of transformations. We can use the ```sortWithinPartitions``` method to do this.

In [62]:
# in python
spark.read.format("json").load("/data/flight-data/json/*-summary.json")

AnalysisException: ignored

### Limit
We might to restrict what we extract from a DataFrame: for example, we might want just the top ten of some DataFrame. We can do this by using the ```limit``` method:


In [63]:
df.limit(5).show()

+-----------------+-------------------+-----+
|DEST_COUNTRY_NAME|ORIGIN_COUNTRY_NAME|count|
+-----------------+-------------------+-----+
|    United States|            Romania|   15|
|    United States|            Croatia|    1|
|    United States|            Ireland|  344|
|            Egypt|      United States|   15|
|    United States|              India|   62|
+-----------------+-------------------+-----+



In [64]:
-- in SQL
SELECT * FROM dfTable LIMIT 6

SyntaxError: ignored

In [65]:
df.orderBy(expr("count desc")).limit(6).show()

+--------------------+-------------------+-----+
|   DEST_COUNTRY_NAME|ORIGIN_COUNTRY_NAME|count|
+--------------------+-------------------+-----+
|               Malta|      United States|    1|
|Saint Vincent and...|      United States|    1|
|       United States|            Croatia|    1|
|       United States|          Gibraltar|    1|
|       United States|          Singapore|    1|
|             Moldova|      United States|    1|
+--------------------+-------------------+-----+



In [66]:
-- in SQL
SELECT * FROM dfTable ORDER BY count desc LIMIT 6

SyntaxError: ignored

### Repartioning and coalsec
- One more important optimization opportunity is to partition the data according to some frequently filtered columns, which control the physical layout of data across the cluster including the partition scheme and the number of partitions.

- Repartio will incur a full shuffle of the data, regardless of whether one is necessary. This means we should typically only repartition when the future number of partitions is greater than our current number of partitions or when we are looking to partition by a set of columns.

In [67]:
#in python
df.rdd.getNumPartitions()

1

In [68]:
df.repartition(5)

DataFrame[DEST_COUNTRY_NAME: string, ORIGIN_COUNTRY_NAME: string, count: bigint]

If we know that we are going to be filtering by a certain column often, it can be worth repartioning based on the column.

In [69]:
#python
df.repartition(col("DEST_COUNTRY_NAME"))

DataFrame[DEST_COUNTRY_NAME: string, ORIGIN_COUNTRY_NAME: string, count: bigint]

- We can optionally specify the number of partitions we would like.

In [70]:
# in python
df.repartition(5, col("DEST_COUNTRY_NAME"))

DataFrame[DEST_COUNTRY_NAME: string, ORIGIN_COUNTRY_NAME: string, count: bigint]

- Coalsec will not incur a full shuffle and will try to combine partitions. This operation will shuffle our data into five partitions based on the destination on country name and then coalsec them(without a full shuffle).

In [73]:
#in python
df.repartition(5, col("DEST_COUNTRY_NAME")).coalesce(2)

DataFrame[DEST_COUNTRY_NAME: string, ORIGIN_COUNTRY_NAME: string, count: bigint]

### Collecting Rows to the Driver
- Spark maintains the state of the cluster in the driver.
- There will be times when we want to collect some of our data to the driver in order to manipulate it on our local machine.
- ```collect``` gets all data from the entire DataFrame, take selects the first *N* rows, and show prints out a number of rows, nicely.

In [76]:
# in python
collectDF = df.limit(10)
collectDF.take(5) #take works with an Integer count
collectDF.show() #this prints it out nicely
collectDF.show(5, False)
collectDF.collect()

+-----------------+-------------------+-----+
|DEST_COUNTRY_NAME|ORIGIN_COUNTRY_NAME|count|
+-----------------+-------------------+-----+
|    United States|            Romania|   15|
|    United States|            Croatia|    1|
|    United States|            Ireland|  344|
|            Egypt|      United States|   15|
|    United States|              India|   62|
|    United States|          Singapore|    1|
|    United States|            Grenada|   62|
|       Costa Rica|      United States|  588|
|          Senegal|      United States|   40|
|          Moldova|      United States|    1|
+-----------------+-------------------+-----+

+-----------------+-------------------+-----+
|DEST_COUNTRY_NAME|ORIGIN_COUNTRY_NAME|count|
+-----------------+-------------------+-----+
|United States    |Romania            |15   |
|United States    |Croatia            |1    |
|United States    |Ireland            |344  |
|Egypt            |United States      |15   |
|United States    |India         

[Row(DEST_COUNTRY_NAME='United States', ORIGIN_COUNTRY_NAME='Romania', count=15),
 Row(DEST_COUNTRY_NAME='United States', ORIGIN_COUNTRY_NAME='Croatia', count=1),
 Row(DEST_COUNTRY_NAME='United States', ORIGIN_COUNTRY_NAME='Ireland', count=344),
 Row(DEST_COUNTRY_NAME='Egypt', ORIGIN_COUNTRY_NAME='United States', count=15),
 Row(DEST_COUNTRY_NAME='United States', ORIGIN_COUNTRY_NAME='India', count=62),
 Row(DEST_COUNTRY_NAME='United States', ORIGIN_COUNTRY_NAME='Singapore', count=1),
 Row(DEST_COUNTRY_NAME='United States', ORIGIN_COUNTRY_NAME='Grenada', count=62),
 Row(DEST_COUNTRY_NAME='Costa Rica', ORIGIN_COUNTRY_NAME='United States', count=588),
 Row(DEST_COUNTRY_NAME='Senegal', ORIGIN_COUNTRY_NAME='United States', count=40),
 Row(DEST_COUNTRY_NAME='Moldova', ORIGIN_COUNTRY_NAME='United States', count=1)]

There's an additional way of collecting rows to the driver in order to iterate over the entire dataset.
- The method ```toLocalIterator``` collects partitions to the driver as an iterator.

- This method allows us to iterate over the entire dataset partition-by-partition in a serial manner.

In [77]:
collectDF.toLocalIterator()

<generator object _local_iterator_from_socket.<locals>.PyLocalIterable.__iter__ at 0x7ff13ee6ba50>

**WARNING:** Any collection of data to the driver can be a very expensive operation. If we have a large dataset and collect,we can crash the driver. If we use localIterator and have very large partitions. We can easily crash the driver node and lose the state of our application. This is also expensive because we can operate on a one-by-one basis, instead of running computation in parallel.  