# PySpark Recipes: Chapter 4: Joins


We would be working with a sample dataset unrelated to our census data for our exercise, so that you appreciate different kind of joins made possible by Spark. 

We would start with regular joins - Inner, Left, Right and Outer joins and then talk about Anti, Semi,Cross and Full Joins as we progress. 

Summary of supported join types - 
- Inner Join ('inner') 
- Left Join ('left', 'leftouter', 'left_outer')
- Right Join ('right', 'rightouter', 'right_outer')
- Outer/Full Joins ('outer', 'full', 'fullouter', 'full_outer')
- Left Semi Join ('leftsemi', 'left_semi')
- Left Anti Semi Join ('leftanti', 'left_anti')
- Cross Join ('cross')


In [3]:
import pyspark
import time
from pyspark.sql import SparkSession


# creating a SparkSession object - you can change any of the configuration option you like. Remember this would
# get the existsing SparkSession and would not create a new one.
# So in case your previous notebook is still running - no issues.
sparkSession = SparkSession \
                .builder \
                .master("local") \
                .appName("Pyspark Recipes - Joins") \
                .getOrCreate()


# Name of the Participants and their final marks. Ignore Fractal_ID, :), 
# Scores are purely random, so if you got score not as per your expectation, blame it on chance. 
dfScores = sparkSession.read.format('csv') \
            .options(header = True, inferSchema = True, sep = ",", enforceSchema = True,
                ignoreLeadingWhiteSpace = True, ignoreTrailingWhiteSpace = True) \
            .load('../datasets/charityml/dataScores.csv')

# Location of the Participants - some of the participants location have been delete
# some dummy Ids and location have been added.
dfLocation = sparkSession.read.format('csv') \
            .options(header = True, inferSchema = True, sep = ",", enforceSchema = True,
                ignoreLeadingWhiteSpace = True, ignoreTrailingWhiteSpace = True) \
            .load('../datasets/charityml/dataLocation.csv')



In [4]:
# Observe that dfScores have a total of 30 rows,
# and Piyusha Biswas has two scores (nothing intentional, the name figured first in the email list)
dfScores.show(n=30) 
dfScores.count()

+----------+-------------------+------+
|Fractal_ID|               Name|Scores|
+----------+-------------------+------+
|         1|     Piyusha Biswas|    99|
|         1|     Piyusha Biswas|    89|
|         2|Siddhartha Nuthakki|    92|
|         3|     Phani Kompella|    86|
|         4|   Gaurav Acharekar|    93|
|         5|       Shadab Azeem|    96|
|         6|       Rachit Sapra|    86|
|         7|      Tulika Mittal|    90|
|         8|          Narhari B|    81|
|         9|       Akash Saxena|    94|
|        10|        Heba Nomani|    96|
|        11|      Manish Shukla|    96|
|        12|   Aishwary Mandloi|    84|
|        13|      Praveen Nagel|    91|
|        14|         Viral Jani|    90|
|        15|      Charls Joseph|    81|
|        16|     Affan Mohammed|    88|
|        17|       Ashish Aswal|   100|
|        18|         Santosh Tv|    88|
|        19|        Maaz Ansari|    98|
|        20|   Sangamesh Kalagi|    82|
|        21|       Madhusudan B|    94|


30

In [5]:
# Observe that dfLocations have 26 rows, 
# data for Fractal_ID - 2,16,17,21,22,23,26 are not there.
# Our target Biswasji, has two entries.
# and records with Fractal_ID - 30,31,32 do not exists in dfScores
dfLocation.show(n=26) 
dfLocation.count()

+----------+--------------+--------------+
|Fractal_ID|       Country|          City|
+----------+--------------+--------------+
|         1| United States|    Louisville|
|         1|         India|       Gurgaon|
|         3|         India|       Gurgaon|
|         4|         India|        Mumbai|
|         5|         India|       Gurgaon|
|         6|         India|       Gurgaon|
|         7|United Kingdom|        London|
|         8|         India|     Bengaluru|
|         9|         India|       Gurgaon|
|        10|         India|        Mumbai|
|        11|         India|       Gurgaon|
|        12|         India|     Bengaluru|
|        13|         India|       Gurgaon|
|        14|         India|        Mumbai|
|        15| United States|       Atlanta|
|        18|         India|     Bengaluru|
|        19|         India|        Mumbai|
|        20|         India|     Bengaluru|
|        24|         India|        Mumbai|
|        25| United States|      New York|
|        27

26

## Inner joins

Also known as "natural joins", it returns the rows when the matching conditions are met. This is also the default join that Spark would use, if no join type is provided - i.e. <span style="color:blue">how='join type'</span>, and a "join condition", i.e. "on" is provided. 

If you observe the output - you can see of course missing Fractal IDs - 2,16,17,21,22,23,26 are not there. Another interesting observation is Piyusha Biswas is repeated "4" times. If the dfLocation just had one entry for the Fractal_ID - "1", we would have seen Piyusha Biswas being repeated twice, with two different country and cities.


In [55]:
#dfInnerJoin = dfScores.join(dfLocation, on=['Fractal_ID'], how='inner')

dfInnerJoin = dfScores.join(dfLocation, on=['Fractal_ID'])

dfInnerJoin.show(n=25)
dfInnerJoin.count()

+----------+------------------+------+--------------+--------------+
|Fractal_ID|              Name|Scores|       Country|          City|
+----------+------------------+------+--------------+--------------+
|         1|    Piyusha Biswas|    99|         India|       Gurgaon|
|         1|    Piyusha Biswas|    99| United States|    Louisville|
|         1|    Piyusha Biswas|    89|         India|       Gurgaon|
|         1|    Piyusha Biswas|    89| United States|    Louisville|
|         3|    Phani Kompella|    86|         India|       Gurgaon|
|         4|  Gaurav Acharekar|    93|         India|        Mumbai|
|         5|      Shadab Azeem|    96|         India|       Gurgaon|
|         6|      Rachit Sapra|    86|         India|       Gurgaon|
|         7|     Tulika Mittal|    90|United Kingdom|        London|
|         8|         Narhari B|    81|         India|     Bengaluru|
|         9|      Akash Saxena|    94|         India|       Gurgaon|
|        10|       Heba Nomani|   

25

Just out of curiosity, let us take a look at the schema of the returned dataframe.

In [79]:
dfInnerJoin.printSchema()

root
 |-- Fractal_ID: integer (nullable = true)
 |-- Name: string (nullable = true)
 |-- Scores: integer (nullable = true)
 |-- Country: string (nullable = true)
 |-- City: string (nullable = true)




While we stick to the above syntax style, the following is completely acceptable.One can completely ignore specifying the "on" and "how", however maintain the order


In [44]:
dfInnerJoin = dfScores.join(dfLocation, "Fractal_ID", 'inner')
dfInnerJoin.show(n=25)
dfInnerJoin.count()

+----------+------------------+------+--------------+--------------+
|Fractal_ID|              Name|Scores|       Country|          City|
+----------+------------------+------+--------------+--------------+
|         1|    Piyusha Biswas|    99|         India|       Gurgaon|
|         1|    Piyusha Biswas|    99| United States|    Louisville|
|         1|    Piyusha Biswas|    89|         India|       Gurgaon|
|         1|    Piyusha Biswas|    89| United States|    Louisville|
|         3|    Phani Kompella|    86|         India|       Gurgaon|
|         4|  Gaurav Acharekar|    93|         India|        Mumbai|
|         5|      Shadab Azeem|    96|         India|       Gurgaon|
|         6|      Rachit Sapra|    86|         India|       Gurgaon|
|         7|     Tulika Mittal|    90|United Kingdom|        London|
|         8|         Narhari B|    81|         India|     Bengaluru|
|         9|      Akash Saxena|    94|         India|       Gurgaon|
|        10|       Heba Nomani|   

25


What if the column names on which join are going to be placed are different. One can specify both the columns from different dataframes.

While the output is the same, we can see one "difference" here - the column "Fractal_ID" is repeated. In previous section, Spark knows that the "Fractal_ID" is same and it picks one, however when you explicitly use "==", it treats them as two different column, while evaluating the boolean expression, and return them as a part of your resulting dataframe.


In [43]:
dfInnerJoin = dfScores.join(dfLocation, dfScores.Fractal_ID == dfLocation.Fractal_ID, 'inner')
dfInnerJoin.show(n=25)
dfInnerJoin.count()

+----------+------------------+------+----------+--------------+--------------+
|Fractal_ID|              Name|Scores|Fractal_ID|       Country|          City|
+----------+------------------+------+----------+--------------+--------------+
|         1|    Piyusha Biswas|    99|         1|         India|       Gurgaon|
|         1|    Piyusha Biswas|    99|         1| United States|    Louisville|
|         1|    Piyusha Biswas|    89|         1|         India|       Gurgaon|
|         1|    Piyusha Biswas|    89|         1| United States|    Louisville|
|         3|    Phani Kompella|    86|         3|         India|       Gurgaon|
|         4|  Gaurav Acharekar|    93|         4|         India|        Mumbai|
|         5|      Shadab Azeem|    96|         5|         India|       Gurgaon|
|         6|      Rachit Sapra|    86|         6|         India|       Gurgaon|
|         7|     Tulika Mittal|    90|         7|United Kingdom|        London|
|         8|         Narhari B|    81|  

25


<span style="color:green;"><i>Extra Gyaan:</span>
The "on" condition is evaluated as a boolean. When we specify the "Fractal_ID", as a parameter, Spark searches for "Fractal_ID", in both the dataframe and does a boolean "==" check on them. The joins are evaluated on the basis of the boolean value. The code below would produce an error  - it would crib, it does not have a second column name for the boolean condition to be satisfied


In [None]:
#----- This would produce an error -----
dfInnerJoin = dfScores.join(dfLocation, dfScores.Fractal_ID, 'inner')


We would continue to stick to the first line syntax in this code window- for remaining demos. However, feel free to use the syntax you are comfortable with

## Left Join

Left join would return all the rows from the left dataframe (in this case dfScores), and the matched rows from the right dataframe (dfLocation)

If you observe the output - you can see of course missing Fractal IDs - 2,16,17,21,22,23,26 are there and their corresponding Country and City are replaced with null. 

Another interesting observation is Piyusha Biswas is repeated "4" times - Again !!!. If the dfLocation just had one entry for the Fractal_ID - "1", we would have seen Piyusha Biswas being repeated twice, with two different country and cities.


In [5]:
dfLeftJoin = dfScores.join(dfLocation, on=['Fractal_ID'], how='left')
dfLeftJoin.show()
dfLeftJoin.count()

+----------+-------------------+------+--------------+----------+
|Fractal_ID|               Name|Scores|       Country|      City|
+----------+-------------------+------+--------------+----------+
|         1|     Piyusha Biswas|    99|         India|   Gurgaon|
|         1|     Piyusha Biswas|    99| United States|Louisville|
|         1|     Piyusha Biswas|    89|         India|   Gurgaon|
|         1|     Piyusha Biswas|    89| United States|Louisville|
|         2|Siddhartha Nuthakki|    92|          null|      null|
|         3|     Phani Kompella|    86|         India|   Gurgaon|
|         4|   Gaurav Acharekar|    93|         India|    Mumbai|
|         5|       Shadab Azeem|    96|         India|   Gurgaon|
|         6|       Rachit Sapra|    86|         India|   Gurgaon|
|         7|      Tulika Mittal|    90|United Kingdom|    London|
|         8|          Narhari B|    81|         India| Bengaluru|
|         9|       Akash Saxena|    94|         India|   Gurgaon|
|        1

32

## Right Join

Right join is just opposite of Left Join, which would return all the rows from the right dataframe (dfLocation) and the matched rows from the left dataframe (dfScores)

Observe that Fractal IDs - 30, 31 and 32 which were only there in dfLocation are returned here, and the Name and Scores are replaced with null. Biswasji, is again repeated four times. 

In [9]:
dfRightJoin = dfScores.join(dfLocation, on=['Fractal_ID'], how='right')
dfRightJoin.show(n=28)
dfRightJoin.count()

+----------+------------------+------+--------------+--------------+
|Fractal_ID|              Name|Scores|       Country|          City|
+----------+------------------+------+--------------+--------------+
|         1|    Piyusha Biswas|    89| United States|    Louisville|
|         1|    Piyusha Biswas|    99| United States|    Louisville|
|         1|    Piyusha Biswas|    89|         India|       Gurgaon|
|         1|    Piyusha Biswas|    99|         India|       Gurgaon|
|         3|    Phani Kompella|    86|         India|       Gurgaon|
|         4|  Gaurav Acharekar|    93|         India|        Mumbai|
|         5|      Shadab Azeem|    96|         India|       Gurgaon|
|         6|      Rachit Sapra|    86|         India|       Gurgaon|
|         7|     Tulika Mittal|    90|United Kingdom|        London|
|         8|         Narhari B|    81|         India|     Bengaluru|
|         9|      Akash Saxena|    94|         India|       Gurgaon|
|        10|       Heba Nomani|   

28

## Outer Join

Outer Join combines the result of both left and right joins, i.e., all the records would be returned from both the tables. 

Observe the order in which the data is returned, unlike previous outputs the column used for join is not ordered. 

In [11]:
dfOuterJoin = dfScores.join(dfLocation, on=['Fractal_ID'], how='outer')
dfOuterJoin.show(n=35)
dfOuterJoin.count()

+----------+-------------------+------+--------------+--------------+
|Fractal_ID|               Name|Scores|       Country|          City|
+----------+-------------------+------+--------------+--------------+
|        31|               null|  null|         India|        Mumbai|
|        28|      Mithu Goswami|    99|         India|        Mumbai|
|        26|       Versha Singh|    93|          null|          null|
|        27|  Mayank Srivastava|    87| United States|North Carolina|
|        12|   Aishwary Mandloi|    84|         India|     Bengaluru|
|        22|        Muniswamy S|    82|          null|          null|
|         1|     Piyusha Biswas|    99| United States|    Louisville|
|         1|     Piyusha Biswas|    99|         India|       Gurgaon|
|         1|     Piyusha Biswas|    89| United States|    Louisville|
|         1|     Piyusha Biswas|    89|         India|       Gurgaon|
|        13|      Praveen Nagel|    91|         India|       Gurgaon|
|         6|       R

35

## Left Anti Join

Returns those rows from the first dataframe which do not have any matches in the second dataframe. You can visualize it as leftovers - (LeftDataFrame - RightDataFrame). 

You can use how='leftanti' or how='left_anti'. Both the 'leftanti' and 'left_anti' are same. There is no 'right anti'. 

Try switching the places of dfScores and dfLocation. 

In [12]:
dfLeftAntiJoin = dfScores.join(dfLocation, on=['Fractal_ID'], how='leftanti')
dfLeftAntiJoin.show()

+----------+-------------------+------+
|Fractal_ID|               Name|Scores|
+----------+-------------------+------+
|         2|Siddhartha Nuthakki|    92|
|        16|     Affan Mohammed|    88|
|        17|       Ashish Aswal|   100|
|        21|       Madhusudan B|    94|
|        22|        Muniswamy S|    82|
|        23|     Prashant Yadav|    99|
|        26|       Versha Singh|    93|
+----------+-------------------+------+



## Left Semi Join

They are like the inner joins, except only the left dataframe columns and values are returned. Nothing is returned from the right dataframe. 

You can use how='leftsemi' or how='left_semi'. Both the 'leftsemi' and 'left_semi' are same. There is no 'right semi'. 

Try switching the places of dfScores and dfLocation. 


In [14]:
dfLeftSemiJoin = dfScores.join(dfLocation, on=['Fractal_ID'], how='leftsemi')
dfLeftSemiJoin.show(n=23)
dfLeftSemiJoin.count()

+----------+------------------+------+
|Fractal_ID|              Name|Scores|
+----------+------------------+------+
|         1|    Piyusha Biswas|    99|
|         1|    Piyusha Biswas|    89|
|         3|    Phani Kompella|    86|
|         4|  Gaurav Acharekar|    93|
|         5|      Shadab Azeem|    96|
|         6|      Rachit Sapra|    86|
|         7|     Tulika Mittal|    90|
|         8|         Narhari B|    81|
|         9|      Akash Saxena|    94|
|        10|       Heba Nomani|    96|
|        11|     Manish Shukla|    96|
|        12|  Aishwary Mandloi|    84|
|        13|     Praveen Nagel|    91|
|        14|        Viral Jani|    90|
|        15|     Charls Joseph|    81|
|        18|        Santosh Tv|    88|
|        19|       Maaz Ansari|    98|
|        20|  Sangamesh Kalagi|    82|
|        24|         Surya Das|    90|
|        25|Venkata Chippagiri|    96|
|        27| Mayank Srivastava|    87|
|        28|     Mithu Goswami|    99|
|        29|    Sandeep C

23

## Cross join 

Cross join returns a cartesian product of the two dataframes. So if we have "m" rows in left dataframe and "n" rows in right dataframe, what we would get as a result of cross join between two dataframe is "m x n" rows.

As we know we have 30 rows in dfScores, and 26 rows in dfLocation, we would receive (30 x 26) = 780 rows.

Avoid doing this on big dataframe. Typical uses cases might be in machine learning models, where you might need a cartesian product across two sets to produce a set of combinations.

In [34]:
dfCrossJoin = dfScores.crossJoin(dfLocation)

#----- Same functionality, Restrict number of columns returned into the new dataframe
#dfCrossJoin = dfScores.crossJoin(dfLocation).select("Name", "Scores", "City")


#-----  Produces same number of results - except here you provide the right side of the cartesian product
#-----  and only "Country" and "City" column is returned from the dfLocation table.
#dfCrossJoin = dfScores.crossJoin(dfLocation.select("Country", "City"))

dfCrossJoin.show()
dfCrossJoin.count()

+----------+--------------+------+--------------+----------+
|Fractal_ID|          Name|Scores|       Country|      City|
+----------+--------------+------+--------------+----------+
|         1|Piyusha Biswas|    99| United States|Louisville|
|         1|Piyusha Biswas|    99|         India|   Gurgaon|
|         1|Piyusha Biswas|    99|         India|   Gurgaon|
|         1|Piyusha Biswas|    99|         India|    Mumbai|
|         1|Piyusha Biswas|    99|         India|   Gurgaon|
|         1|Piyusha Biswas|    99|         India|   Gurgaon|
|         1|Piyusha Biswas|    99|United Kingdom|    London|
|         1|Piyusha Biswas|    99|         India| Bengaluru|
|         1|Piyusha Biswas|    99|         India|   Gurgaon|
|         1|Piyusha Biswas|    99|         India|    Mumbai|
|         1|Piyusha Biswas|    99|         India|   Gurgaon|
|         1|Piyusha Biswas|    99|         India| Bengaluru|
|         1|Piyusha Biswas|    99|         India|   Gurgaon|
|         1|Piyusha Bisw

780


In the above query, we have not specified the dataframe to which the column belongs to. What if we had specified "Fractal_ID" which belongs to both the dataframes? We would get an error as demonstrated in the execution of the code below - 


In [None]:
dfCrossJoin = dfScores.crossJoin(dfLocation).select("Fractal_ID", "Name", "Scores", "City")

#----- To avoid such  an error
#dfCrossJoin = dfScores.crossJoin(dfLocation).select(dfScores.Fractal_ID, "Name", "Scores", "City")

<span style="color:green;"><i>Extra Gyaan: Spark would do a cartesian join, if you just specificy the join without providing the condition on join. An "AnalysisException" exception is thrown when the user forgets to give a condition on join. To bypass "AnalysisException", you can set a configuration variable - "spark.sql.crossjoin.enabled=true", however ... NEVER DO THAT. This is Spark way of restricting you from accidentally triggering a cartesian join, by not specifying join condition.<span style="color:green;"><i>Extra Gyaan: An "entry point" is defined as a point where control is transferred from operating system to the provided program.</i></span>
    
Try executing the line below, and read the error output.

In [None]:
dfErrorJoin = dfScores.join(dfLocation)
dfErrorJoin.show()

## Is that all.. what if I have to put a join using more than one column?

Yes, there would be situations where you would require to have joins on more than two columns. However remember in our syntax the order matter. One cannot have for example - 
df1.join(df2, (column_set1), (column_set2), join_type)

The order matters. We need to put the conditions in square bracket, but to demonstrate that, let us load some sample data.

In [6]:
dfAttempts = sparkSession.read.format('csv') \
            .options(header = True, inferSchema = True, sep = ",", enforceSchema = True,
                ignoreLeadingWhiteSpace = True, ignoreTrailingWhiteSpace = True) \
            .load('../datasets/charityml/dataAttempts.csv')

# Location of the Participants - some of the participants location have been delete
# some dummy Ids and location have been added.
dfAttemptScores = sparkSession.read.format('csv') \
            .options(header = True, inferSchema = True, sep = ",", enforceSchema = True,
                ignoreLeadingWhiteSpace = True, ignoreTrailingWhiteSpace = True) \
            .load('../datasets/charityml/dataAttemptScores.csv')

In [7]:
dfAttempts.show(n=29)

+----------+-------------------+-------+
|Fractal_ID|               Name|Attempt|
+----------+-------------------+-------+
|         1|     Piyusha Biswas|      1|
|         2|Siddhartha Nuthakki|      2|
|         3|     Phani Kompella|      2|
|         4|   Gaurav Acharekar|      2|
|         5|       Shadab Azeem|      2|
|         6|       Rachit Sapra|      2|
|         7|      Tulika Mittal|      1|
|         8|          Narhari B|      1|
|         9|       Akash Saxena|      2|
|        10|        Heba Nomani|      1|
|        11|      Manish Shukla|      2|
|        12|   Aishwary Mandloi|      1|
|        13|      Praveen Nagel|      3|
|        14|         Viral Jani|      3|
|        15|      Charls Joseph|      2|
|        16|     Affan Mohammed|      1|
|        17|       Ashish Aswal|      3|
|        18|         Santosh Tv|      2|
|        19|        Maaz Ansari|      1|
|        20|   Sangamesh Kalagi|      3|
|        21|       Madhusudan B|      1|
|        22|    

In [81]:
dfAttemptScores.show(n=48)

+----------+-------+-----+
|Fractal_ID|Attempt|Score|
+----------+-------+-----+
|         1|      1|   99|
|         2|      1|   64|
|         2|      2|   92|
|         3|      1|   67|
|         3|      2|   86|
|         4|      1|   72|
|         4|      2|   93|
|         5|      1|   71|
|         5|      2|   96|
|         6|      1|   68|
|         6|      2|   86|
|         7|      1|   90|
|         8|      1|   81|
|         9|      1|   72|
|         9|      2|   94|
|        10|      1|   96|
|        11|      1|   61|
|        11|      2|   96|
|        12|      1|   84|
|        13|      1|   72|
|        13|      2|   72|
|        13|      3|   91|
|        14|      1|   72|
|        14|      2|   71|
|        14|      3|   90|
|        15|      1|   74|
|        15|      2|   81|
|        16|      1|   88|
|        17|      1|   61|
|        17|      2|   68|
|        17|      3|  100|
|        18|      1|   64|
|        18|      2|   88|
|        19|      1|   98|
|

The output we want is a dataframe with all the final "Attempt" scores -

In [85]:
dfFinalScores = dfAttempts.join(dfAttemptScores, [dfAttempts.Fractal_ID == dfAttemptScores.Fractal_ID,
                                           dfAttempts.Attempt == dfAttemptScores.Attempt], 
                                           how='inner').select(dfAttempts.Fractal_ID, 
                                            "Name", dfAttempts.Attempt, "Score")

#---- To make the above code more readable - we can rewrite our code as 
#condition = [dfAttempts.Fractal_ID == dfAttemptScores.Fractal_ID,
#              dfAttempts.Attempt == dfAttemptScores.Attempt]
#dfFinalScores = dfAttempts.join(dfAttemptScores, condition, "inner") \
#                .select(dfAttempts.Fractal_ID, "Name", dfAttempts.Attempt, "Score")

dfFinalScores.show(n=29)
dfFinalScores.count()

+----------+-------------------+-------+-----+
|Fractal_ID|               Name|Attempt|Score|
+----------+-------------------+-------+-----+
|         1|     Piyusha Biswas|      1|   99|
|         2|Siddhartha Nuthakki|      2|   92|
|         3|     Phani Kompella|      2|   86|
|         4|   Gaurav Acharekar|      2|   93|
|         5|       Shadab Azeem|      2|   96|
|         6|       Rachit Sapra|      2|   86|
|         7|      Tulika Mittal|      1|   90|
|         8|          Narhari B|      1|   81|
|         9|       Akash Saxena|      2|   94|
|        10|        Heba Nomani|      1|   96|
|        11|      Manish Shukla|      2|   96|
|        12|   Aishwary Mandloi|      1|   84|
|        13|      Praveen Nagel|      3|   91|
|        14|         Viral Jani|      3|   90|
|        15|      Charls Joseph|      2|   81|
|        16|     Affan Mohammed|      1|   88|
|        17|       Ashish Aswal|      3|  100|
|        18|         Santosh Tv|      2|   88|
|        19| 

29

## What if I want to do joins on more than one dataframes?

Use method chaining.

In [71]:
dfThreeDf = dfAttempts.join(dfAttemptScores, ["Fractal_ID","Attempt"]) \
                .join(dfLocation, "Fractal_ID", "left") \
                .select(dfAttempts.Fractal_ID, "Name", dfAttempts.Attempt, "Score", "Country", "City")


dfThreeDf.show()
dfThreeDf.count()

+----------+-------------------+-------+-----+--------------+----------+
|Fractal_ID|               Name|Attempt|Score|       Country|      City|
+----------+-------------------+-------+-----+--------------+----------+
|         1|     Piyusha Biswas|      1|   99|         India|   Gurgaon|
|         1|     Piyusha Biswas|      1|   99| United States|Louisville|
|         2|Siddhartha Nuthakki|      2|   92|          null|      null|
|         3|     Phani Kompella|      2|   86|         India|   Gurgaon|
|         4|   Gaurav Acharekar|      2|   93|         India|    Mumbai|
|         5|       Shadab Azeem|      2|   96|         India|   Gurgaon|
|         6|       Rachit Sapra|      2|   86|         India|   Gurgaon|
|         7|      Tulika Mittal|      1|   90|United Kingdom|    London|
|         8|          Narhari B|      1|   81|         India| Bengaluru|
|         9|       Akash Saxena|      2|   94|         India|   Gurgaon|
|        10|        Heba Nomani|      1|   96|     

30

In the previous result, we had Piyusha Biswas being repeated twice, because there are two entries of Piyusha Biswas in the table dfLocation.

There would be more complex combination with other Spark functions, we would demonstrate them in the Chapter - Bringing it all together, once you are settled with the basics. We would also discuss Shuffle operations and optimizing them in our Chapter - Optimizing Spark. 

To give you an idea :
- what if you want to get the first match of Piyusha Biswas, and ignore subsequent (or the last one - hopefully sorted by say - Date, in the sense latest record) records? Hint - The approach would be to first get the output by joins, and then do the operation to get the record with say either max Date, or first match or similar. 
- What if you want to get the scores of the first attempt of all the participants? Hint - the approach would be combining it with Fractal_ID only as join, and then take the min Attempt or Attemp == 1. 

Let us look at an example, where we want to get the all the scores. Now that we have this dataset, we can find average score, first attempt score, or any other measure. 


In [77]:
dfAllScores = dfAttempts.join(dfAttemptScores, on=['Fractal_ID']) \
              .select(dfAttempts.Fractal_ID, dfAttempts.Name,
                      dfAttemptScores.Attempt, dfAttemptScores.Score)
dfAllScores.show()
dfAllScores.count()

+----------+-------------------+-------+-----+
|Fractal_ID|               Name|Attempt|Score|
+----------+-------------------+-------+-----+
|         1|     Piyusha Biswas|      1|   99|
|         2|Siddhartha Nuthakki|      2|   92|
|         2|Siddhartha Nuthakki|      1|   64|
|         3|     Phani Kompella|      2|   86|
|         3|     Phani Kompella|      1|   67|
|         4|   Gaurav Acharekar|      2|   93|
|         4|   Gaurav Acharekar|      1|   72|
|         5|       Shadab Azeem|      2|   96|
|         5|       Shadab Azeem|      1|   71|
|         6|       Rachit Sapra|      2|   86|
|         6|       Rachit Sapra|      1|   68|
|         7|      Tulika Mittal|      1|   90|
|         8|          Narhari B|      1|   81|
|         9|       Akash Saxena|      2|   94|
|         9|       Akash Saxena|      1|   72|
|        10|        Heba Nomani|      1|   96|
|        11|      Manish Shukla|      2|   96|
|        11|      Manish Shukla|      1|   61|
|        12| 

48

Method chaining is good, howevever it is recommended that in beginning you break your problem in multiple steps to ensure that what data is being returned is as per your expectations. Once you have the final output verified, you can refactor the code, and clean it up. 

A word on Method Chaining - method chaining might not provide you any performance gain. Spark follows a "lazy mechanism", and would not perform any processing on "transform" steps, till it faces an "action" step such as ".show()". Spark look at all the previous steps, and optimizes the code for you. We would discuss more on Method chaining in our chapter - Optimizing Spark.

For now, let us move to the next notebook - Aggregations and GroupBys. 