# PySpark YouTube Course - Lesson 03

This notebook contains all the materials and notes from the PySpark Course: 'PySpark - Aula 03 - Union / Joins / When - Otherwise / Collect' by DataDev Academy, available on YouTube. You can find the video [here](https://www.youtube.com/watch?v=EoI3XwxCkfI).

## Summary

- [PySpark YouTube Course - Lesson 03](#pyspark-youtube-course---lesson-03)
  - [Summary](#summary)
  - [Importing PySpark](#importing-pyspark)
  - [Starting PySpark Session](#starting-pyspark-session)
  - [Reading a File](#reading-a-file)
  - [Displaying the DataFrame](#displaying-the-dataframe)
  - [PySpark 01 Changes](#pyspark-01-changes)
  - [PySpark 02 Changes](#pyspark-02-changes)
  - [Distinct](#distinct)
  - [Collect](#collect)
  - [When / Otherwise](#when--otherwise)
  - [Union / Concat](#union--concat)
  - [Joins](#joins)
    - [Simple Join](#simple-join)
    - [Inner Join](#inner-join)
    - [Left Join](#left-join)
    - [Right Join](#right-join)
    - [Full Join](#full-join)
    - [Semi Join](#semi-join)
    - [Anti Join](#anti-join)

## Importing PySpark

In [39]:
# Importing PySpark modules
from pyspark.sql import SparkSession
from pyspark.sql.functions import *
from pyspark.sql.types import *
from pyspark.sql.window import Window  # Importing window functions

## Starting PySpark Session

In [40]:
# Creating a Spark session
# - master: defines the cluster manager, 'local' means running on a local machine
# - appName: sets the name of your application
spark = SparkSession.builder \
    .master("local") \
    .appName("pyspark-03") \
    .getOrCreate()

## Reading a File

In [41]:
# Reading a CSV file into a DataFrame
# - path: specifies the location of the file
# - header: indicates if the file contains a header row
# - inferSchema: automatically infers the data types of the columns
df = spark.read.csv("./data/wc2018-players.csv", header=True, inferSchema=True)

## Displaying the DataFrame

In [42]:
# - show(5): shows the first 5 rows of the DataFrame
df.show(5)

+---------+---+----+------------------+----------+----------+--------------------+------+------+
|     Team|  #|Pos.| FIFA Popular Name|Birth Date|Shirt Name|                Club|Height|Weight|
+---------+---+----+------------------+----------+----------+--------------------+------+------+
|Argentina|  3|  DF|TAGLIAFICO Nicolas|31.08.1992|TAGLIAFICO|      AFC Ajax (NED)|   169|    65|
|Argentina| 22|  MF|    PAVON Cristian|21.01.1996|     PAVÓN|CA Boca Juniors (...|   169|    65|
|Argentina| 15|  MF|    LANZINI Manuel|15.02.1993|   LANZINI|West Ham United F...|   167|    66|
|Argentina| 18|  DF|    SALVIO Eduardo|13.07.1990|    SALVIO|    SL Benfica (POR)|   167|    69|
|Argentina| 10|  FW|      MESSI Lionel|24.06.1987|     MESSI|  FC Barcelona (ESP)|   170|    72|
+---------+---+----+------------------+----------+----------+--------------------+------+------+
only showing top 5 rows



## PySpark 01 Changes

In [43]:
# Renaming columns in the DataFrame
# - withColumnRenamed("oldName", "newName"): renames a column from oldName to newName
df = df.withColumnRenamed("Pos.", "Position") \
       .withColumnRenamed("#", "Number")

In [44]:
# Extracting and converting components from the "Birth Date" column
df = df.withColumn("Day", split(col("Birth Date"), "\\.").getItem(0).cast(IntegerType())) \
       .withColumn("Month", split(col("Birth Date"), "\\.").getItem(1).cast(IntegerType())) \
       .withColumn("Year", split(col("Birth Date"), "\\.").getItem(2).cast(IntegerType()))

## PySpark 02 Changes

In [45]:
# Copying the DataFrame to a new variable
# - df2 = df: creates a copy of the DataFrame df and assigns it to a new variable df2
df2 = df

# Dropping the "Birth Date" column from the DataFrame
# - drop(col("Birth Date")): removes the "Birth Date" column from the DataFrame
df = df.drop(col("Birth Date"))

## Distinct

In [46]:
# Select the "Team" column and display distinct values
# - select(col("Team")): selects the "Team" column from the DataFrame
# - distinct(): returns a new DataFrame containing only distinct (unique) rows from the selected column
# - show(50): displays the first 50 distinct values from the "Team" column
df.select(col("Team")).distinct().show(50)

+--------------+
|          Team|
+--------------+
|        Russia|
|       Senegal|
|        Sweden|
|       IR Iran|
|       Germany|
|        France|
|     Argentina|
|       Belgium|
|          Peru|
|       Croatia|
|       Nigeria|
|Korea Republic|
|         Spain|
|       Denmark|
|       Morocco|
|        Panama|
|       Iceland|
|       Uruguay|
|        Mexico|
|       Tunisia|
|  Saudi Arabia|
|   Switzerland|
|        Brazil|
|         Japan|
|       England|
|        Poland|
|      Portugal|
|     Australia|
|    Costa Rica|
|         Egypt|
|        Serbia|
|      Colombia|
+--------------+



## Collect

In [47]:
# Select the "Team" column and collect distinct values into a list
# - select(col("Team")): selects the "Team" column from the DataFrame
# - distinct(): returns a new DataFrame containing only distinct (unique) rows from the selected column
# - collect(): collects the data as a list of Row objects
list = df.select(col("Team")).distinct().collect()

# Display the list of distinct teams
list

[Row(Team='Russia'),
 Row(Team='Senegal'),
 Row(Team='Sweden'),
 Row(Team='IR Iran'),
 Row(Team='Germany'),
 Row(Team='France'),
 Row(Team='Argentina'),
 Row(Team='Belgium'),
 Row(Team='Peru'),
 Row(Team='Croatia'),
 Row(Team='Nigeria'),
 Row(Team='Korea Republic'),
 Row(Team='Spain'),
 Row(Team='Denmark'),
 Row(Team='Morocco'),
 Row(Team='Panama'),
 Row(Team='Iceland'),
 Row(Team='Uruguay'),
 Row(Team='Mexico'),
 Row(Team='Tunisia'),
 Row(Team='Saudi Arabia'),
 Row(Team='Switzerland'),
 Row(Team='Brazil'),
 Row(Team='Japan'),
 Row(Team='England'),
 Row(Team='Poland'),
 Row(Team='Portugal'),
 Row(Team='Australia'),
 Row(Team='Costa Rica'),
 Row(Team='Egypt'),
 Row(Team='Serbia'),
 Row(Team='Colombia')]

In [48]:
# Access the first element in the list of distinct teams
# - list[0]: retrieves the first element of the list, which is a Row object containing the first distinct team
list[0]

Row(Team='Russia')

In [49]:
# Access the first value within the first Row object
# - list[0][0]: retrieves the first value from the first Row object in the list. This value is the distinct team name.
list[0][0]

'Russia'

In [50]:
# Check the type of the value retrieved
# - type(list[0][0]): returns the data type of the value from the first Row object
type(list[0][0])

str

In [51]:
# Create an empty list to store team names
country = []

# Iterate over the list of distinct teams and append each team name to the 'country' list
# - for i in list: iterates through each Row object in the list
# - country.append(i[0]): appends the first value (team name) from each Row object to the 'country' list
for i in list:
    country.append(i[0])

# Display the list of team names
country

['Russia',
 'Senegal',
 'Sweden',
 'IR Iran',
 'Germany',
 'France',
 'Argentina',
 'Belgium',
 'Peru',
 'Croatia',
 'Nigeria',
 'Korea Republic',
 'Spain',
 'Denmark',
 'Morocco',
 'Panama',
 'Iceland',
 'Uruguay',
 'Mexico',
 'Tunisia',
 'Saudi Arabia',
 'Switzerland',
 'Brazil',
 'Japan',
 'England',
 'Poland',
 'Portugal',
 'Australia',
 'Costa Rica',
 'Egypt',
 'Serbia',
 'Colombia']

## When / Otherwise

In [52]:
# Add a new column "New Column" with conditional values
# - when(col("Team") == "Argentina", "ARG"): sets the value to "ARG" if the "Team" column is "Argentina"
# - otherwise("NOT ARG"): sets the value to "NOT ARG" for all other cases
# - show(50): displays the first 50 rows of the DataFrame with the new column
df.withColumn("New Column", when(col("Team") == "Argentina", "ARG").otherwise("NOT ARG")).show(50)

+---------+------+--------+------------------+----------+--------------------+------+------+---+-----+----+----------+
|     Team|Number|Position| FIFA Popular Name|Shirt Name|                Club|Height|Weight|Day|Month|Year|New Column|
+---------+------+--------+------------------+----------+--------------------+------+------+---+-----+----+----------+
|Argentina|     3|      DF|TAGLIAFICO Nicolas|TAGLIAFICO|      AFC Ajax (NED)|   169|    65| 31|    8|1992|       ARG|
|Argentina|    22|      MF|    PAVON Cristian|     PAVÓN|CA Boca Juniors (...|   169|    65| 21|    1|1996|       ARG|
|Argentina|    15|      MF|    LANZINI Manuel|   LANZINI|West Ham United F...|   167|    66| 15|    2|1993|       ARG|
|Argentina|    18|      DF|    SALVIO Eduardo|    SALVIO|    SL Benfica (POR)|   167|    69| 13|    7|1990|       ARG|
|Argentina|    10|      FW|      MESSI Lionel|     MESSI|  FC Barcelona (ESP)|   170|    72| 24|    6|1987|       ARG|
|Argentina|     4|      DF|  ANSALDI Cristian|  

In [53]:
# Define lists of team names for different continents
EU = ["Sweden", "Germany", "France", "Belgium", "Croatia", "Spain", "Denmark", "Iceland", "Switzerland", "England", "Poland", "Portugal", "Serbia"]
AS = ["Russia", "IR Iran", "Nigeria", "Korea Republic", "Saudi Arabia", "Japan"]
AF = ["Senegal", "Morocco", "Tunisia", "Egypt"]
OC = ["Australia"]
NA = ["Panama", "Mexico", "Costa Rica"]
SA = ["Argentina", "Peru", "Uruguay", "Brazil", "Colombia"]

In [54]:
# Add a new column "Continent" with continent names based on the team
# - when(col("Team").isin(EU), "EU"): sets the continent to "EU" if the team is in the EU list
# - when(col("Team").isin(AS), "AS"): sets the continent to "AS" if the team is in the AS list
# - when(col("Team").isin(AF), "AF"): sets the continent to "AF" if the team is in the AF list
# - when(col("Team").isin(OC), "OC"): sets the continent to "OC" if the team is in the OC list
# - when(col("Team").isin(NA), "NA"): sets the continent to "NA" if the team is in the NA list
# - when(col("Team").isin(SA), "SA"): sets the continent to "SA" if the team is in the SA list
# - otherwise("CHECK"): sets the continent to "CHECK" if the team does not match any of the lists
# - show(50): displays the first 50 rows of the DataFrame with the new "Continent" column
df = df.withColumn("Continent", when(col("Team").isin(EU), "EU") \
                               .when(col("Team").isin(AS), "AS") \
                               .when(col("Team").isin(AF), "AF") \
                               .when(col("Team").isin(OC), "OC") \
                               .when(col("Team").isin(NA), "NA") \
                               .when(col("Team").isin(SA), "SA") \
                               .otherwise("CHECK"))
df.show(50)

+---------+------+--------+------------------+----------+--------------------+------+------+---+-----+----+---------+
|     Team|Number|Position| FIFA Popular Name|Shirt Name|                Club|Height|Weight|Day|Month|Year|Continent|
+---------+------+--------+------------------+----------+--------------------+------+------+---+-----+----+---------+
|Argentina|     3|      DF|TAGLIAFICO Nicolas|TAGLIAFICO|      AFC Ajax (NED)|   169|    65| 31|    8|1992|       SA|
|Argentina|    22|      MF|    PAVON Cristian|     PAVÓN|CA Boca Juniors (...|   169|    65| 21|    1|1996|       SA|
|Argentina|    15|      MF|    LANZINI Manuel|   LANZINI|West Ham United F...|   167|    66| 15|    2|1993|       SA|
|Argentina|    18|      DF|    SALVIO Eduardo|    SALVIO|    SL Benfica (POR)|   167|    69| 13|    7|1990|       SA|
|Argentina|    10|      FW|      MESSI Lionel|     MESSI|  FC Barcelona (ESP)|   170|    72| 24|    6|1987|       SA|
|Argentina|     4|      DF|  ANSALDI Cristian|   ANSALDI

In [55]:
# Filter the DataFrame to include only rows where the "Continent" column is "CHECK"
# - where(col("Continent") == "CHECK"): selects rows where the continent is marked as "CHECK"
# - show(): displays the filtered rows
df.where(col("Continent") == "CHECK").show()

+----+------+--------+-----------------+----------+----+------+------+---+-----+----+---------+
|Team|Number|Position|FIFA Popular Name|Shirt Name|Club|Height|Weight|Day|Month|Year|Continent|
+----+------+--------+-----------------+----------+----+------+------+---+-----+----+---------+
+----+------+--------+-----------------+----------+----+------+------+---+-----+----+---------+



## Union / Concat

In [56]:
# Filter the DataFrame to include only rows where the "Continent" column is "SA" (South America)
df_sa = df.where(col("Continent") == "SA")

# Select distinct teams from the filtered South American DataFrame
df_sa.select(col("Team")).distinct().show()

+---------+
|     Team|
+---------+
|Argentina|
|     Peru|
|  Uruguay|
|   Brazil|
| Colombia|
+---------+



In [57]:
# Filter the DataFrame to include only rows where the "Continent" column is "NA" (North America)
df_na = df.where(col("Continent") == "NA")

# Select distinct teams from the filtered North American DataFrame
df_na.select(col("Team")).distinct().show()

+----------+
|      Team|
+----------+
|    Panama|
|    Mexico|
|Costa Rica|
+----------+



In [58]:
# Combine the South American and North American DataFrames into a single DataFrame
# - union(df_na): combines the rows from df_sa and df_na into a single DataFrame
#   Note: The number and type of columns must be the same in both DataFrames for the union operation to work
df_america = df_sa.union(df_na)

# Select distinct teams from the combined DataFrame of South American and North American teams
df_america.select(col("Team")).distinct().show()

+----------+
|      Team|
+----------+
| Argentina|
|      Peru|
|   Uruguay|
|    Brazil|
|  Colombia|
|    Panama|
|    Mexico|
|Costa Rica|
+----------+



In [59]:
# Display the first 100 rows of the combined DataFrame
# - show(100): displays the first 100 rows of df_america
df_america.show(100)

+---------+------+--------+--------------------+-------------+--------------------+------+------+---+-----+----+---------+
|     Team|Number|Position|   FIFA Popular Name|   Shirt Name|                Club|Height|Weight|Day|Month|Year|Continent|
+---------+------+--------+--------------------+-------------+--------------------+------+------+---+-----+----+---------+
|Argentina|     3|      DF|  TAGLIAFICO Nicolas|   TAGLIAFICO|      AFC Ajax (NED)|   169|    65| 31|    8|1992|       SA|
|Argentina|    22|      MF|      PAVON Cristian|        PAVÓN|CA Boca Juniors (...|   169|    65| 21|    1|1996|       SA|
|Argentina|    15|      MF|      LANZINI Manuel|      LANZINI|West Ham United F...|   167|    66| 15|    2|1993|       SA|
|Argentina|    18|      DF|      SALVIO Eduardo|       SALVIO|    SL Benfica (POR)|   167|    69| 13|    7|1990|       SA|
|Argentina|    10|      FW|        MESSI Lionel|        MESSI|  FC Barcelona (ESP)|   170|    72| 24|    6|1987|       SA|
|Argentina|     

## Joins

In [60]:
# Display the first 5 rows of the DataFrame
# - show(5): displays the first 5 rows of df to provide a quick overview of the data
df.show(5)

+---------+------+--------+------------------+----------+--------------------+------+------+---+-----+----+---------+
|     Team|Number|Position| FIFA Popular Name|Shirt Name|                Club|Height|Weight|Day|Month|Year|Continent|
+---------+------+--------+------------------+----------+--------------------+------+------+---+-----+----+---------+
|Argentina|     3|      DF|TAGLIAFICO Nicolas|TAGLIAFICO|      AFC Ajax (NED)|   169|    65| 31|    8|1992|       SA|
|Argentina|    22|      MF|    PAVON Cristian|     PAVÓN|CA Boca Juniors (...|   169|    65| 21|    1|1996|       SA|
|Argentina|    15|      MF|    LANZINI Manuel|   LANZINI|West Ham United F...|   167|    66| 15|    2|1993|       SA|
|Argentina|    18|      DF|    SALVIO Eduardo|    SALVIO|    SL Benfica (POR)|   167|    69| 13|    7|1990|       SA|
|Argentina|    10|      FW|      MESSI Lionel|     MESSI|  FC Barcelona (ESP)|   170|    72| 24|    6|1987|       SA|
+---------+------+--------+------------------+----------

In [61]:
# Filter the DataFrame to include only rows where the "Team" column is "Argentina"
# - where(col("Team") == "Argentina"): selects rows where the team is "Argentina"
arg = df.where(col("Team") == "Argentina")

# Filter the DataFrame to include only rows where the "Team" column is "Brazil"
# - where(col("Team") == "Brazil"): selects rows where the team is "Brazil"
bra = df.where(col("Team") == "Brazil")

In [62]:
# Drop specific columns from the Argentina DataFrame
# - drop(col("Club"), col("FIFA Popular Name"), col("Day"), col("Month"), col("Year"), col("Continent"), col("Weight")): 
arg = arg.drop(col("Club"), col("FIFA Popular Name"), col("Day"), col("Month"), col("Year"), col("Continent"), col("Weight"))

# Drop specific columns from the Brazil DataFrame
# - drop(col("Club"), col("FIFA Popular Name"), col("Day"), col("Month"), col("Year"), col("Continent"), col("Weight")): 
bra = bra.drop(col("Club"), col("FIFA Popular Name"), col("Day"), col("Month"), col("Year"), col("Continent"), col("Weight"))


In [63]:
# Display the first 5 rows of the filtered Argentina DataFrame with dropped columns
# - show(5): displays the first 5 rows of arg to provide a view of the data after column removal
arg.show(5)

+---------+------+--------+----------+------+
|     Team|Number|Position|Shirt Name|Height|
+---------+------+--------+----------+------+
|Argentina|     3|      DF|TAGLIAFICO|   169|
|Argentina|    22|      MF|     PAVÓN|   169|
|Argentina|    15|      MF|   LANZINI|   167|
|Argentina|    18|      DF|    SALVIO|   167|
|Argentina|    10|      FW|     MESSI|   170|
+---------+------+--------+----------+------+
only showing top 5 rows



In [64]:
# Display the first 5 rows of the filtered Brazil DataFrame with dropped columns
# - show(5): displays the first 5 rows of bra to provide a view of the data after column removal
bra.show(5)

+------+------+--------+-----------+------+
|  Team|Number|Position| Shirt Name|Height|
+------+------+--------+-----------+------+
|Brazil|    18|      MF|       FRED|   169|
|Brazil|    21|      FW|     TAISON|   172|
|Brazil|    17|      MF|FERNANDINHO|   179|
|Brazil|    22|      DF|     FAGNER|   168|
|Brazil|    10|      FW|  NEYMAR JR|   175|
+------+------+--------+-----------+------+
only showing top 5 rows



In [65]:
# Count the number of rows in the DataFrame filtered for "Argentina"
# - count(): returns the total number of rows in the DataFrame filtered for "Argentina"
arg.count()

23

In [66]:
# Count the number of rows in the DataFrame filtered for "Brazil"
# - count(): returns the total number of rows in the DataFrame filtered for "Brazil"
bra.count()


23

### Simple Join

100% Matching Join Data

In [67]:
# Perform an inner join between the "arg" DataFrame (Argentina) and the "bra" DataFrame (Brazil)
# - join(bra, arg.Number == bra.Number): joins the two DataFrames on the "Number" column, matching rows where "Number" values are equal
# - new_df: stores the resulting DataFrame after the join
# - show(23): displays the first 23 rows of the joined DataFrame to give an overview of the results
new_df = arg.join(bra, arg.Number == bra.Number)
new_df.show(23)

+---------+------+--------+----------+------+------+------+--------+-----------+------+
|     Team|Number|Position|Shirt Name|Height|  Team|Number|Position| Shirt Name|Height|
+---------+------+--------+----------+------+------+------+--------+-----------+------+
|Argentina|     3|      DF|TAGLIAFICO|   169|Brazil|     3|      DF|    MIRANDA|   186|
|Argentina|    22|      MF|     PAVÓN|   169|Brazil|    22|      DF|     FAGNER|   168|
|Argentina|    15|      MF|   LANZINI|   167|Brazil|    15|      MF|   PAULINHO|   181|
|Argentina|    18|      DF|    SALVIO|   167|Brazil|    18|      MF|       FRED|   169|
|Argentina|    10|      FW|     MESSI|   170|Brazil|    10|      FW|  NEYMAR JR|   175|
|Argentina|     4|      DF|   ANSALDI|   181|Brazil|     4|      DF|    GEROMEL|   190|
|Argentina|     5|      MF|    BIGLIA|   175|Brazil|     5|      MF|   CASEMIRO|   185|
|Argentina|     7|      MF|    BANEGA|   175|Brazil|     7|      FW|   D. COSTA|   182|
|Argentina|    14|      DF|MASCH

In [68]:
# Increment the "Number" column by 1 in the "arg" DataFrame (Argentina)
# - withColumn("Number", col("Number") + 1): creates a new column "Number" with each value incremented by 1
# - show(23): displays the first 23 rows of the updated "arg" DataFrame to show the changes
arg = arg.withColumn("Number", col("Number") + 1)
arg.show(23)

+---------+------+--------+----------+------+
|     Team|Number|Position|Shirt Name|Height|
+---------+------+--------+----------+------+
|Argentina|     4|      DF|TAGLIAFICO|   169|
|Argentina|    23|      MF|     PAVÓN|   169|
|Argentina|    16|      MF|   LANZINI|   167|
|Argentina|    19|      DF|    SALVIO|   167|
|Argentina|    11|      FW|     MESSI|   170|
|Argentina|     5|      DF|   ANSALDI|   181|
|Argentina|     6|      MF|    BIGLIA|   175|
|Argentina|     8|      MF|    BANEGA|   175|
|Argentina|    15|      DF|MASCHERANO|   174|
|Argentina|    22|      FW|    DYBALA|   177|
|Argentina|    20|      FW|    AGÜERO|   172|
|Argentina|    10|      FW|   HIGUAÍN|   184|
|Argentina|    12|      MF|  DI MARÍA|   178|
|Argentina|    21|      MF|  LO CELSO|   177|
|Argentina|    14|      MF|      MEZA|   180|
|Argentina|     9|      DF|     ACUÑA|   172|
|Argentina|    24|      GK| CABALLERO|   186|
|Argentina|     3|      DF|   MERCADO|   181|
|Argentina|    18|      DF|  OTAME

### Inner Join

Shows only data that has matches

In [69]:
# Perform an inner join between the "arg" DataFrame (Argentina) and the "bra" DataFrame (Brazil) with an explicit join type
# - join(bra, arg["Number"] == bra["Number"], "inner"): performs an inner join between the two DataFrames on the "Number" column,
#   explicitly specifying the join type as "inner". This joins rows where "Number" values are equal in both DataFrames.
# - new_df: stores the resulting DataFrame after the join
# - show(23): displays the first 23 rows of the joined DataFrame to provide an overview of the results
new_df = arg.join(bra, arg["Number"] == bra["Number"], "inner")
new_df.show(23)

+---------+------+--------+----------+------+------+------+--------+-----------+------+
|     Team|Number|Position|Shirt Name|Height|  Team|Number|Position| Shirt Name|Height|
+---------+------+--------+----------+------+------+------+--------+-----------+------+
|Argentina|     4|      DF|TAGLIAFICO|   169|Brazil|     4|      DF|    GEROMEL|   190|
|Argentina|    23|      MF|     PAVÓN|   169|Brazil|    23|      GK|    EDERSON|   188|
|Argentina|    16|      MF|   LANZINI|   167|Brazil|    16|      GK|     CASSIO|   195|
|Argentina|    19|      DF|    SALVIO|   167|Brazil|    19|      MF|    WILLIAN|   175|
|Argentina|    11|      FW|     MESSI|   170|Brazil|    11|      MF|P. COUTINHO|   172|
|Argentina|     5|      DF|   ANSALDI|   181|Brazil|     5|      MF|   CASEMIRO|   185|
|Argentina|     6|      MF|    BIGLIA|   175|Brazil|     6|      DF|FILIPE LUIS|   182|
|Argentina|     8|      MF|    BANEGA|   175|Brazil|     8|      MF| R. AUGUSTO|   186|
|Argentina|    15|      DF|MASCH

In [70]:
# Count the number of rows in the resulting DataFrame from the join
# - count(): returns the total number of rows in the DataFrame `new_df`
new_df.count()

22

### Left Join

Shows all DataFrame Data on the LEFT side

*obs:* Values ​​on the right side that do not match will be shown as null

In [71]:
# Perform a left join between the "arg" DataFrame (Argentina) and the "bra" DataFrame (Brazil) with an explicit join type
# - join(bra, arg["Number"] == bra["Number"], "left"): performs a left join between the two DataFrames on the "Number" column,
#   explicitly specifying the join type as "left". This means that all rows from the left DataFrame ("arg") will be included in the result,
#   and matching rows from the right DataFrame ("bra") will be included where available. Rows from the left DataFrame without a match
#   in the right DataFrame will have null values for columns from the right DataFrame.
# - new_df: stores the resulting DataFrame after the join
# - show(23): displays the first 23 rows of the joined DataFrame to provide an overview of the results
new_df = arg.join(bra, arg["Number"] == bra["Number"], "left")
new_df.show(23)

+---------+------+--------+----------+------+------+------+--------+-----------+------+
|     Team|Number|Position|Shirt Name|Height|  Team|Number|Position| Shirt Name|Height|
+---------+------+--------+----------+------+------+------+--------+-----------+------+
|Argentina|     4|      DF|TAGLIAFICO|   169|Brazil|     4|      DF|    GEROMEL|   190|
|Argentina|    23|      MF|     PAVÓN|   169|Brazil|    23|      GK|    EDERSON|   188|
|Argentina|    16|      MF|   LANZINI|   167|Brazil|    16|      GK|     CASSIO|   195|
|Argentina|    19|      DF|    SALVIO|   167|Brazil|    19|      MF|    WILLIAN|   175|
|Argentina|    11|      FW|     MESSI|   170|Brazil|    11|      MF|P. COUTINHO|   172|
|Argentina|     5|      DF|   ANSALDI|   181|Brazil|     5|      MF|   CASEMIRO|   185|
|Argentina|     6|      MF|    BIGLIA|   175|Brazil|     6|      DF|FILIPE LUIS|   182|
|Argentina|     8|      MF|    BANEGA|   175|Brazil|     8|      MF| R. AUGUSTO|   186|
|Argentina|    15|      DF|MASCH

In [72]:
# Count the number of rows in the resulting DataFrame from the join
# - count(): returns the total number of rows in the DataFrame `new_df`
new_df.count()

23

### Right Join

Shows all DataFrame Data on the RIGHT side

*obs:* Values ​​on the left side that do not match will be shown as null

In [73]:
# Perform a right join between the "arg" DataFrame (Argentina) and the "bra" DataFrame (Brazil) with an explicit join type
# - join(bra, arg["Number"] == bra["Number"], "right"): performs a right join between the two DataFrames on the "Number" column,
#   explicitly specifying the join type as "right". This means that all rows from the right DataFrame ("bra") will be included in the result,
#   and matching rows from the left DataFrame ("arg") will be included where available. Rows from the right DataFrame without a match
#   in the left DataFrame will have null values for columns from the left DataFrame.
# - new_df: stores the resulting DataFrame after the join
# - show(23): displays the first 23 rows of the joined DataFrame to provide an overview of the results
new_df = arg.join(bra, arg["Number"] == bra["Number"], "right")
new_df.show(23)

+---------+------+--------+----------+------+------+------+--------+-----------+------+
|     Team|Number|Position|Shirt Name|Height|  Team|Number|Position| Shirt Name|Height|
+---------+------+--------+----------+------+------+------+--------+-----------+------+
|Argentina|    18|      DF|  OTAMENDI|   181|Brazil|    18|      MF|       FRED|   169|
|Argentina|    21|      MF|  LO CELSO|   177|Brazil|    21|      FW|     TAISON|   172|
|Argentina|    17|      DF|      ROJO|   189|Brazil|    17|      MF|FERNANDINHO|   179|
|Argentina|    22|      FW|    DYBALA|   177|Brazil|    22|      DF|     FAGNER|   168|
|Argentina|    10|      FW|   HIGUAÍN|   184|Brazil|    10|      FW|  NEYMAR JR|   175|
|Argentina|    11|      FW|     MESSI|   170|Brazil|    11|      MF|P. COUTINHO|   172|
|Argentina|     7|      DF|     FAZIO|   199|Brazil|     7|      FW|   D. COSTA|   182|
|Argentina|     6|      MF|    BIGLIA|   175|Brazil|     6|      DF|FILIPE LUIS|   182|
|Argentina|     9|      DF|     

In [74]:
# Count the number of rows in the resulting DataFrame from the join
# - count(): returns the total number of rows in the DataFrame `new_df`
new_df.count()

23

### Full Join

All lines will be shown

*obs:* Values ​​that do not match will be shown as null on both sides

In [75]:
# Perform a full outer join between the "arg" DataFrame (Argentina) and the "bra" DataFrame (Brazil) with an explicit join type
# - join(bra, arg["Number"] == bra["Number"], "full"): performs a full outer join between the two DataFrames on the "Number" column,
#   explicitly specifying the join type as "full". This means that all rows from both DataFrames will be included in the result,
#   with matching rows from both DataFrames combined where "Number" values match. Rows from either DataFrame that do not have a match
#   in the other DataFrame will have null values for the columns from the DataFrame without a match.
# - new_df: stores the resulting DataFrame after the join
# - show(24): displays the first 24 rows of the joined DataFrame to provide an overview of the results
new_df = arg.join(bra, arg["Number"] == bra["Number"], "full")
new_df.show(24)

+---------+------+--------+----------+------+------+------+--------+-----------+------+
|     Team|Number|Position|Shirt Name|Height|  Team|Number|Position| Shirt Name|Height|
+---------+------+--------+----------+------+------+------+--------+-----------+------+
|     NULL|  NULL|    NULL|      NULL|  NULL|Brazil|     1|      GK|  A. BECKER|   193|
|Argentina|     2|      GK|    GUZMÁN|   192|Brazil|     2|      DF|   T. SILVA|   183|
|Argentina|     3|      DF|   MERCADO|   181|Brazil|     3|      DF|    MIRANDA|   186|
|Argentina|     4|      DF|TAGLIAFICO|   169|Brazil|     4|      DF|    GEROMEL|   190|
|Argentina|     5|      DF|   ANSALDI|   181|Brazil|     5|      MF|   CASEMIRO|   185|
|Argentina|     6|      MF|    BIGLIA|   175|Brazil|     6|      DF|FILIPE LUIS|   182|
|Argentina|     7|      DF|     FAZIO|   199|Brazil|     7|      FW|   D. COSTA|   182|
|Argentina|     8|      MF|    BANEGA|   175|Brazil|     8|      MF| R. AUGUSTO|   186|
|Argentina|     9|      DF|     

In [76]:
# Count the number of rows in the resulting DataFrame from the join
# - count(): returns the total number of rows in the DataFrame `new_df`
new_df.count()

24

### Semi Join

Similar to Inner Join, but only data from the left DataFrame is shown

In [77]:
# Perform a semi join between the "arg" DataFrame (Argentina) and the "bra" DataFrame (Brazil) with an explicit join type
# - join(bra, arg["Number"] == bra["Number"], "semi"): performs a semi join between the two DataFrames on the "Number" column,
#   explicitly specifying the join type as "semi". This join returns only the rows from the left DataFrame ("arg") where there is
#   at least one matching row in the right DataFrame ("bra") based on the "Number" column. It does not include columns from the right DataFrame.
# - new_df: stores the resulting DataFrame after the join
# - show(23): displays the first 23 rows of the joined DataFrame to provide an overview of the results
new_df = arg.join(bra, arg["Number"] == bra["Number"], "semi")
new_df.show(23)

+---------+------+--------+----------+------+
|     Team|Number|Position|Shirt Name|Height|
+---------+------+--------+----------+------+
|Argentina|     4|      DF|TAGLIAFICO|   169|
|Argentina|    23|      MF|     PAVÓN|   169|
|Argentina|    16|      MF|   LANZINI|   167|
|Argentina|    19|      DF|    SALVIO|   167|
|Argentina|    11|      FW|     MESSI|   170|
|Argentina|     5|      DF|   ANSALDI|   181|
|Argentina|     6|      MF|    BIGLIA|   175|
|Argentina|     8|      MF|    BANEGA|   175|
|Argentina|    15|      DF|MASCHERANO|   174|
|Argentina|    22|      FW|    DYBALA|   177|
|Argentina|    20|      FW|    AGÜERO|   172|
|Argentina|    10|      FW|   HIGUAÍN|   184|
|Argentina|    12|      MF|  DI MARÍA|   178|
|Argentina|    21|      MF|  LO CELSO|   177|
|Argentina|    14|      MF|      MEZA|   180|
|Argentina|     9|      DF|     ACUÑA|   172|
|Argentina|     3|      DF|   MERCADO|   181|
|Argentina|    18|      DF|  OTAMENDI|   181|
|Argentina|    17|      DF|      R

In [78]:
# Count the number of rows in the resulting DataFrame from the join
# - count(): returns the total number of rows in the DataFrame `new_df`
new_df.count()

22

### Anti Join

Show data from the DataFrame on the left that has NO matches

In [79]:
# Perform an anti join between the "arg" DataFrame (Argentina) and the "bra" DataFrame (Brazil) with an explicit join type
# - join(bra, arg["Number"] == bra["Number"], "anti"): performs an anti join between the two DataFrames on the "Number" column,
#   explicitly specifying the join type as "anti". This join returns only the rows from the left DataFrame ("arg") where there is
#   no matching row in the right DataFrame ("bra") based on the "Number" column. It excludes rows from the left DataFrame that have
#   any matching rows in the right DataFrame.
# - new_df: stores the resulting DataFrame after the join
# - show(23): displays the first 23 rows of the joined DataFrame to provide an overview of the results
new_df = arg.join(bra, arg["Number"] == bra["Number"], "anti")
new_df.show(23)

+---------+------+--------+----------+------+
|     Team|Number|Position|Shirt Name|Height|
+---------+------+--------+----------+------+
|Argentina|    24|      GK| CABALLERO|   186|
+---------+------+--------+----------+------+



In [80]:
# Count the number of rows in the resulting DataFrame from the join
# - count(): returns the total number of rows in the DataFrame `new_df`
new_df.count()

1