## Data Exploration and Transformation
### Objective:
- Perform more advanced operations like grouping, aggregating, and sorting data from the CSV file.
- Introduce basic transformations using withColumn() and alias().

NOTE : Each time when you have new notebook, Start Spark Session and Provide datasource for Notebook

In [None]:
from pyspark.sql import SparkSession

#Load File
file_path = 'Files/ipldata/ipl_summary_raw.csv'

# Read the CSV file into a DataFrame
df = spark.read.csv(file_path, header=True, inferSchema=True)

## Count
Show me list of list and count how many match payled in total

In [2]:
# Group by 'info_city' and count the number of matches held in each city
df.groupBy("info_city").count().show()

StatementMeta(, b23b74fa-f719-481c-ab3c-9eb9a748ea27, 4, Finished, Available, Finished)

+------------+-----+
|   info_city|count|
+------------+-----+
|   Bangalore|   65|
|       Kochi|    5|
|     Chennai|   85|
|     Lucknow|   14|
| Navi Mumbai|    9|
|        null|   51|
|   Centurion|   12|
|      Ranchi|    7|
|      Mumbai|  173|
|   Ahmedabad|   36|
|      Durban|   15|
|     Kolkata|   93|
|   Cape Town|    7|
|  Dharamsala|   13|
|     Sharjah|   10|
|        Pune|   51|
|Johannesburg|    8|
|   Kimberley|    3|
|       Delhi|   90|
|      Raipur|    6|
+------------+-----+
only showing top 20 rows



## Group by 
#### Group by winner Teams and count total Number of match win by each year

In [3]:
# Group by 'info_season' and get the winning teams for each season
df.groupBy("info_season", "info_outcome_winner").count().show()

StatementMeta(, b23b74fa-f719-481c-ab3c-9eb9a748ea27, 5, Finished, Available, Finished)

+-----------+--------------------+-----+
|info_season| info_outcome_winner|count|
+-----------+--------------------+-----+
|       2018|      Mumbai Indians|    6|
|    2009/10|                null|    1|
|       2018| Chennai Super Kings|   11|
|       2012|     Kings XI Punjab|    8|
|       2023|      Gujarat Titans|   11|
|       2009|    Delhi Daredevils|   10|
|       2022|Kolkata Knight Ri...|    6|
|       2024|      Mumbai Indians|    4|
|       2012|       Pune Warriors|    4|
|    2007/08|    Delhi Daredevils|    7|
|       2018|     Kings XI Punjab|    6|
|       2015|      Mumbai Indians|   10|
|       2024|      Delhi Capitals|    7|
|       2018| Sunrisers Hyderabad|   10|
|       2017|       Gujarat Lions|    4|
|       2024|      Gujarat Titans|    5|
|    2009/10|    Delhi Daredevils|    7|
|    2020/21| Chennai Super Kings|    6|
|       2023|      Mumbai Indians|    9|
|       2015| Chennai Super Kings|   10|
+-----------+--------------------+-----+
only showing top

In [13]:
# Group by 'info_season' and get the winning teams for each season
df.groupBy("info_season", "info_outcome_winner", "info_toss_decision").count().show()

StatementMeta(, 3f1d48e4-cb4b-4e6e-b82e-163058ceb38e, 15, Finished, Available, Finished)

+-----------+--------------------+------------------+-----+
|info_season| info_outcome_winner|info_toss_decision|count|
+-----------+--------------------+------------------+-----+
|       2022|Kolkata Knight Ri...|               bat|    1|
|    2007/08|    Delhi Daredevils|             field|    4|
|       2012|      Mumbai Indians|               bat|    5|
|       2019|                null|             field|    2|
|       2015|Kolkata Knight Ri...|               bat|    2|
|       2024|    Rajasthan Royals|               bat|    2|
|    2020/21|Kolkata Knight Ri...|               bat|    3|
|       2015|Kolkata Knight Ri...|             field|    5|
|       2011|     Kings XI Punjab|               bat|    1|
|       2013|Royal Challengers...|             field|    6|
|       2023|        Punjab Kings|               bat|    1|
|       2009|Royal Challengers...|               bat|    7|
|    2009/10|      Mumbai Indians|               bat|    5|
|       2018|    Rajasthan Royals|      

## Filter and Select

In [6]:
# Filter DataFrame for matches that took place in Chennai
chennai_winners = df.filter(df.info_city == "Chennai").select("info_city", "info_outcome_winner")
# Show the list of winners
chennai_winners.show()

StatementMeta(, b23b74fa-f719-481c-ab3c-9eb9a748ea27, 8, Finished, Available, Finished)

+---------+--------------------+
|info_city| info_outcome_winner|
+---------+--------------------+
|  Chennai| Chennai Super Kings|
|  Chennai| Chennai Super Kings|
|  Chennai| Chennai Super Kings|
|  Chennai| Chennai Super Kings|
|  Chennai| Chennai Super Kings|
|  Chennai| Chennai Super Kings|
|  Chennai|      Mumbai Indians|
|  Chennai| Chennai Super Kings|
|  Chennai|      Mumbai Indians|
|  Chennai|Royal Challengers...|
|  Chennai|Kolkata Knight Ri...|
|  Chennai|      Mumbai Indians|
|  Chennai|Royal Challengers...|
|  Chennai|      Mumbai Indians|
|  Chennai|Royal Challengers...|
|  Chennai|      Delhi Capitals|
|  Chennai| Sunrisers Hyderabad|
|  Chennai|        Punjab Kings|
|  Chennai|                null|
|  Chennai| Chennai Super Kings|
+---------+--------------------+
only showing top 20 rows



In [12]:
# Filter DataFrame for matches that took place in Chennai and in the year 2017
chennai_2017_winners = df.filter((df.info_city == "Mumbai") & (df.info_season == 2017)).select("info_city", "info_season", "info_outcome_winner")

# Show the list of winners for Chennai in 2017
chennai_2017_winners.show()

StatementMeta(, b23b74fa-f719-481c-ab3c-9eb9a748ea27, 14, Finished, Available, Finished)

+---------+-----------+--------------------+
|info_city|info_season| info_outcome_winner|
+---------+-----------+--------------------+
|   Mumbai|       2017|      Mumbai Indians|
|   Mumbai|       2017|      Mumbai Indians|
|   Mumbai|       2017|      Mumbai Indians|
|   Mumbai|       2017|      Mumbai Indians|
|   Mumbai|       2017|Rising Pune Super...|
|   Mumbai|       2017|      Mumbai Indians|
|   Mumbai|       2017|     Kings XI Punjab|
|   Mumbai|       2017|Rising Pune Super...|
+---------+-----------+--------------------+



## Renaming an Existing Column
Renaming columns in a DataFrame is straightforward with the withColumnRenamed function. This is particularly useful for making column names more meaningful or easier to work with.

In [26]:
# Rename the column 'info_outcome_winner' to 'winning_team'
df = df.withColumnRenamed("info_outcome_winner", "winning_team")

# Show the DataFrame to see the updated column name
df.select("winning_team").show()



StatementMeta(, 3f1d48e4-cb4b-4e6e-b82e-163058ceb38e, 28, Finished, Available, Finished)

+--------------------+
|        winning_team|
+--------------------+
| Sunrisers Hyderabad|
|Rising Pune Super...|
|Kolkata Knight Ri...|
|     Kings XI Punjab|
|Royal Challengers...|
| Sunrisers Hyderabad|
|      Mumbai Indians|
|     Kings XI Punjab|
|    Delhi Daredevils|
|      Mumbai Indians|
|Kolkata Knight Ri...|
|      Mumbai Indians|
|       Gujarat Lions|
|Kolkata Knight Ri...|
|    Delhi Daredevils|
|      Mumbai Indians|
|Rising Pune Super...|
|Kolkata Knight Ri...|
| Sunrisers Hyderabad|
|Royal Challengers...|
+--------------------+
only showing top 20 rows



In [16]:
from pyspark.sql.functions import when, col

# Add 'win_by_type' column
df = df.withColumn("win_by_type", 
                   when(col("info_outcome_by_runs").isNotNull(), "runs")
                   .when(col("info_outcome_by_wickets").isNotNull(), "wickets")
                   .otherwise("unknown"))

# Rename 'info_outcome_winner' to 'winning_team'
df = df.withColumnRenamed("info_outcome_winner", "winning_team").select("info_outcome_by_runs", "info_outcome_by_wickets", "win_by_type")

# Show the final DataFrame
df.show()

StatementMeta(, b23b74fa-f719-481c-ab3c-9eb9a748ea27, 18, Finished, Available, Finished)

+--------------------+-----------------------+-----------+
|info_outcome_by_runs|info_outcome_by_wickets|win_by_type|
+--------------------+-----------------------+-----------+
|                35.0|                   null|       runs|
|                null|                    7.0|    wickets|
|                null|                   10.0|    wickets|
|                null|                    6.0|    wickets|
|                15.0|                   null|       runs|
|                null|                    9.0|    wickets|
|                null|                    4.0|    wickets|
|                null|                    8.0|    wickets|
|                97.0|                   null|       runs|
|                null|                    4.0|    wickets|
|                null|                    8.0|    wickets|
|                null|                    4.0|    wickets|
|                null|                    7.0|    wickets|
|                17.0|                   null|       run

## isin & Contain

#### Find Matches player got Man of the match? using isin funcation
#### 
#### Find Matches Where "V Kohli" is the Player of the Match

In [31]:
players=["GJ Maxwell","B Kumar"]
#df.filter(df.info_teams_1.isin(is_this_player_play_which_team)).show()


# Filter DataFrame to find the teams these players played for
players_teams = df.filter(df.info_player_of_match_1.isin(players)).select("info_season","info_player_of_match_1")

# Show the result
players_teams.show()

StatementMeta(, b23b74fa-f719-481c-ab3c-9eb9a748ea27, 33, Finished, Available, Finished)

+-----------+----------------------+
|info_season|info_player_of_match_1|
+-----------+----------------------+
|       2017|            GJ Maxwell|
|       2017|               B Kumar|
|       2021|            GJ Maxwell|
|       2021|            GJ Maxwell|
|       2021|            GJ Maxwell|
|       2023|            GJ Maxwell|
|       2024|               B Kumar|
|       2014|            GJ Maxwell|
|       2014|            GJ Maxwell|
|       2014|            GJ Maxwell|
|       2014|               B Kumar|
|       2014|            GJ Maxwell|
|       2014|               B Kumar|
|       2016|               B Kumar|
|       2016|               B Kumar|
+-----------+----------------------+



In [48]:
# Filter DataFrame where the 'info_player_of_match_1' column contains 'GJ Maxwell'
maxwell_matches = df.filter(df.info_player_of_match_1.contains("V Kohli"))

# Show the results
maxwell_matches.select("info_season","info_player_of_match_1").show()

StatementMeta(, b23b74fa-f719-481c-ab3c-9eb9a748ea27, 50, Finished, Available, Finished)

+-----------+----------------------+
|info_season|info_player_of_match_1|
+-----------+----------------------+
|       2019|               V Kohli|
|    2020/21|               V Kohli|
|       2022|               V Kohli|
|       2023|               V Kohli|
|       2023|               V Kohli|
|       2024|               V Kohli|
|       2024|               V Kohli|
|       2011|               V Kohli|
|       2011|               V Kohli|
|       2013|               V Kohli|
|       2013|               V Kohli|
|       2013|               V Kohli|
|       2015|               V Kohli|
|       2016|               V Kohli|
|       2016|               V Kohli|
|       2016|               V Kohli|
|       2016|               V Kohli|
|       2016|               V Kohli|
+-----------+----------------------+



## Using SQL LIKE for Pattern Matching
SQL LIKE allows for pattern matching, useful when you want to filter based on patterns, not exact matches.
#### Find Teams That Contain "Sunrisers" in Their Name

In [51]:
# Filter DataFrame where the team name in 'info_teams_1' contains "Sunrisers"
sunrisers_matches = df.filter(df.info_teams_1.like("%Sunrisers%"))

# Show the results
sunrisers_matches.show(truncate=False)

StatementMeta(, b23b74fa-f719-481c-ab3c-9eb9a748ea27, 52, Finished, Available, Finished)

+-------------------+----------+------------+-----------------------+---------------------+-----------+-------------------+-------------------------------+--------------------------------+---------------------------+------------------------+------------------------+--------------------+----------------------+----------+----------------------+-----------+--------------+-------------------+---------------------------+------------------+---------------------------+--------------------------------------------+-----------------------+-----------------------+-------------------+----------------+------------+-------------------+
|info_balls_per_over|info_city |info_dates_1|info_event_match_number|info_event_name      |info_gender|info_match_type    |info_officials_match_referees_1|info_officials_reserve_umpires_1|info_officials_tv_umpires_1|info_officials_umpires_1|info_officials_umpires_2|info_outcome_by_runs|info_outcome_winner   |info_overs|info_player_of_match_1|info_season|info_team_type

## Using SQL Expressions for Advanced Filtering
You can use SQL expressions within the filter() or where() methods for more complex conditions.
#### Filter Matches Played in Chennai by "MS Dhoni"

In [54]:
# Filter using a SQL expression
chennai_maxwell_matches = df.filter("info_city == 'Chennai' AND info_player_of_match_1 == 'MS Dhoni'").select("info_teams_2", "info_season")

# Show the results
chennai_maxwell_matches.show(truncate=False)


StatementMeta(, b23b74fa-f719-481c-ab3c-9eb9a748ea27, 55, Finished, Available, Finished)

+-------------------+-----------+
|info_teams_2       |info_season|
+-------------------+-----------+
|Rajasthan Royals   |2019       |
|Delhi Capitals     |2019       |
|Delhi Daredevils   |2011       |
|Sunrisers Hyderabad|2013       |
|Delhi Daredevils   |2013       |
+-------------------+-----------+



## Using where() for Filtering
The where() function is equivalent to filter() in PySpark. It’s just another way to apply conditions to the DataFrame.
#### Filter Matches Played in 2024 in Chennai

In [58]:
# Using where() for filtering
chennai_matches = df.where((df.info_city == "Chennai") & (df.info_season == 2024)).select("info_teams_1","info_teams_2")

# Show the results
chennai_matches.show(truncate=False)


StatementMeta(, b23b74fa-f719-481c-ab3c-9eb9a748ea27, 59, Finished, Available, Finished)

+---------------------------+---------------------+
|info_teams_1               |info_teams_2         |
+---------------------------+---------------------+
|Royal Challengers Bengaluru|Chennai Super Kings  |
|Chennai Super Kings        |Gujarat Titans       |
|Kolkata Knight Riders      |Chennai Super Kings  |
|Chennai Super Kings        |Lucknow Super Giants |
|Chennai Super Kings        |Sunrisers Hyderabad  |
|Chennai Super Kings        |Punjab Kings         |
|Rajasthan Royals           |Chennai Super Kings  |
|Sunrisers Hyderabad        |Rajasthan Royals     |
|Sunrisers Hyderabad        |Kolkata Knight Riders|
+---------------------------+---------------------+



## Sorting Data using sort() or orderBy()
Sorting can be done using sort() or orderBy(). You can sort in ascending or descending order.
#### Sort Matches by Number of Runs in Descending Order

In [5]:
# Sort DataFrame by 'info_outcome_by_runs' in descending order
sorted_matches = df.sort(df.info_outcome_by_runs.desc()).select("info_teams_1","info_teams_2","info_outcome_by_runs","info_outcome_winner")

# Show the results
sorted_matches.show(truncate=False)


StatementMeta(, 3f1d48e4-cb4b-4e6e-b82e-163058ceb38e, 7, Finished, Available, Finished)

+---------------------------+---------------------------+--------------------+---------------------------+
|info_teams_1               |info_teams_2               |info_outcome_by_runs|info_outcome_winner        |
+---------------------------+---------------------------+--------------------+---------------------------+
|Delhi Daredevils           |Mumbai Indians             |146.0               |Mumbai Indians             |
|Royal Challengers Bangalore|Gujarat Lions              |144.0               |Royal Challengers Bangalore|
|Royal Challengers Bangalore|Kolkata Knight Riders      |140.0               |Kolkata Knight Riders      |
|Royal Challengers Bangalore|Kings XI Punjab            |138.0               |Royal Challengers Bangalore|
|Royal Challengers Bangalore|Pune Warriors              |130.0               |Royal Challengers Bangalore|
|Sunrisers Hyderabad        |Royal Challengers Bangalore|118.0               |Sunrisers Hyderabad        |
|Royal Challengers Bangalore|Rajastha

## Using SQL Queries on DataFrames
You can also run SQL queries directly on a DataFrame by creating a temporary view.
#### SQL Query to Find Matches Where "B Kumar" Played

In [None]:
# Register the DataFrame as a SQL temporary view
df.createOrReplaceTempView("ipl_matches")

# Use SQL query to find matches where "B Kumar" played
b_kumar_matches = spark.sql("SELECT * FROM ipl_matches WHERE info_player_of_match_1 LIKE '%B Kumar%'").select("info_teams_1","info_teams_2","info_outcome_by_runs","info_outcome_winner")

# Show the results
b_kumar_matches.show(truncate=False)


## Inner Join: 
#### Returns only the rows that have matching values in both DataFrames.

In [62]:
### Sample data for City Dimension Table

from pyspark.sql import Row
# Sample data for City Dimension Table
city_dimension_data = [
    Row(info_city="Hyderabad", city_population=10000000, city_state="Telangana", city_country="India"),
    Row(info_city="Bengaluru", city_population=12000000, city_state="Karnataka", city_country="India"),
    Row(info_city="Chennai", city_population=7000000, city_state="Tamil Nadu", city_country="India"),
]

# Create City Dimension DataFrame
city_dimension_df = spark.createDataFrame(city_dimension_data)
city_dimension_df.show()

StatementMeta(, 3f1d48e4-cb4b-4e6e-b82e-163058ceb38e, 64, Finished, Available, Finished)

+---------+---------------+----------+------------+
|info_city|city_population|city_state|city_country|
+---------+---------------+----------+------------+
|Hyderabad|       10000000| Telangana|       India|
|Bengaluru|       12000000| Karnataka|       India|
|  Chennai|        7000000|Tamil Nadu|       India|
+---------+---------------+----------+------------+



In [None]:
#Match Data Load File
file_path = 'Files/ipldata/ipl_summary_raw.csv'

# Read the CSV file into a DataFrame
match_winners_df = spark.read.csv(file_path, header=True, inferSchema=True)

In [61]:
# If match_winners_df has a city "Chennai" and city_dimension_df also has "Chennai," the result will include "Chennai."

joined_df = match_winners_df.join(city_dimension_df, on="info_city", how="inner").select("info_city","info_outcome_winner","city_state")
joined_df.show()

#Result: Only rows with matching info_city in both DataFrames will be included.

StatementMeta(, 3f1d48e4-cb4b-4e6e-b82e-163058ceb38e, 63, Finished, Available, Finished)

+---------+--------------------+----------+
|info_city| info_outcome_winner|city_state|
+---------+--------------------+----------+
|Hyderabad|    Delhi Daredevils| Telangana|
|Hyderabad| Sunrisers Hyderabad| Telangana|
|Hyderabad| Sunrisers Hyderabad| Telangana|
|Hyderabad|Rising Pune Super...| Telangana|
|Hyderabad| Sunrisers Hyderabad| Telangana|
|Hyderabad| Sunrisers Hyderabad| Telangana|
|Hyderabad|Kolkata Knight Ri...| Telangana|
|Hyderabad|      Mumbai Indians| Telangana|
|Hyderabad|Royal Challengers...| Telangana|
|Hyderabad| Sunrisers Hyderabad| Telangana|
|Hyderabad| Sunrisers Hyderabad| Telangana|
|Hyderabad| Sunrisers Hyderabad| Telangana|
|Hyderabad|Kolkata Knight Ri...| Telangana|
|Hyderabad|     Kings XI Punjab| Telangana|
|Hyderabad|      Mumbai Indians| Telangana|
|Hyderabad| Sunrisers Hyderabad| Telangana|
|Hyderabad| Sunrisers Hyderabad| Telangana|
|Hyderabad| Chennai Super Kings| Telangana|
|Hyderabad| Sunrisers Hyderabad| Telangana|
|Hyderabad| Sunrisers Hyderabad|

## Left Join:

The join() function is used to perform a left join on the info_city column between the modified df_replaced DataFrame and the city_dimension_df.
The select() function is then used to select only the info_city, info_outcome_winner, and city_state columns from the resulting DataFrame.

In [63]:
# If match_winners_df has a city "Chennai" and city_dimension_df doesn't, the result will include "Chennai" with null values for the city details.

# Perform a left join on the info_city column
left_joined_df = match_winners_df.join(city_dimension_df, on="info_city", how="left").select("info_city","info_outcome_winner","city_state")

# Show the result of the left join
left_joined_df.show(truncate=False)

# Result: All cities from match_winners_df will appear, with corresponding data from city_dimension_df if available.

StatementMeta(, 3f1d48e4-cb4b-4e6e-b82e-163058ceb38e, 65, Finished, Available, Finished)

+---------+---------------------------+----------+
|info_city|info_outcome_winner        |city_state|
+---------+---------------------------+----------+
|Bangalore|Mumbai Indians             |null      |
|Bangalore|Rising Pune Supergiant     |null      |
|Mumbai   |Mumbai Indians             |null      |
|Mumbai   |Mumbai Indians             |null      |
|Mumbai   |Mumbai Indians             |null      |
|Kolkata  |Kolkata Knight Riders      |null      |
|Kolkata  |Kolkata Knight Riders      |null      |
|Pune     |Rising Pune Supergiant     |null      |
|Pune     |Delhi Daredevils           |null      |
|Delhi    |Delhi Daredevils           |null      |
|Delhi    |Kolkata Knight Riders      |null      |
|Bengaluru|Royal Challengers Bangalore|Karnataka |
|Hyderabad|Sunrisers Hyderabad        |Telangana |
|Hyderabad|Sunrisers Hyderabad        |Telangana |
|Hyderabad|Sunrisers Hyderabad        |Telangana |
|Hyderabad|Sunrisers Hyderabad        |Telangana |
|Rajkot   |Kolkata Knight Rider

## Replace and When and Left
Replace info_city = "Bangalore" with info_city = "Bengaluru":

Use the when() function to apply the condition for replacement.

In [None]:
left_joined_df = match_winners_df.join(city_dimension_df, on="info_city", how="left")


In [64]:
from pyspark.sql.functions import when

# Replace "Bangalore" with "Bengaluru" in the info_city column
df_replaced = match_winners_df.withColumn("info_city", when(match_winners_df.info_city == "Bangalore", "Bengaluru").otherwise(match_winners_df.info_city))

# Perform a left join on the info_city column and select the desired columns
left_joined_df = df_replaced.join(city_dimension_df, on="info_city", how="left").select("info_city", "info_outcome_winner", "city_state")

# Show the result of the left join
left_joined_df.show(truncate=False)


StatementMeta(, 3f1d48e4-cb4b-4e6e-b82e-163058ceb38e, 66, Finished, Available, Finished)

+---------+---------------------------+----------+
|info_city|info_outcome_winner        |city_state|
+---------+---------------------------+----------+
|Mumbai   |Mumbai Indians             |null      |
|Mumbai   |Mumbai Indians             |null      |
|Mumbai   |Mumbai Indians             |null      |
|Kolkata  |Kolkata Knight Riders      |null      |
|Kolkata  |Kolkata Knight Riders      |null      |
|Pune     |Rising Pune Supergiant     |null      |
|Pune     |Delhi Daredevils           |null      |
|Delhi    |Delhi Daredevils           |null      |
|Delhi    |Kolkata Knight Riders      |null      |
|Bengaluru|Royal Challengers Bangalore|Karnataka |
|Bengaluru|Mumbai Indians             |Karnataka |
|Bengaluru|Rising Pune Supergiant     |Karnataka |
|Hyderabad|Sunrisers Hyderabad        |Telangana |
|Hyderabad|Sunrisers Hyderabad        |Telangana |
|Hyderabad|Sunrisers Hyderabad        |Telangana |
|Hyderabad|Sunrisers Hyderabad        |Telangana |
|Rajkot   |Kolkata Knight Rider

## Right Join

In [49]:
# If city_dimension_df has a city "Mumbai" and match_winners_df doesn't, the result will include "Mumbai" with null values for match details.

# Perform a right join on the info_city column
right_joined_df = match_winners_df.join(city_dimension_df, on="info_city", how="right").select("info_city","info_outcome_winner","city_state")

# Show the result of the right join
right_joined_df.show(truncate=False)

# Result: All cities from city_dimension_df will appear, with corresponding data from match_winners_df if available.

StatementMeta(, 3f1d48e4-cb4b-4e6e-b82e-163058ceb38e, 51, Finished, Available, Finished)

+---------+---------------------------+----------+
|info_city|info_outcome_winner        |city_state|
+---------+---------------------------+----------+
|Hyderabad|Delhi Daredevils           |Telangana |
|Hyderabad|Sunrisers Hyderabad        |Telangana |
|Hyderabad|Sunrisers Hyderabad        |Telangana |
|Hyderabad|Rising Pune Supergiants    |Telangana |
|Hyderabad|Sunrisers Hyderabad        |Telangana |
|Hyderabad|Sunrisers Hyderabad        |Telangana |
|Hyderabad|Kolkata Knight Riders      |Telangana |
|Hyderabad|Mumbai Indians             |Telangana |
|Hyderabad|Royal Challengers Bangalore|Telangana |
|Hyderabad|Sunrisers Hyderabad        |Telangana |
|Hyderabad|Sunrisers Hyderabad        |Telangana |
|Hyderabad|Sunrisers Hyderabad        |Telangana |
|Hyderabad|Kolkata Knight Riders      |Telangana |
|Hyderabad|Kings XI Punjab            |Telangana |
|Hyderabad|Mumbai Indians             |Telangana |
|Hyderabad|Sunrisers Hyderabad        |Telangana |
|Hyderabad|Sunrisers Hyderabad 

## Outer Join

In [65]:
# Combines all cities from both DataFrames, including unmatched ones with nulls in the missing columns.
# Perform an outer join on the info_city column
outer_joined_df = match_winners_df.join(city_dimension_df, on="info_city", how="outer").select("info_city","info_outcome_winner","city_state")

# Show the result of the outer join
outer_joined_df.show(truncate=False)
# Result: All cities from both DataFrames are included, with nulls where data is missing.

StatementMeta(, 3f1d48e4-cb4b-4e6e-b82e-163058ceb38e, 67, Finished, Available, Finished)

+---------+---------------------------+----------+
|info_city|info_outcome_winner        |city_state|
+---------+---------------------------+----------+
|null     |null                       |null      |
|null     |Sunrisers Hyderabad        |null      |
|null     |Rajasthan Royals           |null      |
|null     |Kings XI Punjab            |null      |
|null     |Delhi Capitals             |null      |
|null     |Sunrisers Hyderabad        |null      |
|null     |Kolkata Knight Riders      |null      |
|null     |Rajasthan Royals           |null      |
|null     |Delhi Capitals             |null      |
|null     |Kings XI Punjab            |null      |
|null     |Chennai Super Kings        |null      |
|null     |Delhi Capitals             |null      |
|null     |Sunrisers Hyderabad        |null      |
|null     |null                       |null      |
|null     |Sunrisers Hyderabad        |null      |
|null     |Delhi Capitals             |null      |
|null     |Kings XI Punjab     

## Summary:
- Inner Join: Matches rows in both DataFrames.
- Left Join: All rows from the left DataFrame and matches from the right.
- Right Join: All rows from the right DataFrame and matches from the left.
- Outer Join: All rows from both DataFrames with nulls where there are no matches.