# Seasonal Segmentation of Women's Soccer League (WSL) Data Using Spark

## Overview

This notebook processes and filters league table data for the Women's Super League (WSL). The data is extracted from a silver data lake, filtered by season, and then saved to a gold data lake. 

This process ensures that each season's data is readily available in a separate Delta table, facilitating easier access to specific seasons' data, enhancing the usability of the dataset for historical comparisons, trend analysis, and performance evaluations.

## Steps

### 1. Retrieve Configuration Details

Get the mount points for the silver and gold data lakes from the 'config' notebook to facilitate data retrieval and storage.

In [0]:
# Retrieve configuration details, including mount points for silver and gold data lakes

config = dbutils.notebook.run('../storage/01-config', 180)
config_dict = eval(config)

silverMountPoint = config_dict["silverMountPoint"]
goldMountPoint = config_dict["goldMountPoint"]

### 2. Load and Display the League Table

Load the previously saved WSL league table from the silver data lake and display its contents along with the schema for verification.

In [0]:
from pyspark.sql.functions import col, substring, lit, concat, when

In [0]:
dfWSLTable = spark.read.format('delta').load(silverMountPoint + '/WSL/' + 'league_table')
display(dfWSLTable.limit(20))

dfWSLTable.printSchema()

Position,Season,Team,Played,Won,Drawn,Lost,For,Against,GD,Points
1,2019,Manchester City,16,13,1,2,39,9,30,40
2,2019,Chelsea,15,12,3,0,47,11,36,39
3,2019,Arsenal,15,12,0,3,40,13,27,36
4,2019,Manchester Utd,14,7,2,5,24,12,12,23
5,2019,Reading,14,6,3,5,21,24,-3,21
6,2019,Tottenham,15,6,2,7,15,24,-9,20
7,2019,Everton,14,6,1,7,21,21,0,19
8,2019,West Ham,14,5,1,8,19,34,-15,16
9,2019,Brighton,16,3,4,9,11,30,-19,13
10,2019,Bristol City,14,2,3,9,9,38,-29,9


root
 |-- Position: integer (nullable = true)
 |-- Season: integer (nullable = true)
 |-- Team: string (nullable = true)
 |-- Played: long (nullable = true)
 |-- Won: long (nullable = true)
 |-- Drawn: long (nullable = true)
 |-- Lost: long (nullable = true)
 |-- For: long (nullable = true)
 |-- Against: long (nullable = true)
 |-- GD: long (nullable = true)
 |-- Points: long (nullable = true)



### 3. Filter and Save Seasonal Data

#### 2023 WSL Season

Filter the league table for the 2023 season, remove the redundant 'Season' column, and save the result to the gold data lake.

In [0]:
# Filtering the DataFrame to show only the data for the 2023 season.

dfWSL2023Table = dfWSLTable.where(col('Season') == 2023).drop(col('Season'))

display(dfWSL2023Table)

Position,Team,Played,Won,Drawn,Lost,For,Against,GD,Points
1,Chelsea,22,18,1,3,71,18,53,55
2,Manchester City,22,18,1,3,61,15,46,55
3,Arsenal,22,16,2,4,53,20,33,50
4,Liverpool,22,12,5,5,36,28,8,41
5,Manchester Utd,22,10,5,7,42,32,10,35
6,Tottenham,22,8,7,7,31,36,-5,31
7,Aston Villa,22,7,3,12,27,43,-16,24
8,Everton,22,6,5,11,24,37,-13,23
9,Brighton,22,5,4,13,26,48,-22,19
10,Leicester City,22,4,6,12,26,45,-19,18


In [0]:
dfWSL2023Table.write.format('delta').mode('overwrite').save(goldMountPoint + '/WSL/' + '2023_league_table')

#### 2022 WSL Season

Filter the league table for the 2022 season, remove the redundant 'Season' column, and save the result to the gold data lake.

In [0]:
# Filtering the DataFrame to show only the data for the 2022 season.

dfWSL2022Table = dfWSLTable.where(col('Season') == 2022).drop(col('Season'))

display(dfWSL2022Table)

Position,Team,Played,Won,Drawn,Lost,For,Against,GD,Points
1,Chelsea,22,19,1,2,66,15,51,58
2,Manchester Utd,22,18,2,2,56,12,44,56
3,Arsenal,22,15,2,5,49,16,33,47
4,Manchester City,22,15,2,5,50,25,25,47
5,Aston Villa,22,11,4,7,47,37,10,37
6,Everton,22,9,3,10,29,36,-7,30
7,Liverpool,22,6,5,11,24,39,-15,23
8,West Ham,22,6,3,13,23,44,-21,21
9,Tottenham,22,5,3,14,31,47,-16,18
10,Leicester City,22,5,1,16,15,48,-33,16


In [0]:
dfWSL2022Table.write.format('delta').mode('overwrite').save(goldMountPoint + '/WSL/' + '2022_league_table')

#### 2021 WSL Season

Filter the league table for the 2021 season, remove the redundant 'Season' column, and save the result to the gold data lake.

In [0]:
# Filtering the DataFrame to show only the data for the 2021 season.

dfWSL2021Table = dfWSLTable.where(col('Season') == 2021).drop(col('Season'))

display(dfWSL2021Table)

Position,Team,Played,Won,Drawn,Lost,For,Against,GD,Points
1,Chelsea,22,18,2,2,62,11,51,56
2,Arsenal,22,17,4,1,65,10,55,55
3,Manchester City,22,15,2,5,60,22,38,47
4,Manchester Utd,22,12,6,4,45,22,23,42
5,Tottenham,22,9,5,8,24,23,1,32
6,West Ham,22,7,6,9,23,33,-10,27
7,Brighton,22,8,2,12,24,38,-14,26
8,Reading,22,7,4,11,21,40,-19,25
9,Aston Villa,22,6,3,13,13,40,-27,21
10,Everton,22,5,5,12,18,41,-23,20


In [0]:
dfWSL2021Table.write.format('delta').mode('overwrite').save(goldMountPoint + '/WSL/' + '2021_league_table')

#### 2020 WSL Season

Filter the league table for the 2020 season, remove the redundant 'Season' column, and save the result to the gold data lake.

In [0]:
# Filtering the DataFrame to show only the data for the 2020 season.

dfWSL2020Table = dfWSLTable.where(col('Season') == 2020).drop(col('Season'))

display(dfWSL2020Table)

Position,Team,Played,Won,Drawn,Lost,For,Against,GD,Points
1,Chelsea,22,18,3,1,69,10,59,57
2,Manchester City,22,17,4,1,65,13,52,55
3,Arsenal,22,15,3,4,63,15,48,48
4,Manchester Utd,22,15,2,5,44,20,24,47
5,Everton,22,9,5,8,39,30,9,32
6,Brighton,22,8,3,11,21,41,-20,27
7,Reading,22,5,9,8,25,41,-16,24
8,Tottenham,22,5,5,12,18,41,-23,20
9,West Ham,22,3,6,13,21,39,-18,15
10,Birmingham City,22,3,6,13,15,44,-29,15


In [0]:
dfWSL2020Table.write.format('delta').mode('overwrite').save(goldMountPoint + '/WSL/' + '2020_league_table')

#### 2019 WSL Season

Filter the league table for the 2019 season, remove the redundant 'Season' column, and save the result to the gold data lake.

In [0]:
# Filtering the DataFrame to show only the data for the 2019 season.

dfWSL2019Table = dfWSLTable.where(col('Season') == 2019).drop(col('Season'))

display(dfWSL2019Table)

Position,Team,Played,Won,Drawn,Lost,For,Against,GD,Points
1,Manchester City,16,13,1,2,39,9,30,40
2,Chelsea,15,12,3,0,47,11,36,39
3,Arsenal,15,12,0,3,40,13,27,36
4,Manchester Utd,14,7,2,5,24,12,12,23
5,Reading,14,6,3,5,21,24,-3,21
6,Tottenham,15,6,2,7,15,24,-9,20
7,Everton,14,6,1,7,21,21,0,19
8,West Ham,14,5,1,8,19,34,-15,16
9,Brighton,16,3,4,9,11,30,-19,13
10,Bristol City,14,2,3,9,9,38,-29,9


In [0]:
dfWSL2019Table.write.format('delta').mode('overwrite').save(goldMountPoint + '/WSL/' + '2019_league_table')