# Loading and Inspecting Womens Football Matches Data Using Spark

## Overview

This notebook outlines the process of loading and inspecting a structured dataset of women's football matches into a Spark DataFrame. The dataset is assumed to be in CSV format and is loaded using a predefined schema that specifies the structure and data types of the columns.

## Leagues and Seasons

- Women Soccer League (2019 - 2023)
- Women Serie A (2019 - 2023)
- Frauen Bundesliga (2021 - 2023)
- Liga F (2022 - 2023)
- FFF Feminine (2019 - 2023)

## Prerequisites

Before starting, ensure you have:

- Apache Spark installed and correctly set up.
- Access to the dataset file (women_football_matches_main.csv) located in the specified path under the base data lake mount point.
- A basic understanding of PySpark, its DataFrame API, and Databricks notebooks.
- The 'config' notebook properly configured and accessible in your Databricks workspace.

## Steps

### 1. Execute Configuration Notebook

Run the 'config' notebook to get the mount points for the base and bronze data lakes.

In [0]:
# Retrieve configuration details, including mount points for base and bronze data lakes

config = dbutils.notebook.run('../storage/01-config', 180)
config_dict = eval(config)

baseMountPoint = config_dict["baseMountPoint"]
bronzeMountPoint = config_dict["bronzeMountPoint"]

### 2. Define Schema

Then, define a schema for the football matches dataset, specifying column names, types, and nullability.

In [0]:
from pyspark.sql.types import StructType, StructField, IntegerType, StringType, DateType
from pyspark.sql.functions import col, substring, lit, concat, when

In [0]:
# Define a schema for the football matches dataset, specifying column names, types, and whether null values are allowed.

fixtureSchema = StructType([
    StructField('MatchId', IntegerType(), True),
    StructField('League', StringType(), True),
    StructField('Season', IntegerType(), True),
    StructField('MatchDay', StringType(), True),
    StructField('HT', StringType(), True),
    StructField('AT', StringType(), True),
    StructField('FTHG', IntegerType(), True),
    StructField('FTAG', IntegerType(), True),
    StructField('FTR', StringType(), True),
    StructField('ATTD', IntegerType(), True)
])

### 3. Load Dataset

Use the baseMountPoint to load the CSV file into a Spark DataFrame with the defined schema.

In [0]:
# Load and display the women's football matches data from CSV, adhering to the predefined schema.

dfMatches = spark.read.format('csv').schema(fixtureSchema).load(baseMountPoint + '/women_football_matches_main.csv')
display(dfMatches.limit(50))

dfMatches.printSchema()

MatchId,League,Season,MatchDay,HT,AT,FTHG,FTAG,FTR,ATTD
101658,D1 FFF,2019,190824 1330,Bordeaux,Fleury,4,1,H,
102633,D1 FFF,2019,190824 1330,Guingamp,Metz,2,0,H,
105384,D1 FFF,2019,190824 1330,Paris FC,Dijon,3,1,H,
107191,D1 FFF,2019,190824 1330,Montpellier,Stade de Reims,2,0,H,
107631,D1 FFF,2019,190824 1430,Lyon,Ol. Marseille,6,0,H,
108229,D1 FFF,2019,190825 1345,Paris S-G,Soyaux,7,0,H,
109491,D1 FFF,2019,190907 1330,Fleury,Montpellier,1,3,A,
109961,D1 FFF,2019,190907 1330,Ol. Marseille,Guingamp,2,1,H,
113735,D1 FFF,2019,190907 1330,Soyaux,Paris FC,0,0,D,
116885,D1 FFF,2019,190907 1530,Stade de Reims,Lyon,3,8,A,


root
 |-- MatchId: integer (nullable = true)
 |-- League: string (nullable = true)
 |-- Season: integer (nullable = true)
 |-- MatchDay: string (nullable = true)
 |-- HT: string (nullable = true)
 |-- AT: string (nullable = true)
 |-- FTHG: integer (nullable = true)
 |-- FTAG: integer (nullable = true)
 |-- FTR: string (nullable = true)
 |-- ATTD: integer (nullable = true)



### 4. Data Transformation

Implement a series of data transformation to prepare the dataset for further analysis

In [0]:
# Transform the 'MatchDay' column into separate 'MatchDate' and 'MatchTime' columns.

dfMatches = dfMatches.withColumn('MatchDate', concat(
        substring(col('MatchDay'), 5, 2), lit('-'), \
        substring(col('MatchDay'), 3, 2), lit('-'), \
        substring(col('MatchDay'), 1, 2)
        )) \
    .withColumn('MatchTime', concat(
        substring(col('MatchDay'), 8, 2), lit(':'), \
        substring(col('MatchDay'), 10, 2)
        ))

display(dfMatches.limit(20))

MatchId,League,Season,MatchDay,HT,AT,FTHG,FTAG,FTR,ATTD,MatchDate,MatchTime
101658,D1 FFF,2019,190824 1330,Bordeaux,Fleury,4,1,H,,24-08-19,13:30
102633,D1 FFF,2019,190824 1330,Guingamp,Metz,2,0,H,,24-08-19,13:30
105384,D1 FFF,2019,190824 1330,Paris FC,Dijon,3,1,H,,24-08-19,13:30
107191,D1 FFF,2019,190824 1330,Montpellier,Stade de Reims,2,0,H,,24-08-19,13:30
107631,D1 FFF,2019,190824 1430,Lyon,Ol. Marseille,6,0,H,,24-08-19,14:30
108229,D1 FFF,2019,190825 1345,Paris S-G,Soyaux,7,0,H,,25-08-19,13:45
109491,D1 FFF,2019,190907 1330,Fleury,Montpellier,1,3,A,,07-09-19,13:30
109961,D1 FFF,2019,190907 1330,Ol. Marseille,Guingamp,2,1,H,,07-09-19,13:30
113735,D1 FFF,2019,190907 1330,Soyaux,Paris FC,0,0,D,,07-09-19,13:30
116885,D1 FFF,2019,190907 1530,Stade de Reims,Lyon,3,8,A,,07-09-19,15:30


In [0]:
# Rename columns for clarity and consistency.

dfMatches = dfMatches.withColumnsRenamed({
    'HT': 'HomeTeam',
    'AT': 'AwayTeam',
    'FTHG': 'HomeTeamGoals',
    'FTAG': 'AwayTeamGoals',
    'FTR': 'MatchResult',
    'ATTD': 'Attendance'
})

display(dfMatches.limit(20))

MatchId,League,Season,MatchDay,HomeTeam,AwayTeam,HomeTeamGoals,AwayTeamGoals,MatchResult,Attendance,MatchDate,MatchTime
101658,D1 FFF,2019,190824 1330,Bordeaux,Fleury,4,1,H,,24-08-19,13:30
102633,D1 FFF,2019,190824 1330,Guingamp,Metz,2,0,H,,24-08-19,13:30
105384,D1 FFF,2019,190824 1330,Paris FC,Dijon,3,1,H,,24-08-19,13:30
107191,D1 FFF,2019,190824 1330,Montpellier,Stade de Reims,2,0,H,,24-08-19,13:30
107631,D1 FFF,2019,190824 1430,Lyon,Ol. Marseille,6,0,H,,24-08-19,14:30
108229,D1 FFF,2019,190825 1345,Paris S-G,Soyaux,7,0,H,,25-08-19,13:45
109491,D1 FFF,2019,190907 1330,Fleury,Montpellier,1,3,A,,07-09-19,13:30
109961,D1 FFF,2019,190907 1330,Ol. Marseille,Guingamp,2,1,H,,07-09-19,13:30
113735,D1 FFF,2019,190907 1330,Soyaux,Paris FC,0,0,D,,07-09-19,13:30
116885,D1 FFF,2019,190907 1530,Stade de Reims,Lyon,3,8,A,,07-09-19,15:30


In [0]:
# Add binary outcome indicators for home wins, away wins, and draws based on the 'MatchResult' column.

dfMatches = dfMatches.withColumn('HomeWin', when(col('MatchResult') == 'H', 1).otherwise(0)) \
    .withColumn('AwayWin', when(col('MatchResult') == 'A', 1).otherwise(0)) \
    .withColumn('Draw', when(col('MatchResult') == 'D', 1).otherwise(0))

display(dfMatches.limit(20))

MatchId,League,Season,MatchDay,HomeTeam,AwayTeam,HomeTeamGoals,AwayTeamGoals,MatchResult,Attendance,MatchDate,MatchTime,HomeWin,AwayWin,Draw
101658,D1 FFF,2019,190824 1330,Bordeaux,Fleury,4,1,H,,24-08-19,13:30,1,0,0
102633,D1 FFF,2019,190824 1330,Guingamp,Metz,2,0,H,,24-08-19,13:30,1,0,0
105384,D1 FFF,2019,190824 1330,Paris FC,Dijon,3,1,H,,24-08-19,13:30,1,0,0
107191,D1 FFF,2019,190824 1330,Montpellier,Stade de Reims,2,0,H,,24-08-19,13:30,1,0,0
107631,D1 FFF,2019,190824 1430,Lyon,Ol. Marseille,6,0,H,,24-08-19,14:30,1,0,0
108229,D1 FFF,2019,190825 1345,Paris S-G,Soyaux,7,0,H,,25-08-19,13:45,1,0,0
109491,D1 FFF,2019,190907 1330,Fleury,Montpellier,1,3,A,,07-09-19,13:30,0,1,0
109961,D1 FFF,2019,190907 1330,Ol. Marseille,Guingamp,2,1,H,,07-09-19,13:30,1,0,0
113735,D1 FFF,2019,190907 1330,Soyaux,Paris FC,0,0,D,,07-09-19,13:30,0,0,1
116885,D1 FFF,2019,190907 1530,Stade de Reims,Lyon,3,8,A,,07-09-19,15:30,0,1,0


In [0]:
# Remove 'MatchDay', 'MatchTime', and 'Attendance' columns to streamline the DataFrame.

dfMatches = dfMatches.drop('MatchDay', 'MatchTime', 'Attendance')

display(dfMatches.limit(20))

MatchId,League,Season,HomeTeam,AwayTeam,HomeTeamGoals,AwayTeamGoals,MatchResult,MatchDate,HomeWin,AwayWin,Draw
101658,D1 FFF,2019,Bordeaux,Fleury,4,1,H,24-08-19,1,0,0
102633,D1 FFF,2019,Guingamp,Metz,2,0,H,24-08-19,1,0,0
105384,D1 FFF,2019,Paris FC,Dijon,3,1,H,24-08-19,1,0,0
107191,D1 FFF,2019,Montpellier,Stade de Reims,2,0,H,24-08-19,1,0,0
107631,D1 FFF,2019,Lyon,Ol. Marseille,6,0,H,24-08-19,1,0,0
108229,D1 FFF,2019,Paris S-G,Soyaux,7,0,H,25-08-19,1,0,0
109491,D1 FFF,2019,Fleury,Montpellier,1,3,A,07-09-19,0,1,0
109961,D1 FFF,2019,Ol. Marseille,Guingamp,2,1,H,07-09-19,1,0,0
113735,D1 FFF,2019,Soyaux,Paris FC,0,0,D,07-09-19,0,0,1
116885,D1 FFF,2019,Stade de Reims,Lyon,3,8,A,07-09-19,0,1,0


### 5. Persisting Transformed Football Match Data to _Delta_ Format

Save the transformed and cleaned football match dataset in Delta format to the bronze data lake.

In [0]:
# Write the transformed file to the bronze data lake.

dfMatches.write.format('delta').mode('overwrite').save(bronzeMountPoint + '/women_football_matches')

### 6. Listing Contents of the Bronze Data Lake

List the files and directories in the bronze data lake to verify the successful storage of the transformed football match dataset.

In [0]:
%fs ls /mnt/bronze/

path,name,size,modificationTime
dbfs:/mnt/bronze/women_football_matches/,women_football_matches/,0,1718930366000
