## Data Preparation

The batch pipeline used later will carry out various queries on the data. Here the data is preprocessed slightly to be in a better relational format and fix any noted data quality issues. After these fixes the data is stored back into the Google cloud bucket prior to running our batch pipeline.

#### Batch job to get data

In [2]:
from pyspark.sql import SparkSession
from pyspark import SparkConf

# Setting Spark configuration
sparkConf = SparkConf()
sparkConf.setMaster("spark://spark-master:7077")
sparkConf.setAppName("Assignment2-BatchDataPrep")
sparkConf.set("spark.driver.memory", "2g")
sparkConf.set("spark.executor.cores", "1")
sparkConf.set("spark.driver.cores", "1")
# Create the spark session, which is the entry point to Spark SQL engine
spark = SparkSession.builder.config(conf=sparkConf).getOrCreate()

# Setup hadoop fs configuration for schema gs://
conf = spark.sparkContext._jsc.hadoopConfiguration()
conf.set("fs.gs.impl", "com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystem")
conf.set("fs.AbstractFileSystem.gs.impl", "com.google.cloud.hadoop.fs.gcs.GoogleHadoopFS")

#  Google Storage File Path- to be adapted to personal buckets
athletes_gsc_file_path = 'gs://dejads_input_assignment2_team1/athletes.csv' 
coaches_gsc_file_path = 'gs://dejads_input_assignment2_team1/coaches.csv' 
tech_offic_gsc_file_path = 'gs://dejads_input_assignment2_team1/technical_officials.csv' 

# Create data frames, return their schemas and show the first row
athletes_df = spark.read.format("csv").option("header", "true") \
       .load(athletes_gsc_file_path)
athletes_df.printSchema()
athletes_df.show(1)

coaches_df = spark.read.format("csv").option("header", "true") \
       .load(coaches_gsc_file_path)
coaches_df.printSchema()
coaches_df.show(1)

tech_offic_df = spark.read.format("csv").option("header", "true") \
       .load(tech_offic_gsc_file_path)
tech_offic_df.printSchema()
tech_offic_df.show(1)

root
 |-- name: string (nullable = true)
 |-- short_name: string (nullable = true)
 |-- gender: string (nullable = true)
 |-- birth_date: string (nullable = true)
 |-- birth_place: string (nullable = true)
 |-- birth_country: string (nullable = true)
 |-- country: string (nullable = true)
 |-- country_code: string (nullable = true)
 |-- discipline: string (nullable = true)
 |-- discipline_code: string (nullable = true)
 |-- height_m/ft: string (nullable = true)
 |-- url: string (nullable = true)

+---------------+----------+------+----------+-----------+-------------+-------+------------+------------+---------------+-----------+--------------------+
|           name|short_name|gender|birth_date|birth_place|birth_country|country|country_code|  discipline|discipline_code|height_m/ft|                 url|
+---------------+----------+------+----------+-----------+-------------+-------+------------+------------+---------------+-----------+--------------------+
|AALERUD Katrine| AALERUD K|Fe

#### Fixing height column

The first thing is to check that all the data is in the optimal set-up to allow data analysis. It was noted that the athletes dataset has a height column with the height in both metres and feet. This is considered duplicate information so the height in metres is extracted and saved into a separate column.

In [3]:
from pyspark.sql.functions import split, col
from pyspark.sql.types import DoubleType

# Split string to get height in metres, convert this to double, save as column and drop previous height string column
athletes_df = athletes_df.withColumn('height_m', split(athletes_df['height_m/ft'], '/').getItem(0).cast(DoubleType())).drop(col("height_m/ft"))

# Show new schema and example of new column in dataframe
athletes_df.printSchema()
athletes_df.show(5)

root
 |-- name: string (nullable = true)
 |-- short_name: string (nullable = true)
 |-- gender: string (nullable = true)
 |-- birth_date: string (nullable = true)
 |-- birth_place: string (nullable = true)
 |-- birth_country: string (nullable = true)
 |-- country: string (nullable = true)
 |-- country_code: string (nullable = true)
 |-- discipline: string (nullable = true)
 |-- discipline_code: string (nullable = true)
 |-- url: string (nullable = true)
 |-- height_m: double (nullable = true)

+-----------------+----------+------+----------+-----------+-------------+-------+------------+-------------------+---------------+--------------------+--------+
|             name|short_name|gender|birth_date|birth_place|birth_country|country|country_code|         discipline|discipline_code|                 url|height_m|
+-----------------+----------+------+----------+-----------+-------------+-------+------------+-------------------+---------------+--------------------+--------+
|  AALERUD Katr

In [4]:
from pyspark.sql.functions import col, when

# Checking for incorrect heights in metres i.e. taller than tallest person in world, smaller than smallest person in the world
# Replacing them with null if any found
athletes_df = athletes_df.withColumn(
    "height_m",
    when(
        col("height_m") < 0.54,
        None
    ).when(
        col("height_m") > 2.52,
        None
    ).otherwise(col("height_m")))

#### Fixing country column

It was also noted that the athletes table contains both country and country code. The coaches table has country code and the technical officials table has country. The country is therefore extracted into a separate country table and then country_code is implemented in all tables for consistency.

In [5]:
# Extracting country and country code from athletes table
country_df = athletes_df.select('country_code', 'country')
# Removing duplicate rows
country_df = country_df.distinct()
country_df.orderBy("country").show(5)

+------------+--------------+
|country_code|       country|
+------------+--------------+
|         AFG|   Afghanistan|
|         ALB|       Albania|
|         ALG|       Algeria|
|         ASA|American Samoa|
|         AND|       Andorra|
+------------+--------------+
only showing top 5 rows



In [6]:
# Showing any country_codes or countries that appear twice in the dataframe
dup_ccode_df = country_df.groupBy("country_code").count().where("count > 1").drop("count")
dup_ccode_df.show()
dup_cty_df = country_df.groupBy("country").count().where("count > 1").drop("count")
dup_cty_df.show()
if((dup_ccode_df.count() > 0) | (dup_cty_df.count() > 0)):
    print('There are duplicates in the country_code and/ or country column.')
else:
    print('There are no duplicates in the country_code and country column.')

+------------+
|country_code|
+------------+
+------------+

+-------+
|country|
+-------+
+-------+

There are no duplicates in the country_code and country column.


In [7]:
from pyspark.sql.functions import length

# Checking if any countries have a country_code in their country column
country_df.where(length(col("country")) <= 3).show()

+------------+-------+
|country_code|country|
+------------+-------+
|         EOR|    EOR|
|         LBN|    LBN|
|         ROC|    ROC|
+------------+-------+



In [8]:
# Updating these incorrect countries- noted from observation
country_df = country_df.withColumn(
    "country",
    when(
        col("country_code") == 'LBN',
        'Lebanon'
    ).when(
        col("country_code") == 'ROC',
        'Russian Federation'
    ).when(
        col("country_code") == 'EOR',
        'Refugee Olympic Team'
    ).when(
        col("country_code") == 'SAM',
        'Samoa'
    ).otherwise(col("country")))

In [9]:
# Removing country column in athletes table now that it is present in the country dataframe
athletes_df = athletes_df.drop('country')

# Updating these incorrect countries- noted from observation
athletes_df = athletes_df.withColumn(
    "birth_country",
    when(
        col("birth_country") == 'Dominica',
        'Dominique'
    ).when(
        col("birth_country") == 'USSR',
        'Russian Federation'
    ).when(
        col("birth_country") == 'Democratic Republic of Timor-Leste',
        'Timor-Leste'
    ).otherwise(col("birth_country")))

In [10]:
# Replacing country column in athletes table with country code
# Joining birth_country with country from country dataframe (using left join to not remove data)
joinExpression = athletes_df["birth_country"] == country_df['country']
athletes_df = athletes_df.join(country_df.withColumnRenamed('country_code', 'birth_country_code'), joinExpression, "left_outer").drop('country')

# Checking if there are any rows which had a birth country but now has no birth country code
# If no rows found then birth_country column can be dropped
if((athletes_df.filter(athletes_df.birth_country_code.isNull() & (~athletes_df.birth_country.isNull()))).count() > 0):
    print('There is at least one row which had a birth country value that has no matching birth country code.')
else: 
    athletes_df = athletes_df.drop('birth_country')

In [11]:
# checking if coach table has a country code not already present in the country table- if any found then this would need to be added
country_codes = country_df.select("country_code").rdd.flatMap(lambda x: x).collect()
if((coaches_df.where(~col('country_code').isin(country_codes))).count() > 0):
    print('There is at least one country code in the coaches table that is not present in the country table.')

In [12]:
# Updating these incorrect countries in technical officials table- noted from observation
tech_offic_df = tech_offic_df.withColumn(
    "country",
    when(
        col("country") == 'ROC',
        'Russian Federation'
    ).otherwise(col("country")))

# Replacing country column in technical officias table with country code
# Joining country with country from country dataframe (using left join to not remove data)
tech_offic_df = tech_offic_df.join(country_df, "country", "left_outer")
tech_offic_df.show(5)

# Checking if there are any rows which had a country but now has no country code
# If no rows found then country column can be dropped
if((tech_offic_df.filter(tech_offic_df.country_code.isNull() & (~tech_offic_df.country.isNull()))).count() > 0):
    print('There is at least one row which had a country value that has no matching country code.')
else: 
    tech_offic_df = tech_offic_df.drop('country')

+----------+--------------------+-------------+------+----------+----------------+--------+--------------------+------------+
|   country|                name|   short_name|gender|birth_date|      discipline|function|                 url|country_code|
+----------+--------------------+-------------+------+----------+----------------+--------+--------------------+------------+
|Uzbekistan|        ABAEVA Elena|     ABAEVA E|Female|1966-04-21|       Wrestling|   Judge|../../../en/resul...|         UZB|
|   Morocco|        ABBAR Bachir|      ABBAR B|  Male|1965-05-03|          Boxing|   Judge|../../../en/resul...|         MAR|
|   Morocco| ABDELLATIF Makfouni| ABDELLATIF M|  Male|1972-11-23|          Boxing|   Judge|../../../en/resul...|         MAR|
|     Japan|            ABE Miya|        ABE M|Female|1992-10-27|Beach Volleyball| Referee|../../../en/resul...|         JPN|
|    Uganda|ACIGA FULA Antoni...|ACIGA FULA AS|  Male|1957-11-28|          Boxing|   Judge|../../../en/resul...|      

#### Fixing discipline column

In [13]:
# Extracting discipline and discipline code from athletes table
disc_df = athletes_df.select('discipline_code', 'discipline')
# Removing duplicate rows
disc_df = disc_df.distinct()
disc_df.show(5)

+---------------+--------------------+
|discipline_code|          discipline|
+---------------+--------------------+
|            BOX|                null|
|            WRE|           Wrestling|
|            FBL|            Football|
|            GTR|Trampoline Gymnas...|
|            GLF|                Golf|
+---------------+--------------------+
only showing top 5 rows



In [14]:
# Most of the discipline codes have a row with the discipline and one with null
# Here we remove the null rows for the discipline codes that have multiple rows
dup_disc_df = disc_df.groupBy("discipline_code").count().where("count > 1").drop("count")
dup_disc_codes = disc_df.select("discipline_code").rdd.flatMap(lambda x: x).collect()
disc_df = disc_df.filter(((~col('discipline').isNull()) & col('discipline_code').isin(dup_disc_codes)) | (~col('discipline_code').isin(dup_disc_codes)))
# Show dataframe with null rows removed
disc_df.orderBy("discipline").show(5)

# Checking if there are still any discipline codes appearing twice in the dataset
if(disc_df.groupBy("discipline_code").count().where("count > 1").drop("count").count() > 0):
    print('There is at least one discipline code which has two or more associated disciplines.')
# Checking if there are still any disciplines appearing twice in the dataset
if(disc_df.groupBy("discipline").count().where("count > 1").drop("count").count() > 0):
    print('There is at least one discipline which has two or more associated disciplines.')

+---------------+-------------------+
|discipline_code|         discipline|
+---------------+-------------------+
|            BK3|     3x3 Basketball|
|            ARC|            Archery|
|            GAR|Artistic Gymnastics|
|            SWA|  Artistic Swimming|
|            ATH|          Athletics|
+---------------+-------------------+
only showing top 5 rows



In [15]:
# Removing discipline column from athletes table
athletes_df = athletes_df.drop('discipline')

# Joining discipline code to technical officials table
# Removing discipline if there is no information present in discipline column that's not also in discipline code column
tech_offic_df = tech_offic_df.join(disc_df, "discipline", "left_outer")
if((tech_offic_df.filter(tech_offic_df.discipline_code.isNull() & (~tech_offic_df.discipline.isNull()))).count() > 0):
    print('There is at least one row in the technical officials table which had a discipline value that has no matching discipline code.')
else: 
    tech_offic_df = tech_offic_df.drop('discipline')
    
# Joining discipline code to technical officials table
# Removing discipline if there is no information present in discipline column that's not also in discipline code column
coaches_df = coaches_df.join(disc_df, "discipline", "left_outer")
if((coaches_df.filter(coaches_df.discipline_code.isNull() & (~coaches_df.discipline.isNull()))).count() > 0):
    print('There is at least one row in the coaches table which had a discipline value that has no matching discipline code.')
else: 
    coaches_df = coaches_df.drop('discipline')

In [None]:
# Saving all five altered table back in Google bucket to be used for next batch job
athletes_df.write.format("csv").option("header", "true").mode("overwrite").save("gs://dejads_output_assignment2_team1/athletes_clean.csv") 
coaches_df.write.format("csv").option("header", "true").mode("overwrite").save("gs://dejads_output_assignment2_team1/coaches_clean.csv") 
tech_offic_df.write.format("csv").option("header", "true").mode("overwrite").save("gs://dejads_output_assignment2_team1/tech_offic_clean.csv") 
country_df.write.format("csv").option("header", "true").mode("overwrite").save("gs://dejads_output_assignment2_team1/country.csv") 
disc_df.write.format("csv").option("header", "true").mode("overwrite").save("gs://dejads_output_assignment2_team1/disciplines.csv") 

There are other improvements that could be made for this dataset including creating a separate table for athlete name, using their short name as a key. This would be carried out in a similar manner as above so has not been implemented here.

In [17]:
# Stop the spark context
spark.stop()