# Data Quality Checks #

This notebook loads the Chess.com and Lichess staging tables from .parquet files and performs these final minor transformations on them:

1. Removing the extra `rated` column in the the Chess.com staging table before we union the tables.
2. Creating an `id` column by hashing the `game_end_time`, `white_username` and `black_username` columns.
3. We also create the the following `*_id` columns using Spark SQL's [SHA1 hash function](https://spark.apache.org/docs/2.3.0/api/sql/index.html#sha1):

* `game_id` hashed from `game_end_time` + `white_username` + `black_username`
* `white_id` hashed from `white_username`
* `black_id` hashed from `black_username`
* `opening_id` hashed from `opening`
* `time_class_id` hashed from `time_class`
* `platform_id` hased from `platform`

4. We use PySpark's [dropDuplicates](https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.sql.DataFrame.dropDuplicates.html) function to remove rows that have the same (SHA1 hashed) `game_id` column.

Currently this removes `50250` rows from the final Chess.com + Lichess combined `games` table dataframe.

5. We use PySpark's `.cast()` function to set the correct data types for the date and rating columns.

Finally, we write the cleaned fact and dim tables to a new set of `.parquet` files so these tables can be loaded by our final `analytics.ipynb` notebook.

In [123]:
from pyspark import SparkContext, SparkConf
from pyspark.sql import SparkSession
import pyspark.sql.functions as F
from pyspark import SQLContext
import pandas as pd
import yaml, os

In [124]:
with open(r'config/dl-chessdotcom.yaml') as file:
    config = yaml.load(file,Loader=yaml.SafeLoader)

os.environ['AWS_ACCESS_KEY_ID']=config['aws_access_key_id']
os.environ['AWS_SECRET_ACCESS_KEY']=config['aws_secret_key_id']

In [125]:
spark = SparkSession \
    .builder \
    .appName("Lichess and Chess.com Staging -> Fact tables") \
    .getOrCreate()

spark.sparkContext._jsc.hadoopConfiguration().set("fs.s3a.aws.credentials.provider","org.apache.hadoop.fs.s3a.SimpleAWSCredentialsProvider")
spark.sparkContext._jsc.hadoopConfiguration().set("fs.s3a.access.key",config['aws_access_key_id'])
spark.sparkContext._jsc.hadoopConfiguration().set("fs.s3a.secret.key",config['aws_secret_key_id'])

In [126]:
# Set which outpath path we need - either local or s3.
# output_data = config['output_data_path_local']
output_data = config['output_data_path_s3']

## Load Chess.com Staging Table

In [127]:
df_chessdotcom = spark.read.parquet("s3a://" + output_data + "staging/chessdotcom/games/")  

### Remove extra `rating` column

There is an extra `rating` column in the Chess.com table, that we need to remove in order to union it correctly with the Lichess table.

In [128]:
df_chessdotcom = df_chessdotcom.drop("rated") #Remove extra column in Chess.com staging table.

#df_chessdotcom.printSchema()

In [None]:
df_chessdotcom.limit(10).toPandas()

## Load Lichess Staging Table

In [None]:
df_lichess = spark.read.parquet("s3a://" + output_data + "staging/lichess/games/")

In [None]:
#df_lichess.printSchema()

## Combine Chess.com + Lichess `games` Table Dataframes

In [None]:
df_merged = df_chessdotcom.union(df_lichess)

In [None]:
#df_merged.count()

### Create `game_id` column

In [None]:
df_merged_with_game_id = df_merged.withColumn("game_id", F.sha1(F.concat(F.col("game_end_time"),F.col("white_username"),F.col("black_username"))))

### Create `white_id` and `black_id` columns

In [None]:
df_merged_with_white_id = df_merged_with_game_id.withColumn("white_id", F.sha1(F.col("white_username")))

In [None]:
df_merged_with_black_id = df_merged_with_white_id.withColumn("black_id", F.sha1(F.col("black_username")))

### Create `opening_id` column

In [None]:
df_merged_with_opening_id = df_merged_with_black_id.withColumn("opening_id", F.sha1(F.col("opening")))

### Create `time_class_id` column

In [None]:
df_merged_with_tc_id = df_merged_with_opening_id.withColumn("time_class_id", F.sha1(F.col("time_class")))

### Create `platform_id` column

In [None]:
df_merged_with_plaform_id = df_merged_with_tc_id.withColumn("platform_id", F.sha1(F.col("platform")))

### Create`year`column

In [None]:
df_merged_with_year = df_merged_with_plaform_id.withColumn("year", F.year(F.col('game_end_date')))

In [None]:
df_merged_with_year.limit(10).toPandas()

In [None]:
df_merged_with_year.count()

## Check Merged `Games` Dim Table for Dupes of `id` Column and Drop Them

In [None]:
df_dupes_dropped = df_merged_with_year.dropDuplicates(['game_id'])

In [None]:
df_dupes_dropped.count()

## Set Correct Data Types for Date and Rating Columns

In [None]:
df_dupes_dropped = df_dupes_dropped.withColumn("year",df_dupes_dropped.year.cast('int'))
df_dupes_dropped = df_dupes_dropped.withColumn("game_end_time",df_dupes_dropped.game_end_time.cast('timestamp'))
df_dupes_dropped = df_dupes_dropped.withColumn("game_end_date",df_dupes_dropped.game_end_date.cast('timestamp'))
df_dupes_dropped = df_dupes_dropped.withColumn("white_rating",df_dupes_dropped.white_rating.cast('int'))
df_dupes_dropped = df_dupes_dropped.withColumn("black_rating",df_dupes_dropped.black_rating.cast('int'))


### Create and Write Cleaned `openings` Dim Table

In [None]:
openings_dim_table = df_dupes_dropped.select(F.col("opening_id").alias("id"),"opening").distinct()


In [None]:
#openings_dim_table.limit(10).toPandas()

In [None]:
openings_dim_table.write.mode('append').parquet("s3a://" + output_data + "/dim/openings")

### Create and Write Cleaned `players` Dim Table

In [None]:
white_players_table = df_dupes_dropped.select("white_id","white_username").distinct()

In [None]:
black_players_table = df_dupes_dropped.select("black_id","black_username").distinct()

In [None]:
white_black_players_combined = white_players_table.union(black_players_table)

In [None]:
players_dim_table = white_black_players_combined.select(F.col("white_id").alias("id"),F.col("white_username").alias("username")).dropDuplicates(['id'])

In [None]:
#players_dim_table.limit(10).toPandas()

In [None]:
players_dim_table.write.mode('append').parquet("s3a://" + output_data + "/dim/players")

### Create and Write Cleaned `time_class` Dim Table

In [None]:
time_class_table = df_dupes_dropped.select((F.col("time_class_id").alias("id"),"time_class").distinct()


In [122]:
#time_class_table.limit(10).toPandas()

Unnamed: 0,time_class_id,time_class
0,7bff1b790fcdfa5016c20a07a145631da3fe3cfa,ultraBullet
1,f915e10481634c0a44492699b2b8a3657c334106,correspondence
2,dc94ac81aa982e5382b814d4d883bfdb2ed62ddf,rapid
3,2fe14b9b993ea6eb953f8129c7a1edede9792b77,daily
4,4a19573b7e72b7249d2271839c762ffe1f0452f3,classical
5,472da2b94e9fa87badd16a55e1eaec4f53ffc52a,bullet
6,6d199aca996a9a8bff542e2bc10e3f0edc62cd07,blitz


In [None]:
time_class_table.write.mode('append').parquet("s3a://" + output_data + "/dim/time_class")

### Create and Write Cleaned `platform` Dim Table

In [None]:
platform_table = df_dupes_dropped.select((F.col("platform_id").alias("id"),"platform").distinct()


In [None]:
platform_table.limit(10).toPandas()

In [None]:
platform_table.write.mode('append').parquet("s3a://" + output_data + "/dim/platform")

## Set and Write Cleaned `games` Fact Table

**Warning: depending on your system / Spark cluster resources, this write can take a while!**

In [None]:
df_final_games_table = df_dupes_dropped.select("game_id","year","game_end_time","game_end_date", "time_class_id", "white_id", "white_rating","black_id","black_rating","winner","termination","opening_id","moves","platform_id")

In [None]:
df_final_games_table.printSchema()

In [None]:
#df_final_games_table.limit(10).toPandas()

In [None]:
df_final_games_table.write.mode('append').partitionBy("year","time_class_id").parquet("s3a://" + output_data + "/fact/game")