## Coach Instructions

This Python notebook can be used before the Hack to prepare data files for the participants.  The intent is for the coach to get the latest Open Powerlitfing data from their website (linked in the challenge files).  In order to suport the challenge structure of implementing an inital load followed by incremental loads, it is necessary to split up the full data set.

Starting with the full data set, this notebook will extract the most recent meet activity (by date) and save it off to separate files.  These files will simulate new, incremental data for the Hack team's incremental loads.  All remaining historical data will be saved as the initial data load file.

The coach can decide how many daily incremental files to create by changing the value of the <b>numDaysToSeparate</b> variable.

Coach Data Preparation Steps:

1. Create a Synapse Workspace.  Create a Spark pool in the workspace to run this notebook.
2. Save the latest Open Powerlifting data file to the root of the workspace storage account.  Put the path to that file in the <b>pathToOpenPowerliftingCSV</b> variable in the first code cell.
3. Set the number of daily meet result files to create in the <b>numDaysToSeparate</b> variable in the second code cell.
4. Run the notebook.  Daily meet result files (incremental loads) will be created in a subfolder called <b>dailyReports</b> in the workspace storage account.  All remaining historical data will be saved in a single file in a subfolder called <b>initialData</b>.

At the coach's discretion, this work can either be done by the coach before the Hack or by the Hack team as part of Challenge 1.

### File Format
It is possible that the format of the Open Powerlifting data file will change over time.  This notebook is only dependent on the Date and MeetName fields in that file.  Provided those fields are present, with those names, the notebook should still produce the desired results with whatever the current format is.




In [1]:
## Replace this variable with full path to the OPL data file in the root container of the Synapse Workspace storage account
pathToOpenPowerliftingCSV = "abfss://<rootContainerName>@<synpaseWorkspaceName>.dfs.core.windows.net/openpowerlifting-2021-01-05.csv"

df = spark.read.csv(pathToOpenPowerliftingCSV, header=True)
##display(df.limit(10))
df.createOrReplaceTempView("LiftingResults")

In [4]:
numDaysToSeparate = 5

dfMeetsByDate = spark.sql("SELECT Date, COUNT(DISTINCT MeetName) AS Meets, COUNT(*) AS Participants FROM LiftingResults GROUP BY Date ORDER BY Date DESC")
listMostRecentDates = dfMeetsByDate.take(numDaysToSeparate)

##dfMeetsByDate.show()
display(listMostRecentDates)

In [3]:
from notebookutils import mssparkutils
from pyspark.sql.functions import col, asc,desc

##Prep target folder path for daily reports
if list(filter(lambda x : x.name == "dailyReports", mssparkutils.fs.ls("/"))):
    mssparkutils.fs.rm("/dailyReports", True)
mssparkutils.fs.mkdirs("/dailyReports")

##Iterate through the N most recent dates and create a CSV file per date, in the target folder path
for row in listMostRecentDates:
    activityDate = row[0]
    outputFilename = "daily-results-" + activityDate + ".csv"
    outputFullPath = "/" + outputFilename

    df.where(df.Date == activityDate).coalesce(1).write.mode("overwrite").option("header", "true").option("emptyValue", "").csv(outputFullPath)

    files = mssparkutils.fs.ls(outputFullPath)
    partFilename = list(filter(lambda x : x.name.endswith("csv"), files))
    for filename in partFilename:
        mssparkutils.fs.mv(filename.path, "/dailyReports/" + outputFilename)

    mssparkutils.fs.rm(outputFullPath, True)
##Done with daily reports

##Prep target folder path for initial data.  This is simply the OpenPowerLifting dataset with the N most recent dates removed
if list(filter(lambda x : x.name == "initialData", mssparkutils.fs.ls("/"))):
    mssparkutils.fs.rm("/initialData", True)
mssparkutils.fs.mkdirs("/initialData")

outputFilename = "openpowerlifting-initial-data.csv"
outputFullPath = "/" + outputFilename

##Filter the dataframe to exclude the  N most recent dates and write it out to a single CSV in the \initalData folder
initialDataBeforeThisDate = listMostRecentDates[numDaysToSeparate - 1][0]

dfInitialData = df.where(df.Date < initialDataBeforeThisDate).orderBy(col("Date").desc())
dfInitialData.coalesce(1).write.mode("overwrite").option("header", "true").option("emptyValue", "").csv(outputFullPath)

files = mssparkutils.fs.ls(outputFullPath)
partFilename = list(filter(lambda x : x.name.endswith("csv"), files))
for filename in partFilename:
    mssparkutils.fs.mv(filename.path, "/initialData/" + outputFilename)

mssparkutils.fs.rm(outputFullPath, True)
##Done with initial data