# Extracting duration in seconds from `contentDetails_duration`

In this notebook we will be working with the `songs` table.  
In this table, there is a `contentDetails_duration` that states the duration of the song. Our issue is that the format is not readable for analysis or modelisation.
The goal of this notebook is to convert it into seconds.

## Loading data

In [0]:
### BEGIN STRIP ###
ACCESS_KEY_ID = "AKIA3V3GLDX54DRFGYTB" # cle du compte student
SECRET_ACCESS_KEY = "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" # secret key du compte student
hadoop_conf = spark._jsc.hadoopConfiguration()
hadoop_conf.set("fs.s3a.access.key", ACCESS_KEY_ID)
hadoop_conf.set("fs.s3a.secret.key", SECRET_ACCESS_KEY)
hadoop_conf.set("fs.s3a.impl","org.apache.hadoop.fs.s3a.S3AFileSystem") 
S3_RESOURCE = 's3'
SCHEME = 's3'
BUCKET_NAME = 'full-stack-bigdata-datasets'
PREFIX = "Big_Data/YOUTUBE"
INPUT_FILENAME = 'items_selected.parquet'
### END STRIP ###

In [0]:
# Load the file into a PySpark DataFrame
#       Perform the usual checks
### BEGIN STRIP ###
songs = spark.read.format("parquet").option("header", "true").option("inferSchema","true").load("s3://full-stack-bigdata-datasets/Big_Data/YOUTUBE/items_selected.parquet")
songs.printSchema()
print("Shape:", (songs.count(), len(songs.columns)))
songs.limit(5).toPandas()
### END STRIP ###

Unnamed: 0,contentDetails_duration,id,snippet_channelId,snippet_channelTitle,snippet_publishedAt,snippet_title,statistics_commentCount,statistics_dislikeCount,statistics_viewCount
0,PT3M33S,t1l8Z6gLPzo,UCUERSOitwgUq_37kGslN96w,VOLO,2013-07-22T12:09:11Z,"VOLO. ""L'air d'un con""",38,26,223172
1,PT7M46S,we5gzZq5Avg,UCson549gpvRhPnJ3Whs5onA,LongWayToDream,2012-03-17T08:34:30Z,Julian Jeweil - Air Conditionné,2,3,13409
2,PT3M7S,49esza4eiK4,UCcHYZ8Ez4gG_2bHEuBL8IfQ,Downtown Records,2007-09-08T02:02:07Z,Justice - D.A.N.C.E,3168,780,10106655
3,PT3M43S,BoO6LfR7ca0,UCQ0wLCF7u23gZKJkHFs1Tpg,Music Is Our Drug,2014-01-24T12:52:38Z,Gramatik - Torture (feat. Eric Krasno),6,0,29153
4,PT5M,DaH4W1rY9us,UCJsTMPZxYD-Q3kEmL4Qijpg,Harvey Pearson,2012-12-02T12:41:13Z,Ben Howard - Oats In The Water,5303,1784,16488714


## First analysis

In [0]:
# We will be using this column a lot
DURATION_COL = 'contentDetails_duration'

In [0]:
# show the first 10 values of the `contentDetails_duration` column
### BEGIN STRIP ###
songs.select(DURATION_COL).show(10)
### END STRIP ###

In [0]:
# Convert the duration column to a unix timestamp
#       then show the first 20 rows (select both original duration and converted duration aliased to `totalSeconds`)
# NOTE: Be careful, you need to escape some characters
### BEGIN STRIP ###
spark.conf.set("spark.sql.legacy.timeParserPolicy","LEGACY") # to avoid bug due to Spark 3.0

from pyspark.sql import functions as F
time_format = "'PT'mm'M'ss'S'"
songs.select(DURATION_COL, F.unix_timestamp(DURATION_COL, time_format).alias('totalSeconds')).show()
### END STRIP ###

Scroll down and look. Can you see anything weird?  
We have null values for `duration_format`, that indicates that our conversion didn't work.

That's because the format is different than the one we're using...

We'll try to evaluate how many different formats we have. It would be difficult to do this precisely, but a ballmark estimate will do.  
Our strategy will be to compute the length of each `duration` and count how many of these we got.  
Then, for each we will select 3 samples to get a better sense of the kind of formats we're dealing with.

In [0]:
songs.withColumn('duration_length', F.length(DURATION_COL)) \
           .groupBy('duration_length') \
           .agg(F.count('*').alias('count'),
                F.slice(F.collect_list(DURATION_COL), start=1, length=3).alias('values')) \
           .orderBy('duration_length') \
           .collect()

That's many! And we're just sampling, there could be more...  
Now is probably a good time to go and look at some documentation. It appears this time format for duration is following the ISO8601 standard.  
Take a look at the [Wikipedia page](http://en.wikipedia.org/wiki/ISO_8601#Durations).

We will first try doing this with Python, and will then solve it with PySpark.  
Using standard Python's library wouldn't be easy, unless you know about regexes.  
For now, we will make it simpler by using an external library: [isodate](https://github.com/gweis/isodate/).

We will first start by selecting a sample of different formats and make sure our python implementation work on these before shifting to PySpark using UDF.  
And at the end of the notebook, as a bonus, you can try to do it using pure PySpark functions.

We will first build a sample of the different kinds of format we can encounter, for each different length of format, collect 10 (some will have less) different values and store them as a python list called `samples`.  
If you made a function for the previous assignment, you can probably reuse it here.

In [0]:
samples = songs.withColumn('duration_length', F.length(DURATION_COL)) \
           .groupBy('duration_length') \
           .agg(F.slice(F.collect_list(DURATION_COL), start=1, length=10).alias('values')) \
           .select(F.explode('values')) \
           .distinct() \
           .rdd.map(lambda r: r[0]).collect()

samples[-5:] 

## Parsing with `isodate`
We will use the Python's library [isodate](https://github.com/gweis/isodate/) to help us parse these durations in ISO8601 format.  
Once we succeed doing it with regular Python, we can embed this into a PySpark UDF.

In [0]:
# install isodate and import it
### BEGIN STRIP ###
!pip install isodate
import isodate
### END STRIP ###

Now we will create a function that uses `isodate` to parse an ISO8601 duration and convert it to seconds.

In [0]:
# Create a function that parse a duration as a ISO6601 string: `total_seconds_from_ISO8601_duration`
### BEGIN STRIP ###
def total_seconds_from_ISO8601_duration(duration_ISO8601: str) -> float:
  return isodate.parse_duration(duration_ISO8601).total_seconds()
### END STRIP ###

In [0]:
# convert your sample to seconds using your newly created function: `samples_seconds`
# NOTE: use a list comprehension
### BEGIN STRIP ###
samples_seconds = [total_seconds_from_ISO8601_duration(d) for d in samples]
samples_seconds[:5]
### END STRIP ###

In [0]:
# make sure we have no null values (e.g. count the null values and make sure it sums to 0)
### BEGIN STRIP ###
sum(e is None for e in samples_seconds)
### END STRIP ###

## Using PySpark
That seems to be working. We'll try to use this with PySpark now.

In [0]:
# Convert your function to an UDF: `total_seconds_from_ISO8601_duration_udf`
# NOTE: Beware of the return type
### BEGIN STRIP ###
from pyspark.sql.types import FloatType

total_seconds_from_ISO8601_duration_udf = F.udf(total_seconds_from_ISO8601_duration, FloatType())
### END STRIP ###

In [0]:
# Using your previously defined UDF, add a new column `totalDurationSeconds`: `songs_output`
### BEGIN STRIP ###
songs_output = songs.withColumn('totalDurationSeconds', total_seconds_from_ISO8601_duration_udf(DURATION_COL))
songs_output.limit(5).toPandas()
### END STRIP ###

Unnamed: 0,contentDetails_duration,id,snippet_channelId,snippet_channelTitle,snippet_publishedAt,snippet_title,statistics_commentCount,statistics_dislikeCount,statistics_viewCount,totalDurationSeconds
0,PT3M33S,t1l8Z6gLPzo,UCUERSOitwgUq_37kGslN96w,VOLO,2013-07-22T12:09:11Z,"VOLO. ""L'air d'un con""",38,26,223172,213.0
1,PT7M46S,we5gzZq5Avg,UCson549gpvRhPnJ3Whs5onA,LongWayToDream,2012-03-17T08:34:30Z,Julian Jeweil - Air Conditionné,2,3,13409,466.0
2,PT3M7S,49esza4eiK4,UCcHYZ8Ez4gG_2bHEuBL8IfQ,Downtown Records,2007-09-08T02:02:07Z,Justice - D.A.N.C.E,3168,780,10106655,187.0
3,PT3M43S,BoO6LfR7ca0,UCQ0wLCF7u23gZKJkHFs1Tpg,Music Is Our Drug,2014-01-24T12:52:38Z,Gramatik - Torture (feat. Eric Krasno),6,0,29153,223.0
4,PT5M,DaH4W1rY9us,UCJsTMPZxYD-Q3kEmL4Qijpg,Harvey Pearson,2012-12-02T12:41:13Z,Ben Howard - Oats In The Water,5303,1784,16488714,300.0


We'll make sure we don't have any null values.

In [0]:
# Count the number of null values in the `totalDurationSeconds` column of `songs_with_total_seconds_duration`
### BEGIN STRIP ###
songs_output \
  .select(F.sum(F.col('totalDurationSeconds').isNull().astype('int'))) \
  .rdd.map(lambda r: r[0]).first()
### END STRIP ###

**If you got 0 null values, good job, you made it!**

This is great progress, but we used a `UDF`: this is not very good performances wise.  
It would be better to implement this using PySpark functions.

You can now use this new variable to perform more analysis. **Good luck!**