# PR0506. Análisis de comportamiento de usuarios en Netflix
## Definición de esquema y comienzo de sesión Spark

In [10]:
from pyspark.sql.types import StringType, IntegerType, DoubleType, StructType, StructField, TimestampNTZType, DateType
schema = StructType([
    StructField("", IntegerType(), False),
    StructField("datetime", TimestampNTZType(), False),
    StructField("duration", DoubleType(), False),
    StructField("title", StringType(), False),
    StructField("genres", StringType(), False),
    StructField("release_date", DateType(), False), 
    StructField("movie_id", StringType(), False),
    StructField("user_id", StringType(), False),
])

In [11]:
from pyspark.sql import SparkSession

spark = (
    SparkSession
        .builder
        .appName("Netflix")
        .master("spark://spark-master:7077")
        .getOrCreate()
)

df = (
    spark
        .read
        .format("csv")
        .schema(schema)
        .option("header", "true")
        .load("/workspace/pr0506/vodclickstream_uk_movies_03.csv")
)


df.show(5)

+-----+-------------------+--------+--------------------+--------------------+------------+----------+----------+
|     |           datetime|duration|               title|              genres|release_date|  movie_id|   user_id|
+-----+-------------------+--------+--------------------+--------------------+------------+----------+----------+
|58773|2017-01-01 01:15:09|     0.0|Angus, Thongs and...|Comedy, Drama, Ro...|  2008-07-25|26bd5987e8|1dea19f6fe|
|58774|2017-01-01 13:56:02|     0.0|The Curse of Slee...|Fantasy, Horror, ...|  2016-06-02|f26ed2675e|544dcbc510|
|58775|2017-01-01 15:17:47| 10530.0|   London Has Fallen|    Action, Thriller|  2016-03-04|f77e500e7a|7cbcc791bf|
|58776|2017-01-01 16:04:13|    49.0|            Vendetta|       Action, Drama|  2015-06-12|c74aec7673|ebf43c36b6|
|58777|2017-01-01 19:16:37|     0.0|The SpongeBob Squ...|Animation, Action...|  2004-11-19|a80d6fc2aa|a57c992287|
+-----+-------------------+--------+--------------------+--------------------+----------

## Ejercicios
### 1.- Auditoría de telemetría Web (validación de datos)

In [13]:
from pyspark.sql import Window, functions as F

user_window = ( Window
                    .partitionBy("user_id")
                    .orderBy("datetime")
)


df = (
    df
        .withColumn("calculated_time_to_next", F.lead(F.unix_timestamp("datetime"), 1).over(user_window) - F.unix_timestamp(F.col("datetime")))
)

df.show(15)



+------+-------------------+--------+--------------------+--------------------+------------+----------+----------+-----------------------+
|      |           datetime|duration|               title|              genres|release_date|  movie_id|   user_id|calculated_time_to_next|
+------+-------------------+--------+--------------------+--------------------+------------+----------+----------+-----------------------+
|139643|2017-05-19 20:21:43|     0.0|                XOXO|        Drama, Music|  2016-08-26|7369676dec|0006ea6b5c|                  91971|
|140442|2017-05-20 21:54:34|     0.0|            Hot Fuzz|Action, Comedy, M...|  2007-04-20|6467fee6b6|0006ea6b5c|                 506607|
|144717|2017-05-26 18:38:01|     0.0|         War Machine|  Comedy, Drama, War|  2017-05-26|0f3b137f4e|0006ea6b5c|                  17625|
|144301|2017-05-26 23:31:46|     0.0|          Apocalypto|Action, Adventure...|  2006-12-08|40dd7bf1f9|0006ea6b5c|                  83635|
|145323|2017-05-27 22:45:41

                                                                                

Hay muchas discrepancias, porque si un usuario cierra la ventana no cuenta cómo duración, mientras que nuestro cálculo sí que lo tiene en cuenta. Sin embargo, hay otros en los que este valor se acerca mucho más al otro.  
### 3.- El ranking de "maratones"  