Data from here https://grouplens.org/datasets/movielens/

In [1]:
import org.apache.spark.sql.functions._
import org.apache.spark.ml.clustering.KMeans
import org.apache.spark.ml.feature.VectorAssembler
import org.apache.spark.ml.evaluation.ClusteringEvaluator

In [2]:
val genres_df = spark.read
    .format("csv")
    .option("header", "true")
    .option("inferSchema", true)
    .load("/home/jovyan/data/ml-20m/movies.csv")

genres_df = [movieId: int, title: string ... 1 more field]


[movieId: int, title: string ... 1 more field]

In [3]:
genres_df.show(3)

+-------+--------------------+--------------------+
|movieId|               title|              genres|
+-------+--------------------+--------------------+
|      1|    Toy Story (1995)|Adventure|Animati...|
|      2|      Jumanji (1995)|Adventure|Childre...|
|      3|Grumpier Old Men ...|      Comedy|Romance|
+-------+--------------------+--------------------+
only showing top 3 rows



In [4]:
def genre_to_lowercase(genres: String): Array[String] = genres.toLowerCase().split("\\|")

// we use the method name followed by a "_" to indicate we want a reference
// to the method, not call it
val genre_words_to_lowercaseUdf = udf(genre_to_lowercase _)

val genres_df1 = genres_df.withColumn("genre_words_lc", genre_words_to_lowercaseUdf('genres))

genre_words_to_lowercaseUdf = UserDefinedFunction(<function1>,ArrayType(StringType,true),Some(List(StringType)))
genres_df1 = [movieId: int, title: string ... 2 more fields]


genre_to_lowercase: (genres: String)Array[String]


[movieId: int, title: string ... 2 more fields]

In [5]:
genres_df1.show(3)

+-------+--------------------+--------------------+--------------------+
|movieId|               title|              genres|      genre_words_lc|
+-------+--------------------+--------------------+--------------------+
|      1|    Toy Story (1995)|Adventure|Animati...|[adventure, anima...|
|      2|      Jumanji (1995)|Adventure|Childre...|[adventure, child...|
|      3|Grumpier Old Men ...|      Comedy|Romance|   [comedy, romance]|
+-------+--------------------+--------------------+--------------------+
only showing top 3 rows



We will think in terms of NLP where we will create a vector of the movie titles. This vector is basically a mapping of the movies to the genre vector.

In [6]:
val all_words = genres_df1.select(explode(genres_df1("genre_words_lc")))
val distinct_words = all_words.filter(_ != "").select(all_words("col")).distinct
val total_distinct_words = distinct_words.count.toInt

all_words = [col: string]
distinct_words = [col: string]
total_distinct_words = 20


20

In [7]:
distinct_words.take(5)

0
crime
imax
fantasy
documentary
action


In [8]:
val category = "adventure"
genres_df1.withColumn("genres_" + category, instr(lower(col("genres")), category))
    .select("genres", "genres_adventure")
    .show(10)

+--------------------+----------------+
|              genres|genres_adventure|
+--------------------+----------------+
|Adventure|Animati...|               1|
|Adventure|Childre...|               1|
|      Comedy|Romance|               0|
|Comedy|Drama|Romance|               0|
|              Comedy|               0|
|Action|Crime|Thri...|               0|
|      Comedy|Romance|               0|
|  Adventure|Children|               1|
|              Action|               0|
|Action|Adventure|...|               8|
+--------------------+----------------+
only showing top 10 rows



category = adventure


adventure

In [9]:
val category = "adventure"
genres_df1.withColumn("genres_" + category, when(instr(lower(col("genres")), category) === 0, 0).otherwise(1)).show(10)

+-------+--------------------+--------------------+--------------------+----------------+
|movieId|               title|              genres|      genre_words_lc|genres_adventure|
+-------+--------------------+--------------------+--------------------+----------------+
|      1|    Toy Story (1995)|Adventure|Animati...|[adventure, anima...|               1|
|      2|      Jumanji (1995)|Adventure|Childre...|[adventure, child...|               1|
|      3|Grumpier Old Men ...|      Comedy|Romance|   [comedy, romance]|               0|
|      4|Waiting to Exhale...|Comedy|Drama|Romance|[comedy, drama, r...|               0|
|      5|Father of the Bri...|              Comedy|            [comedy]|               0|
|      6|         Heat (1995)|Action|Crime|Thri...|[action, crime, t...|               0|
|      7|      Sabrina (1995)|      Comedy|Romance|   [comedy, romance]|               0|
|      8| Tom and Huck (1995)|  Adventure|Children|[adventure, child...|               1|
|      9| 

category = adventure


adventure

What is fold left. You can imagine it to be a piece of paper that you fold from the left.

In [10]:
val prices = List(1.5, 2.0, 2.5)
val sum = prices.foldLeft(0.0)(_ + _)

prices = List(1.5, 2.0, 2.5)
sum = 6.0


6.0

In [11]:
val categories = List("adventure", "animation")
val oneHotDf = categories
    .foldLeft(genres_df1)(
        (genres_df1, category) => 
            genres_df1.withColumn("genres_" + category, when(instr(lower(col("genres")), category) === 0, 0).otherwise(1))
    )

categories = List(adventure, animation)
oneHotDf = [movieId: int, title: string ... 4 more fields]


[movieId: int, title: string ... 4 more fields]

In [12]:
oneHotDf.show(3)

+-------+--------------------+--------------------+--------------------+----------------+----------------+
|movieId|               title|              genres|      genre_words_lc|genres_adventure|genres_animation|
+-------+--------------------+--------------------+--------------------+----------------+----------------+
|      1|    Toy Story (1995)|Adventure|Animati...|[adventure, anima...|               1|               1|
|      2|      Jumanji (1995)|Adventure|Childre...|[adventure, child...|               1|               0|
|      3|Grumpier Old Men ...|      Comedy|Romance|   [comedy, romance]|               0|               0|
+-------+--------------------+--------------------+--------------------+----------------+----------------+
only showing top 3 rows



In [13]:
val distinct_words_list = distinct_words.select("col")
    .map(r => r.getString(0))
    .filter(r => r != "(no genres listed)")
    .collect.toList

distinct_words_list = List(crime, imax, fantasy, documentary, action, animation, mystery, horror, film-noir, musical, adventure, drama, western, children, war, romance, thriller, sci-fi, comedy)


List(crime, imax, fantasy, documentary, action, animation, mystery, horror, film-noir, musical, adventure, drama, western, children, war, romance, thriller, sci-fi, comedy)

In [14]:
val oneHotDf = distinct_words_list
    .foldLeft(genres_df1)(
        (genres_df1, category) => 
            genres_df1
                .withColumn("genres_" + category,
                            when(instr(lower(col("genres")), category) === 0, 0)
                            .otherwise(1))
    )

categories = List(adventure, animation)
oneHotDf = [movieId: int, title: string ... 21 more fields]


[movieId: int, title: string ... 21 more fields]

In [15]:
oneHotDf.show(3)

+-------+--------------------+--------------------+--------------------+------------+-----------+--------------+------------------+-------------+----------------+--------------+-------------+----------------+--------------+----------------+------------+--------------+---------------+----------+--------------+---------------+-------------+-------------+
|movieId|               title|              genres|      genre_words_lc|genres_crime|genres_imax|genres_fantasy|genres_documentary|genres_action|genres_animation|genres_mystery|genres_horror|genres_film-noir|genres_musical|genres_adventure|genres_drama|genres_western|genres_children|genres_war|genres_romance|genres_thriller|genres_sci-fi|genres_comedy|
+-------+--------------------+--------------------+--------------------+------------+-----------+--------------+------------------+-------------+----------------+--------------+-------------+----------------+--------------+----------------+------------+--------------+---------------+------

In [16]:
oneHotDf.printSchema

root
 |-- movieId: integer (nullable = true)
 |-- title: string (nullable = true)
 |-- genres: string (nullable = true)
 |-- genre_words_lc: array (nullable = true)
 |    |-- element: string (containsNull = true)
 |-- genres_crime: integer (nullable = false)
 |-- genres_imax: integer (nullable = false)
 |-- genres_fantasy: integer (nullable = false)
 |-- genres_documentary: integer (nullable = false)
 |-- genres_action: integer (nullable = false)
 |-- genres_animation: integer (nullable = false)
 |-- genres_mystery: integer (nullable = false)
 |-- genres_horror: integer (nullable = false)
 |-- genres_film-noir: integer (nullable = false)
 |-- genres_musical: integer (nullable = false)
 |-- genres_adventure: integer (nullable = false)
 |-- genres_drama: integer (nullable = false)
 |-- genres_western: integer (nullable = false)
 |-- genres_children: integer (nullable = false)
 |-- genres_war: integer (nullable = false)
 |-- genres_romance: integer (nullable = false)
 |-- genres_thriller:

In [17]:
val featureCols = oneHotDf.columns.filter(w => w contains "genres_")
val assembler = new VectorAssembler().setInputCols(featureCols).setOutputCol("features")
val dataset = assembler.transform(oneHotDf)

featureCols = Array(genres_crime, genres_imax, genres_fantasy, genres_documentary, genres_action, genres_animation, genres_mystery, genres_horror, genres_film-noir, genres_musical, genres_adventure, genres_drama, genres_western, genres_children, genres_war, genres_romance, genres_thriller, genres_sci-fi, genres_comedy)
assembler = vecAssembler_dd61feccc44b
dataset = [movieId: int, title: string ... 22 more fields]


[movieId: int, title: string ... 22 more fields]

In [18]:
// Trains a k-means model.
val kmeans = new KMeans().setK(5).setSeed(1L)
val model = kmeans.fit(dataset)

kmeans = kmeans_5b7345207bd1
model = kmeans_5b7345207bd1


kmeans_5b7345207bd1

In [19]:
// Make predictions
val predictions = model.transform(dataset)

predictions = [movieId: int, title: string ... 23 more fields]


[movieId: int, title: string ... 23 more fields]

In [20]:
// Evaluate clustering by computing Silhouette score
val evaluator = new ClusteringEvaluator()

evaluator = cluEval_26d9fa291dd6


cluEval_26d9fa291dd6

In [21]:
val silhouette = evaluator.evaluate(predictions)
println(s"Silhouette with squared euclidean distance = $silhouette")

Silhouette with squared euclidean distance = 0.28909498554884994


silhouette = 0.28909498554884994


0.28909498554884994

In [22]:
// Shows the result.
println("Cluster Centers: ")
model.clusterCenters.foreach(println)

Cluster Centers: 
[0.07150964812712826,0.0370790768066591,0.18123344684071133,0.014377601210745366,0.6428301172909573,0.12372304199772986,0.03216042376087779,0.12372304199772986,0.0034052213393870605,0.019674612183125238,0.561861520998865,0.09648127128263338,0.048808172531214535,0.11123723041997731,0.05675368898978434,0.06469920544835414,0.09875141884222476,0.40938327657964435,0.08702232311766932]
[0.08134842130252433,0.00387750257181293,0.02302761731423597,0.17994777241433882,0.04645089815620796,0.007359341615889847,0.02864603940808736,0.015035214053968505,0.01281949829864683,0.029674764580200994,0.02864603940808736,0.8290733560180422,0.010920313365513966,0.01582654110944053,0.0656801456041782,0.13056896415288438,0.0,0.013848223470760465,0.13215161826382843]
[0.03538976566236251,0.0014347202295552368,0.0583452893352463,0.004304160688665711,0.038737446197991396,0.0196078431372549,0.01912960306073649,0.013868962219033956,0.001912960306073649,0.10616929698708752,0.04017216642754663,0.356

In [23]:
dataset.select("title", "features").show(3)

+--------------------+--------------------+
|               title|            features|
+--------------------+--------------------+
|    Toy Story (1995)|(19,[2,5,10,13,18...|
|      Jumanji (1995)|(19,[2,10,13],[1....|
|Grumpier Old Men ...|(19,[15,18],[1.0,...|
+--------------------+--------------------+
only showing top 3 rows



In [24]:
dataset.printSchema

root
 |-- movieId: integer (nullable = true)
 |-- title: string (nullable = true)
 |-- genres: string (nullable = true)
 |-- genre_words_lc: array (nullable = true)
 |    |-- element: string (containsNull = true)
 |-- genres_crime: integer (nullable = false)
 |-- genres_imax: integer (nullable = false)
 |-- genres_fantasy: integer (nullable = false)
 |-- genres_documentary: integer (nullable = false)
 |-- genres_action: integer (nullable = false)
 |-- genres_animation: integer (nullable = false)
 |-- genres_mystery: integer (nullable = false)
 |-- genres_horror: integer (nullable = false)
 |-- genres_film-noir: integer (nullable = false)
 |-- genres_musical: integer (nullable = false)
 |-- genres_adventure: integer (nullable = false)
 |-- genres_drama: integer (nullable = false)
 |-- genres_western: integer (nullable = false)
 |-- genres_children: integer (nullable = false)
 |-- genres_war: integer (nullable = false)
 |-- genres_romance: integer (nullable = false)
 |-- genres_thriller:

In [25]:
dataset.show(3)

+-------+--------------------+--------------------+--------------------+------------+-----------+--------------+------------------+-------------+----------------+--------------+-------------+----------------+--------------+----------------+------------+--------------+---------------+----------+--------------+---------------+-------------+-------------+--------------------+
|movieId|               title|              genres|      genre_words_lc|genres_crime|genres_imax|genres_fantasy|genres_documentary|genres_action|genres_animation|genres_mystery|genres_horror|genres_film-noir|genres_musical|genres_adventure|genres_drama|genres_western|genres_children|genres_war|genres_romance|genres_thriller|genres_sci-fi|genres_comedy|            features|
+-------+--------------------+--------------------+--------------------+------------+-----------+--------------+------------------+-------------+----------------+--------------+-------------+----------------+--------------+----------------+--------

In [26]:
predictions.show(3)

+-------+--------------------+--------------------+--------------------+------------+-----------+--------------+------------------+-------------+----------------+--------------+-------------+----------------+--------------+----------------+------------+--------------+---------------+----------+--------------+---------------+-------------+-------------+--------------------+----------+
|movieId|               title|              genres|      genre_words_lc|genres_crime|genres_imax|genres_fantasy|genres_documentary|genres_action|genres_animation|genres_mystery|genres_horror|genres_film-noir|genres_musical|genres_adventure|genres_drama|genres_western|genres_children|genres_war|genres_romance|genres_thriller|genres_sci-fi|genres_comedy|            features|prediction|
+-------+--------------------+--------------------+--------------------+------------+-----------+--------------+------------------+-------------+----------------+--------------+-------------+----------------+--------------+---