In [None]:
Oracle AI Data Platform v1.0

Copyright Â© 2025, Oracle and/or its affiliates.

Licensed under the Universal Permissive License v 1.0 as shown at https://oss.oracle.com/licenses/upl/

# AI Data Platform - Use Case for Data Science

# Simple Spark ML example on AI Data Platform
 **Training and Evaluating model on Spark ML in AI Data Platform Cluster**
 
 This notebook demonstrates training a model using the built in Spark ML libraries. It covers:
 
 1. **Create source dataframe**
 2. **Create ML model**
 3. **Train model**
 4. **Evaluate model**
 5. **Prection**

 **Prerequisites**

Before you begin, ensure you have:
 - The necessary IAM policies for accessing AI Data Platform. Learn more about permissions.
 - A configured AI Data Platform environment with a compute cluster created - install the requirements file into cluster libraries, this includes;
   - numpy

<a class="anchor" id="0.1"></a>

# **Table of Contents**


1.	[Movie Recommendation with Pyspark](#1)


In [1]:
import random
import os

from pyspark.sql import SparkSession 
from pyspark.ml  import Pipeline     
from pyspark.sql import SQLContext  
from pyspark.sql.functions import mean,col,split, col, regexp_extract, when, lit
from pyspark.ml.feature import StringIndexer, VectorAssembler
from pyspark.ml.evaluation import MulticlassClassificationEvaluator
from pyspark.ml.feature import QuantileDiscretizer

# **1. Movie Recommendation with Pyspark** <a class="anchor" id="1"></a>

[Go back to table of contents](#0.1)


### Show results

In [1]:
# Sample movie rating data (UserID, MovieID, Rating)
data = [
    (0, "Inception", 5.0),
    (0, "Titanic", 4.5),
    (0, "The Matrix", 4.0),
    (1, "Inception", 4.0),
    (1, "Titanic", 3.0),
    (1, "Interstellar", 5.0),
    (2, "The Matrix", 5.0),
    (2, "Interstellar", 4.0),
    (2, "Titanic", 2.0),
]


# Create DataFrame
columns = ["userId", "title", "rating"]
df = spark.createDataFrame(data, columns)
df.show()

## Our task: given a user, we predict and return a list of movies recommendation for that user to watch.

### We use: **printSchema()** to quick overview of features datatype

In [1]:
df.printSchema()

### As we can see, the title column is stored as string type. To work with pyspark Mlib library, we need to convert string type to numeric values

In [1]:
from pyspark.ml.feature import StringIndexer, IndexToString
stringIndexer = StringIndexer(inputCol='title', outputCol='title_new')
# Applying stringindexer object on dataframe movie title column
model = stringIndexer.fit(df)
#creating new dataframe with transformed values
indexed = model.transform(df)
#validate the numerical title values
indexed.show(5)

### We use Alternating least squares (ALS) algorithm in Pyspark Ml library for recommendation. To read more, you can visit at https://spark.apache.org/docs/2.2.0/ml-collaborative-filtering.html

In [1]:
# split the data into training and test datatset
train, test = indexed.randomSplit([0.75,0.25])
from pyspark.ml.recommendation import ALS

#Training the recommender model using train datatset
rec=ALS( maxIter=10
        ,regParam=0.01
        ,userCol='userId'
        ,itemCol='title_new'
        ,ratingCol='rating'
        ,nonnegative=True
        ,coldStartStrategy="drop")

#fit the model on train set
rec_model=rec.fit(train)

#making predictions on test set 
predicted_ratings=rec_model.transform(test)
predicted_ratings.show(5)

### Evaluate the training

In [1]:
# Importing Regression Evaluator to measure RMSE
from pyspark.ml.evaluation import RegressionEvaluator
# create Regressor evaluator object for measuring accuracy
evaluator=RegressionEvaluator(metricName='rmse',predictionCol='prediction',labelCol='rating')
# apply the RE on predictions dataframe to calculate RMSE
rmse=evaluator.evaluate(predicted_ratings)
# print RMSE error
print(rmse)

### After training, now is the time to recommend top movies which user might like 

In [1]:
# First we need to create dataset of all distinct movies 
unique_movies=indexed.select('title_new').distinct()

#create function to recommend top 'n' movies to any particular user
def top_movies(user_id,n):
    """
    This function returns the top 'n' movies that user has not seen yet but might like 
    
    """
    #assigning alias name 'a' to unique movies df
    a = unique_movies.alias('a')
    
    #creating another dataframe which contains already watched movie by active user 
    watched_movies=indexed.filter(indexed['userId'] == user_id).select('title_new')
    
    #assigning alias name 'b' to watched movies df
    b=watched_movies.alias('b')
    
    #joining both tables on left join 
    total_movies = a.join(b, a.title_new == b.title_new,how='left')
    
    #selecting movies which active user is yet to rate or watch
    remaining_movies=total_movies.where(col("b.title_new").isNull()).select(a.title_new).distinct()
    
    
    #adding new column of user_Id of active useer to remaining movies df 
    remaining_movies=remaining_movies.withColumn("userId",lit(int(user_id)))
    
    
    #making recommendations using ALS recommender model and selecting only top 'n' movies
    recommendations=rec_model.transform(remaining_movies).orderBy('prediction',ascending=False).limit(n)
    
    
    #adding columns of movie titles in recommendations
    movie_title = IndexToString(inputCol="title_new", outputCol="title",labels=model.labels)
    final_recommendations=movie_title.transform(recommendations)
    
    #return the recommendations to active user
    return final_recommendations.show(n,False)

In [1]:
# Test: recommend 5 movies for user of id=60
top_movies(2,1)

I hope you find this notebook beneficial and enjoyable