# Exploring movie ratings dataset with Spark RDDs

This project focuses on exploring the MovieLens dataset using PySpark RDDs (Resilient Distributed Datasets). RDDs are the fundamental data structure in Spark, offering a distributed way to process large-scale data in parallel across a cluster of machines. In simple terms, RDDs allow us to perform operations on our data, like filtering, transforming, and aggregating, all in parallel. They are fault-tolerant, meaning they can recover automatically from errors, ensuring reliability even in a distributed computing environment.

Understanding RDDs is crucial for mastering Spark, especially when fine-grained control over data processing is required.
While DataFrame and Spark SQL provide a more high-level and optimized API for structured data processing, RDDs provide a low-level API for performing transformations and data manipulation on distributed data, making them ideal for tasks requiring flexibility and control.

##### RDD vs. MapReduce
RDD provides a higher-level abstraction compared to MapReduce. RDDs offer a wider range of operations beyond the map and reduce paradigm, including transformations like flatMap, groupByKey, and join. RDD supports iterative and interactive processing patterns more efficiently, making it suitable for machine learning, graph processing, and real-time analytics. MapReduce, on the other hand, primarily used for batch processing tasks, where data is processed in large batches rather than interactively or in real-time.

In [1]:
from pyspark import SparkConf, SparkContext
from pyspark.sql import SparkSession

In [2]:
# Create a SparkConf object to configure Spark. We set the application name to identify our Spark job
# We also set the master URL to local[*], indicating that we want to run Spark in local mode using all available CPU cores
conf = SparkConf().setAppName("RDDExploration").setMaster("local[*]")
# Create a SparkContext object with configuration settings to interact with Spark and to distribute the operations across a cluster
sc = SparkContext(conf=conf)

In [3]:
# Load the MovieLens dataset into an rdd where each line in the file becomes an element in the rdd
# By loading data into an rdd, Spark can distribute the data across multiple nodes in a cluster, allowing for parallel processing
data_rdd = sc.textFile("ml-100k/ml-100k/u.data")

####  Basic RDD operations

In [4]:
# Count determine the number of elements in an rdd
num_ratings = data_rdd.count()
print(f"Number of ratings: {num_ratings}")

Number of ratings: 100000


In [5]:
# Parse each line of the rdd and extract movie id and rating
# We use the map transformation to create key-value pairs anf to apply a function to each element of the rdd and transform it into another rdd
# First, Split each line of rdd, then create rdd where each element is a tuple containing an integer movie id and a float rating
ratings_rdd = data_rdd.map(lambda line: line.split("\t")[1:3])
ratings_rdd = ratings_rdd.map(lambda x: (int(x[0]), float(x[1])))

# Calculate the total rating and count per movie
# The initial value for each movie is a tuple (0, 0), where the first element represents the total rating and the second element represents the count of ratings
# Inside the first lambda, we update the accumulator by adding the rating to the total rating and incrementing the count by 1 for each rating
# In the second lambda, we merge the results from different partitions of the RDD. acc1 and acc2 represent accumulators from different partitions
movie_rating_counts = ratings_rdd.aggregateByKey((0, 0), lambda acc, rating: (acc[0] + rating, acc[1] + 1), lambda acc1, acc2: (acc1[0] + acc2[0], acc1[1] + acc2[1]))
# Calculate the average rating per movie. We use the mapValues to apply a function to the values of each key-value pair in the RDD
average_rating_per_movie = movie_rating_counts.mapValues(lambda x: x[0] / x[1])

print("Average rating per movie:")
for movie_id, avg_rating in average_rating_per_movie.take(5):
    print(f"Movie ID: {movie_id}, Average rating: {avg_rating:.2f}")

Average rating per movie:
Movie ID: 242, Average rating: 3.99
Movie ID: 302, Average rating: 4.16
Movie ID: 346, Average rating: 3.64
Movie ID: 474, Average rating: 4.25
Movie ID: 86, Average rating: 3.94


In [6]:
# Calculate the total number of ratings per movie
total_ratings_per_movie = movie_rating_counts.mapValues(lambda x: x[1])
num_ratings = total_ratings_per_movie.count()

print("Total ratings per movie:")
for movie_id, total_ratings in total_ratings_per_movie.take(5):
    print(f"Movie ID: {movie_id}, Number of ratings: {total_ratings}")
print(f"Number of ratings: {num_ratings}")

# Filter movies with a minimum number of ratings - here we filter out movies with less than 100 reviews
popular_movies = movie_rating_counts.filter(lambda x: x[1][1] >= 100)
num_ratings = popular_movies.count()

print("\nPopular movies:")
for movie_id, rating_info in popular_movies.take(5):
    print(f"Movie ID: {movie_id}, Number of ratings: {rating_info[1]}")
print(f"Number of ratings: {num_ratings}")

Total ratings per movie:
Movie ID: 242, Number of ratings: 117
Movie ID: 302, Number of ratings: 297
Movie ID: 346, Number of ratings: 126
Movie ID: 474, Number of ratings: 194
Movie ID: 86, Number of ratings: 150
Number of ratings: 1682

Popular movies:
Movie ID: 242, Number of ratings: 117
Movie ID: 302, Number of ratings: 297
Movie ID: 346, Number of ratings: 126
Movie ID: 474, Number of ratings: 194
Movie ID: 86, Number of ratings: 150
Number of ratings: 338


In [7]:
# Find the most and least rated movies
most_rated_movies = movie_rating_counts.map(lambda x: (x[0], x[1][1])).sortBy(lambda x: x[1], ascending=False)
least_rated_movies = movie_rating_counts.map(lambda x: (x[0], x[1][1])).sortBy(lambda x: x[1])

print("Most rated movies:")
for movie_id, num_ratings in most_rated_movies.take(5):
    print(f"Movie ID: {movie_id}, Number of ratings: {num_ratings}")

print("\nLeast rated movies:")
for movie_id, num_ratings in least_rated_movies.take(5):
    print(f"Movie ID: {movie_id}, Number of ratings: {num_ratings}")

Most rated movies:
Movie ID: 50, Number of ratings: 583
Movie ID: 258, Number of ratings: 509
Movie ID: 100, Number of ratings: 508
Movie ID: 181, Number of ratings: 507
Movie ID: 294, Number of ratings: 485

Least rated movies:
Movie ID: 1348, Number of ratings: 1
Movie ID: 1320, Number of ratings: 1
Movie ID: 1492, Number of ratings: 1
Movie ID: 1364, Number of ratings: 1
Movie ID: 830, Number of ratings: 1


In [8]:
# Find users who rated the most movies
# We assign a count of 1 to each user rating and then use the reduceByKey to aggregate the counts for each user ID
user_rating_counts = ratings_rdd.map(lambda x: (x[0], 1)).reduceByKey(lambda x, y: x + y)
most_active_users = user_rating_counts.sortBy(lambda x: x[1], ascending=False)

print("Users who rated the most movies:")
for user_id, num_ratings in most_active_users.take(5):
    print(f"User ID: {user_id}, Number of ratings: {num_ratings}")

Users who rated the most movies:
User ID: 50, Number of ratings: 583
User ID: 258, Number of ratings: 509
User ID: 100, Number of ratings: 508
User ID: 181, Number of ratings: 507
User ID: 294, Number of ratings: 485


In [9]:
# Find the distribution of ratings and sort by key
# We use the countByKey to count the occurrences of each unique key in the RDD (rating value).
rating_distribution = ratings_rdd.map(lambda x: (x[1], 1)).countByKey()
sorted_rating_distribution = sorted(rating_distribution.items())

print("Rating distribution:")
for rating, count in sorted_rating_distribution:
    print(f"Rating: {rating}, Count: {count}")

Rating distribution:
Rating: 1.0, Count: 6110
Rating: 2.0, Count: 11370
Rating: 3.0, Count: 27145
Rating: 4.0, Count: 34174
Rating: 5.0, Count: 21201


In [10]:
# Group ratings by movie. We use groupByKey to group ratings by movie ID. The value is a list containing all ratings for that movie
ratings_by_movie = ratings_rdd.map(lambda x: (x[0], x[1])).groupByKey()

print("Ratings by movie:")
for movie_id, ratings_list in ratings_by_movie.take(2):
    print(f"Movie ID: {movie_id}, Ratings: {list(ratings_list)}")

# Group ratings by movie ID and calculate sum and count of ratings for each movie
movie_rating_sum_count = ratings_rdd.map(lambda x: (x[0], (x[1], 1))).reduceByKey(lambda x, y: (x[0] + y[0], x[1] + y[1]))
# Calculate average rating for each movie
average_rating_per_movie = movie_rating_sum_count.mapValues(lambda x: x[0] / x[1])

print("\nAverage rating per movie:")
for movie_id, avg_rating in average_rating_per_movie.take(5):
    print(f"Movie ID: {movie_id}, Average rating: {avg_rating:.2f}")

Ratings by movie:
Movie ID: 242, Ratings: [3.0, 3.0, 5.0, 3.0, 5.0, 4.0, 5.0, 4.0, 4.0, 4.0, 2.0, 5.0, 5.0, 2.0, 5.0, 5.0, 4.0, 4.0, 4.0, 4.0, 4.0, 3.0, 4.0, 5.0, 4.0, 5.0, 5.0, 5.0, 2.0, 4.0, 3.0, 5.0, 1.0, 5.0, 5.0, 3.0, 5.0, 1.0, 4.0, 5.0, 3.0, 4.0, 5.0, 4.0, 1.0, 4.0, 5.0, 5.0, 3.0, 4.0, 4.0, 4.0, 5.0, 5.0, 4.0, 4.0, 5.0, 4.0, 4.0, 4.0, 5.0, 5.0, 3.0, 4.0, 4.0, 4.0, 4.0, 5.0, 4.0, 4.0, 5.0, 5.0, 5.0, 5.0, 3.0, 5.0, 3.0, 5.0, 3.0, 4.0, 4.0, 4.0, 3.0, 5.0, 5.0, 4.0, 5.0, 2.0, 4.0, 5.0, 4.0, 3.0, 4.0, 4.0, 4.0, 5.0, 5.0, 3.0, 4.0, 2.0, 1.0, 5.0, 4.0, 5.0, 3.0, 4.0, 4.0, 3.0, 4.0, 4.0, 4.0, 4.0, 5.0, 4.0, 3.0, 4.0, 3.0]
Movie ID: 302, Ratings: [3.0, 4.0, 4.0, 4.0, 3.0, 5.0, 3.0, 4.0, 5.0, 4.0, 2.0, 4.0, 4.0, 4.0, 4.0, 3.0, 4.0, 3.0, 5.0, 5.0, 3.0, 5.0, 5.0, 1.0, 4.0, 4.0, 4.0, 5.0, 5.0, 4.0, 4.0, 4.0, 5.0, 4.0, 5.0, 5.0, 4.0, 4.0, 5.0, 5.0, 5.0, 4.0, 3.0, 5.0, 5.0, 4.0, 4.0, 4.0, 4.0, 4.0, 5.0, 3.0, 5.0, 4.0, 3.0, 3.0, 2.0, 4.0, 4.0, 5.0, 4.0, 5.0, 4.0, 5.0, 2.0, 4.0, 4.0, 5.0, 5.0, 5.

In [11]:
# Import movie titles and genres and create key-value pairs with movie ID as key and a tuple of title and genres as value
movie_titles_genres_rdd = sc.textFile("ml-100k/ml-100k/u.item").map(lambda line: line.split("|")).map(lambda x: (int(x[0]), (x[1], x[5])))

print("Values of movie_titles_genres_rdd:")
for movie_id, title_genre_tuple in movie_titles_genres_rdd.take(5):
    print(f"Movie ID: {movie_id}, Title: {title_genre_tuple[0]}, Genres: {title_genre_tuple[1]}")

# Join movie titles and genres with ratings based on movie ID
movies_with_ratings = average_rating_per_movie.join(movie_titles_genres_rdd)

print("\nMovies with averaged rating:")
for movie_id, (avg_rating, title_genre_tuple) in movies_with_ratings.take(5):
    print(f"Movie ID: {movie_id}, Average rating: {avg_rating:.2f}, Title: {title_genre_tuple[0]}, Genres: {title_genre_tuple[1]}")

Values of movie_titles_genres_rdd:
Movie ID: 1, Title: Toy Story (1995), Genres: 0
Movie ID: 2, Title: GoldenEye (1995), Genres: 0
Movie ID: 3, Title: Four Rooms (1995), Genres: 0
Movie ID: 4, Title: Get Shorty (1995), Genres: 0
Movie ID: 5, Title: Copycat (1995), Genres: 0

Movies with averaged rating:
Movie ID: 40, Average rating: 2.89, Title: To Wong Foo, Thanks for Everything! Julie Newmar (1995), Genres: 0
Movie ID: 1184, Average rating: 2.50, Title: Endless Summer 2, The (1994), Genres: 0
Movie ID: 392, Average rating: 3.54, Title: Man Without a Face, The (1993), Genres: 0
Movie ID: 144, Average rating: 3.87, Title: Die Hard (1988), Genres: 0
Movie ID: 768, Average rating: 3.08, Title: Casper (1995), Genres: 0


In [12]:
# Extract genres and count distinct values
distinct_genres_count = movie_titles_genres_rdd.flatMap(lambda x: x[1][1].split(",")).distinct().count()

print(f"Number of genres: {distinct_genres_count}")

# Calculate average rating per genre
# map transforms each element into exactly one element, while flatMap can transform each element into zero or more elements
# We use flatMap because each movie can belong to multiple genres, so we need to create multiple key-value pairs for each genre associated with the movie
# After creating the key-value pairs, we use groupByKey to group the ratings by genre
ratings_by_genre = movies_with_ratings.flatMap(lambda x: [(genre, x[1][0]) for genre in x[1][1][1].split(",")]).groupByKey()
average_rating_per_genre = ratings_by_genre.mapValues(lambda x: sum(x) / len(x))

print("\nAverage rating per genre:")
for genre, avg_rating in average_rating_per_genre.take(5):
    print(f"Genre: {genre}, Average rating: {avg_rating:.2f}")

Number of genres: 2

Average rating per genre:
Genre: 0, Average rating: 3.08
Genre: 1, Average rating: 2.22


In [13]:
# Stop SparkContext
sc.stop()