## Explicit vs Implicit Feedback in Recommendation Systems

### **Explicit Feedback**
- **Definition**: Users provide direct ratings or scores for items (e.g rate all items.- **Definition**: Users provide direct ratings or scores for items (e.g., 1â€“5 stars).
- **Example**: Movie ratings in the MovieLens dataset.
- **ALS Setting**: `implicitPrefs=False`


### **Implicit Feedback**
- **Definition**: Preferences inferred from user behavior rather than direct ratings.
- **Examples**:
  - Clicks, views, purchases, watch time.
- **Characteristics**:
  - Data is **abundant** but **noisy** (not all interactions mean positive preference).
  - Requires interpreting signals (e.g., a click might not mean the user liked the item).
- **ALS Setting**: `implicitPrefs=True`


### **Key Differences**
| Aspect           | Explicit Feedback        | Implicit Feedback          |
|------------------|-------------------------|----------------------------|
| Source           | Direct ratings          | Behavioral signals         |
| Accuracy         | High                   | Lower (inferred)          |
| Data Volume      | Sparse                 | Dense                      |
| ALS Parameter    | `implicitPrefs=False`  | `implicitPrefs=True`       |



**Key Takeaway**
- Explicit feedback models often perform better when ratings are available.
- Implicit feedback models are useful when ratings are missing but interaction data exists.
- In some situations, we may want to compare how implicit vs explicit assumptions affect recommendations.



**In this exercise, we will build two recommendation systems from pyspark to deal with the whole dataset, not only the records with rating more than 4**

In [0]:
from pyspark.ml.evaluation import RegressionEvaluator

In [0]:

import pandas as pd
# Import Spark ALS and evaluator
from pyspark.ml.recommendation import ALS
from pyspark.ml.evaluation import RegressionEvaluator

# Load data into Spark DataFrame
# Load data
file_path = '/Volumes/workspace/default/m8100k/u.data'
user_movie_data = pd.read_csv(file_path, sep='\t', names=['userId', 'movieId', 'rating', 'timestamp'])


# Create Spark DataFrame
ratings_df = spark.createDataFrame(user_movie_data)
ratings_df.show(5)


In [0]:
#splitting the data and buidling the evaluation function
from pyspark.ml.recommendation import ALS
from pyspark.ml.evaluation import RegressionEvaluator

# Split data
train_df, test_df = ratings_df.randomSplit([0.8, 0.2], seed=42)

# Evaluator
evaluator = RegressionEvaluator(metricName="rmse", labelCol="rating", predictionCol="prediction")

In [0]:
# Model 1: ALS Explicit
als_explicit = ALS(userCol="userId", itemCol="movieId", ratingCol="rating",
                   implicitPrefs=False, rank=10, maxIter=10, regParam=0.1,
                   coldStartStrategy="drop")
model_explicit = als_explicit.fit(train_df)
predictions_explicit = model_explicit.transform(test_df)
rmse_explicit = evaluator.evaluate(predictions_explicit)

In [0]:
# Model 2: ALS Implicit
als_implicit = ALS(userCol="userId", itemCol="movieId", ratingCol="rating",
                   implicitPrefs=True, rank=10, maxIter=10, regParam=0.1,
                   coldStartStrategy="drop")
model_implicit = als_implicit.fit(train_df)
predictions_implicit = model_implicit.transform(test_df)
rmse_implicit = evaluator.evaluate(predictions_implicit)

In [0]:
print(f"RMSE Explicit: {rmse_explicit}")
print(f"RMSE Implicit: {rmse_implicit}")

**Task 2, find the top five movies simialr to movie_id 10 and the top 5 users similar to user_id 100 using the two models**

**Provide your results in a good visualization format**

In [0]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

def cosine_sim(a, b):
    a = np.array(a); b = np.array(b)
    return float(np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b)))





In [0]:
# Get item factors from Explicit ALS model
item_exp = model_explicit.itemFactors.toPandas()

target_movie_id = 10

# Get target vector
target_vec_exp = item_exp[item_exp['id'] == target_movie_id]['features'].values[0]

# Compute similarity
item_exp['similarity'] = item_exp['features'].apply(lambda v: cosine_sim(target_vec_exp, v))

# Remove itself and get top 5
top5_movies_explicit = (
    item_exp[item_exp['id'] != target_movie_id]
    .sort_values('similarity', ascending=False)
    .head(5)
)

print("Top 5 similar movies (Explicit ALS)")
display(top5_movies_explicit)


In [0]:
plt.figure(figsize=(8,4))
plt.bar(top5_movies_explicit['id'].astype(str),
        top5_movies_explicit['similarity'])
plt.title("Top 5 Movies Similar to Movie ID 10 (Explicit ALS)")
plt.xlabel("Movie ID")
plt.ylabel("Cosine Similarity")
plt.show()


In [0]:
# Get item factors from Implicit ALS
item_imp = model_implicit.itemFactors.toPandas()

target_vec_imp = item_imp[item_imp['id'] == target_movie_id]['features'].values[0]

item_imp['similarity'] = item_imp['features'].apply(lambda v: cosine_sim(target_vec_imp, v))

top5_movies_implicit = (
    item_imp[item_imp['id'] != target_movie_id]
    .sort_values('similarity', ascending=False)
    .head(5)
)

print("Top 5 similar movies (Implicit ALS)")
display(top5_movies_implicit)


In [0]:
plt.figure(figsize=(8,4))
plt.bar(top5_movies_implicit['id'].astype(str),
        top5_movies_implicit['similarity'])
plt.title("Top 5 Movies Similar to Movie ID 10 (Implicit ALS)")
plt.xlabel("Movie ID")
plt.ylabel("Cosine Similarity")
plt.show()


In [0]:
user_exp = model_explicit.userFactors.toPandas()

target_user_id = 100
target_user_vec_exp = user_exp[user_exp['id'] == target_user_id]['features'].values[0]

user_exp['similarity'] = user_exp['features'].apply(lambda v: cosine_sim(target_user_vec_exp, v))

top5_users_explicit = (
    user_exp[user_exp['id'] != target_user_id]
    .sort_values('similarity', ascending=False)
    .head(5)
)

print("Top 5 similar users (Explicit ALS)")
display(top5_users_explicit)


In [0]:
plt.figure(figsize=(7,4))
plt.bar(top5_users_explicit['id'].astype(str),
        top5_users_explicit['similarity'])
plt.title("Top 5 Users Similar to User ID 100 (Explicit ALS)")
plt.xlabel("User ID")
plt.ylabel("Cosine Similarity")
plt.show()


In [0]:
user_imp = model_implicit.userFactors.toPandas()

target_user_vec_imp = user_imp[user_imp['id'] == target_user_id]['features'].values[0]

user_imp['similarity'] = user_imp['features'].apply(lambda v: cosine_sim(target_user_vec_imp, v))

top5_users_implicit = (
    user_imp[user_imp['id'] != target_user_id]
    .sort_values('similarity', ascending=False)
    .head(5)
)

print("Top 5 similar users (Implicit ALS)")
display(top5_users_implicit)


In [0]:
plt.figure(figsize=(7,4))
plt.bar(top5_users_implicit['id'].astype(str),
        top5_users_implicit['similarity'])
plt.title("Top 5 Users Similar to User ID 100 (Implicit ALS)")
plt.xlabel("User ID")
plt.ylabel("Cosine Similarity")
plt.show()
