## Step 3: Baseline Model - Popularity-Based Recommendation System

### Objective
In this notebook, we will implement a **Popularity-Based Recommendation System** as a baseline model. This simple approach recommends items based on their overall popularity, regardless of user preferences. It helps us understand the dataset and serves as a benchmark for more complex models.

---

### Why a Popularity-Based Model?
- **Simplicity**: Recommends items purely based on their popularity.
- **Interpretability**: Identifies the most popular items in the dataset.
- **Foundation**: Provides a reference point to compare against more advanced models.

---

### How the Model Works
1. **Aggregate Popularity**:
   - Compute the average rating or total number of interactions for each item.
   - Higher averages or counts indicate more popular items.
2. **Recommendation**:
   - Recommend the top-N most popular items to all users.

---

### Steps to Implement

#### 1. Compute Item Popularity
- Aggregate interactions (e.g., ratings) per item.
- Sort items by their popularity (e.g., descending average rating or interaction count).

#### 2. Generate Recommendations
- For each user, recommend the top-N most popular items.

#### 3. Evaluate (Baseline)
- Evaluate the baseline using metrics like precision, recall, or hit rate on a test set.

---

### Code Implementation (conceptual)

```python
import pandas as pd

# Step 1: Aggregate popularity
item_popularity = df.groupby('movieId')['rating'].mean().reset_index()
item_popularity.rename(columns={'rating': 'average_rating'}, inplace=True)
item_popularity = item_popularity.sort_values('average_rating', ascending=False)

# View the top 5 most popular items
print("Top 5 Most Popular Items:")
print(item_popularity.head())

# Step 2: Recommend top-N items
top_n = 10
top_items = item_popularity['movieId'].head(top_n).tolist()

# Generate recommendations for a user
def recommend_popular_items(user_id, top_items):
    return top_items  # All users get the same recommendations

# Example usage
user_id = 1
recommendations = recommend_popular_items(user_id, top_items)
print(f"Recommended items for user {user_id}: {recommendations}")

# Step 3: Evaluation (Optional for Baseline)
# Split into train and test sets, and compute the hit rate of top-N items.
```

---

### Advantages of Popularity-Based Models
- **Quick to Implement**: No complex computations or user preferences required.
- **Effective for New Users**: Provides reasonable recommendations without requiring user-specific data.

### Limitations
- Does not consider user preferences or personalization.
- Tends to recommend the same items to all users, which might not be ideal for diverse user bases.

---

This model is a great starting point and allows us to benchmark against more sophisticated approaches like collaborative filtering. Let’s proceed with this implementation!


In [13]:
import pandas as pd
from sklearn.model_selection import train_test_split

# Step 0: Load the Dataset
# Load the MovieLens dataset (adjust the path to your downloaded dataset)
# Example: Assuming we use the 'ratings.csv' file
data_path = 'data/ml-latest-small/ratings.csv'
df = pd.read_csv(data_path)

# Step 1: Aggregate popularity
item_popularity = df.groupby('movieId')['rating'].mean().reset_index()
item_popularity.rename(columns={'rating': 'average_rating'}, inplace=True)
item_popularity = item_popularity.sort_values('average_rating', ascending=False)

# View the top 5 most popular items
print("Top 5 Most Popular Items:")
print(item_popularity.head())

# Step 2: Recommend top-N items
top_n = 10
top_items = item_popularity['movieId'].head(top_n).tolist()

# Generate recommendations for a user
def recommend_popular_items(user_id, top_items):
    return top_items  # All users get the same recommendations (only a placeholder)

print()

# Example usage
user_id = 1
recommendations = recommend_popular_items(user_id, top_items)
print(f"Recommended items for user {user_id}: {recommendations}")
print()

# Step 3: Stratified Train-Test Split
def stratified_split(df, test_size=0.2):
    train_data = []
    test_data = []

    for _, group in df.groupby('userId'):
        user_train, user_test = train_test_split(group, test_size=test_size, random_state=42)
        train_data.append(user_train)
        test_data.append(user_test)

    return pd.concat(train_data), pd.concat(test_data)

# Perform stratified split
df_train, df_test = stratified_split(df, test_size=0.2)

# Unique items in the top-N recommendations
print("Top-N Recommended Items:", set(top_items))

# Get ground truth for evaluation
def get_ground_truth(test_df):
    return test_df.groupby('userId')['movieId'].apply(list).to_dict()

ground_truth = get_ground_truth(df_test)

# Evaluate hit rate
def hit_rate(top_items, ground_truth):
    hits = 0  # Count of users for whom the system recommended a relevant item
    total_users = len(ground_truth)  # Total number of users in the test set
    
    # Iterate over all users and their ground truth items
    for user, true_items in ground_truth.items():
        # Check if any of the top-N items are in the user's ground truth
        hits += len(set(top_items).intersection(set(true_items))) > 0
    
    # Compute hit rate as the fraction of users with hits
    return hits / total_users

# Compute hit rate
hit_rate_value = hit_rate(top_items, ground_truth)
print()
print(f"Hit Rate: {hit_rate_value}")

Top 5 Most Popular Items:
      movieId  average_rating
7638    88448             5.0
8089   100556             5.0
9065   143031             5.0
9076   143511             5.0
9078   143559             5.0

Recommended items for user 1: [88448, 100556, 143031, 143511, 143559, 6201, 102217, 102084, 6192, 145994]

Top-N Recommended Items: {88448, 102084, 143559, 143031, 102217, 145994, 100556, 6192, 143511, 6201}

Hit Rate: 0.003278688524590164


Hit Rate is of course very small because of the general recommendations approach (not user specific).