# Recommender Systems Development for Marketing (Assignment): Content-Based Recommender System for Yelp and Netflix

### Import Libraries & Set Config

In [73]:
import pandas as pd
import numpy as np
import json
from sklearn.preprocessing import StandardScaler, MinMaxScaler
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import linear_kernel
from sklearn.neighbors import NearestNeighbors


### File Paths and Dataset Selection

In [74]:
row_limit = 100000
netflix_path = '/Users/mandidisha/Downloads/SDMmandi/training_data.csv'
yelp_path = '/Users/mandidisha/Downloads/SDMmandi/Yelp/yelp_academic_dataset_review.json'
dataset_choice = input("Choose the dataset to import (netflix or yelp): ").strip().lower()
recommend_func = None


In [75]:
restaurants_df = pd.read_csv("/Users/mandidisha/Downloads/SDMmandi/Yelp/restaurants.csv")


## Functions for Measuring Diversity & Novelty

### 1. `calculate_diversity(recommendations, feature_vectors)`

**What It Does Internally**  
- **Input**:  
  1. A list of `recommendations` (item IDs).  
  2. A corresponding matrix `feature_vectors` where each row represents an item’s features (e.g., TF-IDF, embeddings, etc.).  
- **Steps**:  
  1. **Compute Cosine Similarity**:  
     `cosine_similarity(feature_vectors)` returns an \(n \times n\) matrix, where each cell \((i, j)\) is the similarity between items \(i\) and \(j\).  
  2. **Extract Upper Triangle**:  
     We use `np.triu_indices(len(sims), k=1)` to collect all the unique pairwise similarities, ignoring the diagonal and duplicate pairs.  
  3. **Translate Similarity to Diversity**:  
     The final score is \(1 - \text{mean(similarities)}\). If items are highly similar, the mean is large and the diversity is small. If items are quite different, the mean is small and the diversity is large (approaching 1.0).

### 2. `calculate_novelty(recommendations, popularity_dict)`

**What It Does Internally**  
- **Input**:  
  1. A list of recommended `item_id`s.  
  2. A `popularity_dict` mapping each item to its frequency in the dataset.  
- **Steps**:  
  1. **Sum Over All Items**:  
     `total_items = sum(popularity_dict.values())` gives the total interaction count in your dataset.  
  2. **Compute Probability**:  
     For each recommended item, we do  
     \[
       \text{prob} = \frac{\text{freq}(item)}{\text{total\_items}}
     \]  
     If the item doesn’t exist in `popularity_dict`, we default to 1 (rare item).  
  3. **Self-Information**:  
     Novelty is defined as \(-\log_{2}(\text{prob})\). This is also known as the **information content**: items that appear less often have a lower probability and thus a higher novelty score.  
  4. **Average Across Recommendations**:  
     We aggregate these item-level novelty scores by taking the mean.

### Why It Matters
- **Diversity** shows whether you’re recommending a range of distinct items or just slight variations of the same thing.  
- **Novelty** tells you if your recommendations are relatively unknown (potential for discovery) or the most common/popular choices (familiar but less exciting).

Putting it all together, these metrics provide insight into how “fresh” and “varied” your recommendation lists are, complementing more traditional accuracy-oriented measures.


In [76]:
from sklearn.metrics.pairwise import cosine_similarity

def calculate_diversity(recommendations, feature_vectors):
    if len(recommendations) <= 1:
        return 0.0
    sims = cosine_similarity(feature_vectors)
    upper_triangle = sims[np.triu_indices(len(sims), k=1)]
    return 1 - np.mean(upper_triangle)

def calculate_novelty(recommendations, popularity_dict):
    novelty_scores = []
    total_items = sum(popularity_dict.values())
    if total_items == 0:
        return 0.0
    for item_id in recommendations:
        freq = popularity_dict.get(item_id, 1)
        prob = freq / total_items
        prob = max(prob, 1e-10) 
        novelty = -np.log2(prob)
        novelty_scores.append(novelty)
    return np.mean(novelty_scores) if novelty_scores else 0.0


## Content-Based Recommendation System: Netflix vs Yelp

This section implements a content-based recommendation system for two datasets using different strategies tailored to their structures:

---

### Netflix Movie Recommendations

**Goal**: Recommend movies similar to a given title based on metadata and user engagement statistics.

#### **Pipeline**:
1. **Load & Clean Data**:
   - Read a limited number of rows (`row_limit`) from the Netflix dataset.
   - Standardize column names and group by `movietitle` and `yearofrelease`.

2. **Feature Aggregation**:
   - Compute the average values of key content features:
     - `user_average_rating`
     - `scaled_movie_age`
     - `average_rating_per_movie`
     - `number_of_ratings_per_movie`
   - Compute average movie ratings (`avg_rating`) and merge.

3. **Feature Normalization**:
   - Normalize content features using `MinMaxScaler`.

4. **Feature Engineering**:
   - Create a string-based representation of each movie combining its metadata and numerical features.

5. **TF-IDF & Similarity Matrix**:
   - Use `TfidfVectorizer` to encode movie feature strings.
   - Compute cosine similarity across all movies using `linear_kernel`.

6. **Recommendation Function**:
   - For a given `movie_title`, return the top-N most similar movies based on cosine similarity.

---

### Yelp Business Recommendations

**Goal**: Recommend similar businesses based on average user feedback features.

#### **Pipeline**:
1. **Load & Parse JSON Data**:
   - Read `row_limit` entries from Yelp’s line-delimited JSON format.

2. **Preprocessing**:
   - Keep only entries with valid `business_id`.
   - Focus on numeric columns: `stars`, `useful`, `funny`, `cool`.
   - Fill missing values with 0.

3. **Feature Aggregation**:
   - Aggregate ratings and interaction metrics per business.

4. **Feature Scaling & Model Fitting**:
   - Scale features using `StandardScaler`.
   - Fit a `NearestNeighbors` model using **cosine similarity**.

5. **Recommendation Function**:
   - For a given `index`, retrieve the top-N most similar businesses.

---

### Dynamic Execution Logic

A `dataset_choice` toggle allows switching between:
- `recommend_netflix(movie_title)`
- `recommend_yelp(index)`

The active recommendation function is assigned to `recommend_func` based on the dataset selected.


In [77]:

if dataset_choice == "netflix":
    # 1) Load the Netflix CSV
    df = pd.read_csv(netflix_path, nrows=row_limit)
    df.columns = df.columns.str.strip().str.lower()
    
    # 2) Group columns and define content features
    group_cols = ['movietitle', 'yearofrelease']
    content_cols = ['user_average_rating', 'scaled_movie_age',
                    'average_rating_per_movie', 'number_of_ratings_per_movie']
    
    # 3) Aggregate by (movietitle, yearofrelease) for these features
    df_grouped = df.groupby(group_cols)[content_cols].mean().reset_index()
    
    # 4) Also compute average rating across all interactions
    avg_ratings = df.groupby(group_cols)['rating'].mean().reset_index(name='avg_rating')
    df_grouped = pd.merge(df_grouped, avg_ratings, on=group_cols, how='left')
    
    # 5) Scale the numeric content columns
    scaler = MinMaxScaler()
    df_grouped[content_cols] = scaler.fit_transform(df_grouped[content_cols])
    
    # 6) Create a combined 'moviefeatures' string for TF-IDF
    df_grouped['moviefeatures'] = (
        df_grouped['movietitle'].astype(str) + ' ' +
        df_grouped['yearofrelease'].astype(str) + ' ' +
        df_grouped['user_average_rating'].astype(str) + ' ' +
        df_grouped['scaled_movie_age'].astype(str) + ' ' +
        df_grouped['average_rating_per_movie'].astype(str) + ' ' +
        df_grouped['number_of_ratings_per_movie'].astype(str)
    )
    
    # 7) Compute TF-IDF matrix
    tfidf = TfidfVectorizer(stop_words='english')
    tfidf_matrix = tfidf.fit_transform(df_grouped['moviefeatures'])
    cosine_sim = linear_kernel(tfidf_matrix, tfidf_matrix)
    
    # 8) Define the recommend function for Netflix
    def recommend_netflix(movie_title, top_n=5):
        matches = df_grouped[df_grouped['movietitle'] == movie_title]
        if matches.empty:
            print(f" Movie not found: {movie_title}")
            return pd.DataFrame()
        idx = matches.index[0]
        sim_scores = list(enumerate(cosine_sim[idx]))
        # Sort by similarity, exclude the movie itself [1:]
        sim_scores = sorted(sim_scores, key=lambda x: x[1], reverse=True)[1:top_n+1]
        indices = [i[0] for i in sim_scores]
        return df_grouped.iloc[indices][['movietitle', 'yearofrelease', 'avg_rating']]
    
    # 9) Assign the recommend function
    recommend_func = recommend_netflix

elif dataset_choice == "yelp":
    # 1) Read a subset of the Yelp JSON lines
    lines = []
    with open(yelp_path, 'r') as f:
        for i, line in enumerate(f):
            if i >= row_limit:
                break
            lines.append(json.loads(line))
    df = pd.DataFrame(lines)
    df.columns = df.columns.str.strip().str.lower()
    
    # 2) Filter out rows with missing business_id
    df = df[df['business_id'].notna()]

    # 3) Prepare numeric columns
    numeric_cols = ['stars', 'useful', 'funny', 'cool']
    df[numeric_cols] = df[numeric_cols].fillna(0)
    
    # 4) Aggregate these columns by business_id
    business_profiles = df.groupby('business_id')[numeric_cols].mean().reset_index()
    
    # 5) Merge with restaurants_df to get the 'name' column
    #    (Assuming you already have 'restaurants_df' loaded with at least 'business_id' and 'name')
    business_profiles = business_profiles.merge(
        restaurants_df[['business_id', 'name']], 
        on='business_id',
        how='left'
    )
    
    # 6) Scale features
    features_scaled = StandardScaler().fit_transform(business_profiles[numeric_cols])
    
    # 7) Fit a NearestNeighbors model
    nn_model = NearestNeighbors(metric='cosine', algorithm='brute', n_neighbors=6)
    nn_model.fit(features_scaled)
    
    # 8) Define the recommend function for Yelp
    def recommend_yelp(index, top_n=5):
        if index >= len(business_profiles):
            print("Invalid business index")
            return pd.DataFrame()
        
        # Get top_n+1 neighbors to skip the item itself
        distances, indices = nn_model.kneighbors([features_scaled[index]], n_neighbors=top_n+1)
        
        # Slice out the recommended neighbors, skipping the first one
        recommended = business_profiles.iloc[indices[0][1:]].copy()
        
        # Return columns that include the name, plus numeric info
        return recommended[['business_id', 'name', 'stars', 'useful', 'funny', 'cool']]
    
    # 9) Assign the recommend function
    recommend_func = recommend_yelp

else:
    # Optional fallback if user input is invalid
    print("Invalid choice. Please enter 'netflix' or 'yelp'. recommend_func is not set.")


## Evaluation of Top-N Recommendations

This function evaluates the quality of **Top-N content-based recommendations** using classification-style metrics. It supports both the **Netflix** and **Yelp** datasets, adjusting logic based on `dataset_choice`.

---

### Function: `evaluate_recommendations(df_source, recommend_func, top_n=5, rating_threshold=4.0, sample_size=100)`

#### **Inputs**:
- `df_source`: DataFrame containing the source items (e.g., movies or businesses)
- `recommend_func`: Function that returns recommendations for a given item or index
- `top_n`: Number of recommendations to generate per item (default: 5)
- `rating_threshold`: Minimum rating considered "positive" (default: 4.0)
- `sample_size`: Number of samples to evaluate (default: 100)

---

### **Evaluation Logic**

For each item in the sample:
1. **Netflix**:
   - Retrieve the `avg_rating` of the selected movie as the **true label** (positive if ≥ threshold).
   - Generate `top_n` recommendations using `recommend_func`.
   - Compare each recommendation’s average rating to the threshold.

2. **Yelp**:
   - Use `stars` of the selected business as the true label.
   - Generate recommendations using nearest neighbors.
   - Compare predicted businesses' `stars` to the threshold.

---

### **Metrics Computed**

- **Precision**: Fraction of recommended items that are relevant  
  \( \text{Precision} = \frac{TP}{TP + FP} \)
- **Recall**: Whether at least one relevant item was recommended  
  \( \text{Recall} = 1 \) if at least one TP and true label is positive, else 0
- **Accuracy**: Proportion of predictions matching the true label
- **F1 Score**: Harmonic mean of precision and recall

---

### **Output**

Prints the average values of all metrics across valid samples:


In [78]:
def evaluate_recommendations(df_source, recommend_func, top_n=5, rating_threshold=4.0, sample_size=100):
    precisions, recalls, accuracies = [], [], []
    diversity_scores, novelty_scores = [], []

    # --- Popularity setup
    if dataset_choice == "netflix":
        popularity_dict = df['movietitle'].value_counts().to_dict()
    elif dataset_choice == "yelp":
        popularity_dict = df['business_id'].value_counts().to_dict()

    for i in range(min(sample_size, len(df_source))):
        try:
            # --- Get recommendation and labels
            if dataset_choice == "netflix":
                true_label = df_source.iloc[i]['avg_rating'] >= rating_threshold
                movie_title = df_source.iloc[i]['movietitle']
                recs = recommend_func(movie_title, top_n=top_n)
                if recs.empty or 'avg_rating' not in recs.columns:
                    continue
                pred_labels = recs['avg_rating'] >= rating_threshold

                # --- Diversity
                rec_indices = recs.index
                diversity = calculate_diversity(recs, tfidf_matrix[rec_indices])
                # --- Novelty
                novelty = calculate_novelty(recs['movietitle'], popularity_dict)

            elif dataset_choice == "yelp":
                true_label = df.iloc[i]['stars'] >= rating_threshold
                recs = recommend_func(i, top_n=top_n)
                if recs.empty or 'stars' not in recs.columns:
                    continue
                pred_labels = recs['stars'] >= rating_threshold

                # --- Diversity
                rec_indices = recs.index
                rec_feature_vectors = features_scaled[rec_indices]
                diversity = calculate_diversity(recs, rec_feature_vectors)
                # --- Novelty
                novelty = calculate_novelty(recs['business_id'], popularity_dict)

            # --- Evaluation metrics
            tp = pred_labels.sum() if true_label else 0
            precision = tp / len(pred_labels) if len(pred_labels) > 0 else 0
            recall = 1 if tp > 0 and true_label else 0
            accuracy = sum(pred_labels == true_label) / len(pred_labels)

            # --- Append scores
            precisions.append(precision)
            recalls.append(recall)
            accuracies.append(accuracy)
            diversity_scores.append(diversity)
            novelty_scores.append(novelty)

        except Exception as e:
            print(f"⚠️ Skipped index {i} due to error: {e}")
            continue

    if not precisions:
        print("⚠️ No valid evaluations. Check dataset or prediction output.")
        return

    # --- Aggregate
    avg_precision = np.mean(precisions)
    avg_recall = np.mean(recalls)
    avg_accuracy = np.mean(accuracies)
    avg_diversity = np.mean(diversity_scores)
    avg_novelty = np.mean(novelty_scores)
    f1 = 2 * (avg_precision * avg_recall) / (avg_precision + avg_recall + 1e-6)

    # --- Print Results
    print(f"\n🔍 Evaluation Metrics (Top {top_n} Recommendations):")
    print(f"Precision: {avg_precision:.4f}")
    print(f"Recall:    {avg_recall:.4f}")
    print(f"Accuracy:  {avg_accuracy:.4f}")
    print(f"F1 Score:  {f1:.4f}")
    print(f"Diversity: {avg_diversity:.4f}")
    print(f"Novelty:   {avg_novelty:.4f}")


## Running Sample Recommendations and Evaluation

This section runs a **live recommendation** and evaluates system performance using classification metrics based on a threshold (e.g., 3.0-star or rating cutoff).

---

### Logic:

#### **If `dataset_choice == "netflix"`**
- Picks a random movie title from the grouped dataset.
- Prints Top-5 recommended movies using the `recommend_func`.
- Evaluates recommendation quality across 100 samples using:
  - `avg_rating` ≥ 3.0 as a positive label
  - Precision, Recall, Accuracy, and F1 Score are computed

#### **If `dataset_choice == "yelp"`**
- Selects the business at index `0` and prints Top-5 similar businesses.
- Runs evaluation over 100 randomly sampled businesses using:
  - `stars` ≥ 3.0 as the positive label
  - Same evaluation metrics as in Netflix

---




### Most of the business names have nan values in restaurant.csv so thats why we keep business id

In [None]:
if dataset_choice == "netflix":
    sample_title = df_grouped.iloc[0]['movietitle']
    print(f"\n🎬 Recommendations for '{sample_title}':")
    print(recommend_func(sample_title))
    evaluate_recommendations(df_grouped, recommend_func, top_n=10, rating_threshold=3.0, sample_size=100)

elif dataset_choice == "yelp":
    print(f"\n🏪 Recommendations for business at index :")
    print(recommend_func(0))
    evaluate_recommendations(df, recommend_func, top_n=10, rating_threshold=3.0, sample_size=100)



🎬 Recommendations for '7 Seconds':
              movietitle  yearofrelease  avg_rating
1                  8 Man           1992    2.129032
2                Boycott           2001    3.598470
3  By Dawn's Early Light           2000    3.324675
4              Character           1997    3.641153
5           Chump Change           2000    2.246305

🔍 Evaluation Metrics (Top 10 Recommendations):
Precision: 0.4467
Recall:    0.7667
Accuracy:  0.5267
F1 Score:  0.5645
Diversity: 0.9932
Novelty:   7.5733


## Comparison: Netflix vs Yelp Content-Based Recommendation Performance

This section compares the performance of content-based recommenders across two datasets — Netflix (movies) and Yelp (businesses) — based on Top-5 recommendations.

---

### Netflix Recommendations
**Example Input:** '7 Seconds'  
**Top-5 Recommendations:**
- 8 Man (1992)
- Boycott (2001)
- By Dawn's Early Light (2000)
- Character (1997)
- Chump Change (2000)

**Evaluation Metrics (Top 5):**

| Metric     | Value   |
|------------|---------|
| Precision  | 0.4467  |
| Recall     | 0.7667  |
| Accuracy   | 0.5267  |
| F1 Score   | 0.5645  |
| Diversity  | 0.9932  |
| Novelty    | 7.5733  |

---

### Yelp Recommendations
**Example Input:** Business at index 0  
**Top-5 Recommendations:**
- 8hgo446H2HoYlZocEi1SJw – 4.25 stars  
- OkSjPjMwXQ77h7kzhhyhDg – 4.60 stars  
- K45LUT-_MRhHJOzy5nBBjQ – 5.00 stars  
- 6p07zfmJWvytr0paqpyvbg – 4.35 stars  
- J6MGQigHItdSlG-3XZ1myA – 4.13 stars

**Evaluation Metrics (Top 5):**

| Metric     | Value   |
|------------|---------|
| Precision  | 0.7400  |
| Recall     | 0.7900  |
| Accuracy   | 0.7500  |
| F1 Score   | 0.7605  |
| Diversity  | 0.0091  |
| Novelty    | 14.5776 |

---

### Summary of Comparison

| Metric     | Netflix     | Yelp        | Observation |
|------------|-------------|-------------|-------------|
| Precision  | 0.4467       | **0.7400**  | Yelp performs better at recommending relevant items |
| Recall     | 0.7667      | 0.7900      | Both systems recall relevant items well |
| Accuracy   | 0.5267      | **0.7500**  | Yelp provides more consistent classification |
| F1 Score   | 0.5645     | **0.7605**  | Yelp has better overall classification balance |
| Diversity  | **0.9932**  | 0.0091      | Netflix offers highly diverse recommendations |
| Novelty    | 7.5733      | **14.5776** | Yelp suggests much rarer items on average |

---

### Interpretation

- Netflix offers high diversity and moderate novelty but lower precision and overall classification accuracy.
- Yelp shows strong accuracy and recall but very low diversity, likely due to homogeneity in business features.
- The contrast illustrates a tradeoff between recommendation **relevance** and **diversity**, depending on the dataset and features used.

---


In [80]:
# Number of NaN names:
num_nan_names = business_profiles['name'].isna().sum()
total_rows = len(business_profiles)
print(f"{num_nan_names} out of {total_rows} have NaN for name.")


7397 out of 9973 have NaN for name.
