# Simple diversity score

### The dataset
This assignment uses the [Wikipedia Movie Plots](https://www.kaggle.com/datasets/jrobischon/wikipedia-movie-plots) dataset. The dataset contains descriptions of 34,886 movies from around the world. Column descriptions are listed below:

- Release Year - Year in which the movie was released
- Title - Movie title
- Origin/Ethnicity - Origin of movie (i.e. American, Bollywood, Tamil, etc.)
- Director - Director(s)
- Plot - Main actor and actresses
- Genre - Movie Genre(s)
- Wiki Page - URL of the Wikipedia page from which the plot description was scraped
- Plot - Long form description of movie plot (WARNING: May contain spoilers!!!)

### Assignment 01

* Data Preparation and Preprocessing:
  - Load the dataset and explore its structure.
  - Clean and preprocess the data as needed. Focus on the 'Plot' and 'Genre' columns for this assignment.

* Feature Engineering:
  - Combine relevant features to create a unified text representation for each movie. Suggested features to combine include 'Plot' and 'Genre'.

* Vectorization:
  - Use TF-IDF (Term Frequency-Inverse Document Frequency) to vectorize the combined text features of each movie.

* Similarity Calculation:
  - Calculate the cosine similarity between movies based on their vectorized features.

* Diversity Scoring:

  - For each movie, calculate a diversity score as $\text{Diversity Score =}$ $1 - \text{average cosine similarity}$
 with all other movies.

* Recommendation Generation:
  - Implement a function to generate movie recommendations for a given movie. The recommendations should be based on both similarity (movies should be similar to the given movie) and diversity (recommendations should be diverse).

* Analysis:
  - Analyze the differences between the recommendations generated with and without the diversity adjustment.
  - Discuss how the diversity_factor influences the recommendation outcomes.
  - Reflect on the potential benefits and drawbacks of introducing diversity into recommendation systems.

In [None]:
# code goes here

### Assignment 02

* Similarity Score Calculation:
  - Implement a function to calculate cosine similarity scores for items in your dataset. Use scikit-learn's cosine similarity function or write your own.
  - Store the cosine similarity scores in a matrix.

* Diversity Metric Implementation:
  - Implement the diversity_metric function as provided in the earlier code below. This function should adjust the similarity scores based on the diversity_factor to introduce diversity into the recommendations.
  - Experiment with different values of diversity_factor (e.g., 0.2, 0.4, 0.6) and observe how it affects the diversity of recommendations.

* Recommendation Generation:
  - Select a few items (e.g., movies or books) as the basis for generating recommendations.
  - Apply the diversity_metric to the cosine similarity matrix to adjust the similarity scores.
  - For each selected item, generate two sets of recommendations: one using the adjusted similarity scores and another using the original cosine similarity scores.
  - Compare the two sets of recommendations to evaluate the impact of the diversity adjustment.

* Analysis:
  - Analyze the differences between the recommendations generated with and without the diversity adjustment.
  - Discuss how the diversity_factor influences the recommendation outcomes.
  - Reflect on the potential benefits and drawbacks of introducing diversity into recommendation systems.

In [None]:
import numpy as np

def diversity_metric(similarities, diversity_factor=0.4):
    """
    Adjust similarity scores based on a diversity factor.
    A higher diversity_factor promotes more diverse recommendations.
    """
    # Apply diversity adjustment
    adjusted_similarities = similarities - diversity_factor * np.random.rand(*similarities.shape)
    return adjusted_similarities