<a href="https://colab.research.google.com/github/hellosmallkat/Movie-Recommendation-System/blob/main/Copy_of_Movie_Recommendation_System.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

<h1 align="center">
    NSDC Data Science Projects
</h1>
  
<h2 align="center">
    Project: Movie Recommendation System
</h2>

<h3 align="center">
    Name: hellosmallkat
</h3>


### **Please read before you begin your project**

**Instructions: Google Colab Notebooks:**

Google Colab is a free cloud service. It is a hosted Jupyter notebook service that requires no setup to use, while providing free access to computing resources. We will be using Google Colab for this project.

Certain parts of this project will be completed individually, while other parts are encouraged to be completed with the rest of your team. In order to work within the Google Colab Notebook, **please start by clicking on "File" and then "Save a copy in Drive."** This will save a copy of the notebook in your personal Google Drive. Each member of your team should work on their personal copy.

Please rename the file to "Movie Recommendation Analysis - Your Full Name." Once this project is completed, you will be prompted to share your file with the National Student Data Corps (NSDC) Project Leaders.

You can now start working on the project. :)

We'll be using Google Colab for this assignment. This is a Python Notebook environment built by Google that's free for everyone and comes with a nice UI out of the box. For a comprehensive guide, see Colab's official guide [here](https://colab.research.google.com/github/prites18/NoteNote/blob/master/Welcome_To_Colaboratory.ipynb).

Colab QuickStart:
- Notebooks are made up of cells, cells can be either text or code cells. Click the +code or +text button at the top to create a new cell
- Text cells use a format called [Markdown](https://www.markdownguide.org/getting-started/). Cheatsheet is available [here](https://www.markdownguide.org/cheat-sheet/)
- Python code is run/executed in code cells. You can click the play button at the top left of a code block (sometimes hidden in the square brackets) to run the code in that cell. You an also hit shift+enter to run the cell that is currently selected. There is no concurrency since cells run one at a time but you can queue up multiple cells
- Each cell will run code individually but memory is shared across a notebook Runtime. You can think of a Runtime as a code session where everything you create and execute is temporarily stored. This means variables and functions are available between cells if you execute one cell before the other (physical ordering of cells does not matter). This also means that if you delete or change the name of something and re-execute the cell, the old data might still exist in the background. If things aren't making sense, you can always click Runtime -> restart runtime to start over.
- Runtimes will persist for a short period of time so you are safe if you lose connection or refresh the page but Google will shutdown a runtime after enough time has past. Everything that was printed out will remain on the page even if the runtime is disconnected
- Google's Runtimes come preinstalled with all the core python libraries (math, rand, time, etc) as well as common data analysis libraries (numpy, pandas, scikitlearn, matplotlib). Simply run `import numpy as np` in a code cell to make it available

# **Singular Value Decomposition**

---

# Crash Course on Singular Value Decomposition (SVD)

## Introduction

Singular Value Decomposition, or SVD, is a mathematical technique used in many fields such as signal processing, statistics, and machine learning, particularly in the context of recommendation systems. It's a method for decomposing a matrix into three other matrices that reveal its underlying structure.

## Basic Concepts

### Matrices
- **Matrix**: A rectangular array of numbers.
- **Dimension of a Matrix**: Given in the form of rows × columns.

### Decomposition
- **Decomposition**: Breaking down a complex matrix into simpler, understandable parts.

## What is SVD?

```
SVD breaks down any given matrix A into three separate matrices named U, Σ and V*
ie. A = UΣV*
```
Where the components are:
```
- A: Original matrix.
- U: Left singular vectors (orthogonal matrix).
- Σ: Diagonal matrix of singular values (non-negative).
- V*: Right singular vectors (conjugate transpose of V , an orthogonal matrix).
```




## Where do we use SVDs?

### Applications in Recommendation Systems

In recommendation systems, SVD is used to predict unknown preferences by decomposing a large matrix of user-item interactions into factors representing latent features. It helps in capturing the underlying patterns in the data.

### Process

1. **Matrix Creation**: Start with a matrix where rows represent users, columns represent items, and entries represent user ratings.
2. **Apply SVD**: Decompose this matrix using SVD.
3. **Latent Features**: The decomposition reveals latent features that explain observed ratings.
4. **Prediction**: Use the decomposed matrices to predict missing ratings.

### Advantages of an SVD
- Effective at uncovering latent features in the data.
- Reduces dimensionality, making computations more manageable.

### Limitations of an SVD
- Assumes linear relationships in data.
- Sensitive to missing data and outliers.

#### Through this project, we will learn how to build a movie recommendation system using an SVD


#### Dataset being used : **Movielens 100k dataset**

- This specific dataset, often referred to as "ml-100k," contains 100,000 ratings from 943 users on 1,682 movies. The data was collected through the MovieLens website during the seven-month period from September 19th, 1997 to April 22nd, 1998.

- **Data Structure**: The dataset includes user ratings that range from 1 to 5. Additionally, it provides demographic information about the users (age, gender, occupation, etc.) and details about the movies (titles, genres).

- **Usage**: It's a standard dataset used for implementing and testing recommender systems. Its size is manageable, making it a popular choice for educational purposes and for initial experimentation with recommendation algorithms.

- **Significance**: The diversity in the dataset, both in terms of users and movie genres, provides a rich ground for analyzing different recommendation strategies, testing algorithms like SVD, and understanding user preferences and behavioral patterns.

This dataset is an excellent starting point for anyone looking to delve into the world of recommender systems and practice with real-world data.


Now, we will write some code to understand and explore the dataset

In [None]:
#Run this only once, you can comment out this part of the code after.
!pip install surprise

In [None]:
#Importing necessary modules for this project
import pandas as pd
from surprise import Dataset
from surprise.model_selection import train_test_split

In [None]:
#Installing the dataset from pandas, run this only once, you can comment out this part of the code after.
!pip install pandas scikit-surprise

### How do the predictions work?

1. **Model Training**:
   - The SVD algorithm is first trained on a portion of the dataset, which includes user ratings for various movies.
   - During training, the model learns to associate certain patterns and characteristics of users and movies with specific rating behaviors.

2. **Latent Features Extraction**:
   - SVD decomposes the rating matrix into matrices representing latent features of users and movies.
   - These latent features capture underlying aspects that affect rating behavior but are not explicitly available in the data (like user preferences or movie characteristics).

3. **Making Predictions**:
   - Once the model is trained, it can predict ratings for user-movie pairs where the actual rating is unknown.
   - The prediction is essentially a dot product of the latent features of the user and the movie. It represents the estimated preference of the user for that particular movie based on the learned patterns.

4. **Example of a Prediction**:
   - Suppose we want to predict how user `U` would rate movie `M`.
   - The model uses the latent features it has learned for user `U` and movie `M` to compute a predicted rating.
   - This prediction is a numerical value, typically on the same scale as the original ratings (e.g., 1 to 5).

5. **Application**:
   - These predictions are used to recommend movies to users.
   - For example, the system can recommend movies that have the highest predicted ratings for a particular user.

6. **Handling New Users or Movies (Cold Start Problem)**:
   - One challenge is predicting ratings for new users or movies that have little to no rating history. This is known as the cold start problem.
   - Solutions might involve using content-based approaches or hybrid models that don't rely solely on historical rating data.

In [None]:
data = Dataset.load_builtin('ml-100k')
df = pd.DataFrame(data.raw_ratings, columns=["user", "item", "rating", "timestamp"])

print(df.head())

We see the following columns:

* **User ID**: A unique identifier for the user who provided the rating.

* **Item ID (Movie ID)**: A unique identifier for the movie that was rated.

* **Rating:** The rating given to the movie by the user. In the MovieLens 100k dataset, these ratings are typically on a scale of 1 to 5.

* **Timestamp:** The time at which the rating was provided. The timestamp is usually in Unix time format, which counts seconds since the Unix epoch (January 1, 1970).



In [None]:
#TODO - Describe the statistics of this dataset.
# Hint: Use the describe() function


Now, we will do some data preprocessing.

This will include:
*   Checking for missing values
*   Converting timestamps to a readable format
*   Splitting the data into testing and training subsets



In [None]:
print(df.isnull().sum())

We see that there are no missing values.

In [None]:
# Convert timestamp to a readable format
df['timestamp'] = pd.to_datetime(df['timestamp'], unit='s')
print(df.head())

In [None]:
# Split the data into a training set and a test set
trainset, testset = train_test_split(data, test_size=0.20)

# Display the number of users and items in the training set
print(f"Number of users: {trainset.n_users}")
print(f"Number of items: {trainset.n_items}")

# Display the first few elements of the test set
print(testset[:5])

### Hyperparameter Tuning in SVD
Hyperparameter tuning is a critical step in optimizing the performance of an SVD model. The goal is to find the best combination of parameters that results in the most accurate predictions or lowest error rates.

#### Hyperparameters we will be tuning in this project

1. **`n_factors`**:
   - Represents the number of latent factors (or features) to extract from the dataset.
   - The values `[50, 100, 150]` are chosen to test the model's performance with a varying number of factors. A higher number of factors can capture more complex patterns but may lead to overfitting and increased computation time.

2. **`n_epochs`**:
   - Refers to the number of iterations over the entire dataset during training.
   - The values `[20, 30]` provide a range to evaluate whether more iterations improve model performance or lead to overtraining.

3. **`lr_all`** (Learning Rate):
   - Determines the step size at each iteration while moving toward a minimum of the loss function.
   - The values `[0.005, 0.010]` are chosen to test how fast the model learns. A smaller learning rate may lead to more precise convergence but requires more epochs.

4. **`reg_all`** (Regularization Term):
   - Helps prevent overfitting by penalizing larger model parameters.
   - The values `[0.02, 0.1]` offer a range to assess the impact of regularization on model performance. Higher regularization can reduce overfitting but may lead to underfitting.

In [None]:
# Define a grid of SVD hyperparameters explained above for tuning
param_grid = {
    'n_factors': [50, 100, 150],
    'n_epochs': [20, 30],
    'lr_all': [0.005, 0.010],
    'reg_all': [0.02, 0.1]
}


Now, we will train the model with the following parameters:

1. **`SVD`**:
   - This is the recommendation algorithm being tuned. SVD is a popular algorithm used in recommendation systems, particularly for matrix factorization.

2. **`param_grid`**:
   - This is a dictionary where keys are hyperparameter names, and values are lists of parameter settings to try as values. It defines the grid of parameters that will be tested.
   - Example: If `param_grid` is `{'n_factors': [50, 100], 'lr_all': [0.005, 0.01]}`, GridSearchCV will evaluate the SVD algorithm for all combinations of `n_factors` and `lr_all` from these lists.

3. **`measures=['RMSE', 'MAE']`**:
   - These are the performance metrics used to evaluate the algorithm.
   - `RMSE` stands for Root Mean Square Error, and `MAE` stands for Mean Absolute Error. Both are common metrics for evaluating the accuracy of prediction algorithms, with lower values indicating better performance.

4. **`cv=3`**:
   - This specifies the number of folds for cross-validation.
   - In this context, `cv=3` means that a 3-fold cross-validation will be used. The dataset will be split into three parts: in each iteration, two parts will be used for training, and one part will be used for testing. This process repeats three times, each time with a different part used for testing.

In [None]:
from surprise.model_selection import cross_validate, train_test_split, GridSearchCV
from surprise import SVD, Dataset, Reader, accuracy

# Perform grid search with cross-validation to find the best hyperparameters for our model
gs = GridSearchCV(SVD, param_grid, measures=['RMSE', 'MAE'], cv=3)
gs.fit(data)

In [None]:
# Best score and parameters
print(f"Best RMSE: {gs.best_score['rmse']}")
print(f"Best parameters: {gs.best_params['rmse']}")

In [None]:
# TODO - Use the best model. Use best_estimator function on gs
algo = ___________['rmse']

In [None]:
# TODO - Train and test split. Make sure test_size is 0.25
trainset, testset = ________
# TODO - Fit the trainset to train the model
algo.fit(_____)
# TODO - Make predictions on the testset
predictions = algo.test(_____)

# TODO - Calculate and print RMSE on the predictions made
accuracy.rmse(______)

In [None]:
#Predict rating for a user and item
user_id = '196'  # replace with a specific user ID
item_id = '302'  # replace with a specific item (movie) ID
predicted_rating = algo.predict(user_id, item_id)
print(f"Predicted rating for user {user_id} and item {item_id}: {predicted_rating.est}")

In [None]:
# To inspect the predictions in detail, let's print the first 10 predictions made by the model
for idx, prediction in enumerate(predictions[:10]):
    print(f'Prediction {idx}: User {prediction.uid} and item {prediction.iid} has true rating {prediction.r_ui}, and the predicted rating is {prediction.est}')


####Rounding Numbers
Rounding values is a technique used to simplify numbers, but its appropriateness depends on the context:

**When to Round**
1. **Simplification**: For estimations.
2. **Reporting**: When exact figures aren't necessary (e.g., in everyday language).
3. **Data Analysis**: To focus on significant trends by ignoring minor variations.
4. **Financial Transactions**: Rounding to the smallest currency unit.
5. **Display Purposes**: For clarity in graphs or tables.

**When NOT to Round**
1. **Intermediate Calculations**: Early rounding can lead to significant final errors.
2. **Legal/Regulatory Documents**: Require exact figures.
3. **Scientific/Engineering Work**: Precision is crucial.
4. **Critical Calculations**: In health, safety, or finance, precision is essential.

To summarize,
- Rounding depends on the purpose and context of the calculation.
- It is useful for simplification and clarity but should be avoided when precision is critical.
- We must be aware of potential cumulative errors in sequential calculations.


Let us round the values of the predictions so that it falls within the rating categories of [1.0, 2.0, 3.0, 4.0, 5.0]

In [None]:
#TODO - Round the prediction.est variable being printed. Use python's default rounding function to achieve this

for idx, prediction in enumerate(predictions[:10]):
    temp = math.ceil(int(prediction.est))
    print(f'Prediction {idx}: User {prediction.uid} and item {prediction.iid} has true rating {prediction.r_ui}, and the predicted rating is {______(prediction.est)}')


<h3 align = 'center' >
Thank you for completing the project!
</h3>

Please submit all materials to the NSDC HQ team at nsdc@nebigdatahub.org in order to receive a virtual certificate of completion. Do reach out to us if you have any questions or concerns. We are here to help you learn and grow.
