Skip to content

giuven95/MovieRecommenderSystem

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

77 Commits
 
 
 
 
 
 
 
 

Repository files navigation

MovieRecommenderSystem

An automatic movie recommendations system project

Overview

This project was started for University of Messina course "Advanced Techniques for Data Analysis" from the Master's Degree in Engineering and Computer Science.

TODO

Log

  • 2023/01/21 11:50 - Created the project
  • 2023/01/21 11:55 - Created section "Main approaches for creating recommender systems"
  • 2023/01/21 12:15 - Created sections "Technologies" and "Datasets"
  • 2023/01/21 12:20 - Created section "References"
  • 2023/01/21 12:50 - Created section "Useful links"
  • 2023/01/21 13:20 - Created section "Problem formulation"
  • 2023/01/21 13:50 - Added more sections
  • 2023/01/21 14:40 - Added more sections
  • 2023/01/21 20:30 - Added React, Flask, MongoDB, Docker Compose inside "app" folder
  • 2023/01/26 20:20 - Created job system (mockup)
  • 2023/01/29 15:00 - Finishing Iteration 0
  • 2023/01/29 17:20 - Problem formulation
  • 2023/01/31 18:45 - Uploaded first notebook
  • 2023/01/31 10:00 - Switching away from Google Colab because of runtime limitations

Problem formulation

Recommendation systems are a subclass of information filtering systems.

A recommender system uses:

  • a dataset of items, like videos, songs, books, movies, items of clothing, e-commerce products etc.

  • a dataset of user interactions (e.g. reads, likes, buys) with some or all of those items

in order to predict items that are more pertinent to a particular user.

More formally, it is defined by:

  • $C$: the set of all users

  • $S$: the set of all possible items that can be recommended

  • $U$: a utility function $U : C \times S$ [3]

Peculiarities of the recommendation problem

Recommendation is a different enough problem from classification and regression to have its own unique challenges.

  • Novelty should sometimes be taken into account

    TODO

  • Diversity should be taken into account

    Users can get better satisfaction when lists of recommended items are not monotonous, but rather diverse. [4]

  • Serendipity should be taken into account

    Users want to be surprised by item recommendations; they do not want them to be boring or predictable. [5]

    As an example, recommending to a user a movie directed by his favorite director is generally not considered a serendipitous recommendation. The user would have likely discovered that movie on its own. [6]

  • Often, recommender systems cannot be fully offline

    TODO

  • In production environments, recommender systems should be mindful of presentation bias

    TODO

  • In production environments, recommender systems should be robust

    TODO

  • The future is not always like the past

    TODO

Main approaches for creating recommender systems

Collaborative filtering

Collaborative filtering systems use a database of preferences for items by users in order to predict additional items a new user might like. They do not rely on the features of users and items for predictions. [1]

Techniques

Memory-based
  • Neighbor-based

Neighborhood-based techniques can be described as automating the concept of word of mouth. Generally they are very simple to implement; in the least sophisticated case, only one hyperparameter is present (the number of neighbors chosen). They also have good explainability and require no training phase. [6]

  • Top-N
Model-based
  • Bayesian belief nets

  • Clustering

  • Singular value decomposition (SVD)

  • Principal component analysis (PCA)

  • Sparse factor analysis

  • Neural networks

Content-based filtering

Content-based recommender systems analyze item metadata, like movie names, movie descriptions and tags, and find regularities in the content. They rely on the features of users and items for predictions. [1]

Techniques

TODO

Hybrid approaches

Hybrid approaches combine content-based filtering techniques with those based on collaborative filtering. [2]

Techniques

TODO

Performance metrics

TODO

Project scope and approach

TODO

Iterations

Iteration 0

Iteration 0 is comprised of a mock frontend written in React and a mock REST API written in Python using the Flask framework.

It is meant as a rough first draft to serve as the template for the web app project.

It does not contain an actual Machine Learning model, nor does it contain any data scraper.

Job requests are received from a form in the frontend which accepts a Letterboxd profile name to be scraped.

<div className="Home AppPage">
  <h2>Find movies you will like!</h2>
  <div className="AppContainer">
    <form className="AppForm" onSubmit={handleSubmit(onSubmit)}>
      <div className="AppFormGroup">
        <label htmlFor="name">
          Insert your Letterboxd profile name:
        </label>
        <input
          type="text"
          {...register("name")}
        />
      </div>
      {canSubmit? <div className="AppFormGroup">
        <button className="AppButton" type="submit">
          <PlayArrowIcon />
          Submit
        </button>
      </div> : ""}
    </form>
  </div>
  {jobStatus === null ? "" : <>
    <h3>Job status</h3>
    <div className="AppContainer">
      <JobStatusBar />
    </div>
  </>}
  {(jobResponse === null || jobResponse.length === 0) ? "" : <>
    <h3>Check out your results</h3>
    <div className="AppContainer">
      <ResponseSection />
    </div>
  </>}
</div>

An unique job ID is created in the corresponding Flask endpoint, and it is sent back to the frontend.

Jobs are stored in a MongoDB collection.

@app.route(API_PREFIX + "/request", methods=["POST"])
def post_request():
    my_mongo_connect()
    data = request.get_json()
    if "name" not in data:
        return jsonify({"error": "Missing field name in request body"}), 400
    profile_name = data["name"]
    job_id = str(uuid.uuid4())
    job_doc = Job(job_id=job_id, profile_name=profile_name, status=JOB_STATUSES[0], response=[])
    job_doc.save()
    q.enqueue(process_job, {"id": job_id, "profile_name": profile_name})
    return jsonify({"id": job_id})

Jobs are passed to workers through a Redis queue for asynchronous execution. A worker script, also written in Python, pretends to process the job request going through different phases.

def process_job(job):
    my_mongo_connect()
    for i, job_status in enumerate(JOB_STATUSES):
        job_doc = Job.objects(job_id=job['id']).first()
        if job_doc.stopped: return
        if i == 0: continue
        job_doc.status = job_status
        if i == len(JOB_STATUSES) - 1: 
            job_doc.response = [Suggestion(name=movie['name'], score=movie['score']) for movie in DEFAULT_SUGGESTIONS]
        job_doc.save()
        time.sleep(MOCK_WAIT)

The web page polls a job status endpoint every few seconds, notifying the user of any updates. When the job status is set to "DONE", a job response endpoint is queried and the mock suggestions are downloaded in order to display them.

  useEffect(() => {
    let intervalId = null;

    if (jobStatus === "DONE" && jobResponse === null) {
      intervalId = setInterval(() => {
        fetch(BACKEND_API_URL + `/response/${jobId}`)
          .then((res) => res.json())
          .then((data) => {console.log(data); setJobResponse(data.response)})
          .catch((error) => console.error(error));
      }, POLLING_DELAY_MS);
    }

    return () => clearInterval(intervalId);
  }, [jobId, jobStatus, jobResponse]);

  useEffect(() => {
    let intervalId = null;

    if (jobId) {
      intervalId = setInterval(() => {
        fetch(BACKEND_API_URL + `/status/${jobId}`)
          .then((res) => res.json())
          .then((data) => setJobStatus(data.status))
          .catch((error) => console.error(error));
      }, POLLING_DELAY_MS);
    }

    return () => clearInterval(intervalId);
  }, [jobId]);

Iteration 1

In Iteration 1 two important elements are attached to the backend of the previously created demo:

  • a Machine Learning model;
  • a ratings scraper for the Letterboxd profile name input.

Exploratory phase

This iteration requires the beginning of an exploratory phase in order to compare different techniques.

Jupyter Notebooks are used in order to quickly get a feel for the data.

Here is a high level template of the preliminary steps:

# Import necessary libraries

# Download the dataset from Kaggle

# Load the dataset into a Pandas dataframe

# Visualize, clean and preprocess the data

Before starting, it is necessary to download the Kaggle API Token following this guide:

https://www.kaggle.com/general/74235

The first technique tested is neighbourhood-based collaborative filtering.

Here is a high level template of the required steps:

# Create function to calculate similarity between users

# Create a neighborhood-based recommender

# Test the recommender

TODO

Technologies

The following technologies were used for the project:

Frontend

  • React

  • CSS

  • Zustand

TODO

Backend

  • Flask

  • Redis

  • rq, a Python package for Redis queues

  • MongoDB

TODO

Machine learning

TODO

Containerization, deployment and infrastructure

  • Docker

  • Docker Compose

TODO

Datasets

The following datasets were used for the project:

TODO

References

[1]

Xiaoyuan Su and Taghi M. Khoshgoftaar, "A Survey of Collaborative Filtering Techniques", https://downloads.hindawi.com/archive/2009/421425.pdf

[2]

Ana Belén Barragáns-Martínez, Enrique Costa-Montenegro, Juan C. Burguillo, Marta Rey-López, Fernando A. Mikic-Fonte, Ana Peleteiro, A hybrid content-based and item-based collaborative filtering approach to recommend TV programs enhanced with singular value decomposition, Information Sciences, Volume 180, Issue 22, 2010, Pages 4290-4311, ISSN 0020-0255, https://doi.org/10.1016/j.ins.2010.07.024.

[3]

Nitin Mishra et al 2021 J. Phys.: Conf. Ser. 1717 012002

[4]

Ziegler, C.N., McNee, S.M., Konstan, J.A. and Lausen, G. (2005). "Improving recommendation lists through topic diversification". Proceedings of the 14th international conference on World Wide Web. pp. 22–32.

[5]

Castells, Pablo; Hurley, Neil J.; Vargas, Saúl (2015). "Novelty and Diversity in Recommender Systems". In Ricci, Francesco; Rokach, Lior; Shapira, Bracha (eds.). Recommender Systems Handbook (2 ed.). Springer US. pp. 881–918. doi:10.1007/978-1-4899-7637-6_26. ISBN 978-1-4899-7637-6.

[6]

Christian Desrosiers and George Karypis "A Comprehensive Survey of Neighborhood-based Recommendation Methods" https://www.inf.unibz.it/~ricci/ISR/papers/handbook-neighbor.pdf

Useful links