Skip to content

legobridge/movie-genre-classification

Repository files navigation

IMDb Movie Genre Classification

This project uses the following dataset (freely available on Kaggle) to predict the primary genre of a movie, relying only on the natural language description of the movie on IMDb.

IMDb Genre Classification Dataset

The project directory structure is described below.

Data

The data directory houses the original Kaggle data at the root level, and the preprocessed data under the processed subdirectory. The preprocessed data is created from the basic data using preprocessing.py (see below under scripts)

Notebooks

The notebooks directory contains Jupyter notebooks, which we used for EDA, preliminary preprocessing, training models, and other experiments.

A brief description of each file follows:

  • eda.ipynb : This notebook contains most of the EDA we did, including genre distributions, trends over time, word clouds, etc.
  • tfidf.ipynb : This notebook was used for training and validating the performance of our TF-IDF based models.
  • basic_neural_models.ipynb : This notebook was used for training and validating the performance of our neural network models like the simple neural network and RNNs.
  • small_bert.ipynb : This notebook was used for training and validating the performance of our Small BERT model.
  • training.ipynb : This notebook was used to train all of our models on the entire training set, test them on the test set, and save the models for inference.
  • class_imbalance_experiments.ipynb : This notebook was used for experiments regarding our attempt to handle class imbalance via oversampling and SMOTE.

The directory also contains some deprecated notebooks, which are preserved for legacy reasons, but are not part of the submission.

Scripts

The scripts directory is a Python package containing our data preprocessing scripts and the server-client modules that can be used to run a small inference demo.

A brief description of each file follows:

  • language_identification.py : This module contains functions that can identify the language a piece of text is written in.
  • text_processing.py : This module contains functions (such as stemming / stopword removal) that can preprocess text to clean/simplify it.
  • preprocessing.py : This module utilizes language_identification.py and text_processing.py to create the preprocessed data we use to train and test all our models. Running this script creates all the data found under data/processed/.
  • server.py : This module whips up a minimal server (using Flask Restful) at 127.0.0.1:5000 that provides a GET endpoint for inference over the trained Small BERT model.
  • tkinter_client.py : This module brings up a (very) simple GUI where one can input movie descriptions and get the probability distribution over genres. server.py must be running for this to work.

Models

The models directory contains saved instances of the trained models.

Metrics

The metrics directory contains classification reports of the models we trained and tested.

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published