# Movie Title Classifier

The aim of this project is to use the [TextCategorizer](https://spacy.io/api/textcategorizer) from [spaCy](https://spacy.io/) to build a classifier that can classify a set of movies given their title.

## General imports

Let's import some modules

In [8]:
import spacy
from spacy.util import minibatch, compounding
import inspect
import thinc
import random
import csv

## Dataset

The Dataset was build using [The Movie Database API](https://developers.themoviedb.org/3/getting-started/introduction) extraicting information about all the movies released during the years `2005-2018` and dumping them on a `csv` file.
The file looks like this

```csv
Jarhead,"18,10752"
The Devil's Rejects,"18,27,80"
"Yours, Mine & Ours","35,10751,10749"
Just Like Heaven,"35,14,10749"
Walk the Line,"18,10402,10749"
Must Love Dogs,"10749,35"
Fun with Dick and Jane,35
Hoodwinked!,"16,35,10751"
The Ring Two,"18,27,53"
```

The first element of each row is the title of the movie.
After that there's a string field containing comma separated integers.
Each integer acts as the id of a movie genre.
To convert from integer id to genre name we need to do a call to the API

```
https://api.themoviedb.org/3/genre/movie/list?api_key=<your-api-key>&language=en-US
```

You need to substitute `<your-api-key>` by (obviously) your API key.
And to get an API key you need to sign up for the API.

Don't worry if you don't have such a key.
I've done the call to the API for you and what you get back is a JSON like this

```json
{
  "genres": [
    {
      "id": 28,
      "name": "Action"
    },
    {
      "id": 12,
      "name": "Adventure"
    },
    {
      "id": 16,
      "name": "Animation"
    },
    {
      "id": 35,
      "name": "Comedy"
    },
    {
      "id": 80,
      "name": "Crime"
    },
    {
      "id": 99,
      "name": "Documentary"
    },
    {
      "id": 18,
      "name": "Drama"
    },
    {
      "id": 10751,
      "name": "Family"
    },
    {
      "id": 14,
      "name": "Fantasy"
    },
    {
      "id": 36,
      "name": "History"
    },
    {
      "id": 27,
      "name": "Horror"
    },
    {
      "id": 10402,
      "name": "Music"
    },
    {
      "id": 9648,
      "name": "Mystery"
    },
    {
      "id": 10749,
      "name": "Romance"
    },
    {
      "id": 878,
      "name": "Science Fiction"
    },
    {
      "id": 10770,
      "name": "TV Movie"
    },
    {
      "id": 53,
      "name": "Thriller"
    },
    {
      "id": 10752,
      "name": "War"
    },
    {
      "id": 37,
      "name": "Western"
    }
  ]
}
```

As you can see, each genre id has associated an intuitive name such as `Mystery`, `Western` or `Action`.

Now we're ready to load and process the dataset.

## Loading the dataset

We first create two dictionaries to convert from genre id to genre name and the way back.
They will be useful later.

In [3]:
import json

genres_json = json.loads("""
{
  "genres": [
    {
      "id": 28,
      "name": "Action"
    },
    {
      "id": 12,
      "name": "Adventure"
    },
    {
      "id": 16,
      "name": "Animation"
    },
    {
      "id": 35,
      "name": "Comedy"
    },
    {
      "id": 80,
      "name": "Crime"
    },
    {
      "id": 99,
      "name": "Documentary"
    },
    {
      "id": 18,
      "name": "Drama"
    },
    {
      "id": 10751,
      "name": "Family"
    },
    {
      "id": 14,
      "name": "Fantasy"
    },
    {
      "id": 36,
      "name": "History"
    },
    {
      "id": 27,
      "name": "Horror"
    },
    {
      "id": 10402,
      "name": "Music"
    },
    {
      "id": 9648,
      "name": "Mystery"
    },
    {
      "id": 10749,
      "name": "Romance"
    },
    {
      "id": 878,
      "name": "Science Fiction"
    },
    {
      "id": 10770,
      "name": "TV Movie"
    },
    {
      "id": 53,
      "name": "Thriller"
    },
    {
      "id": 10752,
      "name": "War"
    },
    {
      "id": 37,
      "name": "Western"
    }
  ]
}""")

id_to_genre = {elem['id']:elem['name'] for elem in genres_json['genres']}
genre_to_id = {elem['name']:elem['id'] for elem in genres_json['genres']}

Now we can load the dataset and convert the id's to their genre names.

In [20]:
# path to de jsonl file
DATA_PATH = "genres.csv"

def load_data(split=0.8):
    # load data
    samples, labels = [], []
    no_genre_count = 0
    with open(DATA_PATH, 'r') as data_file:
        csv_reader = csv.reader(data_file, delimiter=',')
        for row in csv_reader:
            if row[1] == '':
                no_genre_count += 1
                continue
            samples.append(row[0])
            id_labels = [int(i) for i in row[1].split(',')]
            labels.append(dict((genre, True) if id_ in id_labels else (genre, False) for genre, id_ in genre_to_id.items()))
        print("{} movies loaded ({} movies didn't have any genre associated and were discarded!)".format(len(samples), no_genre_count))
    
        # split into train and dev sets
        split_index = int(len(samples) * split)
        train_samples, train_labels = samples[:split_index], labels[:split_index]
        dev_samples, dev_labels = samples[split_index:], labels[split_index:]
    return (train_samples, train_labels), (dev_samples, dev_labels)

(train_samples, train_labels), (dev_samples, dev_labels) = load_data()

train_data = list(zip(train_samples, [{"cats": cats} for cats in train_labels]))
print(train_data[0])

8360 movies loaded (40 movies didn't have any genre associated and were discarded!)
('Batman Begins', {'cats': {'Action': True, 'Adventure': False, 'Animation': False, 'Comedy': False, 'Crime': True, 'Documentary': False, 'Drama': True, 'Family': False, 'Fantasy': False, 'History': False, 'Horror': False, 'Music': False, 'Mystery': False, 'Romance': False, 'Science Fiction': False, 'TV Movie': False, 'Thriller': False, 'War': False, 'Western': False}})
