# Movie Title Classifier

The aim of this project is to use the [TextCategorizer](https://spacy.io/api/textcategorizer) from [spaCy](https://spacy.io/) to build a classifier that can classify a set of movies given their title.

## General imports

Let's import some modules

In [8]:
import spacy
from spacy.util import minibatch, compounding
import inspect
import thinc
import random
import csv

## Dataset

The Dataset was build using [The Movie Database API](https://developers.themoviedb.org/3/getting-started/introduction) extraicting information about all the movies released during the years `2005-2018` and dumping them on a `csv` file.
The file looks like this

```csv
Jarhead,"18,10752"
The Devil's Rejects,"18,27,80"
"Yours, Mine & Ours","35,10751,10749"
Just Like Heaven,"35,14,10749"
Walk the Line,"18,10402,10749"
Must Love Dogs,"10749,35"
Fun with Dick and Jane,35
Hoodwinked!,"16,35,10751"
The Ring Two,"18,27,53"
```

The first element of each row is the title of the movie.
After that there's a string field containing comma separated integers.
Each integer acts as the id of a movie genre.
To convert from integer id to genre name we need to do a call to the API

```
https://api.themoviedb.org/3/genre/movie/list?api_key=<your-api-key>&language=en-US
```

You need to substitute `<your-api-key>` by (obviously) your API key.
And to get an API key you need to sign up for the API.

Don't worry if you don't have such a key.
I've done the call to the API for you and what you get back is a JSON like this

```json
{
  "genres": [
    {
      "id": 28,
      "name": "Action"
    },
    {
      "id": 12,
      "name": "Adventure"
    },
    {
      "id": 16,
      "name": "Animation"
    },
    {
      "id": 35,
      "name": "Comedy"
    },
    {
      "id": 80,
      "name": "Crime"
    },
    {
      "id": 99,
      "name": "Documentary"
    },
    {
      "id": 18,
      "name": "Drama"
    },
    {
      "id": 10751,
      "name": "Family"
    },
    {
      "id": 14,
      "name": "Fantasy"
    },
    {
      "id": 36,
      "name": "History"
    },
    {
      "id": 27,
      "name": "Horror"
    },
    {
      "id": 10402,
      "name": "Music"
    },
    {
      "id": 9648,
      "name": "Mystery"
    },
    {
      "id": 10749,
      "name": "Romance"
    },
    {
      "id": 878,
      "name": "Science Fiction"
    },
    {
      "id": 10770,
      "name": "TV Movie"
    },
    {
      "id": 53,
      "name": "Thriller"
    },
    {
      "id": 10752,
      "name": "War"
    },
    {
      "id": 37,
      "name": "Western"
    }
  ]
}
```

As you can see, each genre id has associated an intuitive name such as `Mystery`, `Western` or `Action`.

Now we're ready to load and process the dataset.

## Loading the dataset

We first create two dictionaries to convert from genre id to genre name and the way back.
They will be useful later.

In [3]:
import json

genres_json = json.loads("""
{
  "genres": [
    {
      "id": 28,
      "name": "Action"
    },
    {
      "id": 12,
      "name": "Adventure"
    },
    {
      "id": 16,
      "name": "Animation"
    },
    {
      "id": 35,
      "name": "Comedy"
    },
    {
      "id": 80,
      "name": "Crime"
    },
    {
      "id": 99,
      "name": "Documentary"
    },
    {
      "id": 18,
      "name": "Drama"
    },
    {
      "id": 10751,
      "name": "Family"
    },
    {
      "id": 14,
      "name": "Fantasy"
    },
    {
      "id": 36,
      "name": "History"
    },
    {
      "id": 27,
      "name": "Horror"
    },
    {
      "id": 10402,
      "name": "Music"
    },
    {
      "id": 9648,
      "name": "Mystery"
    },
    {
      "id": 10749,
      "name": "Romance"
    },
    {
      "id": 878,
      "name": "Science Fiction"
    },
    {
      "id": 10770,
      "name": "TV Movie"
    },
    {
      "id": 53,
      "name": "Thriller"
    },
    {
      "id": 10752,
      "name": "War"
    },
    {
      "id": 37,
      "name": "Western"
    }
  ]
}""")

id_to_genre = {elem['id']:elem['name'] for elem in genres_json['genres']}
genre_to_id = {elem['name']:elem['id'] for elem in genres_json['genres']}

Now we can load the dataset and convert the id's to their genre names.

In [23]:
# path to de jsonl file
DATA_PATH = "genres.csv"

def load_data(split=0.8):
    # load data
    samples, labels = [], []
    no_genre_count = 0
    with open(DATA_PATH, 'r') as data_file:
        csv_reader = csv.reader(data_file, delimiter=',')
        for row in csv_reader:
            if row[1] == '':
                no_genre_count += 1
                continue
            samples.append(row[0])
            id_labels = [int(i) for i in row[1].split(',')]
            labels.append(dict((genre, True) if id_ in id_labels else (genre, False) for genre, id_ in genre_to_id.items()))
        print("{} movies loaded ({} movies didn't have any genre associated and were discarded!)".format(len(samples), no_genre_count))
    
        # split into train and dev sets
        split_index = int(len(samples) * split)
        train_samples, train_labels = samples[:split_index], labels[:split_index]
        dev_samples, dev_labels = samples[split_index:], labels[split_index:]
    return (train_samples, train_labels), (dev_samples, dev_labels)

(train_samples, train_labels), (dev_samples, dev_labels) = load_data()

train_data = list(zip(train_samples, [{"cats": cats} for cats in train_labels]))
random.shuffle(train_data)
print(train_data[0])

8360 movies loaded (40 movies didn't have any genre associated and were discarded!)
('The Sweeney', {'cats': {'Action': True, 'Adventure': False, 'Animation': False, 'Comedy': False, 'Crime': True, 'Documentary': False, 'Drama': False, 'Family': False, 'Fantasy': False, 'History': False, 'Horror': False, 'Music': False, 'Mystery': False, 'Romance': False, 'Science Fiction': False, 'TV Movie': False, 'Thriller': False, 'War': False, 'Western': False}})


### Quick look at the data

In [32]:
genre_count = {genre: 0 for genre in train_data[0][1]['cats'].keys()}

for _, cats in train_data:
    labels = cats['cats']
    for 

{'Action': 0, 'Adventure': 0, 'Animation': 0, 'Comedy': 0, 'Crime': 0, 'Documentary': 0, 'Drama': 0, 'Family': 0, 'Fantasy': 0, 'History': 0, 'Horror': 0, 'Music': 0, 'Mystery': 0, 'Romance': 0, 'Science Fiction': 0, 'TV Movie': 0, 'Thriller': 0, 'War': 0, 'Western': 0}


## Implement and run the classifier

In this section we use the [TextCategorizer](https://spacy.io/api/textcategorizer) model to classify our movie titles into their genres.
This code is an adaptation from [spaCy's text classifier example](https://spacy.io/usage/examples#textcat).

In [21]:
def evaluate(tokenizer, textcat, texts, cats):
    docs = (tokenizer(text) for text in texts)
    tp = 0.0  # True positives
    fp = 1e-8  # False positives
    fn = 1e-8  # False negatives
    tn = 0.0  # True negatives
    for i, doc in enumerate(textcat.pipe(docs)):
        gold = cats[i]
#         print("gold", gold)
        labels = [l for l, _ in doc.cats.items()]
        socres = [s for _, s in doc.cats.items()]
        for label, score in doc.cats.items():
#             print(label, score)
            if label not in gold: # defensive programming?
                continue
            
            # FIXME, needs work, as it is it's only calculating the P-R-F for the last element
                
            if label == "NEGATIVE":
                continue
            if score >= 0.5 and gold[label] >= 0.5:
                tp += 1.0
            elif score >= 0.5 and gold[label] < 0.5:
                fp += 1.0
            elif score < 0.5 and gold[label] < 0.5:
                tn += 1
            elif score < 0.5 and gold[label] >= 0.5:
                fn += 1
    precision = tp / (tp + fp)
    recall = tp / (tp + fn)
    if (precision + recall) == 0:
        f_score = 0.0
    else:
        f_score = 2 * (precision * recall) / (precision + recall)
    return {"textcat_p": precision, "textcat_r": recall, "textcat_f": f_score}


def generate_spacy_model(n_iter=20):
    # create blank model
    nlp = spacy.blank('en')
    
    # add/get text-categorisation pipe
    if "textcat" not in nlp.pipe_names:
        textcat = nlp.create_pipe(
            name="textcat",
            config={
                "exclusive_classes": False,
                "architecture": "simple_cnn",
            }
        )
        nlp.add_pipe(textcat, last=True)
    # otherwise, get it, so we can add labels to it
    else:
        textcat = nlp.get_pipe("textcat")
        
    # add lables
    for genre in genre_to_id.keys():
        textcat.add_label(genre)
    
    
    # get names of other pipes to disable them during training
    other_pipes = [pipe for pipe in nlp.pipe_names if pipe != "textcat"]
    with nlp.disable_pipes(*other_pipes):  # only train textcat
        optimizer = nlp.begin_training()
#         if init_tok2vec is not None:
#             with init_tok2vec.open("rb") as file_:
#                 textcat.model.tok2vec.from_bytes(file_.read())

    print("Training the model...")
    print("{:^5}\t{:^5}\t{:^5}\t{:^5}".format("LOSS", "P", "R", "F"))
    batch_sizes = compounding(4.0, 32.0, 1.001)
    
    for i in range(n_iter):
            losses = {}
            # batch up the examples using spaCy's minibatch
            random.shuffle(train_data)
            batches = minibatch(train_data, size=batch_sizes)
            for batch in batches:
                texts, annotations = zip(*batch)
                nlp.update(texts, annotations, sgd=optimizer, drop=0.2, losses=losses)
            with textcat.model.use_params(optimizer.averages):
                # evaluate on the dev data split off in load_data()
                scores = evaluate(nlp.tokenizer, textcat, dev_samples, dev_labels)
            print(
                "{0:.3f}\t{1:.3f}\t{2:.3f}\t{3:.3f}".format(  # print a simple table
                    losses["textcat"],
                    scores["textcat_p"],
                    scores["textcat_r"],
                    scores["textcat_f"],
                )
            )
    
generate_spacy_model()

Training the model...
LOSS 	  P  	  R  	  F  
74.505	0.556	0.153	0.241
5.032	0.576	0.190	0.286
1.298	0.560	0.219	0.315
0.492	0.546	0.243	0.336
0.467	0.543	0.265	0.356
0.275	0.534	0.279	0.366
0.260	0.524	0.292	0.375
0.245	0.519	0.303	0.382
0.232	0.507	0.311	0.385
0.223	0.500	0.316	0.387
0.211	0.492	0.320	0.388
0.201	0.492	0.328	0.393


KeyboardInterrupt: 