# Movie Genre Prediction by Poster
The goal of this notebook is to try to predict a movie's genre by looking at the poster it has.
To achieve this goal I will:
1. Use [The Movies Dataset](https://www.kaggle.com/rounakbanik/the-movies-dataset) to get 45k movie titles.
2. Use the [omdb_api](http://www.omdbapi.com/) to get the posters for the movies and store them localy.
3. Use Tensorflow and Keras to create a ConvNet that will classify the images.

## Project Structure
1. Data loading, acquisition and cleaning
2. Model creation.
3. Model evaluation.

In [1]:
# Library Loading
import json
from urllib.parse import urlencode
import urllib.request
import requests
import pandas as pd
import os
import numpy as np

# Loading the 
OMDB_KEY = json.loads(open('apikeys/apikey.json').read())['key']

## 1. Data loading, acquisition and cleaning

### 1.1 Loading 'The Movies' dataset.

In [2]:
the_movies = pd.read_csv('./datasets/The_Movies/movies_metadata.csv')

  interactivity=interactivity, compiler=compiler, result=result)


### 1.2 Defining the API functions

In [3]:
def get_movie(imdbid):
    '''Gets the movie by imdb id and returns json with the title, genre and imdb id.
       Input args:
       - imdbid: the imdb id of the character
       Returns:
       - If the API responds succesfully return the genres, otherwise return NA
    - 
    '''
    url = 'http://www.omdbapi.com/?apikey=' + OMDB_KEY + '&i=' + str(imdbid)
    r = requests.get(url)
    if r.status_code == 200:
        movie_json = r.json()
        return movie_json['Genre']
    else:
        return 'NA'

def get_poster(imdb_id, path, genre, dataset_type, image_heigth, item_id):
    '''Gets the movie poster as a jpg and saves it to the path.
       Input args:
       - imbd_id: the imdb id of the movie,
       - path: where the dataset will be stored,
       - genre: the genre of the movie,
       - dataset_type: whether we are creating the training or testing,
       - image_height: the height of the poster - api query
       - item_id: identifier to attach to the filename
       Returns:
       - None
    '''
    # construct the requests url
    url = 'http://img.omdbapi.com/?apikey=' + OMDB_KEY + '&i=' + str(imdb_id) + '&h=' + str(image_heigth)
    # create the dir where we store the posters based on the genre and the dataset type
    path = os.path.join(os.getcwd(), path, dataset_type, genre + '/')
    # check if the folder is already_created
    if not os.path.exists(path):
        os.makedirs(path)
    r = requests.get(url)
    filename = os.path.join(os.getcwd(), path, genre + '.' + str(item_id) + '.jpg')
    if r.status_code == 200:
        with open(filename, 'wb') as w:
            w.write(r.content)

### 1.3 Downloading the movie posters.

As a movie can have multiple genres I will make an assumption here - if a genre is listed first - that will be the genre of the movie. Let's create a function that will get only the main genre of the movie from the dataset. If it doesn't find one - download it straight from imdb.

In [4]:
import ast

def get_main_genre(dataset_genre, imdb_id):
    '''
        Gets the main genre of the movie. If there is none listed - pull one directly from imdb.
        Input args:
        - genre_row: the dataset value of the genres for the movie,
        - imdb_id: the imdb_id of the movie
        Returns:
        - main_genre: the main genre of the movie 
    '''
    dataset_genre = ast.literal_eval(dataset_genre)
    if len(dataset_genre) == 0:
        try: 
            main_genre = get_movie(imdb_id).replace(' ', '').split(',')[0]
        except KeyError:
            main_genre = 'NA'
    else:
        main_genre = dataset_genre[0]['name']
        if main_genre == 'N/A':
            main_genre = 'NA'
    return main_genre

In [157]:
# DANGER: Slow Code!
the_movies['main_genre'] = the_movies.apply(lambda x: get_main_genre(x['genres'], x['imdb_id']), axis=1)

In [158]:
the_movies['main_genre'].unique().shape 

(31,)

In [189]:
# quick fix for N/As
the_movies[the_movies['main_genre'] == 'N/A'] = 'NA'

All in all we got 31 classes. Let's create the train - test split now, as we will need it for movie poster download.

In [191]:
from sklearn.model_selection import train_test_split

movies_train, movies_test = train_test_split(the_movies, test_size=0.3, random_state=0)

Finally downloading the movies

In [None]:
# downloading the posters for the train set
for idx, row in movies_train.iterrows():
    # get_poster(imdb_id, path, genre, dateset_type, image_heigth, item_id):
    get_poster(row['imdb_id'], 'datasets/The_Movies/posters', row['main_genre'], 'train', 600, idx)
# downloading the posters for the test set
for idx, row in movies_test.iterrows():
    get_poster(row['imdb_id'], 'datasets/The_Movies/posters', row['main_genre'], 'test', 600, idx)

The code above has been moved to a script so it can stay overnight and download all of the movies' posters. 
Since that has been done lets move on to what genres we have actually downloaded.

In [3]:
os.listdir('/media/fury/data/Scripts/the_movies_data_scraper/datasets/The_Movies/posters/train')

['Fantasy',
 'Drama',
 'Biography',
 'Animation',
 'Science Fiction',
 'Romance',
 'Crime',
 'Adventure',
 'History',
 'War',
 'Foreign',
 'Documentary',
 'Thriller',
 'Music',
 'Action',
 'Western',
 'Short',
 'Family',
 'Horror',
 'Comedy',
 'Mystery',
 'Musical',
 'TV Movie']

From the list, displayed above we can see somethings we can edit/remove. For example 
- Sci-Fi and Science Fiction are the same thing so both dirs can be merged to a common category 'Sci-Fi'
- 'Aniplex', 'Odyssey Media' and 'Carousel Productions' are actually names of companies. Will see how much samples they contribute and if it is not significant (~100 or more)they will be deleted.
- NA - The same treatment - check the number of samples and delete them.

#### Results
- Sci-Fi - it seems that there are only 5 posters in Sci-Fi - thus I can merge it into 'Science Fiction'
- 'Odyssey Media' and 'Carousel Productions' and 'NA' contained no information whatsover.
- 'Removed also 'Adult' as it contained no posters.

#### Aditional Checks performed
- Whether both dirs contain the same genres and the same number of genres.

## 2. Model creation.

### 2.1 Importing the neccesarry libraries.

In [3]:
# Importing the libraries
from keras.models import Sequential
from keras.layers import Conv2D
from keras.layers import MaxPooling2D
from keras.layers import Flatten
from keras.layers import Dense

Using TensorFlow backend.


### 2.2  Loading the data.
With the help of Keras that should be quite straightforward. The ImageDataGenerator can be used to load the data and get the labels simultaneously in a neat 2D one-hot encoded format. In addition it allows us to generate additional train/validation/test data.

In [4]:
from keras.preprocessing.image import ImageDataGenerator

train_location = '/media/fury/data/Scripts/the_movies_data_scraper/datasets/The_Movies/posters/train'
test_location = '/media/fury/data/Scripts/the_movies_data_scraper/datasets/The_Movies/posters/test'

train_datagen = ImageDataGenerator(rescale = 1./255)
test_datagen = ImageDataGenerator(rescale = 1./255)

# The generators
train_generator = train_datagen.flow_from_directory(
    train_location,
    target_size = (300, 300),
    color_mode = 'rgb',
    batch_size = 32,
    class_mode = 'categorical',
    seed = 42
)

test_generator = test_datagen.flow_from_directory(
    test_location,
    target_size = (300, 300),
    color_mode = 'rgb',
    batch_size = 32,
    class_mode = 'categorical',
    seed = 42
)

Found 23273 images belonging to 23 classes.
Found 12494 images belonging to 23 classes.


### 2.3 Model Building

In [5]:
# Le model
classifier = Sequential()
classifier.add(Conv2D(32, (3,3), input_shape = (300, 300, 3), activation = 'relu'))
classifier.add(MaxPooling2D(pool_size = (2,2)))
classifier.add(Flatten())
classifier.add(Dense(units = 128, activation = 'relu'))
classifier.add(Dense(units = 23, activation = 'sigmoid'))

In [6]:
classifier.summary()

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
conv2d_1 (Conv2D)            (None, 298, 298, 32)      896       
_________________________________________________________________
max_pooling2d_1 (MaxPooling2 (None, 149, 149, 32)      0         
_________________________________________________________________
flatten_1 (Flatten)          (None, 710432)            0         
_________________________________________________________________
dense_1 (Dense)              (None, 128)               90935424  
_________________________________________________________________
dense_2 (Dense)              (None, 23)                2967      
Total params: 90,939,287
Trainable params: 90,939,287
Non-trainable params: 0
_________________________________________________________________


[Why is ADAM the optimizer?](https://arxiv.org/pdf/1609.04747.pdf)

In [7]:
# Le compile
classifier.compile(optimizer = 'adam', 
                   loss = 'categorical_crossentropy', 
                   metrics = ['accuracy'])

In [None]:
classifier.fit_generator(train_generator,
                         steps_per_epoch = 8000,
                         epochs = 10,
                         validation_data = test_generator,
                         validation_steps = 2000)

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10

Very low accuracy, a little bit above random picking... Bias needs to be improved. But first - let's create a validation set.

#### Offtopic: Creating the validation set.

In [17]:
def get_total_posters(dataset, location):
    num_items = 0
    for item in os.listdir(location):
        num_posters =len(os.listdir(os.path.join(location, item)))
        num_items += num_posters

    print("%s: %i" % (dataset, num_items))

In [55]:
get_total_posters('train', train_location)
get_total_posters('test', test_location)

train: 23273
test: 12494


In [47]:
from shutil import move

In [51]:
def move_data_to_validation(perc, location, target_location):
    '''
    Move a certain percentage of the target dataset to a new location.
    The percentage is drawn from each genre.
    '''
    for genre in os.listdir(location):
        dir_contents = os.listdir(os.path.join(location, genre))
        dir_len = len(dir_contents)
        num_items_to_take = int(np.ceil(perc * dir_len / 100))
        # Randomize the draw with a integeres drawn from descrete uniform
        items_idx_array = np.random.choice(dir_len, num_items_to_take, replace = False)    
        
        for idx in items_idx_array:
            src = os.path.join(location, genre, dir_contents[idx])
            dst_dir = os.path.join(target_location, genre)
            if not os.path.exists(dst_dir): # make sure that the path exists
                os.makedirs(dst_dir)
            move(src, os.path.join(dst_dir, dir_contents[idx]))
            
    print('Done!') # just because

In [52]:
# lets move data
validation_location = '/media/fury/data/Scripts/the_movies_data_scraper/datasets/The_Movies/posters/validation'
move_data_to_validation(20, train_location, validation_location)

Done!


In [15]:
validation_location = '/media/fury/data/Scripts/the_movies_data_scraper/datasets/The_Movies/posters/validation'

In [18]:
# Getting the number of items in each dir
get_total_posters('train', train_location)
get_total_posters('test', test_location)
get_total_posters('validation', validation_location)

train: 23273
test: 12494
validation: 5832


In [19]:
# generating the validation dataset
validation_datagen = ImageDataGenerator(rescale = 1./255)

# The generators
validation_generator = test_datagen.flow_from_directory(
    validation_location,
    target_size = (300, 300),
    color_mode = 'rgb',
    batch_size = 32,
    class_mode = 'categorical',
    seed = 42
)

Found 5832 images belonging to 23 classes.


### 2.3 cntd. Model Building.
Since we saw that the model is underfitting there are a few steps we could try out:  
* CNN Architecture - maybe if we made the model a little deeper...
* Create more data - In my opinion there are two approaches that could be examined:  
    A. Blindly creating more samples - just create more images, irregardles of their class.   
    B. Try to balance out the classess by only cropping the images to generate additional ones.  
* Train for longer - not clear on the results though as it seems between the epochs no significant improvement is made.
* Tune the hyperparameters - maybe use grid search?

Let's go over them.

#### 2.3.1 Going deeper...

In [11]:
# Le model 2
deep_clf = Sequential()
deep_clf.add(Conv2D(64, (3,3), input_shape = (300, 300, 3), activation = 'relu'))
deep_clf.add(MaxPooling2D(pool_size = (2,2)))
deep_clf.add(Flatten())
deep_clf.add(Dense(units = 368, activation = 'relu'))
deep_clf.add(Dense(units = 92, activation = 'relu'))
deep_clf.add(Dense(units = 23, activation = 'sigmoid'))

In [12]:
deep_clf.summary()

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
conv2d_3 (Conv2D)            (None, 298, 298, 64)      1792      
_________________________________________________________________
max_pooling2d_3 (MaxPooling2 (None, 149, 149, 64)      0         
_________________________________________________________________
flatten_3 (Flatten)          (None, 1420864)           0         
_________________________________________________________________
dense_6 (Dense)              (None, 368)               522878320 
_________________________________________________________________
dense_7 (Dense)              (None, 92)                33948     
_________________________________________________________________
dense_8 (Dense)              (None, 23)                2139      
Total params: 522,916,199
Trainable params: 522,916,199
Non-trainable params: 0
______________________________________________________________

In [13]:
deep_clf.compile(optimizer = 'adam', 
                   loss = 'categorical_crossentropy', 
                   metrics = ['accuracy'])

In [20]:
deep_clf.fit_generator(train_generator,
                         steps_per_epoch = 8000,
                         epochs = 10,
                         validation_data = validation_generator,
                         validation_steps = 2000)

Epoch 1/10


ResourceExhaustedError: OOM when allocating tensor with shape[1420864,368] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc
	 [[node dense_6/random_uniform/RandomUniform (defined at /usr/local/lib/python3.6/dist-packages/keras/backend/tensorflow_backend.py:4139)  = RandomUniform[T=DT_INT32, dtype=DT_FLOAT, seed=87654321, seed2=3603411, _device="/job:localhost/replica:0/task:0/device:GPU:0"](dense_3/random_uniform/shape)]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.


Caused by op 'dense_6/random_uniform/RandomUniform', defined at:
  File "/usr/lib/python3.6/runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "/usr/lib/python3.6/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "/usr/local/lib/python3.6/dist-packages/ipykernel_launcher.py", line 16, in <module>
    app.launch_new_instance()
  File "/usr/local/lib/python3.6/dist-packages/traitlets/config/application.py", line 658, in launch_instance
    app.start()
  File "/usr/local/lib/python3.6/dist-packages/ipykernel/kernelapp.py", line 486, in start
    self.io_loop.start()
  File "/usr/local/lib/python3.6/dist-packages/tornado/platform/asyncio.py", line 112, in start
    self.asyncio_loop.run_forever()
  File "/usr/lib/python3.6/asyncio/base_events.py", line 422, in run_forever
    self._run_once()
  File "/usr/lib/python3.6/asyncio/base_events.py", line 1434, in _run_once
    handle._run()
  File "/usr/lib/python3.6/asyncio/events.py", line 145, in _run
    self._callback(*self._args)
  File "/usr/local/lib/python3.6/dist-packages/tornado/platform/asyncio.py", line 102, in _handle_events
    handler_func(fileobj, events)
  File "/usr/local/lib/python3.6/dist-packages/tornado/stack_context.py", line 276, in null_wrapper
    return fn(*args, **kwargs)
  File "/usr/local/lib/python3.6/dist-packages/zmq/eventloop/zmqstream.py", line 450, in _handle_events
    self._handle_recv()
  File "/usr/local/lib/python3.6/dist-packages/zmq/eventloop/zmqstream.py", line 480, in _handle_recv
    self._run_callback(callback, msg)
  File "/usr/local/lib/python3.6/dist-packages/zmq/eventloop/zmqstream.py", line 432, in _run_callback
    callback(*args, **kwargs)
  File "/usr/local/lib/python3.6/dist-packages/tornado/stack_context.py", line 276, in null_wrapper
    return fn(*args, **kwargs)
  File "/usr/local/lib/python3.6/dist-packages/ipykernel/kernelbase.py", line 283, in dispatcher
    return self.dispatch_shell(stream, msg)
  File "/usr/local/lib/python3.6/dist-packages/ipykernel/kernelbase.py", line 233, in dispatch_shell
    handler(stream, idents, msg)
  File "/usr/local/lib/python3.6/dist-packages/ipykernel/kernelbase.py", line 399, in execute_request
    user_expressions, allow_stdin)
  File "/usr/local/lib/python3.6/dist-packages/ipykernel/ipkernel.py", line 208, in do_execute
    res = shell.run_cell(code, store_history=store_history, silent=silent)
  File "/usr/local/lib/python3.6/dist-packages/ipykernel/zmqshell.py", line 537, in run_cell
    return super(ZMQInteractiveShell, self).run_cell(*args, **kwargs)
  File "/usr/local/lib/python3.6/dist-packages/IPython/core/interactiveshell.py", line 2728, in run_cell
    interactivity=interactivity, compiler=compiler, result=result)
  File "/usr/local/lib/python3.6/dist-packages/IPython/core/interactiveshell.py", line 2850, in run_ast_nodes
    if self.run_code(code, result):
  File "/usr/local/lib/python3.6/dist-packages/IPython/core/interactiveshell.py", line 2910, in run_code
    exec(code_obj, self.user_global_ns, self.user_ns)
  File "<ipython-input-11-2c3574019768>", line 6, in <module>
    deep_clf.add(Dense(units = 368, activation = 'relu'))
  File "/usr/local/lib/python3.6/dist-packages/keras/engine/sequential.py", line 181, in add
    output_tensor = layer(self.outputs[0])
  File "/usr/local/lib/python3.6/dist-packages/keras/engine/base_layer.py", line 431, in __call__
    self.build(unpack_singleton(input_shapes))
  File "/usr/local/lib/python3.6/dist-packages/keras/layers/core.py", line 866, in build
    constraint=self.kernel_constraint)
  File "/usr/local/lib/python3.6/dist-packages/keras/legacy/interfaces.py", line 91, in wrapper
    return func(*args, **kwargs)
  File "/usr/local/lib/python3.6/dist-packages/keras/engine/base_layer.py", line 249, in add_weight
    weight = K.variable(initializer(shape),
  File "/usr/local/lib/python3.6/dist-packages/keras/initializers.py", line 218, in __call__
    dtype=dtype, seed=self.seed)
  File "/usr/local/lib/python3.6/dist-packages/keras/backend/tensorflow_backend.py", line 4139, in random_uniform
    dtype=dtype, seed=seed)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/ops/random_ops.py", line 243, in random_uniform
    rnd = gen_random_ops.random_uniform(shape, dtype, seed=seed1, seed2=seed2)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/ops/gen_random_ops.py", line 733, in random_uniform
    name=name)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/framework/op_def_library.py", line 787, in _apply_op_helper
    op_def=op_def)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/util/deprecation.py", line 488, in new_func
    return func(*args, **kwargs)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/framework/ops.py", line 3274, in create_op
    op_def=op_def)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/framework/ops.py", line 1770, in __init__
    self._traceback = tf_stack.extract_stack()

ResourceExhaustedError (see above for traceback): OOM when allocating tensor with shape[1420864,368] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc
	 [[node dense_6/random_uniform/RandomUniform (defined at /usr/local/lib/python3.6/dist-packages/keras/backend/tensorflow_backend.py:4139)  = RandomUniform[T=DT_INT32, dtype=DT_FLOAT, seed=87654321, seed2=3603411, _device="/job:localhost/replica:0/task:0/device:GPU:0"](dense_3/random_uniform/shape)]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.

