# Predicting Movie Success using Posters
---
### Model:
    Deep Convolutional Neural Network

### Developers:

    Bardia Borhani  
    Kevin Ulrich

### Project Goals:
1. Create a working deep convolutional neural network
2. Train the network given posters from the [The Movies Dataset](https://www.kaggle.com/rounakbanik/the-movies-dataset)
3. See what performance can be achieved from model predictions given new posters
4. Bonus: Try out other models. Ensemble models.

# Setup Environment

We import numpy and pandas here for all our data processing needs, and then we specify the input location, and check what files are in the input folder.

In [61]:
print('goodbye world')

goodbye world


In [62]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import os

input_loc = '../input/movie_dataset'

print(os.listdir(input_loc))

['links_small.csv', 'the-movies-dataset.zip', 'ratings.csv', 'links.csv', 'keywords.csv', 'credits.csv', 'ratings_small.csv', 'movies_metadata.csv']


# Import Data

We then read in the data from the movies_metadata file. We output the first few rows of the file just to preview it.

In [63]:
metadata = pd.read_csv(input_loc + '/movies_metadata.csv', low_memory=False)

# preview the data
# print(metadata.head())

# there are some movies that don't have posters... we'll deal with these later
print(str(metadata.iloc[[38802]]['poster_path']))
print(str(metadata.iloc[[44660]]['poster_path']))

38802    /sLvzxFBH5vGFei8oUNzDVwHKGdl.jpg
Name: poster_path, dtype: object
44660    NaN
Name: poster_path, dtype: object


Continuing in the theme of checking our data, let's see all the header names, which helps us locate (loc or iloc) our data by column when we want to go through it row by row.

In [64]:
# what are the column names?
first_movie = metadata.iloc[[0]]
for key, value in first_movie.items() :
    print(key)

adult
belongs_to_collection
budget
genres
homepage
id
imdb_id
original_language
original_title
overview
popularity
poster_path
production_companies
production_countries
release_date
revenue
runtime
spoken_languages
status
tagline
title
video
vote_average
vote_count


# Scrape Posters

The script below scrapes all the posters from the TMDB website. Note that we are downloading the images with a set width of 780. Hopefully this will mean that all the posters will be the same size. Other sizes that we could do would be `w500` or `original`, just by replacing wherever `w780` appears in the script.

Also note that no error checking occurs. The script ran without a hitch and all was fine.

This script takes about 5 hours to download the ~45,000 posters at 780 width. I've heard things about using sessions with the requests library, but didn't care enough to look into it. If the dataset were larger, that might be a good idea.

In [65]:
# get all movie posters

# imports
import requests
from tqdm import tqdm_notebook

# create a directory for the poster images to reside
poster_dir = 'poster_imgs_backup/'
if not os.path.exists(poster_dir):
    os.makedirs(poster_dir)

# for each row of data
for index, row in tqdm_notebook(list(metadata.iterrows())):
    
    # build the file path
    posterpath = str(row['poster_path'])
    filename = str(index) + '.jpg' # ignore the given file extension, it might be non-existent
    filepath = poster_dir + filename
    
    # if we haven't downloaded the file yet
    if not os.path.exists(filepath):
        
        # download the movie poster from tmdb using the poster path
        # and save it locally
        url = 'http://image.tmdb.org/t/p/w780' + posterpath
        r = requests.get(url, allow_redirects=True)
        open(filepath, 'wb').write(r.content)




# Load in the Data

Now that we've scraped the posters, it's time to load them into memory so we can give them to our neural network to train it.

The function we use to preview our images is from [here](https://gist.github.com/soply/f3eec2e79c165e39c9d540e916142ae1)

In [66]:
# # function to plot first two images
# def plotFirstN(image_collection, n):
#     fig, ax = plt.subplots(ncols=n, sharex=True, sharey=True,
#                        figsize=(n * 8, n * 4))

#     for i in range(0, n):
#         ax[i].imshow(image_collection[i], cmap=plt.cm.gray)

#     for a in ax:
#         a.axis('off')

#     plt.tight_layout()
#     plt.show()

In [67]:
# read in filenames
files = []
for (dirpath, dirnames, filenames) in os.walk(poster_dir):
    files.extend(filenames)
    break

# prepend the filepath to the filenames
files = [(poster_dir + '{0}').format(i) for i in files]

print('number of files on disk: ', len(files))

# figure out all the bad metadata (movies without posters in the dataset)
bad_metadata = []
index = 0

for filename in tqdm_notebook(list(files)):
    if '.jpg' not in filename:
        bad_metadata.append(index)
        
    index += 1
    
print('with bad data: ', len(metadata))

# get rid of all the bad metadata
good_metadata = []
index = 0
for data in tqdm_notebook(list(metadata.iterrows())):
    
    if index not in bad_metadata:
        good_metadata.append(data)
        
    index += 1

print('without bad data: ', len(metadata))

number of files on disk:  45863



with bad data:  45466



without bad data:  45466


In [68]:
from plot_images import show_images
import skimage
import skimage.io as io
import skimage.transform
from skimage import data
import sys

# read in the images
images = io.imread_collection(list(files), conserve_memory=True)

print(len(images))

# show_images(images).show()

45863
