# Data Cleaning for Milestone 4

As we discovered during our work in Milestone 4, our dataframes need some extra pre-processing before feeding them into our neural nets. We also found that it was helpful to collapse down our 19 genres into 3 more general categories.

This notebook separates out some of the data processing we used in milestone 4 so that it's easier to read and see exactly what we did.

In [28]:
import pandas as pd
import os

# used for image manipulation
# sudo pip install Image
import PIL.Image as Image

## Building the DataFrames

In [14]:
train = pd.read_csv("train_full.csv")

# make sure the poster path is an ascii string
train.poster_path = map(str, curr_df.poster_path)

# drop this unused column
train.drop("Unnamed: 0", axis=1, inplace=True)

# no need for this genre (only a handful of movies have this)
train.drop("10769", axis=1, inplace=True)

print "Train shape:", train.shape
train.head(1)

Train shape: (7220, 31)


Unnamed: 0,10402,10749,10751,10752,10770,12,14,16,18,27,...,lead actors,movie_id,overview,popularity,poster_path,release_date,title,video,vote_average,vote_count
0,0,0,0,0,0,0,0,0,1,0,...,"[u'Amy Adams', u'Jeremy Renner', u'Forest Whit...",329865,Taking place after alien crafts land around th...,25.66195,/hLudzvGfpi6JlwUnsNhXwKKg4j.jpg,2016-11-10,Arrival,False,6.9,3510


In [18]:
# this is our final dataset with 19 labels
train.to_csv("final_dataset_19_labels.csv", index = False)

As we dicovered while building our nets in milestone 4, having 19 labels turns out to be quite a challenge for our neural network, which doesn't seem to have enough data to overcome the performance of a trivial classifier. For that reason, we collapse our 19 columns in 3 general genre categories based on whether the genres are lighthearted, exciting (which we call heartbeart), or anything else (which we call other).

Here is our relevant commentary from Milestone 4:

**Label insights from milestone 4:** Even using huge images and many layers doesn't seem to get us past baseline accuracy.

It seems like the problem might be this: we have 17 labels, and our entire label matrix is incredibly sparse. It's 84% 0's, meaning that our prediction problem is looking for haystack needles. What we might want to do to decrease the sparsity of this dataset is collapse our 17 genres into a handful of categories. 

For instance, we could collapse "War" and "Horror" into "Scary". This will help us decrease the number of labels in our label matrix, which might finally get our neural net above the baseline of 84%.

Here's how we'll do the split: We'll create a new category, "Heartbeat", which will represent movies that make your heart beat quickly (romance, adventure, horror, crime, thriller etc.)

We'll also do "Lighthearted" for light hearted movies (music, comedy, family, fantasy, etc.)

And lastly, we'll have a category for "other" movies that have genres which are neither categorically lighthearted or inherently exciting.

* 'TV Movie', => Other
* 'Music', => Lighthearted
* 'Adventure' => Heartbeat
* 'Fantasy', => Heartbeat
* 'Animation', => Heartbeat
* 'Drama', => Heartbeat
* 'Action', => Heartbeat
* 'History', => Other
* 'Comedy', => Lighthearted
* 'War', => Heartbeat
* 'Horror', => Heartbeat
* 'Western', => Other
* 'Romance', => Heartbeat
* 'Family', => Lighthearted
* 'Crime', => Heartbeat
* 'Thriller' => Heartbeat

In the cell below, I create a new dataframe with just these three columns and then save it to a different CSV.

In [4]:
curr_df = train.copy()
curr_df.drop(curr_df.columns[:19], axis=1, inplace=True)

curr_df["other"] = ""
curr_df["heartbeat"] = ""
curr_df["lighthearted"] = ""

for index, row in curr_df.iterrows():
    if row["36"] == 1 or row["37"] == 1 or row["10770"] == 1:
        curr_df.set_value(index, "other", 1)
    else:
        curr_df.set_value(index, "other", 0)
        
    if row["12"] == 1 or row["14"] == 1 or row["18"] == 1 or row["27"] == 1 or row["28"] == 1 or row["53"] == 1 or row["80"] == 1 or row["10749"] == 1 or row["10752"] == 1:
        curr_df.set_value(index, "heartbeat", 1)
    else:
        curr_df.set_value(index, "heartbeat", 0)
        
    if row["16"] == 1 or row["35"] == 1 or row["10402"] == 1 or row["10751"] == 1:
        curr_df.set_value(index, "lighthearted", 1)
    else:
        curr_df.set_value(index, "lighthearted", 0)

In [16]:
print "Train shape:", curr_df.shape
curr_df.head(1)

Train shape: (7220, 15)


Unnamed: 0,adult,director,lead actors,movie_id,overview,popularity,poster_path,release_date,title,video,vote_average,vote_count,other,heartbeat,lighthearted
0,False,Denis Villeneuve,"[u'Amy Adams', u'Jeremy Renner', u'Forest Whit...",329865,Taking place after alien crafts land around th...,25.66195,/hLudzvGfpi6JlwUnsNhXwKKg4j.jpg,2016-11-10,Arrival,False,6.9,3510,0,1,0


In [19]:
# this is our final dataset with 3 labels
curr_df.to_csv("final_dataset_3_labels.csv", index = False)

## Downloading the Images

In [41]:
1+1

2

In [20]:
## Helper that downloads web images 
## Takes in the poster path and the id of the movie 
## Saves the movie as a jpg as the unique id of the movie 
## In the images folder.
def download_web_image(poster_path, movie_id):
    # given that we're going to resize our images to be 32x32
    # or something else really small, let's download really small images 
    # to start
    base_url = "https://image.tmdb.org/t/p/w92/" 
    
    request = urllib2.Request(base_url + poster_path)
    img = urllib2.urlopen(request).read()
    image_name= "all_train_imgs/" + str(movie_id) + ".jpg"
    
    with open(image_name, 'w') as f: 
        f.write(img)

In [21]:
### download all of the images
if 1:
    print "If you actually want to download posters, you'll need to turn the `1` above into a `0`. This code doesn't run by default in the notebook so that you don't accidentally download hundreds of images."
else:
    for index, row in curr_df.iterrows():
        if index % 200 == 0:
            print index
            
        movie_id = row["movie_id"]
        poster_path = row["poster_path"]
        
        try:
            download_web_image(poster_path, movie_id)
        except:
            continue

If you actually want to download posters, you'll need to turn the `1` above into a `0`. This code doesn't run by default in the notebook so that you don't accidentally download hundreds of images.


In [32]:
def manipulate_imgs(INPUT_DIR, OUTPUT_DIR, N_ROWS, N_COLS, COLOR="L"):
    """
    Resizes and recolors images and saves them to OUTPUT_DIR
    """
    if not os.path.isdir(OUTPUT_DIR):
        os.mkdir(OUTPUT_DIR)
    
    for img_name in os.listdir(INPUT_DIR):
        # avoid hidden files on mac
        if not img_name.startswith('.'):
            
            # base images are already RBG
            if COLOR is "RBG":
                im = Image.open(INPUT_DIR + img_name)
            else:
                im = Image.open(INPUT_DIR + img_name).convert(COLOR)
                
            # resize img to specified rows and cols
            out = im.resize((N_ROWS, N_COLS))
            
            # save image to the output directory
            out.save(OUTPUT_DIR + img_name)

In [35]:
# turn all images into 32x32 grayscale
manipulate_imgs("final_imgs_folder/all_train_imgs/", "final_imgs_folder/gray_32_32/", 32, 32)

# turn all images into 64x64 grayscale
manipulate_imgs("final_imgs_folder/all_train_imgs/", "final_imgs_folder/gray_64_64/", 64, 64)

# turn all images into 32x32 RGB
manipulate_imgs("final_imgs_folder/all_train_imgs/", "final_imgs_folder/rgb_32_32/", 32, 32, COLOR="RBG")

# turn all images into 64x64 RGB
manipulate_imgs("final_imgs_folder/all_train_imgs/", "final_imgs_folder/rgb_64_64/", 64, 64, COLOR="RBG")

In [36]:
# turn all images in 256x256 grayscale
manipulate_imgs("final_imgs_folder/all_train_imgs/", "final_imgs_folder/gray_256_256/", 256, 256)