In [4]:
%matplotlib inline
import matplotlib.pyplot as plt
import seaborn as sns

import numpy as np
import pandas as pd
import os

# Genre

In this notebook I will boil down a list of genres for each movie into one genre that encompasses each movie's genre.

In [5]:
prefix = '/data/kylevigil/'

## Opening

I will start with reading in the genres table provided by the MovieLens data set and grouping them by movie id into a list of all of the genres.

In [7]:
genres = pd.read_csv(prefix + 'movie_genres.dat', delimiter = '\t')
genres.genre = genres.genre.astype('category')
genres.genre.value_counts()

Drama          5076
Comedy         3566
Thriller       1664
Romance        1644
Action         1445
Crime          1086
Adventure      1003
Horror          978
Sci-Fi          740
Fantasy         535
Children        519
Mystery         497
War             494
Documentary     430
Musical         421
Animation       279
Western         261
Film-Noir       145
IMAX             25
Short             1
dtype: int64

In [8]:
genres = genres.groupby('movieID').agg(lambda x: x.tolist())

In [12]:
genres.head()

Unnamed: 0_level_0,genre
movieID,Unnamed: 1_level_1
1,"[Adventure, Animation, Children, Comedy, Fantasy]"
2,"[Adventure, Children, Fantasy]"
3,"[Comedy, Romance]"
4,"[Comedy, Drama, Romance]"
5,[Comedy]


## Processing

Now I will begin to process each genre list and assigning the movie its 'real genre'

In [20]:
genres['realGenre']  = 'None'

Because there is so much overlap between genres, I have chosen to only model the movies with the genre of drama, comedy, romance, action, or horror. The order in which I assign the realGenre value is important because if there is a drama and a horror movie, the movie will be qualified as a horror because I feel like this order preserves the true genre of the movie.

In [21]:
genres.realGenre[['Drama' in i for i in genres.genre]] = 'Drama'
genres.realGenre[['Comedy' in i for i in genres.genre]] = 'Comedy'
genres.realGenre[['Romance' in i for i in genres.genre]] = 'Romance'
genres.realGenre[['Action' in i for i in genres.genre]] = 'Action'
genres.realGenre[['Horror' in i for i in genres.genre]] = 'Horror'

In [22]:
genres.realGenre.value_counts()

Drama      2853
Comedy     2324
Romance    1525
Action     1340
None       1177
Horror      978
Name: realGenre, dtype: int64

## Saving

I save this list of real genres to a file called genres.pkl

In [23]:
toSave = genres[['realGenre']]

In [24]:
toSave.to_pickle(prefix + 'genres.pkl')