In [14]:
import pandas as pd
import numpy as np
import re
import scipy.sparse as sp
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.metrics import hamming_loss
from sklearn.multiclass import OneVsRestClassifier
from sklearn.multioutput import MultiOutputClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import KFold

# Milestone 3: Traditional statistical and machine learning methods, due Wednesday, April 19, 2017

Think about how you would address the genre prediction problem with traditional statistical or machine learning methods. This includes everything you learned about modeling in this course before the deep learning part. Implement your ideas and compare different classifiers. Report your results and discuss what challenges you faced and how you overcame them. What works and what does not? If there are parts that do not work as expected, make sure to discuss briefly what you think is the cause and how you would address this if you would have more time and resources. 

You do not necessarily need to use the movie posters for this step, but even without a background in computer vision, there are very simple features you can extract from the posters to help guide a traditional machine learning model. Think about the PCA lecture for example, or how to use clustering to extract color information. In addition to considering the movie posters it would be worthwhile to have a look at the metadata that IMDb provides. 

You could use Spark and the [ML library](https://spark.apache.org/docs/latest/ml-features.html#word2vec) to build your model features from the data. This may be especially beneficial if you use additional data, e.g., in text form.

You also need to think about how you are going to evaluate your classifier. Which metrics or scores will you report to show how good the performance is?

The notebook to submit this week should at least include:

- Detailed description and implementation of two different models
- Description of your performance metrics
- Careful performance evaluations for both models
- Visualizations of the metrics for performance evaluation
- Discussion of the differences between the models, their strengths, weaknesses, etc. 
- Discussion of the performances you achieved, and how you might be able to improve them in the future

#### Preliminary Peer Assessment

It is important to provide positive feedback to people who truly worked hard for the good of the team and to also make suggestions to those you perceived not to be working as effectively on team tasks. We ask you to provide an honest assessment of the contributions of the members of your team, including yourself. The feedback you provide should reflect your judgment of each team member’s:

- Preparation – were they prepared during team meetings?
- Contribution – did they contribute productively to the team discussion and work?
- Respect for others’ ideas – did they encourage others to contribute their ideas?
- Flexibility – were they flexible when disagreements occurred?

Your teammate’s assessment of your contributions and the accuracy of your self-assessment will be considered as part of your overall project score.

Preliminary Peer Assessment: [https://goo.gl/forms/WOYC7pwRCSU0yV3l1](https://goo.gl/forms/WOYC7pwRCSU0yV3l1)

## Questions to answer: 

- **What are we predicting exactly?**

So, we are trying to predict movie genres. However, we have that each movie has multiple genres. This leads to the question of how we can predict multiple classifiers for the same object. This more general question is called a multilabel clasification problem. We will explore some of our specifications for this problem below. 

One of the best and most standard solution to do multilable classification is called "one vs. rest" classifiers. These classifiers create n models for each of the n labels. One of the advantages of this model is its interpretability and, for our cases, its ease. We can easily create a pipeline that then does these predictions for us. For an implementation of one vs. all, look at scikit learn: http://scikit-learn.org/dev/modules/generated/sklearn.multiclass.OneVsRestClassifier.html#sklearn.multiclass.OneVsRestClassifier

We will likely be using this in our early attempts at classification. 

- **What does it means to be succesful? What is our metric for success?**

*adapted from http://people.oregonstate.edu/~sorowerm/pdf/Qual-Multilabel-Shahed-CompleteVersion.pdf*

Here are a few options for our measure of accuracy:

#### Exact Match Ratio
The exact match ratio only considers a correct answer for our multilabel data if it is exactly correct (e.g. if there are three classes, we only classify this as correct if we correctly identify all three classes.) 

#### Accuracy 
Accuracy is a simple way of "goodness of prediction." It is defined as follows 

$$ \frac{1}{n} \sum_i^{n}  \frac{|Y_i\cap Z_i|}{|Y_i \cup Z_i|}$$

Where $$Y_i\cap Z_i $$ refers to the total number of correctly predicted labels over the total number of labels for that instance. So, if for example we predicted [romance, action]  and the true labels were [romance, comedy, horror], this would receive an accuracy of 1/4 because there was one correct prediction and 4 unique labels. 


#### Hamming Loss 
The final and most common form of error for multilable predictions is hamming loss. Hamming loss takes into account both the prediction error (an incorrect error is predicted) and the missing error (a relevant lable is NOT predicted.) this is defined as follows below 

$$ \text{HammingLoss, HL} = \frac{1}{kn} \sum_{i}^{n} \sum_l^k [l \in  Z_i \wedge l \notin Y_i)  + I(l \notin Z_i \wedge  l \in Y_i)]$$

*For this project, we will use the hamming loss, which is defined above.* There is a convenient function in `sklearn` to calculate hamming loss: `sklearn.metrics.hamming_loss`

- What is our first modeling approach? Why? 

- What is our second modeling approach? Why? 

In [2]:
'''
An example of hamming loss. We have true labels:

[0, 1]
[1, 1]

And predicted labels:

[0, 0]
[0, 0]

Hamming loss is .75
'''
hamming_loss(np.array([[0, 1], [1, 1]]), np.zeros((2, 2)))

0.75

### Data Collection & Cleaning

## Decision for dropping
Here we choose to drop the missing data instead of imputing because it is non numerical and avereraging or finding means does not make sense in this scencario

In [46]:
train = pd.read_csv("../data/train.csv")

# drop a rogue column
train.drop("Unnamed: 0", axis = 1, inplace = True)
train = train.dropna(axis=0).copy()
print "Dataframe shape:", train.shape
train.head(1)

Dataframe shape: (537, 29)


Unnamed: 0,10402,10749,10751,10752,12,14,16,18,27,28,...,lead actors,movie_id,overview,popularity,poster_path,release_date,title,video,vote_average,vote_count
0,0,0,1,0,0,0,1,0,0,0,...,"[u'Alec Baldwin', u'Miles Bakshi', u'Jimmy Kim...",295693,A story about how a new baby's arrival impacts...,305.881041,/unPB1iyEeTBcKiLg8W083rlViFH.jpg,2017-03-23,The Boss Baby,False,5.7,510


In [47]:
# check for null values
train.isnull().any()

10402           False
10749           False
10751           False
10752           False
12              False
14              False
16              False
18              False
27              False
28              False
35              False
36              False
37              False
53              False
80              False
878             False
9648            False
adult           False
director        False
lead actors     False
movie_id        False
overview        False
popularity      False
poster_path     False
release_date    False
title           False
video           False
vote_average    False
vote_count      False
dtype: bool

# Model 1: Random Forest

Some thoughts:
    * Random forests don't accept strings, so we'll need to vectorize all of the string variables or exclude them entirely. 

In [48]:
train.columns

Index([u'10402', u'10749', u'10751', u'10752', u'12', u'14', u'16', u'18',
       u'27', u'28', u'35', u'36', u'37', u'53', u'80', u'878', u'9648',
       u'adult', u'director', u'lead actors', u'movie_id', u'overview',
       u'popularity', u'poster_path', u'release_date', u'title', u'video',
       u'vote_average', u'vote_count'],
      dtype='object')

In [49]:
string_cols = ["director", "lead actors", "overview", "title"]

string_matrix = train[string_cols]

In [50]:
# Set up helper cleaner function
def cleaner(cell):
    line = cell.replace('[u', '').replace(']', '').replace(',', '').replace("u'", '').replace("'", '')
    line = re.sub("(^|\W)\d+($|\W)", " ", line)
    return line
string_matrix['lead actors'] = string_matrix['lead actors'].apply(cleaner)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy


In [10]:
# trim trailing and leading spaces
string_matrix = string_matrix.apply(lambda col: col.str.strip())

In [52]:
# returns output in scipi format; we want to get back to panadas
vect = CountVectorizer(ngram_range=(1, 3))
vect_df = sp.hstack(string_matrix.apply(lambda col: vect.fit_transform(col)))

In [58]:
# def _coo_to_sparse_series(A, dense_index=False):
#     """ Convert a scipy.sparse.coo_matrix to a SparseSeries.
#     Use the defaults given in the SparseSeries constructor. """
#     s = pd.Series(A.data, pd.MultiIndex.from_arrays((A.row, A.col)))
#     s = s.sort_index()
#     s = s.to_sparse()  # TODO: specify kind?
#     # ...
#     return s
#_coo_to_sparse_series(vect_df)

In [15]:
labels = train.columns[:17]
features = train.columns[17:]
# X = train[features]
X = train[["popularity", "vote_average", "vote_count"]]

In [16]:
genre_ids_df = pd.read_csv("../data/genre_ids.csv")
genre_ids_df.drop("Unnamed: 0", axis = 1, inplace = True)

In [17]:
for label in labels:
    print genre_ids_df[genre_ids_df["id"] == int(label)]["genre"].item()

Music
Romance
Family
War
Adventure
Fantasy
Animation
Drama
Horror
Action
Comedy
History
Western
Thriller
Crime
Science Fiction
Mystery


Currently, our label matrix has 17 rows, meaning that each row has 17 different labels associated with it. This is a big problem because there are 2^17 different possible combinations for each row, and, unless we have a ton of data, we likely won't see more than 1 or 2 instances of a given row from the label matrix. This will make it difficult for our classifier to learn patterns. 

We should probably combine similar genres to make this prediction task more teneble. 

How should we do this combination?

### Evaluating the Random Forest using KFold CV

In [14]:
h_losses = []

for train_ind, test_ind in KFold(n_splits = 5).split(X):
    X_train, X_test = X.iloc[train_ind], X.iloc[test_ind]
    y_train, y_test = X.iloc[train_ind], X.ilco[test_ind]

    forest = RandomForestClassifier(n_estimators=100, random_state=109)

    # instantiate the classifier (n_jobs = -1 tells it)
    # to fit using all CPUs
    multi_target_forest = MultiOutputClassifier(forest, n_jobs=-1)

    # fit the multi-target random forest
    fitted_forest = multi_target_forest.fit(X_train, y_train)

    # predict the label matrix
    preds = fitted_forest.predict(X_test)
    h_losses.append(hamming_loss(y_test, preds))

print np.average(h_losses)

JoblibValueError: JoblibValueError
___________________________________________________________________________
Multiprocessing exception:
...........................................................................
/Users/lukeheine/anaconda/lib/python2.7/runpy.py in _run_module_as_main(mod_name='ipykernel.__main__', alter_argv=1)
    169     pkg_name = mod_name.rpartition('.')[0]
    170     main_globals = sys.modules["__main__"].__dict__
    171     if alter_argv:
    172         sys.argv[0] = fname
    173     return _run_code(code, main_globals, None,
--> 174                      "__main__", fname, loader, pkg_name)
        fname = '/Users/lukeheine/anaconda/lib/python2.7/site-packages/ipykernel/__main__.py'
        loader = <pkgutil.ImpLoader instance>
        pkg_name = 'ipykernel'
    175 
    176 def run_module(mod_name, init_globals=None,
    177                run_name=None, alter_sys=False):
    178     """Execute a module's code without importing it

...........................................................................
/Users/lukeheine/anaconda/lib/python2.7/runpy.py in _run_code(code=<code object <module> at 0x1007d9330, file "/Use...2.7/site-packages/ipykernel/__main__.py", line 1>, run_globals={'__builtins__': <module '__builtin__' (built-in)>, '__doc__': None, '__file__': '/Users/lukeheine/anaconda/lib/python2.7/site-packages/ipykernel/__main__.py', '__loader__': <pkgutil.ImpLoader instance>, '__name__': '__main__', '__package__': 'ipykernel', 'app': <module 'ipykernel.kernelapp' from '/Users/lukeh...python2.7/site-packages/ipykernel/kernelapp.pyc'>}, init_globals=None, mod_name='__main__', mod_fname='/Users/lukeheine/anaconda/lib/python2.7/site-packages/ipykernel/__main__.py', mod_loader=<pkgutil.ImpLoader instance>, pkg_name='ipykernel')
     67         run_globals.update(init_globals)
     68     run_globals.update(__name__ = mod_name,
     69                        __file__ = mod_fname,
     70                        __loader__ = mod_loader,
     71                        __package__ = pkg_name)
---> 72     exec code in run_globals
        code = <code object <module> at 0x1007d9330, file "/Use...2.7/site-packages/ipykernel/__main__.py", line 1>
        run_globals = {'__builtins__': <module '__builtin__' (built-in)>, '__doc__': None, '__file__': '/Users/lukeheine/anaconda/lib/python2.7/site-packages/ipykernel/__main__.py', '__loader__': <pkgutil.ImpLoader instance>, '__name__': '__main__', '__package__': 'ipykernel', 'app': <module 'ipykernel.kernelapp' from '/Users/lukeh...python2.7/site-packages/ipykernel/kernelapp.pyc'>}
     73     return run_globals
     74 
     75 def _run_module_code(code, init_globals=None,
     76                     mod_name=None, mod_fname=None,

...........................................................................
/Users/lukeheine/anaconda/lib/python2.7/site-packages/ipykernel/__main__.py in <module>()
      1 
      2 
----> 3 
      4 if __name__ == '__main__':
      5     from ipykernel import kernelapp as app
      6     app.launch_new_instance()
      7 
      8 
      9 
     10 

...........................................................................
/Users/lukeheine/anaconda/lib/python2.7/site-packages/traitlets/config/application.py in launch_instance(cls=<class 'ipykernel.kernelapp.IPKernelApp'>, argv=None, **kwargs={})
    591         
    592         If a global instance already exists, this reinitializes and starts it
    593         """
    594         app = cls.instance(**kwargs)
    595         app.initialize(argv)
--> 596         app.start()
        app.start = <bound method IPKernelApp.start of <ipykernel.kernelapp.IPKernelApp object>>
    597 
    598 #-----------------------------------------------------------------------------
    599 # utility functions, for convenience
    600 #-----------------------------------------------------------------------------

...........................................................................
/Users/lukeheine/anaconda/lib/python2.7/site-packages/ipykernel/kernelapp.py in start(self=<ipykernel.kernelapp.IPKernelApp object>)
    437         
    438         if self.poller is not None:
    439             self.poller.start()
    440         self.kernel.start()
    441         try:
--> 442             ioloop.IOLoop.instance().start()
    443         except KeyboardInterrupt:
    444             pass
    445 
    446 launch_new_instance = IPKernelApp.launch_instance

...........................................................................
/Users/lukeheine/anaconda/lib/python2.7/site-packages/zmq/eventloop/ioloop.py in start(self=<zmq.eventloop.ioloop.ZMQIOLoop object>)
    157             PollIOLoop.configure(ZMQIOLoop)
    158         return PollIOLoop.current(*args, **kwargs)
    159     
    160     def start(self):
    161         try:
--> 162             super(ZMQIOLoop, self).start()
        self.start = <bound method ZMQIOLoop.start of <zmq.eventloop.ioloop.ZMQIOLoop object>>
    163         except ZMQError as e:
    164             if e.errno == ETERM:
    165                 # quietly return on ETERM
    166                 pass

...........................................................................
/Users/lukeheine/anaconda/lib/python2.7/site-packages/tornado/ioloop.py in start(self=<zmq.eventloop.ioloop.ZMQIOLoop object>)
    878                 self._events.update(event_pairs)
    879                 while self._events:
    880                     fd, events = self._events.popitem()
    881                     try:
    882                         fd_obj, handler_func = self._handlers[fd]
--> 883                         handler_func(fd_obj, events)
        handler_func = <function null_wrapper>
        fd_obj = <zmq.sugar.socket.Socket object>
        events = 1
    884                     except (OSError, IOError) as e:
    885                         if errno_from_exception(e) == errno.EPIPE:
    886                             # Happens when the client closes the connection
    887                             pass

...........................................................................
/Users/lukeheine/anaconda/lib/python2.7/site-packages/tornado/stack_context.py in null_wrapper(*args=(<zmq.sugar.socket.Socket object>, 1), **kwargs={})
    270         # Fast path when there are no active contexts.
    271         def null_wrapper(*args, **kwargs):
    272             try:
    273                 current_state = _state.contexts
    274                 _state.contexts = cap_contexts[0]
--> 275                 return fn(*args, **kwargs)
        args = (<zmq.sugar.socket.Socket object>, 1)
        kwargs = {}
    276             finally:
    277                 _state.contexts = current_state
    278         null_wrapper._wrapped = True
    279         return null_wrapper

...........................................................................
/Users/lukeheine/anaconda/lib/python2.7/site-packages/zmq/eventloop/zmqstream.py in _handle_events(self=<zmq.eventloop.zmqstream.ZMQStream object>, fd=<zmq.sugar.socket.Socket object>, events=1)
    435             # dispatch events:
    436             if events & IOLoop.ERROR:
    437                 gen_log.error("got POLLERR event on ZMQStream, which doesn't make sense")
    438                 return
    439             if events & IOLoop.READ:
--> 440                 self._handle_recv()
        self._handle_recv = <bound method ZMQStream._handle_recv of <zmq.eventloop.zmqstream.ZMQStream object>>
    441                 if not self.socket:
    442                     return
    443             if events & IOLoop.WRITE:
    444                 self._handle_send()

...........................................................................
/Users/lukeheine/anaconda/lib/python2.7/site-packages/zmq/eventloop/zmqstream.py in _handle_recv(self=<zmq.eventloop.zmqstream.ZMQStream object>)
    467                 gen_log.error("RECV Error: %s"%zmq.strerror(e.errno))
    468         else:
    469             if self._recv_callback:
    470                 callback = self._recv_callback
    471                 # self._recv_callback = None
--> 472                 self._run_callback(callback, msg)
        self._run_callback = <bound method ZMQStream._run_callback of <zmq.eventloop.zmqstream.ZMQStream object>>
        callback = <function null_wrapper>
        msg = [<zmq.sugar.frame.Frame object>, <zmq.sugar.frame.Frame object>, <zmq.sugar.frame.Frame object>, <zmq.sugar.frame.Frame object>, <zmq.sugar.frame.Frame object>, <zmq.sugar.frame.Frame object>, <zmq.sugar.frame.Frame object>]
    473                 
    474         # self.update_state()
    475         
    476 

...........................................................................
/Users/lukeheine/anaconda/lib/python2.7/site-packages/zmq/eventloop/zmqstream.py in _run_callback(self=<zmq.eventloop.zmqstream.ZMQStream object>, callback=<function null_wrapper>, *args=([<zmq.sugar.frame.Frame object>, <zmq.sugar.frame.Frame object>, <zmq.sugar.frame.Frame object>, <zmq.sugar.frame.Frame object>, <zmq.sugar.frame.Frame object>, <zmq.sugar.frame.Frame object>, <zmq.sugar.frame.Frame object>],), **kwargs={})
    409         close our socket."""
    410         try:
    411             # Use a NullContext to ensure that all StackContexts are run
    412             # inside our blanket exception handler rather than outside.
    413             with stack_context.NullContext():
--> 414                 callback(*args, **kwargs)
        callback = <function null_wrapper>
        args = ([<zmq.sugar.frame.Frame object>, <zmq.sugar.frame.Frame object>, <zmq.sugar.frame.Frame object>, <zmq.sugar.frame.Frame object>, <zmq.sugar.frame.Frame object>, <zmq.sugar.frame.Frame object>, <zmq.sugar.frame.Frame object>],)
        kwargs = {}
    415         except:
    416             gen_log.error("Uncaught exception, closing connection.",
    417                           exc_info=True)
    418             # Close the socket on an uncaught exception from a user callback

...........................................................................
/Users/lukeheine/anaconda/lib/python2.7/site-packages/tornado/stack_context.py in null_wrapper(*args=([<zmq.sugar.frame.Frame object>, <zmq.sugar.frame.Frame object>, <zmq.sugar.frame.Frame object>, <zmq.sugar.frame.Frame object>, <zmq.sugar.frame.Frame object>, <zmq.sugar.frame.Frame object>, <zmq.sugar.frame.Frame object>],), **kwargs={})
    270         # Fast path when there are no active contexts.
    271         def null_wrapper(*args, **kwargs):
    272             try:
    273                 current_state = _state.contexts
    274                 _state.contexts = cap_contexts[0]
--> 275                 return fn(*args, **kwargs)
        args = ([<zmq.sugar.frame.Frame object>, <zmq.sugar.frame.Frame object>, <zmq.sugar.frame.Frame object>, <zmq.sugar.frame.Frame object>, <zmq.sugar.frame.Frame object>, <zmq.sugar.frame.Frame object>, <zmq.sugar.frame.Frame object>],)
        kwargs = {}
    276             finally:
    277                 _state.contexts = current_state
    278         null_wrapper._wrapped = True
    279         return null_wrapper

...........................................................................
/Users/lukeheine/anaconda/lib/python2.7/site-packages/ipykernel/kernelbase.py in dispatcher(msg=[<zmq.sugar.frame.Frame object>, <zmq.sugar.frame.Frame object>, <zmq.sugar.frame.Frame object>, <zmq.sugar.frame.Frame object>, <zmq.sugar.frame.Frame object>, <zmq.sugar.frame.Frame object>, <zmq.sugar.frame.Frame object>])
    271         if self.control_stream:
    272             self.control_stream.on_recv(self.dispatch_control, copy=False)
    273 
    274         def make_dispatcher(stream):
    275             def dispatcher(msg):
--> 276                 return self.dispatch_shell(stream, msg)
        msg = [<zmq.sugar.frame.Frame object>, <zmq.sugar.frame.Frame object>, <zmq.sugar.frame.Frame object>, <zmq.sugar.frame.Frame object>, <zmq.sugar.frame.Frame object>, <zmq.sugar.frame.Frame object>, <zmq.sugar.frame.Frame object>]
    277             return dispatcher
    278 
    279         for s in self.shell_streams:
    280             s.on_recv(make_dispatcher(s), copy=False)

...........................................................................
/Users/lukeheine/anaconda/lib/python2.7/site-packages/ipykernel/kernelbase.py in dispatch_shell(self=<ipykernel.ipkernel.IPythonKernel object>, stream=<zmq.eventloop.zmqstream.ZMQStream object>, msg={'buffers': [], 'content': {u'allow_stdin': True, u'code': u'h_losses = []\n\nfor train_ind, test_ind in KF...ss(y_test, preds))\n\nprint np.average(h_losses)', u'silent': False, u'stop_on_error': True, u'store_history': True, u'user_expressions': {}}, 'header': {'date': '2017-04-17T18:44:03.266246', u'msg_id': u'5CE6A889B5A140E68EC535D5185F09B6', u'msg_type': u'execute_request', u'session': u'871CD4C125F048C78506D6B7133B8D61', u'username': u'username', u'version': u'5.0'}, 'metadata': {}, 'msg_id': u'5CE6A889B5A140E68EC535D5185F09B6', 'msg_type': u'execute_request', 'parent_header': {}})
    223             self.log.error("UNKNOWN MESSAGE TYPE: %r", msg_type)
    224         else:
    225             self.log.debug("%s: %s", msg_type, msg)
    226             self.pre_handler_hook()
    227             try:
--> 228                 handler(stream, idents, msg)
        handler = <bound method IPythonKernel.execute_request of <ipykernel.ipkernel.IPythonKernel object>>
        stream = <zmq.eventloop.zmqstream.ZMQStream object>
        idents = ['871CD4C125F048C78506D6B7133B8D61']
        msg = {'buffers': [], 'content': {u'allow_stdin': True, u'code': u'h_losses = []\n\nfor train_ind, test_ind in KF...ss(y_test, preds))\n\nprint np.average(h_losses)', u'silent': False, u'stop_on_error': True, u'store_history': True, u'user_expressions': {}}, 'header': {'date': '2017-04-17T18:44:03.266246', u'msg_id': u'5CE6A889B5A140E68EC535D5185F09B6', u'msg_type': u'execute_request', u'session': u'871CD4C125F048C78506D6B7133B8D61', u'username': u'username', u'version': u'5.0'}, 'metadata': {}, 'msg_id': u'5CE6A889B5A140E68EC535D5185F09B6', 'msg_type': u'execute_request', 'parent_header': {}}
    229             except Exception:
    230                 self.log.error("Exception in message handler:", exc_info=True)
    231             finally:
    232                 self.post_handler_hook()

...........................................................................
/Users/lukeheine/anaconda/lib/python2.7/site-packages/ipykernel/kernelbase.py in execute_request(self=<ipykernel.ipkernel.IPythonKernel object>, stream=<zmq.eventloop.zmqstream.ZMQStream object>, ident=['871CD4C125F048C78506D6B7133B8D61'], parent={'buffers': [], 'content': {u'allow_stdin': True, u'code': u'h_losses = []\n\nfor train_ind, test_ind in KF...ss(y_test, preds))\n\nprint np.average(h_losses)', u'silent': False, u'stop_on_error': True, u'store_history': True, u'user_expressions': {}}, 'header': {'date': '2017-04-17T18:44:03.266246', u'msg_id': u'5CE6A889B5A140E68EC535D5185F09B6', u'msg_type': u'execute_request', u'session': u'871CD4C125F048C78506D6B7133B8D61', u'username': u'username', u'version': u'5.0'}, 'metadata': {}, 'msg_id': u'5CE6A889B5A140E68EC535D5185F09B6', 'msg_type': u'execute_request', 'parent_header': {}})
    386         if not silent:
    387             self.execution_count += 1
    388             self._publish_execute_input(code, parent, self.execution_count)
    389 
    390         reply_content = self.do_execute(code, silent, store_history,
--> 391                                         user_expressions, allow_stdin)
        user_expressions = {}
        allow_stdin = True
    392 
    393         # Flush output before sending the reply.
    394         sys.stdout.flush()
    395         sys.stderr.flush()

...........................................................................
/Users/lukeheine/anaconda/lib/python2.7/site-packages/ipykernel/ipkernel.py in do_execute(self=<ipykernel.ipkernel.IPythonKernel object>, code=u'h_losses = []\n\nfor train_ind, test_ind in KF...ss(y_test, preds))\n\nprint np.average(h_losses)', silent=False, store_history=True, user_expressions={}, allow_stdin=True)
    194 
    195         reply_content = {}
    196         # FIXME: the shell calls the exception handler itself.
    197         shell._reply_content = None
    198         try:
--> 199             shell.run_cell(code, store_history=store_history, silent=silent)
        shell.run_cell = <bound method ZMQInteractiveShell.run_cell of <ipykernel.zmqshell.ZMQInteractiveShell object>>
        code = u'h_losses = []\n\nfor train_ind, test_ind in KF...ss(y_test, preds))\n\nprint np.average(h_losses)'
        store_history = True
        silent = False
    200         except:
    201             status = u'error'
    202             # FIXME: this code right now isn't being used yet by default,
    203             # because the run_cell() call above directly fires off exception

...........................................................................
/Users/lukeheine/anaconda/lib/python2.7/site-packages/IPython/core/interactiveshell.py in run_cell(self=<ipykernel.zmqshell.ZMQInteractiveShell object>, raw_cell=u'h_losses = []\n\nfor train_ind, test_ind in KF...ss(y_test, preds))\n\nprint np.average(h_losses)', store_history=True, silent=False, shell_futures=True)
   2718                 self.displayhook.exec_result = result
   2719 
   2720                 # Execute the user code
   2721                 interactivity = "none" if silent else self.ast_node_interactivity
   2722                 self.run_ast_nodes(code_ast.body, cell_name,
-> 2723                    interactivity=interactivity, compiler=compiler, result=result)
        interactivity = 'last_expr'
        compiler = <IPython.core.compilerop.CachingCompiler instance>
   2724 
   2725                 # Reset this so later displayed values do not modify the
   2726                 # ExecutionResult
   2727                 self.displayhook.exec_result = None

...........................................................................
/Users/lukeheine/anaconda/lib/python2.7/site-packages/IPython/core/interactiveshell.py in run_ast_nodes(self=<ipykernel.zmqshell.ZMQInteractiveShell object>, nodelist=[<_ast.Assign object>, <_ast.For object>, <_ast.Print object>], cell_name='<ipython-input-14-24a9d15aca99>', interactivity='none', compiler=<IPython.core.compilerop.CachingCompiler instance>, result=<IPython.core.interactiveshell.ExecutionResult object>)
   2820 
   2821         try:
   2822             for i, node in enumerate(to_run_exec):
   2823                 mod = ast.Module([node])
   2824                 code = compiler(mod, cell_name, "exec")
-> 2825                 if self.run_code(code, result):
        self.run_code = <bound method ZMQInteractiveShell.run_code of <ipykernel.zmqshell.ZMQInteractiveShell object>>
        code = <code object <module> at 0x10ec92730, file "<ipython-input-14-24a9d15aca99>", line 3>
        result = <IPython.core.interactiveshell.ExecutionResult object>
   2826                     return True
   2827 
   2828             for i, node in enumerate(to_run_interactive):
   2829                 mod = ast.Interactive([node])

...........................................................................
/Users/lukeheine/anaconda/lib/python2.7/site-packages/IPython/core/interactiveshell.py in run_code(self=<ipykernel.zmqshell.ZMQInteractiveShell object>, code_obj=<code object <module> at 0x10ec92730, file "<ipython-input-14-24a9d15aca99>", line 3>, result=<IPython.core.interactiveshell.ExecutionResult object>)
   2880         outflag = 1  # happens in more places, so it's easier as default
   2881         try:
   2882             try:
   2883                 self.hooks.pre_run_code_hook()
   2884                 #rprint('Running code', repr(code_obj)) # dbg
-> 2885                 exec(code_obj, self.user_global_ns, self.user_ns)
        code_obj = <code object <module> at 0x10ec92730, file "<ipython-input-14-24a9d15aca99>", line 3>
        self.user_global_ns = {'In': ['', u'import pandas as pd\nimport numpy as np\nimpor...zer\n\nfrom sklearn.model_selection import KFold', u"'''\nAn example of hamming loss. We have true ...ss(np.array([[0, 1], [1, 1]]), np.zeros((2, 2)))", u'train = pd.read_csv("../data/train.csv")\n\n# ...t "Dataframe shape:", train.shape\ntrain.head(1)', u'# check for null values\ntrain.isnull().sum()', u'train.columns', u'string_cols = ["director", "lead actors", "ove...w", "title"]\nstring_matrix = train[string_cols]', u'# Set up helper cleaner function\ndef cleaner(... = string_matrix[\'lead actors\'].apply(cleaner)', u'labels = train.columns[:17]\nfeatures = train....in[["popularity", "vote_average", "vote_count"]]', u'string_matrix', u'labels = train.columns[:17]\nfeatures = train....in[["popularity", "vote_average", "vote_count"]]', u'genre_ids_df = pd.read_csv("../data/genre_ids...._df.drop("Unnamed: 0", axis = 1, inplace = True)', u'for label in labels:\n    print genre_ids_df[genre_ids_df["id"] == int(label)]["genre"].item()', u'h_losses = []\n\nfor train_ind, test_ind in KF...ss(y_test, preds))\n\nprint np.average(h_losses)', u'h_losses = []\n\nfor train_ind, test_ind in KF...ss(y_test, preds))\n\nprint np.average(h_losses)'], 'KFold': <class 'sklearn.model_selection._split.KFold'>, 'MultiOutputClassifier': <class 'sklearn.multioutput.MultiOutputClassifier'>, 'OneVsRestClassifier': <class 'sklearn.multiclass.OneVsRestClassifier'>, 'Out': {2: 0.75, 3:    10402  10749  10751  10752  12  14  16  18  2...          5.7        510  

[1 rows x 29 columns], 4: 10402           0
10749           0
10751       ...
vote_average    0
vote_count      0
dtype: int64, 5: Index([u'10402', u'10749', u'10751', u'10752', u...e_average', u'vote_count'],
      dtype='object'), 9:                     director  \
0               ...         Going in Style  

[540 rows x 4 columns]}, 'RandomForestClassifier': <class 'sklearn.ensemble.forest.RandomForestClassifier'>, 'TfidfVectorizer': <class 'sklearn.feature_extraction.text.TfidfVectorizer'>, 'X':      popularity  vote_average  vote_count
0    3...          5.9          25

[540 rows x 3 columns], 'X_test':      popularity  vote_average  vote_count
0    3...          6.5         290

[108 rows x 3 columns], 'X_train':      popularity  vote_average  vote_count
108   ...          5.9          25

[432 rows x 3 columns], ...}
        self.user_ns = {'In': ['', u'import pandas as pd\nimport numpy as np\nimpor...zer\n\nfrom sklearn.model_selection import KFold', u"'''\nAn example of hamming loss. We have true ...ss(np.array([[0, 1], [1, 1]]), np.zeros((2, 2)))", u'train = pd.read_csv("../data/train.csv")\n\n# ...t "Dataframe shape:", train.shape\ntrain.head(1)', u'# check for null values\ntrain.isnull().sum()', u'train.columns', u'string_cols = ["director", "lead actors", "ove...w", "title"]\nstring_matrix = train[string_cols]', u'# Set up helper cleaner function\ndef cleaner(... = string_matrix[\'lead actors\'].apply(cleaner)', u'labels = train.columns[:17]\nfeatures = train....in[["popularity", "vote_average", "vote_count"]]', u'string_matrix', u'labels = train.columns[:17]\nfeatures = train....in[["popularity", "vote_average", "vote_count"]]', u'genre_ids_df = pd.read_csv("../data/genre_ids...._df.drop("Unnamed: 0", axis = 1, inplace = True)', u'for label in labels:\n    print genre_ids_df[genre_ids_df["id"] == int(label)]["genre"].item()', u'h_losses = []\n\nfor train_ind, test_ind in KF...ss(y_test, preds))\n\nprint np.average(h_losses)', u'h_losses = []\n\nfor train_ind, test_ind in KF...ss(y_test, preds))\n\nprint np.average(h_losses)'], 'KFold': <class 'sklearn.model_selection._split.KFold'>, 'MultiOutputClassifier': <class 'sklearn.multioutput.MultiOutputClassifier'>, 'OneVsRestClassifier': <class 'sklearn.multiclass.OneVsRestClassifier'>, 'Out': {2: 0.75, 3:    10402  10749  10751  10752  12  14  16  18  2...          5.7        510  

[1 rows x 29 columns], 4: 10402           0
10749           0
10751       ...
vote_average    0
vote_count      0
dtype: int64, 5: Index([u'10402', u'10749', u'10751', u'10752', u...e_average', u'vote_count'],
      dtype='object'), 9:                     director  \
0               ...         Going in Style  

[540 rows x 4 columns]}, 'RandomForestClassifier': <class 'sklearn.ensemble.forest.RandomForestClassifier'>, 'TfidfVectorizer': <class 'sklearn.feature_extraction.text.TfidfVectorizer'>, 'X':      popularity  vote_average  vote_count
0    3...          5.9          25

[540 rows x 3 columns], 'X_test':      popularity  vote_average  vote_count
0    3...          6.5         290

[108 rows x 3 columns], 'X_train':      popularity  vote_average  vote_count
108   ...          5.9          25

[432 rows x 3 columns], ...}
   2886             finally:
   2887                 # Reset our crash handler in place
   2888                 sys.excepthook = old_excepthook
   2889         except SystemExit as e:

...........................................................................
/Users/lukeheine/Desktop/milestone_1/cs109b_final/milestone_3/<ipython-input-14-24a9d15aca99> in <module>()
      9     # instantiate the classifier (n_jobs = -1 tells it)
     10     # to fit using all CPUs
     11     multi_target_forest = MultiOutputClassifier(forest, n_jobs=-1)
     12 
     13     # fit the multi-target random forest
---> 14     fitted_forest = multi_target_forest.fit(X_train, y_train)
     15 
     16     # predict the label matrix
     17     preds = fitted_forest.predict(X_test)
     18     h_losses.append(hamming_loss(y_test, preds))

...........................................................................
/Users/lukeheine/anaconda/lib/python2.7/site-packages/sklearn/multioutput.py in fit(self=MultiOutputClassifier(estimator=RandomForestClas...rbose=0, warm_start=False),
           n_jobs=-1), X=array([[  9.06732600e+00,   8.40000000e+00,   4....093400e+00,   5.90000000e+00,   2.50000000e+01]]), y=array([[  9.06732600e+00,   8.40000000e+00,   4....093400e+00,   5.90000000e+00,   2.50000000e+01]]), sample_weight=None)
     82                 not has_fit_parameter(self.estimator, 'sample_weight')):
     83             raise ValueError("Underlying regressor does not support"
     84                              " sample weights.")
     85 
     86         self.estimators_ = Parallel(n_jobs=self.n_jobs)(delayed(_fit_estimator)(
---> 87             self.estimator, X, y[:, i], sample_weight) for i in range(y.shape[1]))
        self.estimator = RandomForestClassifier(bootstrap=True, class_wei...ate=109,
            verbose=0, warm_start=False)
        X = array([[  9.06732600e+00,   8.40000000e+00,   4....093400e+00,   5.90000000e+00,   2.50000000e+01]])
        y = array([[  9.06732600e+00,   8.40000000e+00,   4....093400e+00,   5.90000000e+00,   2.50000000e+01]])
        sample_weight = None
        y.shape = (432, 3)
     88         return self
     89 
     90     def predict(self, X):
     91         """Predict multi-output variable using a model

...........................................................................
/Users/lukeheine/anaconda/lib/python2.7/site-packages/sklearn/externals/joblib/parallel.py in __call__(self=Parallel(n_jobs=-1), iterable=<generator object <genexpr>>)
    763             if pre_dispatch == "all" or n_jobs == 1:
    764                 # The iterable was consumed all at once by the above for loop.
    765                 # No need to wait for async callbacks to trigger to
    766                 # consumption.
    767                 self._iterating = False
--> 768             self.retrieve()
        self.retrieve = <bound method Parallel.retrieve of Parallel(n_jobs=-1)>
    769             # Make sure that we get a last message telling us we are done
    770             elapsed_time = time.time() - self._start_time
    771             self._print('Done %3i out of %3i | elapsed: %s finished',
    772                         (len(self._output), len(self._output),

---------------------------------------------------------------------------
Sub-process traceback:
---------------------------------------------------------------------------
ValueError                                         Mon Apr 17 18:44:03 2017
PID: 45912              Python 2.7.12: /Users/lukeheine/anaconda/bin/python
...........................................................................
/Users/lukeheine/anaconda/lib/python2.7/site-packages/sklearn/externals/joblib/parallel.py in __call__(self=<sklearn.externals.joblib.parallel.BatchedCalls object>)
    126     def __init__(self, iterator_slice):
    127         self.items = list(iterator_slice)
    128         self._size = len(self.items)
    129 
    130     def __call__(self):
--> 131         return [func(*args, **kwargs) for func, args, kwargs in self.items]
        func = <function _fit_estimator>
        args = (RandomForestClassifier(bootstrap=True, class_wei...ate=109,
            verbose=0, warm_start=False), array([[  9.06732600e+00,   8.40000000e+00,   4....093400e+00,   5.90000000e+00,   2.50000000e+01]]), array([ 9.067326,  9.017647,  9.007006,  9.00585...319,  4.648093,  4.648079,  4.643147,  4.640934]), None)
        kwargs = {}
        self.items = [(<function _fit_estimator>, (RandomForestClassifier(bootstrap=True, class_wei...ate=109,
            verbose=0, warm_start=False), array([[  9.06732600e+00,   8.40000000e+00,   4....093400e+00,   5.90000000e+00,   2.50000000e+01]]), array([ 9.067326,  9.017647,  9.007006,  9.00585...319,  4.648093,  4.648079,  4.643147,  4.640934]), None), {})]
    132 
    133     def __len__(self):
    134         return self._size
    135 

...........................................................................
/Users/lukeheine/anaconda/lib/python2.7/site-packages/sklearn/multioutput.py in _fit_estimator(estimator=RandomForestClassifier(bootstrap=True, class_wei...ate=109,
            verbose=0, warm_start=False), X=array([[  9.06732600e+00,   8.40000000e+00,   4....093400e+00,   5.90000000e+00,   2.50000000e+01]]), y=array([ 9.067326,  9.017647,  9.007006,  9.00585...319,  4.648093,  4.648079,  4.643147,  4.640934]), sample_weight=None)
     31 def _fit_estimator(estimator, X, y, sample_weight=None):
     32     estimator = clone(estimator)
     33     if sample_weight is not None:
     34         estimator.fit(X, y, sample_weight=sample_weight)
     35     else:
---> 36         estimator.fit(X, y)
        estimator.fit = <bound method RandomForestClassifier.fit of Rand...te=109,
            verbose=0, warm_start=False)>
        X = array([[  9.06732600e+00,   8.40000000e+00,   4....093400e+00,   5.90000000e+00,   2.50000000e+01]])
        y = array([ 9.067326,  9.017647,  9.007006,  9.00585...319,  4.648093,  4.648079,  4.643147,  4.640934])
     37     return estimator
     38 
     39 
     40 class MultiOutputEstimator(six.with_metaclass(ABCMeta, BaseEstimator)):

...........................................................................
/Users/lukeheine/anaconda/lib/python2.7/site-packages/sklearn/ensemble/forest.py in fit(self=RandomForestClassifier(bootstrap=True, class_wei...ate=109,
            verbose=0, warm_start=False), X=array([[  9.06732559e+00,   8.39999962e+00,   4.....90000010e+00,   2.50000000e+01]], dtype=float32), y=array([[ 9.067326],
       [ 9.017647],
       [...648079],
       [ 4.643147],
       [ 4.640934]]), sample_weight=None)
    266             # [:, np.newaxis] that does not.
    267             y = np.reshape(y, (-1, 1))
    268 
    269         self.n_outputs_ = y.shape[1]
    270 
--> 271         y, expanded_class_weight = self._validate_y_class_weight(y)
        y = array([[ 9.067326],
       [ 9.017647],
       [...648079],
       [ 4.643147],
       [ 4.640934]])
        expanded_class_weight = undefined
        self._validate_y_class_weight = <bound method RandomForestClassifier._validate_y...te=109,
            verbose=0, warm_start=False)>
    272 
    273         if getattr(y, "dtype", None) != DOUBLE or not y.flags.contiguous:
    274             y = np.ascontiguousarray(y, dtype=DOUBLE)
    275 

...........................................................................
/Users/lukeheine/anaconda/lib/python2.7/site-packages/sklearn/ensemble/forest.py in _validate_y_class_weight(self=RandomForestClassifier(bootstrap=True, class_wei...ate=109,
            verbose=0, warm_start=False), y=array([[ 9.067326],
       [ 9.017647],
       [...648079],
       [ 4.643147],
       [ 4.640934]]))
    452             self.oob_decision_function_ = oob_decision_function
    453 
    454         self.oob_score_ = oob_score / self.n_outputs_
    455 
    456     def _validate_y_class_weight(self, y):
--> 457         check_classification_targets(y)
        y = array([[ 9.067326],
       [ 9.017647],
       [...648079],
       [ 4.643147],
       [ 4.640934]])
    458 
    459         y = np.copy(y)
    460         expanded_class_weight = None
    461 

...........................................................................
/Users/lukeheine/anaconda/lib/python2.7/site-packages/sklearn/utils/multiclass.py in check_classification_targets(y=array([[ 9.067326],
       [ 9.017647],
       [...648079],
       [ 4.643147],
       [ 4.640934]]))
    167     y : array-like
    168     """
    169     y_type = type_of_target(y)
    170     if y_type not in ['binary', 'multiclass', 'multiclass-multioutput',
    171             'multilabel-indicator', 'multilabel-sequences']:
--> 172         raise ValueError("Unknown label type: %r" % y_type)
        y_type = 'continuous'
    173 
    174 
    175 
    176 def type_of_target(y):

ValueError: Unknown label type: 'continuous'
___________________________________________________________________________