In [77]:
import pandas as pd
import numpy as np
import re
import scipy.sparse as sp
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.metrics import hamming_loss
from sklearn.multiclass import OneVsRestClassifier
from sklearn.multioutput import MultiOutputClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import KFold
from sklearn.cross_validation import KFold, train_test_split
from sklearn.linear_model import Lasso, Ridge, SGDRegressor
import scipy.sparse

# Milestone 3: Traditional statistical and machine learning methods, due Wednesday, April 19, 2017

Think about how you would address the genre prediction problem with traditional statistical or machine learning methods. This includes everything you learned about modeling in this course before the deep learning part. Implement your ideas and compare different classifiers. Report your results and discuss what challenges you faced and how you overcame them. What works and what does not? If there are parts that do not work as expected, make sure to discuss briefly what you think is the cause and how you would address this if you would have more time and resources. 

You do not necessarily need to use the movie posters for this step, but even without a background in computer vision, there are very simple features you can extract from the posters to help guide a traditional machine learning model. Think about the PCA lecture for example, or how to use clustering to extract color information. In addition to considering the movie posters it would be worthwhile to have a look at the metadata that IMDb provides. 

You could use Spark and the [ML library](https://spark.apache.org/docs/latest/ml-features.html#word2vec) to build your model features from the data. This may be especially beneficial if you use additional data, e.g., in text form.

You also need to think about how you are going to evaluate your classifier. Which metrics or scores will you report to show how good the performance is?

The notebook to submit this week should at least include:

- Detailed description and implementation of two different models
- Description of your performance metrics
- Careful performance evaluations for both models
- Visualizations of the metrics for performance evaluation
- Discussion of the differences between the models, their strengths, weaknesses, etc. 
- Discussion of the performances you achieved, and how you might be able to improve them in the future

#### Preliminary Peer Assessment

It is important to provide positive feedback to people who truly worked hard for the good of the team and to also make suggestions to those you perceived not to be working as effectively on team tasks. We ask you to provide an honest assessment of the contributions of the members of your team, including yourself. The feedback you provide should reflect your judgment of each team member’s:

- Preparation – were they prepared during team meetings?
- Contribution – did they contribute productively to the team discussion and work?
- Respect for others’ ideas – did they encourage others to contribute their ideas?
- Flexibility – were they flexible when disagreements occurred?

Your teammate’s assessment of your contributions and the accuracy of your self-assessment will be considered as part of your overall project score.

Preliminary Peer Assessment: [https://goo.gl/forms/WOYC7pwRCSU0yV3l1](https://goo.gl/forms/WOYC7pwRCSU0yV3l1)

## Questions to answer: 

- **What are we predicting exactly?**

So, we are trying to predict movie genres. However, we have that each movie has multiple genres. This leads to the question of how we can predict multiple classifiers for the same object. This more general question is called a multilabel clasification problem. We will explore some of our specifications for this problem below. 

One of the best and most standard solution to do multilable classification is called "one vs. rest" classifiers. These classifiers create n models for each of the n labels. One of the advantages of this model is its interpretability and, for our cases, its ease. We can easily create a pipeline that then does these predictions for us. For an implementation of one vs. all, look at scikit learn: http://scikit-learn.org/dev/modules/generated/sklearn.multiclass.OneVsRestClassifier.html#sklearn.multiclass.OneVsRestClassifier

We will likely be using this in our early attempts at classification. 

- **What does it means to be succesful? What is our metric for success?**

*adapted from http://people.oregonstate.edu/~sorowerm/pdf/Qual-Multilabel-Shahed-CompleteVersion.pdf*

Here are a few options for our measure of accuracy:

#### Exact Match Ratio
The exact match ratio only considers a correct answer for our multilabel data if it is exactly correct (e.g. if there are three classes, we only classify this as correct if we correctly identify all three classes.) 

#### Accuracy 
Accuracy is a simple way of "goodness of prediction." It is defined as follows 

$$ \frac{1}{n} \sum_i^{n}  \frac{|Y_i\cap Z_i|}{|Y_i \cup Z_i|}$$

Where $$Y_i\cap Z_i $$ refers to the total number of correctly predicted labels over the total number of labels for that instance. So, if for example we predicted [romance, action]  and the true labels were [romance, comedy, horror], this would receive an accuracy of 1/4 because there was one correct prediction and 4 unique labels. 


#### Hamming Loss 
The final and most common form of error for multilable predictions is hamming loss. Hamming loss takes into account both the prediction error (an incorrect error is predicted) and the missing error (a relevant lable is NOT predicted.) this is defined as follows below 

$$ \text{HammingLoss, HL} = \frac{1}{kn} \sum_{i}^{n} \sum_l^k [l \in  Z_i \wedge l \notin Y_i)  + I(l \notin Z_i \wedge  l \in Y_i)]$$

*For this project, we will use the hamming loss, which is defined above.* There is a convenient function in `sklearn` to calculate hamming loss: `sklearn.metrics.hamming_loss`

- What is our first modeling approach? Why? 

- What is our second modeling approach? Why? 

In [3]:
'''
An example of hamming loss. We have true labels:

[0, 1]
[1, 1]

And predicted labels:

[0, 0]
[0, 0]

Hamming loss is .75
'''
hamming_loss(np.array([[0, 1], [1, 1]]), np.zeros((2, 2)))

0.75

### Data Collection & Cleaning

## Decision for dropping
Here we choose to drop the missing data instead of imputing because it is non numerical and avereraging or finding means does not make sense in this scencario

In [5]:
train = pd.read_csv("../data/train.csv")

# drop a rogue column
train.drop("Unnamed: 0", axis = 1, inplace = True)
train = train.dropna(axis=0).copy()
print "Dataframe shape:", train.shape
train.head(1)

Dataframe shape: (537, 29)


Unnamed: 0,10402,10749,10751,10752,12,14,16,18,27,28,...,lead actors,movie_id,overview,popularity,poster_path,release_date,title,video,vote_average,vote_count
0,0,0,1,0,0,0,1,0,0,0,...,"[u'Alec Baldwin', u'Miles Bakshi', u'Jimmy Kim...",295693,A story about how a new baby's arrival impacts...,305.881041,/unPB1iyEeTBcKiLg8W083rlViFH.jpg,2017-03-23,The Boss Baby,False,5.7,510


In [6]:
# check for null values
train.isnull().any()

10402           False
10749           False
10751           False
10752           False
12              False
14              False
16              False
18              False
27              False
28              False
35              False
36              False
37              False
53              False
80              False
878             False
9648            False
adult           False
director        False
lead actors     False
movie_id        False
overview        False
popularity      False
poster_path     False
release_date    False
title           False
video           False
vote_average    False
vote_count      False
dtype: bool

In [8]:
train.shape

(537, 29)

# Model 1: Random Forest

Some thoughts:
    * Random forests don't accept strings, so we'll need to vectorize all of the string variables or exclude them entirely. 

In [9]:
train.columns

Index([u'10402', u'10749', u'10751', u'10752', u'12', u'14', u'16', u'18',
       u'27', u'28', u'35', u'36', u'37', u'53', u'80', u'878', u'9648',
       u'adult', u'director', u'lead actors', u'movie_id', u'overview',
       u'popularity', u'poster_path', u'release_date', u'title', u'video',
       u'vote_average', u'vote_count'],
      dtype='object')

In [10]:
string_cols = ["director", "lead actors", "overview", "title"]

string_matrix = train[string_cols]

In [11]:
# Set up helper cleaner function
def cleaner(cell):
    line = cell.replace('[u', '').replace(']', '').replace(',', '').replace("u'", '').replace("'", '')
    line = re.sub("(^|\W)\d+($|\W)", " ", line)
    return line
string_matrix['lead actors'] = string_matrix['lead actors'].apply(cleaner)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy


In [12]:
# trim trailing and leading spaces
string_matrix = string_matrix.apply(lambda col: col.str.strip())

In [13]:
# returns output in scipi format; we want to get back to panadas
vect = CountVectorizer(ngram_range=(1, 3))
vect_df = sp.hstack(string_matrix.apply(lambda col: vect.fit_transform(col)))

In [14]:
# def _coo_to_sparse_series(A, dense_index=False):
#     """ Convert a scipy.sparse.coo_matrix to a SparseSeries.
#     Use the defaults given in the SparseSeries constructor. """
#     s = pd.Series(A.data, pd.MultiIndex.from_arrays((A.row, A.col)))
#     s = s.sort_index()
#     s = s.to_sparse()  # TODO: specify kind?
#     # ...
#     return s
#_coo_to_sparse_series(vect_df)

In [15]:
labels = train.columns[:17]
features = train.columns[17:]
# X = train[features]
X = train[["popularity", "vote_average", "vote_count"]]

In [16]:
genre_ids_df = pd.read_csv("../data/genre_ids.csv")
genre_ids_df.drop("Unnamed: 0", axis = 1, inplace = True)

In [17]:
for label in labels:
    print genre_ids_df[genre_ids_df["id"] == int(label)]["genre"].item()

Music
Romance
Family
War
Adventure
Fantasy
Animation
Drama
Horror
Action
Comedy
History
Western
Thriller
Crime
Science Fiction
Mystery


Currently, our label matrix has 17 rows, meaning that each row has 17 different labels associated with it. This is a big problem because there are 2^17 different possible combinations for each row, and, unless we have a ton of data, we likely won't see more than 1 or 2 instances of a given row from the label matrix. This will make it difficult for our classifier to learn patterns. 

We should probably combine similar genres to make this prediction task more teneble. 

How should we do this combination?

### Evaluating the Random Forest using KFold CV

In [18]:
h_losses = []

for train_ind, test_ind in KFold(n_splits = 5).split(X):
    X_train, X_test = X.iloc[train_ind], X.iloc[test_ind]
    y_train, y_test = X.iloc[train_ind], X.ilco[test_ind]

    forest = RandomForestClassifier(n_estimators=100, random_state=109)

    # instantiate the classifier (n_jobs = -1 tells it)
    # to fit using all CPUs
    multi_target_forest = MultiOutputClassifier(forest, n_jobs=-1)

    # fit the multi-target random forest
    fitted_forest = multi_target_forest.fit(X_train, y_train)

    # predict the label matrix
    preds = fitted_forest.predict(X_test)
    h_losses.append(hamming_loss(y_test, preds))

print np.average(h_losses)

AttributeError: 'DataFrame' object has no attribute 'ilco'

# Strategy 2: Using the overview text to make predictions 
In this analytical strategy, we attempt to vectorize our overviews of our data and make predictions purely with the overview and no other metadata. 

In [131]:
### Get rid of any non-values 
#train =  train.dropna()

train = train.dropna(axis=0).copy()

In [133]:
vectorizer = TfidfVectorizer(stop_words="english", min_df=4, decode_error="ignore", ngram_range=(1, 1))
corpus = train["overview"].values

In [134]:
overview_vector = vectorizer.fit_transform(corpus)
overview_vector = overview_vector.toarray()

In [135]:
### So we have a lot more predictors than we have movies to predict on, likely we need to change this 
# or do some form of regularization 
overview_vector.shape

(537, 831)

In [136]:
# let's work with Pandas dataframes to make things easier
df_feature = pd.DataFrame(overview_vector)

In [137]:
#df_feature["10402"] = train["10402"]

#df_feature["10402"] =
df_feature.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,821,822,823,824,825,826,827,828,829,830
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.319646,0.0,0.0,0.0
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.194699
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.180253,0.0,0.0,0.0,0.0,0.0,0.226689
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [138]:
genres = train.columns[0:17]
## Combine the genres with the our word feature space 
df_feature = df_feature.join(train[genres])

In [139]:
df_feature.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,18,27,28,35,36,37,53,80,878,9648
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,1.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [140]:
def get_genre_x_y(data, genre):
    '''
    A helper to separate the response variable (domain1_score) from
    the predictors.
    '''
    y = data[genre]
    X = data.drop(genres, axis=1)
    return X, y

In [148]:
## drop null values 
df_feature =df_feature.dropna(axis=0).copy()

In [149]:
X,y  = get_genre_x_y(df_feature, genres[1])

In [185]:
print "are there any nulls?", y.isnull().values.any()
## Checking to see if there are any null and is so where they are 
[i for i, x in enumerate(y) if pd.isnull(x)]

are there any nulls? False


[]

In [175]:
lasso = Lasso(alpha=.0003)

lasso.fit(X, y)

## Round these to the closest possible values
y_hat = lasso.predict(X).round()

## This is my accuracy, but in reality I think that we are not predicting any of them correctly 
np.mean(y_hat == y)

0.93632958801498123

In [None]:
folds = KFold(X.shape[0], n_folds=2, shuffle=True)

for train_indices, test_indices in folds:
        X_train, X_test = X[train_indices], X[test_indices]
        y_train, y_test = y[train_indices], y[test_indices]
        
        print train_indices
        break 
        

In [187]:
        
for train_indices, test_indices in folds:
    X_train, X_test = X[train_indices], X[test_indices]
    y_train, y_test = y[train_indices], y[test_indices]

#    print "are there any nans",np.isnan(X_train).any()
    print np.isnan(y_train).any()
    print [i for i, x in enumerate(y) if pd.isnull(y_train)]
    break 
    lasso = Lasso(alpha=alpha)

    # fit on the training data 
    lasso.fit(X_train, y_train)

True


ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().

In [152]:
lasso_alpha_list = [.0001, .001, .01 , 1]
lasso_r2_value = []
lasso_spearman_value = []

for alpha in lasso_alpha_list: 
    lasso_r2_temp = []
    lasso_spearman_temp = []
    
    print "Testing alpha = {0}".format(alpha)
    
    # gives us the different folds of our data to test against
    folds = KFold(X.shape[0], n_folds=2, shuffle=True)

    for train_indices, test_indices in folds:
        X_train, X_test = X[train_indices], X[test_indices]
        y_train, y_test = y[train_indices], y[test_indices]

        lasso = Lasso(alpha=alpha)
        
        # fit on the training data 
        lasso.fit(X_train, y_train)
        
        # calculate r^2
        #lasso_r2_temp.append(lasso.score(X_test, y_test))
        #spearman_r = scipy.stats.spearmanr(lasso.predict(X_test), y_test)
        #print "Spearman_r:", spearman_r
        #lasso_spearman_temp.append(spearman_r)
    
    lasso_r2_value.append(np.mean(lasso_r2_temp))
    lasso_spearman_value.append(np.mean(lasso_spearman_temp))

Testing alpha = 0.0001


ValueError: Input contains NaN, infinity or a value too large for dtype('float64').

In [43]:
X = feature_df[["essay_set", "essay_length"]]
y = feature_df["domain1_score"]

X_train, X_test, y_train, y_test = train_test_split(X, y)

ridge = Ridge()
ridge.fit(X_train, y_train)
print "R^2:", ridge.score(X_test, y_test)
print "Spearman r: ", scipy.stats.spearmanr(ridge.predict(X_test), y_test)

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,822,823,824,825,826,827,828,829,830,10402
0,0.000000,0.0,0.0,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,...,0.000000,0.0,0.000000,0.0,0.0,0.319646,0.000000,0.000000,0.000000,0.0
1,0.000000,0.0,0.0,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,...,0.000000,0.0,0.000000,0.0,0.0,0.000000,0.000000,0.000000,0.194699,1.0
2,0.000000,0.0,0.0,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,...,0.000000,0.0,0.180253,0.0,0.0,0.000000,0.000000,0.000000,0.226689,0.0
3,0.000000,0.0,0.0,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,...,0.000000,0.0,0.000000,0.0,0.0,0.000000,0.000000,0.000000,0.000000,1.0
4,0.000000,0.0,0.0,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,...,0.000000,0.0,0.000000,0.0,0.0,0.000000,0.000000,0.000000,0.000000,0.0
5,0.000000,0.0,0.0,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,...,0.000000,0.0,0.341708,0.0,0.0,0.000000,0.473569,0.000000,0.000000,0.0
6,0.000000,0.0,0.0,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,...,0.000000,0.0,0.152226,0.0,0.0,0.000000,0.000000,0.000000,0.000000,0.0
7,0.000000,0.0,0.0,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,...,0.000000,0.0,0.000000,0.0,0.0,0.000000,0.000000,0.000000,0.000000,0.0
8,0.000000,0.0,0.0,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,...,0.000000,0.0,0.000000,0.0,0.0,0.000000,0.000000,0.000000,0.000000,0.0
9,0.000000,0.0,0.0,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,...,0.000000,0.0,0.210884,0.0,0.0,0.000000,0.000000,0.000000,0.000000,0.0
