# Youtube Conversation Prediction
## CS/INFO 4300 Language and Information

In [96]:
from __future__ import print_function
from __future__ import division
import numpy as np
import json

# 1. Load the data from the JSON file.

In [97]:
with open('data.json') as json_file:   
    video_data = json.load(json_file)
video_num_comments, video_captions = np.array([ (video_data[video_id]["numberComments"], video_data[video_id]["captions"]) 
                                  for video_id in videos_data.keys() ]).T

In [98]:
combined_video_captions = []

caption_text = ""
for caption_data_list in video_captions:
    for caption_data in caption_data_list:
        # Consolidate caption text for each video into one string
        caption_text = caption_text + caption_data["text"] + " "
    combined_video_captions.append(caption_text)
    caption_text = ""
    
video_captions = combined_video_captions

#2. Make a 50-50 train-test split.

Use `sklearn.cross_validation.train_test_split`. Set `random_state=0`. Make sure the train and test sizes are equal (plus/minus one)

In [99]:
from sklearn.cross_validation import train_test_split

In [100]:
print(len(video_num_comments))
print(len(video_captions))

5
5


In [87]:
num_comments_train, num_comments_test, video_captions_train, video_captions_test  = train_test_split(video_num_comments, video_captions, 
                                                                       test_size=.5, random_state=0)

In [101]:
print(len(num_comments_test))
print(len(num_comments_train))
print(num_comments_test)
print(len(video_captions_test))
print(len(video_captions_train))

3
2
[9791 133980 17250]
3
2


### 3. Build the document-term matrices

Use `sklearn.feature_extraction.TfidfVectorizer`. Use unigrams only, disable idf, use `l1` normalization. 

Resulting matrices are `X_train` and `X_test`.

**Note:** Remember to just `fit` on the training data. If a word occurs only in the test documents, our model should **not** be aware that the word exists, as we are trying to evaluate the performance on completely unseen data.

In [102]:
from sklearn.feature_extraction.text import TfidfVectorizer

In [103]:
tfv = TfidfVectorizer(ngram_range=(1,1), use_idf=False, norm='l1')
tfv.fit(video_captions_train)
X_train = tfv.transform(video_captions_train)
X_test  = tfv.transform(video_captions_test)

In [105]:
print(X_train.shape)
print(X_test.shape)

(2, 531)
(3, 531)


# 4. Predict using a random guess baseline

Use a random classifier from `sklearn.dummy.DummyClassifier`.  Set `strategy="stratified"`? Set `random_state=0`, to get the same result every time, since randomness is involved.

In [106]:
from sklearn.dummy import DummyClassifier

In [122]:
dummy = DummyClassifier(strategy="stratified", random_state=0)
dummy.fit(X_train, num_comments_train)
num_comments_pred_stratified = dummy.predict(X_test)

# 5. Evaluate the randomized predictions

We will use a regression evaluation statistic called mean absolute error initially

In [123]:
from sklearn.metrics import mean_absolute_error

In [124]:
my_mae_score = mean_absolute_error(num_comments_test, Y_pred_stratified)

In [125]:
print(my_mae_score)

42995.6666667


# 6. Train and evaluate a classifier.

We will use `sklearn.svm.SVR()" as our initial classifier (Support Vector Regression)

In [126]:
from sklearn.svm import SVR

In [127]:
svm_regression_classifier = SVR()

In [128]:
svm_regression_classifier.fit(X_test, num_comments_test)

SVR(C=1.0, cache_size=200, coef0=0.0, degree=3, epsilon=0.1, gamma=0.0,
  kernel='rbf', max_iter=-1, shrinking=True, tol=0.001, verbose=False)

In [133]:
num_comments_pred_svr = svm_regression_classifier.predict(X_test)
my_mae_score_svr = mean_absolute_error(num_comments_test, num_comments_pred_svr)

In [132]:
print(my_mae_score_svr)

41396.333308


# 7. Use grid search and cross-validation to tune the classifier

The score above is pretty disappointing, but kind of expected, given how little work we did-- we are basically just using the default configuration.  A `LinearSVC` has a bunch of configuration options that should be tweaked:

* `C` is the *regularization parameter*. Lower values of C constraint the model more, while higher values allows the model to fit the training data better. (Remember that fitting the training data too well can lead to overfitting.)

* `class_weight` can force the classifier to emphasize positive instances more or less than negative ones. This is useful if we know for a fact that the classes aren't equally probable. Read the documentation and see what the `'auto'` setting does.

However, choosing these values should also be done without looking at the test data, because they are part of the model. Use `sklearn.grid_search.GridSearchCV` to systematically try out different values for these two parameters, and choose the configuration that does best.

`GridSearchCV` uses k-fold cross-validation to ensure fair evaluation and avoid overfitting. This consists of splitting the training data into *k* parts, then training the classifier *k* times, each time leaving out a different part, that is used for scoring. The average score over the *k* folds is a better estimate of how well the classifier would generalize.

Because we are facing a multi-label problem, the default scoring strategy (accuracy) doesn't make sense. We have to define our own `sample_f1_scorer` strategy:

In [250]:
def sample_f1_scorer(estimator, X, y):
    """sample-f1 scorer metric
    
    This function is just glue code for the scikit-learn scorer API.
    See http://scikit-learn.org/stable/modules/model_evaluation.html#implementing-your-own-scoring-object
    
    Parameters:
    -----------
    
    estimator:
        the model that should be evaluated (e.g., the scikit-learn classifier)
    X: array-like, shape (n_samples, n_features)
        the test data
    y: array-like, shape (n_samples, n_labels)
        the ground truth target for X.
    
    Returns:
    --------
    
    sample_f1_score, float
        the sample F1 score as used in Q06 and Q07
    """
    y_pred = estimator.predict(X)
    return f1_score(y, y_pred, average='samples')

Now, run grid search over a range of regularization parameters, as below.  This takes under 1 minute on a 2014 MacBook Pro Retina. If you're not sure your code works, test it on a small number of documents first to avoid wasting time.

What is the best configuration, and the best score (averaged over the 3 folds)? (there are attributes of the `GridSearchCV` object that answer this).

DISCUSSION ITEM.
What can you say about the impact of `C` and `class_weight` on the score? (look at `grid.grid_scores_` to answer this).

In [251]:
from sklearn.grid_search import GridSearchCV

In [252]:
param_grid = dict(
    estimator__C=[1e-3, 1e-2, 1e-1, 1, 1e1, 1e2, 1e3],  # you can also build this using np.logspace
    estimator__class_weight=['auto', None])

In [253]:
grid = GridSearchCV(ovr,
                    param_grid,
                    cv=3,
                    scoring=sample_f1_scorer,
                    verbose=True)

In [254]:
grid.fit(X_train, Y_train)

[Parallel(n_jobs=1)]: Done   1 jobs       | elapsed:    0.1s
[Parallel(n_jobs=1)]: Done  42 out of  42 | elapsed:   28.6s finished


Fitting 3 folds for each of 14 candidates, totalling 42 fits


GridSearchCV(cv=3,
       estimator=OneVsRestClassifier(estimator=LinearSVC(C=1.0, class_weight=None, dual=True, fit_intercept=True,
     intercept_scaling=1, loss='l2', multi_class='ovr', penalty='l2',
     random_state=None, tol=0.0001, verbose=0),
          n_jobs=1),
       fit_params={}, iid=True, loss_func=None, n_jobs=1,
       param_grid={'estimator__class_weight': ['auto', None], 'estimator__C': [0.001, 0.01, 0.1, 1, 10.0, 100.0, 1000.0]},
       pre_dispatch='2*n_jobs', refit=True, score_func=None,
       scoring=<function sample_f1_scorer at 0x7feeaf225500>, verbose=True)

# 8. Evaluate the chosen classifier on the test set. Inspect performance on individual categories.

Use `grid.best_estimator_` to access the `ovr` object chosen as best by the grid search. Use `sample_f1_scorer` and report the **sample F1** score as in Q06 and Q07. This time, you should see a rewarding increase.

DISCUSSION ITEM.
Compare this score with the cross-validated average score over the 3 folds for the best model (Q08).  Does cross-validation give a reasonable estimate of the actual generalization performance a model can get on unseen test data? Compare with what we saw in class, when we were looking at the performance of a classifier on the data it was trained on, versus on the test data.

In [259]:
grid_f1_score = sample_f1_scorer(grid.best_estimator_, X_test, Y_test)

In [260]:
print(grid_f1_score)

0.563988708163


** TODO discuss **


Then, to aggregate scores over individual categories, use `sklearn.metrics.classification_report`. Keep in mind that in the classification report, precision, recall and F1 have different meaning than the sample-based scores we used in the previous questions: they are averages over a given label, as opposed to a given document.

DISCUSSION ITEM. How do you interpret this table?

In [264]:
from sklearn.metrics import classification_report

In [268]:
Y_pred_test_grid = grid.predict(X_test)
grid_report = classification_report(Y_test, Y_pred_test_grid)

In [269]:
print(grid_report)

             precision    recall  f1-score   support

          0       0.59      0.73      0.65        78
          1       0.49      0.57      0.53        54
          2       0.61      0.66      0.64        89
          3       0.49      0.73      0.59        52
          4       0.69      0.65      0.67       156
          5       0.39      0.41      0.40        39
          6       0.64      0.49      0.56        55
          7       0.29      0.39      0.33        46
          8       0.36      0.46      0.41        68
          9       0.68      0.67      0.68        61
         10       0.66      0.72      0.69       120

avg / total       0.57      0.62      0.59       818

