# Milestone 3: Traditional statistical and machine learning methods

By: Alexandra Ding, Cynthia He, Jingyi Yu

## Data Description

Before modeling, we constructed a sample data set of about 6000 movies and 124 predictors. The predictors include movie features TMDB and IMDB and popular word counts from movie titles and movie reviews.

The predictors are:
 - Release year (TMDB, factor of 6 levels)
 - Release month (TMDB, factor of 12 levels)
 - Vote count (TMDB)
 - Popularity (TMDB)
 - Vote average (TMDB)
 - Runtime (IMDB)
 - Aspect ratio (IMDB, factor of 22 levels)
 - Count of keywords in movie title (TMDB, 17 words with more than 30 total occurrence were selected )
 - Count of keywords in movie overview (TMDB, 66 words with more than 200 total occurrence were selected) 

Response variable:
 - Movie genre: binarized (One vs Rest)

In [1]:
import pandas as pd
import pickle

In [7]:
### Load Dataset
# X: Unprocessed features
# X_std: standardized by Preprocessor
# y: MultiLabel Binarized targets
[X_data, X_data_std, y_data] = pickle.load(open('continuous_features_targets.p', 'rb'))

print 'X_data shape:', X_data.shape
print 'X_data_std shape:', X_data_std.shape
print 'y shape', y_data.shape

X_data shape: (5996, 124)
X_data_std shape: (5996, 124)
y shape (5996, 20)


## Detailed Model Description

Because a single movie can be associated with several different labels, this represents a **multi-label classification** problem where we have to adapt binary classifiers to the multi-label scenario. Methods to adapt classifiers such as Logistic Regression, Decision Trees, Random Forest and Support Vector Machines to the multilabel case include fitting a model on a single label at a time (**One vs Rest Classifier**) and then assigning multiple predicted labels to data points that are predicted to belong to the labeled cases. Other methods to perform multilabel classification include one-vs-one classification (where each pair of labels is compared) and RAKEL, where a random subset of labels is trained against the remaining labels (SOURCE) 

Because One vs Rest classification is computationally efficient (requiring only n_classes classifiers) and highly interpretable, we decided to use this way to transform our multilabel classification problem into a set of binary classifications. Python’s Sklearn Machine Learning package can generalize many classification models to the One vs Rest case (OneVsRestClassifier in sklearn.multiclass). 


## Models fitted on the data

We selected two models to predict the binarized response variable using the movie features. The models were selected according to their computational efficiency and performance. The relevant parameters were tuned using 3-fold cross validation. 

 - Weighted Logistic regression: Tuned **Regularization Parameter C** using Cross-Validation.  
 - Decision Tree: Tuned **max_depth** using cross validation
 - Random Forest: Tuned **max_depth** using cross validation
 - AdaBoost, Gradient Boosting 
**NOTE: We ran the models on separate computers, rather than in a single notebook. The results were saved in a pickle file for each and are analyzed below**

We also tried but did not finish running: 
- SVM (Linear, Polynomial, SVM)- these took a long time (i.e. more than several hours) to run on computers or on AWS, so we excluded these from our report. 

In [None]:
### Cross Validation and Model Selection metrics

from sklearn import metrics
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import KFold
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import train_test_split
from sklearn.model_selection import cross_val_predict

from sklearn.metrics import f1_score
from sklearn.metrics import hamming_loss 
from sklearn.metrics import make_scorer

# Preprocessing
import sklearn.preprocessing as Preprocessing
from sklearn.preprocessing import StandardScaler as Standardize
from sklearn.preprocessing import MultiLabelBinarizer

### Classification
from sklearn.multiclass import OneVsRestClassifier
from sklearn.svm import LinearSVC
from sklearn.svm import SVC

from sklearn import tree
from sklearn import ensemble
from sklearn.ensemble import RandomForestClassifier as RandomForest
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.ensemble import AdaBoostClassifier

from sklearn.linear_model import LogisticRegression as Log_Reg

In [None]:
### Load Dataset
# X: Unprocessed features
# X_std: standardized by Preprocessor
# y: MultiLabel Binarized targets
[X_data, X_data_std, y_data] = pickle.load(open('continuous_features_targets.p', 'rb'))

######      CREATE SCORING METRIC        ######
# http://scikit-learn.org/stable/modules/generated/sklearn.metrics.make_scorer.html#sklearn.metrics.make_scorer
# http://scikit-learn.org/stable/modules/generated/sklearn.metrics.hamming_loss.html

# Hamming Loss: want to MINIMIZE LOSS 
hamming_scorer = make_scorer(hamming_loss, greater_is_better = False)

### Logistic Regression

The first model we tried is regularized **Logistic Regression** with **balanced class weights**. 

Logistic regression is the most commonly used statistical model for binary classification. It is a generalized linear model which utilizes a logit link to associate the linear combination of the predictors and the class probability (i.e. $log(\frac{p}{1-p}) = X’\beta$). In order to overcome the imbalanced nature of the data, we used balanced weight when fitting the logistic regression.

Logistic regression is a parametric model, and there are two important underlying assumption of this model: 1) each observations are independent; 2) the relationship between the predictors and logit class probability (i.e. $log(\frac{p}{1-p})$). Although the data sample may satisfy the first assumption, the second assumption is very likely to be violated. This is probably an important reason why that the logistic regression model does not perform well, so we moved on to try several non-parametric models: decision tree, random forest and adaboost. 


In [None]:
### Logistic Regression
LogReg_Model = OneVsRestClassifier(Log_Reg(penalty = 'l2', class_weight = 'balanced'))

LogReg_grid = GridSearchCV(LogReg_Model, 
                           param_grid={'estimator__C': np.logspace(-5, 15, 20)}, 
                                       scoring= hamming_scorer,
                                       n_jobs = 5)
LogReg_grid.fit(X_data_std, y_data)
LogReg_grid.cv_results_['mean_test_score']

# Fit best model on data, predict
y_pred_LogReg = cross_val_predict(LogReg_grid.best_estimator_, X_data_std, y_data)

# Dump CV results AND predictions from best model
pickle.dump([LogReg_grid.cv_results_, y_pred_LogReg], open('LogReg_grid_results.p', 'wb'))

### Decision Tree

Next, we tried to use a single Decision Tree. Some of the advantages of the Decision Tree include interpretability (the splitting criterion for each branch is easy to interpret), scale invariance, ability to handle both categorical and quantitative variables (this is beneficial for our dataset due to the presence of variables such as runtime and season) and can approximate a wide variety of distributions of data. The reason why we started with a Decision Tree is to gain intuition into whether our multilabel problem is easily solved by tree-based methods. 

Here, we tuned the parameter **max_depth**, which controls how far the tree expands and, indirectly, the number of samples in each node. Restricting the parameter max_depth can reduce overfitting. 

In [None]:
### Single Decision Tree
DecisionTree_Model = OneVsRestClassifier(tree.DecisionTreeClassifier(criterion='gini'))
DT_grid = GridSearchCV(DecisionTree_Model, 
                    param_grid = {'estimator__max_depth': range(1,10)},
                                  scoring = hamming_scorer)
DT_grid.fit(X_data_std, y_data)
DT_grid.cv_results_['mean_test_score']
print 'Best score single DT:', DT_grid.best_score_

y_pred_Decision_Tree = cross_val_predict(DT_grid.best_estimator_, X_data_std, y_data)
np.mean(y_pred_Decision_Tree == y_data)

pickle.dump([DT_grid.cv_results_, y_pred_Decision_Tree], open('DecisionTree_grid_results.p', 'wb'))


### Random Forest

A Random Forest consists of many decision trees, and each tree in the ensemble is built from a sample drawn with replacement (i.e., a bootstrap sample) from the training set. In addition, when splitting a node during the construction of the tree, the split that is chosen is no longer the best split among all features. Instead, the split that is picked is the best split among a random subset of the features (SOURCE: sklearn documentation http://scikit-learn.org/stable/modules/ensemble.html#forest). 

In the random forest, parameters we could tune include the number of trees, the maximum number of features, the maximum depth, and minimum number of samples at the leaf node. As we saw in lecture, the test classification error decreases with the number of trees and eventually levels off. Note that Random Forests **cannot overfit the data with respect to the number of trees**, because adding trees does not increase the flexibiltiy of the model. The maximum number of features, as previously discussed, allows us to tune the chance that relevant features will be selected, and thus influences misclassification error. Together, the max depth and min samples at leaf node influence how "full-grown" a tree is, meaning how many splits are made in the algorithm. 

Given these parameters, I decided to adjust **max_depth**.

In [None]:
### Random Forest
RandomForest_Model = OneVsRestClassifier(RandomForest())
rf_grid = GridSearchCV(RandomForest_Model,
                       param_grid = {'estimator__max_depth': 10*np.linspace(1,7, 7) },
scoring = hamming_scorer)

rf_grid.fit(X_data_std, y_data)
rf_grid.cv_results_['mean_test_score']
print 'Best score in RF tuning max_depth:', rf_grid.best_score_
y_pred_RF = cross_val_predict(rf_grid.best_estimator_, X_data_std, y_data)

pickle.dump([rf_grid.cv_results_, y_pred_RF], open('RandomForest_tune_maxdepth_hamming_grid_results.p', 'wb'))


### AdaBoost

In AdaBoost, we fit "weak learners" (such as small decision trees/stumps) to the data, and then use an iterative algorithm to "adapt" to the misclassified samples in order to boost the classification accuracy. Initially, all samples are equally weighted, and with each iteration, the weights are adjusted based on whether the last classification misclassified the sample. (SOURCE: http://scikit-learn.org/stable/modules/ensemble.html#forest)

In [None]:
### Adaboost

clf = OneVsRestClassifier(AdaBoostClassifier())
ada_grid = GridSearchCV(clf,
                       param_grid = {'estimator__n_estimators': np.logspace(1,3,6).astype(int) },
                                     scoring = hamming_scorer)
#ada_grid.fit(iris.data, iris.target)
ada_grid.fit(X_data_std, y_data)
ada_grid.cv_results_['mean_test_score']
print 'Best score in RF tuning max_depth:', ada_grid.best_score_

#scores = cross_val_score(ada_grid.best_estimator_, X_data_std, y_data, scoring = hamming_scorer)
y_pred_ada = cross_val_predict(ada_grid.best_estimator_, X_data_std, y_data)
pickle.dump([ada_grid.cv_results_, y_pred_ada], open('Adaboost_grid_results.p', 'wb'))

## Assessing Model Performance for Multilabel Classification

### Describe performance metrics

To perform multilabel classification using the OneVsAll classifier, we used several different metrics to assess model performance. Six different metrics were used as the performance metrics of our model: 0-1 loss based accuracy, Hamming loss, precision, recall, F1-score, and Jaccard Similarity. In addition to reporting these metrics for the overall dataset under different models, we evaluated each of these metrics for each label, to detect whether any labels were being excluded from prediction due to imbalance. 


#### 1. Accuracy

Accuracy is calculated on a zero-one loss basis, which counts a match between the prediction and true value as one. It then takes the mean of all the indicator and produced the percentage that the binary entries in prediction matrix correctly match up with true values.

#### 2. Hamming Loss

Simply put, the Hamming loss is the fraction of labels that are incorrectly predicted, penalized according to sample weights. While the zero-one loss considers the entire set of labels for a given sample incorrect if it doesn’t entirely match the true set of labels, the Hamming loss is more forgiving in that it penalizes the individual labels. One benefit of using this loss function (cited in this review) is that different costs can be specified for different types of errors (ex: false positives and true negatives). 

$$Accuracy(H,D) = \frac{1}{|D|} \sum_{i=1}^{|D|} ( \frac{|Y_i \cap Z_i|}{|Y_i \cup Z_i|}^a$$

Where: D = dataset, Y_i = dataset labels, H = multilabel classification output, Z_i = set of labels predicted by H for example X_i. The parameter a is the “forgiveness rate” that calibrates the severity of different errors. This can be adjusted by the “sample_weight” argument in sklearn. 

**We used Hamming Loss as the scoring parameter used to tune our model parameters via Cross-Validation.** In this case, we used the default (uniform) forgiveness rate, but future calibration of our models may require that we weight the samples to offset the label imbalance in our dataset. 

Because **a larger Hamming Loss denotes a worse prediction accuracy,** we had to multiply this value by -1 to use it in the GridSearchCV function below. 

#### 3. Precision

$$Precision=\frac{tp}{tp+fp}$$
where tp = True Positives and fp = False Positives

True positives are the correctly identified positive instances, and false positives are the negative instances which were incorrectly identified as positives. The sum of true positives and false positives is the total number of positive instances identified in the prediction. Thus, the Precision is the proportion of correctly identified positive instances over the total number of predicted positive instances. 

A low Precision can occur if many movies are being labeled with a given genre but few of them actually belong to that genre. 

#### 4. Recall

$$Recall=\frac{tp}{tp+fn}$$ 
where tp = True Positives and fn = False Negatives

Recall calculates very similarly as the prediction but uses a different numerator. The false negatives are the positive instances which were incorrectly identified as negative. True positives and false negatives sum up to the total number of true positive instances in the data. Therefore, the Recall is the proportion of correctly identified positive instances over the total number of true positive instances. 

A low Recall score would suggest that few movies are correctly labeled in predictions, in proportion with the number of movies truly with that label.

#### 5. F-1 Score

F-1 score is the harmonic mean of Precision and Recall. It is a measurement that considers both Precision and Recall. 

$$F_{1}=2\cdot {\frac {1}{{\tfrac {1}{\mathrm {recall} }}+{\tfrac {1}{\mathrm {precision} }}}}=2\cdot {\frac {\mathrm {precision} \cdot \mathrm {recall} }{\mathrm {precision} +\mathrm {recall} }}$$

#### 6. Jaccard Similarities

The Jaccard Similarity between the predicted labels (y_pred) and ground truth labels (y_data) is defined as the intersection divided by the size of the union of the two label sets for a given data point $X_i$. Therefore, JS penalizes the inclusion or exclusion of true labels from the prediction. 

$$ J(A,B) = {{|A \cap B|}\over{|A \cup B|}} = {{|A \cap B|}\over{|A| + |B| - |A \cap B|}}$$
where A = dataset labels, B = multilabel classification output

Like the Hamming Loss, the weights of different samples can be changed based on label imbalance, though we used default values when assessing our classifiers below. 

### Performance Evaluation and Visualizations

In [2]:
import numpy as np
import pandas as pd
import pickle
import os

from sklearn.metrics import f1_score
from sklearn.metrics import hamming_loss 
from sklearn.metrics import make_scorer
from sklearn.metrics import jaccard_similarity_score as jaccard_score
from sklearn.metrics import classification_report

os.chdir('/Users/AlexandraDing/Documents/cs109b-best-group')

# WD Where the model results are kept
data_wd = '/Users/AlexandraDing/Documents/cs109b-best-group/Model_Results/'



In [9]:
### Load Dataset
# X: Unprocessed features
# X_std: standardized by Preprocessor
# y: MultiLabel Binarized targets
[X_data, X_data_std, y_data] = pickle.load(open('continuous_features_targets.p', 'rb'))

print 'X_data shape:', X_data.shape
print 'X_data_std shape:', X_data_std.shape
print 'y shape', y_data.shape

# Unpickle Model Results
[LogReg_cv, y_pred_LogReg] = pickle.load(open(data_wd+'LogReg_grid_results.p', 'rb'))
print y_pred_LogReg.shape

[DT_grid_cv, y_pred_Decision_Tree] = pickle.load(open(data_wd + 'DecisionTree_grid_results.p', 'rb'))
print y_pred_Decision_Tree.shape

[rf_grid_cv, y_pred_RF] = pickle.load(open(data_wd + 'RandomForest_tune_maxdepth_hamming_grid_results.p', 'rb'))
print y_pred_RF.shape

[ada_grid_cv, y_pred_ada] = pickle.load(open(data_wd + 'Adaboost_grid_results.p', 'rb'))
print y_pred_ada.shape

# Create list with all model prediction
prediction_list = [y_pred_LogReg, y_pred_Decision_Tree, y_pred_RF, y_pred_ada]

X_data shape: (5996, 124)
X_data_std shape: (5996, 124)
y shape (5996, 20)
(5996, 20)
(5996, 20)
(5996, 20)
(5996, 20)


In [10]:
# Read column names to get genre labels for tables below
movie_data = pd.read_csv("add_imdb_utf8_fixruntime_cleaned.csv")
# print movie_data.head(n=3)
genre_numbers = movie_data.columns[14:33]
genre_dict = pickle.load(open('/Users/AlexandraDing/Documents/cs109b-best-group/Milestone1/genre_dict_by_id.p', 'rb'))
genre_labels = [genre_dict[int(genre_numbers[i])] for i in range(len(genre_numbers))]
print pd.Series(data= np.sum(movie_data[genre_numbers], axis = 0).values, index=genre_labels)
genre_labels.insert(17, "Foreign")
print genre_labels

Adventure           367
Fantasy             268
Animation           339
Drama              2179
Horror              856
Action              774
Comedy             1496
History             125
Western              55
Thriller           1157
Crime               396
Documentary         909
Science Fiction     422
Mystery             269
Music               286
Romance             650
Family              460
War                  73
TV Movie            227
dtype: int64
[u'Adventure', u'Fantasy', u'Animation', u'Drama', u'Horror', u'Action', u'Comedy', u'History', u'Western', u'Thriller', u'Crime', u'Documentary', u'Science Fiction', u'Mystery', u'Music', u'Romance', u'Family', 'Foreign', u'War', u'TV Movie']


In [11]:
### SUMMARIZE MODEL ACCURACY: 
    # for MULTILABEL DATA, calculates baseline accuracy, hamming loss, f1 score, jaccard similarity, classification report
    # INPUTS:
        # y_prediction: predicted y
        # y_data : ground truth y
    # OUTPUTS:
        # prints accuracy metrics
        # Return 0

def summarize_model_accuracy (y_prediction, y_data, names):
    # Get basic accuracy: what proportion of labels are correct
    print 'Accuracy:', np.mean(y_prediction == y_data)
    
    # Get Hamming Loss
    print 'Hamming Loss:', hamming_loss(y_data, y_prediction)
    
    # Get f1
    print 'F1 Score:', f1_score(y_data, y_prediction, average = 'weighted')
    
    # get Jaccard Similarity
    print 'Jaccard Similarity:', jaccard_score(y_data, y_prediction)
    
    # Classification report:report recall, precision, f1 ON EACH CLASS (can be used for multilabel case)
    print classification_report(y_data, y_prediction, target_names = names)

### Logistic Regression Model Performance


In [12]:
# Summarize LogReg Performance
LogRegSummary = summarize_model_accuracy(y_pred_LogReg, y_data, genre_labels)

Accuracy: 0.595488659106
Hamming Loss: 0.404511340894
F1 Score: 0.363745775457
Jaccard Similarity: 0.151680275982
                 precision    recall  f1-score   support

      Adventure       0.10      0.61      0.18       367
        Fantasy       0.07      0.66      0.13       268
      Animation       0.10      0.80      0.18       339
          Drama       0.52      0.67      0.58      2179
         Horror       0.22      0.74      0.35       856
         Action       0.20      0.74      0.32       774
         Comedy       0.36      0.68      0.47      1496
        History       0.04      0.80      0.07       125
        Western       0.02      0.73      0.03        55
       Thriller       0.29      0.76      0.42      1157
          Crime       0.09      0.71      0.17       396
    Documentary       0.30      0.91      0.45       909
Science Fiction       0.11      0.77      0.20       422
        Mystery       0.06      0.67      0.11       269
          Music       0.08    

### Decision Tree Model Performance

In [13]:
Decision_Tree = summarize_model_accuracy(y_pred_Decision_Tree, y_data, genre_labels)

Accuracy: 0.908522348232
Hamming Loss: 0.0914776517678
F1 Score: 0.168331087028
Jaccard Similarity: 0.13245377871
                 precision    recall  f1-score   support

      Adventure       0.59      0.04      0.08       367
        Fantasy       0.14      0.01      0.01       268
      Animation       0.45      0.05      0.09       339
          Drama       0.56      0.27      0.37      2179
         Horror       0.00      0.00      0.00       856
         Action       0.49      0.03      0.06       774
         Comedy       0.86      0.12      0.22      1496
        History       0.08      0.01      0.01       125
        Western       0.00      0.00      0.00        55
       Thriller       0.25      0.00      0.00      1157
          Crime       0.00      0.00      0.00       396
    Documentary       0.82      0.30      0.44       909
Science Fiction       0.30      0.03      0.06       422
        Mystery       0.00      0.00      0.00       269
          Music       0.65    

### Random Forest Model Performance

In [8]:
RF_Model_Summary = summarize_model_accuracy(y_pred_RF, y_data, genre_labels)

Accuracy: 0.909372915277
Hamming Loss: 0.0906270847231
F1 Score: 0.215170777762
Jaccard Similarity: 0.169916055148
                 precision    recall  f1-score   support

      Adventure       0.52      0.06      0.11       367
        Fantasy       0.27      0.01      0.02       268
      Animation       0.52      0.04      0.07       339
          Drama       0.60      0.34      0.43      2179
         Horror       0.62      0.09      0.16       856
         Action       0.44      0.05      0.09       774
         Comedy       0.64      0.17      0.27      1496
        History       0.00      0.00      0.00       125
        Western       0.00      0.00      0.00        55
       Thriller       0.40      0.10      0.16      1157
          Crime       0.14      0.00      0.00       396
    Documentary       0.76      0.35      0.48       909
Science Fiction       0.47      0.02      0.04       422
        Mystery       0.00      0.00      0.00       269
          Music       0.63   

  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)


### Ada Boost Model Performance

In [14]:
Ada_Boost = summarize_model_accuracy(y_pred_ada, y_data, genre_labels)

Accuracy: 0.909856571047
Hamming Loss: 0.0901434289526
F1 Score: 0.265702827649
Jaccard Similarity: 0.21204890562
                 precision    recall  f1-score   support

      Adventure       0.49      0.10      0.17       367
        Fantasy       0.00      0.00      0.00       268
      Animation       0.43      0.09      0.15       339
          Drama       0.58      0.39      0.47      2179
         Horror       0.62      0.19      0.29       856
         Action       0.47      0.09      0.15       774
         Comedy       0.70      0.18      0.29      1496
        History       0.26      0.05      0.08       125
        Western       0.00      0.00      0.00        55
       Thriller       0.46      0.15      0.23      1157
          Crime       0.00      0.00      0.00       396
    Documentary       0.71      0.45      0.55       909
Science Fiction       0.43      0.02      0.04       422
        Mystery       0.00      0.00      0.00       269
          Music       0.62    

## Model Comparison and Results

 - Logistic regression
The strength of logistic regression is that the model is computationally fast, and the results can be well interpreted, as long as the assumptions are met. The performance of the model can also be assessed by plotting the residuals and Cook’s distance. It can be regularized to prevent overfitting. However, logistic regression usually perform well on data that are linearly separable. It is less common to apply kernels to the model, and correctly selecting higher degrees of the predictors is hard.

 - Decision tree
The advantage of decision trees is that the models can be initiatively explained. It is easy to see how the decision is made at every step. (It has been discussed above.) However, the model performance is not stable.

 - Random Forest and Ada Boost:
Random forest  and Ada Boost are advanced versions of decision trees, and they perform much more stably because the ensemble of multiple decision trees. However, they loose the interpretability of the model. 
 
According to the model accuracy measurements above, Ada boost is currently our best performing model. Its performance  is consistent better than the others across multiple performance metrics: having the highest 0-1 loss based accuracy,  precision, F-1 score and Jaccard Similarity and having the lowest Hamming loss.

As mentioned in the model comparison that logistic regression model is parametric, which is not as flexible as decision trees. Random forest and Ada boost are ensembled version of decision trees. Reasonably, ada boost out perform the other models.

## Challenges and Next Steps

We encountered various challenges in the model implementation process:

 - **High commuptation intensity:**

    SVM is very computational intensive. We tried to run it on AWS x2.large instance, but fitting the SVM models still requires several hours. We are working on establishing parallel computing, and would include SVM models if we have more time. 
    
 
 - **High dimensionality:**
 
    Text analysis on the movie title and overview provided us over 10 thousand words. We tried to reducing the dimensionality through PCA, but over 3000 principal components are required to explain 90% of the variability. Therefore, we decided to set reasonable thresholds on the total occurrence of the words, and only use the words with ocurrence that is beyond the threshold. In this way, we are able to reduce the number of predictors and keep the interpretbility, although some useful information may be lost. 
 
At the moment, we are running SVM with different kernels, and we are looking forward to seeing the results. Regarding the next step, we plan to try Gradient Boosting and fine tune the parameters in the realm of traditional statistical and machine learning methods. At the sametime, we would like to explore the deep learning method and utilize image information coming from movie posters. 