In [1]:
import sklearn 
import pandas as pd
import xml.etree.ElementTree as ET


__First of all they did not mentioned what sklearn version they used!!!__


As described in the paper, the first step is to select base classifiers. 
The selected base classifiers are trained with default parameter settings with 10-fold cross-validation.
As input data, the training data set and its ground truth labels, per single modality is used.
For the audio MFCC features, we set NaN values to 0, and calculate the average of each MFCC coefficient over all frames.

# Load input data


# Description:

## Available Data
There are variouse csv files and data files available. It's very messy. 
There is one file called "CoE_dataset_offical_release.zip"! 
We extract this file and use this data included their for now! 

## Meta Data
In the original paper there is no information given what is included in the metadata. 
Looking at the paper describing the data set (Right Inflight? A Dataset for Exploring the Automatic Prediction of Movies Suitable for a Watching Situation
) we found out that as metadata they used language, year published, genre, country, runtime and age rating. We assume, since the author of our paper didn't say otherwise, that they used the same metadata. 

## Visual Data: 
The visual data is provied as a csv file for each movie, containing two rows. According to the paper of the dataset they calculated following visual features, Histogram of Oriented Gradients (HOG) gray, Color Moments, local binary patterns (LBP) and Gray Level Run Length Matrix, but don't say how the csv file represents them. Also as mentioned the csv file just has two rows which would not ad up to the mentioned 4 visual features. __We are treating all values as seperate column!__

## Audio Data: 
Audio features is also provided per movie as a csv file. Each audio feature consits of 12 coefficients for multiple frames.

## Textual Data
The textual data is just one file containing the tdf-idf matrix. The first line are the row names for each word. 
While the columns are the associated movie. __There is no indication to which movie each column belongs! Thus we need to assume this!__

__For now we assume the order is the same as in the df_labled_movies dataframe!!!__



In [2]:

df_labled_movies = pd.read_csv("./data/CoE_dataset/Dev_set/dev_set_groundtruth_and_trailers.csv", sep=';')
del df_labled_movies['trailer']
df_labled_movies = df_labled_movies[['movie','filename', 'goodforairplane']]
display(df_labled_movies.head(10))


### Load Meta Data ###

def load_meta_data( filenames ): 
    
    raw_data = []
    
    for file in filenames: 
        file_path = f'./data/CoE_dataset/Dev_set/XML/{file}.xml'
        with open(file_path) as f: 
            tree = ET.parse(f)
            movie = tree.find('movie')
            
            lang = movie.get('language')
            year = movie.get('year')
            genre = movie.get('genre')
            country = movie.get('country')
            runtime = movie.get('runtime')
            age_rating = movie.get('rated')
             
            raw_data.append( (file,lang,year,genre,country,runtime,age_rating) )
    
    return pd.DataFrame(raw_data, columns=['filename','language','year','genre','country','runtime','rated'])


df_meta_data = load_meta_data( df_labled_movies['filename']  )
display(df_meta_data.head(10))
display(df_meta_data.dtypes)

### Load Visual Data ###

def load_visual_data( filenames ):
    data_list = []
    
    for file in filenames: 
        file_path = f'./data/CoE_dataset/Dev_set/vis_descriptors/{file}.csv'
        df_data = pd.read_csv(file_path,index_col=None, header=None)
        data_list.append(df_data)
        
    return pd.concat(data_list, axis = 0, keys = filenames,names=('filename','vis_data'),  sort=False)

df_visual_data = load_visual_data( df_labled_movies['filename']  )
display(df_visual_data.head(10))

### Load Audio Data ###

def load_audio_data( filenames ):
    data_list = []
    
    for file in filenames: 
        file_path = f'./data/CoE_dataset/Dev_set/audio_descriptors/{file}.csv'
        df_data = pd.read_csv(file_path,index_col=None, header=None)
        data_list.append(df_data)
        
    return pd.concat(data_list, axis = 0, keys = filenames,names=('filename','freq_coeff'),  sort=False)

df_audio_data = load_audio_data( df_labled_movies['filename']  )
display(df_audio_data.head(20))


### Load textual Data ###

def load_text_data(filenames):
    

    data_list = []
    file_path = f'./data/CoE_dataset/Dev_set/text_descriptors/tdf_idf_dev.csv'
    #somehow pandas can not really handle that the first line is row names.(at least I didn't find a better way) 
    # thus we do it a little complicated here
    header_index = pd.read_csv(file_path, index_col=0,nrows=1 ).reset_index().columns
    df_data = pd.read_csv(file_path, header=None, index_col=False,skiprows=1)
    df_data.set_index(header_index, inplace=True)
    df_data.columns = filenames
    return df_data.T #row are should be represented by movie names

df_text_data = load_text_data(df_labled_movies['filename'] )
display(df_text_data.head(20))
display(df_text_data.shape)
display(df_text_data.describe())



Unnamed: 0,movie,filename,goodforairplane
0,Seventh Son,Seventh_Son,1
1,Welcome to Me,Welcome_to_Me,0
2,The Judge,The_Judge,0
3,Transformers Age of Extinction,Transformers__Age_of_Extinction,0
4,The Normal Heart,The_Normal_Heart,1
5,The Phantom Tollbooth,The_Phantom_Tollbooth,1
6,Andaz Apna Apna,Andaz_Apna_Apna,1
7,Hotel Transylvania,Hotel_Transylvania,1
8,The Matrix,The_Matrix,1
9,Into the Wild,Into_the_Wild,1


Unnamed: 0,filename,language,year,genre,country,runtime,rated
0,Seventh_Son,English,2014,"Action, Adventure, Fantasy","USA, UK, Canada, China",102 min,PG-13
1,Welcome_to_Me,English,2014,"Comedy, Drama",USA,105 min,R
2,The_Judge,English,2014,Drama,USA,141 min,R
3,Transformers__Age_of_Extinction,English,2014,"Action, Adventure, Sci-Fi","USA, China",165 min,PG-13
4,The_Normal_Heart,English,2014,Drama,USA,132 min,TV-MA
5,The_Phantom_Tollbooth,English,1970,"Family, Adventure, Animation",USA,90 min,G
6,Andaz_Apna_Apna,Hindi,1994,"Comedy, Family, Romance",India,160 min,PG
7,Hotel_Transylvania,English,2012,"Animation, Comedy, Family",USA,91 min,PG
8,The_Matrix,English,1999,"Action, Sci-Fi","USA, Australia",136 min,R
9,Into_the_Wild,"English, Danish",2007,"Adventure, Biography, Drama",USA,148 min,R


filename    object
language    object
year        object
genre       object
country     object
runtime     object
rated       object
dtype: object

Unnamed: 0_level_0,Unnamed: 1_level_0,0,1,2,3,4,5,6,7,8,9,...,816,817,818,819,820,821,822,823,824,825
filename,vis_data,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1
Seventh_Son,0,0.047044,0.11619,0.13633,0.066194,0.072554,0.17267,0.21519,0.070574,0.071423,0.14938,...,731.69,502.01,1.897,2.2788,2.1412,2.9504,91672.0,22207.0,26201.0,14542.0
Seventh_Son,1,0.056526,0.12516,0.14628,0.082497,0.079331,0.17538,0.21839,0.093521,0.074837,0.15025,...,689.95,474.97,2.2676,2.5887,2.4022,3.2167,81373.0,21045.0,24225.0,13529.0
Welcome_to_Me,0,0.30717,0.33422,0.33112,0.33124,0.31114,0.33644,0.33616,0.34479,0.16983,0.27379,...,394.34,167.91,20.337,21.276,18.527,21.189,81665.0,13672.0,32531.0,13753.0
Welcome_to_Me,1,0.30466,0.33193,0.33124,0.33138,0.30788,0.3327,0.33357,0.34305,0.1733,0.28076,...,397.26,168.23,20.426,21.3,18.608,21.182,83171.0,13714.0,32774.0,13780.0
The_Judge,0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,230400.0,119950.0,1e-06,0.002466,4e-06,0.002466,729320.0,119950.0,230400.0,119950.0
The_Judge,1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,230400.0,119950.0,1e-06,0.002466,4e-06,0.002466,729320.0,119950.0,230400.0,119950.0
Transformers__Age_of_Extinction,0,0.19996,0.26934,0.27986,0.23725,0.30844,0.33242,0.32998,0.325,0.30735,0.33431,...,1112.6,668.67,15.79,14.923,15.017,14.779,208630.0,23968.0,47979.0,24059.0
Transformers__Age_of_Extinction,1,0.18913,0.25738,0.27465,0.23664,0.30332,0.32989,0.32888,0.32246,0.30543,0.33551,...,1120.6,669.56,15.086,14.7,14.859,14.723,211630.0,24019.0,48339.0,24090.0
The_Normal_Heart,0,0.0,0.0,0.0,0.0,0.038749,0.083701,0.10544,0.1215,0.038749,0.083701,...,34463.0,20376.0,1.3683,7.3447,8.0146,7.3798,145760.0,20730.0,35320.0,20831.0
The_Normal_Heart,1,0.0,0.0,0.0,0.0,0.20135,0.2979,0.39682,0.55336,0.20135,0.2979,...,41786.0,19786.0,13.071,11.296,11.202,11.306,79962.0,20617.0,45216.0,20738.0


Unnamed: 0_level_0,Unnamed: 1_level_0,0,1,2,3,4,5,6,7,8,9,...,23078,23079,23080,23081,23082,23083,23084,23085,23086,23087
filename,freq_coeff,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1
Seventh_Son,0,,,,,,,,-51.235,-25.775,-17.41,...,,,,,,,,,,
Seventh_Son,1,,,,,,,,4.7601,10.414,15.935,...,,,,,,,,,,
Seventh_Son,2,,,,,,,,-8.6519,-6.1667,-7.3772,...,,,,,,,,,,
Seventh_Son,3,,,,,,,,-8.1397,-8.0911,-14.568,...,,,,,,,,,,
Seventh_Son,4,,,,,,,,-1.7245,2.1968,1.145,...,,,,,,,,,,
Seventh_Son,5,,,,,,,,0.93079,9.8801,15.528,...,,,,,,,,,,
Seventh_Son,6,,,,,,,,-2.2074,4.473,6.5692,...,,,,,,,,,,
Seventh_Son,7,,,,,,,,-2.6355,-1.6751,-6.13,...,,,,,,,,,,
Seventh_Son,8,,,,,,,,-0.3302,1.2823,-0.95568,...,,,,,,,,,,
Seventh_Son,9,,,,,,,,0.25014,7.6977,11.972,...,,,,,,,,,,


Unnamed: 0_level_0,24000,baby,baseball,big,doc,escort,frozen,heroes,high,huck,...,years.1,york,yorks,young,young.1,younger,youngja,zebra,zellweger,zoologists
filename,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
Seventh_Son,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
Welcome_to_Me,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
The_Judge,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
Transformers__Age_of_Extinction,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
The_Normal_Heart,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.051657,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
The_Phantom_Tollbooth,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
Andaz_Apna_Apna,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
Hotel_Transylvania,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.041679,0.0,0.0,0.0,0.0
The_Matrix,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.15957,0.15957,0.0,0.0,0.0,0.0,0.0
Into_the_Wild,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


(95, 3283)

Unnamed: 0,24000,baby,baseball,big,doc,escort,frozen,heroes,high,huck,...,years.1,york,yorks,young,young.1,younger,youngja,zebra,zellweger,zoologists
count,95.0,95.0,95.0,95.0,95.0,95.0,95.0,95.0,95.0,95.0,...,95.0,95.0,95.0,95.0,95.0,95.0,95.0,95.0,95.0,95.0
mean,0.0,0.002288,0.001108,0.000353,0.0,0.0,0.0,0.000753,0.001395,0.0,...,0.003679,0.002638,0.0,0.006366,0.006366,0.001488,0.0,0.000531,0.0,0.0
std,0.0,0.019976,0.010803,0.003444,0.0,0.0,0.0,0.007335,0.008181,0.0,...,0.013408,0.013584,0.0,0.021099,0.021099,0.008699,0.0,0.005178,0.0,0.0
min,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
50%,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
75%,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
max,0.0,0.19349,0.10529,0.033572,0.0,0.0,0.0,0.071495,0.062724,0.0,...,0.073645,0.092014,0.0,0.15957,0.15957,0.067395,0.0,0.050467,0.0,0.0


# Preprocess Data

## Description 

Beside a short description for the audio data there is no more information on how to handle the other data. For example the runtime currently is not handles as a number but as a string(object)
Since sklearn mostly expects numerical inputs, we need to encode the data. 

For different class normally you would use one-hot-encoding, but since it's not specified let's try first the easiest approach which is Labelencoding.


### Audio Data: 
As mentiones in the paper, NaN values of the audio data are set to 0 and the average of each MFCC coefficient is calculated over all frames.





In [3]:

def pre_process_audio_data():
    df_data = df_audio_data.fillna(0.0)
    return df_data.mean(axis=1)
    
def pre_process_visual_data():
    #create columns of the two rows belonging to each movie
    df_data = df_visual_data.unstack()
    return df_data
    
    
df_audio_data_processed = pre_process_audio_data()
display(df_audio_data_processed.head(20))

df_visual_data_processed = pre_process_visual_data()
display(df_visual_data_processed.head(20))

filename       freq_coeff
Seventh_Son    0             33.737346
               1             -2.259660
               2              0.822080
               3             -0.298483
               4              0.680520
               5             -0.679905
               6              0.085080
               7             -0.249879
               8             -0.025137
               9             -0.134721
               10            -0.116094
               11            -0.098648
               12             0.066234
Welcome_to_Me  0             39.561047
               1             -4.593651
               2             -0.709224
               3             -1.020713
               4              0.160524
               5              0.001964
               6             -1.487054
dtype: float64

Unnamed: 0_level_0,0,0,1,1,2,2,3,3,4,4,...,821,821,822,822,823,823,824,824,825,825
vis_data,0,1,0,1,0,1,0,1,0,1,...,0,1,0,1,0,1,0,1,0,1
filename,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2,Unnamed: 9_level_2,Unnamed: 10_level_2,Unnamed: 11_level_2,Unnamed: 12_level_2,Unnamed: 13_level_2,Unnamed: 14_level_2,Unnamed: 15_level_2,Unnamed: 16_level_2,Unnamed: 17_level_2,Unnamed: 18_level_2,Unnamed: 19_level_2,Unnamed: 20_level_2,Unnamed: 21_level_2
Seventh_Son,0.047044,0.056526,0.11619,0.12516,0.13633,0.14628,0.066194,0.082497,0.072554,0.079331,...,2.9504,3.2167,91672.0,81373.0,22207.0,21045.0,26201.0,24225.0,14542.0,13529.0
Welcome_to_Me,0.30717,0.30466,0.33422,0.33193,0.33112,0.33124,0.33124,0.33138,0.31114,0.30788,...,21.189,21.182,81665.0,83171.0,13672.0,13714.0,32531.0,32774.0,13753.0,13780.0
The_Judge,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.002466,0.002466,729320.0,729320.0,119950.0,119950.0,230400.0,230400.0,119950.0,119950.0
Transformers__Age_of_Extinction,0.19996,0.18913,0.26934,0.25738,0.27986,0.27465,0.23725,0.23664,0.30844,0.30332,...,14.779,14.723,208630.0,211630.0,23968.0,24019.0,47979.0,48339.0,24059.0,24090.0
The_Normal_Heart,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.038749,0.20135,...,7.3798,11.306,145760.0,79962.0,20730.0,20617.0,35320.0,45216.0,20831.0,20738.0
The_Phantom_Tollbooth,0.016911,0.068953,0.014269,0.099843,0.016807,0.12155,0.031862,0.17175,0.029332,0.22677,...,0.004375,0.87925,230400.0,17662.0,38355.0,1955.1,73984.0,6105.2,38355.0,2000.1
Andaz_Apna_Apna,0.0,0.0,0.29416,0.29452,0.29007,0.2904,0.011351,0.011381,0.10093,0.10176,...,11.133,11.156,66358.0,66445.0,27000.0,27016.0,60473.0,60427.0,20441.0,20459.0
Hotel_Transylvania,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.00247,0.00247,725900.0,725900.0,119790.0,119790.0,230400.0,230400.0,119790.0,119790.0
The_Matrix,0.0,0.11486,0.0,0.20847,0.0,0.22465,0.0,0.24525,0.0,0.2825,...,0.003916,1.8843,230400.0,22739.0,55609.0,3850.1,129600.0,6361.4,55609.0,3976.7
Into_the_Wild,0.0,0.24998,0.0,0.32256,0.0,0.31273,0.0,0.24374,0.0,0.25535,...,0.004386,15.352,230400.0,26102.0,37959.0,3523.4,72900.0,6113.7,37959.0,3568.1


# Define Models

## Description 
These are the models described in the paper. It is not allways clear which exact models they used. (see comments)

In [4]:
from  sklearn.neighbors import KNeighborsClassifier, NearestCentroid #(not sure if this is the nearest mean classifiert) 
from  sklearn.tree import DecisionTreeClassifier
from  sklearn.linear_model import LogisticRegression
from  sklearn.svm import SVC #(not clear which SVC, there is also NuSVC )
from  sklearn.ensemble import BaggingClassifier
from  sklearn.ensemble import AdaBoostClassifier
from  sklearn.ensemble import GradientBoostingClassifier
from  sklearn.ensemble import RandomForestClassifier
from  sklearn.naive_bayes import GaussianNB # there are 3 different naive bayes classifiers, it is not stated which one they used 


model_list = [KNeighborsClassifier(),
                    DecisionTreeClassifier(),
                    LogisticRegression(),
                    SVC(),
                    BaggingClassifier(),
                    AdaBoostClassifier(),
                    GradientBoostingClassifier(),
                    RandomForestClassifier(),
                    GaussianNB() 
                   ]

    


  from numpy.core.umath_tests import inner1d


# Define Performance measures:

As mentioned in the paper the performant measueres are the following Precision and Recall and F1-Score. To be more precise the weighted average of Precision and Recall and F1-Score as stated in the dataset paper. 

In [5]:
from sklearn.model_selection import cross_validate

def calculate_metrics(clf,X,y ):
    metric =  cross_validate(clf, X, y, scoring=('precision_weighted','recall_weighted','f1_weighted'), return_train_score=False, cv=10)  
    return pd.Series({'precision':metric['test_precision_weighted'].mean(),'recall':metric['test_recall_weighted'].mean(),'F1':metric['test_f1_weighted'].mean() })

# Select Models

As defined in the paper they use 10-fold CV on the classifiers for training and keep all the classifiers where the metrics are above 0.5 for later stacking.


In [6]:
from sklearn.model_selection import KFold
from sklearn.preprocessing import LabelEncoder, OneHotEncoder
import numpy as np

class MultiColumnLabelEncoder:
    
    def __init__(self, columns = None):
        self.columns = columns # list of column to encode

    def fit(self, X, y=None):
        return self

    def transform(self, X):
        '''
        Transforms columns of X specified in self.columns using
        LabelEncoder(). If no columns specified, transforms all
        columns in X.
        '''
        
        output = X.copy()
        
        if self.columns is not None:
            for col in self.columns:
                output[col] = LabelEncoder().fit_transform(output[col])
        else:
            for colname, col in output.iteritems():
                output[colname] = LabelEncoder().fit_transform(col)
        
        return output

    def fit_transform(self, X, y=None):
        return self.fit(X, y).transform(X)
    
def getModelName( object ): 

    if hasattr(object, '__module__') and hasattr(object, '__name__'):
        return  object.__name__
    elif hasattr(object, '__module__') and hasattr(object, '__class__'):
        return  object.__class__.__name__
    else:
        raise TypeError("Could not get name of object!")
    
def evaluate_models( X, y ):
    metrics = pd.DataFrame()

    for model in model_list:

        m = calculate_metrics(model,X,y )
        metrics[getModelName(model)] = m

    return metrics.T


df_final_results = pd.DataFrame()

import warnings
warnings.filterwarnings('ignore')

In [7]:
    
df_train = pd.merge(df_labled_movies,df_meta_data, on='filename')
df_train.drop(['movie', 'filename'],axis=1, inplace=True)
display(df_train.head(2))
df_X = df_train.drop('goodforairplane',axis=1)
df_y = df_train['goodforairplane']



display("----  Lable encoded ----")
label_encoder = MultiColumnLabelEncoder(['language','year','genre','country','runtime','rated'])    
X_labelencoded = label_encoder.fit_transform(df_X)
metrics = evaluate_models(X_labelencoded, df_y)
display(metrics)

#convert runtime and year to actual number
df_X['runtime'] = df_X['runtime'].apply(lambda x: int(x.split(' ')[0]) )
df_X['year'] =  df_X['year'].apply(pd.to_numeric)

display("---- Lable encoded with float for year and runtime ----")
##optimizing encoding
label_encoder = MultiColumnLabelEncoder(['language','year','genre','country','rated'])    
X_labelencoded = label_encoder.fit_transform(df_X)
metrics = evaluate_models(X_labelencoded, df_y)
display(metrics)

display("---- Lable encoded without year ----")
label_encoder = MultiColumnLabelEncoder(['language','genre','country','rated'])    
X_labelencoded = label_encoder.fit_transform(df_X)
metrics = evaluate_models(X_labelencoded, df_y)
display(metrics)

# save the best of the for the final table 
metrics['Modality'] = 'metadata'
df_final_results = df_final_results.append(metrics)


display("---- OneHot Encoding ----")
##optimizing encoding further
X_onehotencoded = pd.get_dummies(df_X)
metrics = evaluate_models(X_onehotencoded, df_y)
display(metrics)


Unnamed: 0,goodforairplane,language,year,genre,country,runtime,rated
0,1,English,2014,"Action, Adventure, Fantasy","USA, UK, Canada, China",102 min,PG-13
1,0,English,2014,"Comedy, Drama",USA,105 min,R


'----  Lable encoded ----'

Unnamed: 0,precision,recall,F1
KNeighborsClassifier,0.5194,0.566162,0.52335
DecisionTreeClassifier,0.523778,0.519798,0.497971
LogisticRegression,0.587637,0.585556,0.574014
SVC,0.297467,0.536869,0.382622
BaggingClassifier,0.547209,0.515758,0.480662
AdaBoostClassifier,0.516002,0.501717,0.490721
GradientBoostingClassifier,0.512493,0.511818,0.49845
RandomForestClassifier,0.539624,0.517778,0.503076
GaussianNB,0.47488,0.506667,0.481515


'---- Lable encoded with float for year and runtime ----'

Unnamed: 0,precision,recall,F1
KNeighborsClassifier,0.657535,0.618586,0.600698
DecisionTreeClassifier,0.382621,0.399596,0.380096
LogisticRegression,0.512315,0.52404,0.493776
SVC,0.300554,0.54798,0.388116
BaggingClassifier,0.542088,0.517778,0.507261
AdaBoostClassifier,0.466827,0.474444,0.462241
GradientBoostingClassifier,0.430951,0.43202,0.415558
RandomForestClassifier,0.504539,0.503737,0.484702
GaussianNB,0.467194,0.499798,0.47404


'---- Lable encoded without year ----'

Unnamed: 0,precision,recall,F1
KNeighborsClassifier,0.632037,0.619697,0.589233
DecisionTreeClassifier,0.432352,0.44202,0.419286
LogisticRegression,0.545546,0.549293,0.529509
SVC,0.300554,0.54798,0.388116
BaggingClassifier,0.394108,0.40798,0.383867
AdaBoostClassifier,0.466827,0.474444,0.462241
GradientBoostingClassifier,0.424838,0.43,0.412427
RandomForestClassifier,0.546888,0.533939,0.525844
GaussianNB,0.517396,0.539798,0.499798


'---- OneHot Encoding ----'

Unnamed: 0,precision,recall,F1
KNeighborsClassifier,0.574384,0.554242,0.544317
DecisionTreeClassifier,0.488396,0.478485,0.459617
LogisticRegression,0.335283,0.402727,0.359249
SVC,0.389456,0.453333,0.413346
BaggingClassifier,0.521665,0.503535,0.493322
AdaBoostClassifier,0.375063,0.382727,0.371664
GradientBoostingClassifier,0.340226,0.400707,0.36006
RandomForestClassifier,0.408165,0.43202,0.413088
GaussianNB,0.367316,0.400707,0.347878


In [8]:
from sklearn.preprocessing import Normalizer

################## Use textual data  ###################
display('################## Use textual data  ###################')

df_movies = df_labled_movies.drop(['movie'],axis=1)
df_train = pd.merge(df_movies,df_text_data, on='filename')
df_train.drop(['filename'],axis=1, inplace=True)
display(df_train.head(2))
df_X = df_train.drop('goodforairplane',axis=1)
df_y = df_train['goodforairplane']


display("---- RAW Data ----")
metrics = evaluate_models(df_X, df_y)
display(metrics)

# save  the final table 
metrics['Modality'] = 'textual'
df_final_results = df_final_results.append(metrics)

display("---- Normalize Data ----")
df_normalized_X = Normalizer().fit_transform(df_X)
metrics = evaluate_models(df_normalized_X, df_y)
display(metrics)




'################## Use textual data  ###################'

Unnamed: 0,goodforairplane,24000,baby,baseball,big,doc,escort,frozen,heroes,high,...,years.1,york,yorks,young,young.1,younger,youngja,zebra,zellweger,zoologists
0,1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


'---- RAW Data ----'

Unnamed: 0,precision,recall,F1
KNeighborsClassifier,0.341994,0.465152,0.366055
DecisionTreeClassifier,0.442484,0.482424,0.4433
LogisticRegression,0.300554,0.54798,0.388116
SVC,0.300554,0.54798,0.388116
BaggingClassifier,0.508149,0.565152,0.517855
AdaBoostClassifier,0.507167,0.521717,0.498766
GradientBoostingClassifier,0.64989,0.653939,0.597817
RandomForestClassifier,0.478306,0.543131,0.490471
GaussianNB,0.537073,0.558182,0.538881


'---- Normalize Data ----'

Unnamed: 0,precision,recall,F1
KNeighborsClassifier,0.506534,0.53303,0.512809
DecisionTreeClassifier,0.486296,0.495354,0.459348
LogisticRegression,0.300554,0.54798,0.388116
SVC,0.300554,0.54798,0.388116
BaggingClassifier,0.492177,0.519596,0.477972
AdaBoostClassifier,0.477032,0.510404,0.478507
GradientBoostingClassifier,0.600671,0.620606,0.56857
RandomForestClassifier,0.409067,0.458182,0.415903
GaussianNB,0.537073,0.558182,0.538881


In [9]:
from sklearn.preprocessing import StandardScaler,RobustScaler

################## Use visual data  ###################
display('################## Use visual data  ###################')

df_movies = df_labled_movies.drop(['movie'],axis=1)
df_train = pd.merge(df_movies,df_visual_data_processed, on='filename')
df_train.drop(['filename'],axis=1, inplace=True)
display(df_train.head(5))
df_X = df_train.drop('goodforairplane',axis=1)
df_y = df_train['goodforairplane']


display("---- RAW Data ----")
metrics = evaluate_models(df_X, df_y)
display(metrics)

display("---- Scaled Data ----")
df_scaled_X = StandardScaler().fit_transform(df_X)
metrics = evaluate_models(df_scaled_X, df_y)
display(metrics)

# save  the final table 
metrics['Modality'] = 'visual'
df_final_results = df_final_results.append(metrics)

display("---- RobustScaler Data ----")
df_scaled_X = RobustScaler().fit_transform(df_X)
metrics = evaluate_models(df_scaled_X, df_y)
display(metrics)


'################## Use visual data  ###################'

Unnamed: 0,goodforairplane,"(0, 0)","(0, 1)","(1, 0)","(1, 1)","(2, 0)","(2, 1)","(3, 0)","(3, 1)","(4, 0)",...,"(821, 0)","(821, 1)","(822, 0)","(822, 1)","(823, 0)","(823, 1)","(824, 0)","(824, 1)","(825, 0)","(825, 1)"
0,1,0.047044,0.056526,0.11619,0.12516,0.13633,0.14628,0.066194,0.082497,0.072554,...,2.9504,3.2167,91672.0,81373.0,22207.0,21045.0,26201.0,24225.0,14542.0,13529.0
1,0,0.30717,0.30466,0.33422,0.33193,0.33112,0.33124,0.33124,0.33138,0.31114,...,21.189,21.182,81665.0,83171.0,13672.0,13714.0,32531.0,32774.0,13753.0,13780.0
2,0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.002466,0.002466,729320.0,729320.0,119950.0,119950.0,230400.0,230400.0,119950.0,119950.0
3,0,0.19996,0.18913,0.26934,0.25738,0.27986,0.27465,0.23725,0.23664,0.30844,...,14.779,14.723,208630.0,211630.0,23968.0,24019.0,47979.0,48339.0,24059.0,24090.0
4,1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.038749,...,7.3798,11.306,145760.0,79962.0,20730.0,20617.0,35320.0,45216.0,20831.0,20738.0


'---- RAW Data ----'

Unnamed: 0,precision,recall,F1
KNeighborsClassifier,0.546787,0.542929,0.529648
DecisionTreeClassifier,0.461554,0.47798,0.463189
LogisticRegression,0.571573,0.585051,0.560193
SVC,0.398547,0.55798,0.4225
BaggingClassifier,0.553917,0.543737,0.535883
AdaBoostClassifier,0.505459,0.509293,0.496069
GradientBoostingClassifier,0.521322,0.521515,0.512374
RandomForestClassifier,0.467618,0.479091,0.465573
GaussianNB,0.503584,0.518586,0.484453


'---- Scaled Data ----'

Unnamed: 0,precision,recall,F1
KNeighborsClassifier,0.570299,0.56,0.538146
DecisionTreeClassifier,0.576772,0.572828,0.558449
LogisticRegression,0.580084,0.549798,0.538425
SVC,0.45307,0.538889,0.448757
BaggingClassifier,0.618066,0.597273,0.59028
AdaBoostClassifier,0.507985,0.504242,0.489265
GradientBoostingClassifier,0.469628,0.490202,0.474537
RandomForestClassifier,0.561159,0.542626,0.536601
GaussianNB,0.607893,0.587273,0.57359


'---- RobustScaler Data ----'

Unnamed: 0,precision,recall,F1
KNeighborsClassifier,0.558343,0.56,0.544837
DecisionTreeClassifier,0.523867,0.522424,0.508063
LogisticRegression,0.513166,0.509394,0.501544
SVC,0.425366,0.467172,0.420111
BaggingClassifier,0.562226,0.550808,0.546398
AdaBoostClassifier,0.497985,0.493131,0.478154
GradientBoostingClassifier,0.471742,0.492222,0.477334
RandomForestClassifier,0.476928,0.471818,0.454637
GaussianNB,0.532297,0.525354,0.522063


In [10]:
from sklearn.preprocessing import StandardScaler,RobustScaler

################## Use audio data  ###################
display('################## Use audio data  ###################')

df_movies = df_labled_movies.drop(['movie'],axis=1)
df_train = pd.merge(df_movies,pd.DataFrame(df_audio_data_processed), on='filename')
df_train.drop(['filename'],axis=1, inplace=True)
display(df_train.head(5))
df_X = df_train.drop('goodforairplane',axis=1)
df_y = df_train['goodforairplane']


display("---- RAW Data ----")
metrics = evaluate_models(df_X, df_y)
display(metrics)

# save  the final table 
metrics['Modality'] = 'audio'
df_final_results = df_final_results.append(metrics)

display("---- Scaled Data ----")
df_scaled_X = StandardScaler().fit_transform(df_X)
metrics = evaluate_models(df_scaled_X, df_y)
display(metrics)

display("---- RobustScaler Data ----")
df_scaled_X = RobustScaler().fit_transform(df_X)
metrics = evaluate_models(df_scaled_X, df_y)
display(metrics)

'################## Use audio data  ###################'

Unnamed: 0,goodforairplane,0
0,1,33.737346
1,1,-2.25966
2,1,0.82208
3,1,-0.298483
4,1,0.68052


'---- RAW Data ----'

Unnamed: 0,precision,recall,F1
KNeighborsClassifier,0.506285,0.510045,0.506474
DecisionTreeClassifier,0.543862,0.542441,0.54188
LogisticRegression,0.485729,0.54166,0.401713
SVC,0.54432,0.545673,0.416338
BaggingClassifier,0.535994,0.533504,0.533682
AdaBoostClassifier,0.524882,0.539982,0.486366
GradientBoostingClassifier,0.500367,0.520509,0.480814
RandomForestClassifier,0.535649,0.531957,0.532033
GaussianNB,0.523032,0.546532,0.431976


'---- Scaled Data ----'

Unnamed: 0,precision,recall,F1
KNeighborsClassifier,0.506285,0.510045,0.506474
DecisionTreeClassifier,0.543862,0.542441,0.54188
LogisticRegression,0.485729,0.54166,0.401713
SVC,0.486108,0.544912,0.403628
BaggingClassifier,0.546418,0.5433,0.543256
AdaBoostClassifier,0.524882,0.539982,0.486366
GradientBoostingClassifier,0.500367,0.520509,0.480814
RandomForestClassifier,0.535934,0.531917,0.531996
GaussianNB,0.523032,0.546532,0.431976


'---- RobustScaler Data ----'

Unnamed: 0,precision,recall,F1
KNeighborsClassifier,0.506285,0.510045,0.506474
DecisionTreeClassifier,0.543862,0.542441,0.54188
LogisticRegression,0.485729,0.54166,0.401713
SVC,0.505053,0.529445,0.420454
BaggingClassifier,0.536228,0.53273,0.532796
AdaBoostClassifier,0.524882,0.539982,0.486366
GradientBoostingClassifier,0.500367,0.520509,0.480814
RandomForestClassifier,0.525148,0.5222,0.522393
GaussianNB,0.523032,0.546532,0.431976


## Final base classifier filter

In [11]:
df_r = df_final_results
df_r = df_r[ (df_r['precision'] > 0.5) & (df_r['recall'] > 0.5) & (df_r['F1'] > 0.5) ]
display(df_r)


Unnamed: 0,precision,recall,F1,Modality
KNeighborsClassifier,0.632037,0.619697,0.589233,metadata
LogisticRegression,0.545546,0.549293,0.529509,metadata
RandomForestClassifier,0.546888,0.533939,0.525844,metadata
BaggingClassifier,0.508149,0.565152,0.517855,textual
GradientBoostingClassifier,0.64989,0.653939,0.597817,textual
GaussianNB,0.537073,0.558182,0.538881,textual
KNeighborsClassifier,0.570299,0.56,0.538146,visual
DecisionTreeClassifier,0.576772,0.572828,0.558449,visual
LogisticRegression,0.580084,0.549798,0.538425,visual
BaggingClassifier,0.618066,0.597273,0.59028,visual


As we can see the results table looks pretty different than in the paper. There is not really enough information in the paper to be sure that we are correctly reproducing the steps. 

With the audio data there is actually not really more we could do since we just end up with one coliumn of data as descirbed in the paper, but still the metrics is not as good as in the paper. 

__Is there something wrong already when we load the data ? Wrong data?__

