<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Preparation" data-toc-modified-id="Preparation-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Preparation</a></span><ul class="toc-item"><li><span><a href="#Imports" data-toc-modified-id="Imports-1.1"><span class="toc-item-num">1.1&nbsp;&nbsp;</span>Imports</a></span></li><li><span><a href="#Load-and-prep-data" data-toc-modified-id="Load-and-prep-data-1.2"><span class="toc-item-num">1.2&nbsp;&nbsp;</span>Load and prep data</a></span></li><li><span><a href="#Las-Vegas-Wrapper-for-feature-selection" data-toc-modified-id="Las-Vegas-Wrapper-for-feature-selection-1.3"><span class="toc-item-num">1.3&nbsp;&nbsp;</span>Las Vegas Wrapper for feature selection</a></span></li></ul></li><li><span><a href="#Learning-algorithms" data-toc-modified-id="Learning-algorithms-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>Learning algorithms</a></span></li><li><span><a href="#Reproducing-table-2" data-toc-modified-id="Reproducing-table-2-3"><span class="toc-item-num">3&nbsp;&nbsp;</span>Reproducing table 2</a></span></li></ul></div>

# Preparation

## Imports

In [1]:
import csv
import pandas as pd
import numpy as np
import glob
from tqdm import tqdm
import time
import random

from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import GradientBoostingClassifier, BaggingClassifier, RandomForestClassifier, AdaBoostClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC
from sklearn.tree import DecisionTreeClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.metrics import precision_recall_fscore_support
from sklearn.model_selection import cross_validate

## Load and prep data

In [2]:
def load_audio(trainpath="./data/CoE_dataset_icpr/Dev_Set/audio_descriptors/*",
               testpath="./data/CoE_dataset_icpr/Test_Set/audio_descriptors/*",
               trainrefpath="./data/CoE_dataset_icpr/Dev_Set/CoeDevelopmentTrainingdata.csv",
               testrefpath="./data/CoE_dataset_icpr/Dev_Set/CoeDevelopmentTestdata.csv"):
    """ loads all audio data and includes the target variable
    
    kwargs:
        trainpath: path to the folder containing all audio descriptor csv files of the train set
        testpath: path to the folder containing all audio descriptor csv files of the test set
        trainrefpath: path to the csv containing filename, movie name and the target variable for the train set
        testrefpath: path to the csv containing filename, movie name and the target variable for the test set
    """
    
    train = pd.DataFrame()
    test = pd.DataFrame()
    trainfiles = []
    testfiles = []

    # load training data
    for csvpath in tqdm(glob.glob(trainpath), desc='Loading audio train data'):
        trainfiles.append(csvpath.split('/')[-1].split('.csv')[0])
        tmp = pd.DataFrame(pd.read_csv(csvpath, header=None)).mean(axis=1)
        train = train.append(pd.DataFrame(tmp.values.flatten()).transpose())

    # load test data
    for csvpath in tqdm(glob.glob(testpath), desc='Loading audio test data'):
        testfiles.append(csvpath.split('/')[-1].split('.csv')[0])
        tmp = pd.DataFrame(pd.read_csv(csvpath, header=None)).mean(axis=1)
        test = test.append(pd.DataFrame(tmp.values.flatten()).transpose())

    # add filename and target variable
    train['fname'] = trainfiles
    test['fname'] = testfiles
    train = train.merge(pd.read_csv(trainrefpath)[['file_name', 'goodforairplanes']], left_on='fname', right_on='file_name').drop(columns=['file_name'])
    test = test.merge(pd.read_csv(testrefpath)[['file_name', 'goodforairplanes']], left_on='fname', right_on='file_name').drop(columns=['file_name'])

    # set file name as index
    train.set_index(['fname'], inplace=True)
    test.set_index(['fname'], inplace=True)

    # replace NAs in audio
    train = train.fillna(0)
    test = test.fillna(0)

    return train, test

In [3]:
def load_text(trainpath = "./data/CoE_dataset_icpr/Dev_Set/text_descriptors/tdf_idf_dev.csv",
             testpath = "./data/CoE_dataset_icpr/Test_Set/text_descriptors/tdf_idf_test.csv",
             trainrefpath = "./data/CoE_dataset_icpr/Dev_Set/CoeDevelopmentTrainingdata.csv",
             testrefpath = "./data/CoE_dataset_icpr/Dev_Set/CoeDevelopmentTestdata.csv"):
    """ loads text dataset and returns train and test DFs"""

    train = pd.read_csv(trainpath).T.dropna()
    test = pd.read_csv(testpath).T.dropna()
    trainref = pd.read_csv(trainrefpath).sort_values("file_name").reset_index(drop=True)
    testref = pd.read_csv(testrefpath).sort_values("file_name").reset_index(drop=True)
    
    train = train.join(trainref).drop(columns=['movie_name'])
    tes = test.join(testref).drop(columns=['movie_name'])
    
    return train, test

In [4]:
audio_train, audio_test = load_audio()
text_train, text_test = load_text()

audio = audio_train.append(audio_test)
text = text_train.append(text_test)

Loading audio train data: 100%|██████████| 95/95 [00:51<00:00,  1.84it/s]
Loading audio test data: 100%|██████████| 223/223 [01:58<00:00,  1.88it/s]


In [5]:
#text_train, text_test = load_text()
text_train # TODO join key finden

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,3275,3276,3277,3278,3279,3280,3281,3282,file_name,goodforairplanes
24000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,,
baby,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,,
baseball,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,,
big,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,,
doc,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
absolute,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,,
academic,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,,
accept,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,,
access,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,,


## Las Vegas Wrapper for feature selection

In [6]:
def LVW(tX, ty, vX, vy, K, original_features, classifier):
    
    acc = 0
    k = 0
    C = len(original_features)
    
    while k < K:
        #print('k: ', k)
        ran_choice = range(1,len(original_features)-1)
        S1 = random.sample(original_features, random.choice(ran_choice))
        C1 = len(S1)
        
        x_train = tX[tX.columns.intersection(S1)]
        x_test = vX[vX.columns.intersection(S1)]
        
        acc1 = Classifier(x_train, ty, x_test, vy, 10, classifier)['F1']
        
        if (acc1 > acc) or (acc1 == acc and C1 < C):
            k = 0
            acc = acc1
            C = C1
            S = S1
        
        else:
            k += 1
            
    return S

# Learning algorithms

In [7]:
def seperate_tvar(data, target_var='goodforairplanes'):
    """ seperate target variable from the rest of the dataset
    kwargs:
        data: dataframe containing dependent and independent variables
        target_var: column name of the target variable as a string
    """
    if target_var in data.columns:
        data_y = data[target_var]
        data_x = data.drop(columns=target_var)

    return data_y, data_x

In [8]:
# dict with classifier name: time spent for fitting and classifier obejcts
classifier = {"KNN": KNeighborsClassifier(),
              #"Nearest mean": #TODO
              "Decision tree": DecisionTreeClassifier(),
              "SVM": SVC(),
              "Bagging": BaggingClassifier(),
              "Random forest": RandomForestClassifier(),
              "AdaBoost": AdaBoostClassifier(),
              "Naive bayes": GaussianNB(),
              "Logistic Regression": LogisticRegression(),
              "Gradient Boosting Tree": GradientBoostingClassifier()
              }

table2_train, table2_full = pd.DataFrame(), pd.DataFrame()
audio_y, audio_x = seperate_tvar(audio)

audio_train_y, audio_train_x = seperate_tvar(audio_train)
audio_test_y, audio_test_x = seperate_tvar(audio_test)

for c in classifier:
#     # apply LVW feature selection on train_x and test_x
#     features = LVW(classifier[c], train_x, train_y, test_x, test_y, 500, range(0,train_x.shape[1]-1))
#     train_x = train_x[train_x.columns.intersection(features)]
#     test_x = test_x[test_x.columns.intersection(features)]

    model = classifier[c]
    yhat_full = cross_validate(model, audio_x, audio_y, scoring=['precision', 'recall', 'f1'], cv=10)
    yhat_train = cross_validate(model, audio_train_x, audio_train_y, scoring=['precision', 'recall', 'f1'], cv=10)
    table2_full = table2_full.append([[c, 'Reproduced',
                                       np.mean(yhat_full['test_precision']),
                                       np.mean(yhat_full['test_recall']),
                                       np.mean(yhat_full['test_f1']),
                                       np.mean(yhat_full['fit_time']),
                                       'Audio']])
    table2_train= table2_train.append([[c, 'Reproduced',
                                       np.mean(yhat_train['test_precision']),
                                       np.mean(yhat_train['test_recall']),
                                       np.mean(yhat_train['test_f1']),
                                       np.mean(yhat_train['fit_time']),
                                       'Audio']])

    
# add results from the paper for comparison
table2_train = table2_train.append([['Logistic Regression', 'Paper', 0.507, 0.597, 0.546, np.nan, 'Audio'],
                                    ['Gradient Boosting Tree', 'Paper', 0.560, 0.617, 0.587, np.nan, 'Audio']])

table2_full = table2_full.append([['Logistic Regression', 'Paper', 0.507, 0.597, 0.546, np.nan, 'Audio'],
                                  ['Gradient Boosting Tree', 'Paper', 0.560, 0.617, 0.587, np.nan, 'Audio']])


# name columns and add multilevel idx
table2_train.rename(columns={0: 'Algorithm', 1: 'Source', 2: 'Precision', 3: 'Recall',
                             4: 'F1', 5: 'Training time (s)', 6: 'Modality'}, inplace=True)
# table2_train.set_index(['Modality', 'Source', 'Algorithm'], inplace=True)

table2_full.rename(columns={0: 'Algorithm', 1: 'Source', 2: 'Precision', 3: 'Recall',
                            4: 'F1', 5: 'Training time (s)', 6: 'Modality'}, inplace=True)
# table2_full.set_index(['Modality', 'Source', 'Algorithm'], inplace=True)



# Reproducing table 2 - Audio

Usually we would have ended up with the following table (taking all the data for training, as CV already uses the folds in turn per run as testset):  
> reproducable with `table2_full`

Algorithm |Source |Precision |Recall |F1 |Training time (s) |Modality
--|--|--|--|--|--|--
KNN |Reproduced |0.570931 |0.616071 |0.584850 |0.001231 |Audio
Decision tree |Reproduced |0.454762 |0.375000 |0.406107 |0.001806 |Audio
SVM |Reproduced |0.512449 |0.598214 |0.542862 |0.001888 |Audio
Bagging |Reproduced |0.471667 |0.380357 |0.413395 |0.013214 |Audio
Random forest |Reproduced |0.511032 |0.428571 |0.454090 |0.010469 |Audio
AdaBoost |Reproduced |0.501468 |0.466071 |0.477330 |0.051304 |Audio
Naive bayes |Reproduced |0.571944 |0.369643 |0.430480 |0.001216 |Audio
Logistic Regression |Reproduced |0.458135 |0.467857 |0.451560 |0.001487 |Audio
Gradient Boosting Tree |Reproduced |0.560397 |0.508929 |0.530108 |0.046474 |Audio
Logistic Regression |Paper |0.507000 |0.597000 |0.546000 |NaN |Audio
Gradient Boosting Tree |Paper |0.560000 |0.617000 |0.587000 |NaN |Audio

---

As only the algorithms are kept, for which the condition $f1>0,5$ holds, we can filter:  
> reproducable with `table2_full[table2_full['F1']>0.5]`

Algorithm |Source |Precision |Recall |F1 |Training time (s) |Modality
--|--|--|--|--|--|--
KNN |Reproduced |0.570931 |0.616071 |0.584850 |0.001231 |Audio
SVM |Reproduced |0.512449 |0.598214 |0.542862 |0.001888 |Audio
Gradient Boosting Tree |Reproduced |0.560397 |0.508929 |0.530108 |0.046474 |Audio
Logistic Regression |Paper |0.507000 |0.597000 |0.546000 |NaN |Audio
Gradient Boosting Tree |Paper |0.560000 |0.617000 |0.587000 |NaN |Audio

---

Strangely, the authors seem to have chosen only to train their models on the training set, which does not agree with the idea of CV per se, but with that selection, we can achieve results which are closer to the ones in the paper:  
> reproducable with `table2_train`

Algorithm |Source |Precision |Recall |F1 |Training time (s) |Modality
--|--|--|--|--|--|--
KNN |Reproduced |0.508333 |0.520000 |0.499697 |0.001239 |Audio
Decision tree |Reproduced |0.526667 |0.413333 |0.434430 |0.001488 |Audio
SVM |Reproduced |0.467619 |0.600000 |0.522145 |0.001653 |Audio
Bagging |Reproduced |0.515476 |0.450000 |0.467828 |0.011674 |Audio
Random forest |Reproduced |0.546667 |0.430000 |0.468304 |0.011116 |Audio
AdaBoost |Reproduced |0.561667 |0.566667 |0.558889 |0.046116 |Audio
Naive bayes |Reproduced |0.516667 |0.383333 |0.429365 |0.001191 |Audio
Logistic Regression |Reproduced |0.511429 |0.563333 |0.529297 |0.001419 |Audio
Gradient Boosting Tree |Reproduced |0.561667 |0.570000 |0.559798 |0.041004 |Audio
Logistic Regression |Paper |0.507000 |0.597000 |0.546000 |NaN |Audio
Gradient Boosting Tree |Paper |0.560000 |0.617000 |0.587000 |NaN |Audio

---

### Final reproduction of table 2 - audio part

With the same filter as above, we receive the following:  
> reproducable with `table2_train[table2_train['F1']>0.5]`

Algorithm |Source |Precision |Recall |F1 |Training time (s) |Modality
--|--|--|--|--|--|--
SVM |Reproduced |0.467619 |0.600000 |0.522145 |0.001653 |Audio
AdaBoost |Reproduced |0.561667 |0.566667 |0.558889 |0.046116 |Audio
Logistic Regression |Reproduced |0.511429 |0.563333 |0.529297 |0.001419 |Audio
Gradient Boosting Tree |Reproduced |0.561667 |0.570000 |0.559798 |0.041004 |Audio
Logistic Regression |Paper |0.507000 |0.597000 |0.546000 |NaN |Audio
Gradient Boosting Tree |Paper |0.560000 |0.617000 |0.587000 |NaN |Audio