## **Predicting Video Memorability Score**

This notebook contains multiple approaches to predicint memorability scores. The features provded by MediaEval were transformed in a number of ways which are outlined in the Pre-Processing section. In order to determine the optimum feature and model combination, each feature was trained on each model. After Pre-Processing, every section contains a Machine Learning algorithm and each subsection contains models trained on the different features. The ML algorithms implemented are: 



*   Linear Regression
*   Ridge Regression
*   Decision Tree
*   Support Vector Machine
*   Random Forest
*   Voting Regressor
*   Bagging Regressor
*   Stacking Regressor

I have commented the code, however to avoid being repetitive and overfilling the notebook I do not repeat any comments which explain similar code.





# Spearmans Correlation Coefficient Function

The performance of each model will be based on the Spearman correlation score of the predictions and the ground truth. This function was provided by Eoin Brophy and can be found here:

https://drive.google.com/drive/folders/1puG9lLjao1y4ZngKHJFpxi4Yl-9cHvV7

In [0]:
def Get_score(Y_true, Y_pred):
    '''Calculate the Spearmann"s correlation coefficient'''
    Y_pred = np.squeeze(Y_pred)
    Y_true = np.squeeze(Y_true)
    if Y_pred.shape != Y_true.shape:
        print('Input shapes don\'t match!')
    else:
        if len(Y_pred.shape) == 1:
            Res = pd.DataFrame({'Y_true':Y_true,'Y_pred':Y_pred})
            score_mat = Res[['Y_true','Y_pred']].corr(method='spearman',min_periods=1)
            print('The Spearman\'s correlation coefficient is: %.3f' % score_mat.iloc[1][0])
        else:
            for ii in range(Y_pred.shape[1]):
                Get_score(Y_pred[:,ii],Y_true[:,ii])

# Data Acquisition

The initial data acquition step can be found in the file 'LoadDataSaveAsNumpy.ipynb'. In that file the data was loaded and saved as NPY files. In the section 'Load Numpy Files' below I load these files.

The features used in this study are:

*   Captions
*   HMP
*   C3D
*   Labels





## Mapping Drive and Load Packages

I mounted my drive as I was developing this notebook on Google Colabs. 

I imported all relevant packages for later use.

In [3]:
#Mount Drive
from google.colab import drive
import os
drive.mount('/content/drive/')
os.chdir('/content/drive/My Drive/CA684_Assignment/')

#Import Packages
from tensorflow.python.keras import Sequential
from tensorflow.python.keras import layers
from tensorflow.python.keras import regularizers
from tensorflow.python.keras.preprocessing.text import Tokenizer
import pandas as pd
import numpy as np
from numpy import load
from numpy import asarray
from numpy import save

from sklearn.preprocessing import normalize
from sklearn.decomposition import PCA

from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn.feature_extraction.text import CountVectorizer

from string import punctuation
from collections import Counter
from sklearn.model_selection import train_test_split
import matplotlib.pyplot as plt
from sklearn.metrics import r2_score

from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.linear_model import LinearRegression
from sklearn.tree import DecisionTreeRegressor 
from sklearn.ensemble import RandomForestRegressor
from sklearn.linear_model import Ridge

from sklearn.metrics import make_scorer
from sklearn.model_selection import cross_val_score

#Random Seed Set To One
from numpy.random import seed
seed(1)


Go to this URL in a browser: https://accounts.google.com/o/oauth2/auth?client_id=947318989803-6bn6qk8qdgf4n4g3pfee6491hc0brc4i.apps.googleusercontent.com&redirect_uri=urn%3aietf%3awg%3aoauth%3a2.0%3aoob&response_type=code&scope=email%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdocs.test%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdrive%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdrive.photos.readonly%20https%3a%2f%2fwww.googleapis.com%2fauth%2fpeopleapi.readonly

Enter your authorization code:
··········
Mounted at /content/drive/


## Load Numpy Files

As mentioned previously, the inital data acquisition has been carried out in the notebook 'LoadDataSaveAsNumpy.ipynb'. Here the features and labels are loaded from the NPY files created earlier. Doing this allowed the data to be loaded much quicker than it had been loaded from the original dataset files.

In [0]:
#The column names of the labels and captions as these were lost converting the files to npy.
captions_column_names = ['video','caption']
labels_column_names = ['video','short-term_memorability','nb_short-term_annotations','long-term_memorability','nb_long-term_annotations']

#Loading the features from numpy files.
loaded_caps = load('/content/drive/My Drive/Assignment/Dataset/captions.npy', allow_pickle=True)
loaded_labels = load('/content/drive/My Drive/Assignment/Dataset/labels.npy', allow_pickle=True)
loaded_cd3 = load('/content/drive/My Drive/Assignment/Dataset/c3d.npy', allow_pickle=True)
loaded_hmp = load('/content/drive/My Drive/Assignment/Dataset/hmp.npy', allow_pickle=True)

#Convert Files To Dataframes so they are easier to work with later on
df_cap = pd.DataFrame(data=loaded_caps, columns=captions_column_names)
df_labels = pd.DataFrame(data=loaded_labels, columns=labels_column_names)
df_c3d =  pd.DataFrame(data=loaded_cd3) 
df_hmp =  pd.DataFrame(data=loaded_hmp) 

#Correcting the datatypes as these were lost after converting to numpy.
df_cap = df_cap.infer_objects()
df_labels = df_labels.infer_objects()
df_c3d = df_c3d.infer_objects()
df_hmp = df_hmp.infer_objects()


# Pre-processing

The dataset provided consisted of numerical and text features. The text features were the captions and the numberical features consisted of both HMP and C3d. 

The Machine Learning algorithms used in this study are from Sklearn which work optimally with scaled data. Otherwise, the results from these algorithms could be affected. Therefore, each feature was scaled accoridingly and respective to the type of data.

It is also recommended to convert text to numbers when working with machine learning algorithms. Therefore, the text feature Captions was transformed. This was carried out in three different ways so the results could be compared later on in the study. Captions was transformed with Sequences, One Hot Encoding and TF-IDF. As well as this, the standard text cleaning was applied which involved removing stop words and punctuation.

## Captions Pre-Processing

In other studies of this dataset, captions have been found to perform well for predictions. Therefore, I wanted to get as much use out of the captions as possible. Therefore, I implemented three transformation approaches. Each approach is tested with each machine learning algorithm.

### Cleaning

Currently the captions text looks like the results of the first box below. It would be more useful if I split this text into separate words.
A simple way to do this is to use a Counter object. Using Counter to create the vocabulary I will use for predictions. I  am updating our dataframe df_cap with the cleaned text because I do not need to keep the previous text.
Each word is split and saved in the vocabulary.


In [4]:
#Looking at the first few rows of our captions we can see they require a lot of cleaning
df_cap['caption'].head()

0                   blonde-woman-is-massaged-tilt-down
1    roulette-table-spinning-with-ball-in-closeup-shot
2                                        khr-gangsters
3                 medical-helicopter-hovers-at-airport
4                 couple-relaxing-on-picnic-crane-shot
Name: caption, dtype: object

In [5]:
#Setup our Counter object which can assist with cleaning the captions
vocab = Counter()

#Loop through each caption and clean
for i, capitalLetter in enumerate(df_cap['caption']):
    # Removing dashes in between words and convert words to lowercase.
    text = ''.join([c if c not in punctuation else ' ' for c in capitalLetter]).lower()
    #At each row of iteration i save the updated text
    df_cap.loc[i,'caption'] = text
    vocab.update(text.split())
    
df_cap['caption'].head()

0                   blonde woman is massaged tilt down
1    roulette table spinning with ball in closeup shot
2                                        khr gangsters
3                 medical helicopter hovers at airport
4                 couple relaxing on picnic crane shot
Name: caption, dtype: object

### One-hot Encoding

For machine learning algorithms, it is advised to work with numbers over text. This cell consists of the first approach to transform the data using a technique called ‘One Hot Encoding’.  This simply involves converting the categorical variables into a more machine learning friendly format. Each word will get its own column and so this is going to increase the feature size dramatically, however the benefits of numerical data over categorical are more important here. The main goal of this study is to find an optimum model. 

First, I used the Tokenizer object and fit it to the captions to encode the data. This allowed me to easily apply one hot encoding.

The next step is feature scaling. A test was carried out later down the line with Linear regression and R squared to determine whether this data is linear or non-linear. If the data was linear, I would need to standardise it. It was found to be non-linear, so this requires normalisation, so scaling the data.  I normalised the data to L2 form so all the rows have a unit norm.


In [7]:
#Setting up the tokenizer
num_words = len(vocab)
tokenizer = Tokenizer(num_words=num_words)
captions_list = list(df_cap.caption.values)

#Fitting our Tokenizer on the updated captions
tokenizer.fit_on_texts(captions_list) 

#One-Hot Encoding
one_hot_res = tokenizer.texts_to_matrix(list(df_cap.caption.values),mode='binary')

#Normalising One-Hot Encoding to l2
oh_normalized = normalize(one_hot_res, norm='l2')

#Apply PCA for dimensionality reduction while retaining 95% variance.
pca = PCA(n_components = 0.95)
ohn = pca.fit_transform(oh_normalized)

print( "Shape Of one_hot_res before Normalise and PCA : " , one_hot_res.shape)
print( "Shape Of one_hot_res after Normalise and PCA : " , ohn.shape)

Shape Of one_hot_res before Normalise and PCA :  (6000, 5191)
Shape Of one_hot_res after Normalise and PCA :  (6000, 1778)


### Sequences

Using the Tokenizer, I then converted the Captions text into a sequence of integers. Using 'texts_to_sequences'. I then normalise the sequenced data so it is within a range of 0 and 1 and padded them to a length of 50 with zeros. This will ensure the data is in an appropriate representation for the machine learning algorithms used later on.

We do not need to reduce dimensionality here because X_seq has only 50 features.


In [9]:
#Sequence Encoding
sequences = tokenizer.texts_to_sequences(list(df_cap.caption.values))

#Defining the length I want the sequences all to be
max_len = 50

#Creating empty numpy array full of zeros to the size of 50
X_seq = np.zeros((len(sequences),max_len))
#Loop through for the entire contents of sequences
for i in range(len(sequences)):
    #n is a placeholder which stores the length of seqeunce at the current iteraiton of i
    n = len(sequences[i])
    #If empty do nothing
    if n==0:
        print(i)
    else:
        X_seq[i,-n:] = sequences[i];
vocab_size = len(tokenizer.word_index) + 1

#Normalise the Sequences
seq_normalized = normalize(X_seq, norm='l2')

print( "Shape Of sequences after Normalise and PCA : " , seq_normalized.shape)

Shape Of sequences after Normalise and PCA :  (6000, 50)


### TF-IDF

To implement some feature engineering, I implemented a pipeline. The pipeline builds a dictionary of features with CountVectorizer and then transforms the features with TF-IDF. CountVectorizer takes care of the standard text pre-processing like filtering of stop words. They were then normalised and the dimensionality reduced. As you can see the number of features reduces by almost 3000 after applying PCA, all while retaining 95% variance.

In [10]:
#Create our TF-IDF pipeline
tfidf_pipe = Pipeline( [ 
                       ('count_vec', CountVectorizer()),
                       ('tfidf', TfidfTransformer()),
                      ])
caps = list(df_cap.caption.values)
tfidf = tfidf_pipe.fit_transform(caps)

#Normalising TFIDF data
tfidf_normalized = normalize(tfidf, norm='l2')

#Convert sparse matrix to dense array so can ues PCA
X_normalizedde = tfidf_normalized.todense()

#Retaining 95 % variance.
pca = PCA(n_components = 0.95)
tfidfn = pca.fit_transform(X_normalizedde)

#Print differences
print( "Shape Of tfidf Before Normalise and PCA : " , tfidf.shape)
print( "Shape Of tfidf After Normalise and PCA  : " , tfidfn.shape)


Shape Of tfidf Before Normalise and PCA :  (6000, 5174)
Shape Of tfidf After Normalise and PCA  :  (6000, 2265)


## C3D PCA and Normalise

There are many benefits to dimensionality reduction. C3D was found to have 101 features. It would be possible that my training instances would not be spread out uniformly and therefore training any models on C3D could return biased results. Principal Component Analysis (PCA) is a widely used dimensionality reduction algorithm. I used the Scikit-Learns PCA class, setting a variance requirement of 95%. Using PCA I was able to reduce the dimensions down to 44, so that’s a 44% size reduction from its original size. This is a great improvement from the previous 101 features.

In [10]:
#Normalise the data
c3d_normalized = normalize(df_c3d, norm='l2')

#Apply PCA
pca = PCA(n_components = 0.95)
c3dp = pca.fit_transform(c3d_normalized)

#Print differences
print( "Shape Of C3D Before PCA : " , df_c3d.shape)
print( "Shape Of C3D After PCA : " , c3dp.shape)

Shape Of C3D Before PCA :  (6000, 101)
Shape Of C3D After PCA :  (6000, 50)


## HMP Normalise and PCA

Similar to C3D, HMP requires normalisation before being applied to any model. 

In [11]:
#Normalise the data
hmp_normalized = normalize(df_hmp, norm='l2')

#Apply PCA
pca = PCA(n_components = 0.95)
hmpp = pca.fit_transform(hmp_normalized)

#Print differences
print( "Shape Of C3D Before PCA : " , df_hmp.shape)
print( "Shape Of C3D After PCA : " , hmpp.shape)

Shape Of C3D Before PCA :  (6000, 6075)
Shape Of C3D After PCA :  (6000, 24)


## Combining Features


### Combine Captions, HMP and C3d

Previous years work combined all features and then used them to train the model. To build on this I will combine some features, but I will also apply PCA dimensionality reduction which was not implemented. I will maintain 95% variance. I hope this will allow me to determine the best features while keeping dimensionality reduced.

In [12]:
#Combining HMP, C3D and One hot encoded data into one numpy array
cdd = np.concatenate((hmpp,c3dp,ohn), axis=1)

#Apply PCA to the combination 
pca = PCA(n_components=0.95)
ccdp = pca.fit_transform(cdd)

#Print differences
print("Before: ", cdd.shape)
print("After :", ccdp.shape)


Before:  (6000, 1852)
After : (6000, 887)


### Combine all text features

During my research, I saw several studies which combined multiple pre-computed features together, for example like I have done in the previous cell. To the best of my knowledge it seemed that there was a lack of experiments in terms combining caption transformations. So here I am combining TF-IDF, One Hot Encoding and the Sequences. They all derive from the same feature so it is possible they will not return better results than each individually, however  I thought this was worth testing. 

In [13]:
#Combining all text features
allcaptions = np.concatenate((ohn,seq_normalized,tfidfn), axis=1)

#Applying PCA to text features
pca = PCA(n_components=0.95)
allcaptionsp = pca.fit_transform(allcaptions)

#Print differences
print("Before: ", cdd.shape)
print("After :", ccdp.shape)


Before:  (6000, 1852)
After : (6000, 887)


# Linear Regression

To test whether the data is linear or non-linear we used linear regression. Calculating R Squared of the predictions and the ground truth will allow us to identify this. As you can see the R squared result is negative which is a really bad score. This means that the data does not suit a linear model and therefore could be considered non-linear. This is really important to know going forward, for example, knowing the data is non-linear allows me to apply a kernelised SVR. As you can see from the results below every feature is non-linear expect one. It seems that TF-IDF could be considered linear and therefore we must take this into account when developing models later on down the line.

## Using Sequences

In [112]:
from sklearn.metrics import r2_score

#The following code details the Short Term Score model
#Setting my target vector
Y_s = df_labels['short-term_memorability'].values
#Setting my feature matrix which is the normalised sequences.
X = seq_normalized;
#Splitting my data into training and validation data. Doing an 80% 20% split.
X_train_lr_st, X_test_lr_st, Y_train_lr_st, Y_test_lr_st = train_test_split(X, Y_s, test_size=0.2, random_state=42)

#Creating my Linear Regression model and setting it to run on all available threads so it will run faster.
lr_c_s = LinearRegression(n_jobs=-1)
#Training model on the feature matrix and target vector
lr_c_s.fit(X_train_lr_st, Y_train_lr_st)
#Using my trained model to predict the target vector based on some feature matrix and storing the results
y_pred_rf_st = lr_c_s.predict(X_test_lr_st)
#Calculating the Spearman coefficient score for my models predictions and the validation target vector
Get_score(Y_test_lr_st, y_pred_rf_st)

#The following code details the Long Term Score model
#Setting up my feature matrix and target vector.
Y_l = df_labels['long-term_memorability'].values
X = seq_normalized;
#Creating my training and validation data
X_train_lr_lt, X_test_lr_lt, Y_train_lr_lt, Y_test_lr_lt = train_test_split(X, Y_l, test_size=0.2, random_state=42)

#Implementing model
lr_c_l = LinearRegression(n_jobs=-1)
lr_c_l.fit(X_train_lr_lt, Y_train_lr_lt)
y_pred_rf_lt = lr_c_l.predict(X_test_lr_lt)
#Calculating Sprearman score
Get_score(Y_test_lr_lt, y_pred_rf_lt)

# Test to whether the data is linear or non-linear
print(r2_score(lr_c_s.predict(X_test_lr_st), Y_test_lr_st))
print(r2_score(lr_c_l.predict(X_test_lr_lt), Y_test_lr_lt))

The Spearman's correlation coefficient is: 0.058
The Spearman's correlation coefficient is: 0.011
-0.0008340283350369848
-0.0008340412614020742


## Using Captions(One-Hot Encoded)

Linear Regression model trained on the One Hot Encoded data which was normalised and dimensionality reduced. This model performs the best out of all Linear Regression models for both short- and long-term scores, with 0.319 for short-term and 0.122 for long-term.

In [16]:
#Short Term Score
#Setting up features and target, splitting it into training and validation
Y_s = df_labels['short-term_memorability'].values
#The feature matrix here is the One Hot Encoded data
X = ohn;
X_train_lr_st, X_test_lr_st, Y_train_lr_st, Y_test_lr_st = train_test_split(X, Y_s, test_size=0.2, random_state=42)

#Building a new model which is to run tasks on all available cores
lr_c_os = LinearRegression(n_jobs=-1)
lr_c_os.fit(X_train_lr_st, Y_train_lr_st)
y_pred_rf_st = lr_c_os.predict(X_test_lr_st)
Get_score(Y_test_lr_st, y_pred_rf_st)

#Long Term Score
Y_l = df_labels['long-term_memorability'].values
X = ohn;
X_train_lr_lt, X_test_lr_lt, Y_train_lr_lt, Y_test_lr_lt = train_test_split(X, Y_l, test_size=0.2, random_state=42)

lr_c_ol = LinearRegression(n_jobs=-1)
lr_c_ol.fit(X_train_lr_lt, Y_train_lr_lt)
y_pred_rf_lt = lr_c_ol.predict(X_test_lr_lt)
Get_score( Y_test_lr_lt, y_pred_rf_lt)


# Test to whether the data is linear or non-linear
print(r2_score(lr_c_os.predict(X_test_lr_st), Y_test_lr_st))
print(r2_score(lr_c_ol.predict(X_test_lr_lt), Y_test_lr_lt))

The Spearman's correlation coefficient is: 0.319
The Spearman's correlation coefficient is: 0.122
-5.3512749786932545e-14
-3.774758283725532e-15


## Using TF-IDF

This Linear Regression model is training on the normalised TF-IDF data. This model does not perform as well as the one hot encoded data but performs better than the sequences. 

Previous to PCA being applied the spearman score was 0.225 for short term and 0.085 for long term. After PCA was applied these scores actually increased to 0.255 and  0.085.

It is important to take note of the R squared results here. As we can see they results came back as positive integers compared to all other Linear Regression Models trained on the other features. It is possible that the TF-IDF data is linear. Whereas the rest of the features can be considered non-linear due to their negative scores. We would expect TF-IDF to perform better on Linear models.

In [66]:
from sklearn.metrics import r2_score

#Short Term Score
#Setting up features and target, splitting it into training and validation
Y_s = df_labels['short-term_memorability'].values
#The feature matrix is the normalised TF-IDF data
X = tfidfn
X_train_lr_st, X_test_lr_st, Y_train_lr_st, Y_test_lr_st = train_test_split(X, Y_s, test_size=0.2, random_state=42)

lr_tfs = LinearRegression(n_jobs=-1)
lr_tfs.fit(X_train_lr_st, Y_train_lr_st)
y_pred_lrs = lr_tfs.predict(X_test_lr_st)
Get_score(Y_test_lr_st, y_pred_lrs)

#Long Term Score
Y_l = df_labels[ 'long-term_memorability'].values
#The feature matrix is the normalised TF-IDF data
X = tfidfn
X_train_lr_lt, X_test_lr_lt, Y_train_lr_lt, Y_test_lr_lt = train_test_split(X, Y_l, test_size=0.2, random_state=42)

lr_tfl = LinearRegression(n_jobs=-1)
lr_tfl.fit(X_train_lr_lt, Y_train_lr_lt)
y_pred_lrl = lr_tfl.predict(X_test_lr_lt)
Get_score(Y_test_lr_lt, y_pred_lrl)
 
# Test to whether the data is linear or non-linear
print(r2_score(y_pred_lrs, Y_test_lr_st))
print(r2_score(y_pred_lrl, Y_test_lr_lt))

The Spearman's correlation coefficient is: 0.257
The Spearman's correlation coefficient is: 0.099
5.6621374255882984e-14
8.881784197001252e-16


## Using C3D

Building a linear regression model for short- and long-term seperately. This model is trained on the C3D data which PCA was applied.

In [18]:
#Short-Term Memorability
#Preparing data
Y_s = df_labels['short-term_memorability'].values
X = c3dp
X_train_lr_st, X_test_lr_st, Y_train_lr_st, Y_test_lr_st = train_test_split(X, Y_s, test_size=0.2, random_state=42)

#Implementing model
lr_c = LinearRegression(n_jobs=-1)
lr_c.fit(X_train_lr_st, Y_train_lr_st)
y_pred_lrs = lr_c.predict(X_test_lr_st)
Get_score(Y_test_lr_st, y_pred_lrs)

#Long-Term Memorability
#Preparing data
Y_l = df_labels[ 'long-term_memorability'].values
X = c3dp
X_train_lr_lt, X_test_lr_lt, Y_train_lr_lt, Y_test_lr_lt = train_test_split(X, Y_l, test_size=0.2, random_state=42)

#Implementing model
lr_l = LinearRegression(n_jobs=-1)
lr_l.fit(X_train_lr_lt, Y_train_lr_lt)
y_pred_lrl = lr_l.predict(X_test_lr_lt)
Get_score(Y_test_lr_lt, y_pred_lrl)


# Test to whether the data is linear or non-linear
print(r2_score(y_pred_lrs, Y_test_lr_st))
print(r2_score(y_pred_lrl, Y_test_lr_lt))

The Spearman's correlation coefficient is: 0.288
The Spearman's correlation coefficient is: 0.116
-7.410909888717088
-34.46574790420082


## Using HMP

Linear regression model is trained on the HMP dataset. PCA was applied to this data. The HMP data does not perform well as a feature matrix making this model the second worst out of all Linear Regression models.

In [19]:
#Short-Term Memorability
Y_s = df_labels['short-term_memorability'].values
X = hmpp
X_train_s, X_test_s, Y_train_s, Y_test_s = train_test_split(X, Y_s, test_size=0.2, random_state=42)

lr_c = LinearRegression(n_jobs=-1)
lr_c.fit(X_train_s, Y_train_s)
y_pred_s = lr_c.predict(X_test_s)
Get_score(Y_test_s, y_pred_s)

#Long-Term Memorability
Y_l = df_labels[ 'long-term_memorability'].values
X = hmpp
X_train_lh, X_test_lh, Y_train_lh, Y_test_lh = train_test_split(X, Y_l, test_size=0.2, random_state=42)

lr_l = LinearRegression(n_jobs=-1)
lr_l.fit(X_train_lh, Y_train_lh)
y_pred_l = lr_l.predict(X_test_lh)
Get_score(Y_test_lh, y_pred_l)



# Test to whether the data is linear or non-linear
print(r2_score(y_pred_s, Y_test_s))
print(r2_score(y_pred_l, Y_test_lh))

The Spearman's correlation coefficient is: 0.250
The Spearman's correlation coefficient is: 0.114
-10.923953658632914
-43.37442228855988


##  Using Combination Of Captions, C3D and HMP

This model outperforms all other linear regression models in both short and long term.

In [20]:
#Short-Term
#Preparing data
Y_s = df_labels['short-term_memorability'].values
X = ccdp
X_train_s, X_test_s, Y_train_s, Y_test_s = train_test_split(X, Y_s, test_size=0.2, random_state=42)

#Implementing Model
lr_cddps = LinearRegression(n_jobs=-1)
lr_cddps.fit(X_train_s, Y_train_s)
y_pred_s = lr_cddps.predict(X_test_s)
Get_score(Y_test_s, y_pred_s)

#Long-Term
#Preparing data
Y_s = df_labels['long-term_memorability'].values
X_train_l, X_test_l, Y_train_l, Y_test_l = train_test_split(X, Y_s, test_size=0.2, random_state=42)

#Implementing Model
lr_cddpl = LinearRegression(n_jobs=-1)
lr_cddpl.fit(X_train_l, Y_train_l)
Y_pred_l = lr_cddpl.predict(X_test_l)
Get_score(Y_test_l, Y_pred_l)

The Spearman's correlation coefficient is: 0.415
The Spearman's correlation coefficient is: 0.172


## Using Combination of Caption Transformations

This model trained on all the text transformations does not perform as well as the comination model above. However, it performs better than each of the individual text models. 

In [98]:
#Short-Term
#Preparing data
Y_s = df_labels['short-term_memorability'].values
X = allcaptionsp
X_train_s, X_test_s, Y_train_s, Y_test_s = train_test_split(X, Y_s, test_size=0.2, random_state=42)

#Implementing Model
lr_alltexts = LinearRegression(n_jobs=-1)
lr_alltexts.fit(X_train_s, Y_train_s)
y_pred_s = lr_alltexts.predict(X_test_s)
Get_score(Y_test_s, y_pred_s)

#Long-Term
#Preparing data
Y_l = df_labels['long-term_memorability'].values
X_train_l, X_test_l, Y_train_l, Y_test_l = train_test_split(X, Y_l, test_size=0.2, random_state=42)

#Implementing Model
lr_alltextsl = LinearRegression(n_jobs=-1)
lr_alltextsl.fit(X_train_l, Y_train_l)
Y_pred_l = lr_alltextsl.predict(X_test_l)
Get_score(Y_test_l, Y_pred_l)

The Spearman's correlation coefficient is: 0.362
The Spearman's correlation coefficient is: 0.150


# Ridge Regression

Moving on from Linear regression I implemented Ridge regression which is a regularised version of Linear Regression. This approach will allow me to control the weights of the data. As this is a regularized model it is important to ensure the data has been normalised.

## Using Sequences

In [21]:
#Short-Term Prediction
#Preparing the data
Y = df_labels['short-term_memorability'].values
X = seq_normalized;
X_train_s, X_test_s, Y_train_s, Y_test_s = train_test_split(X, Y, test_size=0.2, random_state=42)

#Implementing Ridge model wiht aplha 2
r_c_s = Ridge(alpha = 2.0)
r_c_s.fit(X_train_s, Y_train_s)
Y_pred_s = r_c_s.predict(X_test_s)
Get_score(Y_test_s, Y_pred_s)

#Long-Term Prediction
#Preparing the data
Y = df_labels['long-term_memorability'].values
X = seq_normalized;
X_train_l, X_test_l, Y_train_l, Y_test_l = train_test_split(X, Y, test_size=0.2, random_state=42)

#Implementing Ridge model wiht aplha 2
r_c_l = Ridge(alpha = 2.0)
r_c_l.fit(X_train_l, Y_train_l)
Y_pred_l = r_c_l.predict(X_test_l)
Get_score(Y_test_l, Y_pred_l)

The Spearman's correlation coefficient is: 0.058
The Spearman's correlation coefficient is: 0.015


## Using Captions (One-Hot Encoded)

This Ridge regression model is trained on normalised One Hot Encoded data. It scored very high with 0.445 and 0.166. The model performed poorer on the original data with scores of: 0.405 and 0.169 

In [80]:
from sklearn.linear_model import Ridge
#Short-Term
Y = df_labels['short-term_memorability'].values
X = ohn;
X_train_lr_st, X_test_lr_st, Y_train_lr_st, Y_test_lr_st = train_test_split(X, Y, test_size=0.2, random_state=42)

r_c_s = Ridge(alpha = 2.0)
r_c_s.fit(X_train_lr_st, Y_train_lr_st)
y_pred_rf_st = r_c_s.predict(X_test_lr_st)
Get_score(Y_test_lr_st, y_pred_rf_st)


#Long-Term
Y = df_labels['long-term_memorability'].values
X = ohn;
X_train_lr_lt, X_test_lr_lt, Y_train_lr_lt, Y_test_lr_lt = train_test_split(X, Y, test_size=0.2, random_state=42)

r_c_l = Ridge(alpha = 2.0)
r_c_l.fit(X_train_lr_lt, Y_train_lr_lt)
y_pred_r_lt = r_c_l.predict(X_test_lr_lt)
Get_score(Y_test_lr_lt, y_pred_r_lt)

The Spearman's correlation coefficient is: 0.445
The Spearman's correlation coefficient is: 0.169


## Using TF-IDF

We can see that using TF-IDF for the long term predictions performs much better than using the one hot encoded data or the sequences. However there is very little difference for the short term score with 0.001 of a difference. Normalizing the data did not make much of a different in terms of performance. Before normalising the scores were 0.445 and 0.192.

In [11]:
from sklearn.linear_model import Ridge

#Short term
Y_s = df_labels['short-term_memorability'].values
X = tfidfn
X_train_r_st, X_test_r_st, Y_train_r_st, Y_test_r_st = train_test_split(X, Y_s, test_size=0.2, random_state=42)

#Increasing the alpha 
r_tfs = Ridge(alpha = 3.0)
r_tfs.fit(X_train_r_st, Y_train_r_st)
y_pred_r_st = r_tfs.predict(X_test_r_st)
Get_score(Y_test_r_st, y_pred_r_st)

#Long term
Y_l = df_labels['long-term_memorability'].values
X = tfidfn;
X_train_r_lt, X_test_r_lt, Y_train_r_lt, Y_test_r_lt = train_test_split(X, Y_l, test_size=0.2, random_state=42)

#Increasing the alpha 
r_tfl = Ridge(alpha = 3.0)
r_tfl.fit(X_train_r_lt, Y_train_r_lt)
y_pred_r_lt = r_tfl.predict(X_test_r_lt)
Get_score(Y_test_r_lt, y_pred_r_lt)


The Spearman's correlation coefficient is: 0.446
The Spearman's correlation coefficient is: 0.191


## Using C3D

In [24]:
from sklearn.linear_model import Ridge

#Short-Term
#Preparing the data
Y_s = df_labels['short-term_memorability'].values
X = c3dp
X_train_s, X_test_s, Y_train_s, Y_test_s = train_test_split(X, Y_s, test_size=0.2, random_state=42)

#Implementing the model
r_c = Ridge(alpha = 3.0)
r_c.fit(X_train_s, Y_train_s)
y_pred_s = r_c.predict(X_test_s)
Get_score(Y_test_s, y_pred_s)

#Long-Term
#Preparing the data
Y_l = df_labels['long-term_memorability'].values
X_train_l, X_test_l, Y_train_l, Y_test_l = train_test_split(X, Y_l, test_size=0.2, random_state=42)

#Implementing the model
r_l = Ridge(alpha = 3.0)
r_l.fit(X_train_l, Y_train_l)
y_pred_l = r_l.predict(X_test_l)
Get_score(Y_test_l, y_pred_l)


The Spearman's correlation coefficient is: 0.287
The Spearman's correlation coefficient is: 0.118


## Using HMP

In [25]:
from sklearn.linear_model import Ridge

#Short-Term 
#Preparing the data
Y_s = df_labels['short-term_memorability'].values
X = hmpp
X_train_s, X_test_s, Y_train_s, Y_test_s = train_test_split(X, Y_s, test_size=0.2, random_state=42)

#Implementing the model
r_h = Ridge(alpha = 3.0)
r_h.fit(X_train_s, Y_train_s)
y_pred_s = r_h.predict(X_test_s)
Get_score(Y_test_s, y_pred_s)

#Long-Term
#Preparing the data
Y_l = df_labels['long-term_memorability'].values
X = hmpp;
X_train_l, X_test_l, Y_train_l, Y_test_l = train_test_split(X, Y_l, test_size=0.2, random_state=42)

#Implementing the model
r_hl = Ridge(alpha = 3.0)
r_hl.fit(X_train_l, Y_train_l)
y_pred_ll = r_hl.predict(X_test_l)
Get_score(Y_test_l, y_pred_ll)


The Spearman's correlation coefficient is: 0.253
The Spearman's correlation coefficient is: 0.114


##  Using Combination Of Captions, C3D and HMP

This model is trained the combination of captions, C3D and HMP. This model performs the best for the long term memorability score. It outperforms all other models trained on this data.

In [57]:
#Short-Term 
#Preparing the data
Y_s = df_labels['short-term_memorability'].values
X = ccdp
X_train_lr_st, X_test_lr_st, Y_train_lr_st, Y_test_lr_st = train_test_split(X, Y_s, test_size=0.2, random_state=42)

#Implementing the model
r_ccd_s = Ridge(alpha = 2.0)
r_ccd_s.fit(X_train_lr_st, Y_train_lr_st)
y_pred_rf_st = r_ccd_s.predict(X_test_lr_st)
Get_score(Y_test_lr_st, y_pred_rf_st)


#Long-Term
#Preparing the data
Y_l = df_labels['long-term_memorability'].values
X = ccdp
X_train_lr_lt, X_test_lr_lt, Y_train_lr_lt, Y_test_lr_lt = train_test_split(X, Y_l, test_size=0.2, random_state=42)

#Implementing the model
r_ccd_l = Ridge(alpha = 2.0)
r_ccd_l.fit(X_train_lr_lt, Y_train_lr_lt)
y_pred_rf_st = r_ccd_l.predict(X_test_lr_lt)
Get_score(Y_test_lr_lt, y_pred_rf_st)

The Spearman's correlation coefficient is: 0.443
The Spearman's correlation coefficient is: 0.176


## Using Combination Of Caption Transformations

This combination does not perform as well here compared to the individual transformations.

In [99]:
#Short-Term 
#Preparing the data
Y_s = df_labels['short-term_memorability'].values
X = allcaptionsp
X_train_lr_st, X_test_lr_st, Y_train_lr_st, Y_test_lr_st = train_test_split(X, Y_s, test_size=0.2, random_state=42)

#Implementing the model
r_alltexts = Ridge(alpha = 2.0)
r_alltexts.fit(X_train_lr_st, Y_train_lr_st)
y_pred_alls = r_alltexts.predict(X_test_lr_st)
Get_score(Y_test_lr_st, y_pred_alls)


#Long-Term
#Preparing the data
Y_l = df_labels['long-term_memorability'].values
X_train_lr_lt, X_test_lr_lt, Y_train_lr_lt, Y_test_lr_lt = train_test_split(X, Y_l, test_size=0.2, random_state=42)

#Implementing the model
r_alltextsl = Ridge(alpha = 2.0)
r_alltextsl.fit(X_train_lr_lt, Y_train_lr_lt)
y_pred_alll = r_alltextsl.predict(X_test_lr_lt)
Get_score(Y_test_lr_lt, y_pred_alll)

The Spearman's correlation coefficient is: 0.436
The Spearman's correlation coefficient is: 0.168


# Decision Tree Regression 

Multiple decision tree regression models were implemented. Unfortunately, as you will see from the results, these models performed very poorly. In fact, they performed the worst out of all models. One solution to a model underperforming is to move onto a more complicated model. So, later on in the notebook you will see that I implemented Random Forest which yields much better results.

##  Using Sequences

For the decision tree model, the end nodes are our target vectors, the memorability scores. 


In [14]:
#Short-Term 
Y_s = df_labels['short-term_memorability'].values 
X = seq_normalized 
X_train, X_test, Y_train, Y_test = train_test_split(X, Y_s, test_size=0.2, random_state=42)

#Creating Decision tree model
dt_s = DecisionTreeRegressor(random_state = 0,  max_depth=17)  
dt_s.fit(X_train, Y_train) 
Y_pred_s = dt_s.predict(X_test) 
Get_score(Y_test, Y_pred_s) 

#Long-Term
Y_l = df_labels['long-term_memorability'].values 
X = seq_normalized 
X_train_l, X_test_l, Y_train_l, Y_test_l = train_test_split(X, Y_l, test_size=0.2, random_state=42)


#Creating Decision tree model
dt_l = DecisionTreeRegressor(random_state = 0,  max_depth=17)  
dt_l.fit(X_train_l, Y_train_l) 
Y_pred_l = dt_l.predict(X_test_l) 
Get_score(Y_test_l, Y_pred_l) 


The Spearman's correlation coefficient is: 0.117
The Spearman's correlation coefficient is: 0.066


## Using Captions(One-hot Encoded)

If deicison trees are not restricted in terms of tree depth when they are growing, they will more than likely overfit the data. It is very important to test different tree depths out, choosing one that will result in a tree not too complex. 

I wanted to test the performance of the model on normalised and non normalised data. Applying this model to the one hot encoded data that was normalised but did not have PCA applied to it resulted in really good Spearman scores. Short term got 0.295 and long term got 0.044.

When the model is applied to the normalised data the scores are 0.167
and 0.058 respectivly.

Here letting the tree grow unrestricted actually resulted in a lower Sprearman score. This is probably due to the fact that the tree would fit too closely to the trianing data and then when it was predicting based on data it had not seen before it could not predict as well.

#### Using Original Data( Not normalised or pca applied)

In [28]:
#Short-Term 
#Preparing the data
Y_s = df_labels['short-term_memorability'].values 
X = one_hot_res 
X_train, X_test, Y_train, Y_test = train_test_split(X, Y_s, test_size=0.2, random_state=42)
  
#Implementing restricted decision tree.
dt_oh = DecisionTreeRegressor(random_state = 0, max_depth=17)  
dt_oh.fit(X_train, Y_train) 
Y_pred_s = dt_oh.predict(X_test) 
Get_score(Y_test, Y_pred_s) 

#Long-Term
#Preparing the data
Y_l = df_labels['long-term_memorability'].values 
X_train_l, X_test_l, Y_train_l, Y_test_l = train_test_split(X, Y_l, test_size=0.2, random_state=42)
  
#Implementing restricted decision tree.
dt_oh = DecisionTreeRegressor(random_state = 0,max_depth=17)  
dt_oh.fit(X_train_l, Y_train_l) 
Y_pred_l = dt_oh.predict(X_test_l) 
Get_score(Y_test_l, Y_pred_l) 

The Spearman's correlation coefficient is: 0.293
The Spearman's correlation coefficient is: 0.070


### Using Normalised Data

In [29]:
#Short-Term 
#Preparing the data
Y_s = df_labels['short-term_memorability'].values 
X = ohn
X_train, X_test, Y_train, Y_test = train_test_split(X, Y_s, test_size=0.2, random_state=42)

#Implementing restricted decision tree.
dt_oh = DecisionTreeRegressor(random_state = 0, max_depth=17)  
dt_oh.fit(X_train, Y_train) 
Y_pred_s = dt_oh.predict(X_test) 
Get_score(Y_test, Y_pred_s) 

#Long-Term
#Preparing the data
Y_l = df_labels['long-term_memorability'].values 
X = ohn
X_train_l, X_test_l, Y_train_l, Y_test_l = train_test_split(X, Y_l, test_size=0.2, random_state=42)

#Implementing restricted decision tree.
dt_oh = DecisionTreeRegressor(random_state = 0,max_depth=17)  
dt_oh.fit(X_train_l, Y_train_l) 
Y_pred_l = dt_oh.predict(X_test_l) 
Get_score(Y_test_l, Y_pred_l) 

The Spearman's correlation coefficient is: 0.167
The Spearman's correlation coefficient is: 0.058


## Using TF-IDF

The Spearman's correlation coefficient is: 0.231
The Spearman's correlation coefficient is: 0.048

In [81]:
#Short Term
Y = df_labels['short-term_memorability'].values 
X = tfidfn
X_train_s, X_test_s, Y_train_s, Y_test_s = train_test_split(X, Y, test_size=0.2, random_state=42)

#Implementing the model
dt_t = DecisionTreeRegressor(random_state = 0, max_depth=17)  
dt_t.fit(X_train_s, Y_train_s) 
Y_pred_s = dt_t.predict(X_test_s) 
Get_score(Y_test_s, Y_pred_s) 

#Long Term
Y = df_labels['long-term_memorability'].values 
X = tfidfn
X_train_l, X_test_l, Y_train_l, Y_test_l = train_test_split(X, Y, test_size=0.2, random_state=42)

#Implementing the model
dt_t = DecisionTreeRegressor(random_state = 0, max_depth=17)  
dt_t.fit(X_train_l, Y_train_l) 
Y_pred_l = dt_t.predict(X_test_l) 
Get_score(Y_test_l, Y_pred_l) 

The Spearman's correlation coefficient is: 0.174
The Spearman's correlation coefficient is: 0.011


## Using C3D 

C3D has consistently performed poorly so the low scores for short and long term are expected.

In [31]:
Y = df_labels[['short-term_memorability','long-term_memorability']].values 
X = c3dp 
X_train_dt_st, X_test_dt_st, Y_train_dt_st, Y_test_dt_st = train_test_split(X, Y, test_size=0.2, random_state=42)
  
dt_cd = DecisionTreeRegressor(random_state = 42, max_depth=20)  
dt_cd.fit(X_train_dt_st, Y_train_dt_st) 
Y_pred_dt_oh = dt_cd.predict(X_test_dt_st) 
Get_score(Y_test_dt_st, Y_pred_dt_oh) 


The Spearman's correlation coefficient is: 0.075
The Spearman's correlation coefficient is: 0.033


## Using HMP

In [32]:
Y = df_labels[['short-term_memorability','long-term_memorability']].values 
X = hmpp 
X_train_dt_st, X_test_dt_st, Y_train_dt_st, Y_test_dt_st = train_test_split(X, Y, test_size=0.2, random_state=42)
  
dt_cd = DecisionTreeRegressor(random_state = 42, max_depth=20)  
dt_cd.fit(X_train_dt_st, Y_train_dt_st) 
Y_pred_dt_oh = dt_cd.predict(X_test_dt_st) 
Get_score(Y_test_dt_st, Y_pred_dt_oh) 


The Spearman's correlation coefficient is: 0.085
The Spearman's correlation coefficient is: 0.019


##  Using Combination Of Captions, C3D and HMP

This decision tree is trained on the combination of Captions, C3D and HMP. Not restricting the tree depth results in a lower Spearman score, due to decision trees overfitting issue.
4 was found to be the optimum tree depth. Higher depths resulted in a lower Spearman score.

In [33]:
#Short and long term 
#Preparing the data
Y = df_labels[['short-term_memorability','long-term_memorability']].values 
X = ccdp 
X_train_sl, X_test_sl, Y_train_sl, Y_test_sl = train_test_split(X, Y, test_size=0.2, random_state=42)

#Implementing the model
dt_cda = DecisionTreeRegressor(random_state = 42, max_depth=4)  
dt_cda.fit(X_train_sl, Y_train_sl) 
Y_pred_dt_a = dt_cda.predict(X_test_sl) 
Get_score(Y_test_sl, Y_pred_dt_a) 


The Spearman's correlation coefficient is: 0.270
The Spearman's correlation coefficient is: 0.107


## Using Combination of Caption Tranformations

This model doesnt perform overly well for short-term scores. It is the second highest scorer of decision trees in terms of long-term scores.

In [108]:
#Short and long term 
#Preparing the data
Y = df_labels[['short-term_memorability','long-term_memorability']].values 
X = allcaptionsp 
X_train_sl, X_test_sl, Y_train_sl, Y_test_sl = train_test_split(X, Y, test_size=0.2, random_state=42)

#Implementing the model
df_alltext = DecisionTreeRegressor(random_state = 42, max_depth=3)  
df_alltext.fit(X_train_sl, Y_train_sl) 
Y_pred_alls = df_alltext.predict(X_test_sl) 
Get_score(Y_test_sl, Y_pred_alls) 


The Spearman's correlation coefficient is: 0.210
The Spearman's correlation coefficient is: 0.090


# Support Vector Machine

To improve the best performing SVR model I wanted to change the sampling method. I wanted to test whether bootstrapping over randomly sampling would improve my SVR model. Below you will see the subheadings for the model with bootstrapping and without bootstrapping. 

## Using Sequences

In [13]:
from sklearn.svm import SVR

#Short-Term memorability
Y_st = df_labels['short-term_memorability'].values
X = seq_normalized;
X_train_svr_st, X_test_svr_st, Y_train_svr_st, Y_test_svr_st = train_test_split(X , Y_st, test_size=0.2, random_state=42)

svr_st_c = SVR(C=1.0, cache_size=200, coef0=0.0, degree=3 , epsilon=0.1, gamma='scale', kernel='poly', max_iter=-1, shrinking=True, tol=0.001, verbose=False)
svr_st_c.fit(X_train_svr_st, Y_train_svr_st)
y_pred_svr_st = svr_st_c.predict(X_test_svr_st)
Get_score(Y_test_svr_st,y_pred_svr_st )


#Long-Term Memorability
Y_lt = df_labels['long-term_memorability'].values
X_train_svr_lt, X_test_svr_lt, Y_train_svr_lt, Y_test_svr_lt = train_test_split(X , Y_lt, test_size=0.2, random_state=109)

svr_lt_c = SVR(C=1.0, cache_size=200, coef0=0.0, degree=3, epsilon=0.1, gamma='scale', kernel='poly', max_iter=-1, shrinking=True, tol=0.001, verbose=False)
svr_lt_c.fit(X_train_svr_lt, Y_train_svr_lt)
y_pred_svr_lt = svr_lt_c.predict(X_test_svr_lt)
Get_score(y_pred_svr_lt, Y_test_svr_lt )

The Spearman's correlation coefficient is: 0.070
The Spearman's correlation coefficient is: 0.062


## Using Captions (One Hot Encoding )

Bootstraping was tested on the Support Vector Machine for the one hot encoded data. It was found that the model did performed  better with bootstraping.

### Without Bootstraping

This model was tested on both normalised and the raw one hot encoded data. The model performs better on the normalised data with a short-term score of 0.416 and long-term of 0.180. The scores on the original data was 0.376 and 0.161 respectively.


In [81]:
from sklearn.svm import SVR

#short term memorability
Y_st = df_labels['short-term_memorability'].values
#X_oh = one_hot_res;
X_oh = ohn
X_train_svr_st, X_test_svr_st, Y_train_svr_st, Y_test_svr_st = train_test_split(X_oh, Y_st, test_size=0.2, random_state=42)

svr_st = SVR(kernel="rbf", degree=2, C=100, epsilon=0.1)
svr_st.fit(X_train_svr_st, Y_train_svr_st)
y_pred_svr_st = svr_st.predict(X_test_svr_st)
Get_score(Y_test_svr_st, y_pred_svr_st)


#long term
Y_lt = df_labels['long-term_memorability'].values
X_train_svr_lt, X_test_svr_lt, Y_train_svr_lt, Y_test_svr_lt = train_test_split(X_oh, Y_lt, test_size=0.2, random_state=42)

svr_lt = SVR(kernel="rbf", degree=2, C=100, epsilon=0.1)
svr_lt.fit(X_train_svr_lt, Y_train_svr_lt)
y_pred_svr_lt = svr_lt.predict(X_test_svr_lt)
Get_score(Y_test_svr_lt, y_pred_svr_lt)



The Spearman's correlation coefficient is: 0.416
The Spearman's correlation coefficient is: 0.180


### With bootstraping

SVM performs better with bootstrapping enabled.

In [22]:
from sklearn.ensemble import BaggingRegressor
from sklearn.svm import SVR

#short term memorability
Y_st = df_labels['short-term_memorability'].values
#X_oh = one_hot_res;
X_oh = ohn
X_train_svr_st, X_test_svr_st, Y_train_svr_st, Y_test_svr_st = train_test_split(X_oh, Y_st, test_size=0.2, random_state=42)

svr_st = SVR(kernel="rbf", degree=2, C=100, epsilon=0.1)
n_estimators = 2
svm_bag = BaggingRegressor(svr_st, n_estimators=n_estimators, bootstrap=True)
svm_bag.fit(X_train_svr_st, Y_train_svr_st)
y_pred_svr_st = svm_bag.predict(X_test_svr_st)
Get_score(Y_test_svr_st, y_pred_svr_st)

#long term
Y_lt = df_labels['long-term_memorability'].values
X_train_svr_lt, X_test_svr_lt, Y_train_svr_lt, Y_test_svr_lt = train_test_split(X_oh, Y_lt, test_size=0.2, random_state=42)

svr_lt = SVR(kernel="rbf", degree=2, C=100, epsilon=0.1)
n_estimators = 2
svm_bag_l = BaggingRegressor(svr_lt, n_estimators=n_estimators, bootstrap=True)
svm_bag_l.fit(X_train_svr_lt, Y_train_svr_lt)
y_pred_svr_lt = svm_bag_l.predict(X_test_svr_lt)
Get_score(Y_test_svr_lt, y_pred_svr_st)


The Spearman's correlation coefficient is: 0.413
The Spearman's correlation coefficient is: 0.170


## Using TF-IDF

SVR with Tf-IDF is one of the best performing models so far. In an attempt to improve the model even further I used Grid Search to find the optimum hyperparameter values. The short-term score was 0.436 and the long-term score was 0.160 before tuning. After the tuned parameters were used the scores were

In [12]:
from sklearn.svm import SVR

#Short-Term Memorability
#Preparing Data
Y_st = df_labels['short-term_memorability'].values
X = tfidfn;
X_train_svr_st, X_test_svr_st, Y_train_svr_st, Y_test_svr_st = train_test_split(X, Y_st, test_size=0.2, random_state=42)

#Implementing Model
svr_st_t = SVR(kernel="poly", degree=3, C=100, epsilon=0.1)
svr_st_t.fit(X_train_svr_st, Y_train_svr_st)
y_pred_svr_st = svr_st_t.predict(X_test_svr_st)
Get_score(Y_test_svr_st, y_pred_svr_st)

#Long-Term Memorability
#Preparing Data
Y_lt = df_labels['long-term_memorability'].values
X_train_svr_lt, X_test_svr_lt, Y_train_svr_lt, Y_test_svr_lt = train_test_split(X, Y_lt, test_size=0.2, random_state=109)

#Implementing Model
svr_lt_t = SVR(kernel="poly", degree=3, C=100, epsilon=0.1)
svr_lt_t.fit(X_train_svr_lt, Y_train_svr_lt)
y_pred_svr_lt = svr_lt_t.predict(X_test_svr_lt)
Get_score(y_pred_svr_lt, Y_test_svr_lt )

The Spearman's correlation coefficient is: 0.436
The Spearman's correlation coefficient is: 0.160


### Optimise with Grid Search

I am using grid search to find the optimum combinations of the hyperparameters available for SVR. Because I am mostly concerned with Spearman score in this study I have created my own scorer. This will be used by Grid Search CV to evaluate the predictions on the test set. 

In [25]:
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import make_scorer

#Defining my Spearman score function. This one is slightly different than the one seen previously.
def cv_Spearman(Y_true, Y_pred):
    '''Calculate the Spearmann"s correlation coefficient'''
    Y_pred = np.squeeze(Y_pred)
    Y_true = np.squeeze(Y_true)
    if Y_pred.shape != Y_true.shape:
        print('Input shapes don\'t match!')
    else:
        if len(Y_pred.shape) == 1:
            Res = pd.DataFrame({'Y_true':Y_true,'Y_pred':Y_pred})
            score_mat = Res[['Y_true','Y_pred']].corr(method='spearman',min_periods=1)
            #Return the score only, no text because valid scorer must only return single value.
            return score_mat.iloc[1][0]
        else:
            for ii in range(Y_pred.shape[1]):
                cv_Spearman(Y_pred[:,ii],Y_true[:,ii])

#Making my Spearman function an actual scorer than can be used.
my_scorer = make_scorer(cv_Spearman, greater_is_better=True)

#Find best hyperparameters for short term SVR model which was already trained above
parameterss = {'kernel':('linear', 'rbf', 'poly'),'degree':[1, 2, 3] ,'C':[1, 10, 100], 'epsilon':[1.0, 0.1, 0.5]}
gridmodels = svr_st_t
gridsearch_svr = GridSearchCV(gridmodels, parameterss, scoring=my_scorer, n_jobs=-1)
gridsearch_svr.fit(X_train_svr_st, Y_train_svr_st)


#Find best hyperparameters for long-term SVR model which was already trained above
parametersl = {'kernel':('linear', 'rbf', 'poly'),'degree':[1, 2, 3] ,'C':[1, 10, 100], 'epsilon':[1.0, 0.1, 0.5]}
gridmodell = svr_lt_t
gridsearch_svrl = GridSearchCV(gridmodell, parametersl, scoring=my_scorer, n_jobs=-1)
gridsearch_svrl.fit(X_train_svr_st, Y_train_svr_st)

print("Short-Term Best Params: ", gridsearch_svr.best_params_)
print("Long-Term Best Params: ", gridsearch_svrl.best_params_)

Short-Term Best Params:  {'C': 1, 'degree': 2, 'epsilon': 0.1, 'kernel': 'poly'}
Long-Term Best Params:  {'C': 1, 'degree': 2, 'epsilon': 0.1, 'kernel': 'poly'}


### Retrain model using best hyperparameters

Now that I have the best hyperparameters values for my models I will change these and retrain the model. Because I defined my own scorer to be Spearman score the scores should improve somewhat. 

As you can see from the results of the previous cell, the optimum combination of hyperparameters were:

{'C': 1, 'degree': 2, 'epsilon': 0.1, 'kernel': 'poly'}

Unfortunetly this does not actually improve the results.

In [29]:
from sklearn.svm import SVR

#Short-Term Memorability
#Preparing Data
Y_st = df_labels['short-term_memorability'].values
X = tfidfn;
X_train_svr_st, X_test_svr_st, Y_train_svr_st, Y_test_svr_st = train_test_split(X, Y_st, test_size=0.2, random_state=42)

#Implementing model
svr_s_optim = SVR(kernel="poly", degree=2, C=1, epsilon=0.1)
svr_s_optim.fit(X_train_svr_st, Y_train_svr_st)
y_pred_svr_st = svr_s_optim.predict(X_test_svr_st)
Get_score(Y_test_svr_st, y_pred_svr_st)


#Long-Term Memorability
#Preparing Data
Y_lt = df_labels['long-term_memorability'].values
X_train_svr_lt, X_test_svr_lt, Y_train_svr_lt, Y_test_svr_lt = train_test_split(X, Y_lt, test_size=0.2, random_state=109)

#Implementing model
svr_l_optim = SVR(kernel="poly", degree=2, C=1, epsilon=0.1)
svr_l_optim.fit(X_train_svr_lt, Y_train_svr_lt)
y_pred_svr_lt = svr_l_optim.predict(X_test_svr_lt)
Get_score(y_pred_svr_lt, Y_test_svr_lt )

The Spearman's correlation coefficient is: 0.430
The Spearman's correlation coefficient is: 0.151


## Using C3D

Applying the pca reduced C3D features to my SVR model improved the Spearman score, however it is still not at a high level.


In [38]:
from sklearn.svm import SVR

#Short-Term Memorability
#Preparing Data
Y_st = df_labels['short-term_memorability'].values
X_c = c3dp;
X_train_svr_st, X_test_svr_st, Y_train_svr_st, Y_test_svr_st = train_test_split(X_c, Y_st, test_size=0.2, random_state=42)

#Implementing model
#svr_st_c3d = SVR(kernel="rbf", degree=3, C=200, epsilon=0.1)
svr_st_c = SVR(C=1.0, cache_size=200, coef0=0.0, degree=3 , epsilon=0.1, gamma='scale', kernel='rbf', max_iter=-1, shrinking=True, tol=0.001, verbose=False)
svr_st_c.fit(X_train_svr_st, Y_train_svr_st)
y_pred_svr_st = svr_st_c.predict(X_test_svr_st)
Get_score(Y_test_svr_st, y_pred_svr_st )


#Long-Term Memorability
#Preparing Data
Y_lt = df_labels['long-term_memorability'].values
X_train_svr_lt, X_test_svr_lt, Y_train_svr_lt, Y_test_svr_lt = train_test_split(X_c, Y_lt, test_size=0.2, random_state=109)

#Implementing model
#svr_lt_c3d = SVR(kernel="rbf", degree=3, C=200, epsilon=0.1)
svr_lt_c = SVR(C=1.0, cache_size=200, coef0=0.0, degree=3, epsilon=0.1, gamma='scale', kernel='rbf', max_iter=-1, shrinking=True, tol=0.001, verbose=False)
svr_lt_c.fit(X_train_svr_lt, Y_train_svr_lt)
y_pred_svr_lt = svr_lt_c.predict(X_test_svr_lt)
Get_score(Y_test_svr_lt, y_pred_svr_lt )


The Spearman's correlation coefficient is: 0.247
The Spearman's correlation coefficient is: 0.052


## Using HMP

In [39]:
from sklearn.svm import SVR


#Short-Term Memorability
#Preparing Data
Y_st = df_labels['short-term_memorability'].values
X = hmpp;
X_train_svr_st, X_test_svr_st, Y_train_svr_st, Y_test_svr_st = train_test_split(X , Y_st, test_size=0.2, random_state=42)

#Implementing model
#svr_st_c3d = SVR(kernel="rbf", degree=3, C=200, epsilon=0.1)
svr_st_c = SVR(C=1.0, cache_size=200, coef0=0.0, degree=3 , epsilon=0.1, gamma='scale', kernel='rbf', max_iter=-1, shrinking=True, tol=0.001, verbose=False)
svr_st_c.fit(X_train_svr_st, Y_train_svr_st)
y_pred_svr_st = svr_st_c.predict(X_test_svr_st)
Get_score(Y_test_svr_st, y_pred_svr_st)



#Long-Term Memorability
#Preparing Data
Y_lt = df_labels['long-term_memorability'].values
X_train_svr_lt, X_test_svr_lt, Y_train_svr_lt, Y_test_svr_lt = train_test_split(X , Y_lt, test_size=0.2, random_state=109)

#Implementing model
#svr_lt_c3d = SVR(kernel="rbf", degree=3, C=200, epsilon=0.1)
svr_lt_c = SVR(C=1.0, cache_size=200, coef0=0.0, degree=3, epsilon=0.1, gamma='scale', kernel='rbf', max_iter=-1, shrinking=True, tol=0.001, verbose=False)
svr_lt_c.fit(X_train_svr_lt, Y_train_svr_lt)
y_pred_svr_lt = svr_lt_c.predict(X_test_svr_lt)
Get_score(Y_test_svr_lt, y_pred_svr_lt )


The Spearman's correlation coefficient is: 0.242
The Spearman's correlation coefficient is: 0.052


##  Using Combination Of Captions, C3D and HMP

In [58]:
from sklearn.svm import SVR

#Short-Term Memorability
#Preparing Data
Y_st = df_labels['short-term_memorability'].values
X = ccdp;
X_train_s, X_test_s, Y_train_s, Y_test_s = train_test_split(X , Y_st, test_size=0.2, random_state=42)

#Implementing model
svr_as = SVR(C=1.0, cache_size=200, coef0=0.0, degree=3 , epsilon=0.1, gamma='scale', kernel='rbf', max_iter=-1, shrinking=True, tol=0.001, verbose=False)
svr_as.fit(X_train_s, Y_train_s)
y_pred_s = svr_as.predict(X_test_s)
Get_score(Y_test_s, y_pred_s)


#Long-Term Memorability
#Preparing Data
Y_lt = df_labels['long-term_memorability'].values
X_train_svr_lt, X_test_svr_lt, Y_train_svr_lt, Y_test_svr_lt = train_test_split(X , Y_lt, test_size=0.2, random_state=109)

#Implementing model
svr_al = SVR(C=1.0, cache_size=200, coef0=0.0, degree=3, epsilon=0.1, gamma='scale', kernel='rbf', max_iter=-1, shrinking=True, tol=0.001, verbose=False)
svr_al.fit(X_train_svr_lt, Y_train_svr_lt)
y_pred_svr_lt = svr_al.predict(X_test_svr_lt)
Get_score(Y_test_svr_lt, y_pred_svr_lt )


The Spearman's correlation coefficient is: 0.396
The Spearman's correlation coefficient is: 0.151


## Using Combination of Caption Tranformations

This model has mediocre performance. So far the combination of caption transformations has not been a top feature for prediction.

In [109]:
from sklearn.svm import SVR

#Short-Term Memorability
#Preparing Data
Y_st = df_labels['short-term_memorability'].values
X = allcaptionsp;
X_train_s, X_test_s, Y_train_s, Y_test_s = train_test_split(X , Y_st, test_size=0.2, random_state=42)

#Implementing model
svr_alltexts = SVR(C=1.0, cache_size=200, coef0=0.0, degree=3 , epsilon=0.1, gamma='scale', kernel='rbf', max_iter=-1, shrinking=True, tol=0.001, verbose=False)
svr_alltexts.fit(X_train_s, Y_train_s)
y_pred_alls = svr_alltexts.predict(X_test_s)
Get_score(Y_test_s, y_pred_alls)


#Long-Term Memorability
#Preparing Data
Y_lt = df_labels['long-term_memorability'].values
X_train_svr_lt, X_test_svr_lt, Y_train_svr_lt, Y_test_svr_lt = train_test_split(X , Y_lt, test_size=0.2, random_state=109)

#Implementing model
svr_alltextsl = SVR(C=1.0, cache_size=200, coef0=0.0, degree=3, epsilon=0.1, gamma='scale', kernel='rbf', max_iter=-1, shrinking=True, tol=0.001, verbose=False)
svr_alltextsl.fit(X_train_svr_lt, Y_train_svr_lt)
y_pred_alll= svr_alltextsl.predict(X_test_svr_lt)
Get_score(Y_test_svr_lt, y_pred_alll )


The Spearman's correlation coefficient is: 0.404
The Spearman's correlation coefficient is: 0.168


# Random Forest

As discussed previously, the individual decision trees did not perform very well. Therefore, I decided to implement Random Forest models. They are simply an ensemble of decision trees. Ensemble approaches can often be found to perform much better than the individual algorithms. Thankfully, this was the case here and the random forest models  yield much more promising spearman scores.

## Using Sequence of Words

In [108]:
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import train_test_split

#Short-Term Memorability
#Preparing Data
Y_st = df_labels['short-term_memorability'].values
X = seq_normalized;
X_train_rf_st, X_test_rf_st, Y_train_rf_st, Y_test_rf_st = train_test_split(X, Y_st, test_size=0.2, random_state=42)

#Implementing model
rf_st_s = RandomForestRegressor(n_estimators= 15, min_samples_split= 10, min_samples_leaf= 4, max_features= 'sqrt', max_depth= 40, bootstrap= False)
rf_st_s.fit(X_train_rf_st, Y_train_rf_st)
y_pred_rf_st = rf_st_s.predict(X_test_rf_st)
Get_score(y_pred_rf_st, Y_test_rf_st)

#Long-Term Memorability
#Preparing Data
Y_lt = df_labels['long-term_memorability'].values
X_train_rf_lt, X_test_rf_lt, Y_train_rf_lt, Y_test_rf_lt = train_test_split(X, Y_lt, test_size=0.2, random_state=42)

#Implementing model
rf_lt_s = RandomForestRegressor(n_estimators= 15, min_samples_split= 10, min_samples_leaf= 4, max_features= 'sqrt', max_depth= 40, bootstrap= True)
rf_lt_s.fit(X_train_rf_lt, Y_train_rf_lt)
y_pred_rf_st = rf_lt_s.predict(X_test_rf_lt)
Get_score(y_pred_rf_st, Y_test_rf_lt)


The Spearman's correlation coefficient is: 0.232
The Spearman's correlation coefficient is: 0.069


## Using Captions (One Hot Encoding)

### Implementation

Random forest model for one hot encoding was tested with three feature matrices. The original one hot encoded data, the normalised data and then the noramlised data which was reduced with PCA. Random forest performed best on the original data and worse on the PCA data.

I have used Random seach to zone in on a range of values for the RF hyperparameters. Using these new parameters increased performance by 0.01 which is not great.

Enabling bootsraping instead of randomly sampling the data improves the Spearman score, with 0.31 before and 0.44 after. 


#### Using Normalised Data

I found the model performed better with bootstrapping disabled. The spearman score declined when enabled. This model is one of the top performers we have seen so far 


In [95]:
##short term memorability
Y_st = df_labels['short-term_memorability'].values
X_oh = oh_normalized
X_train_rf_st, X_test_rf_st, Y_train_rf_st, Y_test_rf_st = train_test_split(X_oh, Y_st, test_size=0.2, random_state=42)

# Random Forest
rf_st_c = RandomForestRegressor(n_estimators= 100, min_samples_split= 10, min_samples_leaf= 4, max_features= 'sqrt', max_depth= 90, bootstrap= False)
rf_st_c.fit(X_train_rf_st, Y_train_rf_st)
y_pred_rf_st = rf_st_c.predict(X_test_rf_st)
Get_score(Y_test_rf_st,y_pred_rf_st)


#Long-Term
Y_lt = df_labels['long-term_memorability'].values
X_train_rf_lt, X_test_rf_lt, Y_train_rf_lt, Y_test_rf_lt = train_test_split(X_oh, Y_lt, test_size=0.2, random_state=42)

# Random Forest
rf_lt_c = RandomForestRegressor(n_estimators= 100, min_samples_split= 10, min_samples_leaf= 4, max_features= 'sqrt', max_depth= 80, bootstrap= False)
rf_lt_c.fit(X_train_rf_lt, Y_train_rf_lt)
y_pred_rf_lt = rf_lt_c.predict(X_test_rf_lt)
Get_score(Y_test_rf_lt, y_pred_rf_lt)


The Spearman's correlation coefficient is: 0.445
The Spearman's correlation coefficient is: 0.135


#### Using Normalised and PCA reduced

Here I wanted to test if the random forest model would perform better with less input features. The normalised PCA reduced data has (6000, 1778) features whereas the normalised data has (6000, 5191) features. Even though the smaller data maintained a 95% variance after dimensionality reduction we see a big decrease in the spreaman score. The model trained on the normalised data performed much better with a short term-score of X and a long-term score of Y.

In [65]:
##short term memorability
Y_st = df_labels['short-term_memorability'].values
X_oh = ohn

#impf = rf_st_c.feature_importances_
X_train_rf_st, X_test_rf_st, Y_train_rf_st, Y_test_rf_st = train_test_split(X_oh, Y_st, test_size=0.2, random_state=42)

# Random Forest
#dont change cos gives better results.
rf_st_cn = RandomForestRegressor(n_estimators= 25, min_samples_split= 5, min_samples_leaf= 4, max_features= 'sqrt', max_depth=15, bootstrap= True)
rf_st_cn.fit(X_train_rf_st, Y_train_rf_st)
y_pred_rf_st = rf_st_cn.predict(X_test_rf_st)
Get_score(Y_test_rf_st,y_pred_rf_st)


# #long term
Y_lt = df_labels['long-term_memorability'].values
X_train_rf_lt, X_test_rf_lt, Y_train_rf_lt, Y_test_rf_lt = train_test_split(X_oh, Y_lt, test_size=0.2, random_state=42)

# Random Forest
rf_st_cn = RandomForestRegressor(n_estimators= 25, min_samples_split= 5, min_samples_leaf= 4, max_features= 'sqrt', max_depth=15, bootstrap= True)
rf_st_cn.fit(X_train_rf_lt, Y_train_rf_lt)
y_pred_rf_lt = rf_st_cn.predict(X_test_rf_lt)
Get_score(Y_test_rf_lt, y_pred_rf_lt)


The Spearman's correlation coefficient is: 0.306
The Spearman's correlation coefficient is: 0.112


#### Use Random Grid Search to Speed Up Random Forest

Because the random forest model with the One-Hot encoded data has performed as one of the best models so far we can use Grid Search to find a combination of  hyperparameters which might improve the model further. We find the range for values short and long term predictions. This range will narrow down the possible values for the RF parameters. We expect to see an improvement when using the range. We test a number of different hyperparameters, the number of trees, maximum features, tree depth and so on. 



In [86]:
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import make_scorer

#Making my Spearman function an actual scorer than can be used.
my_scorer = make_scorer(cv_Spearman, greater_is_better=True)

##short term memorability
Y_st = df_labels['short-term_memorability'].values
X_oh = oh_normalized
X_train_rf_st, X_test_rf_st, Y_train_rf_st, Y_test_rf_st = train_test_split(X_oh, Y_st, test_size=0.2, random_state=42)


#Find best hyperparameters for short term RF model which was already trained above
parameterss_randf = {'n_estimators':[5, 10, 15, 20 ,25],'max_depth':[2, 5, 10, 20, 30]}
model_to_improve_short = rf_st_c
gridsearch_rfs = GridSearchCV(model_to_improve_short, parameterss_randf, scoring=my_scorer, n_jobs=-1)
gridsearch_rfs.fit(X_train_rf_st, Y_train_rf_st)

#Find best hyperparameters for long-term RF model which was already trained above
model_to_improve_long = rf_lt_c
gridsearch_rfsl = GridSearchCV(model_to_improve_long, parameterss_randf, scoring=my_scorer, n_jobs=-1)
gridsearch_rfsl.fit(X_train_svr_st, Y_train_svr_st)

print("Short-Term Best Params: ", gridsearch_rfs.best_params_)
print("Long-Term Best Params: ", gridsearch_rfsl.best_params_)

Short-Term Best Params:  {'max_depth': 30, 'n_estimators': 20}
Long-Term Best Params:  {'max_depth': 10, 'n_estimators': 25}


In [89]:
##short term memorability
Y_st = df_labels['short-term_memorability'].values
X_oh = oh_normalized
X_train_rf_st, X_test_rf_st, Y_train_rf_st, Y_test_rf_st = train_test_split(X_oh, Y_st, test_size=0.2, random_state=42)

# Random Forest
rf_st_c_tuned = RandomForestRegressor(n_estimators= 20, max_depth= 30)
rf_st_c_tuned.fit(X_train_rf_st, Y_train_rf_st)
y_pred_rf_st_tuned = rf_st_c_tuned.predict(X_test_rf_st)
Get_score(Y_test_rf_st,y_pred_rf_st_tuned)


#Long-Term
Y_lt = df_labels['long-term_memorability'].values
X_train_rf_lt, X_test_rf_lt, Y_train_rf_lt, Y_test_rf_lt = train_test_split(X_oh, Y_lt, test_size=0.2, random_state=42)

# Random Forest
rf_lt_c_tuned = RandomForestRegressor(n_estimators= 25, max_depth= 10)
rf_lt_c_tuned.fit(X_train_rf_lt, Y_train_rf_lt)
y_pred_rf_lt_tuned = rf_lt_c_tuned.predict(X_test_rf_lt)
Get_score(Y_test_rf_lt, y_pred_rf_lt_tuned)


The Spearman's correlation coefficient is: 0.360
The Spearman's correlation coefficient is: 0.108


## Using TF-IDF

In [69]:
from sklearn.ensemble import RandomForestRegressor

#Short-Term Memorability
#Prepare data
Y_st = df_labels['short-term_memorability'].values
X = tfidfn;
X_train_rf_st, X_test_rf_st, Y_train_rf_st, Y_test_rf_st = train_test_split(X, Y_st, test_size=0.2, random_state=42)

# Implement model
rf_t_s = RandomForestRegressor(n_estimators= 100, min_samples_split= 10, min_samples_leaf= 4, max_features= 'sqrt', max_depth= 90, bootstrap= False)
#rf_t_s = RandomForestRegressor(n_estimators=100, max_depth=100, n_jobs=-1)
rf_t_s.fit(X_train_rf_st, Y_train_rf_st)
y_pred_rf_st = rf_t_s.predict(X_test_rf_st)
Get_score(Y_test_rf_st, y_pred_rf_st)


#Long-Term Memorability
#Prepare data
Y_lt = df_labels['long-term_memorability'].values
X_train_rf_lt, X_test_rf_lt, Y_train_rf_lt, Y_test_rf_lt = train_test_split(X, Y_lt, test_size=0.2, random_state=42)

# Implement model
rf_t_l = RandomForestRegressor(n_estimators= 100, min_samples_split= 10, min_samples_leaf= 4, max_features= 'sqrt', max_depth= 90, bootstrap= False)
rf_t_l.fit(X_train_rf_lt, Y_train_rf_lt)
y_pred_rf_l = rf_t_l.predict(X_test_rf_lt)
Get_score(Y_test_rf_lt, y_pred_rf_l)

The Spearman's correlation coefficient is: 0.336
The Spearman's correlation coefficient is: 0.153


## Using C3D

In [128]:
from sklearn.ensemble import RandomForestRegressor

#Short-Term Memorability
#Prepare data
Y_st = df_labels['short-term_memorability'].values
X_c3d = c3dp;
X_train_s, X_test_s, Y_train_s, Y_test_s = train_test_split(X_c3d, Y_st, test_size=0.2, random_state=42)

# Implement model
rf_cds = RandomForestRegressor(n_estimators=100, n_jobs=-1)
rf_cds.fit(X_train_s, Y_train_s)
y_pred_rf_cds = rf_cds.predict(X_test_s)
Get_score(Y_test_s, y_pred_rf_cds)

#Long-Term Memorability
#Prepare data
Y_lt = df_labels['long-term_memorability'].values
X_train_l, X_test_l, Y_train_l, Y_test_l = train_test_split(X_c3d, Y_lt, test_size=0.2, random_state=42)

# Implement model
rfcdl = RandomForestRegressor(n_estimators=100, n_jobs=-1)
rfcdl.fit(X_train_l, Y_train_l)
y_pred_l= rfcdl.predict(X_test_l)
Get_score(Y_test_l, y_pred_l)

The Spearman's correlation coefficient is: 0.235
The Spearman's correlation coefficient is: 0.084


## Using HMP

In [129]:
from sklearn.ensemble import RandomForestRegressor

#Short-Term Memorability
#Prepare data
Y_st = df_labels['short-term_memorability'].values
X = hmpp;
X_train_s, X_test_s, Y_train_s, Y_test_s = train_test_split(X, Y_st, test_size=0.2, random_state=42)

# Implement Random Forest
random_forest = RandomForestRegressor(n_estimators=100, n_jobs=-1)
random_forest.fit(X_train_s, Y_train_s)
y_pred_s = random_forest.predict(X_test_s)
Get_score(Y_test_s, y_pred_s)


#Long-Term Memorability
#Prepare data
Y_lt = df_labels['long-term_memorability'].values
X_train_l, X_test_l, Y_train_l, Y_test_l = train_test_split(X, Y_lt, test_size=0.2, random_state=42)

# Implement Random Forest
random_forest = RandomForestRegressor(n_estimators=100, n_jobs=-1)
random_forest.fit(X_train_l, Y_train_l)
y_pred_l = random_forest.predict(X_test_l)
Get_score(Y_test_l, y_pred_l)

The Spearman's correlation coefficient is: 0.271
The Spearman's correlation coefficient is: 0.075


## Using Combination Of Captions, C3D and HMP

In [130]:
from sklearn.ensemble import RandomForestRegressor

#Short-Term Memorability
#Prepare data
Y_st = df_labels['short-term_memorability'].values
X = ccdp;
X_train_s, X_test_s, Y_train_s, Y_test_s = train_test_split(X, Y_st, test_size=0.2, random_state=42)

# Implement RF Model
rfas = RandomForestRegressor( n_jobs=-1, max_depth = 50)
rfas.fit(X_train_s, Y_train_s)
y_pred_s = rfas.predict(X_test_s)
Get_score(Y_test_s, y_pred_s)


#Long-Term Memorability
#Prepare data
Y_lt = df_labels['long-term_memorability'].values
X_train_l, X_test_l, Y_train_l, Y_test_l = train_test_split(X, Y_lt, test_size=0.2, random_state=42)

# Implement RF Model
rfal = RandomForestRegressor(n_jobs=-1, max_depth = 50)
rfal.fit(X_train_l, Y_train_l)
y_pred_l = rfal.predict(X_test_l)
Get_score(Y_test_l, y_pred_l)

The Spearman's correlation coefficient is: 0.387
The Spearman's correlation coefficient is: 0.189


## Using Combinatino of Caption Transformation

In [113]:
from sklearn.ensemble import RandomForestRegressor

#Short-Term Memorability
#Prepare data
Y_st = df_labels['short-term_memorability'].values
X = allcaptionsp;
X_train_s, X_test_s, Y_train_s, Y_test_s = train_test_split(X, Y_st, test_size=0.2, random_state=42)

# Implement RF Model
rf_textalls = RandomForestRegressor( n_jobs=-1, max_depth = 50)
rf_textalls.fit(X_train_s, Y_train_s)
y_pred_sall = rf_textalls.predict(X_test_s)



#Long-Term Memorability
#Prepare data
Y_lt = df_labels['long-term_memorability'].values
X_train_l, X_test_l, Y_train_l, Y_test_l = train_test_split(X, Y_lt, test_size=0.2, random_state=42)

# Implement RF Model
rf_textall = RandomForestRegressor(n_jobs=-1, max_depth = 50)
rf_textall.fit(X_train_l, Y_train_l)
y_pred_lall = rf_textall.predict(X_test_l)

Get_score(Y_test_s, y_pred_sall)
Get_score(Y_test_l, y_pred_lall)

The Spearman's correlation coefficient is: 0.357
The Spearman's correlation coefficient is: 0.169


#Ensemble Methods


##Voting Regressor

For this ensemble method we will combine all the best performing models. The feature matrix will be the feature that has performed the best to this point. That is the one hot encoded captions. It should be the case that the voting regressor performs better than the rest of the models.

This model will not be tested on all features. It will only be tested on the best performing ones. Those were TF-IDF, One Hot Encoding and the combination of captions, C3D and HMP.

Decision trees are not tested here as random forest is an ensemble of decision trees. 

### Using TF-IDF

I tested all combinations of models to see which resulted in the optimum score. The results of each combination is below. The best combination for both long- and short-term was SVR and Ridge.

The voting regressor performs better than any of the estimators so far. This was expected as ensemble methods usually perform better than their base estimators. The score is not improved by much however so I want to emply another ensemble method.


**Model Combinations Short-Term Test Results** 

Linear, Ridge, SVR, Random Forest(RF) = 0.344

Linear, Ridge, SVR =0.328

Linear, Ridge =  0.316

Linear, SVR = 0.279

Linear, RF = 0.297

**Ridge, SVR =  0.455**

SVR, RF = 0.443


**Model Combinations Long-Term Test Results** 

Linear, Ridge, SVR, Random Forest(RF) = 0.143

Ridge, SVR, RF = 0.194

Linear, Ridge, SVR = 0.136

Linear, Ridge =  0.122

Linear, SVR = 0.120

Linear, RF = 0.116

**Ridge, SVR =  0.201**

SVR, RF = 0.184

Ridge, RF =  0.196

 



In [13]:
from sklearn.svm import SVR
from sklearn.ensemble import VotingRegressor
from sklearn.ensemble import RandomForestRegressor

#Preparing the Short-Term test data
Y_st = df_labels['short-term_memorability'].values
X = tfidfn
X_train_oh, X_test_oh, Y_train_oh, Y_test_oh = train_test_split(X, Y_st, test_size=0.2, random_state=42)

#Defining the all  the Short Term Prediction Models so we can choose which ones to implement here
#linears = lr_tfs #Linear Regression
ridges = r_tfs # Ridge Regression
svrs = svr_st_t # Support Vector Machine
#rfs = rf_t_s # Random Forest

#Defining our models that the voting regressor will actually use
estimators = [('ridges', ridges), ('svrs', svrs)]

#Implementing voting regressor with our estimators defined above. Set to run in parallel. 
voting_s = VotingRegressor(estimators=estimators, n_jobs=-1)
#Voting being trained on TFIDF normalised data
voting_s.fit(X_train_oh, Y_train_oh)
y_pred_v = voting_s.predict(X_test_oh)
Get_score(Y_test_oh, y_pred_v)




# Preparing the Long-Term Data
Y_lt = df_labels['long-term_memorability'].values
X = tfidfn
X_train_l, X_test_l, Y_train_l, Y_test_l = train_test_split(X, Y_lt, test_size=0.2, random_state=42)

#Defining all the Long-Term Prediction Models we have created so far.
#linearltf = lr_tfl #Linear Regression
ridgeltf = r_tfl  # Ridge Regression
svrltf = svr_lt_t  #Support Vector Regression
#randfortf = rf_t_l #Random Forest
          
#Defining our estimators that the voting regressor will use to get votes.
estimators = [('ridgeltf', ridgeltf) , ('svrltf', svrltf)]

#Creating voting regressor
voting_long_tf = VotingRegressor(estimators=estimators, n_jobs=-1)
voting_long_tf.fit(X_train_l, Y_train_l)
y_pred_vtf = voting_long_tf.predict(X_test_l)
Get_score(Y_test_l, y_pred_vtf)


The Spearman's correlation coefficient is: 0.455
The Spearman's correlation coefficient is: 0.201


### Using Captions(one-hot encoded) 

The voting regressor trained on the one hot encoded data does not perform as well as the model above trained on TF-IDF data. It has a lower short term score with 0.002 difference which is very small. 

In [82]:
from sklearn.svm import SVR
from sklearn.ensemble import VotingRegressor
from sklearn.ensemble import RandomForestRegressor

#Short-Term Score
Y_st = df_labels['short-term_memorability'].values
X_oh = ohn
X_train_oh, X_test_oh, Y_train_oh, Y_test_oh = train_test_split(X_oh, Y_st, test_size=0.2, random_state=42)

#Implement Voting Regressor
votingtso = VotingRegressor(estimators=[ ('r_c_s', r_c_s), ('svr_st', svr_st)])
votingtso.fit(X_train_oh, Y_train_oh)
y_pred_vso = votingtso.predict(X_test_oh)
Get_score(Y_test_oh, y_pred_vso)

#Long-Term
Y_lt = df_labels['long-term_memorability'].values
X_train_l, X_test_l, Y_train_l, Y_test_l = train_test_split(X_oh, Y_lt, test_size=0.2, random_state=42)

#The estimators here are the best performing ones.
votingtlo = VotingRegressor( estimators=[('r_c_l', r_c_l), ('svr_lt', svr_lt)])
votingtlo.fit(X_train_oh, Y_train_oh)
y_pred_vloh = votingtlo.predict(X_test_l)
Get_score(Y_test_l, y_pred_vloh)

The Spearman's correlation coefficient is: 0.445
The Spearman's correlation coefficient is: 0.192


### Using Combination of Captions, C3d and HMP

This model performs worst out of the voting models so far.

In [59]:
from sklearn.svm import SVR
from sklearn.ensemble import VotingRegressor
from sklearn.ensemble import RandomForestRegressor

#short term test split
Y_st = df_labels['short-term_memorability'].values
X = ccdp
X_train_s, X_test_s, Y_train_s, Y_test_s = train_test_split(X, Y_st, test_size=0.2, random_state=42)

#Short Term Prediction Models
ridgesa = r_ccd_s
svrsa = svr_as

estimators = [ ('ridgesa', ridgesa), ('svrsa', svrsa) ]
    
#short term
voting_sa = VotingRegressor(estimators=estimators)
voting_sa.fit(X_train_s, Y_train_s)
y_pred_va = voting_sa.predict(X_test_s)
Get_score(Y_test_s, y_pred_va)

#Long Term Score
Y_l = df_labels['long-term_memorability'].values
X_train_l, X_test_l, Y_train_l, Y_test_l  = train_test_split(X, Y_l, test_size=0.2, random_state=42)

#Long Term Prediction Models
ridgesal = r_ccd_l
svrsal = svr_al

estimators = [ ('ridgesal', ridgesal), ('svrsal', svrsal) ]

#short term
voting_lac = VotingRegressor(estimators=estimators)
voting_lac.fit(X_train_l, Y_train_l)
y_pred_vl = voting_lac.predict(X_test_l)
Get_score(Y_test_l, y_pred_vl)




The Spearman's correlation coefficient is: 0.441
The Spearman's correlation coefficient is: 0.171


## Stacking Ensemble Method

This is the best model for short-term predictions. 

Using the predictions for the same algorithms used in the ensemble method as input for my blender. We used a hold-out set to train the blender. The goal here is to improve the learning process. The incorrect predictions are valued in this model and are used to improve performance. The model learns from mistakes. Each tier will learn from the mistakes of the tier before it.

For the stacking ensemble this will only be tested with the TF-IDF data as this has consistently performed the best.

The final estimator for this model will be the Ridge regression model as this performed very well so far.

The stacking regrossor provides the ability to determine the cross-validation splitting. Here I have set this to use 5-fold cross validation. This is a K-fold split.




In [14]:
from sklearn.ensemble import StackingRegressor

#Short Term Prediction Models
#The final estimator is going to be Ridge model because it performed best out of all others so far.
ridges = r_tfs # Ridge Regression
svrs = svr_st_t # Support Vector Machine
final_estimator = ridges 

#short term test split
Y_st = df_labels['short-term_memorability'].values
X = tfidfn;
X_train_s, X_test_s, Y_train_s, Y_test_s = train_test_split(X, Y_st, test_size=0.2, random_state=42)

#Defining Models To Use
estimators = [ ('svrs', svrs), ('voting_s', voting_s)]

#With hyperparameters setting the estimators and final estimator for regressor to use
stackeds = StackingRegressor(estimators=estimators, final_estimator=final_estimator, cv=5)
stackeds.fit(X_train_s, Y_train_s)
y_pred_stacked = stackeds.predict(X_test_s)
Get_score( Y_test_s, y_pred_stacked)


#Long Term Prediction Models
#The final estimator is going to be Ridge model because it performed best out of all others so far.
final_estimator_long = r_tfl 
svrsal = svr_lt_t 

#Long term test split
Y_lt = df_labels['long-term_memorability'].values
X = tfidfn;
X_train_l, X_test_l, Y_train_l, Y_test_l  = train_test_split(X, Y_lt, test_size=0.2, random_state=42)

#Defining Models To Use
estimators_long = [ ('svrsal', svrsal), ('voting_long_tf', voting_long_tf)]

#With hyperparameters setting the estimators and final estimator for regressor to use
stackedl = StackingRegressor(estimators=estimators_long, final_estimator=final_estimator_long)
stackedl.fit(X_train_l, Y_train_l)
y_pred_stackedl = stackedl.predict(X_test_l)
Get_score( Y_test_l, y_pred_stackedl)


The Spearman's correlation coefficient is: 0.457
The Spearman's correlation coefficient is: 0.199


## Bagging Ensemble Method


Final approach is to implement bagging ensemble. The short-term score is high but not as high as the stacking regressor model. However, the long-term score is the highest score we have seen so far.

In [15]:
from sklearn.ensemble import BaggingRegressor

#Short-term memorability score
#Preparing data
Y_st = df_labels['short-term_memorability'].values
X = tfidfn;
X_train_s, X_test_s, Y_train_s, Y_test_s = train_test_split(X, Y_st, test_size=0.2, random_state=42)

#Defining base estimator for bagging regressor. Using best estimator we have so far.
best_est_short = stackeds

#Implementing model
bag_reg = BaggingRegressor(base_estimator = best_est_short, n_estimators=10, random_state=0)
bag_reg.fit(X_train_s, Y_train_s)
preds = bag_reg.predict(X_test_s)
Get_score(Y_test_s, preds)


#Long-Term Memorability
#Preparing data
Y_lt = df_labels['long-term_memorability'].values
X = tfidfn;
X_train_l, X_test_l, Y_train_l, Y_test_l = train_test_split(X, Y_lt, test_size=0.2, random_state=42)

#Define base estimator for bagging regressor to use. Using best estimator we have so far.
base_est_long =  stackedl

#Implementing model
bag_reg_long = BaggingRegressor(base_estimator = base_est_long, n_estimators=10, random_state=0)
bag_reg_long.fit(X_train_l, Y_train_l)
preds_long = bag_reg_long.predict(X_test_l)
Get_score(Y_test_l, preds_long)

The Spearman's correlation coefficient is: 0.454
The Spearman's correlation coefficient is: 0.210


## Cross Validation on best model

I apply 2-fold cross validation to my model. This is a really important step because it allows me to get a really good idea as to the performance of my best model. The data is split into 2 groups and each group it used to evaluate the model. I have set the scoring of the cross validation to the Spearman Coefficient score. This test will allow me to see the mean Spearman score after a 2 fold validation test. 

The results below are not as high as the original score, however it proves that this model is better performing than nearly all other models implemented.


In [17]:
#12:24
f:rom sklearn.metrics import make_scorer
from sklearn.model_selection import cross_val_score

def cv_Spearman(Y_true, Y_pred):
    '''Calculate the Spearmann"s correlation coefficient'''
    Y_pred = np.squeeze(Y_pred)
    Y_true = np.squeeze(Y_true)
    if Y_pred.shape != Y_true.shape:
        print('Input shapes don\'t match!')
    else:
        if len(Y_pred.shape) == 1:
            Res = pd.DataFrame({'Y_true':Y_true,'Y_pred':Y_pred})
            score_mat = Res[['Y_true','Y_pred']].corr(method='spearman',min_periods=1)
            return score_mat.iloc[1][0]
        else:
            for ii in range(Y_pred.shape[1]):
                cv_Spearman(Y_pred[:,ii],Y_true[:,ii])


#short term memorability
X = tfidfn
Y_st = df_labels['short-term_memorability'].values
X_train, X_test, Y_train, Y_test = train_test_split(X, Y_st, test_size=0.2, random_state=42)

#This is my best performing model for short term with a score of 0.457
model = stackeds 

my_scorer = make_scorer(cv_Spearman, greater_is_better=True)
scores = cross_val_score(model, X, Y_st, scoring=my_scorer, cv=2)
tree_sp_scores = np.mean(scores)


#long term memorability
X = tfidfn
Y_lt = df_labels['long-term_memorability'].values
X_train, X_test, Y_train, Y_test = train_test_split(X, Y_lt, test_size=0.2, random_state=42)

#This is my best performing model for long term with a score of 0.201
model_l = bag_reg_long 

scores_long = cross_val_score(model_l, X, Y_lt, scoring=my_scorer, cv=2)
tree_sp_scores_long = np.mean(scores_long)


print("Short term: ", tree_sp_scores)
print("Long term: ", tree_sp_scores_long)

Short term:  0.40352935567133413
Long term:  0.16308589010330357


# Train Final Best Models On Full Dev-set Data And Make Predictions

The optimum model for short and long term perdictions will be used here to predict the memorability scores for the test data. The best model for short-term scores was the Stacking Regressor with 0.457. The best model for long-term was the Bagging regressor with 0.210.

We will now train these models on the full 6000 instances. Then we will use the test set of captions to predict their memorability.  We were not provided with the ground truth for this data and therefore cannot calcualte a Spearman score. I am simply saving the results in a template.

## Import Test-Set

In [0]:
#Functions to load captions
def read_caps(fname):
    """Load the captions into a dataframe"""
    vn = []
    cap = []
    df = pd.DataFrame();
    with open(fname) as f:
        for line in f:
            pairs = line.split()
            vn.append(pairs[0])
            cap.append(pairs[1])
        df['video']=vn
        df['caption']=cap
    return df


# Load in the test set captions which we will try predict the memorability score for
cap_path_tmp_test = './Test-set/Captions_test/test-set-1_video-captions.txt'
df_cap_tmp_test = read_caps(cap_path_tmp_test)

# Load in a template to store the predictions
test_ground_truth = pd.read_csv('./Test-set/Ground-truth_test/ground_truth_template.csv')


## Prepare Test Captions

As I found TF-IDF to work the best with the best performing models I will apply TF-IDF to the test set.

In [21]:
#Setup our Counter object which can assist with cleaning the captions
vocab_test = Counter()

#Loop through each caption and clean
for i, capitalLetter in enumerate(df_cap_tmp_test['caption']):
    # Removing dashes in between words and convert words to lowercase.
    text = ''.join([c if c not in punctuation else ' ' for c in capitalLetter]).lower()
    #At each row of iteration i save the updated text
    df_cap_tmp_test.loc[i,'caption'] = text
    vocab_test.update(text.split())


caps_test = list(df_cap_tmp_test.caption.values)

#Applying the pipeline I made earlier which applies CountVectorizer and TFIDF
tfidf_test = tfidf_pipe.transform(caps_test)

#Normalising TFIDF data
tfidf_normalized_test = normalize(tfidf_test, norm='l2')

#Convert sparse matrix to dense array so can ues PCA
X_normalizedde_test = tfidf_normalized_test.todense()

#Retaining 95 % variance.
pca_test = PCA(n_components = 0.95)
tfidfn_test = pca.transform(X_normalizedde_test)

#We need to make sure there are the same number of features for training and testing
print( "Shape Of TEST TF-IDF After Normalise and PCA  : " , tfidfn_test.shape)
print(" Shape of DEV TF-IDF training data used earlier : ", tfidfn.shape)

Shape Of TEST TF-IDF After Normalise and PCA  :  (2000, 2265)
 Shape of DEV TF-IDF training data used earlier :  (6000, 2265)


# Final Predictions Using Test-Set 

In [0]:
#Short term
#We need to train the best model on the whole 6000 instances, no validation check now.
Y_st = df_labels['short-term_memorability'].values
X = tfidfn

#We need to predict the test data but we dont have the ground truth so not training split.
X_test = tfidfn_test

# Final Short-term prediction mode
#Stacking Model from earlier
final_short = stackeds 
#train final model on all 6000 instances
final_short.fit(X, Y_st) 
#Save predictions so we can convert to csv
y_pred_short = final_short.predict(X_test)
print(X_test.shape, X.shape)


#Final Long-term predictions
Y_lt = df_labels['long-term_memorability'].values
X = tfidfn
X_test = tfidfn_test

# Best long-term memorability model was the Bagging model 
final_long = bag_reg_long
final_long.fit(X, Y_lt)
y_pred_long = final_long.predict(X_test)


(2000, 2265) (6000, 2265)


##  Save predications as CSV

In [0]:
test_ground_truth['short-term_memorability'] = y_pred_short
test_ground_truth['long-term_memorability'] = y_pred_long
test_ground_truth.to_csv('/content/drive/My Drive/Assignment/Dataset/Maureen_Maguire_19213997_predictions.csv',  index=False)

test_ground_truth.tail()


# Resources

Spearman: https://drive.google.com/drive/folders/1puG9lLjao1y4ZngKHJFpxi4Yl-9cHvV7

Sequences and One-hot Encoding:
https://drive.google.com/drive/folders/1puG9lLjao1y4ZngKHJFpxi4Yl-9cHvV7

SVM Bootstraping:
https://www.researchgate.net/figure/A-general-architecture-of-bootstrapping-using-a-single-SVM-model_fig2_275238559
https://stats.stackexchange.com/questions/183230/bootstrapping-confidence-interval-from-a-regression-prediction

Stacking Ensemble Method:
https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.StackingRegressor.html

Learning Curve:
https://github.com/mainkoon81/DCU_project-01-Competition-MediaEval/blob/master/MinKun_ML_Script.ipynb