# Vlogger Big-Five Competition
## Competition I - Round II
### University of Amsterdam
### Behavioral Data Science
### Group 9
#### Romy Leferink, Bence Marosi, Sophia Lorenz
#### September 18, 2023
---




## The YouTube Personality Dataset

In the following, we analyzed data of YouTube vlogger. For this a dataset called "YouTube personality dataset" was used which contains data of 404 YouTube vloggers. All vloggers in the dataset explicitly talked about a variety of topics including personal issues, politics, movies, books, etc and the language is diverse. 
The dataset consists of a collection of behavorial features, speech transcriptions, and personality impression scores of the YouTube vloggers. 

---

## 0 Directories

#### 0.0 Load the libraries required to conduct the data analysis

In [None]:
library(tidyverse) 
library(tidytext)

#### 0.1 Load all files under the "../input/" directory

In [None]:
list.files(path <- "../input")

master_dir <- file.path(list.files('../input', full.names=TRUE), 'youtube-personality')
directory_content <- list.files(master_dir, full.names = TRUE)
print(directory_content)

---

## 1 Importing Data

We'll import

- Audio-visual information
- Gender
- Personality scores
- Transcripts


There are three `.csv` files in the directory structure (see line \[3\], \[4\], and \[5\] in the output above).

#### 1.1 Importing audiovisual information

The file path containing the audiovisual`.csv` file is stored in a variable which is used to read in the audio-visual data. 

*Note:* For all `.csv` files the `read_delim()` function was used instead of the`read_csv()` function because the data are not comma separated but delimited with a space.

In [None]:
audiovisual_file <- directory_content[3]
audiovisual_df <- read.table(audiovisual_file, head=TRUE)

audiovisual_df %>% 
    names()

audiovisual_df %>% 
    head()

#### 1.2 Importing gender information
The file path containing the gender`.csv` file is stored in a variable which is used to read in the gender data. This file did not have any column names which were added. 


In [None]:
gender_file <- directory_content[4]
gender_df <- read.delim(gender_file, head=FALSE, sep=" ", skip = 2)

# Adding column names
names(gender_df) <- c('vlogId', 'gender')

gender_df %>% 
    head()

#### 1.3 Importing personality information

In [None]:
personality_file <- directory_content[5]
personality_df <- read_delim(personality_file, delim=" ", show_col_types=FALSE)

personality_df %>% 
    head()

#### 1.4 Importing transcripts
There is a "transcript" subfolder (see line \[2\] in the output above) in which the actual video transcripts are stored in `.txt` files. A path was created to the transcripts directory. A selection of the transcripts is listed below:

In [None]:
path_to_transcripts = directory_content[2] 
transcript_files = list.files(path_to_transcripts, full.names = TRUE) 

transcript_files %>% 
    head()

---

## 2 Preparing Data for the Analysis

#### 2.1 Extracting vlogger ID from vlogger names
To protect the privacy of the vloggers, their names were removed before conducting the analysis. The vlogger ID needed to be encoded as we need this for joining information from the different dataframes. Furthermore, we removed the file extension ".txt".

In [None]:
vlogId = basename(transcript_files)
vlogId = str_replace(vlogId, pattern = ".txt$", replacement = "")

vlogId %>% 
    head()

#### 2.2 Transcript features
To include features extracted from the transcript texts, the text from files was stored in a dataframe.

In [None]:
transcripts_df <- tibble(
    
    # vlogId connects each transcripts to a vlogger
    vlogId = vlogId,
    
    # Read the transcript text from all files and store as string
    TEXT = map_chr(transcript_files, ~ paste(readLines(.x), collapse = "\\n")), 
    
    # `filename` contains specific video transcript
    filename = transcript_files
)

transcripts_df %>% 
    head(2)

#### 2.3 Merging `gender` and `personality` dataframes
The dataframes `gender` and `personality` were merged which is necessary for the analysis. 

In [None]:
vlogger_df <- left_join(gender_df, personality_df, by='vlogId')

vlogger_df %>% 
    head() # VLOG8 has missing personality scores: those should be predicted!

___

## 3 Feature Extraction from Transcripts

#### 3.1 Transcripts tokenized by words

By breaking down the text into individual words (tokens) sentiments can be assigned to each token.

In [None]:
transcripts_tokenized <- 
    transcripts_df %>%
    unnest_tokens(token, TEXT, token = 'words')

#### 3.2 Deleting stopwords from the transcript words

By deleting stopwords from the transcript words (`transcripts_tokenized`) we avoid having common words interfer with our data analysis. For example, words like "the", "a", ect. are commonly used in language but do not necessary add to the sentiment analysis. Thus, it makes sense to exclude those stopwords from the sentiment analysis because first, the predictions become more accurate and second, the data becomes cleaner.

In [None]:
stopwords <- get_stopwords()

transcripts_tokenized <-
    transcripts_tokenized %>%
    anti_join(stopwords, by = c(token = "word"))


#### 3.3 Retrieving lexica
Helper function in order to retrieve different lexica from the `textdata` package. The lexica of interest contain emotion data and are particularly used for sentiment analysis. 

In [None]:
get_lexicon = function(lexicon_name = names(textdata:::download_functions)) {
    lexicon_name = match.arg(lexicon_name)
    textdata:::download_functions[[lexicon_name]]('.')
    rds_filename = paste0(lexicon_name,'.rds')
    textdata:::process_functions[[lexicon_name]]('.',rds_filename)
    readr::read_rds(rds_filename)
}

#### 3.4 "nrc" lexicon

For the following sentiment analysis the "nrc" lexicon was used. This lexicon is based on Plutchick's wheel of emotion. Psychologist Robert Plutchik proposed this model which consists of 8 base emotions: 
- joy
- trust
- fear
- surprise
- sadness
- anticipation
- anger
- disgust
Each base emotion has a polar opposite. Thus, Plutchik's model also contains a positive valence and a negative valence.

The nrc lexicon assigns a sentiment, that is, a positive or negative valence and / or a base emotion typically conveyed by a word to each word in the transcripts_tokenized dataframe. Note that the "nrc" lexicon is incomplete so that not each word in the `transcripts_tokenized` dataframe gets an emotion assigned. Those tokens without a sentiment were dropped because they do not add anything to the analysis.

In [None]:
# loading "nrc" lexicon
nrc <- get_lexicon('nrc')

# left joining the "nrc" lexicon and tokenized transcripts
transcripts_token_labeled <-
    left_join(transcripts_tokenized, nrc, by = c(token = 'word'), relationship='many-to-many')

# dropping rows with no sentiment
transcript_features_df <-
    transcripts_token_labeled %>%
    filter(!is.na(sentiment))

transcript_features_df %>% 
    head()

#### 3.5 Merging sentimented transcripts with `vlogger_df` dataframe
The dataframes`vlogger_df` and`transcript_features_df` were merged into `vlogger_transcript_df`. Furthermore, we deleted the `filename` column because it is not needed for the rest of the analysis.

In [None]:
# merging `vlogger_df` with `transcript_features_df` into `vlogger_transcript_df`
vlogger_transcript_df <-
    vlogger_df %>%
    left_join(transcript_features_df, by = "vlogId")

# deleting "filename" column
vlogger_transcript_df <-
    vlogger_transcript_df %>%
    select(-filename)

vlogger_transcript_df %>% 
    head()

#### 3.6 Counting sentiment scores

The sentiment scores were counted for each vlogger as part of the sentiment analysis.

In [None]:
vlogger_sentiment_scores <-
    vlogger_transcript_df  %>%
    count(`vlogId`, sentiment)

#### 3.7 Wide dataframe for sentiment 

The long format of the `vlogger_sentiment_scores` dataframe is transformed into a wide dataframe such that each sentiment has a separate column.

In [None]:
vlogger_sentiment_wide <- 
    vlogger_sentiment_scores %>%
    pivot_wider(id_cols = 'vlogId', names_from=sentiment, values_from=n, values_fill = 0)

vlogger_sentiment_wide %>% 
    head()

#### 3.8 Merging sentiment data with vlogger data

The sentiment data of the `vlogger_sentiment_wide` dataframe was merged with the vlogger data of `vlogger_transcript_df`. The columns `token` and `sentiment` were dropped as they won't be needed for the rest of the analysis. This ensures a cleaner picture of the data.

In [None]:
vlogger_features =
    inner_join(vlogger_transcript_df, vlogger_sentiment_wide, by = "vlogId") %>%
    select(-c(token, sentiment)) %>%
    distinct()

vlogger_features %>% 
    head()

#### 3.9 Splitting into training data and test data

The `vlogger_features` dataframe is split into training data and test data. Splitting the data in training and test data allows to train the model based on the training data which contains labels. Then, the model is run on the testing data to make predictions about the new, unseen, testing data. 

The training data contains all the vlogger data; the sentiment of each vlogger's transcript (`vlogger_sentiment_wide`; predictor variable) as well as the personality score based on the big5 of the personality dataset (`personality_df`; outcome variable).
In contrast, the test data only contains each vlogger's transcript (`vlogger_sentiment_wide`; predictor variable) but not the personality score. Test data are those `vlogId`'s where the big5 personality scores of the dataframe `personality_df` are missing (`NA`).


In [None]:
# training data
vlogger_features_training <-
    vlogger_features %>%
    drop_na(c(Extr, Agr, Cons, Emot, Open))

vlogger_features_training %>% 
    head()

# test data
vlogger_features_test <-
    vlogger_features %>%
    anti_join(vlogger_features_training, by = "vlogId")

vlogger_features_test %>% 
    head()

# number of vloggers for each dataset (combined, training, test)
nrow(vlogger_features)
nrow(vlogger_features_training)
nrow(vlogger_features_test)

#### 3.10 Merging audiovisual data with vlogger data

The audiovisual data of the `audio_visual` dataframe is merged with the vlogger data of `vlogger_df` into `vlogger_audiovisual`. This way we do not only base our predictions of vlogger's personalities based on what they are saying in their vloggs (transcript data) but also based on audiovisual features. We expect more accurate predictions of our models if we include more data.

In [None]:
vlogger_audiovisual =
    inner_join(vlogger_df, audiovisual_df, by = "vlogId") %>%
    distinct()

vlogger_audiovisual %>% 
    head()

vlogger_audiovisual %>% 
    names()

#### 3.11 Splitting into training data and test data

The `vlogger_audiovisual` dataframe is split into training data and test data for the same reason as previously explained.

The training data contains all the vlogger data; the audiovisual data (`audiovisual_df`; predictor variable) of each vlogger and the personality score based on the big5 of the personality dataset (`personality_df`; outcome variable) as well as gender data (`gender_df`).
In contrast, the test data only contains each vlogger's audiovisual data (`audiovisual_df`; predictor variable) but not the personality score. Test data are those `vlogId`'s where the big5 personality scores of the dataframe `personality_df` are missing (`NA`).

In [None]:
# training data
vlogger_audiovisual_training <-
    vlogger_audiovisual %>%
    drop_na(c(Extr, Agr, Cons, Emot, Open))

vlogger_audiovisual_training %>% 
    head()

# test data
vlogger_audiovisual_test <-
    vlogger_audiovisual %>%
    anti_join(vlogger_audiovisual_training, by = "vlogId")

vlogger_audiovisual_test %>% 
    head()


# number of vloggers for each dataset (combined, training, test)
nrow(vlogger_audiovisual)
nrow(vlogger_audiovisual_training)
nrow(vlogger_audiovisual_test)

#### 3.12 Merging audiovisual data with transcrip features data 

The `audiovisual_df` dataframe is merged with the `vlogger_features` dataframe, to get a dataframe with all text- and audiovisual features.

In [None]:
features_audiovisual_df =
    inner_join(audiovisual_df, vlogger_features, by = "vlogId") %>%
    distinct()

features_audiovisual_df %>% 
    head()

features_audiovisual_df %>% 
    names()

#### 3.13 Splitting into training data and test data

Like the previous dataframes, the `features_audiovisual_df` dataframe is then split into test and training data. 

The training data contains all the vlogger data; the audiovisual data (`audiovisual_df`; predictor variable) of each vlogger, the data of the transcripted features (`vlogger_features`; predictor variable) and the personality score based on the big5 of the personality dataset (`personality_df`; outcome variable),  as well as gender data (`gender_df`).
In contrast, the test data only contains each vlogger's audiovisual data (`audiovisual_df`; predictor variable) and the data of the transcripted features (`vlogger_features`; predictor variable) but not the personality score. Test data are those `vlogId`'s where the big5 personality scores of the dataframe `personality_df` are missing (`NA`).

In [None]:
# training data
features_audiovisual_df_train <-
    features_audiovisual_df %>%
    drop_na(c(Extr, Agr, Cons, Emot, Open))

features_audiovisual_df_train %>% 
    head()

# test data
features_audiovisual_df_test <-
    features_audiovisual_df %>%
    anti_join(features_audiovisual_df_train, by = "vlogId")

features_audiovisual_df_test %>% 
    head()


# number of vloggers for each dataset (combined, training, test)
nrow(features_audiovisual_df)
nrow(features_audiovisual_df_test)
nrow(features_audiovisual_df_train)

___

## 4 Predictive Linear Models

Models are trained on the training data in order to predict the Big5 personality trait of each vlogger in the dataset. Multiple regression was applied on the data with five personality traits (outcome variable) with eight emotions and two valences (predictors). Different multiple linear regression models with differing structures were used for the training.

#### 4.1 Linear model 1
First, we specified a full linear model containing all 10 predictors to investigate which predictors have significant effects on which outcomes (Model 1).  

In [None]:
fit_mlm_01 <- lm(cbind(Extr, Agr, Cons, Emot, Open) ~ anger + anticipation + 
                 disgust + fear + joy + negative + positive + sadness + 
                 surprise + trust, data = features_audiovisual_df_train)

summary(fit_mlm_01)

#### 4.2 Linear model 2
Second, we specified a linear model with all 10 predictors and an interaction of`surprise`  and`positive` (Model 2). We assumed an interaction effect between`surprise`  and`positive`  as surprise can be both - positive and negative.`surprise`  has to be positive to have a ‘positive’ influence 
on the outcome (e.g. more positively valenced suprise words could indicate higher extraversion).

In [None]:
fit_mlm_02 <- lm(cbind(Extr, Agr, Cons, Emot, Open) ~ anger + anticipation + 
                 disgust + fear + joy + negative + positive + sadness + trust + 
                 surprise*positive, data = features_audiovisual_df_train)

summary(fit_mlm_02)

#### 4.3 Linear model 3

Third, we specified a linear model that contains interaction effects between each emotion and its corresponding valence according to Plutchick's wheel of emotion (Model 3). 

In [None]:
fit_mlm_03 <- lm(cbind(Extr, Agr, Cons, Emot, Open) ~ anger*negative + 
                 anticipation*positive + disgust*negative + fear*negative + 
                 joy*positive + sadness*negative + trust*positive +
                 surprise*positive + disgust*negative, data = features_audiovisual_df_train)

summary(fit_mlm_03)

#### 4.4 Linear model 4

Fourth, we specified a linear model (Model 4) which only contains the predictors that were shown to be significant in the full model (Model 1). 

In [None]:
fit_mlm_04 <- lm(cbind(Extr, Agr, Cons, Emot, Open) ~ anger + disgust 
                 + fear + positive + surprise, data = features_audiovisual_df_train)

summary(fit_mlm_04)

#### 4.5 Linear model 5 

Maharani and Effendy (2022) investigated big five personality prediction using machine learning methods. They tried to predict the personality traits of social media users by investigating their tweets using the NRC emotion lexicon. They found that Conscientiousness, Extraversion and Agreeableness were significantly correlated with anticipantion, joy, surprise and trust. Additionally, they found significant correlations between Openness and Neuroticisim and sadness, disgust, anger and fear. Based on these results, we formed the following models (Model 5 and Model 6). 

The fifth model is a linear model which contains `Conscientiousnes`,`Extraversion` and `Agreeablenes` as outcome variables and `anticipation`, `joy`, `surprise`, and `trust` as predictors (Model 5).  

In [None]:
fit_mlm_05 <- lm(cbind(Extr, Agr, Cons)  ~ anticipation + joy + surprise + trust, 
                 data = features_audiovisual_df_train)

summary(fit_mlm_05)

#### 4.6 Linear model 6

Also based on the study by Maharani and Effendy (2022), we formed a linear model containing `Opennes` and `Neuroticism` as outcome variables and `sadness`, `disgust`, `anger` and `fear` as predictors (Model 6). 

In [None]:
fit_mlm_06 <- lm(cbind(Emot, Open) ~ anger + disgust + fear + sadness,
                 data = features_audiovisual_df_train)

summary(fit_mlm_06)

#### 4.7 Linear model 7

Similarly, Farnadi et al. (2014) studied the relationship between emotions expressed in Facebook status updates and user’s personality according to the Big Five model. These researchers deployed the NRC emotion lexicon as well. The results showed that Openness and Extraversion were significantly correlated with all emotions and valences from the NRC lexicon. These regressions were already included in the full model (Model 1, `fit_mlm_0`). For the other personality traits, Farnadi et al. (2014) found that Conscientiousness was correlated with all emotions/valences except negativity, anger and sadness; that agreeableness was correlated with all emotions/valences  except sadness; and that neuroticism was correlated with all emotions/valences except positivity, anticipation and trust. Based on these results, the following three models (Model 7, Model 8, Model 9) were formed.  

The seventh model is a linear model for the outcome variable `Conscientiousness` with `anticipation`, `disgust`, `fear`, `joy`, `positive`, `surprise`, and `trust` as predictors (Model 7). 

In [None]:
fit_mlm_07 <- lm(Cons ~ anticipation + disgust + fear + joy + positive + 
                 surprise + trust, data = features_audiovisual_df_train)

summary(fit_mlm_07)

#### 4.8 Linear model 8

Also based on the study by Farnadi et al. (2014), we specified a model with outcome variable `Agreeableness` and `anger`, `anticipation`,`disgust`,`fear`, `joy`, `negative`, `positive`, `surprise`, and `trust` as predictors (Model 8). 

In [None]:
fit_mlm_08 <- lm(Agr ~ anger + anticipation + disgust + fear + joy + negative + 
                 positive + surprise + trust, data = features_audiovisual_df_train)

summary(fit_mlm_08)

#### 4.9 Linear model 9

Also based on the study by Farnadi et al. (2014), we formed a model for the outcome variables `Neuroticism` and `Emotional Stability` with `anger`,`disgust`, `fear`,`joy`, `negative`, `sadness`, and `surprise` as predictors (Model 9). 

In [None]:
fit_mlm_09 <- lm(Emot ~ anger + disgust + fear + joy + negative +  
                 sadness + surprise, data = features_audiovisual_df_train)

summary(fit_mlm_09)

#### 4.10 Linear model 10

Biel and Gatica-Perez (2013) investigated the relationship between several audio and visual cues and Big Five personality traits. They found that energy (audio), speaking time (audio) and the length of looking segments (visual) are significantly correlated with the Big Five personality traits. Based on these results, we formed the following model (Model 10).  

The tenth model is a linear model containing all five personaltiy traits as outcomes and `mean.energy`,`time.speaking` and `avg.len.seg` as predictors. 

In [None]:
fit_mlm_10 <- lm(cbind(Extr, Agr, Cons, Emot, Open) ~ mean.energy + time.speaking 
                 + avg.len.seg, data = features_audiovisual_df_train)

summary(fit_mlm_10)


___

#### 4.11 linear model 11

Lastly, we made a model (Model 11) with the text features from Model 3 – as that model had the lowest RMSE of the models containing all outcomes (see below) – combined with the audiovisual features from Model 10. 

In [None]:
fit_mlm_11 <- lm(cbind(Extr, Agr, Cons, Emot, Open) ~ anger*negative + 
                 anticipation*positive + disgust*negative + fear*negative + 
                 joy*positive + sadness*negative + trust*positive +
                 surprise*positive + disgust*negative + 
                 mean.energy + time.speaking + avg.len.seg, data = features_audiovisual_df_train)

summary(fit_mlm_11)


## 5 Predictions 

#### 5.1 Predictions on training data set

The predictions for the training data based on each of our proposed linear models can be found below. Those predictions are needed in order to evaluate each model (i.e., calculate the RMSE; see below).

In [None]:
pred_mlm_01_train <-
    predict(fit_mlm_01)

pred_mlm_01_train %>% 
    head()


pred_mlm_02_train <-
    predict(fit_mlm_02)

pred_mlm_02_train %>% 
    head()


pred_mlm_03_train <-
    predict(fit_mlm_03)

pred_mlm_03_train %>% 
    head()


pred_mlm_04_train <-
    predict(fit_mlm_04)

pred_mlm_04_train %>% 
    head()


pred_mlm_05_train <-
    predict(fit_mlm_05)

pred_mlm_05_train %>% 
    head()


pred_mlm_06_train <-
    predict(fit_mlm_06)

pred_mlm_06_train %>% 
    head()


pred_mlm_07_train <-
    predict(fit_mlm_07)

pred_mlm_07_train %>% 
    as.matrix() %>% 
    head()


pred_mlm_08_train <-
    predict(fit_mlm_08)

pred_mlm_08_train %>% 
    as.matrix() %>% 
    head()


pred_mlm_09_train <-
    predict(fit_mlm_09)

pred_mlm_09_train %>% 
    as.matrix() %>% 
    head()


pred_mlm_10_train <-
    predict(fit_mlm_10)

pred_mlm_10_train %>% 
    as.matrix() %>% 
    head()


pred_mlm_11_train <-
    predict(fit_mlm_11)

pred_mlm_11_train %>% 
    as.matrix() %>% 
    head()

#### 5.2 Predictions on test data set

Now the predictions for test data are computed based on each of the eleven linear models. That is what we are actually interested in: we want to predict the (yet unknown) personality of vloggers based on our models. 

In [None]:
pred_mlm_01_test <-
    predict(fit_mlm_01, newdata = features_audiovisual_df_test)

pred_mlm_01_test %>% 
    head()


pred_mlm_02_test <-
    predict(fit_mlm_02, newdata = features_audiovisual_df_test)

pred_mlm_02_test %>% 
    head()


pred_mlm_03_test <-
    predict(fit_mlm_03, newdata = features_audiovisual_df_test)

pred_mlm_03_test %>% 
    head()


pred_mlm_04_test <-
    predict(fit_mlm_04, newdata = features_audiovisual_df_test)

pred_mlm_04_test %>% 
    head()


pred_mlm_05_test <-
    predict(fit_mlm_05, newdata = features_audiovisual_df_test)

pred_mlm_05_test %>% 
    head()


pred_mlm_06_test <-
    predict(fit_mlm_06, newdata = features_audiovisual_df_test)

pred_mlm_06_test %>% 
    head()


pred_mlm_07_test <-
    predict(fit_mlm_07, newdata = features_audiovisual_df_test)

pred_mlm_07_test %>% 
    as.matrix() %>% 
    head()


pred_mlm_08_test <-
    predict(fit_mlm_08, newdata = features_audiovisual_df_test)

pred_mlm_08_test %>% 
    as.matrix() %>% 
    head()


pred_mlm_09_test <-
    predict(fit_mlm_09, newdata = features_audiovisual_df_test)

pred_mlm_09_test %>% 
    as.matrix() %>% 
    head()


pred_mlm_10_test <-
    predict(fit_mlm_10, newdata = features_audiovisual_df_test)

pred_mlm_10_test %>% 
    as.matrix() %>% 
    head()


pred_mlm_11_test <-
    predict(fit_mlm_11, newdata = features_audiovisual_df_test)

pred_mlm_11_test %>% 
    as.matrix() %>% 
    head()

---

---

## 6 Evaluation of Predictions: RMSE Calculations

As part of the competition, the previously calculated predictions on training data are evaluated by computing the Root Means Square Error (RMSE). The RMSE is one way to assess how well a regression model fits a dataset and a model fit measure that measures the average difference between a model's predicted values and its actual values in the dataset.

**Root Means Square Error**:
    - $\displaystyle{RMSE =\sqrt{{1 \over 5n} \sum_{k \in \{cEXT, \ldots, cOPN\}} \sum_{i=1}^n (y_{ik} - \hat y_{ik})^2}}$
    - Here 
        - $y_{ik}$ is the observed value for vlogger $i$ 
        - $\hat y_{ik}$ is our prediction for vlogger $i$
        
The lower the RMSE value, the better the given model is at predicting the values of the dataset.

#### 6.1 RMSE function
A helper `rmse`() function was created based on the RMSE formula provided above. For the sake of convenience this function was created in order to avoid repeating code as we need to calculate the RMSE for each model. The function takes two arguments:
- the observed values of each model 
- predicted values of each model

With those two inputs the RMSE can be calculated. 

In [None]:
n <- nrow(vlogger_features_training)

rmse <- function(observed, predicted) {
  rmse_score = sqrt(1/(5*n) * sum((observed-predicted)^2))
  return(rmse_score)
}

In order to calculate the RMSE we need to retrieve the observed values on personality traits.

In [None]:
observed_train <-
    features_audiovisual_df_train %>%
    select(Extr, Agr, Cons, Emot, Open)

Then, we can calculate the RMSE with help of the `rmse()` function by pluggin in the observed and the predicted values for each of the 11 models (also per outcome if applicable). 

In [None]:
rmse_m1_extr <- rmse(unname(unlist(observed_train[, 1])), unname(unlist(pred_mlm_01_train[, 1])))
rmse_m1_agr <- rmse(unname(unlist(observed_train[, 2])), unname(unlist(pred_mlm_01_train[, 2])))
rmse_m1_cons <- rmse(unname(unlist(observed_train[, 3])), unname(unlist(pred_mlm_01_train[, 3])))
rmse_m1_emot <- rmse(unname(unlist(observed_train[, 4])), unname(unlist(pred_mlm_01_train[, 4])))
rmse_m1_open <- rmse(unname(unlist(observed_train[, 5])), unname(unlist(pred_mlm_01_train[, 5])))

rmse_m2_extr <- rmse(unname(unlist(observed_train[, 1])), unname(unlist(pred_mlm_02_train[, 1])))
rmse_m2_agr <- rmse(unname(unlist(observed_train[, 2])), unname(unlist(pred_mlm_02_train[, 2])))
rmse_m2_cons <- rmse(unname(unlist(observed_train[, 3])), unname(unlist(pred_mlm_02_train[, 3])))
rmse_m2_emot <- rmse(unname(unlist(observed_train[, 4])), unname(unlist(pred_mlm_02_train[, 4])))
rmse_m2_open <- rmse(unname(unlist(observed_train[, 5])), unname(unlist(pred_mlm_02_train[, 5])))

rmse_m3_extr <- rmse(unname(unlist(observed_train[, 1])), unname(unlist(pred_mlm_03_train[, 1])))
rmse_m3_agr <- rmse(unname(unlist(observed_train[, 2])), unname(unlist(pred_mlm_03_train[, 2])))
rmse_m3_cons <- rmse(unname(unlist(observed_train[, 3])), unname(unlist(pred_mlm_03_train[, 3])))
rmse_m3_emot <- rmse(unname(unlist(observed_train[, 4])), unname(unlist(pred_mlm_03_train[, 4])))
rmse_m3_open <- rmse(unname(unlist(observed_train[, 5])), unname(unlist(pred_mlm_03_train[, 5])))

rmse_m4_extr <- rmse(unname(unlist(observed_train[, 1])), unname(unlist(pred_mlm_04_train[, 1])))
rmse_m4_agr <- rmse(unname(unlist(observed_train[, 2])), unname(unlist(pred_mlm_04_train[, 2])))
rmse_m4_cons <- rmse(unname(unlist(observed_train[, 3])), unname(unlist(pred_mlm_04_train[, 3])))
rmse_m4_emot <- rmse(unname(unlist(observed_train[, 4])), unname(unlist(pred_mlm_04_train[, 4])))
rmse_m4_open <- rmse(unname(unlist(observed_train[, 5])), unname(unlist(pred_mlm_04_train[, 5])))

rmse_m5_extr <- rmse(unname(unlist(observed_train[, 1])), unname(unlist(pred_mlm_05_train[, 1])))
rmse_m5_agr <- rmse(unname(unlist(observed_train[, 2])), unname(unlist(pred_mlm_05_train[, 2])))
rmse_m5_cons <- rmse(unname(unlist(observed_train[, 3])), unname(unlist(pred_mlm_05_train[, 3])))

rmse_m6_emot <- rmse(unname(unlist(observed_train[, 4])), unname(unlist(pred_mlm_06_train[, 1])))
rmse_m6_open <- rmse(unname(unlist(observed_train[, 5])), unname(unlist(pred_mlm_06_train[, 2])))

rmse_m10_extr <- rmse(unname(unlist(observed_train[, 1])), unname(unlist(pred_mlm_10_train[, 1])))
rmse_m10_agr <- rmse(unname(unlist(observed_train[, 2])), unname(unlist(pred_mlm_10_train[, 2])))
rmse_m10_cons <- rmse(unname(unlist(observed_train[, 3])), unname(unlist(pred_mlm_10_train[, 3])))
rmse_m10_emot <- rmse(unname(unlist(observed_train[, 4])), unname(unlist(pred_mlm_10_train[, 4])))
rmse_m10_open <- rmse(unname(unlist(observed_train[, 5])), unname(unlist(pred_mlm_10_train[, 5])))

rmse_m11_extr <- rmse(unname(unlist(observed_train[, 1])), unname(unlist(pred_mlm_11_train[, 1])))
rmse_m11_agr <- rmse(unname(unlist(observed_train[, 2])), unname(unlist(pred_mlm_11_train[, 2])))
rmse_m11_cons <- rmse(unname(unlist(observed_train[, 3])), unname(unlist(pred_mlm_11_train[, 3])))
rmse_m11_emot <- rmse(unname(unlist(observed_train[, 4])), unname(unlist(pred_mlm_11_train[, 4])))
rmse_m11_open <- rmse(unname(unlist(observed_train[, 5])), unname(unlist(pred_mlm_11_train[, 5])))

rmse_m1 <- rmse(unname(unlist(observed_train)), unname(unlist(pred_mlm_01_train)))
rmse_m2 <- rmse(unname(unlist(observed_train)), unname(unlist(pred_mlm_02_train)))
rmse_m3 <- rmse(unname(unlist(observed_train)), unname(unlist(pred_mlm_03_train)))
rmse_m4 <- rmse(unname(unlist(observed_train)), unname(unlist(pred_mlm_04_train)))
rmse_m5 <- rmse(unname(unlist(observed_train[,1:3])), unname(unlist(pred_mlm_05_train)))
rmse_m6 <- rmse(unname(unlist(observed_train[,4:5])), unname(unlist(pred_mlm_06_train)))
rmse_m7_cons <- rmse(observed_train[,3], pred_mlm_07_train)
rmse_m8_agr <- rmse(observed_train[,2], pred_mlm_08_train)
rmse_m9_emot <- rmse(observed_train[,4], pred_mlm_09_train)
rmse_m10 <- rmse(unname(unlist(observed_train)), unname(unlist(pred_mlm_10_train)))
rmse_m11 <- rmse(unname(unlist(observed_train)), unname(unlist(pred_mlm_11_train)))

To have a better overview of all RMSE values, all RMSE values were put into one dataframe.

In [None]:
rmse_combined <- rbind(rmse_m1, rmse_m2, rmse_m3, rmse_m4, rmse_m5, rmse_m6, rmse_m10, rmse_m11,
                       rmse_m1_extr, rmse_m1_agr, rmse_m1_cons, rmse_m1_emot, rmse_m1_open,
                       rmse_m2_extr, rmse_m2_agr, rmse_m2_cons, rmse_m2_emot, rmse_m2_open,
                       rmse_m3_extr, rmse_m3_agr, rmse_m3_cons, rmse_m3_emot, rmse_m3_open,
                       rmse_m4_extr, rmse_m4_agr, rmse_m4_cons, rmse_m4_emot, rmse_m4_open,
                       rmse_m5_extr, rmse_m5_agr, rmse_m5_cons,
                       rmse_m6_emot, rmse_m6_open, 
                       rmse_m7_cons, 
                       rmse_m8_agr, 
                       rmse_m9_emot,
                       rmse_m10_extr, rmse_m10_agr, rmse_m10_cons, rmse_m10_emot, rmse_m10_open,
                       rmse_m11_extr, rmse_m11_agr, rmse_m11_cons, rmse_m11_emot, rmse_m11_open) %>%
                       as.data.frame() %>% rownames_to_column("Model") %>% 
                       rename("RMSE" = "V1")
rmse_combined

---

## 7 Model Selection

We selected the best model for each outcome (5) and the best model for all outcomes (1) based on the RMSE metric. The lower the RMSE value, the better the model fits the data. 

In order to do that we made tables of the RMSE values per outcome and for the full models, in which we arranged the RMSE values to make it clear which model performs best. 

In [None]:
# Model per outcome 
rmse_combined %>% filter(grepl("open$", Model)) %>% arrange(RMSE) 
rmse_combined %>% filter(grepl("cons$", Model)) %>% arrange(RMSE) 
rmse_combined %>% filter(grepl("extr$", Model)) %>% arrange(RMSE) 
rmse_combined %>% filter(grepl("agr$", Model)) %>% arrange(RMSE)  
rmse_combined %>% filter(grepl("emot$", Model)) %>% arrange(RMSE)

# Models containing all outcomes 
rmse_combined %>% filter(grepl("(1|2|3|4|10|11)$", Model)) %>% arrange(RMSE) 

As can be seen in the table, the RMSE values of the six models that contain all outcome variables (i.e., all personality traits) indicate that, overall, these models do not fit the data well, since the RMSE values are above 0.5 which indicates a bad model fit. Model 11, which included both audiovisual and text features as predictors, seems to have the best fit.

On the contrary, RMSE values of the model that only contained a selection of the outcome variables are below 0.5, indicating that these models do have acceptable fit. For each of the Big Five personality traits, model 11 outperforms the other models (though differences are slight). 

---

## 8 Writing Predictions to Files

#### 8.1 Wide format
In order to display the predicitons of our best model, Model 11, we made the following wide dataframe.  

In [None]:
testset_pred_11 = features_audiovisual_df_test %>% 
    mutate(
        Extr = pred_mlm_11_test[,'Extr'], 
        Agr  = pred_mlm_11_test[,'Agr' ],
        Cons = pred_mlm_11_test[,'Cons'],
        Emot = pred_mlm_11_test[,'Emot'],
        Open = pred_mlm_11_test[,'Open']
    ) %>%
    select(vlogId, Extr:Open)

testset_pred_11 %>% 
    head()

#### 8.2 Long dataframe with axis and predicition values

We transformed the wide format to a long format which contains `vlogI` (i.e. the ID of each vlogger), `axis` (i.e. one of the dimensions on the big5) and `predictions` (i.e. the predicted score on each big5 dimension)

In [None]:
testset_pred_long_11 = testset_pred_11 %>% 
    pivot_longer(cols = Extr:Open, names_to = "axis", values_to = "prediction") %>%
    arrange(`vlogId`,axis)

testset_pred_long_11 %>% 
    head()

#### 8.3 Final data frame

Finally, we made a file containing a dataframe with two colums: the first column containing the vlogger id (`vlogId`) with big5 dimensions and the second column containing the expected value on the big5 dimensions (`Expected`).
 

In [None]:
testset_pred_long_final_11 <- 
    testset_pred_long_11 %>%
    unite(Id, `vlogId`, axis) %>%
    rename(Expected = prediction)

nrow(testset_pred_long_final_11) # n=400 is correct: 80 vloggers times 5 traits

testset_pred_long_final_11 %>% head()

#### 8.3 Writing output to .csv file

The resulting dataframe is written to a .csv file.


In [None]:
# Write to csv
write_csv(testset_pred_long_final_11, file = "predictions_model_11.csv")

# Checking directory
dir()

---

## 9 Visualizations

To visualize our results, we used a heatmap of the RMSE values per model. In order to visualize this, we created a dataframe with RMSE values and split those into variable (for each personality trait).

In [None]:
# dataframe containing the RMSE for each model (that contains all personality axes) on each personality trait
rmse_data <- data.frame(
  Variable = rep(c("Extroversion", "Agreeableness", "Conscientious", "Emotional stability", "Openness"), each = 6),
  Model = rep(c("Model1", "Model2", "Model3", "Model4","Model10", "Model11"), times = 5),
  RMSE = c(
rmse_m1_extr, rmse_m1_agr, rmse_m1_cons, rmse_m1_emot, rmse_m1_open,
rmse_m2_extr, rmse_m2_agr, rmse_m2_cons, rmse_m2_emot, rmse_m2_open,
rmse_m3_extr, rmse_m3_agr, rmse_m3_cons, rmse_m3_emot, rmse_m3_open,
rmse_m4_extr, rmse_m4_agr, rmse_m4_cons, rmse_m4_emot, rmse_m4_open,
rmse_m10_extr, rmse_m10_agr, rmse_m10_cons, rmse_m10_emot, rmse_m10_open,
rmse_m11_extr, rmse_m11_agr, rmse_m11_cons, rmse_m11_emot, rmse_m11_open))

model_order <- c("Model1", "Model2", "Model3", "Model4", "Model10", "Model11")

rmse_data$Model <- factor(rmse_data$Model, levels = model_order)

# heatmap
heatmap_plot <- ggplot(rmse_data, aes(x = Model, y = Variable, fill = RMSE)) +
  geom_tile() +
  labs(title = 'RMSE of Different "Full" Model Variables on the training data', fill = "RMSE") +
  scale_fill_gradient(low = "red", high = "white") +
geom_text(aes(label = round(RMSE, 2))) +
  theme_minimal()
heatmap_plot

# saving the plot
ggsave("heatmap.jpg", plot = heatmap_plot)

Furthermore, we also visualized the RMSE values for the models containing all personality axes using a barplot. Again, we first created a dataframe with RMSE values for each of these models before visualizing it.

In [None]:
# dataframe containing the RMSE for each of these models 
rmse_data_2 <- data.frame(
  Models = factor(c("Model 1", "Model 2", "Model 3", "Model 4", "Model 10", "Model 11"),
                  levels = c("Model 1", "Model 2", "Model 3", "Model 4", "Model 10", "Model 11")),
  y = c(rmse_m1, rmse_m2, rmse_m3, rmse_m4, rmse_m10, rmse_m11))

# barplot
bar_plot <- ggplot(rmse_data_2, aes(x = Models, y = y, fill = Models)) +
  geom_bar(stat = "identity") +
  labs(
    title = 'RMSE of Different "Full" Models on the training data',
    x = "Models",
    y = "RMSE"
  ) +
  theme_minimal() +
coord_cartesian(ylim = c(0.7, 0.85)) +
geom_text(aes(label = round(y, 3)), vjust = -0.5, size = 3) 

bar_plot

# saving the barplot
ggsave("barplot.jpg", plot = bar_plot)

---

## 10 Collaborators

* Sophia Lorenz - Conceptual development, pre-processing, features, predictions, testing, model comparisions, visualizations, text & code style, final check-ups 

* Romy Leferink - Conceptual development, pre-processing, literature research, features, predictions, testing, model comparisions, visualizations, coding/editing, text & code style, final check-ups

* Bence András Marosi - Conceptual development, pre-processing, features, predictions, testing, model comparisions, visualizations, coding/editing, final check-ups

---

## 11 References

Casper van Tongeren, Enrico Erler, Raoul. (2022). Syllable counting model. Kaggle. https://kaggle.com/competitions/syllable-counting-model

dan_vdmeer, Dave Leitritz, Raoul. (2023). BDA 2023 Profiling Personality. Kaggle. https://kaggle.com/competitions/bda-2023-profiling-personality

Farnadi, G., Sitaraman, G., Rohani, M., Kosinski, M., Stillwell, D., Moens, M., Davalos, S., & De Cock, M. (2014). How are you doing? : emotions and personality in Facebook. EMPIRE 2014, 1181, 45–56.

Maharani, W., & Effendy, V. (2022). Big five personality prediction based in Indonesian tweets using machine learning methods. International Journal of Power Electronics and Drive Systems, 12(2), 1973. https://doi.org/10.11591/ijece.v12i2.pp1973-1981

@article{biel2013youtube,
  title={The youtube lens: Crowdsourced personality impressions and audiovisual analysis of vlogs},
  author={Biel, Joan-Isaac and Gatica-Perez, Daniel},
  journal={Multimedia, IEEE Transactions on},
  volume={15},
  number={1},
  pages={41--55},
  year={2013},
  publisher={IEEE}
}

@inproceedings{biel2013hi,
  title={Hi YouTube!: personality impressions and verbal content in social video},
  author={Biel, Joan-Isaac and Tsiminaki, Vagia and Dines, John and Gatica-Perez, Daniel},
  booktitle={Proceedings of the 15th ACM on International conference on multimodal interaction},
  pages={119--126},
  year={2013},
  organization={ACM}
}

Related literature 

J.-I. Biel and D. Gatica-Perez, “The YouTube Lens: Crowdsourced Personality Impressions
and Audiovisual Analysis of Vlogs" in IEEE Transactions on Multimedia , Vol. 15, No. 1,
pp. 41-55, Jan. 2013.

J.-I. Biel and D. Gatica-Perez. “VlogSense: Conversational Behavior and Social Attention
in YouTube" in ACM Transactions on Multimedia Computing, Communications, and Applications,
Special Issue on Social Media, Oct 2011.

 J.-I. Biel, V. Vtsminaki, V., J. Dines and D. Gatica-Perez “Hi YouTube! What verbal content
reveals in social video" in Proceedings International Conference on Multimodal Interaction
(ICMI) , Sydney, Dec. 2013.

J.-I. Biel, Teijeiro-Mosquera, L. and D. Gatica-Perez “FaceTube: Predicting Personality from
Facial Expressions of Emotion in Online Conversational Video" in Proceedings International
Conference on Multimodal Interaction (ICMI) , Santa Monica, Oct. 2012.

J.-I. Biel and D. Gatica-Perez “The Good, the Bad, and the Angry: Analyzing Crowdsourced
Impressions of Vloggers" in Proceedings of AAAI International Conference on Weblogs and
Social Media (ICWSM) , Dublin, June 2012

J-I. Biel, O. Aran, and D. Gatica-Perez, “You Are Known by How You Vlog: Personality
Impressions and Nonverbal Behavior in YouTube” In Proc. AAAI Int. Conf. . on Weblogs
and Social Media (ICWSM) , Barcelona, July 2011.

J.-I. Biel and D. Gatica-Perez, “Vlogcast Yourself: Nonverbal Behavior and Attention in
Social