In [None]:
library(tidyverse) 
library(tidytext)

# Input data files are available in the read-only "../input/" directory
# For example, running this cell (by clicking ▶️, run or pressing Shift+Enter) will list 
# all files under the "../input/" directory

list.files(path = "../input")

There are three .csv files in the directory structure:

In [None]:
directory_content = list.files("../input/bda-2022-personality-profiling/youtube-personality", full.names = TRUE)
print(directory_content)

In addition there's a "transcript" folder (see number \[2\] in the output above) in which the actual video transcripts are stored in `.txt` files. 

Store these file paths in variables for easy reference later on:

In [None]:
# Path to the transcripts directory with transcript .txt files
path_to_transcripts = directory_content[2] 

# .csv filenames (see output above)
AudioVisual_file    = directory_content[3]
Gender_file         = directory_content[4]
Personality_file    = directory_content[5]

# 1. Import the data

We'll import

- Transcripts
- Personality scores
- Gender
- Audiovisuals

## 1.1 Importing transcripts

The transcript text files are stored in the subfolder 'transcripts'. They can be listed with the following commands:

In [None]:
transcript_files = list.files(path_to_transcripts, full.names = TRUE) 

print(head(transcript_files))

The transcript file names encode the vlogger ID that you will need for joining information from the different data frames. A clean way to extract the vlogger ID's from the names is by using the funcation `basename()` and removing the file extension ".txt".

In [None]:
vlogId = basename(transcript_files)
vlogId = str_replace(vlogId, pattern = ".txt$", replacement = "")
head(vlogId)

To include features extracted from the transcript texts you will have to read the text from files and store them in a data frame. For this, you will need the full file paths as stored in `transcript_files`.

Here are some tips to do that programmatically

- use either a `for` loop, the `sapply()` function, or the `map_chr()` from the `tidyverse`
- don't forget to also store `vlogId` extracted with the code above 

We will use the `map_chr()` function here:

In [None]:
transcripts_df = tibble(
    
    # vlogId connects each transcripts to a vlogger
    vlogId=vlogId,
    
    # Read the transcript text from all file and store as a string
    Text = map_chr(transcript_files, ~ paste(readLines(.x), collapse = "\\n")), 
    
    # `filename` keeps track of the specific video transcript
    filename = transcript_files
)

In [None]:
transcripts_df %>% 
    head(2)

## Import personality scores

The other data files can be read in with `read_delim` (not `read_csv` because the files are not actually comma separated). For instance, the following should work:

In [None]:
# Import the Personality scores
pers_df = read_delim(Personality_file, delim=" ")

In [None]:
head(pers_df)

## Import gender

Gender info is stored in a separate `.csv` which is also delimited with a space. This file doesn't have column names, so we have to add them ourselves:

In [None]:
gender_df = read.delim(Gender_file, head=FALSE, sep=" ", skip = 2)


# Add column names
names(gender_df) = c('vlogId', 'gender')


head(gender_df)

## Merging the `gender` and `pers` dataframes

Obviously, we want all the information in a single tidy data frame. While the builtin R function `merge()` can do that, the `tidyverse()` has a number of more versatile and consistent functions called `left_join`, `right_join`, `inner_join`, `outer_join`, and `anti_join`. We'll use `left_join` here to merge the gender and personality data frames:

In [None]:
vlogger_df = left_join(gender_df, pers_df, by='vlogId')

head(vlogger_df) # VLOG8 has missing personality scores: those should be predicted

Note that some rows, like row 5, has `NA`'s for the personality scores. This is because this row corresponds to the vlogger with vlogId `VLOG8` is part of the test set. You still have to split `vlogger_df` into the training and test set, as shown below.

We leave the `transcripts_df` data frame seperate for now, because you will first have to extract features from the transcripts first. Once you have those features in a tidy data frame, including a `vlogId` column, you can refer to this `left_join` example to merge your features with `vlogger_df` in one single tidy data frame.

# 2. Feature extraction from transcript texts

Here you will develop the code that extract features from the transcript texts using `tidytext`. Look at [Introducing Text Analytics](https://www.kaggle.com/datasniffer/introducing-text-analytics-personality-from-text) to see how you should do this.

## Foreword:
Although we provide clear reasoning for our feature selection, our group tried to use many reasonable features for our predictions. For the final feature selection we adhered to the statistical process rather than our intuition. This means we believed that it is more trustworthy and useful to use statistical selection (e.g. stepwise regression) to extract features from our collection, than simply selecting features based on our intuition. Thus, we tried to gather many features that ultimately provided more data, while performing a statistical selection.

First, we initialized the required dataframes, tokenizing the text and preparing the final dataframe.

In [None]:
# Here goes YOUR CODE to compute the dataframe `transcript_features_df`

# Tokenization of vlog text 
token_df <- transcripts_df %>%
    unnest_tokens(word, Text, token = "words") 


# Initialize 'transcript_features_df` dataframe as tibble to use tidy framework 
transcript_features_df <- vlogger_df %>%
    select(vlogId) %>%
    tibble()


head(token_df)
head(transcript_features_df)

## Feature: Sentiment (NRC, AFINN, BING)

We used all the sentiment datasets, including AFINN, bing, and nrc. We especially extracted the emotions from nrc to have a more specific picture of conveyed emotions. These features include words assigned to positive / negative sentiment scores, as well as the use of emotions like anger, anticipation, disgust, fear, joy, sadness, surprise, and trust. The emotions / sentiment in a vlog may express attitudes and opinions of individuals. It gives us an opportunity to explore the quality or the characteristic of the vlogger, as they should reflect the way of thinking, perspectives, and emotions of the vlogger. Usually these characteristics all differ per different personality traits, so we would expect these features to benefit our prediction efforts. The first feature we created included a sentiment analysis of the text applying three distinct sentiment vocabularies.

In [None]:
# FEATURE: "Sentiment" 
# Sub-Feature: Bing
transcript_features_df <- token_df %>%
    inner_join(get_sentiments("bing"), by = "word") %>%
    count(vlogId, sentiment) %>%
    pivot_wider(names_from = sentiment, values_from = n, values_fill = 0) %>%
    mutate(sentiment_bing = positive - negative, .keep = "unused") %>%  # the .keep argument enables us to remove unnecessary columns.
    right_join(transcript_features_df, by = "vlogId")
# As with most features, we tried to directly join the created feature to the main dataframe "transcript_features_df" by right-joining them.


# Sub-Feature: AFINN
# AFINN Code from "Introducing Text Analytics: Personality from Text" to receive keyword list AFINN
download.file("http://www2.imm.dtu.dk/pubdb/edoc/imm6010.zip","afinn.zip")
unzip("afinn.zip")
afinn = read.delim("AFINN/AFINN-111.txt", sep="\t", col.names = c("word","score"), stringsAsFactors = FALSE)

transcript_features_df <- token_df %>%
    inner_join(afinn, by = "word") %>%
    group_by(vlogId) %>%
    summarise(sentiment_afinn = sum(score)) %>%
    right_join(transcript_features_df, by = "vlogId")


# Sub-Feature: NRC
# NRC Code from "Introducing Text Analytics: Personality from Text" to receive keyword list NRC
load_nrc = function() {
    if (!file.exists('nrc.txt'))
        download.file("https://www.dropbox.com/s/yo5o476zk8j5ujg/NRC-Emotion-Lexicon-Wordlevel-v0.92.txt?dl=1","nrc.txt")
    nrc = read.table("nrc.txt", col.names=c('word','sentiment','applies'), stringsAsFactors = FALSE)
    nrc %>% 
        filter(applies == 1) %>% 
        select(-applies)
}

transcript_features_df <- token_df %>%
    inner_join(load_nrc(), by = "word") %>%
    count(vlogId, sentiment) %>%
    pivot_wider(names_from = sentiment, values_from = n, values_fill = 0) %>%
    mutate(sentiment_nrc = positive - negative, .keep = "unused") %>%
    right_join(transcript_features_df, by = "vlogId")


transcript_features_df %>%
    select(anger:sentiment_bing) %>%
    head()

## Feature: Average Word and Average Sentence Length.

We also calculated the average word and average sentence length in vlogger’s transcript. These features came from an idea that individuals with high conscientiousness – being self-disciplined and self-controlled towards their own goals and duties – would tend to use longer words or more elaborate sentences. We thought these individuals tend more towards using difficult or longer words to express themselves. In addition to this, longer sentences not only reflect on difficulty of the sentence but also the use of language in general. We would expect individuals with high conscientiousness to have better rhythmic features in their speech, having more descriptive and explanatory sentences. 

Average Word length was measured as the average amount of characters used per word. Average Sentence length was measured as the average amount of words used per sentence.

In [None]:
# FEATURE: "Average Word Length" 
transcript_features_df <- token_df %>%
    group_by(vlogId) %>%
    summarise(avg_word_len = nchar(word) %>%
                                mean()) %>%
    right_join(transcript_features_df, by = "vlogId")


# FEATURE: "Average Sentence Length" 
transcript_features_df <- transcripts_df %>%
    unnest_tokens(sentences, Text, token = "sentences") %>%
    mutate(sentence_num = row_number()) %>%
    unnest_tokens(words, sentences, token = "words") %>% 
    group_by(vlogId) %>%
    count(sentence_num) %>%
    summarise(avg_sen_len = mean(n)) %>%
    right_join(transcript_features_df, by = "vlogId")
# In order to count the words per sentence, we first unnested the text into sentences, gave them numbers, and then unnested into words.

transcript_features_df %>%
    select(avg_word_len:avg_sen_len) %>%
    head()

## Feature: Stopword / Adjective / Linking word / Pronouns / Unique word proportions

We then calculated the proportions of various word lists from the vlogger’s transcript. As the absolute number of words is generally not a good indicator, we wanted to use proportions to represent the unique use of words in each vlogger’s speech. The stopword is an additive or descriptive feature that occurs in the natural speech, which is considered meaningless and insignificant. The use of stopwords can represent how descriptive the person is and how long it takes the person to get to the main point. It may reflect on person’s conscientiousness, as it shows how organized they are with their words; or, it may show on person’s openness to experience and extraversion, as it reflects confidence in their speech. Adjectives and linking words also represent how descriptive a person is. Adjectives especially show the expressiveness of an individual, showing how frequent the person tries to explain somethings with more explanatory words. The list is extracted from Cambridge dictionary. The use of pronouns, on the other hand, indicate the subjectivity of the speech, showing how frequent the person refers to themselves or others in their vlog. We tried to differentiate personal, first, second, and third pronouns to specifically show which pronouns are correlated with which personality. The pronouns usage also reflects the perspectives of a person. For instance, if a person is more agreeable, he/she will be using more third person pronouns to express the ideas and opinions of others. Lastly, the use of unique words would increase the predictability by being potentially related to conscientiousness. This would show the diversity in the use of language, showing the person is conscious of their words and have control in their speech. We try to improve our model by showing how colorful one’s speech and language is. 

In [None]:
# FEATURE: "Different Word Proportions" 
words_df <- token_df %>%
    group_by(vlogId) %>%
    summarise(w_num = n())

# Most of the below features were created in a similar fashion: 
# Acquiring a word list, semi-joining into the token dataframe to create a new dataframe containing only the new word types.
# Afterwards, most were joined with the words_df to calculate the proportions using the absoulte number of words per text.

# Sub-Feature: "Stopword Proportions" 
stopwords_df <- token_df %>%
    semi_join(get_stopwords(), by = "word") %>%
    group_by(vlogId) %>%
    summarise(s_num = n())

transcript_features_df <- words_df %>%
    left_join(stopwords_df, by = "vlogId") %>%
    mutate(stop_prop = s_num/w_num, .keep = "unused") %>%
    right_join(transcript_features_df, by = "vlogId")


# Sub-Feature: "Adjectives Proportions" [List was acquired from this github repo: https://gist.github.com/hugsy/8910dc78d208e40de42deb29e62df913]
adjective_list <- read.delim(url("https://gist.github.com/hugsy/8910dc78d208e40de42deb29e62df913/raw/eec99c5597a73f6a9240cab26965a8609fa0f6ea/english-adjectives.txt"),
                             header = FALSE)
names(adjective_list) <- paste("word")

adjectives_df <- token_df %>%
    semi_join(adjective_list, by = "word") %>%
    group_by(vlogId) %>%
    summarise(a_num = n())

transcript_features_df <- words_df %>%
    left_join(adjectives_df, by = "vlogId") %>%
    mutate(adj_prop = a_num/w_num, .keep = "unused") %>%
    right_join(transcript_features_df, by = "vlogId")


# Sub-Feature: "Linking Word Proportion" [List was acquired from https://dictionary.cambridge.org/grammar/british-grammar/conjunctions-and-linking-words]
linking_words <- tibble(word = c("accordingly",  "consequently", "for", "forthwith", "hence", "then", "therefore", "thereupon", "thus",
                                 "absolutely", "chiefly", "clearly", "definitely", "especially", "even", "importantly", "indeed", "obviously", 
                                 "particularly", "surprisingly", "truly", "additionally", "also", "and", "besides", "finally", "first", 
                                 "further", "furthermore", "last", "moreover", "not only", "second", "secondly", "thirdly", "third",
                                 "because", "so", "example", "instance", "namely", "alternativiely", "contrarily", "contrary", "conversely", 
                                 "however", "nevertheless", "contrast", "instead", "nonetheless", "rather", "nor", "unlike", "while", 
                                 "whereas", "yet","whilst", "alike", "both", "compare", "equally", "likewise", "altogether", "generally", 
                                 "conclusion", "shortly", "summary", "overall", "but", "or", "as"))

linking_words_df <- token_df %>%
    semi_join(linking_words, by = "word") %>%
    group_by(vlogId) %>%
    summarize(l_num = n())

transcript_features_df <- words_df %>%
    left_join(linking_words_df, by = "vlogId") %>%
    mutate(link_prop = l_num/w_num, .keep = "unused") %>%
    right_join(transcript_features_df, by = "vlogId")


# Sub-Feature: "Unique Word Proportion"
transcript_features_df <- token_df %>%
    anti_join(get_stopwords(), by = "word") %>%
    group_by(vlogId) %>%
    summarise(unique_prop = length(unique(word)) / length(word)) %>%
    right_join(transcript_features_df, by = "vlogId")


# Sub-Feature: "Pronouns Proportion"
fir_pronouns <- tibble(word = c("i'm", "i'll", "i","my","me","myself","mine"))
other_fir_pronouns <- tibble(word = c("we","us","ours","our","ourselves"))
sec_pronouns <- tibble(word = c("you","yourself","yourselves","your","yours"))
thi_pronouns <- tibble(word = c("she","he","her","him","it","they","them","himself","herself","itself","themselves","his","her","hers","its","their"))

pronouns_df1 <- token_df %>%
    semi_join(fir_pronouns, by = "word") %>%
    group_by(vlogId) %>%
    summarize(p1_num = n())

pronouns_df2 <- token_df %>%
    semi_join(other_fir_pronouns, by = "word") %>%
    group_by(vlogId) %>%
    summarize(p2_num = n())

pronouns_df3 <- token_df %>%
    semi_join(sec_pronouns, by = "word") %>%
    group_by(vlogId) %>%
    summarize(p3_num = n())

pronouns_df4 <- token_df %>%
    semi_join(thi_pronouns, by = "word") %>%
    group_by(vlogId) %>%
    summarize(p4_num = n())

transcript_features_df <- words_df %>%
    left_join(pronouns_df1, by = "vlogId") %>%
    mutate(pro1_prop = p1_num/w_num, .keep = "unused") %>%
    right_join(transcript_features_df, by = "vlogId")

transcript_features_df <- words_df %>%
    left_join(pronouns_df2,by = "vlogId") %>%
    mutate(pro2_prop = p2_num/w_num, .keep = "unused") %>%
    right_join(transcript_features_df, by = "vlogId")

transcript_features_df <- words_df %>%
    left_join(pronouns_df3, by = "vlogId") %>%
    mutate(pro3_prop = p3_num/w_num, .keep = "unused") %>%
    right_join(transcript_features_df, by = "vlogId")

transcript_features_df <- words_df %>%
    left_join(pronouns_df4, by = "vlogId") %>%
    mutate(pro4_prop = p4_num/w_num, .keep = "unused") %>%
    right_join(transcript_features_df, by = "vlogId")


transcript_features_df %>%
    select(stop_prop:pro4_prop) %>%
    head()

## Feature: Audiovisual features

According to Biel and Gatica-Perez (2012), analysis on the audiovisual features of Youtube Vlogs are useful in identifying and predicting the personality of the vloggers. We’ve decided to use some features from audio cues, such as pitch, the voicing rate, energy (loudness of the voice), the segmentation of the voice, time speaking, and video cues, like looking turns and vlogger’s gaze. The main reason why we’ve only selected these features from the audiovisual file can be found by analyzing Table V and Table VI of Biel and Gatica-Perez (2012), where it displayed the Spearman’s correlation coefficients between audio and video cutes, and the personality traits. Speaking activities in audio cues generally had significant correlations with personality. The feature indicates the fluency and talkativeness of an individual, which possibly relates to extraversion and conscientiousness of the individual. Prosodic cues like intonation and rhythm of the language also showed high correlations with many of the personality traits, as they indicate the intensity and utilization of the language. For instance, the energy of the voice – loudness of the voice – was found to have significant correlations with conscientiousness, extraversion, and agreeableness. This indicates that an individual with a lot of energy in their voice can control their voice, while also showing excitement. From visual features, some of the looking activity and pose features were used, as they indicate how much vloggers are looking at the camera and show confidence in their facial expression. Also, some visual activity features that showed excitement were used, which also indicated some levels of personality traits. Some measures were exlcuded based on non-significant relations to the Big Five.

In [None]:
# FEATURE: "Audiovisual" 
transcript_features_df <- read.delim(AudioVisual_file, head = TRUE, sep = " ") %>%
    tibble() %>%
    select(-c(mean.d.energy, mean.spec.entropy, sd.spec.entropy)) %>%
    right_join(transcript_features_df, by = "vlogId")


transcript_features_df %>%
    select(mean.pitch:hogv.cogC) %>%
    head()

Some features carried NAs, which we address by inserting 0 at the respective positions

In [None]:
# Checking for missing data, where it is, and replacing it with 0
which(is.na(transcript_features_df), arr.ind = TRUE) # checking which rows are affected to get a better picture
transcript_features_df[is.na(transcript_features_df)] <- 0

head(transcript_features_df) 

Once you have computed features from the transcript texts and stored it in a data frame, merge it with the `vlogger_df` dataframe:

## Feature: Gender
The gender feature is included in our analysis because many studies have shown the gender difference in personality traits. For instance, the study by Schmitt et al. (2017) found that the perceived gender roles, socialization, and sociostructural power impact individuals’ behavior as well as their personalities. Especially cultures with larger discrepancies between gender roles and socialization seemed to have larger gender differences in personality. Additionally, they found evidence for differences in conscientiousness and neuroticism. Thus, we think it is essential to have gender as a feature to improve the model predictions. Gender was already included in vlogger_df.

In [None]:
# YOUR CODE to merge `vlogger_df` with `transcript_features_df`
vlogger_df <- vlogger_df %>%
    left_join(transcript_features_df, by = "vlogId")
head(vlogger_df)

# 3. Predictive model

Next you fit your predictive model(s). For instance, a linear regression model that only uses `gender` a feature might be:

Initially, we investigated the scatterplots between predictors and response variables to visualize the data for once, get an idea of the variables, and identify possible transformation tragets.

The code had to be commented out as the figure margins of the plot exceeded the notebooks limit.

In [None]:
# pairs(vlogger_df[,-c(1:2)], pch = 1, lower.panel = NULL)

### 3.1 Check for multicollinearity

Next we checked for multicollinearity to identify features that are strongly correlated with eachother. To asses this we computed the variance inflation factor (VIF). 
Our first idea was to exclude all the variables with a VIF higher than 10. However, later on we realised multicollinearity is less problematic than assumed. As our goal concerns prediction, we want to retain as much data and information from features as possible. Furthermore, we plan on doing a stepwise selection  in order to reduce variance and complexity, keeping predictive power. 

Therefore, in the end we decided not to remove the variables with a high VIF, also because almost all of the features did not exceed a score of 10.

In [None]:
# Custom VIF function
for (i in 1:length(vlogger_df[, -c(1:7)])){
    
    lmp <- paste(colnames(vlogger_df[, i+7, drop = F]), "~", paste(colnames(vlogger_df[,-c(1:7, i+7)]), collapse = "+"))
    fit <- lm(data = vlogger_df, formula = lmp)
    VIF <- 1/(1 - summary(fit)$r.squared)
  
    colnames(vlogger_df[, i+7, drop = F]) %>%
        paste(VIF, sep = ": ") %>%
        print()
}

### 3.2 Fit predictive model
We decided to create a model for each personality trait on its own instead of sequentially in a multivariate appraoch. This simplifies the fitting process, enables a stepwise regression approach, and makes the general summaries clearer.
Additionally, we decided NOT to use personality traits as predictors for each other, as real-life scenarios most likely won't involve this kind of data and we want to predict the traits solely with the text and audiovisual data at hand.

In [None]:
# YOUR CODE to fit your predictive model

# Model Fitting:
fit_lm_extr <- lm(Extr ~ . -vlogId -Agr -Cons -Emot -Open, data = vlogger_df)
fit_lm_agre <- lm(Agr ~ . -vlogId -Extr -Cons -Emot -Open, data = vlogger_df)
fit_lm_cons <- lm(Cons ~ . -vlogId -Agr -Extr -Emot -Open, data = vlogger_df)
fit_lm_emot <- lm(Emot ~ . -vlogId -Agr -Cons -Extr -Open, data = vlogger_df)
fit_lm_open <- lm(Open ~ . -vlogId -Agr -Cons -Emot -Extr, data = vlogger_df)


# Summary of the five model fits
summary(fit_lm_extr)
summary(fit_lm_agre)
summary(fit_lm_cons)
summary(fit_lm_emot)
summary(fit_lm_open)

### 3.3 Non-linearity, outliers & leverage points

After fitting the model we checked the Residual vs. Fitted plots to identify possible non-linearity of the data. None of the plots showed a clear non-linear associations in the data. 

Furthermore, we used  the standardized Residual vs Leverage plot to identify potential outliers and high leverage points (or both!). Although some weak outliers and leverage points can be seen in the plots we decided to not exclude these observations. We don't have many observations and we want to retain as much of our data as possible. We felt comfortable to retain them as they don't appear to be too extreme and did not change the model fit substantially.

In [None]:
plot(fit_lm_extr, which = c(1,5))
plot(fit_lm_agre, which = c(1,5))
plot(fit_lm_cons, which = c(1,5))
plot(fit_lm_emot, which = c(1,5))
plot(fit_lm_open, which = c(1,5))

### 3.4 Stepwise Regression
We have a lot of predictors which can lead to an increased variance/lower bias.

In order to reduce complexity of our models, we decided to use a mixed stepwise regression approach, choosing two criterions to identify a reasonable model fit. The mixed approach enables us to use both mechanisms from forward and backward selection.
* Both AIC and BIC punish models with too many parameters helping us achive a simpler model fit. 
* However, due to our high sample size, we expect BIC to penalize even more and achieve a more sparse model. This penalty is important, as the AIC fit might be to liberal, ultimately including too many parameters.
* Therefore, we are going for a more parsimonious model, trying to counter potential overfitting and reduce variance (trading with some bias), while keeping predictive power.
* We will apply two stepwise regressions to each model, using AIC and BIC as model selection criteria, respectively. Our goal is to find a model that reduces both measures to an acceptable extent.
* Additionally, the stepwise approach makes sense as the five personality traits might require different combinations of features to be predicted more accurately. 

In [None]:
# Stepwise Regression Fitting - AIC (k = 2)
step_lm_extr_A <- step(fit_lm_extr, direction = "both", trace = 0)
step_lm_agre_A <- step(fit_lm_agre, direction = "both", trace = 0)
step_lm_cons_A <- step(fit_lm_cons, direction = "both", trace = 0)
step_lm_emot_A <- step(fit_lm_emot, direction = "both", trace = 0)
step_lm_open_A <- step(fit_lm_open, direction = "both", trace = 0)

# Stepwise Regression Fitting - BIC (k = 323 as number of training observations = 323)
step_lm_extr_B <- step(fit_lm_extr, direction = "both", trace = 0, k = log(323))
step_lm_agre_B <- step(fit_lm_agre, direction = "both", trace = 0, k = log(323))
step_lm_cons_B <- step(fit_lm_cons, direction = "both", trace = 0, k = log(323))
step_lm_emot_B <- step(fit_lm_emot, direction = "both", trace = 0, k = log(323))
step_lm_open_B <- step(fit_lm_open, direction = "both", trace = 0, k = log(323))


# Summary of AIC vs. BIC model selection for all personality dimensions
# Extraversion
summary(step_lm_extr_A)
summary(step_lm_extr_B)

# Agreeableness
summary(step_lm_agre_A)
summary(step_lm_agre_B)

# Conscientiousness
summary(step_lm_cons_A)
summary(step_lm_cons_B)

# Neuroticism
summary(step_lm_emot_A)
summary(step_lm_emot_B)

# Openness
summary(step_lm_open_A)
summary(step_lm_open_B)

After fitting the stepwise model we first investigated the adjusted R^2, to get an intuition of the explained variance, after adjustment. The adjusted R^2 illustrates a corrected version of the R^2, penalizing the inclusion of less relevant predictors, therefore, being able to decrease.

As demonstrated by the plots, the AIC-selected models show an increased adj. explained variance, compared to both the initial full models and the BIC-selected models. The model output above confirms this as AIC-selected models have retained more predictors compared to BIC-selected models. This difference, however, is more noticeable for the models concerning the dimensions of Openness and Neuroticism. 

We do not want to select a model solely based on the adj. R^2 value, especially when AIC-selected models contain considerably more predictors than their BIC-model counterparts. Furthermore, differences in the adj. R^2 values are tolerable. 

In [None]:
r2_df <- tibble(Extraversion = c(summary(fit_lm_extr)$adj.r.squared, summary(step_lm_extr_B)$adj.r.squared, summary(step_lm_extr_A)$adj.r.squared),
                Agreeableness = c(summary(fit_lm_agre)$adj.r.squared, summary(step_lm_agre_B)$adj.r.squared, summary(step_lm_agre_A)$adj.r.squared),
                Conscientiousness = c(summary(fit_lm_cons)$adj.r.squared, summary(step_lm_cons_B)$adj.r.squared, summary(step_lm_cons_A)$adj.r.squared),
                Neuroticism = c(summary(fit_lm_emot)$adj.r.squared, summary(step_lm_emot_B)$adj.r.squared, summary(step_lm_emot_A)$adj.r.squared),
                Openness = c(summary(fit_lm_open)$adj.r.squared, summary(step_lm_open_B)$adj.r.squared, summary(step_lm_open_A)$adj.r.squared),
                Model = c("Full Model", "BIC Model", "AIC Model"))
r2_df$Model <- r2_df$Model %>%
    factor(levels = c("Full Model", "BIC Model", "AIC Model"))

head(r2_df)

for (i in 1:5){
    
    p <- ggplot(r2_df, aes_string(fill = "Model", y = colnames(r2_df)[i], x = "Model")) + 
        geom_bar(position = "dodge", stat = "identity") + 
        labs(title = paste("Comparison of adjusted R^2", colnames(r2_df)[i], sep ="\n"),
             x ="Model", 
             y = "adjusted R^2") +
        scale_y_continuous(limits = c(0, 0.5), breaks = seq(0, 0.5, by = 0.1)) +
        scale_fill_brewer(palette = "Set1", direction = -1) +
        theme(panel.grid.major = element_blank(), panel.grid.minor = element_blank(),
              panel.background = element_blank(), axis.line = element_line(colour = "black"),
              text = element_text(size = 17), plot.title = element_text(hjust = 0.5))
    print(p)
}

To identify the appropriate reduced model, we visualized the AICs and BICs of the respective stepwise models to find the one that reduces both sufficiently. Each graph is comparing the models selected by the AIC vs. BIC criterion on these two values. We intentionally did not visualize the AIC and BIC of the original models as they lead to skewed graphs due to their substantially higher AIC and BIC values. In all cases, the AIC and BIC were reduced by the stepwise regressions. Therefore, in order to clearly compare the two stepwise models and their criteria, we only visualized their AIC/BIC values. 

The below plots of AIC- vs. BIC-selected models show a trend:
* The BIC-selected models have only a slightly higher AIC compared to the models selected by AIC. 
* In contrast, their BIC is considerably lower.
* This trend spans over all five personality dimensions

As all BIC-selected models appear to have a comparatively low AIC and BIC value, we decided to use those models for our predicitons 

Additionally, the AIC-selected models still included too many parameters, risking overfitting by including many pontentially redundant features.

In [None]:
stepAB_df <- tibble(Extraversion = c(AIC(step_lm_extr_A), BIC(step_lm_extr_A), AIC(step_lm_extr_B), BIC(step_lm_extr_B)),
                    Agreeableness = c(AIC(step_lm_agre_A), BIC(step_lm_agre_A), AIC(step_lm_agre_B), BIC(step_lm_agre_B)),
                    Conscientiousness = c(AIC(step_lm_cons_A), BIC(step_lm_cons_A), AIC(step_lm_cons_B), BIC(step_lm_cons_B)),
                    Neuroticism = c(AIC(step_lm_emot_A), BIC(step_lm_emot_A), AIC(step_lm_emot_B), BIC(step_lm_emot_B)),
                    Openness = c(AIC(step_lm_open_A), BIC(step_lm_open_A), AIC(step_lm_open_B), BIC(step_lm_open_B)),
                    Criterion = c("AIC", "BIC", "AIC", "BIC"),
                    Model = c("AIC Model", "AIC Model", "BIC Model", "BIC Model"))
head(stepAB_df)

for (i in 1:5){
    
    p <- ggplot(stepAB_df, aes_string(fill = "Criterion", y = colnames(stepAB_df)[i], x = "Model")) + 
        geom_bar(position = "dodge", stat = "identity") + 
        labs(title = paste("Comparison of Stepwise Models", colnames(stepAB_df)[i], sep ="\n"),
             x ="Model", 
             y = "Value") +
        scale_y_continuous(limits = c(0, 900), breaks = seq(0, 900, by = 100)) +
        scale_fill_brewer(palette = "Set1") +
        theme(panel.grid.major = element_blank(), panel.grid.minor = element_blank(),
              panel.background = element_blank(), axis.line = element_line(colour = "black"),
              text = element_text(size = 17), plot.title = element_text(hjust = 0.5))
    print(p)
}

# 4. Making predictions on the test set

For the competition we have to make **predictions** for the data in the **test set**

- The predictions will be evaluated by computing the **Root Means Square Error**:
    - $\displaystyle{RMSE =\sqrt{{1 \over 5n} \sum_{k \in \{cEXT, \ldots, cOPN\}} \sum_{i=1}^n (y_{ik} - \hat y_{ik})^2}}$
    - Here 
        - $y_{ik}$ is the observed value for vlogger $i$ 
        - $\hat y_{ik}$ is your prediction for vlogger $i$
        
        
You will have to take the following steps:

1. Extract the test set from the `vlogger_df`
2. Compute predictions for the test set using your model
3. Write those predictions to file in the right format

The following gives code for these steps in order.

## 4.1 The test set

The test set are those `vlogId` that are missing in the personality scores data frame `pers`. They are the rows in `vlogger_df` for which the personality scores are missing:

In [None]:
testset_vloggers = vlogger_df %>% 
    filter(is.na(Extr))

head(testset_vloggers)

## 4.2 Predictions

Continuing the example `fit_mlm` model above, for almost all models we will encounter use the `predict()` function.

- `predict()` function exists for most model fit function like `lm`, `glm`, etc., that we encounter
    - first argument should be a model object (`fit_mlm` in the example)
    - second argument should be a data frame with the test set
    - optionnaly, a third argument specifies type of response:
      - for `lm` object only `type = "resp"`
      - for `glm` object `type = "pred"` (linear predictor) or `type = "resp"` ('response' &rarr; probabilities)


As mentioned above, due to having five different models, we also have to predict five separate times. In this case, our analysis concluded to use the stepwise regression model suggested by the BIC criterion. Compared to the models selected by AIC, the BIC-models showed low AIC and BIC values, reducing the complexity considerably and retaining an acceptable adjusted R^2. All personality dimensions have an individual prediction model, including different features for the respective predictions. 

In [None]:
pred_extr <- predict(step_lm_extr_B, new = testset_vloggers)
pred_agre <- predict(step_lm_agre_B, new = testset_vloggers)
pred_cons <- predict(step_lm_cons_B, new = testset_vloggers)
pred_emot <- predict(step_lm_emot_B, new = testset_vloggers)
pred_open <- predict(step_lm_open_B, new = testset_vloggers)

# Always check the output
head(pred_extr)
head(pred_agre)
head(pred_cons)
head(pred_emot)
head(pred_open)

In [None]:
# compute output data frame
testset_pred = testset_vloggers %>% 
    mutate(
        Extr = pred_extr, 
        Agr  = pred_agre,
        Cons = pred_cons,
        Emot = pred_emot,
        Open = pred_open
    ) %>%
    select(vlogId, Extr:Open)

head(testset_pred)

## 4.3 Writing predictions to file

You need to upload your predictions in .csv file. However, there are multiple columns: `Extr`, `Agr`, `Cons`, `Emot`, `Open`, while Kaggle expects **long format**!

What does long format look like?

- Every prediction on a single line.
- Columns `vlogId` and `pers_axis` to map prediction *vlogger ID* and *personality axis*.

To achieve this, first `gather` the column values into a single `value` column, adding a `pers_axis` to indicate the column name:

In [None]:
testset_pred_long  <- 
  testset_pred %>% 
  gather(pers_axis, Expected, -vlogId) %>%
  arrange(vlogId, pers_axis)

head(testset_pred_long)

According to the competition's [Evaluation instructions](https://www.kaggle.com/c/bda2019big5/overview/evaluation), Kaggle expects file with two colums: `Id` and `value`.
  
The [Evaluation instructions](https://www.kaggle.com/c/bda2019big5/overview/evaluation) specifies we need to encode the `Agr` prediction for `VLOG8` as `VLOG8_Agr` in the `Id` column. To achieve this use `unite()` function of `dplyr`.

`unite()` take:

- a data frame as its first argument (implicitely passed by the piping operator `%>%`)
- the name of new column as its second argument (`Id` below)
- all extra arguments (`vlogId` and `pers_axis` below) are concatenated with an underscore in between

Then write the resulting data frame to a .csv file.

In [None]:
# Obtain the right format for Kaggle
testset_pred_final <- 
  testset_pred_long %>%
  unite(Id, vlogId, pers_axis) 

# Check if we succeeded
head(testset_pred_final)

# Write to csv
testset_pred_final %>%
  write_csv(file = "predictions.csv")

# Check if the file was written successfully.
list.files()

Once you have clicked the <span style="background-color:#000000;color:white;padding:3px;border-radius:10px;padding-left:6px;padding-right:6px;">⟳ Save Version&nbsp;&nbsp;|&nbsp;&nbsp;0</span> button at the top left, and select the "Save & Run All (Commit)" option, go to the Viewer. There you will find your "predictions.csv" under Output. You'll also see a button there that allows you to submit your predictions with one click.

## Sources

* Biel, J. I., & Gatica-Perez, D. (2012). The youtube lens: Crowdsourced personality impressions and audiovisual analysis of vlogs. IEEE Transactions on Multimedia, 15(1), 41-55.
* Schmitt, D. P., Long, A. E., McPhearson, A., O'Brien, K., Remmert, B., & Shah, S. H. (2017). Personality and gender differences in global perspective. International Journal of Psychology, 52, 45-56.