In [1]:
# This R environment comes with all of CRAN and many other helpful packages preinstalled.
# You can see which packages are installed by checking out the kaggle/rstats docker image: 
# https://github.com/kaggle/docker-rstats

library(tidyverse) # metapackage with lots of helpful functions
library(tidytext)
library(dplyr)
library(stringr)
library(quanteda)

# Input data files are available in the read-only "../input/" directory
# For example, running this cell (by clicking ▶️, run or pressing Shift+Enter) will list 
# all files under the "../input/" directory

list.files(path = "../input")

# You can write up to 5GB to the current directory (/kaggle/working/) that gets preserved 
# as output when you create a version using "⟳ Save & Run All". From the resulting output
# section in the Viewer you can submit an output file as your entry to the competition.
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of 
# the current session

── [1mAttaching packages[22m ─────────────────────────────────────── tidyverse 1.3.1 ──

[32m✔[39m [34mggplot2[39m 3.3.4     [32m✔[39m [34mpurrr  [39m 0.3.4
[32m✔[39m [34mtibble [39m 3.1.2     [32m✔[39m [34mdplyr  [39m 1.0.7
[32m✔[39m [34mtidyr  [39m 1.1.3     [32m✔[39m [34mstringr[39m 1.4.0
[32m✔[39m [34mreadr  [39m 1.4.0     [32m✔[39m [34mforcats[39m 0.5.1

── [1mConflicts[22m ────────────────────────────────────────── tidyverse_conflicts() ──
[31m✖[39m [34mdplyr[39m::[32mfilter()[39m masks [34mstats[39m::filter()
[31m✖[39m [34mdplyr[39m::[32mlag()[39m    masks [34mstats[39m::lag()

Package version: 3.0.0
Unicode version: 13.0
ICU version: 66.1

Parallel computing: 4 of 4 threads used.

See https://quanteda.io for tutorials and examples.



There are three .csv files in the directory structure:

In [2]:
directory_content <- list.files("../input/bda2021big5/youtube-personality", full.names = TRUE)
print(directory_content)

[1] "../input/bda2021big5/youtube-personality/README.txt"                                                 
[2] "../input/bda2021big5/youtube-personality/transcripts"                                                
[3] "../input/bda2021big5/youtube-personality/YouTube-Personality-audiovisual_features.csv"               
[4] "../input/bda2021big5/youtube-personality/YouTube-Personality-gender.csv"                             
[5] "../input/bda2021big5/youtube-personality/YouTube-Personality-Personality_impression_scores_train.csv"


In addition there's a "transcript" folder (see number \[2\] in the output above) in which the actual video transcripts are stored in `.txt` files. 

Store these file paths in variables for easy reference later on:

In [3]:
# Path to the transcripts directory with transcript .txt files
path_to_transcripts <- directory_content[2] 

# .csv filenames (see output above)
AudioVisual_file    <- directory_content[3]
Gender_file         <- directory_content[4]
Personality_file    <- directory_content[5]

# 2. Data Import and Merging

We'll import

- Transcripts
- Personality scores
- Gender

## 2.1 Importing transcripts

The transcript text files are stored in the subfolder 'transcripts'. They can be listed with the following commands:

In [4]:
transcript_files <- list.files(path_to_transcripts, full.names = TRUE) 

print(head(transcript_files))

[1] "../input/bda2021big5/youtube-personality/transcripts/VLOG1.txt"  
[2] "../input/bda2021big5/youtube-personality/transcripts/VLOG10.txt" 
[3] "../input/bda2021big5/youtube-personality/transcripts/VLOG100.txt"
[4] "../input/bda2021big5/youtube-personality/transcripts/VLOG102.txt"
[5] "../input/bda2021big5/youtube-personality/transcripts/VLOG103.txt"
[6] "../input/bda2021big5/youtube-personality/transcripts/VLOG104.txt"


The transcript file names encode the vlogger ID that you will need for joining information from the different data frames. A clean way to extract the vlogger ID's from the names is by using the funcation `basename()` and removing the file extension ".txt".

In [5]:
vlogId = basename(transcript_files)
vlogId = str_replace(vlogId, pattern = ".txt$", replacement = "")
head(vlogId)

To include features extracted from the transcript texts you will have to read the text from files and store them in a data frame. For this, you will need the full file paths as stored in `transcript_files`.

Here are some tips to do that programmatically

- use either a `for` loop, the `sapply()` function, or the `map_chr()` from the `tidyverse`
- don't forget to also store `vlogId` extracted with the code above 

We will use the `map_chr()` function here:

In [6]:
transcripts_df = tibble(
    
    # vlogId connects each transcripts to a vlogger
    vlogId=vlogId,
    
    # Read the transcript text from all file and store as a string
    Text = map_chr(transcript_files, ~ paste(readLines(.x), collapse = "\\n")), 
    
    # `filename` keeps track of the specific video transcript
    filename = transcript_files
)

“incomplete final line found on '../input/bda2021big5/youtube-personality/transcripts/VLOG11.txt'”


In [7]:
transcripts_df %>% 
    head(2)

vlogId,Text,filename
<chr>,<chr>,<chr>
VLOG1,"You know what I see - - no, more like hear a lot these days, is people calling other people gay as an insult. Now what makes people come up with calling others gay? Now here's an example. Hey, hey, you wanna trade Pokemon or Ziegfield cards? Or, or, or we can play, we can play superheroes. Oh, can I be Optimus Prime? Dude, you are so gay. Dude, the cool kids do crack. Oh, my mommy says, say no to drugs. Okay, how the hell does playing Pokemon cards or -- or --- or dancing or holding hands with another guy make me homosexual? I don't get these people. \nThis is how it is in my school. Okay, here's an example. All right, um, when they see two guys are gay, they're together, they're like no, ew, no. No, no that -- that doesn't go together - - you know, two guys, no. two sticks, no. It just doesn't work like . But when they see two girls, they're like, get it on. And I don't get these people. I've never seen someone say like, oh, you're so homosexual or you're so lesbian or you're such a child molester. It is always the word gay, cause apparently gay is now an insult, even though the word means like happy and lively and that kinda giddy feeling you have inside, like -- -- but no you have to turn that happy word into a mean word. Apparently, we can do that now, turning good things into bad things. It's like how Spiderman felt good, but then that -- that -- that grease that gets all over him and then and then evil Dr. Octopus. That's so gay, you like Spiderman. Lar, I'm going to the movies with the guys to watch Mama Mia. \nYou never know if other people are offended by what you say. I'm not saying you're a bad person if you do it. I used to do it all the time. I'm more focused on why we say it. In the end, we're all the same. You know, there's nothing wrong with it. I was just wondering where it all came from, you know. All right, thanks a lot for watching. Oh, yeah and the club channel is up and running. So, make sure to check that out because there's gonna be a lot of cool stuff on there. We'll do up to like four challenges at a time. We'll do contests, dares, questions. In the end, there's gonna be a lot of viewer interactions, so it's gonna be really fun. We may even put other people on the video too. So check it.",../input/bda2021big5/youtube-personality/transcripts/VLOG1.txt
VLOG10,"Hey everybody, it's Monday, July twenty seventh, two thousand and nine and that means it's time for another edition of XXXX. \nGovernor Palin's back in the news this week as she transitions from Alaska governor to Alaskan citizen. Pending all power to Lieutenant Governor Parnell, she had a few choice words for the Media. \nIt is, as throughout all Alaska, that big wild, good wife teaming along the road that is north to the future. That's what we get to see every day. Now what the rest of America gets to see along with us, is in this last frontier there is hope and opportunity and there is country pride. And it is our man and women in uniform securing it. And we are facing tough challenges in America with some seeming to just be hell bent maybe on tearing down our nation, perpetuating some pessimism and suggesting American apologetics. \nWhat? \nAnd we can resist enslavement to big central government that pressures hope and opportunity. Be wary of accepting government largesse. It doesn't come free and often accepting it takes away everything that is free. Melting into Washington's powerful, caretaking arms will just suck incentive to work hard and charge our own course right out of us. \nUh, wait. Is that -- no way, what? She made a good point there at the end. But sometimes I have to wonder if I'm listening to Sarah Palin or Nicholas Fain . \nIn other news over the weekend I heard the story of Troy Anthony Davis. Do you know who he is? You should. \nMister Davis was sentenced to death for the murder of an off duty Savannah, Georgia police officer named Mark McPhail back in nineteen ninety one. Davis was convicted solely on the testimony of nine eye witnesses. Since that time three eye witnesses have recanted, admitting that they were coerced and two other eye witnesses admitted that they never even saw the murder take place. \nDespite a wealth of information that has been presented to the court and some new evidence yet to be presented, Mister Davis remains on death row after eighteen years. For more information about Troy Anthony Davis and how you can help in his case, check out IAMTROY dot com. \nThat's all I've got for this week for everybody. I hope you enjoyed the show because although I started recording on Monday, July twenty seventh, it's Tuesday, July twenty eighth and that time has come. Gotta get out of here for work, new work schedule, so until next time, keep checking out XXXX dot com for daily blog posts, updates and etcetera and meet me back here Monday for new video or XXXX. Thanks for watching.",../input/bda2021big5/youtube-personality/transcripts/VLOG10.txt


## 2.2 Importing AudioVisual features

In [8]:
audiovisual_df <- read_delim(AudioVisual_file, delim = " ")


[36m──[39m [1m[1mColumn specification[1m[22m [36m────────────────────────────────────────────────────────[39m
cols(
  .default = col_double(),
  vlogId = [31mcol_character()[39m
)
[36mℹ[39m Use [30m[47m[30m[47m`spec()`[47m[30m[49m[39m for the full column specifications.




## 2.3 Import personality scores

The other data files can be read in with `read_delim` (not `read_csv` because the files are not actually comma separated). For instance, the following should work:

In [9]:
# Import the Personality scores
pers_df = read_delim(Personality_file, delim = " ")


[36m──[39m [1m[1mColumn specification[1m[22m [36m────────────────────────────────────────────────────────[39m
cols(
  vlogId = [31mcol_character()[39m,
  Extr = [32mcol_double()[39m,
  Agr = [32mcol_double()[39m,
  Cons = [32mcol_double()[39m,
  Emot = [32mcol_double()[39m,
  Open = [32mcol_double()[39m
)




In [10]:
head(pers_df)

vlogId,Extr,Agr,Cons,Emot,Open
<chr>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>
VLOG1,4.9,3.7,3.6,3.2,5.5
VLOG3,5.0,5.0,4.6,5.3,4.4
VLOG5,5.9,5.3,5.3,5.8,5.5
VLOG6,5.4,4.8,4.4,4.8,5.7
VLOG7,4.7,5.1,4.4,5.1,4.7
VLOG9,5.6,5.0,4.0,4.2,4.9


## 2.4 Import gender

Gender info is stored in a separate `.csv` which is also delimited with a space. This file doesn't have column names, so we have to add them ourselves:

In [11]:
gender_df <- read.delim(Gender_file, head = FALSE, sep = " ", skip = 2)

# Add column names
names(gender_df) = c('vlogId', 'gender')


head(gender_df)

Unnamed: 0_level_0,vlogId,gender
Unnamed: 0_level_1,<chr>,<chr>
1,VLOG3,Female
2,VLOG5,Male
3,VLOG6,Male
4,VLOG7,Male
5,VLOG8,Female
6,VLOG9,Female


### 2.4.1 Merging the `gender` and `pers` dataframes

Obviously, we want all the information in a single tidy data frame. While the builtin R function `merge()` can do that, the `tidyverse()` has a number of more versatile and consistent functions called `left_join`, `right_join`, `inner_join`, `outer_join`, and `anti_join`. We'll use `left_join` here to merge the gender and personality data frames:

In [12]:
vlogger_df = left_join(gender_df, pers_df)
head(vlogger_df) # VLOG8 has missing personality scores: those should be predicted

Joining, by = "vlogId"



Unnamed: 0_level_0,vlogId,gender,Extr,Agr,Cons,Emot,Open
Unnamed: 0_level_1,<chr>,<chr>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>
1,VLOG3,Female,5.0,5.0,4.6,5.3,4.4
2,VLOG5,Male,5.9,5.3,5.3,5.8,5.5
3,VLOG6,Male,5.4,4.8,4.4,4.8,5.7
4,VLOG7,Male,4.7,5.1,4.4,5.1,4.7
5,VLOG8,Female,,,,,
6,VLOG9,Female,5.6,5.0,4.0,4.2,4.9


Note that some rows, like row 5, has `NA`'s for the personality scores. This is because this row corresponds to the vlogger with vlogId `VLOG8` is part of the test set. You still have to split `vlogger_df` into the training and test set, as shown below.

We leave the `transcripts_df` data frame seperate for now, because you will first have to extract features from the transcripts first. Once you have those features in a tidy data frame, including a `vlogId` column, you can refer to this `left_join` example to merge your features with `vlogger_df` in one single tidy data frame.

# 3. Tokenization of transcripts

In [13]:
# Sentences
tokenized_sentences <- transcripts_df %>%
    unnest_tokens(sentence, Text, token = "sentences")

# Words with Stop Words
tokenized_words <- transcripts_df %>%
    unnest_tokens(token, Text, token = "words")

# Words without Stop Words
stopwords <- get_stopwords()

tokenized_no_stop_words <- transcripts_df %>%
    unnest_tokens(token, Text, token = "words") %>%
    anti_join(stopwords, by = c(token = "word"))

# 4. Feature extraction from transcript texts

## 4.1 Our features

In this section we extract our features. For the first round we extracted 9 features: the number of times "um" "uhm" or "uh" is used, NRC, BING, intonation,the number of syllables, the number of questions, the number of swear words, the number of pauses, and the number of self-eference words. After the first round we decided to add five more features from other groups: the average number of characters in a sentence, AFINN, wMEI, the relative frequency of the words "I" and "we" and the frequency of negation.

### Feature 1
  #### Number of times the word "um" is used

Our first feature is the number of times the words "um", "uhm", and "uh" are used. These are so called filler words to avoid silences. Even though a direct relationship between the use of filler words could not be shown in earlier research (Laserna et al., 2014), they could still indicate difficulty finding words to say which might be related to for example introversion.

In [14]:
nr_of_um <- tokenized_words %>%
    group_by(vlogId) %>%
    filter(token == "um" | token == "uhm" | token == "uh") %>%
    count()

# add vloggers with 0 counts
um_feature <- left_join(tibble(vlogId), nr_of_um, copy = TRUE) %>% 
    replace(is.na(.),0) %>%
    rename(um_count = n)

Joining, by = "vlogId"



### Feature 2
  #### NRC

Our second feature is the NRC. The NRC is a large database that enables us to associate words with 8 common emotions, which are either negative or positive. How much each emotion occurs in a text might tell us something about the personality of the speaker. In earlier research sentiment analysis using NRC was shown to be an effective personality predictor (Christian et al. 2021).

In [15]:
# Extracting NRC
load_nrc <- function() {
    if (!file.exists('nrc.txt'))
        download.file("https://www.dropbox.com/s/yo5o476zk8j5ujg/NRC-Emotion-Lexicon-Wordlevel-v0.92.txt?dl=1","nrc.txt")
    nrc <- read.table("nrc.txt", col.names = c('word','sentiment','applies'), stringsAsFactors = FALSE)
    nrc %>% filter(applies == 1) %>% 
        select(-applies)
}

nrc <- load_nrc()

nrc_feature <- tokenized_no_stop_words %>%
    inner_join(nrc, by = c(token = "word")) %>%
    count(vlogId, sort = TRUE, sentiment) %>%
    group_by(vlogId) %>%
    spread(sentiment, n, fill = 0) %>%
    rename(positive_nrc = positive, negative_nrc = negative)

### Feature 3
#### Bing

Our third feature is Bing. Bing doesn't allow us to assign emotions to a word but it does let us classify words as positive or negative. The number of words in a text that are positive or negative might also tell us something about de personality of the speaker. Also since the distribution of postive and negative words is different in Bing than in NRC it seems sensible to use both.

In [16]:
# Extracting Bing 
bing <- get_sentiments("bing")
bing_feature <- tokenized_no_stop_words %>%
    inner_join(bing, by = c(token = "word")) %>%
    count(vlogId, sort = TRUE, sentiment) %>%
    group_by(vlogId) %>%
    spread(sentiment, n, fill = 0) %>%
    rename(positive_bing = positive, negative_bing = negative)


### Feature 4 
#### Intonation combined of 3 features 
##### (pitch, energy, and average voiced segments or the syllable length)



Our fourth feature is intonation. Earlier research showed that intonation can be a very effective predictor for personality, when accessed by humans (Mohammadi et al., 2012). This indicates that data about intonation might also be effective in predicting personality. We divided intonation into three variables: pitch, energy and average voiced segments or the syllable length which is based on a study that Aydin et al. (2016) conducted. 

In [17]:
intonation_feature <- audiovisual_df %>%
    select(vlogId, mean.pitch, mean.energy, avg.voiced.seg) %>%
    group_by(vlogId)

### Feature 5 
#### Speeach Feature
##### Speaking time, speaking turns, and voicing rate

Our fifth feature is the number of syllables that each vlogger uses. Earlier research has shown that this is an effective predictor for mainly agreeableness (Metha et al., 2020). Even though this study was on written text, this finding might also hold for speech. 

In [18]:
speech_feature <- audiovisual_df %>%
    select(vlogId, time.speaking, num.turns, voice.rate, avg.len.seg) %>%
    group_by(vlogId)

### Feature 6 
#### Counting the Syllables per VlogId

Our sixth feature is the number of syllables that each vlogger uses. Earlier research has shown that this is an effective predictor for mainly agreeableness (Metha et al., 2020). Even though this study was on written text, this finding might also hold for speech. 

In [19]:
download.file("https://bda2019syllables.netlify.com/en_syllable_3grams.csv.zip", "en_syllable_3grams.csv.zip") ## downloading the data base for syllables
unzip("en_syllable_3grams.csv.zip")
syldf <- read.csv("en_syllable_3grams.csv", check.names = FALSE, 
                 stringsAsFactor = FALSE)

nearZ <- syldf %>% caret::nearZeroVar() # removing syllables with near Zero Var
syldf2 <- syldf[, -nearZ]

new_syldf <- cor(syldf2[,-(1:2)]) # finding correlations among syllables above 0.9 and removing them
high_r <- new_syldf %>%
    caret::findCorrelation(cutoff = 0.9) + 2
syldf3 <- syldf2[, -high_r]

names(syldf3) = gsub("^$", " ", names(syldf3))

fitlm <- lm(nsyl ~ . - word, syldf3)  # creating a linear model to predict the number of syllables with the remaining features
round(coef(fitlm), 5)

nsyl_est <- function(word, betas) {  # creating a function that counts the number of syllables with a given word and beta coef.
   features = stringr::str_count(word, names(betas)[-1])
   nsyl = betas[1] + features %*% betas[-1]
   return(drop(nsyl))
}


hat_beta = coef(fitlm) 

tokens <- tokenized_no_stop_words %>%  # singling out tokens(words) from the tokenized transcript and grouping by VlogId
    select(token, vlogId) %>%
    group_by(vlogId)


count_syl <- numeric()  # counting the number of syllables for each word in the transcript
for(i in 1:nrow(tokens)){
    count_syl[i] <- nsyl_est(tokens[i,1], hat_beta)
}


syl_feature <- tokens %>%  # creating an output with rounded count of syllables per each VlogId
    data.frame(round(count_syl)) %>%
    rename(syl_count = round.count_syl.) %>%
    group_by(vlogId) %>%
    summarise(sum = sum(syl_count))





### Feature 7 
####  Number of questions

Our seventh feature is the number of questions in a vlog. Asking questions might for example be related to insecurity and thus introversion or to curiousness and thus opnenness to experience. 

In [20]:
nr_questions <- tokenized_sentences %>%  
    group_by(vlogId) %>%
    mutate(end_sentence = str_sub(sentence, start = -1)) %>%
    filter(end_sentence == "?") %>%
    count()

# add vloggers with 0 counts
question_feature <- left_join(tibble(vlogId), nr_questions, copy = TRUE) %>% 
    replace(is.na(.),0) %>%
    rename(quest_count = n)

Joining, by = "vlogId"



### Feature 8 
#### Swear Words

Our eigth feature is the number of swear words that is used. The study of Metha et al. 2020 also showed this was an effective predictor for agreeabless. Besides that it also intuitively makes sense that the number of swear words would predict agreeableness (the more swear words the lower agreeableness) and possibly other big five traits. 

In [21]:

swear_words_url <- "http://www.bannedwordlist.com/lists/swearWords.txt"
download.file(swear_words_url, destfile = "swear_words.txt") 
swear_words <- tibble(read.table("swear_words.txt", 
                                stringsAsFactor = FALSE, 
                                sep = ","))

swear_count_tbl <- tokenized_words %>%   
  inner_join(swear_words, by = c(token = 'V1'))%>% # inner join matches pairs of observations when keys are equal
  count(vlogId) %>%
  group_by(vlogId) %>%
  rename(swear_count = n) 

# add vloggers with 0 counts
swear_feature <- left_join(tibble(vlogId), swear_count_tbl, copy = TRUE) %>% 
    replace(is.na(.),0)

Joining, by = "vlogId"



### Feature 9 
#### Number of pauses

Our ninth feature is the number of pauses for each vlogger. Since there are a lot of ways pauses or silence can be used in speech Kostiuk (2012) so it might predict personality in different ways. However if there are a lot of pauses this likely indicates difficulty finding words which might be associated with introversion. 

In [22]:
# counts each pause in a sentence once
nr_pauses <- tokenized_sentences %>%
    group_by(vlogId) %>%
    mutate(sentence_with_pause = grepl("-", sentence)) %>%
    count(sentence_with_pause) %>%
    filter(sentence_with_pause == TRUE)

# add vloggers with 0 counts
pause_feature <- left_join(tibble(vlogId), nr_pauses, copy = TRUE) %>% 
    replace(is.na(.),0) %>%
    rename(pause_count = n) %>%
    select(vlogId, pause_count)

Joining, by = "vlogId"



### Feature 10 
#### Self-Reference/"I" Words 

Our tenth feature is the us of self-reference words. Self-reference words are associated with multiple personality traits so their use might be a good predictor for the Big Five. Earlier research showed for example an association between the use of the word "I" and neuroticism (Scully & Terry, 2011). 

In [23]:
transcript_with_stop <- transcripts_df %>%
    unnest_tokens(token, Text, token = "words")


selfreference <- transcript_with_stop %>% 
  filter(token == "i"  | token ==  "me" | token == "myself" | token == "i'm" | token == "mine" |
           token == "my") %>%
  group_by(vlogId) %>%
  tally(name = 'selfwords') %>%
  rename(self_count = selfwords) 

# add vloggers with 0 counts
selfreference_feature <- left_join(tibble(vlogId), 
                                   selfreference, 
                                   copy = TRUE) %>%
    replace(is.na(.),0) 

# which(selfreference_feature$self_count == 0)

Joining, by = "vlogId"



### Feature 11 
#### "We" Words Relative Frequency 

In [24]:
we_reference <- transcript_with_stop %>% 
  filter(token == "we"  | token ==  "we're" | token == "us" | token == "us" | token == "our" |
           token == "ours" | token == "ourselves") %>%
  group_by(vlogId) %>%
  tally(name = 'we_words') %>%
  rename(we_count = we_words) 


# add vloggers with 0 counts
we_feature <- left_join(tibble(vlogId), we_reference, copy = TRUE) %>%
    replace(is.na(.),0) 

# which(we_feature$we_count == 0)

Joining, by = "vlogId"



## 4.2 Stolen features

We did not come up with these features ourseleves, but after seeing them in the work of other groups we decided to use them either because we thought they would be good predictors or they somehow expanded on predictors we thought of in the first place. 

### Feature 12 
#### Average Number of characters in a sentence

In [25]:
char_len_feature <- tokenized_sentences %>%
    mutate(nr_char = nchar(sentence)) %>%
    group_by(vlogId) %>%
    summarize(avg_char_len = mean(nr_char))

# Code from Group 10 but some changes

### Feature 13 
####  AFINN 

In [26]:
download.file("http://www2.imm.dtu.dk/pubdb/edoc/imm6010.zip","afinn.zip")
 unzip("afinn.zip")
 afinn = read.delim("AFINN/AFINN-111.txt", sep="\t", col.names = c("word","score"), 
                   stringsAsFactors = FALSE)

afinn_feature <- tokenized_words %>%
    inner_join(afinn, by = c(token = 'word')) %>%
    group_by(vlogId) %>%
    summarise (afinn_mean = mean(score)) 

# Code from Group 10 but some changes

### Feature 14
####  wMEI 

In [27]:
wMEi_feature <- audiovisual_df %>% 
                        select(vlogId, hogv.entropy, hogv.median, 
                                 hogv.cogR, hogv.cogC)

# Code from Group 10 but some changes

### Feature 15 
####  "We" relative frequency

In [28]:
# count total number of words
count_words <- tokenized_words %>%
  count(vlogId) %>%
  rename(total_words = n)

count_i <- tokenized_words %>%
  filter(token == "i" |
         token == "i'm" |
         token == "me" |
         token == "my" |
         token == "mine" |
         token == "myself") %>%
  count(vlogId) %>% 
  rename(total_i = n)

count_we <- tokenized_words %>%
  filter(token == "we" |
         token == "we're" |
         token == "us" |
         token == "our" |
         token == "ours" |
         token == "ourselves") %>%
  count(vlogId) %>% 
  rename(total_we = n)

i_we_feature <- count_words %>% 
  full_join(count_i,  by = "vlogId") %>%
  full_join(count_we, by = "vlogId") %>%
  replace_na(list(total_i = 0, total_we = 0)) %>%
  mutate(freq_i = total_i / total_words,
         freq_we = total_we / total_words,) %>% 
  select(vlogId, freq_i, freq_we)

# Code from Group 3

### Feature 16 
#### Negation frequency

In [29]:
negation_url <- "https://www.grammarly.com/blog/negatives/"
negation_words <- readLines(negation_url)[63:89] %>%
  str_replace("<li>", "") %>% 
  str_replace("</li>", "") %>%
  tibble(word = .) %>%
  filter(!grepl(" ", word),
         !grepl("<", word),
         word != "") %>% 
  mutate(word = str_to_lower(word))


negation_feature <- tokenized_words %>%
  filter(token %in% negation_words$word) %>%
  count(vlogId) %>%
  full_join(count_words, by = "vlogId") %>%
  replace_na(list(n = 0)) %>%
  mutate(negation_freq = n / total_words) %>%
  select(vlogId, negation_freq)

# Code from Group 3

“incomplete final line found on 'https://www.grammarly.com/blog/negatives/'”


### Feature 17 
### Mean Word Count Per Sentence

In [30]:
tokenized_sentences$n_words_sent <-
  ntoken(x = tokenized_sentences$sentence,
         remove_punct = TRUE)

# Calculate the average amount of words per sentence per vlog
mean_word_per_sentence_feature <- 
    tokenized_sentences %>%
    group_by(vlogId) %>%
    summarise(mean_n_words = mean(n_words_sent)) 

# Code from Group 11

# 5. Computing the features data frame

In [31]:
# Here goes YOUR CODE to compute the dataframe `transcript_features_df

transcript_features_df <- tibble(vlogId) %>%
    left_join(bing_feature) %>%
    left_join(intonation_feature) %>%
    left_join(syl_feature) %>%
    left_join(nrc_feature) %>%
    left_join(um_feature) %>%
    left_join(question_feature) %>%
    left_join(pause_feature) %>%
    left_join(selfreference_feature) %>%
    left_join(afinn_feature) %>%
    left_join(char_len_feature) %>%
    left_join(wMEi_feature) %>%
    left_join(i_we_feature) %>%
    left_join(negation_feature) %>% 
    left_join(swear_feature) %>%
    left_join(we_feature) %>%
    left_join(mean_word_per_sentence_feature) %>%
    left_join(speech_feature)

head(transcript_features_df)
any(is.na(transcript_features_df))

Joining, by = "vlogId"

Joining, by = "vlogId"

Joining, by = "vlogId"

Joining, by = "vlogId"

Joining, by = "vlogId"

Joining, by = "vlogId"

Joining, by = "vlogId"

Joining, by = "vlogId"

Joining, by = "vlogId"

Joining, by = "vlogId"

Joining, by = "vlogId"

Joining, by = "vlogId"

Joining, by = "vlogId"

Joining, by = "vlogId"

Joining, by = "vlogId"

Joining, by = "vlogId"

Joining, by = "vlogId"



vlogId,negative_bing,positive_bing,mean.pitch,mean.energy,avg.voiced.seg,sum,anger,anticipation,disgust,⋯,freq_i,freq_we,negation_freq,swear_count,we_count,mean_n_words,time.speaking,num.turns,voice.rate,avg.len.seg
<chr>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,⋯,<dbl>,<dbl>,<dbl>,<int>,<int>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>
VLOG1,9,21,178.15,0.061449,0.18441,329,8,10,8,⋯,0.02758621,0.013793103,0.029885057,1,6,11.44737,0.60796,0.44839,0.051389,1.3559
VLOG10,11,13,285.22,0.018539,0.421,479,10,19,7,⋯,0.01336303,0.017817372,0.004454343,1,8,21.38095,0.76182,0.30559,0.02934,2.493
VLOG100,4,12,141.96,0.0039344,0.19332,229,0,9,0,⋯,0.09897611,0.0,0.0,0,0,24.41667,0.56062,0.51625,0.051806,1.0859
VLOG102,10,42,226.56,0.0049353,0.30051,995,4,50,7,⋯,0.06983655,0.012630015,0.005943536,0,17,10.51562,0.59995,0.39872,0.038547,1.5047
VLOG103,17,28,208.49,0.0094889,0.26597,508,0,19,4,⋯,0.12483745,0.00130039,0.010403121,0,1,11.30882,0.44812,0.31675,0.041071,1.4147
VLOG104,2,19,213.54,0.00012061,0.203,708,2,15,0,⋯,0.01522843,0.002538071,0.005076142,0,2,29.18519,0.72873,0.40011,0.052709,1.8213


# 6. Checking for Correlation and Near Zero Variances among predictors

In [32]:
near_zero <- caret::nearZeroVar(transcript_features_df)

# There are no variables found that have near zero variation

glimpse(near_zero) 

# Check if there are highly correlated features 

library(caret)
library(dplyr)
total_correlation_matrix <- cor(transcript_features_df[,-1]) 
high_total_r <- total_correlation_matrix %>%
    findCorrelation(cutoff = 0.9) + 1 

glimpse(high_total_r)

# Omit missing values
final_features <- transcript_features_df[-high_total_r]


 int(0) 


Loading required package: lattice


Attaching package: ‘caret’


The following object is masked from ‘package:purrr’:

    lift


The following object is masked from ‘package:httr’:

    progress




 num [1:4] 7 14 13 33


Once you have computed features from the transcript texts and stored it in a data frame, merge it with the `vlogger_df` dataframe:

In [33]:
# YOUR CODE to merge `vlogger_df` with `transcript_features_df`

our_df <- vlogger_df %>% 
    left_join(final_features)
head(our_df)

Joining, by = "vlogId"



Unnamed: 0_level_0,vlogId,gender,Extr,Agr,Cons,Emot,Open,negative_bing,positive_bing,mean.pitch,⋯,hogv.cogC,freq_i,freq_we,negation_freq,swear_count,we_count,time.speaking,num.turns,voice.rate,avg.len.seg
Unnamed: 0_level_1,<chr>,<chr>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,⋯,<dbl>,<dbl>,<dbl>,<dbl>,<int>,<int>,<dbl>,<dbl>,<dbl>,<dbl>
1,VLOG3,Female,5.0,5.0,4.6,5.3,4.4,2,18,239.32,⋯,164,0.016,0.0,0.002666667,0,0,0.51374,0.50013,0.057632,1.0272
2,VLOG5,Male,5.9,5.3,5.3,5.8,5.5,1,15,173.5,⋯,156,0.07088608,0.002531646,0.002531646,0,1,0.70205,0.31675,0.037614,2.2164
3,VLOG6,Male,5.4,4.8,4.4,4.8,5.7,11,17,201.28,⋯,179,0.0755627,0.006430868,0.004823151,0,4,0.75993,0.29976,0.048036,2.5351
4,VLOG7,Male,4.7,5.1,4.4,5.1,4.7,17,32,275.68,⋯,156,0.06055901,0.001552795,0.0,1,1,0.60069,0.34916,0.024801,1.7204
5,VLOG8,Female,,,,,,8,19,255.58,⋯,178,0.07395498,0.0,0.006430868,0,0,0.46439,0.55015,0.056864,0.84412
6,VLOG9,Female,5.6,5.0,4.0,4.2,4.9,30,33,230.75,⋯,156,0.06071019,0.016036655,0.004581901,0,14,0.67458,0.41678,0.054172,1.6186


# 7. Model Selection

Next you fit your predictive model(s). For instance, a linear regression model that only uses `gender` a feature might be:

## 7.1 Inflexible Models

### 7.1.1 Overall Predictive Model

In [34]:
colnames(our_df)

fit_our_ml <- lm(cbind(Extr, Agr, Cons, Emot, Open) ~ gender + anger + anticipation + 
                         disgust + fear + joy + sadness + surprise + trust + positive_bing + 
                         negative_bing + um_count + quest_count + pause_count + mean.pitch + 
                         mean.energy + avg.voiced.seg + self_count + we_count + afinn_mean + avg_char_len +
                         hogv.entropy + hogv.median + hogv.cogR + hogv.cogC + freq_i + freq_we + 
                         negation_freq + swear_count + time.speaking + num.turns + voice.rate + avg.len.seg, 
                         data = our_df)

### 7.1.2 Model Selection

In [35]:
# Extraversion 
startmod_Extr <- lm(Extr ~ 1, data = our_df[,-c(1,4:7)])
fullmod_Extr <- lm(Extr ~., data = our_df[,-c(1,4:7)])

# stepwise regression
both_Extr <- MASS::stepAIC(fullmod_Extr, direction = "both", trace = FALSE)

# forward regression
forward_Extr <- MASS::stepAIC(startmod_Extr, 
                              scope = list(upper = fullmod_Extr), 
                              direction = "forward", trace = FALSE)

# Fit Extraversion with chosen predictors 

summary(both_Extr) #stepwise
summary(forward_Extr) #forward

# Anova table of the stepwise regression for Extraversion

both_Extr$anova
forward_Extr$anova


Call:
lm(formula = Extr ~ mean.pitch + mean.energy + anger + joy + 
    sadness + um_count + quest_count + hogv.entropy + hogv.median + 
    hogv.cogR + time.speaking, data = our_df[, -c(1, 4:7)])

Residuals:
     Min       1Q   Median       3Q      Max 
-2.48719 -0.57282  0.04516  0.62044  1.76524 

Coefficients:
               Estimate Std. Error t value Pr(>|t|)    
(Intercept)    1.072176   0.519779   2.063 0.039966 *  
mean.pitch     0.002011   0.000842   2.389 0.017509 *  
mean.energy    4.112762   1.958541   2.100 0.036542 *  
anger          0.016530   0.011668   1.417 0.157574    
joy            0.021842   0.006399   3.413 0.000727 ***
sadness       -0.028525   0.013224  -2.157 0.031769 *  
um_count      -0.022102   0.005417  -4.081 5.72e-05 ***
quest_count    0.037454   0.011395   3.287 0.001129 ** 
hogv.entropy   0.172778   0.049537   3.488 0.000557 ***
hogv.median    0.966837   0.544539   1.776 0.076791 .  
hogv.cogR      0.004496   0.002668   1.685 0.092927 .  
time.speaki


Call:
lm(formula = Extr ~ hogv.entropy + time.speaking + quest_count + 
    um_count + joy + mean.pitch + mean.energy + sadness + hogv.median + 
    hogv.cogR + anger, data = our_df[, -c(1, 4:7)])

Residuals:
     Min       1Q   Median       3Q      Max 
-2.48719 -0.57282  0.04516  0.62044  1.76524 

Coefficients:
               Estimate Std. Error t value Pr(>|t|)    
(Intercept)    1.072176   0.519779   2.063 0.039966 *  
hogv.entropy   0.172778   0.049537   3.488 0.000557 ***
time.speaking  1.721591   0.324093   5.312 2.07e-07 ***
quest_count    0.037454   0.011395   3.287 0.001129 ** 
um_count      -0.022102   0.005417  -4.081 5.72e-05 ***
joy            0.021842   0.006399   3.413 0.000727 ***
mean.pitch     0.002011   0.000842   2.389 0.017509 *  
mean.energy    4.112762   1.958541   2.100 0.036542 *  
sadness       -0.028525   0.013224  -2.157 0.031769 *  
hogv.median    0.966837   0.544539   1.776 0.076791 .  
hogv.cogR      0.004496   0.002668   1.685 0.092927 .  
anger      

Step,Df,Deviance,Resid. Df,Resid. Dev,AIC
<chr>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>
,,,289,203.384,-81.4057
- afinn_mean,1.0,0.004902658,290,203.3889,-83.39792
- negative_bing,1.0,0.019555942,291,203.4085,-85.36686
- disgust,1.0,0.011042691,292,203.4195,-87.34933
- surprise,1.0,0.051585999,293,203.4711,-89.26743
- avg_char_len,1.0,0.063378468,294,203.5345,-91.16683
- self_count,1.0,0.076391574,295,203.6109,-93.04563
- hogv.cogC,1.0,0.091221422,296,203.7021,-94.90095
- we_count,1.0,0.097205872,297,203.7993,-96.74685
- freq_we,1.0,0.120726103,298,203.92,-98.55557


Step,Df,Deviance,Resid. Df,Resid. Dev,AIC
<chr>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>
,,,322,301.355,-20.40434
+ hogv.entropy,1.0,36.290049,321,265.065,-59.84979
+ time.speaking,1.0,20.384558,320,244.6804,-83.69691
+ quest_count,1.0,8.5307,319,236.1497,-93.15919
+ um_count,1.0,7.994991,318,228.1547,-102.28395
+ joy,1.0,6.298801,317,221.8559,-109.3266
+ mean.pitch,1.0,3.496855,316,218.3591,-112.45822
+ mean.energy,1.0,2.676813,315,215.6823,-114.44228
+ sadness,1.0,1.910538,314,213.7717,-115.31619
+ hogv.median,1.0,1.654652,313,212.1171,-115.82603


In [36]:
# Agreeableness
startmod_Agr <- lm(Agr ~ 1, data = our_df[, -c(1, 3, 5:7)])
fullmod_Agr <- lm(Agr ~., data = our_df[, -c(1, 3, 5:7)])

# stepwise regression

both_Agr <- MASS::stepAIC(fullmod_Agr, direction = "both", trace = FALSE)

# forward regression
forward_Agr <- MASS::stepAIC(startmod_Agr, 
                              scope = list(upper = fullmod_Agr), 
                              direction = "forward", trace = FALSE)

# Agreeableness with chosen predictions 

summary(both_Agr)
summary(forward_Agr)

# Anova table of the stepwise regression for Agreeableness

both_Agr$anova
forward_Agr$anova


Call:
lm(formula = Agr ~ gender + anger + surprise + pause_count + 
    afinn_mean + hogv.cogC + freq_i + freq_we + negation_freq + 
    swear_count, data = our_df[, -c(1, 3, 5:7)])

Residuals:
     Min       1Q   Median       3Q      Max 
-1.93654 -0.42563  0.01693  0.43992  1.94292 

Coefficients:
                Estimate Std. Error t value Pr(>|t|)    
(Intercept)     5.417671   0.364857  14.849  < 2e-16 ***
genderMale     -0.290830   0.079958  -3.637 0.000322 ***
anger          -0.018417   0.008136  -2.264 0.024291 *  
surprise        0.017954   0.009441   1.902 0.058134 .  
pause_count    -0.011646   0.006854  -1.699 0.090290 .  
afinn_mean      0.330756   0.065652   5.038 7.98e-07 ***
hogv.cogC      -0.005412   0.002139  -2.530 0.011888 *  
freq_i          3.876191   1.445605   2.681 0.007722 ** 
freq_we         9.273221   4.395360   2.110 0.035674 *  
negation_freq -22.527113   6.335256  -3.556 0.000435 ***
swear_count    -0.084223   0.018505  -4.551 7.64e-06 ***
---
Signif. co


Call:
lm(formula = Agr ~ afinn_mean + swear_count + gender + negation_freq + 
    anger + hogv.cogC + freq_i + we_count + pause_count + surprise, 
    data = our_df[, -c(1, 3, 5:7)])

Residuals:
     Min       1Q   Median       3Q      Max 
-1.97642 -0.42551  0.01756  0.43375  1.94784 

Coefficients:
                Estimate Std. Error t value Pr(>|t|)    
(Intercept)     5.445005   0.364305  14.946  < 2e-16 ***
afinn_mean      0.326363   0.065546   4.979 1.06e-06 ***
swear_count    -0.083802   0.018515  -4.526 8.55e-06 ***
genderMale     -0.303664   0.079662  -3.812 0.000166 ***
negation_freq -22.500303   6.334017  -3.552 0.000441 ***
anger          -0.020591   0.008176  -2.518 0.012286 *  
hogv.cogC      -0.005186   0.002126  -2.439 0.015279 *  
freq_i          3.698544   1.419992   2.605 0.009638 ** 
we_count        0.015007   0.007006   2.142 0.032960 *  
pause_count    -0.013329   0.006926  -1.924 0.055216 .  
surprise        0.014704   0.009547   1.540 0.124541    
---
Signif. c

Step,Df,Deviance,Resid. Df,Resid. Dev,AIC
<chr>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>
,,,289,143.6927,-193.622
- mean.energy,1.0,0.002036485,290,143.6948,-195.6174
- self_count,1.0,0.008516332,291,143.7033,-197.5983
- avg.len.seg,1.0,0.015908026,292,143.7192,-199.5625
- hogv.entropy,1.0,0.020365662,293,143.7395,-201.5167
- mean.pitch,1.0,0.027440571,294,143.767,-203.4551
- anticipation,1.0,0.052831579,295,143.8198,-205.3364
- num.turns,1.0,0.064790612,296,143.8846,-207.1909
- hogv.median,1.0,0.075206802,297,143.9598,-209.0222
- negative_bing,1.0,0.107610632,298,144.0674,-210.7808


Step,Df,Deviance,Resid. Df,Resid. Dev,AIC
<chr>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>
,,,322,255.6699,-73.50611
+ afinn_mean,1.0,62.13108,321,193.5388,-161.43222
+ swear_count,1.0,16.680489,320,176.8584,-188.54392
+ gender,1.0,7.872012,319,168.9863,-201.25051
+ negation_freq,1.0,7.157743,318,161.8286,-213.23
+ anger,1.0,4.993635,317,156.835,-221.35401
+ hogv.cogC,1.0,2.182579,316,154.6524,-223.88058
+ freq_i,1.0,1.829322,315,152.8231,-225.72399
+ we_count,1.0,2.156566,314,150.6665,-228.31448
+ pause_count,1.0,1.258397,313,149.4081,-229.02357


In [37]:
# Openness

startmod_Open <- lm(Open ~ 1, data = our_df[, -c(1, 3:6)])
fullmod_Open <- lm(Open ~., data = our_df[, -c(1, 3:6)])

# stepwise regression

both_Open <- MASS::stepAIC(fullmod_Open, direction = "both", trace = FALSE)

# forward regression
forward_Open <- MASS::stepAIC(startmod_Open, 
                              scope = list(upper = fullmod_Open), 
                              direction = "forward", trace = FALSE)

# Openness with chosen predictions

summary(both_Open)
summary(forward_Open)

# Anova table of the stepwise regression for Openness

both_Open$anova
forward_Open$anova


Call:
lm(formula = Open ~ fear + joy + surprise + um_count + quest_count + 
    avg_char_len + hogv.median + swear_count + time.speaking + 
    num.turns + avg.len.seg, data = our_df[, -c(1, 3:6)])

Residuals:
     Min       1Q   Median       3Q      Max 
-2.17343 -0.42987 -0.01925  0.41280  1.76838 

Coefficients:
               Estimate Std. Error t value Pr(>|t|)    
(Intercept)    4.241721   0.286671  14.796  < 2e-16 ***
fear          -0.009920   0.005593  -1.773 0.077133 .  
joy            0.023397   0.006319   3.703 0.000252 ***
surprise      -0.024888   0.011550  -2.155 0.031947 *  
um_count      -0.011719   0.004293  -2.730 0.006699 ** 
quest_count    0.015777   0.009322   1.693 0.091543 .  
avg_char_len   0.001929   0.001189   1.623 0.105705    
hogv.median    1.627548   0.311169   5.230 3.11e-07 ***
swear_count   -0.035397   0.015793  -2.241 0.025709 *  
time.speaking  0.604498   0.297658   2.031 0.043121 *  
num.turns     -0.989011   0.499033  -1.982 0.048376 *  
avg.len.se


Call:
lm(formula = Open ~ hogv.median + time.speaking + swear_count + 
    joy + um_count + surprise + gender, data = our_df[, -c(1, 
    3:6)])

Residuals:
     Min       1Q   Median       3Q      Max 
-2.32560 -0.47330  0.01604  0.43228  1.69211 

Coefficients:
               Estimate Std. Error t value Pr(>|t|)    
(Intercept)    3.879725   0.191586  20.251  < 2e-16 ***
hogv.median    1.653727   0.304616   5.429 1.13e-07 ***
time.speaking  0.767919   0.261119   2.941 0.003515 ** 
swear_count   -0.040386   0.015049  -2.684 0.007668 ** 
joy            0.022585   0.006230   3.625 0.000337 ***
um_count      -0.012261   0.004303  -2.849 0.004670 ** 
surprise      -0.025865   0.011287  -2.292 0.022591 *  
genderMale     0.112358   0.074692   1.504 0.133509    
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.6629 on 315 degrees of freedom
  (80 observations deleted due to missingness)
Multiple R-squared:  0.1451,	Adjusted R-squared:  0.1261 


Step,Df,Deviance,Resid. Df,Resid. Dev,AIC
<chr>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>
,,,289,128.9431,-228.6048
- positive_bing,1.0,0.003997607,290,128.9471,-230.5948
- trust,1.0,0.005364078,291,128.9524,-232.5814
- anticipation,1.0,0.012387735,292,128.9648,-234.5504
- mean.energy,1.0,0.018660575,293,128.9835,-236.5036
- mean.pitch,1.0,0.075382212,294,129.0589,-238.3149
- afinn_mean,1.0,0.069277781,295,129.1282,-240.1416
- negative_bing,1.0,0.054473545,296,129.1826,-242.0053
- pause_count,1.0,0.098358216,297,129.281,-243.7595
- anger,1.0,0.093059218,298,129.374,-245.5271


Step,Df,Deviance,Resid. Df,Resid. Dev,AIC
<chr>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>
,,,322,161.9356,-221.0165
+ hogv.median,1.0,9.0644808,321,152.8711,-237.6224
+ time.speaking,1.0,4.2550521,320,148.6161,-244.7404
+ swear_count,1.0,2.3311462,319,146.2849,-247.847
+ joy,1.0,1.1936432,318,145.0913,-248.4934
+ um_count,1.0,3.2557574,317,141.8355,-253.8239
+ surprise,1.0,2.4025985,316,139.4329,-257.3422
+ gender,1.0,0.9945063,315,138.4384,-257.6542


In [38]:
# Conscientiousness
startmod_Cons <- lm(Cons ~ 1, data = our_df[, -c(1, 3:4, 6:7)])
fullmod_Cons <- lm(Cons ~., data = our_df[, -c(1, 3:4, 6:7)])

# stepwise regression
both_Cons <- MASS::stepAIC(fullmod_Cons, direction = "both", trace = FALSE)

# forward regression
forward_Cons <- MASS::stepAIC(startmod_Cons, 
                              scope = list(upper = fullmod_Cons), 
                              direction = "forward", trace = FALSE)

# Conscientiousness with chosen predictors

summary(both_Cons)
summary(forward_Cons)

# Anova table of the stepwise regression for Conscientiousness

both_Cons$anova
forward_Cons$anova



Call:
lm(formula = Cons ~ negative_bing + anger + fear + trust + hogv.entropy + 
    hogv.cogC + freq_i + we_count + time.speaking + voice.rate, 
    data = our_df[, -c(1, 3:4, 6:7)])

Residuals:
     Min       1Q   Median       3Q      Max 
-2.63551 -0.38010  0.03559  0.47499  1.85216 

Coefficients:
               Estimate Std. Error t value Pr(>|t|)    
(Intercept)    4.997231   0.481534  10.378  < 2e-16 ***
negative_bing -0.022507   0.006453  -3.488 0.000557 ***
anger         -0.034329   0.014607  -2.350 0.019382 *  
fear           0.037280   0.011089   3.362 0.000870 ***
trust          0.010421   0.004461   2.336 0.020114 *  
hogv.entropy  -0.059969   0.029623  -2.024 0.043783 *  
hogv.cogC     -0.005664   0.002183  -2.594 0.009931 ** 
freq_i        -3.702701   1.440146  -2.571 0.010602 *  
we_count       0.011575   0.007142   1.621 0.106097    
time.speaking  1.072409   0.278181   3.855 0.000141 ***
voice.rate     9.158916   3.975784   2.304 0.021898 *  
---
Signif. codes:  0 ‘*


Call:
lm(formula = Cons ~ time.speaking + swear_count + freq_i + hogv.entropy + 
    hogv.cogC + voice.rate + negation_freq + trust + negative_bing + 
    fear + anger + we_count, data = our_df[, -c(1, 3:4, 6:7)])

Residuals:
     Min       1Q   Median       3Q      Max 
-2.59114 -0.35645  0.04693  0.47812  1.91262 

Coefficients:
               Estimate Std. Error t value Pr(>|t|)    
(Intercept)    5.029550   0.481630  10.443  < 2e-16 ***
time.speaking  1.076276   0.277969   3.872 0.000132 ***
swear_count   -0.021738   0.020588  -1.056 0.291858    
freq_i        -3.615080   1.447038  -2.498 0.012998 *  
hogv.entropy  -0.057916   0.029630  -1.955 0.051527 .  
hogv.cogC     -0.005350   0.002191  -2.442 0.015166 *  
voice.rate     8.235101   4.026357   2.045 0.041669 *  
negation_freq -7.984489   6.368666  -1.254 0.210890    
trust          0.009483   0.004511   2.102 0.036340 *  
negative_bing -0.018156   0.007488  -2.425 0.015894 *  
fear           0.033472   0.011591   2.888 0.00415

Step,Df,Deviance,Resid. Df,Resid. Dev,AIC
<chr>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>
,,,289,150.5187,-178.6315
- mean.energy,1.0,0.0004493153,290,150.5191,-180.6306
- afinn_mean,1.0,0.0007220984,291,150.5199,-182.629
- hogv.cogR,1.0,0.009265326,292,150.5291,-184.6091
- surprise,1.0,0.0116984204,293,150.5408,-186.584
- joy,1.0,0.0167754975,294,150.5576,-188.548
- positive_bing,1.0,0.0671184854,295,150.6247,-190.4041
- avg_char_len,1.0,0.0763883292,296,150.7011,-192.2403
- swear_count,1.0,0.1260732369,297,150.8272,-193.9702
- hogv.median,1.0,0.128272171,298,150.9554,-195.6956


Step,Df,Deviance,Resid. Df,Resid. Dev,AIC
<chr>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>
,,,322,200.7218,-151.6616
+ time.speaking,1.0,11.551307,321,189.1705,-168.8062
+ swear_count,1.0,8.215457,320,180.955,-181.1474
+ freq_i,1.0,8.142485,319,172.8125,-194.0187
+ hogv.entropy,1.0,3.097606,318,169.7149,-197.8609
+ hogv.cogC,1.0,2.283467,317,167.4315,-200.2363
+ voice.rate,1.0,1.765038,316,165.6664,-201.6593
+ negation_freq,1.0,1.315305,315,164.3511,-202.234
+ trust,1.0,1.267623,314,163.0835,-202.735
+ negative_bing,1.0,2.108219,313,160.9753,-204.9377


In [39]:
# Neuroticism

startmod_Emot <- lm(Emot ~ 1, data = our_df[, -c(1, 3:5, 7)])
fullmod_Emot <- lm(Emot ~., data = our_df[, -c(1, 3:5, 7)])

# forward regression

both_Emot <- MASS::stepAIC(fullmod_Emot, direction = "both", trace = FALSE)

# forward regression

forward_Emot <- MASS::stepAIC(startmod_Emot, 
                              scope = list(upper = fullmod_Emot), 
                              direction = "forward", trace = FALSE)
# Neuroticism with chosen predictors 

summary(both_Emot)
summary(forward_Emot)

# Anova table of the stepwise regression for Neuroticism

both_Emot$anova
forward_Emot$anova


Call:
lm(formula = Emot ~ trust + self_count + afinn_mean + hogv.cogR + 
    hogv.cogC + negation_freq + swear_count + time.speaking, 
    data = our_df[, -c(1, 3:5, 7)])

Residuals:
    Min      1Q  Median      3Q     Max 
-2.0881 -0.4337  0.0476  0.4486  1.7803 

Coefficients:
                Estimate Std. Error t value Pr(>|t|)    
(Intercept)     4.436856   0.494888   8.965  < 2e-16 ***
trust           0.006374   0.003829   1.665 0.096968 .  
self_count     -0.001971   0.001399  -1.409 0.159829    
afinn_mean      0.279832   0.061458   4.553 7.57e-06 ***
hogv.cogR       0.004121   0.002217   1.859 0.063960 .  
hogv.cogC      -0.003247   0.002090  -1.553 0.121314    
negation_freq -19.060824   6.265934  -3.042 0.002549 ** 
swear_count    -0.062341   0.017892  -3.484 0.000564 ***
time.speaking   0.433126   0.266561   1.625 0.105195    
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.6857 on 314 degrees of freedom
  (80 observations dele


Call:
lm(formula = Emot ~ afinn_mean + swear_count + negation_freq + 
    time.speaking + hogv.cogR + hogv.cogC, data = our_df[, -c(1, 
    3:5, 7)])

Residuals:
     Min       1Q   Median       3Q      Max 
-2.09562 -0.44424  0.05158  0.46141  1.75509 

Coefficients:
                Estimate Std. Error t value Pr(>|t|)    
(Intercept)     4.405764   0.495387   8.894  < 2e-16 ***
afinn_mean      0.279371   0.061387   4.551 7.63e-06 ***
swear_count    -0.061712   0.017524  -3.521 0.000492 ***
negation_freq -19.378528   6.274169  -3.089 0.002189 ** 
time.speaking   0.498387   0.263727   1.890 0.059702 .  
hogv.cogR       0.003859   0.002211   1.746 0.081857 .  
hogv.cogC      -0.003021   0.002086  -1.449 0.148463    
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.687 on 316 degrees of freedom
  (80 observations deleted due to missingness)
Multiple R-squared:  0.2219,	Adjusted R-squared:  0.2072 
F-statistic: 15.02 on 6 and 316 DF,  p-value

Step,Df,Deviance,Resid. Df,Resid. Dev,AIC
<chr>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>
,,,289,142.1085,-197.2028
- disgust,1.0,0.001079673,290,142.1096,-199.2004
- anticipation,1.0,0.001928538,291,142.1115,-201.196
- mean.energy,1.0,0.027272307,292,142.1388,-203.134
- we_count,1.0,0.025271927,293,142.1641,-205.0766
- pause_count,1.0,0.028109974,294,142.1922,-207.0127
- positive_bing,1.0,0.031942025,295,142.2241,-208.9402
- joy,1.0,0.033290583,296,142.2574,-210.8646
- voice.rate,1.0,0.051059992,297,142.3085,-212.7487
- num.turns,1.0,0.079855924,298,142.3883,-214.5675


Step,Df,Deviance,Resid. Df,Resid. Dev,AIC
<chr>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>
,,,322,191.6633,-166.5776
+ afinn_mean,1.0,28.9586532,321,162.7047,-217.4861
+ swear_count,1.0,4.9344419,320,157.7702,-225.4335
+ negation_freq,1.0,4.6349882,319,153.1353,-233.0648
+ time.speaking,1.0,1.5581275,318,151.5771,-234.3682
+ hogv.cogR,1.0,1.4583227,317,150.1188,-235.4908
+ hogv.cogC,1.0,0.9902029,316,149.1286,-235.6284


### 7.1.3 Combined Model based on Model Selection


## 7.2 Flexible Models

**Non-linear Transformations of the Predictors**

In [40]:
# Extraversion
forward_Extr2 <- lm(formula = Extr ~ hogv.entropy + I(hogv.entropy^2) + 
                    quest_count + um_count + I(um_count^2) + joy + 
                    I(joy^2) + mean.pitch + sadness + mean.energy + 
                    I(time.speaking^2) + time.speaking + hogv.median + 
                    hogv.cogR + anger, data = our_df[, -c(1, 4:7)])
summary(forward_Extr2)


Call:
lm(formula = Extr ~ hogv.entropy + I(hogv.entropy^2) + quest_count + 
    um_count + I(um_count^2) + joy + I(joy^2) + mean.pitch + 
    sadness + mean.energy + I(time.speaking^2) + time.speaking + 
    hogv.median + hogv.cogR + anger, data = our_df[, -c(1, 4:7)])

Residuals:
    Min      1Q  Median      3Q     Max 
-2.6239 -0.5647  0.0386  0.6122  1.5951 

Coefficients:
                     Estimate Std. Error t value Pr(>|t|)    
(Intercept)         1.6039747  0.8341707   1.923 0.055425 .  
hogv.entropy        0.2929181  0.2203427   1.329 0.184711    
I(hogv.entropy^2)  -0.0137374  0.0237859  -0.578 0.563995    
quest_count         0.0423567  0.0115762   3.659 0.000298 ***
um_count           -0.0336287  0.0118947  -2.827 0.005004 ** 
I(um_count^2)       0.0002871  0.0002664   1.078 0.282100    
joy                 0.0425890  0.0143992   2.958 0.003340 ** 
I(joy^2)           -0.0005227  0.0003376  -1.548 0.122626    
mean.pitch          0.0017004  0.0008550   1.989 0.047603 *  


In [41]:
# Agreeableness 
our_df2 <- our_df %>% mutate(gender = as.numeric(as.factor(gender)))

forward_Agr2 <- lm(formula = Agr ~ afinn_mean + I(afinn_mean^2) + 
                   swear_count + I(swear_count^2) + gender + I(gender^2) +
                   negation_freq + I(negation_freq^2) + anger + hogv.cogC + 
                   freq_i + surprise + pause_count + we_count, 
                   data = our_df2[, -c(1, 3, 5:7)])

summary(forward_Agr2)


Call:
lm(formula = Agr ~ afinn_mean + I(afinn_mean^2) + swear_count + 
    I(swear_count^2) + gender + I(gender^2) + negation_freq + 
    I(negation_freq^2) + anger + hogv.cogC + freq_i + surprise + 
    pause_count + we_count, data = our_df2[, -c(1, 3, 5:7)])

Residuals:
     Min       1Q   Median       3Q      Max 
-1.95668 -0.40451  0.02375  0.41547  1.98414 

Coefficients: (1 not defined because of singularities)
                     Estimate Std. Error t value Pr(>|t|)    
(Intercept)         5.588e+00  3.979e-01  14.043  < 2e-16 ***
afinn_mean          5.008e-01  1.125e-01   4.452 1.19e-05 ***
I(afinn_mean^2)    -9.173e-02  5.369e-02  -1.709 0.088523 .  
swear_count        -1.186e-01  3.223e-02  -3.679 0.000276 ***
I(swear_count^2)    3.262e-03  1.746e-03   1.868 0.062711 .  
gender             -3.110e-01  7.935e-02  -3.919 0.000109 ***
I(gender^2)                NA         NA      NA       NA    
negation_freq       3.189e+00  1.609e+01   0.198 0.842979    
I(negation_freq^2) -

In [42]:
# Openness
forward_Open2 <- lm(formula = Open ~ hogv.median + I(hogv.median^2) + time.speaking +
                    swear_count + joy + I(joy^2) + um_count + surprise + gender, 
                    data = our_df[, -c(1, 3:6)])


summary(forward_Open2)


Call:
lm(formula = Open ~ hogv.median + I(hogv.median^2) + time.speaking + 
    swear_count + joy + I(joy^2) + um_count + surprise + gender, 
    data = our_df[, -c(1, 3:6)])

Residuals:
     Min       1Q   Median       3Q      Max 
-2.30210 -0.46394  0.01719  0.42965  1.65406 

Coefficients:
                   Estimate Std. Error t value Pr(>|t|)    
(Intercept)       3.8213314  0.2028482  18.838  < 2e-16 ***
hogv.median       1.6733137  0.8417279   1.988  0.04769 *  
I(hogv.median^2) -0.0552986  1.8799495  -0.029  0.97655    
time.speaking     0.7653497  0.2621148   2.920  0.00376 ** 
swear_count      -0.0422754  0.0151852  -2.784  0.00570 ** 
joy               0.0340408  0.0125402   2.715  0.00701 ** 
I(joy^2)         -0.0002857  0.0002714  -1.053  0.29332    
um_count         -0.0121273  0.0043122  -2.812  0.00523 ** 
surprise         -0.0273042  0.0114213  -2.391  0.01741 *  
genderMale        0.1081464  0.0750770   1.440  0.15073    
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘

In [43]:
# Conscientiousness
forward_Cons2 <- lm(formula = Cons ~ time.speaking + I(time.speaking^2) + swear_count + 
                    freq_i + hogv.entropy + hogv.cogC + voice.rate + negation_freq + 
                    trust + negative_bing + fear + anger,
                    data = our_df[, -c(1, 3:4, 6:7)])


summary(forward_Cons2)


Call:
lm(formula = Cons ~ time.speaking + I(time.speaking^2) + swear_count + 
    freq_i + hogv.entropy + hogv.cogC + voice.rate + negation_freq + 
    trust + negative_bing + fear + anger, data = our_df[, -c(1, 
    3:4, 6:7)])

Residuals:
     Min       1Q   Median       3Q      Max 
-2.26708 -0.38063  0.03373  0.47128  1.93919 

Coefficients:
                    Estimate Std. Error t value Pr(>|t|)    
(Intercept)         5.360723   0.583319   9.190  < 2e-16 ***
time.speaking      -0.225239   1.368560  -0.165  0.86938    
I(time.speaking^2)  1.093022   1.127728   0.969  0.33319    
swear_count        -0.026500   0.020469  -1.295  0.19640    
freq_i             -3.929021   1.422614  -2.762  0.00609 ** 
hogv.entropy       -0.058905   0.029675  -1.985  0.04802 *  
hogv.cogC          -0.004883   0.002190  -2.230  0.02648 *  
voice.rate          8.141529   4.033696   2.018  0.04441 *  
negation_freq      -8.430422   6.388626  -1.320  0.18794    
trust               0.010616   0.004470  

In [44]:
# Neuroticism
forward_Emot2 <- lm(formula = Emot ~ afinn_mean + I(afinn_mean^2) + 
                    swear_count + I(swear_count^2) +  I(negation_freq^2) + 
                    negation_freq + time.speaking + 
                    hogv.cogR + hogv.cogC, 
                    data = our_df[, -c(1, 3:5, 7)])

summary(forward_Emot2)


Call:
lm(formula = Emot ~ afinn_mean + I(afinn_mean^2) + swear_count + 
    I(swear_count^2) + I(negation_freq^2) + negation_freq + time.speaking + 
    hogv.cogR + hogv.cogC, data = our_df[, -c(1, 3:5, 7)])

Residuals:
    Min      1Q  Median      3Q     Max 
-2.0308 -0.4490  0.0343  0.4767  1.7035 

Coefficients:
                     Estimate Std. Error t value Pr(>|t|)    
(Intercept)         4.302e+00  4.969e-01   8.658 2.59e-16 ***
afinn_mean          3.628e-01  1.041e-01   3.486 0.000559 ***
I(afinn_mean^2)    -3.846e-02  5.093e-02  -0.755 0.450779    
swear_count        -7.139e-02  2.975e-02  -2.399 0.017003 *  
I(swear_count^2)    1.075e-03  1.680e-03   0.640 0.522770    
I(negation_freq^2) -1.085e+03  5.595e+02  -1.940 0.053315 .  
negation_freq       8.184e+00  1.578e+01   0.519 0.604446    
time.speaking       4.359e-01  2.660e-01   1.639 0.102308    
hogv.cogR           3.476e-03  2.219e-03   1.566 0.118304    
hogv.cogC          -2.670e-03  2.100e-03  -1.271 0.204594    


# 8. Performance Tables

In [45]:
best_total_ml <- tibble(Extr = summary(fit_our_ml)[[1]]$r.squared,
                          Agr = summary(fit_our_ml)[[2]]$r.squared,
                          Cons = summary(fit_our_ml)[[3]]$r.squared,
                          Emot = summary(fit_our_ml)[[4]]$r.squared,
                          Open = summary(fit_our_ml)[[5]]$r.squared)


best_both_ml <- tibble(Extr = summary(both_Extr)$r.squared,
                         Agr = summary(both_Agr)$r.squared, 
                         Cons = summary(both_Cons)$r.squared,
                         Emot = summary(both_Emot)$r.squared,
                         Open = summary(both_Open)$r.squared)

best_forward_ml <- tibble(Extr = summary(forward_Extr)$r.squared,
                         Agr = summary(forward_Agr)$r.squared, 
                         Cons = summary(forward_Cons)$r.squared,
                         Emot = summary(forward_Emot)$r.squared,
                         Open = summary(forward_Open)$r.squared)

best_transformed_ml <- tibble(Extr = summary(forward_Extr2)$r.squared,
                              Agr = summary(forward_Agr2)$r.squared, 
                              Cons = summary(forward_Cons2)$r.squared,
                              Emot = summary(forward_Emot2)$r.squared,
                              Open = summary(forward_Open2)$r.squared)


best_total_ml
best_both_ml
best_forward_ml
best_transformed_ml

Extr,Agr,Cons,Emot,Open
<dbl>,<dbl>,<dbl>,<dbl>,<dbl>
0.3251016,0.4379757,0.2501129,0.2585514,0.2037386


Extr,Agr,Cons,Emot,Open
<dbl>,<dbl>,<dbl>,<dbl>,<dbl>
0.3061446,0.4197786,0.217624,0.2297143,0.1674995


Extr,Agr,Cons,Emot,Open
<dbl>,<dbl>,<dbl>,<dbl>,<dbl>
0.3061446,0.4200304,0.2238837,0.2219242,0.145102


Extr,Agr,Cons,Emot,Open
<dbl>,<dbl>,<dbl>,<dbl>,<dbl>
0.320113,0.4326969,0.2208707,0.2328184,0.1481289


###### The overall model still shows the highest explained variance, so we decided to use that for our final predictions. 

# Final model
### Total Model including all predictors and outcome variables according to stepwise regression

In [46]:
fit_our_ml <- lm(cbind(Extr, Agr, Cons, Emot, Open) ~ gender + anger + anticipation + 
                         disgust + fear + joy + sadness + surprise + trust + positive_bing + 
                         negative_bing + um_count + quest_count + pause_count + mean.pitch + 
                         mean.energy + avg.voiced.seg + self_count + we_count + afinn_mean + avg_char_len +
                         hogv.entropy + hogv.median + hogv.cogR + hogv.cogC + freq_i + freq_we + 
                         negation_freq + swear_count + time.speaking + num.turns + voice.rate + avg.len.seg, 
                         data = our_df)

summary(fit_our_ml)

Response Extr :

Call:
lm(formula = Extr ~ gender + anger + anticipation + disgust + 
    fear + joy + sadness + surprise + trust + positive_bing + 
    negative_bing + um_count + quest_count + pause_count + mean.pitch + 
    mean.energy + avg.voiced.seg + self_count + we_count + afinn_mean + 
    avg_char_len + hogv.entropy + hogv.median + hogv.cogR + hogv.cogC + 
    freq_i + freq_we + negation_freq + swear_count + time.speaking + 
    num.turns + voice.rate + avg.len.seg, data = our_df)

Residuals:
     Min       1Q   Median       3Q      Max 
-2.72945 -0.55213  0.04367  0.63382  1.66407 

Coefficients:
                 Estimate Std. Error t value Pr(>|t|)    
(Intercept)     1.1285651  0.8587297   1.314 0.189812    
genderMale      0.1453904  0.1149739   1.265 0.207051    
anger           0.0108287  0.0183308   0.591 0.555157    
anticipation   -0.0060187  0.0115439  -0.521 0.602502    
disgust         0.0038910  0.0214846   0.181 0.856410    
fear            0.0075479  0.0158966  

# 9. Making predictions on the test set

For the competition we have to make **predictions** for the data in the **test set**

- The predictions will be evaluated by computing the **Root Means Square Error**:
    - $\displaystyle{RMSE =\sqrt{{1 \over 5n} \sum_{k \in \{cEXT, \ldots, cOPN\}} \sum_{i=1}^n (y_{ik} - \hat y_{ik})^2}}$
    - Here 
        - $y_{ik}$ is the observed value for vlogger $i$ 
        - $\hat y_{ik}$ is your prediction for vlogger $i$
        
        
You will have to take the following steps:

1. Extract the test set from the `vlogger_df`
2. Compute predictions for the test set using your model
3. Write those predictions to file in the right format

The following gives code for these steps in order.

## 9.1 The test set

The test set are those `vlogId` that are missing in the personality scores data frame `pers`. They are the rows in `vlogger_df` for which the personality scores are missing:

In [47]:
#testset_vloggers = vlogger_df %>% 
#    filter(is.na(Extr))

#head(testset_vloggers)
##########
testset_vloggers = our_df %>% 
    filter(is.na(Extr))

head(testset_vloggers)

Unnamed: 0_level_0,vlogId,gender,Extr,Agr,Cons,Emot,Open,negative_bing,positive_bing,mean.pitch,⋯,hogv.cogC,freq_i,freq_we,negation_freq,swear_count,we_count,time.speaking,num.turns,voice.rate,avg.len.seg
Unnamed: 0_level_1,<chr>,<chr>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,⋯,<dbl>,<dbl>,<dbl>,<dbl>,<int>,<int>,<dbl>,<dbl>,<dbl>,<dbl>
1,VLOG8,Female,,,,,,8,19,255.58,⋯,178,0.07395498,0.0,0.006430868,0,0,0.46439,0.55015,0.056864,0.84412
2,VLOG15,Male,,,,,,13,25,157.64,⋯,165,0.02557856,0.00365408,0.008526188,1,3,0.71592,0.33342,0.031669,2.1472
3,VLOG18,Male,,,,,,1,8,194.72,⋯,157,0.02803738,0.015576324,0.003115265,0,5,0.69587,0.4086,0.046223,1.703
4,VLOG22,Female,,,,,,0,3,285.79,⋯,161,0.08,0.0,0.04,0,0,0.42792,0.28364,0.035558,1.5087
5,VLOG28,Male,,,,,,7,6,140.13,⋯,158,0.10116732,0.003891051,0.011673152,0,1,0.52014,0.33342,0.036923,1.56
6,VLOG29,Female,,,,,,9,23,256.48,⋯,190,0.06726457,0.00896861,0.011210762,5,4,0.6359,0.28341,0.045721,2.2438


## 9.2 Predictions

Continuing the example `fit_mlm` model above, for almost all models we will encounter use the `predict()` function.

- `predict()` function exists for most model fit function like `lm`, `glm`, etc., that we encounter
    - first argument should be a model object (`fit_mlm` in the example)
    - second argument should be a data frame with the test set
    - optionnaly, a third argument specifies type of response:
      - for `lm` object only `type = "resp"`
      - for `glm` object `type = "pred"` (linear predictor) or `type = "resp"` ('response' &rarr; probabilities)

For example:

In [48]:
pred_mlm = predict(fit_our_ml, new = testset_vloggers)



# Always check the output

head(pred_mlm)

Unnamed: 0,Extr,Agr,Cons,Emot,Open
1,4.748218,4.829221,4.297031,4.793365,4.607907
2,3.48576,4.342099,5.115251,4.746141,3.977132
3,5.335477,4.734669,4.901746,5.157851,5.106987
4,4.221069,4.980525,4.063148,4.669564,4.721717
5,3.201738,4.165894,4.493809,4.194404,4.105185
6,4.927087,4.458777,4.262755,4.649074,4.391437


In [49]:
# compute output data frame
testset_pred = testset_vloggers %>% 
    mutate(
        Extr = pred_mlm[,'Extr'], 
        Agr  = pred_mlm[,'Agr' ],
        Cons = pred_mlm[,'Cons'],
        Emot = pred_mlm[,'Emot'],
        Open = pred_mlm[,'Open']
    ) %>%
    select(vlogId, Extr:Open)

head(testset_pred)

Unnamed: 0_level_0,vlogId,Extr,Agr,Cons,Emot,Open
Unnamed: 0_level_1,<chr>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>
1,VLOG8,4.748218,4.829221,4.297031,4.793365,4.607907
2,VLOG15,3.48576,4.342099,5.115251,4.746141,3.977132
3,VLOG18,5.335477,4.734669,4.901746,5.157851,5.106987
4,VLOG22,4.221069,4.980525,4.063148,4.669564,4.721717
5,VLOG28,3.201738,4.165894,4.493809,4.194404,4.105185
6,VLOG29,4.927087,4.458777,4.262755,4.649074,4.391437


## 9.3 Writing predictions to file

You need to upload your predictions in .csv file. However, there are multiple columns: `Extr`, `Agr`, `Cons`, `Emot`, `Open`, while Kaggle expects **long format**!

What does long format look like?

- Every prediction on a single line.
- Columns `vlogId` and `pers_axis` to map prediction *vlogger ID* and *personality axis*.

To achieve this, first `gather` the column values into a single `value` column, adding a `pers_axis` to indicate the column name:

In [50]:
testset_pred_long  <- 
  testset_pred %>% 
  gather(pers_axis, Expected, -vlogId) %>%
  arrange(vlogId, pers_axis)

head(testset_pred_long)

Unnamed: 0_level_0,vlogId,pers_axis,Expected
Unnamed: 0_level_1,<chr>,<chr>,<dbl>
1,VLOG100,Agr,5.217891
2,VLOG100,Cons,4.635489
3,VLOG100,Emot,5.22179
4,VLOG100,Extr,4.089424
5,VLOG100,Open,4.908177
6,VLOG113,Agr,5.309466


According to the competition's [Evaluation instructions](https://www.kaggle.com/c/bda2019big5/overview/evaluation), Kaggle expects file with two colums: `Id` and `value`.
  
The [Evaluation instructions](https://www.kaggle.com/c/bda2019big5/overview/evaluation) specifies we need to encode the `Agr` prediction for `VLOG8` as `VLOG8_Agr` in the `Id` column. To achieve this use `unite()` function of `dplyr`.

`unite()` take:

- a data frame as its first argument (implicitely passed by the piping operator `%>%`)
- the name of new column as its second argument (`Id` below)
- all extra arguments (`vlogId` and `pers_axis` below) are concatenated with an underscore in between

Then write the resulting data frame to a .csv file.

In [51]:
# Obtain the right format for Kaggle
testset_pred_final <- 
  testset_pred_long %>%
  unite(Id, vlogId, pers_axis) 

# Check if we succeeded
head(testset_pred_final)

# Write to csv
testset_pred_final %>%
  write_csv(path = "predictions.csv")

# Check if the file was written successfully.
list.files()

Unnamed: 0_level_0,Id,Expected
Unnamed: 0_level_1,<chr>,<dbl>
1,VLOG100_Agr,5.217891
2,VLOG100_Cons,4.635489
3,VLOG100_Emot,5.22179
4,VLOG100_Extr,4.089424
5,VLOG100_Open,4.908177
6,VLOG113_Agr,5.309466


“The `path` argument of `write_csv()` is deprecated as of readr 1.4.0.
Please use the `file` argument instead.


References

Christian, H., Suhartono, D., Chowanda, A., & Zamli, K. Z. (2021). Text based personality prediction from multiple social media data sources using pre-trained language model and model averaging. Journal of Big Data, 8(1). https://doi.org/10.1186/s40537-021-00459-1
Lee, C. H., Kim, K., Seo, Y. S., & Chung, C. K. (2007). The Relations Between Personality and Language Use. The Journal of General Psychology, 134(4), 405–413. https://doi.org/10.3200/genp.134.4.405-414
Laserna, C. M., Seih, Y. T., & Pennebaker, J. W. (2014). Um . . . Who Like Says You Know. Journal of Language and Social Psychology, 33(3), 328–338. https://doi.org/10.1177/0261927x14526993
Mehta, Y., Fatehi, S., Kazameini, A., Stachl, C., Cambria, E., & Eetemadi, S. (2020). Bottom-Up and Top-Down: Predicting Personality with Psycholinguistic and Language Model Features. 2020 IEEE International Conference on Data Mining (ICDM). Published. https://doi.org/10.1109/icdm50108.2020.00146
Scully, I. D., & Terry, C. P. (2011). Self-Referential Memory for the Big-Five Personality Traits. Psi Chi Journal of Psychological Research, 16(3), 123–128. https://doi.org/10.24839/1089-4136.jn16.3.123



# Division of labour

All
* Tokenaziation of Transcript 

Jelena Kalinić
* Feature Nrc
* Feature Bing
* Feature intonation
* Feature swear words
* Feature the Syllables per VlogId
* Feature Self-reference words
* Model selection pt.2

Jesse Boot
* Feature average number of words in a sentence
* Description of features

Jessica Bormann
* Feature number of questions
* Feature number of um's
* Feature number of pauses
* Feature swear words
* added features of other groups
* Model selection pt.1

Once you have clicked the <span style="background-color:#000000;color:white;padding:3px;border-radius:10px;padding-left:6px;padding-right:6px;">⟳ Save Version&nbsp;&nbsp;|&nbsp;&nbsp;0</span> button at the top left, and select the "Save & Run All (Commit)" option, go to the Viewer. There you will find your "predictions.csv" under Output. You'll also see a button there that allows you to submit your predictions with one click.