# Task 1

In [1]:
import pandas as pd
from statistics import mean, stdev
import contractions
import nltk

nltk.download("stopwords")
nltk.download("punkt")

script = pd.read_csv("data/lotr_scripts.csv", encoding='utf-8') # load the data
concat = {} # stores the concatenated scripts for each character

script_lengths = {} # individual script lengths grouped by character
stopwords = nltk.corpus.stopwords.words("english")

raw_scripts = {}

for index, row in script.iterrows(): # go through the entries one by one for clarity (of course, could be implemented more compactly with nice pandas oneliners)
    if not row["char"] in concat:
        concat[row["char"]] = []
        script_lengths[row["char"]] = []
        raw_scripts[row["char"]] = ""
    
    raw_scripts[row["char"]] += str(row["dialog"]).lower() + " "
    dialogue = contractions.fix(str(row["dialog"])) # can't -> can not etc.
    words = nltk.word_tokenize(dialogue) # tokenize
    words = [word.lower() for word in words if word.isalpha()] # to lowercase if the "word" consists of alphabet letters
    
    script_lengths[row["char"]].append(len(words)) # length of the individual script
    concat[row["char"]].extend(words) # add to the concatenated list for this character

concat = dict(sorted(concat.items(), key=lambda i: -len(i[1]))) # sort by number of tokens in descending order
details = {} # dictionary containing the derived characteristics for every character

for character in concat:
    if not character in details:
        details[character] = {}

    details[character]["token_num"] = len(concat[character]) # total number of tokens
    details[character]["vocab_size"] = len(set(concat[character])) # vocabulary size = number of unique tokens
    details[character]["avg_script_length"] = mean(script_lengths[character]) # mean of script lengths
    details[character]["stopword_proportion"] = 0
    
    if len(script_lengths[character]) == 1:
        details[character]["sd_script_length"] = float('inf') # if only one script available for this character, SD is "infinite"
    else:
        details[character]["sd_script_length"] = stdev(script_lengths[character])
    
    for token in concat[character]:
        if token in stopwords:
            details[character]["stopword_proportion"] += 1 # if contained in stopword list, increment number of stopwords per char
    
    details[character]["stopword_proportion"] /= len(concat[character]) # divide by the total number of tokens

pd.DataFrame.from_dict(details, orient="index") # display table

[nltk_data] Downloading package stopwords to
[nltk_data]     /home/syomasa/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to /home/syomasa/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


Unnamed: 0,token_num,vocab_size,avg_script_length,stopword_proportion,sd_script_length
GANDALF,3138,880,15.382353,0.528999,17.865421
SAM,2164,559,10.018519,0.577634,11.186355
FRODO,1826,494,8.115556,0.591457,8.211761
ARAGORN,1394,531,7.535135,0.497131,7.933837
GOLLUM,1311,362,9.857143,0.504195,9.257383
...,...,...,...,...,...
OLD MAN,1,1,1.000000,0.000000,inf
FRODO VOICE,1,1,1.000000,0.000000,inf
MRS BRACEGIRDLE,1,1,1.000000,0.000000,inf
PROUDFOOT HOBBIT,1,1,1.000000,0.000000,inf


# Task 2

This code implements the personality detection presented in https://github.com/desaichirayu/Personality-Attribution-using-Natural-Language-Processing. Of the various models provided there, the multilayer perceptron (MLP) implementation was utilized since it seemed to provide the best overall performance among all traits. However, it required some modifications since it only provided a binary representation of the trait vector which would be poorly suited for PCA and would not allow determining the dominating traits.

Alas, at least with the LOTR dataset, the performance of this tool is rather questionable: the results vary considerably between repeated runs and while this could be partly attributed to the randomized train-test-split, the same phenomenon seems to occur even when using the whole essays dataset as the training set (perhaps the training does not converge properly with this dataset?)

**UPDATE:** Some improvement regarding this after fixing the TF-IDF bug. In the first implementation, only one concatenated corpora was processed during one iteration, resulting in the IDF term being constant/zero(?!) Now, the concatenated corpora for all characters are inputted at the same time. Note that the same kind of preprocessing pipeline as for the training set is used (i.e. only lowercase transform). However, see report for possible limitations on using the TF-IDF model.

However, according to literature, even the best personality trait detectors seem to top out at about 60 % in terms of accuracy (and this is, of course, even measured with similar data to the training data!) so not much can be expected. Perhaps better results could be obtained by utilizing a model specifically trained with dialogue but could not find many (there is MELD but I guess we are not allowed to use it; furthermore, it does not even use the Big Five traits...). Some alternatives for the second model:
- TwitPersonality (https://github.com/D2KLab/twitpersonality): Could work better with dialogue (analogy to short tweets instead of the stream-of-consciousness essays used by some other models); however, a bit difficult to setup and utilizes the myPersonality dataset which is no longer officially available (of course, could still be obtained)
- using another model from https://github.com/desaichirayu/Personality-Attribution-using-Natural-Language-Processing, such as Naive-Bayes
- https://github.com/amirmohammadkz/personality_detection: a bit difficult to setup and uses the essays dataset like https://github.com/desaichirayu/Personality-Attribution-using-Natural-Language-Processing; furthermore, seems to be linked somehow to the SenticNet implementation which I'm not sure if we are allowed to use or not

All in all, would probably make sense to select a model that utilizes word embeddings (from a comparison viewpoint)

In [11]:
# adapted and modified from
# https://github.com/desaichirayu/Personality-Attribution-using-Natural-Language-Processing/blob/master/code%20and%20data/mlp_simple.py
from sklearn.neural_network import MLPClassifier
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn.decomposition import PCA
from sklearn.metrics import accuracy_score

df = pd.read_csv(r"data/essays.csv", names=["author_id", "essay", "Extraversion", "Neuroticism",
                                      "Agreeableness", "Conscientiousness", "Openness"], encoding="cp1252") # load essay dataset
x = df['essay'][1:]
x = x.str.lower() # to lowercase

classifiers = []
choices = {
        0: ('Extraversion', ('tanh', 'adaptive', 'lbfgs')),
        1: ('Neuroticism', ('tanh', 'adaptive', 'lbfgs')),
        2: ('Agreeableness', ('tanh', 'adaptive', 'lbfgs')),
        3: ('Conscientiousness', ('relu', 'invscaling', 'lbfgs')),
        4: ('Openness', ('relu', 'invscaling', 'lbfgs'))
    } # specs for every trait classifier (activation function etc.)

for trait in range(0, 5):
    y = df[choices[trait][0]][1:] # select essays associated with this trait
    x_train, x_test, y_train, y_test = train_test_split(x, y, random_state=11) # random split into train & test sets
    
    # TF-IDF vectorizer
    vectorizer = TfidfVectorizer()
    xx_train = vectorizer.fit_transform(x_train)
    xx_test = vectorizer.transform(x_test)
    
    # specify and train a classifier for every trait
    classifiers.append(MLPClassifier(activation=choices[trait][1][0], alpha=0.0001, hidden_layer_sizes=(60),
                               learning_rate=choices[trait][1][1], max_iter=20, solver=choices[trait][1][2]))
    classifiers[trait].fit(xx_train, y_train)
    
    predictions = classifiers[trait].predict(xx_test) # predict for test set
    score = accuracy_score(y_test, predictions)
    print("Accuracy for {}: {}".format(choices[trait][0], score))
    


print("Training and testing DONE!")

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
  self.n_iter_ = _check_optimize_result("lbfgs", opt_res, self.max_iter)


Accuracy for Extraversion: 0.5607779578606159


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
  self.n_iter_ = _check_optimize_result("lbfgs", opt_res, self.max_iter)


Accuracy for Neuroticism: 0.5818476499189628


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
  self.n_iter_ = _check_optimize_result("lbfgs", opt_res, self.max_iter)


Accuracy for Agreeableness: 0.5445705024311183


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
  self.n_iter_ = _check_optimize_result("lbfgs", opt_res, self.max_iter)


Accuracy for Conscientiousness: 0.5283630470016207
Accuracy for Openness: 0.6142625607779578
Training and testing DONE!


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
  self.n_iter_ = _check_optimize_result("lbfgs", opt_res, self.max_iter)


In [13]:
xx_test = vectorizer.transform(raw_scripts.values()) # transform the "raw" but lowercased, concat. scripts for each character
traitProbs = {} # a dict containing the probabilites of every trait for each character

for character in concat:
    traitProbs[character] = [0] * 5 # initialize trait list

for trait in range(0, 5):
    prob = classifiers[trait].predict_proba(xx_test)[:,1] # predict probability for every trait
    
    for i, character in enumerate(concat):
        traitProbs[character][trait] = prob[i] 

jutut = pd.DataFrame.from_dict(traitProbs, orient="index") # display table containing the probabilites of every trait for each character
jutut.columns = [choices[i][0] for i in range(5)]


display(jutut)

Unnamed: 0,Extraversion,Neuroticism,Agreeableness,Conscientiousness,Openness
GANDALF,0.014638,0.954037,0.000473,0.003313,0.215194
SAM,0.906167,0.789434,0.159345,0.999635,0.827121
FRODO,0.991890,0.721836,0.095620,0.008255,0.474021
ARAGORN,0.962266,0.544277,0.509467,0.944991,0.956488
GOLLUM,0.021897,0.794533,0.032910,0.999983,0.989741
...,...,...,...,...,...
OLD MAN,0.991890,0.721836,0.095620,0.008255,0.474021
FRODO VOICE,0.998642,0.998033,0.999709,0.998159,0.920064
MRS BRACEGIRDLE,0.916730,0.812455,0.101024,0.018099,0.730887
PROUDFOOT HOBBIT,0.849733,0.096252,0.712639,0.987220,0.924599


In [15]:
from sklearn.metrics.pairwise import cosine_similarity
import plotly.express as px
import numpy as np

num_char = 5 # number of characters to be included in the scatter plot
pca = PCA(n_components=2)
principalComponents = pca.fit_transform(jutut[0:num_char].values) # peform PCA

df = pd.DataFrame(principalComponents, index=list(concat.keys())[0:num_char], columns=['PC1','PC2'])
fig = px.scatter(df, x='PC1', y='PC2', text=df.index) # plot according to the principal components
fig.update_traces(textposition="top center")
fig.update_layout(height=800)
fig.show()

# Task 3

In [7]:
similarities = {} # contains the pairwise cosine similarities for characters

for character1 in list(concat.keys())[0:num_char]:
    if not character1 in similarities:
        similarities[character1] = {}

    for character2 in list(concat.keys())[0:num_char]: # I know, CosSim(A,B)=CosSim(B,A)...
        similarities[character1][character2] = cosine_similarity(np.array(traitProbs[character1]).reshape(1, -1), np.array(traitProbs[character2]).reshape(1, -1))[0][0]

pd.DataFrame.from_dict(similarities, orient="index")

Unnamed: 0,GANDALF,SAM,FRODO,ARAGORN,GOLLUM
GANDALF,1.0,0.540661,0.870873,0.390399,0.547448
SAM,0.540661,1.0,0.704444,0.913348,0.987756
FRODO,0.870873,0.704444,1.0,0.671,0.688016
ARAGORN,0.390399,0.913348,0.671,1.0,0.846934
GOLLUM,0.547448,0.987756,0.688016,0.846934,1.0


# Task 4

repeat for alternative personality trait detection method

In [141]:
# this follows the personality detection method created by
# author jkwieser at https://github.com/jkwieser/personality-detection-text
import pickle
from sklearn.feature_extraction.text import CountVectorizer
import plotly.express as px
import pandas as pd
import re
import numpy as np

# Loading gloVe pretrained models
cEXT = pickle.load( open( "data/models/cEXT.p", "rb"))
cNEU = pickle.load( open( "data/models/cNEU.p", "rb"))
cAGR = pickle.load( open( "data/models/cAGR.p", "rb"))
cCON = pickle.load( open( "data/models/cCON.p", "rb"))
cOPN = pickle.load( open( "data/models/cOPN.p", "rb"))
vectorizer_31 = pickle.load( open( "data/models/vectorizer_31.p", "rb"))
vectorizer_30 = pickle.load( open( "data/models/vectorizer_30.p", "rb"))

# Using the pretrained models to generate the big 5 predictions
def predict_personality(text):
    scentences = re.split("(?<=[.!?]) +", text)
    text_vector_31 = vectorizer_31.transform(scentences)
    text_vector_30 = vectorizer_30.transform(scentences)
    EXT = cEXT.predict_proba(text_vector_31)[0][1]
    NEU = cNEU.predict_proba(text_vector_30)[0][1]
    AGR = cAGR.predict_proba(text_vector_31)[0][1]
    CON = cCON.predict_proba(text_vector_31)[0][1]
    OPN = cOPN.predict_proba(text_vector_31)[0][1]
    return EXT, NEU, AGR, CON, OPN

# Change this to analyse another character
character = 'Gollum'

character = character.upper()

# creating a dataframe just out of the char and dialog sections
movie_df = pd.DataFrame(script,
           columns=['char', 'dialog'])

# getting all the dialogue of one character
dialogue = (movie_df.loc[movie_df.char == character])

all_text = ""
for dialog in dialogue.dialog:
    all_text = all_text + dialog

all_text = all_text.replace("\xa0", "")

if(len(all_text) > 2):   
    predictions = predict_personality(all_text)
    print("predicted personality:", predictions)
    df = pd.DataFrame(dict(r=predictions, theta=['Extraversion','Neuroticism','Agreeableness', 'Conscientiousness', 'Openness']))
    fig = px.line_polar(df, r='r', theta='theta', line_close=True)
    fig.show()
else:
    print("Didn't find dialogue of the character check character input!")

predicted personality: (0.5400782768802677, 0.52, 0.5007047819659296, 0.44071084082553935, 0.6815849265850413)



Trying to unpickle estimator LogisticRegression from version 0.22.1 when using version 1.1.2. This might lead to breaking code or invalid results. Use at your own risk. For more info please refer to:
https://scikit-learn.org/stable/model_persistence.html#security-maintainability-limitations


Trying to unpickle estimator DecisionTreeClassifier from version 0.22.1 when using version 1.1.2. This might lead to breaking code or invalid results. Use at your own risk. For more info please refer to:
https://scikit-learn.org/stable/model_persistence.html#security-maintainability-limitations


Trying to unpickle estimator RandomForestClassifier from version 0.22.1 when using version 1.1.2. This might lead to breaking code or invalid results. Use at your own risk. For more info please refer to:
https://scikit-learn.org/stable/model_persistence.html#security-maintainability-limitations


Trying to unpickle estimator CountVectorizer from version 0.22.1 when using version 1.1.2. This might lead to