# 1. Preparation

**1.0 Import Lexicons** <br>
Initially we intended to use LIWC lexicon dictionairies (download [here](https://pypi.org/project/liwc/), and install using `!pip install -U liwc`). But it would require a substantial fee. So we turned to a free alternative called EMPATH (guideline could be accessed [here](https://github.com/Ejhfast/empath-client).) If this does not perform adequately, we may explore the [SEANCE](https://www.linguisticanalysistools.org/seance.html). as a secondary option. <br>

**1.1 Explore EMPATHY, identifying relevant existing lexicons for direct adoption.** <br>
We then examined the EMPATHY. In [Yarkoni (2011)](https://www.sciencedirect.com/science/article/pii/S0092656610000541), Table 1 outlines the correlation between LIWC lexicons and the Big-Five personality dimentions. Using this as a benchmark we filter through the EMPATHY, identified relevant labels, and applied them to our dataset. <br>

**1.2 Use Spacy to add lexicons that we need but missing from EMPATHY.** <br>
For those lexicons missing from EMPATHY, such as 1st, 2nd, 3rd person pronouns, we applied Spacy for parsing. But certain categories remain inaccessible. We will acknowledge that as a limitation of this study. <br> 

**The results shows that:**

- EMPATHY includes: 

`affect`, `positive_emotions`, `negative_emotions`, `anger`, `sadness`, `hearing`, `communication`, `friends`, `family`, `swearing_terms`

- Require Spacy: 

`pronouns(PRON)`, `articles(DET)`, `prepositions(PREP)`, `numbers(NUM)`,
`1st person sg/pl`, `2nd person`, `3rd person pronouns`,
`Past/present/future tense vb`.

The remaining categories are then excluded from this study.


In [32]:
### 1.0 Import Lexicon pkg ###
##############################

# !pip install empath spacy pandas numpy scipy statsmodels
import pandas as pd
import spacy
from empath import Empath
lexicon = Empath()

### 1.1 EXPLORING EMPATHY ###
### WHAT EXISTING CLASSES IN EMPATHY COULD BE ADOPTED DIRECTLY ###
# Print all category (class) names
print(list(lexicon.cats.keys()))
print()

# Define the list of class (category) names we're looking for
categories_to_check = ["total pronouns", "pron", "first person sing.", "first person", "second person", "third person",
 "negation", "assent", "articles", "prep", "prepositions", "number",
 "affect", "positive", "optimism", "negative", "anxiety", "anger", "sadness", 
 "cognitive", "causation", "insight", "discrepancy", "inhibition", "tentative", "certainty", 
 "sensory", "seeing", "hearing", "feeling", "social", "communication", "references",
 "friend", "family", "human", "time", "tense", "space", "up", "down", 
 "inclusive", "exclusive", "motion", "occupation", "school", "job", "work", "achieve", 
 "leisure", "home", "sport", "tv", "movie", "music", "sound", "money", "finance",
 "metaphysics", "religion", "death", "physical", "body", "sexuality", "sex", "eat", "drink", "sleep", "groom", "swear"]

# Convert categories_to_check to lowercase for case-insensitive comparison
categories_to_check_lower = [cat.lower() for cat in categories_to_check]

# Find matching categories (substring match, case insensitive)
matched_categories = [cat for cat in lexicon.cats if any(search_term in cat.lower() for search_term in categories_to_check_lower)]
not_matched_categories = [cat for cat in categories_to_check if not any(search_term in cat.lower() for search_term in lexicon.cats.keys())]

# Output matched and not matched categories
print("Matched categories:", matched_categories)
print()
print("Not matched categories:", not_matched_categories)


['help', 'office', 'dance', 'money', 'wedding', 'domestic_work', 'sleep', 'medical_emergency', 'cold', 'hate', 'cheerfulness', 'aggression', 'occupation', 'envy', 'anticipation', 'family', 'vacation', 'crime', 'attractive', 'masculine', 'prison', 'health', 'pride', 'dispute', 'nervousness', 'government', 'weakness', 'horror', 'swearing_terms', 'leisure', 'suffering', 'royalty', 'wealthy', 'tourism', 'furniture', 'school', 'magic', 'beach', 'journalism', 'morning', 'banking', 'social_media', 'exercise', 'night', 'kill', 'blue_collar_job', 'art', 'ridicule', 'play', 'computer', 'college', 'optimism', 'stealing', 'real_estate', 'home', 'divine', 'sexual', 'fear', 'irritability', 'superhero', 'business', 'driving', 'pet', 'childish', 'cooking', 'exasperation', 'religion', 'hipster', 'internet', 'surprise', 'reading', 'worship', 'leader', 'independence', 'movement', 'body', 'noise', 'eating', 'medieval', 'zest', 'confusion', 'water', 'sports', 'death', 'healing', 'legend', 'heroic', 'celebr

In [33]:
# do an example analysis for light testing

# result = lexicon.analyze("he kiss the other person", normalize=True)
# filtered_result = {category: value for category, value in result.items() if value > 0}
# print(filtered_result)

# 2. Data Processing

**Recall the hypotheses for word level**

We aim to investigate whether correlations between LIWC categories and Big Five personality traits observed in Yarkoni (2011) align with our dataset’s trends. 

**Hypotheses Summary**

|   EMPATHYe   |Label Name| Neuroticism  | Extroversion |   Openness   |Agreeableness |Conscientiousness|
|--------------|--------------|--------------|--------------|--------------|--------------|--------------|
| pronouns     |*pronouns|      +       |      +       |       --     |       ++     |       -      |
| 1st person sing.|*first_person_sg|   ++      |      +       |       -      |       +      |       0      |
| 1st person plural|*first_person_pl|   -      |     ++       |       --     |       ++     |       +      |
| 1st person   |*first_person||++|+|--|++|+|
| 2nd person   |*second_person||--|++|--|+|0|
| 3rd person   |*third_person|+|+|-|+|-|
| negations    |*negations|++|-|--|-|--|
| articles     |*articles|--|-|++|+|++|
| prepositions |*prepositions|-|-|++|+|+|
| numbers      |*numbers|-|--|--|++|+|
| affect       |affection|+|+|--|+|-|
| positive emotions|positive_emotion|-|++|--|++|+|
| optimism    |optimism|--|+|0|++|++|
| negative emotions|negative_emotion|++|+|0|--|--|
| anger        |anger|++|+|+|--|--|
| sadness      |sadness|++|+|-|+|--|
| hearing      |hearing|+|++|--|+|--|
| communication|communication|0|++|-|+|-|
| friends      |friends|--|++|-|++|+|
| family       |family|-|+|--|++|+|
| past tense vb.|*past_tense|+|-|--|+|0|
| present tense vb.|*present_tense|+|-|--|0|-|
| future tense vb.|*future_tense|-|-|-|-|-|
| occupation   |occupation|+|--|+|-|+|
| school       |school|+|-|+|-|-|
| job/work     |work|+|--|+|-|+|
| achievement  |achievement|+|--|-|+|--|
| leisure      |leisure|-|++|--|++|+|
| home         |home|0|+|--|++|+|
| sports       |sports|-|+|--|+|0|
| music        |music|-|++|+|+|--|
| money        |money|+|-|-|--|-|
| religion     |religion|-|++|+|+|-|
| death        |death|+|+|++|--|--|
| body states  |body|+|++|-|++|-|
| sexuality    |sexuality|+|++|0|++|-|
| eating       |eating|-|+|--|+|-|
| sleep        |sleep|++|-|--|++|-|
| swearing words|swearing_terms|++|+|+|--|--|
(_Label Name_ refers to its new name in our _EMPATHYe_)

**Re-classification Rquired**:<br>
The following labels from _EMPATHY_ will be reclassified and renamed in our lexicon:
- Work: domestick_work, blue_collar_job, white_collar_job, work
- Music: music, sound, musical 
- Sexuality: sexual 

**Dataset**<br>
Our dataset use the complete collection of the eight *Harry Potter* film. Initially, we used only the first film, but this yielded insufficient data for significance, so we included all films in the end. <br>

**developing our own lexicon:** <br>
After the pre-processing stage,  each character’s dialogue forms an individual dataset labeled with tokens, frequencies, and part-of-speech tags. We develope a new lexicon with labels containing the necessary lexicon data: some labels are processed directly using EMPATH, while others require additional parsing with SpaCy.<br>

We proceed as follows:

**2.1 Handling Empathy** <br>
We apply EMPATH to identify and annotate the lexicons in each character's lines, focusing on existing categories relevant to our study, such as emotional and social words, as outlined in our lexicon guide.

**2.2 Applying Spacy** <br>
Using SpaCy, we tag and extract lexical categories unavailable in EMPATH, such as specific pronoun types and verb tenses.

In [34]:
### 2.1 START WITH EMPATHY ###
##############################

# labels to keep
labels_to_keep = [
    'money', 'domestic_work', 'sleep', 'occupation', 'family', 'swearing_terms', 'leisure', 'school',
    'blue_collar_job', 'optimism', 'home', 'sexual', 'religion', 'body', 'eating', 'sports',
    'death', 'communication', 'hearing', 'music', 'sound', 'work', 'sadness', 'emotional', 'affection',
    'anger', 'white_collar_job', 'negative_emotion', 'friends', 'achievement', 'positive_emotion', 'musical'
]

# merging rules:
merge_rules = {
    'work': ['domestic_work', 'blue_collar_job', 'white_collar_job', 'work'],
    'music': ['music', 'sound', 'musical'],
    'sexuality': ['sexual']
}

temp_lexicon = {} # temporarily store the lexicon

# filter and merge based on the rules above
for label in labels_to_keep:
    # 检查标签是否需要合并
    merged = False
    for new_label, old_labels in merge_rules.items():
        if label in old_labels:
            # 如果是要合并的标签，将内容合并至新标签
            if new_label not in temp_lexicon:
                temp_lexicon[new_label] = set()
            temp_lexicon[new_label].update(lexicon.cats[label])
            merged = True
            break
    # 如果标签不在合并规则内，直接添加到临时存储中
    if not merged:
        temp_lexicon[label] = lexicon.cats[label]

# Clear original lexicon contents
lexicon.cats.clear()

# Reassign the updated content to lexicon
lexicon.cats.update(temp_lexicon)

# Print the current label names in the lexicon
print("Current labels in lexicon:", lexicon.cats.keys())

Current labels in lexicon: dict_keys(['money', 'work', 'sleep', 'occupation', 'family', 'swearing_terms', 'leisure', 'school', 'optimism', 'home', 'sexuality', 'religion', 'body', 'eating', 'sports', 'death', 'communication', 'hearing', 'music', 'sadness', 'emotional', 'affection', 'anger', 'negative_emotion', 'friends', 'achievement', 'positive_emotion'])


In [35]:
### 2.2 USE SPACY TO PROCEED MORE ###
#####################################

# [1] PERSONAL PRONOUNS
nlp = spacy.load("en_core_web_lg") # load the English model
# 定义代词标签
pronouns = {
    "first_person_sg": ["I", "me", "my", "mine"],
    "first_person_pl": ["we", "us", "our", "ours"],
    "first_person": ["I", "me", "my", "mine", "we", "us", "our", "ours"],
    "second_person": ["you", "your", "yours"],
    "third_person": ["he", "him", "his", "she", "her", "hers", "they", "them", "their", "theirs"],
}

# 将 pronouns 添加到 lexicon
for label, words in pronouns.items():
    lexicon.cats[label] = words

# 检查添加后的 lexicon
print(list(lexicon.cats.keys()))

['money', 'work', 'sleep', 'occupation', 'family', 'swearing_terms', 'leisure', 'school', 'optimism', 'home', 'sexuality', 'religion', 'body', 'eating', 'sports', 'death', 'communication', 'hearing', 'music', 'sadness', 'emotional', 'affection', 'anger', 'negative_emotion', 'friends', 'achievement', 'positive_emotion', 'first_person_sg', 'first_person_pl', 'first_person', 'second_person', 'third_person']


In [37]:
### 2.2 USE SPACY TO PROCEED MORE ###
#####################################

# [2] tense verbs
def label_tenses(file_path):
    # 读取 CSV 文件，不指定列名
    df = pd.read_csv(file_path, header=None)
    
    # 存储结果
    labeled_verbs = {
        'past_tense': [],
        'present_tense': [],
        'future_tense': []
    }

    # # 遍历每一行文本
    # for index in range(len(df)):
    #     # 使用 spaCy 处理每一行文本
    #     doc = nlp(df.iloc[index, 0])  # 访问第一列（每一行的文本）

    #     # 查找动词
    #     for token in doc:
       
    #             if token.tag_ in ['VBD', 'VBN']:  # 过去时动词
    #                 labeled_verbs['past_tense'].append(token.text)
    #             elif token.tag_ in ['VBZ', 'VBP', 'VBG']:  # 现在时动词
    #                 labeled_verbs['present_tense'].append(token.text)
    #             elif token.tag_ == 'MD':  # 将来时动词（情态动词）
                   
    #                  if token.nbor().pos_ == "VERB":
    #                     labeled_verbs['future_tense'].append(token.nbor().text)

    # return labeled_verbs



    # 遍历每一行文本
    for index in range(len(df)):
        # 使用 spaCy 处理每一行文本
        doc = nlp(df.iloc[index, 0])  # 假设文本在第一列

        # 查找动词
        for token in doc:
            # 检查动词的时态
            if token.tag_ in ['VBD', 'VBN']:  # 过去时动词
                labeled_verbs['past_tense'].append(token.text)
            elif token.tag_ in ['VBZ', 'VBP', 'VBG']:  # 现在时动词
                labeled_verbs['present_tense'].append(token.text)
            elif token.tag_ == 'MD':  # 情态动词标记为将来时
                # 确保 token 后面有单词，以防溢出
                if token.i < len(doc) - 1 and doc[token.i + 1].pos_ == "VERB":
                    # 将情态动词后的动词视为未来时态
                    labeled_verbs['future_tense'].append(doc[token.i + 1].text)

    return labeled_verbs

# [3] numbers

def label_numbers(file_path):
    df = pd.read_csv(file_path, header=None)
    labeled_numbers = []

    for index in range(len(df)):
        doc = nlp(df.iloc[index, 0])  # 访问第一列（每一行的文本）

        for token in doc:
            if token.like_num:  # 判断是否是数字
                labeled_numbers.append(token.text)

    return labeled_numbers


# [4] prepositions

def label_prepositions(file_path):
    df = pd.read_csv(file_path, header=None)
    labeled_prepositions = []

    for index in range(len(df)):
        doc = nlp(df.iloc[index, 0])  # 访问第一列（每一行的文本）

        for token in doc:
            if token.pos_ == "ADP":  # 介词的 POS 标签是 ADP
                labeled_prepositions.append(token.text)

    return labeled_prepositions


# [5] articles

def label_articles(file_path):
    df = pd.read_csv(file_path, header=None)
    labeled_articles = []

    for index in range(len(df)):
        doc = nlp(df.iloc[index, 0])  # 访问第一列（每一行的文本）

        for token in doc:
            if token.pos_ == "DET":  # 冠词的 POS 标签是 DET
                labeled_articles.append(token.text)

    return labeled_articles


# [6] negations

def label_negations(file_path):
    df = pd.read_csv(file_path, header=None)
    labeled_negations = []

    for index in range(len(df)):
        doc = nlp(df.iloc[index, 0])  # 访问第一列（每一行的文本）

        for token in doc:
            if token.dep_ == "neg":  # 否定词的依存关系标签是 neg
                labeled_negations.append(token.text)

    return labeled_negations

# # [7] pronouns

# def label_pronouns(file_path):
#     df = pd.read_csv(file_path, header=None)
#     labeled_pronouns = {"first_person": [], "second_person": [], "third_person": []}

#     for index in range(len(df)):
#         doc = nlp(df.iloc[index, 0])  # 访问第一列（每一行的文本）

#         for token in doc:
#             if token.text in pronouns["first_person"]:
#                 labeled_pronouns["first_person"].append(token.text)
#             elif token.text in pronouns["second_person"]:
#                 labeled_pronouns["second_person"].append(token.text)
#             elif token.text in pronouns["third_person"]:
#                 labeled_pronouns["third_person"].append(token.text)

#     return labeled_pronouns

# [8] empathy

def label_empathy(file_path):
    df = pd.read_csv(file_path, header=None)
    labeled_empathy = {label: [] for label in lexicon.cats.keys()}

    for index in range(len(df)):
        doc = nlp(df.iloc[index, 0])  # 访问每一行的文本
        doc_text = doc.text.lower()  # 将文本转换为小写以匹配词汇表中的单词

        for label, words in lexicon.cats.items():
            for word in words:
                if word in doc_text:
                    labeled_empathy[label].append(word)

    return labeled_empathy

# 3. Data Processing

Using the new lexicon, we start the data analysis of correlation between character personalty traits and their linguistic patterns. We measure frequency and percentage for each lexicon per character, comparing patterns based on personality dimensions.

- *Percentage = Frequency / Number_of_Tokens

The **personality scores** from our reference study (Stening and Stening 2018) are as follows:

|   Character   | Neuroticism  | Extroversion |   Openness   |Agreeableness |Conscientiousness|
|--------------|--------------|--------------|--------------|--------------|--------------|
|Ron Wesley|3.22|4.9|4.02|3.76|3.01|
|Hermine Granger|4.22|4.65|5.12|4.07|6.22|
|Albus Dumbledore|5.52|4.36|5.52|5.07|5.73|
|Lord Voldmort|3|4.36|4.27|1.95|4.88|
|Draco Malfoy|3.15|4.23|3.86|2.15|4.16|
|Harry Potter|3.85|3.92|5.13|4.11|4.36|
|Severus Snape|4.43|2.65|4.08|2.6|5.49|

In [38]:
from collections import Counter

def analysis_tense(file_path):
    # 调用不同的函数并打印结果
    tenses_result = label_tenses(file_path)

    # 计算总单词数
    total_words = 0

    # 计算每个句子的总单词数
    df = pd.read_csv(file_path, header=None)
    for index in range(len(df)):
        doc = nlp(df.iloc[index, 0])  # 访问第一列（每一行的文本）
        total_words += len(doc)  # 统计当前句子的单词数

    # 获取各类动词的数量
    past_count = len(tenses_result['past_tense'])
    present_count = len(tenses_result['present_tense'])
    future_count = len(tenses_result['future_tense'])

    # 计算百分比
    past_percentage = (past_count / total_words) * 100 if total_words > 0 else 0
    present_percentage = (present_count / total_words) * 100 if total_words > 0 else 0
    future_percentage = (future_count / total_words) * 100 if total_words > 0 else 0

    # 打印结果
    print(f"Past Tense Verbs: {past_count} ({past_percentage:.2f}%)")
    print(f"Present Tense Verbs: {present_count} ({present_percentage:.2f}%)")
    print(f"Future Tense Verbs: {future_count} ({future_percentage:.2f}%)")

def analysis_numbers(file_path):
    labeled_numbers = label_numbers(file_path)
    
    # 统计数字token的总数量
    total_numbers = len(labeled_numbers)
    
    # 计算总词数（包含数字和非数字的总词数）
    df = pd.read_csv(file_path, header=None)
    total_words = sum(len(nlp(row[0])) for row in df.values)
    
    # 计算数字token的占比
    number_percentage = (total_numbers / total_words) * 100 if total_words > 0 else 0

    # 打印结果
    print(f"Total number words: {total_numbers} ({number_percentage:.2f}%)")
    # print(f"Total words containing numbers: {total_numbers}")
    # print(f"Percentage of numbers in all tokens: {percentage:.2f}%")


# Analysis for prepositions
def analysis_prepositions(file_path):
    labeled_prepositions = label_prepositions(file_path)
    
    # 统计介词的总数量
    total_prepositions = len(labeled_prepositions)
    
    # 计算总词数
    df = pd.read_csv(file_path, header=None)
    total_words = sum(len(nlp(row[0])) for row in df.values)
    
    # 计算介词的占比
    preposition_percentage = (total_prepositions / total_words) * 100 if total_words > 0 else 0

    # 打印结果
    print(f"Total prepositions: {total_prepositions} ({preposition_percentage:.2f}%)")


# Analysis for articles
def analysis_articles(file_path):
    labeled_articles = label_articles(file_path)
    
    # 统计冠词的总数量
    total_articles = len(labeled_articles)
    
    # 计算总词数
    df = pd.read_csv(file_path, header=None)
    total_words = sum(len(nlp(row[0])) for row in df.values)
    
    # 计算冠词的占比
    article_percentage = (total_articles / total_words) * 100 if total_words > 0 else 0

    # 打印结果
    print(f"Total articles: {total_articles} ({article_percentage:.2f}%)")


# Analysis for negations
def analysis_negations(file_path):
    labeled_negations = label_negations(file_path)
    
    # 统计否定词的总数量
    total_negations = len(labeled_negations)
    
    # 计算总词数
    df = pd.read_csv(file_path, header=None)
    total_words = sum(len(nlp(row[0])) for row in df.values)
    
    # 计算否定词的占比
    negation_percentage = (total_negations / total_words) * 100 if total_words > 0 else 0

    # 打印结果
    print(f"Total negations: {total_negations} ({negation_percentage:.2f}%)")

# def analysis_pronouns(file_path):
#     labeled_pronouns = label_pronouns(file_path)

#     # 统计每种代词的数量
#     first_person_count = len(labeled_pronouns["first_person"])
#     second_person_count = len(labeled_pronouns["second_person"])
#     third_person_count = len(labeled_pronouns["third_person"])

#     # 计算总词数
#     df = pd.read_csv(file_path, header=None)
#     total_words = sum(len(nlp(row[0])) for row in df.values)
    
#     # 计算各代词的百分比
#     first_person_percentage = (first_person_count / total_words) * 100 if total_words > 0 else 0
#     second_person_percentage = (second_person_count / total_words) * 100 if total_words > 0 else 0
#     third_person_percentage = (third_person_count / total_words) * 100 if total_words > 0 else 0

#     # 打印结果
#     print(f"First Person Pronouns: {first_person_count} ({first_person_percentage:.2f}%)")
#     print(f"Second Person Pronouns: {second_person_count} ({second_person_percentage:.2f}%)")
#     print(f"Third Person Pronouns: {third_person_count} ({third_person_percentage:.2f}%)")

def analysis_empathy(file_path):
    labeled_empathy = label_empathy(file_path)

    # 统计每个类别的词频
    category_counts = {label: len(words) for label, words in labeled_empathy.items()}
    total_empathy_words = sum(category_counts.values())

    # 计算总词数
    df = pd.read_csv(file_path, header=None)
    total_words = sum(len(nlp(row[0])) for row in df.values)
    
    # 打印每个类别的频率和百分比
    print(f"Total empathy-related words: {total_empathy_words} ({(total_empathy_words / total_words) * 100:.2f}% of total words)")
    for label, count in category_counts.items():
        percentage = (count / total_words) * 100 if total_words > 0 else 0
        print(f"{label}, {count} ({percentage:.2f}%)")


In [39]:
# implement analysis and output results:

def analysis_all(file_path):
    
    print("Tense Analysis:")
    analysis_tense(file_path)
    
    print("\nNumber Analysis:")
    analysis_numbers(file_path)
    
    print("\nPreposition Analysis:")
    analysis_prepositions(file_path)
    
    print("\nArticle Analysis:")
    analysis_articles(file_path)
    
    print("\nNegation Analysis:")
    analysis_negations(file_path)
    
    print("\nEmpathy Analysis:")
    analysis_empathy(file_path)


In [40]:
print("Counts per character:")

print("\nAlbus Dumbledore")
analysis_all("Tokens/Dumbledore.csv")

print("\nHarry Potter")
analysis_all("Tokens/Harry.csv")

print("\nHermione Granger")
analysis_all("Tokens/Hermione.csv")

print("\nDraco Malfoy")
analysis_all("Tokens/Malfoy.csv")

print("\nRon Wesley")
analysis_all("Tokens/Ron.csv")

print("\nSeverus Snape")
analysis_all("Tokens/Snape.csv")

print("\nLord Voldemort")
analysis_all("Tokens/Voldemort.csv")

Counts per character:

Albus Dumbledore
Tense Analysis:
Past Tense Verbs: 386 (4.04%)
Present Tense Verbs: 708 (7.42%)
Future Tense Verbs: 109 (1.14%)

Number Analysis:
Total number words: 95 (1.00%)

Preposition Analysis:
Total prepositions: 619 (6.49%)

Article Analysis:
Total articles: 575 (6.03%)

Negation Analysis:
Total negations: 123 (1.29%)

Empathy Analysis:
Total empathy-related words: 5383 (56.41% of total words)
money, 54 (0.57%)
work, 1047 (10.97%)
sleep, 48 (0.50%)
occupation, 8 (0.08%)
family, 92 (0.96%)
swearing_terms, 31 (0.32%)
leisure, 28 (0.29%)
school, 76 (0.80%)
optimism, 46 (0.48%)
home, 62 (0.65%)
sexuality, 10 (0.10%)
religion, 48 (0.50%)
body, 82 (0.86%)
eating, 121 (1.27%)
sports, 111 (1.16%)
death, 100 (1.05%)
communication, 140 (1.47%)
hearing, 118 (1.24%)
music, 149 (1.56%)
sadness, 51 (0.53%)
emotional, 47 (0.49%)
affection, 19 (0.20%)
anger, 30 (0.31%)
negative_emotion, 151 (1.58%)
friends, 113 (1.18%)
achievement, 109 (1.14%)
positive_emotion, 99 (1.04%

# Results on Language Use per Character

| Label                          | Albus Dumbledore   | Harry Potter       | Hermione Granger    | Draco Malfoy       | Ron Wesley       | Severus Snape     | Lord Voldemort    |
|--------------------------------|--------------------|--------------------|---------------------|--------------------|------------------|-------------------|-------------------|
| **Tense Analysis**             |                    |                    |                     |                    |                  |                   |                   |
| Past Tense Verbs               | 386 (4.04%)       | 891 (4.57%)       | 456 (4.37%)        | 75 (4.32%)        | 385 (3.97%)     | 105 (3.49%)      | 78 (3.60%)       |
| Present Tense Verbs            | 708 (7.42%)       | 1703 (8.74%)      | 955 (9.16%)        | 155 (8.92%)       | 885 (9.14%)     | 221 (7.35%)      | 152 (7.02%)      |
| Future Tense Verbs             | 109 (1.14%)       | 159 (0.82%)       | 84 (0.81%)         | 16 (0.92%)        | 50 (0.52%)      | 34 (1.13%)       | 31 (1.43%)       |
| **Number Analysis**            |                    |                    |                     |                    |                  |                   |                   |
| Total Number Words             | 95 (1.00%)        | 133 (0.68%)       | 61 (0.59%)         | 12 (0.69%)        | 76 (0.78%)      | 22 (0.73%)       | 14 (0.65%)       |
| **Preposition Analysis**       |                    |                    |                     |                    |                  |                   |                   |
| Total Prepositions             | 619 (6.49%)       | 900 (4.62%)       | 519 (4.98%)        | 88 (5.06%)        | 468 (4.83%)     | 219 (7.29%)      | 112 (5.17%)      |
| **Article Analysis**           |                    |                    |                     |                    |                  |                   |                   |
| Total Articles                 | 575 (6.03%)       | 772 (3.96%)       | 443 (4.25%)        | 78 (4.49%)        | 402 (4.15%)     | 175 (5.82%)      | 84 (3.88%)       |
| **Negation Analysis**          |                    |                    |                     |                    |                  |                   |                   |
| Total Negations                | 123 (1.29%)       | 377 (1.93%)       | 207 (1.99%)        | 39 (2.24%)        | 185 (1.91%)     | 38 (1.26%)       | 36 (1.66%)       |
| **Pronoun Analysis**           |                    |                    |                     |                    |                  |                   |                   |
| First Person Singular          | 259 (2.71%)      | 581 (2.98%)       | 241 (2.31%)        | 62 (3.57%)        | 249 (2.57%)      | 88 (2.93%)       | 84 (3.88%)       |
| First Person Plural            | 387 (4.06%)      | 617 (3.16%)       | 401 (3.85%)        | 64 (3.68%)        | 328 (3.39%)      | 137 (4.56%)      | 111 (5.12%)      |
| First Person                   | 646 (6.77%)      | 1198 (6.14%)      | 642 (6.16%)        | 126 (7.25%)       | 577 (5.96%)      | 225 (7.49%)      | 195 (9.00%)      |
| Second Person                  | 369 (3.87%)      | 640 (3.28%)       | 346 (3.32%)        | 91 (5.24%)        | 317 (3.27%)      | 165 (5.49%)      | 119 (5.49%)      |
| Third Person                   | 732 (7.67%)      | 1748 (8.97%)      | 791 (7.59%)        | 154 (8.86%)       | 858 (8.86%)      | 214 (7.12%)      | 173 (7.99%)      |
| **Empathy Analysis**           |                  |                   |                    |                   |                  |                  |                  |
| Total Empathy-related Words    | 5391 (56.49%)    | 10603 (54.39%)    | 5464 (52.42%)      | 1079 (62.08%)     | 5179 (53.47%)    | 1662 (55.31%)    | 1371 (63.30%)    |
| Money                          | 54 (0.57%)       | 72 (0.37%)        | 47 (0.45%)         | 3 (0.17%)         | 28 (0.29%)       | 12 (0.40%)       | 11 (0.51%)       |
| Work                           | 1047 (10.97%)    | 3105 (15.93%)     | 1469 (14.09%)      | 241 (13.87%)      | 1498 (15.47%)    | 336 (11.18%)     | 272 (12.56%)     |
| Sleep                          | 48 (0.50%)       | 80 (0.41%)        | 44 (0.42%)         | 4 (0.23%)         | 47 (0.49%)       | 16 (0.53%)       | 6 (0.28%)        |
| Occupation                     | 8 (0.08%)        | 7 (0.04%)         | 11 (0.11%)         | 0 (0.00%)         | 7 (0.07%)        | 4 (0.13%)        | 2 (0.09%)        |
| Family                         | 92 (0.96%)       | 123 (0.63%)       | 67 (0.64%)         | 25 (1.44%)        | 72 (0.74%)       | 22 (0.73%)       | 24 (1.11%)       |
| Swearing Terms                 | 31 (0.32%)       | 58 (0.30%)        | 35 (0.34%)         | 9 (0.52%)         | 62 (0.64%)       | 11 (0.37%)       | 3 (0.14%)        |
| Leisure                        | 28 (0.29%)       | 24 (0.12%)        | 23 (0.22%)         | 5 (0.29%)         | 17 (0.18%)       | 8 (0.27%)        | 3 (0.14%)        |
| School                         | 76 (0.80%)       | 44 (0.23%)        | 49 (0.47%)         | 11 (0.63%)        | 17 (0.18%)       | 21 (0.70%)       | 4 (0.18%)        |
| Optimism                       | 46 (0.48%)       | 49 (0.25%)        | 49 (0.47%)         | 9 (0.52%)         | 31 (0.32%)       | 16 (0.53%)       | 12 (0.55%)       |
| Home                           | 62 (0.65%)       | 147 (0.75%)       | 58 (0.56%)         | 22 (1.27%)        | 76 (0.78%)       | 23 (0.77%)       | 26 (1.20%)       |
| Sexuality                      | 10 (0.10%)       | 9 (0.05%)         | 14 (0.13%)         | 0 (0.00%)         | 6 (0.06%)        | 2 (0.07%)        | 4 (0.18%)        |
| Superhero                      | 8 (0.08%)        | 10 (0.05%)        | 11 (0.11%)         | 1 (0.06%)         | 4 (0.04%)        | 2 (0.07%)        | 1 (0.05%)        |
| Religion                       | 48 (0.50%)       | 79 (0.41%)        | 21 (0.20%)         | 6 (0.35%)         | 10 (0.10%)       | 12 (0.40%)       | 5 (0.23%)        |
| Body                           | 82 (0.86%)       | 120 (0.62%)       | 75 (0.72%)         | 24 (1.38%)        | 66 (0.68%)       | 31 (1.03%)       | 25 (1.15%)       |
| Eating                         | 121 (1.27%)      | 117 (0.60%)       | 73 (0.70%)         | 9 (0.52%)         | 74 (0.76%)       | 20 (0.67%)       | 25 (1.15%)       |
| Sports                         | 111 (1.16%)      | 125 (0.64%)       | 49 (0.47%)         | 11 (0.63%)        | 45 (0.46%)       | 16 (0.53%)       | 10 (0.46%)       |
| Death                          | 100 (1.05%)      | 150 (0.77%)       | 52 (0.50%)         | 13 (0.75%)        | 50 (0.52%)       | 25 (0.83%)       | 38 (1.75%)       |
| Communication                  | 140 (1.47%)      | 234 (1.20%)       | 116 (1.11%)        | 20 (1.15%)        | 80 (0.83%)       | 41 (1.36%)       | 20 (0.92%)       |
| Hearing                        | 118 (1.24%)      | 229 (1.17%)       | 127 (1.22%)        | 37 (2.13%)        | 101 (1.04%)      | 25 (0.83%)       | 22 (1.02%)       |
| Music                          | 149 (1.56%)      | 222 (1.14%)       | 121 (1.16%)        | 27 (1.55%)        | 106 (1.09%)      | 33 (1.10%)       | 25 (1.15%)       |
| Sadness                        | 51 (0.53%)       | 21 (0.11%)        | 22 (0.21%)         | 3 (0.17%)         | 17 (0.18%)       | 18 (0.60%)       | 13 (0.60%)       |
| Emotional                      | 47 (0.49%)       | 78 (0.40%)        | 50 (0.48%)         | 7 (0.40%)         | 44 (0.45%)       | 8 (0.27%)        | 8 (0.37%)        |
| Affection                      | 19 (0.20%)       | 13 (0.07%)        | 15 (0.14%)         | 2 (0.12%)         | 8 (0.08%)        | 3 (0.10%)        | 4 (0.18%)        |
| Anger                          | 30 (0.31%)       | 10 (0.05%)        | 23 (0.22%)         | 2 (0.12%)         | 10 (0.10%)       | 7 (0.23%)        | 7 (0.32%)        |
| Negative Emotion               | 151 (1.58%)      | 364 (1.87%)       | 196 (1.88%)        | 36 (2.07%)        | 185 (1.91%)      | 37 (1.23%)       | 52 (2.40%)       |
| Friends                        | 113 (1.18%)      | 158 (0.81%)       | 102 (0.98%)        | 27 (1.55%)        | 98 (1.01%)       | 42 (1.40%)       | 31 (1.43%)       |
| Achievement                    | 109 (1.14%)      | 51 (0.26%)        | 32 (0.31%)         | 6 (0.35%)         | 30 (0.31%)       | 15 (0.50%)       | 12 (0.55%)       |
| Positive Emotion               | 99 (1.04%)       | 120 (0.62%)       | 92 (0.88%)         | 22 (1.27%)        | 61 (0.63%)       | 27 (0.90%)       | 24 (1.11%)       |

# 4. Data Analysis

Now we conduct the statistical tests to examine our hypotheses. <br>

**Normal Distribution** <br>
Prior to determining the appropriate correlation test (Spearman or Pearson), we assess the normality of our data distribution. <br>

**Correlation Test**
Results indicate that `total prepositions`, `articles`, `sports`, `hearing`, `music`, `achievement`, `sadness` are deviated from a normal distribution. So we will apply _Spearman_ test for these variables and the _Pearson_ test for the remaining categories. <br>

In [41]:
# import packages for statistical analysis
import pandas as pd
from scipy import stats
from scipy.stats import shapiro
from scipy.stats import pearsonr, spearmanr

In [43]:
### DISTRIBUTION TEST ON LEXICON DATA ###

# Defining the data for each category based on the percentages provided for each character
data = {
    "Past Tense Verbs": [4.04, 4.57, 4.37, 4.32, 3.97, 3.49, 3.60],
    "Present Tense Verbs": [7.42, 8.74, 9.16, 8.92, 9.14, 7.35, 7.02],
    "Future Tense Verbs": [1.14, 0.82, 0.81, 0.92, 0.52, 1.13, 1.43],
    "Total Number Words": [1.00, 0.68, 0.59, 0.69, 0.78, 0.73, 0.65],
    "Total Prepositions": [6.49, 4.62, 4.98, 5.06, 4.83, 7.29, 5.17],
    "Total Articles": [6.03, 3.96, 4.25, 4.49, 4.15, 5.82, 3.88],
    "Total Negations": [1.29, 1.93, 1.99, 2.24, 1.91, 1.26, 1.66],
    "First Person Singular": [2.71, 2.98, 2.31, 3.57, 2.57, 2.93, 3.88],
    "First Person Plural": [4.06, 3.16, 3.85, 3.68, 3.39, 4.56, 5.12],
    "First Person": [6.78, 6.14, 5.78, 7.25, 5.96, 6.89, 9.00],
    "Second Person": [3.11, 2.49, 2.57, 3.74, 2.47, 3.96, 3.83],
    "Third Person": [1.54, 1.84, 1.62, 1.21, 1.97, 1.36, 1.71],
    "Total Empathy-related Words": [56.49, 54.39, 52.42, 62.08, 53.47, 55.31, 63.30],
    "Money": [0.57, 0.37, 0.45, 0.17, 0.29, 0.40, 0.51],
    "Work": [10.97, 15.93, 14.09, 13.87, 15.47, 11.18, 12.56],
    "Sleep": [0.50, 0.41, 0.42, 0.23, 0.49, 0.53, 0.28],
    "Occupation": [0.08, 0.04, 0.11, 0.00, 0.07, 0.13, 0.09],
    "Family": [0.96, 0.63, 0.64, 1.44, 0.74, 0.73, 1.11],
    "Swearing Terms": [0.32, 0.30, 0.34, 0.52, 0.64, 0.37, 0.14],
    "Leisure": [0.29, 0.12, 0.22, 0.29, 0.18, 0.27, 0.14],
    "School": [0.80, 0.23, 0.47, 0.63, 0.18, 0.70, 0.18],
    "Optimism": [0.48, 0.25, 0.47, 0.52, 0.32, 0.53, 0.55],
    "Home": [0.65, 0.75, 0.56, 1.27, 0.78, 0.77, 1.20],
    "Sexuality": [0.10, 0.05, 0.13, 0.00, 0.06, 0.07, 0.18],
    "Superhero": [0.08, 0.05, 0.11, 0.06, 0.04, 0.07, 0.05],
    "Religion": [0.50, 0.41, 0.20, 0.35, 0.10, 0.40, 0.23],
    "Body": [0.86, 0.62, 0.72, 1.38, 0.68, 1.03, 1.15],
    "Eating": [1.27, 0.60, 0.70, 0.52, 0.76, 0.67, 1.15],
    "Sports": [1.16, 0.64, 0.47, 0.63, 0.46, 0.53, 0.46],
    "Death": [1.05, 0.77, 0.50, 0.75, 0.52, 0.83, 1.75],
    "Communication": [1.47, 1.20, 1.11, 1.15, 0.83, 1.36, 0.92],
    "Hearing": [1.24, 1.17, 1.22, 2.13, 1.04, 0.83, 1.02],
    "Music": [1.56, 1.14, 1.16, 1.55, 1.09, 1.10, 1.15],
    "Sadness": [0.53, 0.11, 0.21, 0.17, 0.18, 0.60, 0.60],
    "Emotional": [0.49, 0.40, 0.48, 0.40, 0.45, 0.27, 0.37],
    "Affection": [0.20, 0.07, 0.14, 0.12, 0.08, 0.10, 0.18],
    "Anger": [0.31, 0.05, 0.22, 0.12, 0.10, 0.23, 0.32],
    "Negative Emotion": [1.58, 1.87, 1.88, 2.07, 1.91, 1.23, 2.40],
    "Friends": [1.18, 0.81, 0.98, 1.55, 1.01, 1.40, 1.43],
    "Achievement": [1.14, 0.26, 0.31, 0.35, 0.31, 0.50, 0.55],
    "Positive Emotion": [1.04, 0.62, 0.88, 1.27, 0.63, 0.90, 1.11],
}


# Perform Shapiro-Wilk test for each dataset
for label, values in data.items():
    stat, p_value = shapiro(values)
    print(f"{label}: p-value = {p_value:.5f}")
    if p_value > 0.05:
        print("  The data is normally distributed (p > 0.05)\n")
    else:
        print("  The data is NOT normally distributed (p <= 0.05)\n")


Past Tense Verbs: p-value = 0.66796
  The data is normally distributed (p > 0.05)

Present Tense Verbs: p-value = 0.06280
  The data is normally distributed (p > 0.05)

Future Tense Verbs: p-value = 0.92010
  The data is normally distributed (p > 0.05)

Total Number Words: p-value = 0.16491
  The data is normally distributed (p > 0.05)

Total Prepositions: p-value = 0.04550
  The data is NOT normally distributed (p <= 0.05)

Total Articles: p-value = 0.03818
  The data is NOT normally distributed (p <= 0.05)

Total Negations: p-value = 0.38827
  The data is normally distributed (p > 0.05)

First Person Singular: p-value = 0.65063
  The data is normally distributed (p > 0.05)

First Person Plural: p-value = 0.80727
  The data is normally distributed (p > 0.05)

First Person: p-value = 0.17471
  The data is normally distributed (p > 0.05)

Second Person: p-value = 0.08490
  The data is normally distributed (p > 0.05)

Third Person: p-value = 0.97860
  The data is normally distributed (p 

In [44]:
### DISTRIBUTION TEST ON PERSONALITY DATA ###

# Big Five 各个性格特征的评分数据
neuroticism = [3.22, 4.22, 5.52, 3.0, 3.15, 3.85, 4.43]
extroversion = [4.9, 4.65, 4.36, 4.36, 4.23, 3.92, 2.65]
openness = [4.02, 5.12, 5.52, 4.27, 3.86, 5.13, 4.08]
agreeableness = [3.76, 4.07, 5.07, 1.95, 2.15, 4.11, 2.6]
conscientiousness = [3.01, 6.22, 5.73, 4.88, 4.16, 4.36, 5.49]

# 将各维度的结果存储在字典中，方便批量检验
big_five_data = {
    "Neuroticism": neuroticism,
    "Extroversion": extroversion,
    "Openness": openness,
    "Agreeableness": agreeableness,
    "Conscientiousness": conscientiousness
}

# 对每个性格特征进行 Shapiro-Wilk 检验并打印 p-value 和判断结果
for trait, scores in big_five_data.items():
    stat, p_value = shapiro(scores)
    print(f"{trait}: p-value = {p_value:.3f}")
    if p_value > 0.05:
        print(f"  The data is normally distributed (p > 0.05)\n")
    else:
        print(f"  The data is NOT normally distributed (p <= 0.05)\n")


Neuroticism: p-value = 0.394
  The data is normally distributed (p > 0.05)

Extroversion: p-value = 0.098
  The data is normally distributed (p > 0.05)

Openness: p-value = 0.176
  The data is normally distributed (p > 0.05)

Agreeableness: p-value = 0.485
  The data is normally distributed (p > 0.05)

Conscientiousness: p-value = 0.896
  The data is normally distributed (p > 0.05)



In [45]:
### CORRELATION TEST ###

# Language feature data
data = {
    "Past Tense Verbs": [4.04, 4.57, 4.37, 4.32, 3.97, 3.49, 3.60],
    "Present Tense Verbs": [7.42, 8.74, 9.16, 8.92, 9.14, 7.35, 7.02],
    "Future Tense Verbs": [1.14, 0.82, 0.81, 0.92, 0.52, 1.13, 1.43],
    "Total Number Words": [1.00, 0.68, 0.59, 0.69, 0.78, 0.73, 0.65],
    "Total Prepositions": [6.49, 4.62, 4.98, 5.06, 4.83, 7.29, 5.17], # NOT NORMAL
    "Total Articles": [6.03, 3.96, 4.25, 4.49, 4.15, 5.82, 3.88],     # NOT NORMAL
    "Total Negations": [1.29, 1.93, 1.99, 2.24, 1.91, 1.26, 1.66],
    "First Person Singular": [2.71, 2.98, 2.31, 3.57, 2.57, 2.93, 3.88],
    "First Person Plural": [4.06, 3.16, 3.85, 3.68, 3.39, 4.56, 5.12],
    "First Person": [6.78, 6.14, 5.78, 7.25, 5.96, 6.89, 9.00],
    "Second Person": [3.11, 2.49, 2.57, 3.74, 2.47, 3.96, 3.83],
    "Third Person": [1.54, 1.84, 1.62, 1.21, 1.97, 1.36, 1.71],
    "Total Empathy-related Words": [56.49, 54.39, 52.42, 62.08, 53.47, 55.31, 63.30],
    "Money": [0.57, 0.37, 0.45, 0.17, 0.29, 0.40, 0.51],
    "Work": [10.97, 15.93, 14.09, 13.87, 15.47, 11.18, 12.56],
    "Sleep": [0.50, 0.41, 0.42, 0.23, 0.49, 0.53, 0.28],
    "Occupation": [0.08, 0.04, 0.11, 0.00, 0.07, 0.13, 0.09],
    "Family": [0.96, 0.63, 0.64, 1.44, 0.74, 0.73, 1.11],
    "Swearing Terms": [0.32, 0.30, 0.34, 0.52, 0.64, 0.37, 0.14],
    "Leisure": [0.29, 0.12, 0.22, 0.29, 0.18, 0.27, 0.14],
    "School": [0.80, 0.23, 0.47, 0.63, 0.18, 0.70, 0.18],
    "Optimism": [0.48, 0.25, 0.47, 0.52, 0.32, 0.53, 0.55],
    "Home": [0.65, 0.75, 0.56, 1.27, 0.78, 0.77, 1.20],
    "Sexuality": [0.10, 0.05, 0.13, 0.00, 0.06, 0.07, 0.18],
    "Superhero": [0.08, 0.05, 0.11, 0.06, 0.04, 0.07, 0.05],
    "Religion": [0.50, 0.41, 0.20, 0.35, 0.10, 0.40, 0.23],
    "Body": [0.86, 0.62, 0.72, 1.38, 0.68, 1.03, 1.15],
    "Eating": [1.27, 0.60, 0.70, 0.52, 0.76, 0.67, 1.15],
    "Sports": [1.16, 0.64, 0.47, 0.63, 0.46, 0.53, 0.46],          # NOT NORMAL
    "Death": [1.05, 0.77, 0.50, 0.75, 0.52, 0.83, 1.75],
    "Communication": [1.47, 1.20, 1.11, 1.15, 0.83, 1.36, 0.92],
    "Hearing": [1.24, 1.17, 1.22, 2.13, 1.04, 0.83, 1.02],         # NOT NORMAL
    "Music": [1.56, 1.14, 1.16, 1.55, 1.09, 1.10, 1.15],           # NOT NORMAL
    "Sadness": [0.53, 0.11, 0.21, 0.17, 0.18, 0.60, 0.60],         # NOT NORMAL
    "Emotional": [0.49, 0.40, 0.48, 0.40, 0.45, 0.27, 0.37],
    "Affection": [0.20, 0.07, 0.14, 0.12, 0.08, 0.10, 0.18],
    "Anger": [0.31, 0.05, 0.22, 0.12, 0.10, 0.23, 0.32],
    "Negative Emotion": [1.58, 1.87, 1.88, 2.07, 1.91, 1.23, 2.40],
    "Friends": [1.18, 0.81, 0.98, 1.55, 1.01, 1.40, 1.43],
    "Achievement": [1.14, 0.26, 0.31, 0.35, 0.31, 0.50, 0.55],     # NOT NORMAL
    "Positive Emotion": [1.04, 0.62, 0.88, 1.27, 0.63, 0.90, 1.11],
}


# Big Five personality scores
big_five_data = {
    "Neuroticism": [3.22, 4.22, 5.52, 3.0, 3.15, 3.85, 4.43],
    "Extroversion": [4.9, 4.65, 4.36, 4.36, 4.23, 3.92, 2.65],
    "Openness": [4.02, 5.12, 5.52, 4.27, 3.86, 5.13, 4.08],
    "Agreeableness": [3.76, 4.07, 5.07, 1.95, 2.15, 4.11, 2.6],
    "Conscientiousness": [3.01, 6.22, 5.73, 4.88, 4.16, 4.36, 5.49],
}

# Correlation results storage
correlation_results = {}

# Perform correlation tests
for trait, scores in big_five_data.items():
    correlation_results[trait] = {}
    for label, values in data.items():
        # Use Spearman for non-normal distributions, otherwise use Pearson
        if label in ["Total prepositions", "sports"]:
            corr, p_value = spearmanr(scores, values)
            test_type = "Spearman"
        else:
            corr, p_value = pearsonr(scores, values)
            test_type = "Pearson"
        
        # Store results in dictionary
        correlation_results[trait][label] = (test_type, corr, p_value)

# Print correlation results
for trait, results in correlation_results.items():
    print(f"\n{trait} Correlations:")
    for label, (test_type, corr, p_value) in results.items():
        print(f"  {label}: {test_type} correlation = {corr:.3f}, p-value = {p_value:.5f}")



Neuroticism Correlations:
  Past Tense Verbs: Pearson correlation = 0.138, p-value = 0.76863
  Present Tense Verbs: Pearson correlation = 0.063, p-value = 0.89387
  Future Tense Verbs: Pearson correlation = 0.123, p-value = 0.79203
  Total Number Words: Pearson correlation = -0.658, p-value = 0.10805
  Total Prepositions: Pearson correlation = -0.216, p-value = 0.64186
  Total Articles: Pearson correlation = -0.365, p-value = 0.42100
  Total Negations: Pearson correlation = 0.088, p-value = 0.85142
  First Person Singular: Pearson correlation = -0.216, p-value = 0.64155
  First Person Plural: Pearson correlation = 0.208, p-value = 0.65493
  First Person: Pearson correlation = -0.079, p-value = 0.86669
  Second Person: Pearson correlation = -0.228, p-value = 0.62222
  Third Person: Pearson correlation = 0.206, p-value = 0.65838
  Total Empathy-related Words: Pearson correlation = -0.279, p-value = 0.54507
  Money: Pearson correlation = 0.418, p-value = 0.35046
  Work: Pearson correlati

# Statistical Results on Correlation Between Language Use and Personality

After running statistical tests, most correlations did not reach significance (p > 0.05). This may be attributed to the limited sample size of seven characters, compared to the larger datasets in reference studies. So instead of focusing on this statistical results, we will manually analyzing character-language use relationships in the Disccusion section later.

**Significant Correlations** <br>

- `Number` words are negatively correlated with conscientiousness, consistent with previous findings and our hypothesis.
- Unexpectedly, `articles` and `achievement` words show a negative association with conscientiousness, contradicting our hypothesis.
- Contrary to expectations, `first person singular and plural pronouns` do not positively correlate with extraversion.
- The `family` and `home` lexicons align with prior results, showing a negative association, supporting our hypothesis.


|                           | Neuroticism                | Extroversion                | Openness                   | Agreeableness              | Conscientiousness          |
|---------------------------|----------------------------|-----------------------------|----------------------------|----------------------------|----------------------------|
|**Tense Verbs**|||||
| Past Tense Verbs          | 0.138                      | 0.666                       | 0.298                      | 0.196                      | 0.403                      |
| Present Tense Verbs       | 0.063                      | 0.503                       | 0.228                      | -0.029                     | 0.334                      |
| Future Tense Verbs        | 0.123                      | -0.584                      | -0.110                     | 0.036                      | -0.070                     |
|**Numbers**|||||
| Total Number Words        | -0.658                     | 0.453                       | -0.554                     | -0.119                     | **-0.892**                     |
|**Prepositions**|||||
| Total Prepositions         | -0.216                     | 0.030                       | 0.063                      | 0.273                      | -0.630                     |
|**Articles**|||||
| Total Articles            | -0.365                     | 0.362                       | -0.023                     | 0.267                      | **-0.780**                     |
|**Negations**|||||
| Total Negations           | 0.088                      | 0.098                       | 0.061                      | -0.322                     | 0.576                      |
|**Pronouns**|||||
| First Person Singular     | -0.216                     | -0.653                      | -0.380                     | -0.595                     | 0.222                      |
| First Person Plural       | 0.208                      | **-0.782**                  | -0.127                     | -0.014                     | -0.122                     |
| First Person              | -0.079                     | **-0.800**                  | -0.431                     | -0.449                     | 0.027                      |
| Second Person             | -0.228                     | -0.550                      | -0.174                     | -0.298                     | -0.185                     |
| Third Person              | 0.206                      | -0.072                      | -0.142                     | 0.010                      | 0.218                      |
|**Empathy**|||||
| Total Empathy-related Words| -0.279                    | -0.585                      | -0.504                     | -0.631                     | 0.027                      |
| Money                     | 0.418                      | -0.167                      | 0.057                      | 0.537                      | -0.193                     |
| Work                      | 0.120                      | 0.189                       | 0.121                      | -0.161                     | 0.600                      |
| Sleep                     | -0.029                     | 0.410                       | 0.193                      | 0.499                      | -0.480                     |
| Occupation                | 0.505                      | -0.302                      | 0.350                      | 0.586                      | -0.149                     |
| Family                    | -0.494                     | -0.244                      | -0.567                     | **-0.702**                 | -0.187                     |
| Swearing Terms            | -0.571                     | 0.451                       | -0.254                     | -0.429                     | -0.339                     |
| Leisure                   | -0.411                     | 0.394                       | -0.067                     | 0.012                      | -0.672                     |
| School                    | -0.277                     | 0.453                       | 0.110                      | 0.263                      | -0.600                     |
| Optimism                  | 0.051                      | -0.470                      | -0.088                     | -0.065                     | -0.259                     |
| Home                      | -0.342                     | -0.574                      | -0.473                     | **-0.777**                     | 0.168                      |
| Sexuality                 | 0.627                      | -0.606                      | 0.015                      | 0.305                      | 0.122                      |
| Religion                  | -0.266                     | 0.422                       | 0.130                      | 0.289                      | -0.281                     |
| Body                      | -0.330                     | -0.432                      | -0.306                     | -0.526                     | -0.111                     |
| Eating                    | -0.047                     | -0.260                      | -0.522                     | -0.024                     | -0.491                     |
| Sports                    | -0.426                     | 0.578                       | -0.291                     | 0.118                      | -0.639                     |
| Death                     | 0.037                      | -0.725                      | -0.396                     | -0.248                     | 0.010                      |
| Communication             | -0.130                     | 0.533                       | 0.301                      | 0.515                      | -0.395                     |
| Hearing                   | -0.355                     | 0.302                       | -0.182                     | -0.421                     | 0.040                      |
| Music                     | -0.519                     | 0.427                       | -0.391                     | -0.254                     | -0.488                     |
| Sadness                   | -0.013                     | -0.511                      | -0.194                     | 0.078                      | -0.436                     |
| Emotional                 | 0.051                      | 0.491                       | -0.223                     | 0.068                      | -0.170                     |
| Affection                 | 0.091                      | -0.229                      | -0.330                     | 0.050                      | -0.377                     |
| Anger                     | 0.211                      | -0.433                      | -0.169                     | 0.188                      | -0.381                     |
| Negative Emotion          | 0.160                      | -0.506                      | -0.366                     | -0.507                     | 0.485                      |
| Friends                   | -0.357                     | -0.489                      | -0.350                     | -0.493                     | -0.278                     |
| Achievement               | -0.312                     | 0.148                       | -0.416                     | 0.077                      | **-0.755**                     |
| Positive Emotion          | -0.183                     | -0.282                      | -0.283                     | -0.317                     | -0.205                     |

# Discussion

In examining the language use patterns by character, our goal was to validate hypothesized the relationships between linguistic markers and the Big Five personality traits proven in Yarkoni (2011). Despite some observed trends, most correlations were not statistically significant, likely due to the limited sample size of only seven characters.

For the mannual inspection, we look back at the **Results on Language Use per Character**. <br>

#### Neuroticism
Albus Dumbledore scored the highest in Neuroticism, which suggested that he would use more first-person singular pronouns, negations, and language expressing negative emotions. However, the data revealed inconsistencies. Lord Voldemort led in first-person singular pronouns, contrary to expectations. Similarly, negations, which should correlate positively with Neuroticism, showed no significant difference between Dumbledore (1.29%) and Severus Snape (1.26%), who unexpectedly had lower-than-anticipated negation use.

#### Extraversion
Ron Weasley, with the highest Extraversion score (4.9), was expected to use more social and affective language, including plural first-person pronouns, second-person pronouns, and positive-emotion words. Yet these associations did not hold. Notably, Snape, the lowest in Extraversion (2.65), used second-person pronouns most frequently (5.49%), while Ron's use was lowest at 3.27%, a reversal of the hypothesized pattern. In addition, lexical categories like “family” and “communication,” thought to align with extraversion, did not show consistent results.

#### Openness
Dumbledore’s high score in Openness (5.52) suggested a tendency toward language expressing abstract or intellectual engagement, such as increased present-tense verbs, prepositions, and positive emotion words. Yet, neither Dumbledore nor Ron, the lowest scorer in Openness (4.02), showed language patterns that strongly supported this trend. Prepositions and positive emotion words did not align meaningfully with Openness, with only minimal correlation observed for present-tense verbs (r = 0.228).

#### Agreeableness
Harry Potter, with a high Agreeableness score (4.11), was hypothesized to use more positive and social language (e.g., family, friends, and past-tense verbs). However, correlations for these categories remained inconsistent. For instance, negative emotion words, expected to be lower among highly agreeable characters, showed mixed results, with Voldemort (1.95 Agreeableness) showing unexpectedly low negative emotion use compared to others.

#### Conscientiousness
Hermione Granger, with the highest Conscientiousness score (6.22), was expected to avoid negations and negative emotions while using more language related to achievement and work. Although a negative association was found between conscientiousness and numbers (r = -0.892), supporting prior research, other categories like achievement showed only weak associations, and negations were inconsistently distributed across characters.

### Conclusion
Based on the current dataset, we were unable to confirm the hypothesized correlations between the characters’ personality dimensions and their linguistic patterns. This may largely be due to the small sample size, with only seven characters analyzed, which limits statistical power and makes individual differences more influential.


# Reference
- Basto, C. (2021). Extending the abstraction of personality types based on MBTI with machine learning and natural language processing. _arXiv preprint arXiv:2105.11798._
- Ginting, S. A. (2018). Syntactic complexity on extraverted and introverted Indonesian langugae learners' written products. _International Journal of Education and Literacy Studies, 6_(4), 101-106.
- John, O. P., Donahue, E. M., \& Kentle, R. L. (1991). Big five inventory. _Journal of personality and social psychology_.
- Pennebaker J. W., Francis M. E., Booth R. J. (2001). Linguistic Inquiry and Word Count (LIWC): LIWC 2001. _Mahwah, NJ: Erlbaum_.
- Tausczik, Y. R., \& Pennebaker, J. W. (2010). The Psychological Meaning of Words: LIWC and Computerized Text Analysis Methods. _Journal of Language and Social Psychology, 29_(1), 24-54.
- Yarkoni, T. (2010). Personality in 100,000 words: A large-scale analysis of personality and word use among bloggers. _Journal of research in personality, 44_(3), 363-373.

_© Yifei Chen, Chi Kuan Lai. 2024_