# The Idea in General

We have a preconcieved notion of what families or communities of people are like, in regards to their temper or sentiment they give off. We now want to investigate if it holds up to a more robust and "objective" analysis of these communities' sentiments. A network describing each person's relation to another based on costarring in a scene will also include a color attribute that tells us if the person is more negative or positive based on their dialogue throughout the series.  
As such we will see clusterings of communities and can gather their coloring to determine their overall sentiments. A non visualized attribute will be the birthing place of the individual characters/nodes, maybe there's just a place that breeds negativity?  
In addition, we will present the 5 biggest families and their sentiments as to also hold them against our previous beliefs.  

We also want to analyse which persons have changed their sentiment the most throughout

- Other text analysis? Most used words by each family/community?

### <span style='color: #fc03d7'> Steps to take </span>


1. <span style='color: #f2d052'> Import NLP libraries </span>
2. <span style='color: #f2d052'> Analyze sentiment on each piece of dialogue and record it into a dataframe with the speaker </span>
3. <span style='color: #f2d052'> GroupBy speaker and average the sentiment scores of each speaker into a single number (in a new dataframe probably) </span>
4. <span style='color: #f2d052'> Make it a little dataset for Kristi's network </span>

---

In [2]:
import json
import pandas as pd
import numpy as np
import netwulf as nw
import matplotlib.pyplot as plt
import networkx as nx
import random as random 
import re
from collections import Counter

import nltk
nltk.download('punkt')
from nltk.tokenize import word_tokenize
from nltk.probability import FreqDist
from nltk.corpus import stopwords
nltk.download('stopwords')

pd.set_option('display.max_colwidth', None)

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\nicol\AppData\Roaming\nltk_data...
[nltk_data]   Unzipping tokenizers\punkt.zip.
[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\nicol\AppData\Roaming\nltk_data...
[nltk_data]   Unzipping corpora\stopwords.zip.


In [6]:
# Perform sentiment analysis on dialogue
script = pd.read_csv('./data/Game_of_Thrones_Script_corrected_manual.csv')
script = script[['Name', 'Sentence']]
script

FileNotFoundError: [Errno 2] No such file or directory: './data/Game_of_Thrones_Script_corrected_manually.csv'

In [8]:
# preprocess the sentences for use in sentiment analysis on each sentence

puncs = '!"#$%&\()*+,-./:;<=>?@[\\]^_`{|}~' # ' is removed

def preprocess(sentence):
    sentence = sentence.lower()
    sentence = re.sub(f"[{re.escape(puncs)}]+", '', sentence) # remove punctuation, except apostrophe
    sentence = re.sub(r'\d+', '', sentence) # remove numbers
    # remove stopwords from sentence
    sentence = ' '.join([word for word in sentence.split() if word not in stopwords.words('english')])
    sentence = sentence.strip()
    return sentence


In [9]:
"""
#! DONT TOUCH THIS CELL
# 1000 most common words in the sentences
sent_filtered = script.Sentence.str.cat(sep=' ')

sent_filtered = preprocess(sent_filtered)

sent_filtered = sent_filtered.split(' ')
print(sent_filtered[:10])

Common1000Words = Counter(sent_filtered).most_common(1000)
Common1000Words = [word for word, count in Common1000Words]
# Common1000Words = [word.capitalize() for word in Common1000Words]

print(f"20 most common words {Common1000Words[:20]}")
"""

KeyboardInterrupt: 

In [29]:
# remove stopwords
stop_words = set(stopwords.words('english'))
sent_filtered = [word for word in sent_filtered if word not in stop_words]
print(f"20 most common words after removing stopwords {Counter(sent_filtered).most_common(20)}")

20 most common words after removing stopwords [(' ', 288033), ('e', 139029), ('n', 73010), ('r', 69819), ('h', 65183), ('l', 46757), ('u', 38845), ('.', 34467), ('w', 26659), ('g', 23334), ('f', 21063), ('c', 18296), ('b', 14566), ("'", 14435), ('k', 13019), ('I', 12783), ('p', 12289), (',', 11614), ('v', 11589), ('?', 6861)]


### <span style='color: pink'> Huggingface model use </span>
We gonna steal a sentimentmodel from huggingface which is good, but optimized to run on less resources

In [10]:
# skal også lige downloade torch og transformers
import torch
from transformers import DistilBertTokenizer, DistilBertForSequenceClassification
# men GPT4 har allerede givet et godt svar så skriver lige det ned xD
# kan du downloade torch og transformers i konsolen? Og bagefter update requirements.txt med "pip freeze > requirements.txt" :)

  from .autonotebook import tqdm as notebook_tqdm


In [11]:
tokenizer = DistilBertTokenizer.from_pretrained('distilbert-base-uncased')
model = DistilBertForSequenceClassification.from_pretrained('distilbert-base-uncased-finetuned-sst-2-english')

# SST is stanford sentiment treebank, this light version of BERT has been fine-tuned on this dataset.

def analyze_continuous_sentiment(text):
    """
    takes raw text as input and returns a sentiment score between -1 and 1
    """
    inputs = tokenizer(text, return_tensors='pt', padding=True, truncation=True)
    outputs = model(**inputs)
    logits = outputs.logits
    probabilities = torch.softmax(logits, dim=1)
    
    # Calculate the sentiment score by subtracting the negative probability from the positive probability
    sentiment_score = probabilities[0, 1] - probabilities[0, 0]
    
    # Convert the score to float and return it
    return sentiment_score.item()



In [12]:
# Analyze sentiment of each sentence in a new column named "sentiment_score"
script['sentiment_score'] = script.Sentence.iloc[0:100].apply(analyze_continuous_sentiment)

In [14]:
# for each character, calculate the average sentiment score for all their sentences
person_sentiment_scores = script.iloc[0:100].groupby('Name').sentiment_score.mean().reset_index()
person_sentiment_scores

Unnamed: 0,Name,sentiment_score
0,bran stark,-0.476492
1,cassel,-0.299343
2,catelyn stark,-0.10107
3,cersei lannister,0.497387
4,eddard stark,0.061471
5,gared,0.499138
6,jaime lannister,-0.561357
7,jon snow,0.336135
8,luwin,-0.980625
9,maester luwin,-0.061113


## Corrections to the script dataset
line 25: changed name from jonrobb to jon snow

rename "nan" to something that doesnt get read as a "NaN" value...

Nicolaj will go through character list and we will purge double names and the like.

In [60]:
single_names = [name for name in script.Name.dropna().unique() if len(name.split(' ')) == 1]

double_names = [name for name in script.Name.dropna().unique() if len(name.split(' ')) == 2]

triple_names = [name for name in script.Name.dropna().unique() if len(name.split(' ')) == 3]

quadruple_names = [name for name in script.Name.dropna().unique() if len(name.split(' ')) == 4]

# [print(name) for name in script.Name.unique() if type(name) == float]

In [72]:
duplicate_names = pd.DataFrame(columns=['single_name', 'count', 'in_double_name'])
row = 0
for name in single_names:
    for d_name in double_names:
        if name in d_name. split(' ')[0]:
            count = len(script.Name[script.Name == name])
            duplicate_names.loc[row] = [name, count, d_name]
            row += 1

In [114]:
# make list of nouns that are not names and remove the from the dialogue later on
# [word for (word, pos) in nltk.pos_tag(nltk.word_tokenize(' '.join([name.capitalize() for name in script.Name.dropna().unique()]))) if pos != 'NN']
# nltk.pos_tag(nltk.word_tokenize(' '.join([name for name in script.Name.dropna().unique()])))

In [80]:
duplicate_names.sort_values(by='count', ascending=False).iloc[0:25]

Unnamed: 0,single_name,count,in_double_name
18,sam,399,sam tarly
47,daario,166,daario naharis
87,sandor,129,sandor clegane
46,beric,92,beric dondarrion
3,guard,77,guard captain
37,roose,77,roose bolton
25,loras,75,loras tyrell
44,barristan,67,barristan selmy
29,lancel,67,lancel lannister
48,walder,55,walder frey


In [117]:
# see if any in the double names could possibly be written wrong or be in the triple names?
"""
duplicate_names = pd.DataFrame(columns=['double_name', 'count', 'in_double_name'])
row = 0
for name in single_names:
    for d_name in double_names:
        if name in d_name. split(' ')[0]:
            count = len(script.Name[script.Name == name])
            duplicate_names.loc[row] = [name, count, d_name]
            row += 1
"""


"\nduplicate_names = pd.DataFrame(columns=['double_name', 'count', 'in_double_name'])\nrow = 0\nfor name in single_names:\n    for d_name in double_names:\n        if name in d_name. split(' ')[0]:\n            count = len(script.Name[script.Name == name])\n            duplicate_names.loc[row] = [name, count, d_name]\n            row += 1\n"

In [115]:
# load dataset with all characters and extract json charactername to a dataframe
with open('data/characters.json') as f:
    data = json.load(f)

df = pd.DataFrame(data['characters'])
print("Characters", df.shape[0])

df = df[df['characterLink'].notnull()]
print("Characters with link", df.shape[0])
characters = df.characterName

l = []
for name in np.unique(characters):
    for n in name.split(' '):
        l.append(n)
unique_charnames = np.unique(l)

Characters 389
Characters with link 368


In [None]:
# characters that are found in the script
