![https://i.postimg.cc/vZ8k9N0K/Screen-Shot-2020-08-03-at-21-54-55.png](https://i.postimg.cc/vZ8k9N0K/Screen-Shot-2020-08-03-at-21-54-55.png)

# Game of Thrones Sentences Analysis

This notebook is a great way to get some insights about the worldwide famous HBO tv serie '[Game of Thrones](https://en.wikipedia.org/wiki/Game_of_Thrones)', based on the novel '[A Song of Ice and Fire](https://en.wikipedia.org/wiki/A_Song_of_Ice_and_Fire)' writen by George R. R. Martin.

Also, you'll be able to see how to explore data using Pyspark, using some techniques to clean and enhance the dataset.

The dataset was download from Kaggle, where you can find other amazing datasets published by the members.

I hope you enjoy this study! :)

## Dataset setup

### Let's begin starting the Spark Context (I'm using PySpark and Jupyter over a Docker container)

In [1]:
from pyspark.sql import *
from pyspark.sql.functions import *
import os
import time


spark = SparkSession.builder.master('local').getOrCreate()
sc = spark.sparkContext

### Now we start creating a dataframe by reading the raw dataset

In [2]:
sentences_df = spark.read.csv("../datasets/raw_data/got/Game_of_Thrones_Script.csv", header = True)
sentences_df.createOrReplaceTempView("sentences")

sentences_df.show(5)

+------------+--------+---------+----------------+------------+--------------------+
|Release Date|  Season|  Episode|   Episode Title|        Name|            Sentence|
+------------+--------+---------+----------------+------------+--------------------+
|  2011-04-17|Season 1|Episode 1|Winter is Coming|waymar royce|What do you expec...|
|  2011-04-17|Season 1|Episode 1|Winter is Coming|        will|I've never seen w...|
|  2011-04-17|Season 1|Episode 1|Winter is Coming|waymar royce|How close did you...|
|  2011-04-17|Season 1|Episode 1|Winter is Coming|        will|Close as any man ...|
|  2011-04-17|Season 1|Episode 1|Winter is Coming|       gared|We should head ba...|
+------------+--------+---------+----------------+------------+--------------------+
only showing top 5 rows



### Character and Houses
### We're about to enhance the dataset with treatments over the characters' name column. Fixing wrong names and attaching last names

Check this example of the issue we may fix. The character 'Alliser Thorne' appears in the dataset with 5 different names we can identify by the resemblance

In [3]:
spark.sql("select distinct Name from sentences where Name like 'alli%'").show()

+--------------+
|          Name|
+--------------+
|       alliser|
|      allister|
|alliser thorne|
| alliser thorn|
|alliser throne|
+--------------+



For this task, we'd need a new dataset to join the wrong names with the right ones. But the problem is: it wasn't provided with the sentences dataset, so I needed to do it by myself :x please enjoy

### Sample of the dataset to enhance characters' names and houses

In [4]:
characters_df = spark.read.csv("../datasets/raw_data/got/Characters_Dataset.csv", header = True)
characters_df.createOrReplaceTempView("characters")
characters_df.show(10,False)

+--------------+---------------+---------+
|CharacterName |VerifiedName   |House    |
+--------------+---------------+---------+
|A Voice       |A Voice        |null     |
|Addam Marbrand|Addam Marbrand |Marbrand |
|Aemon         |Aemon Targaryen|Targaryen|
|Aeron         |Aeron Greyjoy  |Greyjoy  |
|Aerson        |Aeron Greyjoy  |Greyjoy  |
|Ahsa          |Asha Greyjoy   |Greyjoy  |
|All           |All            |null     |
|All Three     |All Three      |null     |
|All Together  |All Together   |null     |
|Alliser       |Alliser Thorne |Thorne   |
+--------------+---------------+---------+
only showing top 10 rows



### The first transformation applied to the dataset is:
- Change the 'Season' column to be more comfortable to work with
- Change the 'Episode' column to be more comfortable to work with
- Create a new column: 'SeasonEpisode'
- Applying camelcase over 'CharacterName' column
- Create a new column: 'SentenceSize'
- Merge with the Characters_Dataset to enhance names, changing 'CharacterName' column
- Bring from Characters_Dataset 'House' column

In [5]:
query = """
        select ReleaseDate,
                Season,
                Episode,
                concat(Season,Episode) as SeasonEpisode,
                EpisodeTitle,
                VerifiedName as CharacterName,
                House,
                Sentence,
                SentenceSize
        from (
            select `Release Date` as ReleaseDate,
                    replace(Season,"Season ","S0") as Season,
                    if(char_length(substring(Episode,instr(Episode," ")+1,3)) = 1,
                        concat("E0",substring(Episode,instr(Episode," ")+1,3)),
                        concat("E",substring(Episode,instr(Episode," ")+1,3)))
                        as Episode,
                    `Episode Title` as EpisodeTitle,
                    initcap(Name) as CharacterName,
                    Sentence,
                    char_length(Sentence) as SentenceSize
            from sentences            
                    ) sentences
        left join characters
        on sentences.CharacterName = characters.CharacterName
                    
        """
sentences_df_refined = spark.sql(query)
sentences_df_refined.createOrReplaceTempView("sentences_refined")
sentences_df_refined.show(20)

+-----------+------+-------+-------------+----------------+-------------+-----+--------------------+------------+
|ReleaseDate|Season|Episode|SeasonEpisode|    EpisodeTitle|CharacterName|House|            Sentence|SentenceSize|
+-----------+------+-------+-------------+----------------+-------------+-----+--------------------+------------+
| 2011-04-17|   S01|    E01|       S01E01|Winter is Coming| Waymar Royce| null|What do you expec...|         137|
| 2011-04-17|   S01|    E01|       S01E01|Winter is Coming|         Will| null|I've never seen w...|         103|
| 2011-04-17|   S01|    E01|       S01E01|Winter is Coming| Waymar Royce| null|How close did you...|          22|
| 2011-04-17|   S01|    E01|       S01E01|Winter is Coming|         Will| null|Close as any man ...|          23|
| 2011-04-17|   S01|    E01|       S01E01|Winter is Coming|        Gared| null|We should head ba...|          32|
| 2011-04-17|   S01|    E01|       S01E01|Winter is Coming|   Yohn Royce| null|Do the de

### ---------------------------------------------------- OPTIONAL!!! [OPEN] ---------------------------------------------------- 
### Maybe you wanna save this refined dataset as .parquet and use it instead of using .csv
FYI: The original dataset in .csv is almost 3 times heavier than the .parquet (.parquet files are compressed - in our case the compression applied is named as 'snappy'). Also, you can save using the partitionBy commonly done in professional projects

In [6]:
start_time = time.time()
sentences_df_refined.write.mode("overwrite").partitionBy('Season','Episode').parquet("../datasets/refined_data/got/refined_dataset")
print("--- It took %s seconds to persist your dataset as parquet ---" % (time.time() - start_time))

--- It took 8.20419430732727 seconds to persist your dataset as parquet ---


### The result will be like this

In [7]:
path = "../datasets/refined_data/got/refined_dataset/"

files = sorted(os.listdir(path))
for f in files:
    if os.path.isdir(path+f):
        print(f"Directory: {f}")
        episodes = sorted(os.listdir(path+f))
        for episode in episodes:
            if os.path.isdir(path+f+"/"+episode):
                print(f"\tDirectory: {episode}")
                files = sorted(os.listdir(path+f+"/"+episode))
                for file in files:
                    print(f"\t\tItem: {file}")

Directory: Season=S01
	Directory: Episode=E01
		Item: .part-00000-93a9c0ae-3e56-49fe-a74b-12ef01cbf050.c000.snappy.parquet.crc
		Item: part-00000-93a9c0ae-3e56-49fe-a74b-12ef01cbf050.c000.snappy.parquet
	Directory: Episode=E02
		Item: .part-00000-93a9c0ae-3e56-49fe-a74b-12ef01cbf050.c000.snappy.parquet.crc
		Item: part-00000-93a9c0ae-3e56-49fe-a74b-12ef01cbf050.c000.snappy.parquet
	Directory: Episode=E03
		Item: .part-00000-93a9c0ae-3e56-49fe-a74b-12ef01cbf050.c000.snappy.parquet.crc
		Item: part-00000-93a9c0ae-3e56-49fe-a74b-12ef01cbf050.c000.snappy.parquet
	Directory: Episode=E04
		Item: .part-00000-93a9c0ae-3e56-49fe-a74b-12ef01cbf050.c000.snappy.parquet.crc
		Item: part-00000-93a9c0ae-3e56-49fe-a74b-12ef01cbf050.c000.snappy.parquet
	Directory: Episode=E05
		Item: .part-00000-93a9c0ae-3e56-49fe-a74b-12ef01cbf050.c000.snappy.parquet.crc
		Item: part-00000-93a9c0ae-3e56-49fe-a74b-12ef01cbf050.c000.snappy.parquet
	Directory: Episode=E06
		Item: .part-00000-93a9c0ae-3e56-49fe-a74b-12ef0

In [8]:
sentences_df_refined = spark.read.parquet("../datasets/refined_data/got/refined_dataset")
sentences_df_refined.createOrReplaceTempView("sentences_refined")
sentences_df_refined.show(10)

+-----------+-------------+------------+---------------+---------+--------------------+------------+------+-------+
|ReleaseDate|SeasonEpisode|EpisodeTitle|  CharacterName|    House|            Sentence|SentenceSize|Season|Episode|
+-----------+-------------+------------+---------------+---------+--------------------+------------+------+-------+
| 2017-08-13|       S07E05|   Eastwatch|Jaime Lannister|Lannister|You could have ki...|          25|   S07|    E05|
| 2017-08-13|       S07E05|   Eastwatch|          Bronn|     null|What the fuck wer...|          40|   S07|    E05|
| 2017-08-13|       S07E05|   Eastwatch|Jaime Lannister|Lannister|Ending the war by...|          30|   S07|    E05|
| 2017-08-13|       S07E05|   Eastwatch|          Bronn|     null|You saw the drago...|          39|   S07|    E05|
| 2017-08-13|       S07E05|   Eastwatch|          Bronn|     null|looks incredulous...|          29|   S07|    E05|
| 2017-08-13|       S07E05|   Eastwatch|          Bronn|     null|      

### ---------------------------------------------------- OPTIONAL!!! [CLOSED] ---------------------------------------------------- 

## Data Analysis

## Word Count
### The first thing you may ask when you see a dataset of sentences is Word Count. Basically we split the sentences word by word, then we count :)
We assemble a couple functions and parameters to help us to handle the words found in the sentences

In [9]:
def generate_replace_for_empty(word,character_list):
    if len(character_list) == 1:
        target = f"replace({word},'{character_list[0]}','')"
        return target
    else:
        for character in character_list:
            if character == character_list[0]:
                target = f"replace({word},'{character}','')"
            else:
                target = f"replace({target},'{character}','')"
        return target
    

def generate_if_then(word,if_list):
    for if_case in if_list:
        if if_case == if_list[0]:
            target = f"if(Word = '{if_case[0]}','{if_case[1]}',{word})"
        else:
            target = f"if(Word = '{if_case[0]}','{if_case[1]}',{target})"
    return target

    
character_list = ['.','[',']','?','/',',','(',')','!','…',"|","”","‘","\"",'-','*','—',"#",'{','}','“','&',';','–','1','2','3','4','5','6','7','8','9','0',' ']
if_list = [['re','are'],['t','not'],['d','would'],['ll','will'],['won','will']]

The first step is to select only the column 'Sentence' from the dataset and start splitting the column based in some characters

In [10]:
# Splitting by Spaces
sentences_df_word_split = sentences_df_refined.select(split(sentences_df_refined.Sentence, '\s+').alias('split'))

sentences_df_word_single = sentences_df_word_split.select(explode(sentences_df_word_split.split).alias('Word'))

# Splitting by Apostrophes
sentences_df_word_split = sentences_df_word_single.select(split(sentences_df_word_single.Word, '\'').alias('split'))

sentences_df_word_single = sentences_df_word_split.select(explode(sentences_df_word_split.split).alias('Word'))

# Splitting by …

sentences_df_word_split = sentences_df_word_single.select(split(sentences_df_word_single.Word, '…').alias('split'))

sentences_df_word_single = sentences_df_word_split.select(explode(sentences_df_word_split.split).alias('Word'))

# Splitting by ;

sentences_df_word_split = sentences_df_word_single.select(split(sentences_df_word_single.Word, ';').alias('split'))

sentences_df_word_single = sentences_df_word_split.select(explode(sentences_df_word_split.split).alias('Word'))

# Different of empty

sentences_df_words = sentences_df_word_single.where(sentences_df_word_single.Word != '')
sentences_df_words.createOrReplaceTempView("sentences_words")

sentences_df_words.show(5)

+------+
|  Word|
+------+
|   You|
| could|
|  have|
|killed|
|   me.|
+------+
only showing top 5 rows



Perhaps you noticed there are some special characters we need to get rid of, like commas, slashes, and others. We can also enhance the dataset inserting a column with the length of the word. On top of this, we can lowercase all words in order to achieve correct results when we group them.

In [11]:
cleanse_query = f"""
        select Word,char_length(Word) as WordSize
        from (
            select rtrim(lower({generate_replace_for_empty('Word',character_list)})) as Word
            from sentences_words
            )
        where char_length(Word) > 0
        """

sentences_df_words_cleansed = spark.sql(cleanse_query)
sentences_df_words_cleansed.createOrReplaceTempView("sentences_cleansed")

sentences_df_words_cleansed.show(5)

+------+--------+
|  Word|WordSize|
+------+--------+
|   you|       3|
| could|       5|
|  have|       4|
|killed|       6|
|    me|       2|
+------+--------+
only showing top 5 rows



After cleaning the words dataset, there are some words we need to fix. Such as 'll' to 'will', 'd' to 'would', etc., generated by the word-split (for example: "they're" = 'they','re')

In [12]:
short_fixing_query = f"""
        select {generate_if_then('Word',if_list)} as Word,
                WordSize,
                count(Word) as WordCount
        from sentences_cleansed
        group by Word,WordSize
        """
sentences_df_words_short_fixed = spark.sql(short_fixing_query)
sentences_df_words_short_fixed.repartition(1).write.mode("overwrite").parquet("../datasets/refined_data/got/word_count")

sentences_df_words_short_fixed.show(10)

+---------+--------+---------+
|     Word|WordSize|WordCount|
+---------+--------+---------+
|     down|       4|      319|
|     next|       4|       70|
|     wits|       4|        4|
|    drink|       5|      117|
|  recruit|       7|        3|
|   assume|       6|       20|
|implicate|       9|        1|
| infantry|       8|        3|
|   kissed|       6|       10|
|   locate|       6|        3|
+---------+--------+---------+
only showing top 10 rows



### Now we're ready to get some insights!

In [13]:
sentences_df_words_short_fixed_parquet = spark.read.parquet("../datasets/refined_data/got/word_count")
sentences_df_words_short_fixed_parquet.cache()
sentences_df_words_short_fixed_parquet.createOrReplaceTempView("word_count")

### What are the top 10 words spoken in the tv series?

In [14]:
spark.sql("select Word,WordCount from word_count order by WordCount desc limit 10;").show()

+----+---------+
|Word|WordCount|
+----+---------+
| you|    12452|
| the|    12228|
|   i|    10054|
|  to|     7972|
|   a|     6093|
| and|     5272|
|   s|     4656|
|  of|     4556|
|  it|     3991|
| not|     3788|
+----+---------+



### What are the top 10 longest words?

In [15]:
spark.sql("select Word,WordSize from word_count order by WordSize desc limit 10;").show(10,False)
print("as you can, see there are some issues with some words because of the absence of spaces :( anyway, one more thing to be aware in a future enhancement")

+---------------------+--------+
|Word                 |WordSize|
+---------------------+--------+
|granddaughterwwwertha|21      |
|somethingsomething   |18      |
|seventysevencourse   |18      |
|eastwatchbythesea    |17      |
|kingbeyondthewall    |17      |
|greatgrandfather     |16      |
|responsibilities     |16      |
|misunderstanding     |16      |
|accomplishments      |15      |
|tralalalaleeday      |15      |
+---------------------+--------+

as you can, see there are some issues with some words because of the absence of spaces :( anyway, one more thing to be aware in a future enhancement


### How many times were said 'Jon' and 'Snow'?

In [16]:
spark.sql("select Word,WordCount from word_count where Word = 'jon' or Word = 'snow' order by 1").show()

+----+---------+
|Word|WordCount|
+----+---------+
| jon|      284|
|snow|      192|
+----+---------+



## Sentences

### Have you asked yourself how many times you heard the famous "winter is coming" while watched the tv show? Well, probably you thought you heard more than you actually did :O

![https://i.postimg.cc/CKbsgMJ9/Screen-Shot-2020-08-02-at-23-49-44.png](https://i.postimg.cc/CKbsgMJ9/Screen-Shot-2020-08-02-at-23-49-44.png)

In [17]:
sentence_query = """
    select SeasonEpisode,
            CharacterName,
            Sentence from sentences_refined
    where lower(Sentence) like '%winter is coming%'
    order by SeasonEpisode
    """
winter_is_coming = spark.sql(sentence_query)
rows = winter_is_coming.collect()

print("Number of times the sentence was found: "+str(len(rows))+"\n--------------------------------------------")

for row in rows:
    print("at "+row[0]+", "+row[1]+" says: "+row[2]+"\n")

Number of times the sentence was found: 12
--------------------------------------------
at S01E01, Eddard "Ned" Stark says: He won't be a boy forever. And winter is coming.

at S01E01, Benjen Stark says: Maybe. Direwolves south of the wall. Talk of the Walkers. My brother might be the next Hand to the king. Winter is coming.

at S01E01, Eddard "Ned" Stark says: Winter is coming.

at S01E03, Arya Stark says: Winter is coming.

at S01E08, Robb Stark says: Tell Lord Tywin, Winter is coming for him.

at S01E10, Yoren says: Come on, you sorry sons of whores! lt's a thousand leagues from here to the Wall, and winter is coming!

at S02E03, Catelyn Stark says: Because it won't last. Because they are the knights of summer and winter is coming.

at S03E03, Ramsay Bolton says: You're a long way from home and winter is coming.

at S04E10, Mance Rayder says: I showed you everything I had. The whole army, a hundred thousand strong, and what did you do? You fired on us with everything you had. It was

### This one is DEEP: "The night is dark and full of terrors". And surprisingly it's said more than the previous one

![https://i.postimg.cc/Y9qRzKfV/Screen-Shot-2020-08-02-at-23-55-36.png](https://i.postimg.cc/Y9qRzKfV/Screen-Shot-2020-08-02-at-23-55-36.png)

In [18]:
sentence_query = """
    select SeasonEpisode,
            CharacterName,
            Sentence from sentences_refined
    where lower(Sentence) like '%the night is dark and full of terrors%'
    order by SeasonEpisode
    """
night_is_dark = spark.sql(sentence_query)
rows = night_is_dark.collect()

print("Number of times the sentence was found: "+str(len(rows))+"\n--------------------------------------------")

for row in rows:
    print("at "+row[0]+", "+row[1]+" says: "+row[2]+"\n")

Number of times the sentence was found: 18
--------------------------------------------
at S02E01, Melisandre says: Take them and cast your light upon us. For the night is dark and full of terrors.

at S02E01, Group says: For the night is dark and full of terrors.

at S02E01, Melisandre says: For the night is dark and full of terrors.

at S02E01, Melisandre says: The night is dark and full of terrors, old man, but the fire burns them all away. Your Grace.

at S02E04, Melisandre says: Look to your sins, Lord Renly. The night is dark and full of terrors.

at S02E04, Davos Seaworth says: Someone once told me the night is dark and full of terrors.

at S03E05, Thoros of Myr says: Show us the truth. Strike this man down if he is guilty. Give strength to his sword if he is true. Lord of Light, give us wisdom. For the night is dark and full of terrors.

at S03E05, Men says: For the night is dark and full of terrors.

at S03E05, Thoros of Myr says: For the night is dark and full of terrors. Lor

### You may be wondering how'd be the ranking of the more talkative houses.
### Let's take a look considering the number of sentences

In [19]:
sentence_house_q = """
        select House,
                count(Sentence) as NumberOfSentences
        from sentences_refined
        where House is not null
        group by House
        order by count(Sentence) desc
        """
spark.sql(sentence_house_q).show(5)

+---------+-----------------+
|    House|NumberOfSentences|
+---------+-----------------+
|Lannister|             4621|
|    Stark|             4236|
|Targaryen|             1178|
|Baratheon|              789|
|  Greyjoy|              744|
+---------+-----------------+
only showing top 5 rows



### and now considering the length of the sentences

In [20]:
sentence_house_q = """
        select House,
                sum(SentenceSize) as TotalSentencesSize
        from sentences_refined
        where House is not null
        group by House
        order by sum(SentenceSize) desc
        """
spark.sql(sentence_house_q).show(5)

+---------+------------------+
|    House|TotalSentencesSize|
+---------+------------------+
|Lannister|            331012|
|    Stark|            216493|
|Targaryen|             74298|
|Baratheon|             50376|
|  Greyjoy|             46451|
+---------+------------------+
only showing top 5 rows



### I had the feeling the Lannisters would win this contest (maybe because the Starks are almost dead lol)
![https://64.media.tumblr.com/072498d6f25b8b9c05577620876208c6/tumblr_nuymvbSm7X1ueoasuo1_500.gif](https://64.media.tumblr.com/072498d6f25b8b9c05577620876208c6/tumblr_nuymvbSm7X1ueoasuo1_500.gif)

### What are the episodes with more dialogues and fewer dialogues?

In [21]:
dialogues_query = """
    select SeasonEpisode,
    EpisodeTitle,
    count(Sentence) as CountSentences,
    sum(SentenceSize) as SumSentencesSize
    from sentences_refined
    group by SeasonEpisode,EpisodeTitle
    order by count(Sentence) desc
    limit 1;
    """

spark.sql(dialogues_query).show(1,False)

+-------------+------------+--------------+----------------+
|SeasonEpisode|EpisodeTitle|CountSentences|SumSentencesSize|
+-------------+------------+--------------+----------------+
|S07E05       |Eastwatch   |505           |33866           |
+-------------+------------+--------------+----------------+



In [22]:
dialogues_query = """
    select SeasonEpisode,
    EpisodeTitle,
    count(Sentence) as CountSentences,
    sum(SentenceSize) as SumSentencesSize
    from sentences_refined
    group by SeasonEpisode,EpisodeTitle
    order by count(Sentence) asc
    limit 1;
    """

spark.sql(dialogues_query).show(1,False)

+-------------+----------------------+--------------+----------------+
|SeasonEpisode|EpisodeTitle          |CountSentences|SumSentencesSize|
+-------------+----------------------+--------------+----------------+
|S08E04       |The Last of the Starks|51            |2692            |
+-------------+----------------------+--------------+----------------+



### The number of sentences and size of sentences per Season

In [23]:
dialogues_query = """
    select Season,
    count(Sentence) as CountSentences,
    sum(SentenceSize) as SumSentencesSize
    from sentences_refined
    group by Season
    order by Season asc
    """

spark.sql(dialogues_query).show()

+------+--------------+----------------+
|Season|CountSentences|SumSentencesSize|
+------+--------------+----------------+
|   S01|          3179|          191173|
|   S02|          3914|          239487|
|   S03|          3573|          215795|
|   S04|          3446|          212372|
|   S05|          3035|          194068|
|   S06|          2856|          185802|
|   S07|          2442|          168716|
|   S08|          1466|           68946|
+------+--------------+----------------+



## Sentiment analysis
### Here's something interesting to take a look, gathering insights to apply in real professional projects

![https://i.postimg.cc/Mp7WdNwm/sentiment-analysis.png](https://i.postimg.cc/Mp7WdNwm/sentiment-analysis.png)

For our purpose here we'll use two different Python libs which can help us calculating the sentiment of a given text. You can deep dive into machine and deep learning if you prefer, but for our study, these two should do the trick ;)

Let's start by importing the libs, initializing some objects and creating some helpful functions (which we'll use as UDF (user defined function) in SparkSQL)

In [24]:
from textblob import TextBlob
import nltk
from vaderSentiment.vaderSentiment import SentimentIntensityAnalyzer
from pyspark.sql.types import *


analyzer = SentimentIntensityAnalyzer()


def get_textblob_score(sentence):
    analysis = TextBlob(sentence)
    return analysis.sentiment.polarity


def get_vader_score(sentence):
    return analyzer.polarity_scores(sentence)['compound']


spark.udf.register("sentimentTextBlob", get_textblob_score, DoubleType())
spark.udf.register("sentimentVader", get_vader_score, DoubleType())


def generate_sentiment_face(score_column):
    return f"""if({score_column}=0,":|",
                    if({score_column}>0 and {score_column}<0.4,":)",
                        if({score_column}>=0.4,":D",
                            if({score_column}<0 and {score_column}>=-0.4,":(",":'("))))
    """


def generate_sentiment_result(score_column):
    return f"""if({score_column}=0,"Neutral",
                    if({score_column}>0 and {score_column}<0.4,"Positive",
                        if({score_column}>=0.4,"Very Positive",
                            if({score_column}<0 and {score_column}>=-0.4,"Negative","Very Negative"))))
    """

### Here's a sample of the results we can get

In [25]:
sample_query = f"""
    select SeasonEpisode,
                CharacterName,
                Sentence,
                SentimentScore1,
                {generate_sentiment_face('SentimentScore1')} as Sentiment1,
                {generate_sentiment_result('SentimentScore1')} as SentimentResult1,
                SentimentScore2,
                {generate_sentiment_face('SentimentScore2')} as Sentiment2,
                {generate_sentiment_result('SentimentScore2')} as SentimentResult2
    from (
        select SeasonEpisode,
                CharacterName,
                Sentence,
                sentimentTextBlob(Sentence) as SentimentScore1,
                sentimentVader(Sentence) as SentimentScore2
        from sentences_refined
        limit 20
        )
"""
sample_sentences_df = spark.sql(sample_query)
sample_sentences_df.select("SeasonEpisode","CharacterName","Sentence","SentimentResult2","SentimentResult2").show(20)

+-------------+------------------+--------------------+----------------+----------------+
|SeasonEpisode|     CharacterName|            Sentence|SentimentResult2|SentimentResult2|
+-------------+------------------+--------------------+----------------+----------------+
|       S07E05|   Jaime Lannister|You could have ki...|   Very Negative|   Very Negative|
|       S07E05|             Bronn|What the fuck wer...|   Very Negative|   Very Negative|
|       S07E05|   Jaime Lannister|Ending the war by...|   Very Negative|   Very Negative|
|       S07E05|             Bronn|You saw the drago...|         Neutral|         Neutral|
|       S07E05|             Bronn|looks incredulous...|         Neutral|         Neutral|
|       S07E05|             Bronn|                And?|         Neutral|         Neutral|
|       S07E05|   Jaime Lannister|sits up and says ...|         Neutral|         Neutral|
|       S07E05|             Bronn|Listen to me, cun...|        Negative|        Negative|
|       S0

### Personally I enjoyed the results I got from VaderSentiment :) so I rather continue this journey trusting in it to be my main tool to do the sentiment analysis

In [26]:
sentiment_query = f"""
    select *,
            {generate_sentiment_result('SentimentScore')} as SentimentResult,
            {generate_sentiment_face('SentimentScore')} as Sentiment
    from (
        select *,
                sentimentVader(Sentence) as SentimentScore
        from sentences_refined
        )
"""
sentiment_df = spark.sql(sentiment_query)

start_time = time.time()
sentiment_df.write.mode("overwrite").partitionBy('Season','Episode').parquet("../datasets/refined_data/got/sentiment_analysis")
print("--- It took %s seconds to persist your dataset as parquet ---" % (time.time() - start_time))

--- It took 35.136321783065796 seconds to persist your dataset as parquet ---


In [27]:
sentiment_df = spark.read.parquet("../datasets/refined_data/got/sentiment_analysis")
sentiment_df.createOrReplaceTempView("sentiment_analysis")
sentiment_df.show(5)

+-----------+-------------+------------+---------------+---------+--------------------+------------+--------------+---------------+---------+------+-------+
|ReleaseDate|SeasonEpisode|EpisodeTitle|  CharacterName|    House|            Sentence|SentenceSize|SentimentScore|SentimentResult|Sentiment|Season|Episode|
+-----------+-------------+------------+---------------+---------+--------------------+------------+--------------+---------------+---------+------+-------+
| 2017-08-13|       S07E05|   Eastwatch|Jaime Lannister|Lannister|You could have ki...|          25|       -0.6705|  Very Negative|      :'(|   S07|    E05|
| 2017-08-13|       S07E05|   Eastwatch|          Bronn|     null|What the fuck wer...|          40|       -0.5423|  Very Negative|      :'(|   S07|    E05|
| 2017-08-13|       S07E05|   Eastwatch|Jaime Lannister|Lannister|Ending the war by...|          30|       -0.8519|  Very Negative|      :'(|   S07|    E05|
| 2017-08-13|       S07E05|   Eastwatch|          Bronn|  

### Now we're reading from a new refined dataset and for our first exploration, let's see the distribution of sentiments around the series

In [28]:
sentiment_split_query = """
    select Season,
            sum(VeryNegative) as VeryNegative,
            sum(Negative) as Negative,
            sum(Neutral) as Neutral,
            sum(Positive) as Positive,
            sum(VeryPositive) as VeryPositive
    from (
        select Season,
                if(SentimentResult='Very Negative',1,0) as VeryNegative,
                if(SentimentResult='Negative',1,0) as Negative,
                if(SentimentResult='Neutral',1,0) as Neutral,
                if(SentimentResult='Positive',1,0) as Positive,
                if(SentimentResult='Very Positive',1,0) as VeryPositive
        from sentiment_analysis
        )
    group by Season
    order by Season
"""

sentiment_distribution = spark.sql(sentiment_split_query)
sentiment_distribution.cache()
sentiment_distribution.show(10)
sentiment_distribution.groupBy().agg(\
                                    sum("VeryNegative").alias("VeryNegative"),\
                                    sum("Negative").alias("Negative"),\
                                    sum("Neutral").alias("Neutral"),\
                                    sum("Positive").alias("Positive"),\
                                    sum("VeryPositive").alias("VeryPositive")).show(1)

+------+------------+--------+-------+--------+------------+
|Season|VeryNegative|Negative|Neutral|Positive|VeryPositive|
+------+------------+--------+-------+--------+------------+
|   S01|         439|     322|   1371|     365|         682|
|   S02|         603|     416|   1689|     434|         772|
|   S03|         536|     357|   1596|     411|         673|
|   S04|         542|     329|   1505|     386|         684|
|   S05|         447|     308|   1278|     356|         646|
|   S06|         453|     320|   1232|     331|         520|
|   S07|         395|     267|   1115|     252|         413|
|   S08|         181|     137|    741|     159|         248|
+------+------------+--------+-------+--------+------------+

+------------+--------+-------+--------+------------+
|VeryNegative|Negative|Neutral|Positive|VeryPositive|
+------------+--------+-------+--------+------------+
|        3596|    2456|  10527|    2694|        4638|
+------------+--------+-------+--------+-----------

### The Seasons with the highest and lowest count of each of the sentiments

In [29]:
sentiment_distribution.createOrReplaceTempView("sentiment_counts")

sentiment_max_min = """
    select max(VeryNegative),
            min(VeryNegative),
            max(Negative),
            min(Negative),
            max(Neutral),
            min(Neutral),
            max(Positive),
            min(Positive),
            max(VeryPositive),
            min(VeryPositive)
    from sentiment_counts
"""

result = spark.sql(sentiment_max_min).collect()[0]
print(f"{result}\n\n")

list_sentiments = ['VeryNegative','Negative','Neutral','Positive','VeryPositive']
counter = 0

for sentiment in list_sentiments:
    query = f"select Season,{sentiment} as Max{sentiment} from sentiment_counts where {sentiment} = {result[0+counter]}"
    spark.sql(query).show()
    counter+=1
    
    query = f"select Season,{sentiment} as Min{sentiment} from sentiment_counts where {sentiment} = {result[0+counter]}"
    spark.sql(query).show()
    counter+=1

Row(max(VeryNegative)=603, min(VeryNegative)=181, max(Negative)=416, min(Negative)=137, max(Neutral)=1689, min(Neutral)=741, max(Positive)=434, min(Positive)=159, max(VeryPositive)=772, min(VeryPositive)=248)


+------+---------------+
|Season|MaxVeryNegative|
+------+---------------+
|   S02|            603|
+------+---------------+

+------+---------------+
|Season|MinVeryNegative|
+------+---------------+
|   S08|            181|
+------+---------------+

+------+-----------+
|Season|MaxNegative|
+------+-----------+
|   S02|        416|
+------+-----------+

+------+-----------+
|Season|MinNegative|
+------+-----------+
|   S08|        137|
+------+-----------+

+------+----------+
|Season|MaxNeutral|
+------+----------+
|   S02|      1689|
+------+----------+

+------+----------+
|Season|MinNeutral|
+------+----------+
|   S08|       741|
+------+----------+

+------+-----------+
|Season|MaxPositive|
+------+-----------+
|   S02|        434|
+------+-----------+

+------+---------

In [30]:
sentiment_distribution.unpersist()

DataFrame[Season: string, VeryNegative: bigint, Negative: bigint, Neutral: bigint, Positive: bigint, VeryPositive: bigint]

### Now changing our perspective to the main characters, how'd be their distribution of sentiments

In [31]:
sentiment_split_query = """
    select CharacterName,
            sum(VeryNegative) as VeryNegative,
            sum(Negative) as Negative,
            sum(Neutral) as Neutral,
            sum(Positive) as Positive,
            sum(VeryPositive) as VeryPositive
    from (
        select CharacterName,House,
                if(SentimentResult='Very Negative',1,0) as VeryNegative,
                if(SentimentResult='Negative',1,0) as Negative,
                if(SentimentResult='Neutral',1,0) as Neutral,
                if(SentimentResult='Positive',1,0) as Positive,
                if(SentimentResult='Very Positive',1,0) as VeryPositive
        from sentiment_analysis
        where CharacterName = 'Jon Snow' or CharacterName = 'Arya Stark' or
                CharacterName = 'Sansa Stark' or CharacterName = 'Brandon "Bran" Stark' or
                CharacterName = 'Cersei Lannister' or CharacterName = 'Jaime Lannister' or
                CharacterName = 'Tyrion Lannister' or CharacterName = 'Samwell Tarly' or
                CharacterName = 'Daenerys Targaryen'
        )
    group by CharacterName,House
    order by House,CharacterName asc
"""

sentiment_distribution = spark.sql(sentiment_split_query)
sentiment_distribution.show(10)

+--------------------+------------+--------+-------+--------+------------+
|       CharacterName|VeryNegative|Negative|Neutral|Positive|VeryPositive|
+--------------------+------------+--------+-------+--------+------------+
|    Cersei Lannister|         184|     107|    382|     107|         225|
|     Jaime Lannister|         160|      89|    414|     124|         158|
|    Tyrion Lannister|         302|     201|    609|     215|         433|
|          Arya Stark|         106|      73|    448|      71|          85|
|Brandon "Bran" Stark|          52|      33|    213|      48|          54|
|            Jon Snow|         153|     117|    571|     113|         180|
|         Sansa Stark|         118|      72|    345|      91|         158|
|  Daenerys Targaryen|         134|     116|    465|     128|         206|
|       Samwell Tarly|          67|      51|    254|      78|         109|
+--------------------+------------+--------+-------+--------+------------+

