<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Introduction" data-toc-modified-id="Introduction-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Introduction</a></span></li><li><span><a href="#Setup" data-toc-modified-id="Setup-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>Setup</a></span><ul class="toc-item"><li><span><a href="#Imports" data-toc-modified-id="Imports-2.1"><span class="toc-item-num">2.1&nbsp;&nbsp;</span>Imports</a></span></li><li><span><a href="#Config" data-toc-modified-id="Config-2.2"><span class="toc-item-num">2.2&nbsp;&nbsp;</span>Config</a></span></li><li><span><a href="#Helper-Functions" data-toc-modified-id="Helper-Functions-2.3"><span class="toc-item-num">2.3&nbsp;&nbsp;</span>Helper Functions</a></span></li></ul></li><li><span><a href="#Data-Acquisition" data-toc-modified-id="Data-Acquisition-3"><span class="toc-item-num">3&nbsp;&nbsp;</span>Data Acquisition</a></span><ul class="toc-item"><li><span><a href="#Books" data-toc-modified-id="Books-3.1"><span class="toc-item-num">3.1&nbsp;&nbsp;</span>Books</a></span></li><li><span><a href="#Characters" data-toc-modified-id="Characters-3.2"><span class="toc-item-num">3.2&nbsp;&nbsp;</span>Characters</a></span></li></ul></li><li><span><a href="#Viz" data-toc-modified-id="Viz-4"><span class="toc-item-num">4&nbsp;&nbsp;</span>Viz</a></span><ul class="toc-item"><li><span><a href="#Basics" data-toc-modified-id="Basics-4.1"><span class="toc-item-num">4.1&nbsp;&nbsp;</span>Basics</a></span><ul class="toc-item"><li><span><a href="#Small-shapes-with-closed-contour-->-object,-idea,-entity,-node" data-toc-modified-id="Small-shapes-with-closed-contour-->-object,-idea,-entity,-node-4.1.1"><span class="toc-item-num">4.1.1&nbsp;&nbsp;</span>Small shapes with closed contour -&gt; object, idea, entity, node</a></span></li><li><span><a href="#Spatial-ordered-graphical-objects-->-related-information-or-a-sequence" data-toc-modified-id="Spatial-ordered-graphical-objects-->-related-information-or-a-sequence-4.1.2"><span class="toc-item-num">4.1.2&nbsp;&nbsp;</span>Spatial ordered graphical objects -&gt; related information or a sequence</a></span></li><li><span><a href="#Objects-in-proximity-or-with-same-colour/texture-->-similar-concepts,-related-information" data-toc-modified-id="Objects-in-proximity-or-with-same-colour/texture-->-similar-concepts,-related-information-4.1.3"><span class="toc-item-num">4.1.3&nbsp;&nbsp;</span>Objects in proximity or with same colour/texture -&gt; similar concepts, related information</a></span></li><li><span><a href="#Size-or-height-of-object:-Magnitude,-quantity,-importance,-2D-location" data-toc-modified-id="Size-or-height-of-object:-Magnitude,-quantity,-importance,-2D-location-4.1.4"><span class="toc-item-num">4.1.4&nbsp;&nbsp;</span>Size or height of object: Magnitude, quantity, importance, 2D location</a></span></li></ul></li><li><span><a href="#Complex" data-toc-modified-id="Complex-4.2"><span class="toc-item-num">4.2&nbsp;&nbsp;</span>Complex</a></span><ul class="toc-item"><li><span><a href="#Shapes-connected-by-contour-->-related-entities" data-toc-modified-id="Shapes-connected-by-contour-->-related-entities-4.2.1"><span class="toc-item-num">4.2.1&nbsp;&nbsp;</span>Shapes connected by contour -&gt; related entities</a></span></li><li><span><a href="#Thickness-of-connecting-contour-->-strength-of-relationship" data-toc-modified-id="Thickness-of-connecting-contour-->-strength-of-relationship-4.2.2"><span class="toc-item-num">4.2.2&nbsp;&nbsp;</span>Thickness of connecting contour -&gt; strength of relationship</a></span></li><li><span><a href="#Shapes-enclosed-by-a-contour/texture/colour-->-contained-entities,-related-entities" data-toc-modified-id="Shapes-enclosed-by-a-contour/texture/colour-->-contained-entities,-related-entities-4.2.3"><span class="toc-item-num">4.2.3&nbsp;&nbsp;</span>Shapes enclosed by a contour/texture/colour -&gt; contained entities, related entities</a></span></li><li><span><a href="#Nested/partitioned-regions-->-hierarchical-concepts" data-toc-modified-id="Nested/partitioned-regions-->-hierarchical-concepts-4.2.4"><span class="toc-item-num">4.2.4&nbsp;&nbsp;</span>Nested/partitioned regions -&gt; hierarchical concepts</a></span></li><li><span><a href="#Colour-and-texture-of-connecting-contour-->-type-of-relationship" data-toc-modified-id="Colour-and-texture-of-connecting-contour-->-type-of-relationship-4.2.5"><span class="toc-item-num">4.2.5&nbsp;&nbsp;</span>Colour and texture of connecting contour -&gt; type of relationship</a></span></li></ul></li></ul></li></ul></div>

# Harry Potter Viz

## Introduction

A notebooks showing various vizualisation heuristics using <a href='https://www.kaggle.com/gulsahdemiryurek/harry-potter-dataset'>this</a> Harry Potter dataset.

We'll be looking at the following semantic meanings of patterns and colours:

Basic patterns

+ Small shapes with closed contour -> object, idea, entity, node
+ Spatial ordered graphical objects -> related information or a sequence
+ Objects in proximity or with same colour/texture ->  similar concepts, related information
+ Size or height of object: Magnitude, quantity, importance, 2D location  

More complex patterns

+ Shapes connected by contour -> related entities 
+ Thickness of connecting contour -> strength of relationship
+ Colour and texture of connecting contour -> type of relationship
+ Shapes enclosed by a contour/texture/colour -> contained entities, related entities
+ Nested/partitioned regions -> hierarchical concepts

<img src='https://static1.cbrimages.com/wordpress/wp-content/uploads/2017/12/herry-potter-memes.jpg'>

## Setup

### Imports

In [None]:
import pandas as pd
import plotly.express as px
import numpy as np
import plotly.graph_objects as go

from pathlib import Path
import json

from unicodedata import normalize

### Config

In [None]:
data_dir = Path('./data')
pd.options.display.max_rows = 100
pd.options.display.max_columns = 100
plotly_template='plotly_dark'

### Helper Functions

In [None]:
def convert_to_float(x):
    try:
        r = float(x)
    except:
        r = np.NaN
    return r

## Data Acquisition

### Books

In [None]:
df_philosopher= pd.read_csv(data_dir/'Harry Potter 1.csv', sep=';')
df_chamber= pd.read_csv(data_dir/'Harry Potter 2.csv', sep=';')

def clean_book(df):
    df.columns = [c.lower().strip().replace(' ','_') for c in df.columns]
    df['character'] = df['character'].str.strip().str.lower()
    df['sentence'] = df['sentence'].str.lower()
    return df

df_philosopher = clean_book(df_philosopher)
df_chamber = clean_book(df_chamber)

In [None]:
df_philosopher.sample(4)

In [None]:
df_chamber.sample(4)

### Characters

In [None]:
df_characters = pd.read_csv(data_dir/'Characters.csv', sep=';', encoding='latin-1')
df_characters.columns = [c.lower().strip().replace(' ','_') for c in df_characters.columns]
df_characters['blood_status'] = df_characters['blood_status'].apply(lambda ai: normalize('NFKD',str(ai)))

blood_status_map = {
'Part-HumanÂ (Half-giant)':'Part-Human', 
'Half-bloodÂ orÂ pure-blood':'Half-blood or Pure-blood',
'Pure-bloodÂ orÂ half-blood':'Half-blood or Pure-blood',
'Pure-bloodÂ orÂ Half-blood':'Half-blood or Pure-blood',
'Pure-blood or half-blood':'Half-blood or Pure-blood',
'Half-blood[':'Half-blood',
'Muggle-bornÂ orÂ half-blood[':'Muggle-born',
}

df_characters['blood_status'] = df_characters['blood_status'].apply(lambda ai: blood_status_map.get(ai, ai))

df_characters.sample(3)

In [None]:
df_characters_short = pd.read_csv(data_dir/'shortversioncharacters.csv')
df_characters_short.columns = [c.lower() for c in df_characters_short.columns]
df_characters_short['dateofbirth'] = df_characters_short['dateofbirth'].astype('datetime64[ns]')
df_characters_short['wand_wood'] = df_characters_short['wand'].apply(lambda ai: json.loads(ai.replace("'", '"'))['wood'])
df_characters_short['wand_core'] = df_characters_short['wand'].apply(lambda ai: json.loads(ai.replace("'", '"'))['core'])
df_characters_short['wand_length'] = df_characters_short['wand'].apply(lambda ai: json.loads(ai.replace("'", '"'))['length'])
df_characters_short['wand_length'] = df_characters_short['wand_length'].apply(lambda ai: convert_to_float(ai))
df_characters_short.sample(3)

## Viz

### Basics

#### Small shapes with closed contour -> object, idea, entity, node
#### Spatial ordered graphical objects -> related information or a sequence

In [None]:
px.scatter(df_characters_short,
           x='dateofbirth',
           y='name',
           color='name',
           title='Birthday per character',
           template=plotly_template)\
.update_yaxes(title='')\
.update_xaxes(title='')\
.update_layout(showlegend=False)\
.add_layout_image(
    dict(
        source="",
        xref="paper", yref="paper",
        x=1, y=1,
        sizex=0.1, sizey=0.25,
        xanchor="right", yanchor="bottom"
    )
)

#### Objects in proximity or with same colour/texture -> similar concepts, related information

In [None]:
px.bar(df_characters[~df_characters['blood_status'].isna()],
           x='house',
           color='blood_status',
           title='Blood Status per House',
           template=plotly_template)\
.update_yaxes(title='')\
.update_xaxes(title='')

Would've been happy, but the repeting blue is a problem, so we change the color pallet and add some texture.

In [None]:
px.bar(df_characters[~df_characters['blood_status'].isna()]
       .groupby(['blood_status', 'house']).count().reset_index(),
       x='house',
       y='name',
       color_discrete_sequence=px.colors.qualitative.Light24,
       pattern_shape='blood_status',
       color='blood_status',
       title='Blood Status per House',
       template=plotly_template)\
    .update_yaxes(title='')\
    .update_xaxes(title='')

For interst sake the people in Slyterin who aren't pure bloods are:

In [None]:
df_characters[
    (df_characters['blood_status'] == 'Half-blood') &
    (df_characters['house'] == 'Slytherin')
][['name', 'blood_status']]

To drive the point home for colours, if we were to plot the 4 Hogwarts houses, I would put in the time to get their hex codes.

In [None]:
houses_color_map = {'Gryffindor': '#9e0109',
                    'Ravenclaw': '#0e4177',
                    'Slytherin': '#008300',
                    'Hufflepuff': '#b4a335'}

px.bar(df_characters[df_characters['house'].isin(houses_color_map.keys())],
       x='gender',
       color='house',
       barmode='group',
       title='Genders per House',
       color_discrete_map=houses_color_map,
       template=plotly_template)\
    .update_yaxes(title='')\
    .update_xaxes(title='')

#### Size or height of object: Magnitude, quantity, importance, 2D location

In [None]:
px.scatter(df_characters_short[
    (df_characters_short['house'].isin(houses_color_map.keys())) &
    (df_characters_short['wand_core'] != '')
],
    y='wand_length',
    x='wand_core',
    hover_data=['name'],
    color='house',
    title='Wand length by Wand core',
    symbol_sequence=['star'],
    size=[1]*12,
    color_discrete_map=houses_color_map,
    template=plotly_template)\
    .update_yaxes(title='')\
    .update_xaxes(title='')

We could size the symbol according to the wand size as well, to exemplify the size further.

In [None]:
px.scatter(df_characters_short[
    (df_characters_short['house'].isin(houses_color_map.keys())) &
    (df_characters_short['wand_core'] != '')&
    (~df_characters_short['wand_length'].isna())
],
    y='wand_length',
    x='wand_wood',
    hover_data=['name'],
    color='house',
    title='Wand length by Wand wood',
    symbol_sequence=['star'],
    size='wand_length',
    color_discrete_map=houses_color_map,
    template=plotly_template)\
    .update_yaxes(title='')\
    .update_xaxes(title='')

In [None]:
px.scatter_3d(df_characters_short[
    (df_characters_short['house'].isin(houses_color_map.keys())) &
    (df_characters_short['wand_core'] != '') &
    (~df_characters_short['wand_length'].isna())
],
    z='wand_length',
    x='wand_core',
    y='wand_wood',
    color='house',
    symbol_sequence=['circle'],
    size='wand_length',
    hover_data=['name'],
    title='Wand length by Wand wood by Wand core',
    color_discrete_map=houses_color_map,
    template=plotly_template)\
    .update_layout(scene=dict(
        xaxis=dict(title=''),
        yaxis=dict(title=''),
        zaxis=dict(title='')
    )
)

Is this 3D justified? 

### Complex

#### Shapes connected by contour -> related entities
#### Thickness of connecting contour -> strength of relationship

In [None]:
for character in df_philosopher['character'].unique():
    df_philosopher[f'{character}'] = df_philosopher['sentence'].str.count(character)

df_mentions =  df_philosopher.groupby('character')\
.sum().reset_index()\
.melt(id_vars='character', var_name='mentions', value_name='count')

characters = list(df_mentions['character'].unique())

df_mentions['speaker'] = df_mentions['character'].apply(lambda ai: characters.index(ai))
df_mentions['mentioner'] = df_mentions['mentions'].apply(lambda ai: characters.index(ai)+len(character))

In [None]:
fig = go.Figure(data=[go.Sankey(
    node=dict(
        label=characters+characters,
    ),
    link=dict(
        source=df_mentions['speaker'].values,
        target=df_mentions['mentioner'].values,
        value=df_mentions['count'].values
    ))])

fig.update_layout(title_text="Who speaks about who",
                  height=1000,
                  )
fig.show()

Would this work as a static image? 

#### Shapes enclosed by a contour/texture/colour -> contained entities, related entities
#### Nested/partitioned regions -> hierarchical concepts
#### Colour and texture of connecting contour -> type of relationship

In [None]:
df_chamber['nr_words_in_speech'] = df_chamber['sentence']\
    .apply(lambda ai: len(ai.split(' ')))
df_chamber['avg_word_length'] = df_chamber['sentence']\
    .apply(lambda ai: np.mean([len(v) for v in ai.split(' ')]))

df_name_map = {'harry': 'Harry Potter',
               'ron': 'Ron Weasley',
               'ginny': 'Ginny Weasley',
               'mr. weasley': 'Arthur Weasley',
               'lucius malfoy': 'Lucius Malfoy',
               'draco': 'Draco Malfoy',
               'hagrid': 'Rubeus Hagrid',
               'hermione': 'Hermione Granger',
               'snape': 'Severus Snape',
               'tom riddle': 'Lord Voldemort'}

df_chamber['name'] = df_chamber['character'].map(df_name_map)
df_chamber = df_chamber[~df_chamber['name'].isna()]
df_chamber_details = df_chamber.merge(df_characters_short,
                                      on='name',
                                      how='left')

In [None]:
df_chamber_details_aggr = df_chamber_details\
    .groupby(['house', 'ancestry', 'hogwartsstaff'])\
    .mean().reset_index()
df_chamber_details_aggr

In [None]:
px.sunburst(df_chamber_details_aggr,
            path=['house', 'ancestry'],
            values='avg_word_length',
            title='Average word length by house by ancestry',
            width=500,
            color_discrete_sequence=[houses_color_map['Gryffindor'],
                                     houses_color_map['Slytherin']],
            template=plotly_template)

In [None]:
px.sunburst(df_chamber_details_aggr,
            path=['house', 'hogwartsstaff'],
            values='avg_word_length',
            title='Average word length by house by Hogwarts staff',
            width=500,
            color_discrete_sequence=[houses_color_map['Gryffindor'],
                                     houses_color_map['Slytherin']],
            template=plotly_template)

In [None]:
px.sunburst(df_chamber_details_aggr,
            path=['house', 'ancestry'],
            values='wand_length',
            branchvalues='remainder',
            title='Wand length by house by ancestry',
            width=500,
            color_discrete_sequence=[houses_color_map['Gryffindor'],
                                     houses_color_map['Slytherin']],
            template=plotly_template)