<font style='font-size:1.5em'>🗓️ Week 11 – Applications II: <span style='color:#eaeaea'>text mining  \&</span> network analysis</font>

<font style='font-size:1.2em'>DS105L – Data for Data Science</font>

**AUTHOR:**  Dr. [Jon Cardoso-Silva](https://www.lse.ac.uk/DSI/People/Jonathan-Cardoso-Silva)

**DEPARTMENT:** [LSE Data Science Institute](https://twitter.com/lsedatascience)

**DATE:** 31 March 2023


---


## Imports 
Section with library imports.

In [None]:
import pandas as pd
import networkx as nx
import altair as alt

DATA_DIR = '../data'

### Custom functions
Section with some general functions used over the notebook

In [None]:
def read_data(data, data_dir=DATA_DIR):
    """Reads a JSON file and returns a dataframe.

    Args:
        data (str): Name of the JSON file.
        data_dir (str): Path to the directory where the JSON file is located.

    Returns:
        df (pandas.DataFrame): Dataframe containing the data from the JSON file.

    Raises:
        NotImplementedError: If the JSON file contains more than one list.
                             Know how to handle this? Please open a PR!
    """

    # I found it easier to read the JSON file as a series and then convert it to a dataframe
    df = pd.read_json(f'{data_dir}/{data}.json', typ='series')

    # If the JSON file only contains one single list, parse it as a dataframe
    if len(df) == 1:
        df = pd.DataFrame.from_dict(df.iloc[0])
    else:
        error = f'JSON file {data} contains more than one list. Please check the file.'
        raise NotImplementedError(error)

    return df

def convert_time_to_minutes(x):
    """
    Converts a string of the form '0:01:45' to minutes.
    Thanks, CoPilot 🤖! You saved me a lot of time here.
    """

    # Split the string by ':'
    x = x.split(':')

    # Convert the list of strings to integers
    x = [int(i) for i in x]

    # Convert the list of integers to seconds
    x = x[0] * 60*60 + x[1]  * 60 + x[2]

    return x / 60

# 💽 1: The Data

I am using data collected by [jeffreylancaster/game-of-thrones](https://github.com/jeffreylancaster/game-of-thrones) to create a network of characters from the Game of Thrones series. 

I cloned the repository and copied the `data` folder to the `data` folder of my project. Because the license of the data is permissive, I copied the data to my project and gave credit to the original author (always do that!).

Anyways, let's look at the data I am using.

## 1.1 Character data

In [None]:
df_characters = read_data('characters')
df_characters[['characterName', 'houseName', 'royal']].head(20)

## 1.2 Characters groups

Not that this sounds suuper useful.

In [None]:
df_characters_groups = read_data('characters-groups')
df_characters_groups

## 1.3 Episode data ⭐ 

This is the most interesting dataset.

In [None]:
df_episodes = read_data('episodes')
df_episodes.shape

In [None]:
# Check how it looks like
df_episodes.head(2)

**Let's look closer at the first episode:**

In [None]:
df_episodes.iloc[0]

## 1.4 Exploding scenes 💣

What's in the 'scenes' column?

In [None]:
df_episodes.iloc[0]['scenes']

In [None]:
len(df_episodes.iloc[0]['scenes'])

That is neat! A JSON containing details about characters who co-appear in a scence.

💡 _Later, we will use this data to create a **network** of characters that appear in the same scene._

But first, let's expand the scenes using the good ol' `explode()` function from `pandas`.

In [None]:
df_episodes.head(1).explode('scenes').head()

In case you forgot how `explode()` works, notice that a single row in the original data frame will be expanded into multiple rows, one for each element in the `scenes` list.  

You can confirm that with the following:

In [None]:
print(f"Exploding just the first row of df_episodes "
      f"produces a df with shape: {df_episodes.head(1).explode('scenes').shape}")

## 1.5 Building our final dataset

In [None]:
df_got = df_episodes.explode('scenes')

# What's the new shape?
df_got.shape

🔐 **Note to attentive reader:** do you fully understand what is going on in the cell below? Try running each element of the concat list separately and see if you can understand what is going on. If you don't and you are curious to learn more, just send me a message on Slack and I'll be more than happy to help you out.

In [None]:
def get_characters_list(x):
    """Extracts the characters from the scenes.

    Args:
        x (dict): Dictionary containing the characters.

    Returns:
        characters (list): List of characters.
    """

    # 🐰 Easter egg: there is more potentially useful information in the dictionary
    # Can you find it?
    characters = sorted([character['name'] for character in x['characters']])
    characters = pd.Series({'characters': characters})
    return characters

df_got = pd.concat([df_got.drop(columns='scenes'),
                    df_got['scenes'].apply(pd.Series).drop(columns='characters'),
                    df_got['scenes'].apply(get_characters_list)
                    ], axis=1)

In [None]:
print(f"In the end, we are left with {df_got.shape[0]} scenes and {df_got.shape[1]} columns about them.")

In [None]:
df_got.head(2)

# 📊 2: Some interactive visualizations

Tired of static visualizations? Let's make some interactive ones using a library called Altair.

[Altair](https://altair-viz.github.io/gallery/index.html) is a truly fantastic library for creating interactive visualizations in Python. It is built on top of [Vega-Lite](https://vega.github.io/vega-lite/), a JavaScript library for creating interactive visualizations.

Another plus side of Altair is that, similar to what I've been saying about plotnine, it forces you to think about the **data** and not the **plot** itself. 

💡 **Because of this, if you chose to do all your visualisation in Altair, you will also be adhering to the final project dataviz library requirements.**

## 2.1 How many scenes are there per episode?

Let's keep with the tradition of creating a `plot_df` dataframe with the data we want to plot. This way we minimise the risk of ruining our original data.

In [None]:
selected_cols = ['seasonNum', 'episodeNum', 'sceneStart', 
                 'sceneEnd', 'characters', 'location', 'subLocation']

plot_df = df_got[selected_cols].copy()
plot_df

**Minor pre-processing:**

- `sceneStart` and `sceneEnd` are strings, so we need to convert them to integers
- it will be useful to have a `timeSpan` column, which is the difference between `sceneStart` and `sceneEnd`
- I'll join the character names as a string, so that we can use it as a tooltip in the plot later
- Add a `numCharacters` column, which is the number of characters in a scene

In [None]:
plot_df['sceneStart']    = plot_df['sceneStart'].apply(convert_time_to_minutes)
plot_df['sceneEnd']      = plot_df['sceneEnd'].apply(convert_time_to_minutes)
plot_df['timeSpan']      = plot_df['sceneEnd'] - plot_df['sceneStart']
plot_df['numCharacters'] = plot_df['characters'].apply(len) 
plot_df['characters']    = plot_df['characters'].apply(lambda x: ', '.join(x))

In [None]:
plot_df.head()

In the lecture, I will build the plot below step by step, but here is the final result:

In [None]:
# Define the selection dropdown
season_select = alt.binding_select(options=list(range(1, plot_df['seasonNum'].max()+1)))
season_filter = alt.selection_point(fields=['seasonNum'], bind=season_select, name='Season', value=1)

# Create the Altair chart
chart = alt.Chart(plot_df).properties(
    title='Scene visualizer',
    width=1000,
    height=400
).add_params(
    season_filter
)

# Add the rectangle marks
rects = chart.mark_rect(color='lightgray', opacity=0.8, stroke='black', strokeWidth=0.5).encode(
    y=alt.Y('episodeNum:N', axis=alt.Axis(title='Episode Number')),
    x=alt.X('sceneStart:Q', axis=alt.Axis(title='Time Span (min)', values=list(range(0, 100, 10)))),
    x2=alt.X2('sceneEnd:T'),
    color=alt.Color('numCharacters:Q', 
                    legend=alt.Legend(title='Num Characters'),
                    scale=alt.Scale(scheme='viridis')),
    tooltip=[alt.Tooltip('location', title='Scene Location'),
             alt.Tooltip('subLocation', title='Scene Sublocation'),
             alt.Tooltip('numCharacters', title='Number of Characters'),
             alt.Tooltip('characters', title='Names')]
).transform_filter(
    season_filter
)

chart = rects # + circles

chart = chart.configure_title(fontSize=40)
chart = chart.configure_axis(labelFontSize=20, titleFontSize=30, grid=False)
chart = chart.configure_legend(labelFontSize=20, titleFontSize=20)

chart

# 🕸️ 3: Network