## Summary notes

This **#TidyTuesday** project was posted back on 28th May, 2018.
Here's the motivating tweet from [@thomas_mock](https://twitter.com/thomas_mock):

<blockquote class="twitter-tweet"><p lang="en" dir="ltr">The <a href="https://twitter.com/R4DScommunity?ref_src=twsrc%5Etfw">@R4DScommunity</a> welcomes you to week 9 of <a href="https://twitter.com/hashtag/tidytuesday?src=hash&amp;ref_src=twsrc%5Etfw">#tidytuesday</a>! Let&#39;s explore the world of comic book characters! This rich dataset comes from the <a href="https://twitter.com/FiveThirtyEight?ref_src=twsrc%5Etfw">@fivethirtyeight</a> package! <br>Data: <a href="https://t.co/sElb4fcv3u">https://t.co/sElb4fcv3u</a> <br>Article: <a href="https://t.co/4QD3cpbKE9">https://t.co/4QD3cpbKE9</a> <a href="https://twitter.com/hashtag/rstats?src=hash&amp;ref_src=twsrc%5Etfw">#rstats</a> <a href="https://twitter.com/hashtag/tidyverse?src=hash&amp;ref_src=twsrc%5Etfw">#tidyverse</a> <a href="https://twitter.com/hashtag/dataviz?src=hash&amp;ref_src=twsrc%5Etfw">#dataviz</a> <a href="https://twitter.com/hashtag/r4ds?src=hash&amp;ref_src=twsrc%5Etfw">#r4ds</a> <a href="https://twitter.com/hashtag/comics?src=hash&amp;ref_src=twsrc%5Etfw">#comics</a> <a href="https://t.co/VRki5rkJkS">pic.twitter.com/VRki5rkJkS</a></p>&mdash; Tom Mock (@thomas_mock) <a href="https://twitter.com/thomas_mock/status/1001451167547908096?ref_src=twsrc%5Etfw">May 29, 2018</a></blockquote> <script async src="https://platform.twitter.com/widgets.js" charset="utf-8"></script>

We defined a `dataclass`, *Constants*, to hold our `str` literals.



## Dependencies

In [1]:
from dataclasses import dataclass
import numpy as np
import pandas as pd
import altair as alt

## Classes

In [2]:
@dataclass(frozen=True)
class Constants:
    remote = ('https://raw.githubusercontent.com/rfordatascience/tidytuesday/'
              + 'master/data/2018/2018-05-29/week9_comic_characters.csv')

## Functions

In [3]:
def get_continuity(x: str) -> str:
    if 'New Earth' in x:
        return 'DC, New Earth continuity'
    elif 'Earth-616' in x:
        return 'Marvel, Earth-616 continuity'
    else:
        return 'NA'

In [4]:
def is_female(x: str) -> bool:
    if x == 'Female Characters':
        return True
    return False

In [5]:
def prop_females_per_year_by_publisher(df: pd.DataFrame) -> pd.Series:
    return (
        gsource.groupby(['publisher', 'year'])['is_female'].sum().cumsum()
        / gsource.groupby(['publisher', 'year']).size().cumsum()
    )

## Main

### Initialise the constants

In [6]:
constants = Constants()

### Load the data

In [7]:
#| code-summary: 'Ready honeyprod into a DataFrame`
comics = pd.read_csv(constants.remote, index_col=0)
comics.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 23272 entries, 1 to 23272
Data columns (total 16 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   publisher         23272 non-null  object 
 1   page_id           23272 non-null  int64  
 2   name              23272 non-null  object 
 3   urlslug           23272 non-null  object 
 4   id                17489 non-null  object 
 5   align             17086 non-null  object 
 6   eye               9877 non-null   object 
 7   hair              16734 non-null  object 
 8   sex               22293 non-null  object 
 9   gsm               154 non-null    object 
 10  alive             23266 non-null  object 
 11  appearances       21821 non-null  float64
 12  first_appearance  23203 non-null  object 
 13  month             22457 non-null  object 
 14  year              22388 non-null  float64
 15  date              22386 non-null  object 
dtypes: float64(2), int64(1), object(13)
memo

### Process the data

We add two columns to support the analysis: *is_female* amd *continuity*.

In [12]:
#| code-summary: 'Take view of comics and add columns for analysis'
v_comics = comics
# identify continuity
v_comics[['continuity']] = v_comics[['name']].applymap(get_continuity)
# identify if female character
v_comics[['is_female']] = v_comics[['sex']].applymap(is_female)
v_comics.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 23272 entries, 1 to 23272
Data columns (total 18 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   publisher         23272 non-null  object 
 1   page_id           23272 non-null  int64  
 2   name              23272 non-null  object 
 3   urlslug           23272 non-null  object 
 4   id                17489 non-null  object 
 5   align             17086 non-null  object 
 6   eye               9877 non-null   object 
 7   hair              16734 non-null  object 
 8   sex               22293 non-null  object 
 9   gsm               154 non-null    object 
 10  alive             23266 non-null  object 
 11  appearances       21821 non-null  float64
 12  first_appearance  23203 non-null  object 
 13  month             22457 non-null  object 
 14  year              22388 non-null  float64
 15  date              22386 non-null  object 
 16  continuity        23272 non-null  object

### Visualise the data

In [15]:
#| code-summary: 'New Comic Book Characters Introduced Per Year By Continuity'
gsource = v_comics[['continuity', 'year']].dropna()
gsource['year'] = pd.to_datetime(gsource['year'], format='%Y')
gsource = (
    gsource
    .get(['continuity', 'year'])
    .query("continuity != 'NA' and year < '2011-01-02'")
    .groupby(['continuity', 'year'])
    .size()
    .rename('new_chars')
    .to_frame().reset_index()
)

alt.Chart(gsource).mark_bar().encode(
    x='year',
    y='new_chars',
    color=alt.Color('continuity', legend=None)
).properties(
    width=300
).facet(
    facet='continuity',
    title='New Comic Book Characters Introduced Per Year By Continuity'
)

In [22]:
#| code-summary: 'Gender Ratio In Comic Books'
gsource = v_comics[['continuity', 'year', 'is_female']].dropna()
gsource['year'] = pd.to_datetime(gsource['year'], format='%Y')
gsource = gsource.query(
    "continuity != 'NA' and ('1939-01-01' <= year <= '2011-01-02')"
)
gsource = (
    gsource.groupby(['continuity', 'year'])['is_female'].sum().cumsum()
    / gsource.groupby(['continuity', 'year']).size().cumsum()
)
gsource = (
    gsource
    .mul(100).round(1)         # percentify
    .rename('pct_females')
    .to_frame().reset_index()  # altair needs a DF
)

alt.Chart(gsource).mark_line().encode(
    x='year',
    y='pct_females',
    color=alt.Color('continuity')
).properties(
    width=600
)

In [25]:
#| code-summary: 'Percentage new female character per year'
gsource = v_comics[['continuity', 'year', 'is_female']].dropna()
gsource['year'] = pd.to_datetime(gsource['year'], format='%Y')
gsource = gsource.query(
    "continuity != 'NA' and ('1980-01-01' <= year <= '2011-01-01')"
)
gsource = (
    gsource.groupby(['continuity', 'year'])['is_female'].sum()
    / gsource.groupby(['continuity', 'year']).size()
)
gsource = (
    gsource
    .mul(100).round(1)
    .rename('pct_females')
    .to_frame().reset_index()
)

alt.Chart(gsource).mark_line().encode(
    x='year',
    y='pct_females',
    color=alt.Color('continuity')
).properties(
    width=600
)

In [None]:
#| code-summary: 'Character alignment by sex and publisher'
gsource = (
    v_comic_chars
    .get(['publisher', 'sex', 'id', 'align']).dropna()
    .query("sex in ['Female Characters', 'Male Characters']")
    .query("id not in ['No Dual Identity', 'Identity Unknown']")
    .query("align != 'Reformed Criminals'")
)
gsource = (
    gsource.groupby(['publisher', 'sex', 'align']).size()
    / gsource.groupby(['publisher', 'sex']).size()
)
gsource = (
    gsource
    .mul(100).round(1)
    .rename('pct')
    .to_frame().reset_index()
)

bars = alt.Chart(gsource).mark_bar().encode(
    x=alt.X('pct', stack='zero'),
    y=alt.Y('sex'),
    color=alt.Color('align'),
    facet=alt.Facet('publisher', columns=1)
)

bars