# Bubble Chart

Source: [Data Viz Catalogue](https://datavizcatalogue.com/methods/bubble_chart.html)

- Multi-variable graph (a cross between a [Scatterplot](https://datavizcatalogue.com/methods/scatterplot.html) and a [Proportional Area Chart](https://datavizcatalogue.com/methods/area_chart.html))
- **Up to 4 variables**:
    - X axis (quantitative variable)
    - Y axis (quantitative variable)
    - circle size (quantitative variable)
    - circle color (qualitative variable)
- They are used to **compare and show the relationship**s between categorised circles, by the use of positioning and proportions.
- **Limitations:**
    - data size capacity: too many bubbles can make the chart hard to read
        - This can be somewhat remedied by interactivity
        - Transparent circles
- **Important:** the sizes of the circles need to be drawn based on the circle’s area, not its radius or diameter.

$Circle\_Area = \pi + {Radius}^{2}$

$Circle\_Diameter = 2 \sqrt{\frac{Area}{\pi}}$

In [1]:
import math
import pandas as pd
from pandas.api.types import CategoricalDtype

from bokeh.io import output_notebook, show
from bokeh.plotting import figure
from bokeh.transform import transform
from bokeh.models import ColumnDataSource
from bokeh.models.mappers import CategoricalColorMapper
output_notebook()

# Dataset

Dataset comes from Kaggle: 
    [Titanic - Suited for binary logistic regression](https://www.kaggle.com/heptapod/titanic)

In [2]:
import kaggle

In [3]:
kaggle.api.authenticate()
kaggle.api.dataset_download_files('heptapod/titanic', path='.', unzip=True)

## Data reading and preproccesing

In [4]:
df = pd.read_csv('train_and_test2.csv', sep=',')

df.rename(columns={'2urvived': 'survived'}, inplace=True)

df['Sex'] = df['Sex'].replace({0: 'Male', 1: 'Female'}).astype('category')
df['Pclass'] = df['Pclass'].replace({1: '1st', 2: '2nd', 3: '3th'}).astype('category')
df['Age'] = pd.cut(df['Age'].round(), 5)

In [5]:
df[['Age', 'Pclass', 'sibsp', 'survived']].sample(10)

Unnamed: 0,Age,Pclass,sibsp,survived
830,"(-0.08, 16.0]",3th,1,1
410,"(16.0, 32.0]",3th,0,0
547,"(16.0, 32.0]",2nd,0,1
119,"(-0.08, 16.0]",3th,4,0
134,"(16.0, 32.0]",2nd,0,0
348,"(-0.08, 16.0]",3th,1,1
261,"(-0.08, 16.0]",3th,4,1
217,"(32.0, 48.0]",2nd,1,0
856,"(32.0, 48.0]",1st,1,1
55,"(16.0, 32.0]",1st,0,1


In [6]:
aux = df[['Age', 'Pclass', 'sibsp', 'survived']].groupby(by=['Age', 'Pclass']).agg(['sum', 'count']).copy()
aux.reset_index(inplace=True)
aux['Age'] = aux['Age'].astype(str)
aux['percent'] = aux['survived']['sum'] / aux['survived']['count'] * 100
aux['sibsp_area'] = 2 * (aux['sibsp']['sum']/ math.pi).pow(1./2)
aux = aux[aux['percent'] > 0]
aux

Unnamed: 0_level_0,Age,Pclass,sibsp,sibsp,survived,survived,percent,sibsp_area
Unnamed: 0_level_1,Unnamed: 1_level_1,Unnamed: 2_level_1,sum,count,sum,count,Unnamed: 7_level_1,Unnamed: 8_level_1
0,"(-0.08, 16.0]",1st,6,11,8,11,72.727273,2.763953
1,"(-0.08, 16.0]",2nd,21,30,19,30,63.333333,5.170883
2,"(-0.08, 16.0]",3th,177,93,28,93,30.107527,15.012108
3,"(16.0, 32.0]",1st,51,127,57,127,44.88189,8.058239
4,"(16.0, 32.0]",2nd,61,157,41,157,26.11465,8.812923
5,"(16.0, 32.0]",3th,192,507,83,507,16.370809,15.63528
6,"(32.0, 48.0]",1st,43,106,47,106,44.339623,7.399277
7,"(32.0, 48.0]",2nd,20,64,21,64,32.8125,5.046265
8,"(32.0, 48.0]",3th,33,95,7,95,7.368421,6.482045
9,"(48.0, 64.0]",1st,38,71,23,71,32.394366,6.955796


# Data visualization

In [7]:
source = ColumnDataSource(data=aux)

p = figure(
    title='Survival chance on The Titanic',
    plot_width=600, 
    plot_height=400,
    x_range=aux['Age'].unique().tolist(),
)

color_mapper = CategoricalColorMapper(palette=["green", "blue", "red"], factors=["1st", "2nd", "3th"])

p.circle(
    x='Age_', 
    y='percent_', 
    size='sibsp_area_', 
    fill_color=transform('Pclass_', color_mapper), 
    line_color=None,
    fill_alpha=0.6,
    legend_group='Pclass_',
    source=source,
)

p.yaxis.axis_label = 'Survival percentage (%)'
p.xaxis.axis_label = 'Age ranges'

show(p)