# Bokeh tutorial 2: Column data source
This is my bokeh tutorial walkthrough of the [tutorial 2](http://nbviewer.jupyter.org/github/bokeh/bokeh-notebooks/blob/master/tutorial/02%20-%20column%20data%20source.ipynb). `ColumnDataSource` is bokeh's data source class and in this tutorial, I am going to learn how to use it.

First, I will import required modules including `bokeh.models.ColumnDataSource`

In [1]:
from bokeh.io import output_notebook, show
from bokeh.plotting import figure
from random import random, sample
from bokeh.models import ColumnDataSource
output_notebook()

### Sourcing dictionary
First, a `ColumnDataSource` object can be created from a Python `dictionary`.

In [2]:
# create a ColumnDataSource object from a Python dictionary
source = ColumnDataSource( 
            data = {
                'x' : list(range(1,6)),
                'y' : [random()*10 for x in range(5)],
                'radii' : [random()*0.8 for x in range(5)]
            })
print('x :', source.data['x'])
print('y :', [round(x,2) for x in source.data['y']])
print('radii :', [round(x,2) for x in source.data['radii']])

x : [1, 2, 3, 4, 5]
y : [0.17, 2.2, 4.6, 0.24, 3.73]
radii : [0.71, 0.51, 0.75, 0.74, 0.39]


The `ColumnDataSource` object is then used as `source` when creating the `circle` renderer.

In [3]:
# create a plot using the ColumnDataSource as the source
p = figure(width=640, height=480, title='Sample Plot with ColumnDataSource')
p.circle('x', 'y', radius='radii', alpha=0.5, source=source)
show(p)

### Sourcing Iris Pandas data frame
A `ColumnDataSource` object can also be create from a Pandas data frame. Below, the `iris` dataset is imported.

In [4]:
# iris dataset is included as a sample dataset in bokeh
from bokeh.sampledata.iris import flowers as df
source = ColumnDataSource(df)
print(source.data.keys())

dict_keys(['sepal_length', 'sepal_width', 'petal_length', 'petal_width', 'species', 'index'])


The data is successfully imported and the columns are checked with `source.data.keys()`. The data is now plotted.

In [5]:
# add a colour mapper 
from bokeh.models import CategoricalColorMapper
from bokeh.palettes import Set2

df['sepal_size'] = df.sepal_length*df.sepal_width
fill_color_map = \
    CategoricalColorMapper(\
        factors=list(df.species.unique()), \
        palette=Set2[3])

source = ColumnDataSource(df)
p = figure(width=640, height=480, title='Iris data plot')
p.circle('petal_length', 'petal_width', \
         size='sepal_size', line_color='white', 
         fill_color={'field': 'species', 'transform':fill_color_map}, \
         alpha=0.8, legend='species',\
         source=source)
p.legend.location = 'top_left'
p.legend.border_line_color = 'white'

p.xaxis.axis_label = 'Petal length'
p.xaxis.axis_label_text_font_style = 'normal'
p.yaxis.axis_label = 'Petal width'
p.yaxis.axis_label_text_font_style = 'normal'

show(p)

I have added additional parameters to make the graph more interesting.
+ **Fill colors by category**: `CategoricalColorMapper` to identify flower species
+ **Legend**: `legend` option within the `circle` renderer call for the fill color categories
+ **Size**: `size` option takes *sepal_size* provided as part of the `ColumnDataSource` object

### Sourcing autompg data frame
I have created a couple of more plots using the `autompg` dataset, which is also included as a sample dataset. 

In [6]:
from bokeh.sampledata.autompg import autompg as adf
# mapping origin values with the region names
adf['origin_str'] = ['North America' if x ==1 \
                     else 'Europe' if x==2 \
                     else 'Asia' \
                     for x in adf.origin]
# creating two additional columns to represent relative weight and hp 
# compared to the maximum value found in the dataset
adf['rel_weight'] = [float(x)*25./max(adf.weight) for x in adf.weight]
adf['rel_hp'] = [float(x)*25./max(adf.hp) for x in adf.hp]
adf.head()

Unnamed: 0,mpg,cyl,displ,hp,weight,accel,yr,origin,name,origin_str,rel_weight,rel_hp
0,18.0,8,307.0,130,3504,12.0,70,1,chevrolet chevelle malibu,North America,17.042802,14.130435
1,15.0,8,350.0,165,3693,11.5,70,1,buick skylark 320,North America,17.962062,17.934783
2,18.0,8,318.0,150,3436,11.0,70,1,plymouth satellite,North America,16.712062,16.304348
3,16.0,8,304.0,150,3433,12.0,70,1,amc rebel sst,North America,16.697471,16.304348
4,17.0,8,302.0,140,3449,10.5,70,1,ford torino,North America,16.775292,15.217391


First, I plotted the *mpg* measure against *weight* on x-axis while represengint relative horsepower by size. The fill colors represent the originating region.

In [7]:
# plot mpg against hp, weight, and origin
p = figure(width=640, height=480, title='MPG vs. weight and hp by originatin region')

fill_color_map = \
    CategoricalColorMapper(\
        factors=list(adf.origin_str.unique()), \
        palette=Set2[3])

source = ColumnDataSource(adf)
p.circle('weight', 'mpg', size='rel_hp', line_color='white',
         fill_color={'field': 'origin_str', 'transform':fill_color_map}, \
         alpha=0.7, legend='origin_str', \
         source=source)
p.xaxis.axis_label = 'Weight'
p.xaxis.axis_label_text_font_style = 'normal'
p.yaxis.axis_label = 'MPG'
p.yaxis.axis_label_text_font_style = 'normal'
p.legend.border_line_color = 'white'

show(p)

In the last plot, I included a `HoverTool` as well. I can display data values for each data point on the plot. The ci

In [8]:
# plot mpg against year, hp, and origin
from bokeh.models import HoverTool
hover = HoverTool(tooltips=[
    ("Name", "@name"),
    ("MPG", "$y"),
    ("HP", "@hp"),
])
p = figure(width=640, height=480, \
           title='MPG vs. year and weight by originatin region', \
           tools=[hover, 'pan', 'wheel_zoom', 'reset'])

adf_orig = adf.groupby(['origin', 'yr'])
adf_orig = adf_orig.mean()

source = ColumnDataSource(adf)

p.circle('yr', 'mpg', line_color='white', size='rel_weight', alpha=0.5,\
         fill_color={'field': 'origin_str', 'transform':fill_color_map}, \
         legend='origin_str', \
         source=source)


adf_na = adf_orig.loc[1].reset_index(); adf_na['name'] = 'North America'
adf_eu = adf_orig.loc[3].reset_index(); adf_eu['name'] = 'Europe'
adf_as = adf_orig.loc[2].reset_index(); adf_as['name'] = 'Asia'
p.line('yr', 'mpg', line_color=Set2[3][0], line_width=2, source=adf_na)
p.line('yr', 'mpg', line_color=Set2[3][1], line_width=2, source=adf_eu)
p.line('yr', 'mpg', line_color=Set2[3][2], line_width=2, source=adf_as)

p.xaxis.axis_label = 'Year'
p.xaxis.axis_label_text_font_style = 'normal'
p.yaxis.axis_label = 'MPG'
p.yaxis.axis_label_text_font_style = 'normal'
p.xaxis.ticker = list(range(min(adf.yr), max(adf.yr)+1))

p.legend.location = 'top_left'
p.legend.border_line_color = 'white'

show(p)