**Coursebook: Data Visualization in Python**
- Part 3 of Python Fundamental Course
- Course Length: 6 hours
- Last Updated: July 2019

___

- Developed by [Algoritma](https://algorit.ma)'s product division and instructors team

# Background

The coursebook is part of the **Python Fundamental Course** prepared by [Algoritma](https://algorit.ma). The coursebook is intended for a restricted audience only, i.e. the individuals and organizations having received this coursebook directly from the training organization. It may not be reproduced, distributed, translated or adapted in any form outside these individuals and organizations without permission.

Algoritma is a data science education center based in Jakarta. We organize workshops and training programs to help working professionals and students gain mastery in various data science sub-fields: data visualization, machine learning, data modeling, statistical inference etc.

## Training Objectives

The primary objectives in this coursebook is to provide a full hands-on experience in using visual exploratory techniques to help participants gain full proficiency in data visualization tools in Python. The library we'll be using is called Bokeh, an interactive visualization library.

The objectives of this training is divided into 2 main focus:

- **Introduction to Bokeh**  
- Grammar of graphics  
- Axis title and text formatting  
- Making use of `ColumnDataSource`  
- **Plotting Essentials**  
- Basic plotting  
- Choosing appropriate plot  
- Using widget tools  

At the end of this course, we'll be working with **Learn by Building** module as their graded assignment. You'll be given a dataset to apply what you've learned. Create a visualization with the appropriate annotations and aestethics to generate a powerful insight and communicate a story.

## Introduction to Bokeh

### Why Learn Bokeh?

Visualizing dataset using programming language can have a higher learning curve than most tools. Some of you might wonder: why use Python to visualize when you can use a drag-and-drop tools such as Tableau or Power BI. There are 2 main points of visualization using programming language:  
- Reproducibility  
- Customization  

Since we'll be creating our own visualization from scratch. It gives us limitless capability to create any kind of visualization that we can imagine. There are also limitless variance of the plot that we can try, iterating through the plot types can give us an unexplored insight we never think of, this process is called: **visual exploratory data analysis**. Once we're satisfied with the result, the plot can always be reproduced with the same script retelling the same data perspective.

In this hands-on exercise, we'll be going through a visual exploratory data analysis process drawing out all available insights within our dataset.

### Grammar of Graphics

Bokeh adopt a versatile system where user can alter graphic components layer by layer with a high level of modularity. On top of that, it's also able to be rendered within HTML and work extremely well with Jupyter Notebok.

A bokeh plot can have the following components:  
- A `figure` object  
- One or multiple glyph methods  

To understand the grammar of graphics, we'll use one of the sample data provided by Bokeh library:

In [11]:
from bokeh.sampledata.autompg import autompg

autompg.head()

Unnamed: 0,mpg,cyl,displ,hp,weight,accel,yr,origin,name
0,18.0,8,307.0,130,3504,12.0,70,1,chevrolet chevelle malibu
1,15.0,8,350.0,165,3693,11.5,70,1,buick skylark 320
2,18.0,8,318.0,150,3436,11.0,70,1,plymouth satellite
3,16.0,8,304.0,150,3433,12.0,70,1,amc rebel sst
4,17.0,8,302.0,140,3449,10.5,70,1,ford torino


In order to produce a bokeh plot in a notebook, we need to use `output_notebook()` function:

In [12]:
from bokeh.io import output_notebook

output_notebook()

Now we'll create our `figure` object, and specify necessary parameters:

In [13]:
from bokeh.plotting import figure

p = figure(title = "Miles per Gallon Vehicle Test",
           plot_height = 200)
type(p)

bokeh.plotting.figure.Figure

To display our newly made `figure` object, we'll use `show()` function:

In [14]:
from bokeh.io import show

show(p)





Notice how it generates an empty HTML widgets. Next we can add some glyphs object from our `figure` object. There are multiple types of glyph, namely a few:

- circle  
- hex  
- line  
- image  
- bar  
- segments  

Now let's try using our `circle()` glyph on our `p` figure:

In [15]:
p.circle(x = autompg['mpg'], y= autompg['hp'])
show(p)

![ ](assets/1.png)

Notice how we created a scatter plot to visualize the relationship between Miles per Gallon, and Horsepower passed in as the parameters of `circle` function. We can also add several annotations of the plot from the `figure` function:

In [16]:
p = figure(title = "Miles per Gallon Vehicle Testing",
           x_axis_label = "Miles per Gallon",
           y_axis_label = "Horsepower",
           plot_height=200)

p.circle(x = autompg['mpg'], y= autompg['hp'])
show(p)

![ ](assets/2.png)

On most times, however, we'll utilize the use of `ColumnDataSource` to refer our data source. This methods works well with pandas Series or a dictionary. Let's see the example in refering our padas Data Frame as data source:

In [17]:
from bokeh.models import ColumnDataSource

p = figure(title = "Miles per Gallon Vehicle Testing",
           x_axis_label = "Miles per Gallon",
           y_axis_label = "Horsepower",
           plot_height=200)

p.circle(x ='mpg', y='hp', source=ColumnDataSource(autompg))
show(p)

![ ](assets/3.png)

**Dive Deeper:**

Notice how the widget provides feature buttons on the right side? Take your time to explore each of the buttons!

You can easily edit which tools you'd like to use in your plot! Take a look at the [official documentation](https://bokeh.pydata.org/en/latest/docs/user_guide/tools.html) and modify the above codes to have the following tools:
- Pan  
- Lasso Select  
- Reset  
- Wheel Zoom  
- Save  

*Hint, use `tools` parameter in `figure()` function and add a list of the tools name you'd like to use!*

In [36]:
# Your code here


# create a new plot with the toolbar below
p = figure(width=400, height=400,
           title=None, toolbar_location="below", tools="pan,lasso_select, reset, wheel_zoom,box_zoom, save")

p.circle(x ='mpg', y='hp', source=ColumnDataSource(autompg))
show(p)

## Plotting Essential

In this second part of the coursebook we'll explore other Bokeh capabilities and learn how to incorporate these into exploratory data analysis process. There are numerous types of plots we'll be working on, let's first read our dataset using pandas:

In [19]:
import pandas as pd

retail = pd.read_csv("online_retail.csv")

print(retail.shape)

(240007, 8)


The data contain 8 columns and up to hundred thousands of rows. Let's inspect what are the columns we are deaing with:

In [20]:
print(retail.dtypes)

InvoiceNo       object
StockCode       object
Description     object
Quantity         int64
InvoiceDate     object
UnitPrice      float64
CustomerID     float64
Country         object
dtype: object


This dataset consist of transaction data of an online retail shops provided from UCI Machine Learning [repository](https://archive.ics.uci.edu/ml/datasets/online+retail). Before proceeding into plotting the data we'll need to understand thoroughly each variable. The following is the data glossary provided for this dataset:

- InvoiceNo: Invoice number. Nominal, a 6-digit integral number uniquely assigned to each transaction. If this code starts with letter 'c', it indicates a cancellation.   
- StockCode: Product (item) code. Nominal, a 5-digit integral number uniquely assigned to each distinct product. 
- Description: Product (item) name. Nominal.   
- Quantity: The quantities of each product (item) per transaction. Numeric.	  
- InvoiceDate: Invice Date and time. Numeric, the day and time when each transaction was generated.   
- UnitPrice: Unit price. Numeric, Product price per unit in sterling.   
- CustomerID: Customer number. Nominal, a 5-digit integral number uniquely assigned to each customer.   
- Country: Country name. Nominal, the name of the country where each customer resides.  



Let's implement things we have learned in previous course:

In [21]:
retail['InvoiceDate'] = pd.to_datetime(retail['InvoiceDate'])
retail['InvoiceDate'].describe()

count                  240007
unique                  11240
top       2010-12-06 16:57:00
freq                      675
first     2010-12-01 08:26:00
last      2011-06-26 10:59:00
Name: InvoiceDate, dtype: object

By converting our data into datetime type, we now understand how the invoice were created between December of 2010 to mid year of 2011.

**Dive Deeper:**

Before proceeding into data visualization let's do one more knowledge check. Can you answer the following questions by inspecting our data:

1. Can you confirm wether or not each row indicates unique invoice number?  
2. Can someone order multiple items with 1 invoice?  

If you are able to answer the previous questions without any trouble, good job! Remember, always be skeptical when you receive a new data. Understand the structure thoroughly to avoid any misinterpretation from your data.

Notice how `InvoiceNo` consist of a cancelled invoice. When you tried to explore retail store sales, do you want it to be part of your analysis? In this exercise, we shall remove the rows that contains a 'C':

In [22]:
retail = retail[~retail.InvoiceNo.str.contains('c', case=False)]

Moving on to next problem. We'll address our earlier **dive deeper** problem. Our data clearly shows a same `InvoiceNo` recorded in different rows. Means that our data stored sales of each item per Invoice.

This introduces a new perspective, in which we have to rethink wether or not analyzing our data as is will be an appropriate approach. For example: say we are interested in the number sales generated by our business. To do that it is easier to have a seperate data frame that holds unique invoice information per row:

In [23]:
retail['TotalPrice'] = retail.Quantity * retail.UnitPrice

invoice_grouped = retail.groupby('InvoiceNo').agg({
    'StockCode': lambda x: x.unique().size,
    'TotalPrice': 'sum',
    'CustomerID': 'first',
    'Country': 'first',
    'InvoiceDate': 'first'
}).rename(columns={
    'StockCode': 'ItemBought'
})

invoice_grouped.head()

Unnamed: 0_level_0,ItemBought,TotalPrice,CustomerID,Country,InvoiceDate
InvoiceNo,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
536365,7,139.12,17850.0,United Kingdom,2010-12-01 08:26:00
536366,2,22.2,17850.0,United Kingdom,2010-12-01 08:28:00
536367,12,278.73,13047.0,United Kingdom,2010-12-01 08:34:00
536368,4,70.05,13047.0,United Kingdom,2010-12-01 08:34:00
536369,1,17.85,13047.0,United Kingdom,2010-12-01 08:35:00


### Data Distribution using Scatter Plot

Say, for example I wanted to known my customer's purchasing behaviour: how is the correlation between total invoice and numer of item bought? How does one affect the other and how is the common behaviour seen in our existing customers?

This hypothesis then leads me to create a plot that can illustrate how does the data distribution of amount paid over quantity bought is, to do that we will use *scatter plot*. Scatter plot might be one of the most used types of plot in understanding the distribution between numeric data.

Let's go ahead to create our plot applying what we've learned about Bokeh:

In [24]:
from bokeh.models import ColumnDataSource

p = figure(title='Customer Purchasing Behaviour',
          x_axis_label='Number of Item Bought',
          y_axis_label='Total Paid',
          plot_height=500,
          plot_width=800)
p.circle('ItemBought', 'TotalPrice', source = ColumnDataSource(invoice_grouped))
show(p)

![ ](assets/4.png)

The key, as it is with data visualization in general, is to have our plot be effective. A plot that is effective complements how human visual perception works. The plot above is ineffective because it could be communicating more with less "visual clutter" at the bottom left.

To deal with that, we need to create a smaller sample or funelled our analysis to a certain set, this time we'll try to focus on a certain Country transaction:

In [25]:
invoice_grouped.Country.value_counts().head(10)

United Kingdom    9663
Germany            197
France             178
EIRE               115
Belgium             50
Netherlands         43
Spain               38
Australia           28
Portugal            27
Switzerland         23
Name: Country, dtype: int64

See how most of our transaction came from United Kingdom, we can safely assumed that our main market is in United Kingdom, since it's where the retailer are based off. So we can do it with 2 approach:  
1. We can do a random sampling of the main market (United Kingdom)  
2. We define the secondary market and plot accordingly  

In [26]:
second_market = ['Germany', 'France', 'EIRE', 'Belgium', 'Netherlands']
invoice_second = invoice_grouped[invoice_grouped.Country.isin(second_market)]
invoice_second.shape

(583, 5)

Let's talk about creating additional visual dimension to our plot using colors. Bokeh came with a good module for color palettes. To refer to the complete reference head to the [official documentation](https://bokeh.pydata.org/en/latest/docs/reference/palettes.html) or print out `__palettes__` attribute from the `bokeh.palettes` module. First let's prepare our dictinonary to map our country color:

In [27]:
from bokeh.palettes import Paired

col = Paired[5]
col_dict = {}
for i in range(0,5):
    col_dict[second_market[i]] = col[i]
    
col_dict

{'Germany': '#a6cee3',
 'France': '#1f78b4',
 'EIRE': '#b2df8a',
 'Belgium': '#33a02c',
 'Netherlands': '#fb9a99'}

In [28]:
invoice_second['Color'] = invoice_second['Country'].map(col_dict)

p = figure(title='Customer Purchasing Behaviour',
          x_axis_label='Number of Item Bought',
          y_axis_label='Total Paid',
          plot_height=500,
          plot_width=800)
p.circle(x='ItemBought', 
         y='TotalPrice', 
         color='Color',
         legend='Country',
         source = ColumnDataSource(invoice_second))
show(p)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  """Entry point for launching an IPython kernel.


![ ](assets/5.png)

Notice how we could have a more insightful plot by limiting the sample space of our data to avoid any visual clutter and add color dimension to help highlight a more specific set of information. Majority of our customers, can be seen in the bottom left and produced a similar shape as our entire transactions, but we can easily identify that there are several invoices from Netherland that tends to do a transaction with relatively large total invoice.

Now let's try to improve the plot by adding a hover tool. A hover tools is highly recommended if you try to identify a specific data points from the plot. To do that, we'll need to add a `hover` in our tools list:

In [67]:
tooltips = [
    ('Customer', '@CustomerID'),
    ('Total Paid', '@TotalPrice{1,000}'),
    ('Total Item', '@ItemBought')
]

tools = [
    'pan', 'wheel_zoom', 'reset', 'hover', 'save'
]

p = figure(title='Customer Purchasing Behaviour',
          x_axis_label='Number of Item Bought',
          y_axis_label='Total Paid',
          plot_height=500,
          plot_width=800,
          tools=tools,
          tooltips=tooltips)
p.circle(x='ItemBought', 
         y='TotalPrice', 
         color='Color',
         legend='Country',
         source = ColumnDataSource(invoice_second))
show(p)



![ ](assets/6.png)

Notice how the invoice from Netherland is created for the same customer numbered `14646`? This means that we can identify a customer that has a slightly higher amount of purchasing standard than other customers.

**Dive Deeper:**

Can you create a visualization from a sample of our customers originated form United Kingdom? Take the last 3 months of transaction and identify this:

- Does it follow a same pattern as the plots from our secondary market?  
- Is there any apparent valuable customers in the sample data?  

In [63]:
## Your code below
invoice_third = invoice_grouped.copy()
invoice_third['MonthTime'] = invoice_third.InvoiceDate.dt.to_period('M')
invoice_third.dtypes
# invoice_third.Country.value_counts()
invoice_third['Country'] = invoice_third['Country'].astype('category')
invoice_third.sort_values(by='MonthTime', ascending=0, inplace=True)
invoice_third = invoice_third[(invoice_third['MonthTime'] >= '2011-04') & (invoice_third['Country'] == 'United Kingdom')]
invoice_third.shape

(4387, 6)

### Trend Analysis using Line Plot

The second type of plots we'll talk about is line plot. This plot is heavily used for analyzing trend or data movement, or mostly known as time series. Notice how we have `InvoiceDate` column. We shall use this column as our time anchor. To do that, we'll need to floor the time using the trick we've learned in previous chapter. 

Since we have identify customer `14646` from Netherland that seems to have a repeated repurchase, let's see how does its purchasing trend looks like:

In [31]:
invoice_grouped['MonthTime'] = invoice_grouped.InvoiceDate.dt.to_period('M')

invoice_14646 = invoice_grouped[invoice_grouped.CustomerID == 14646]
invoice_14646.head()

Unnamed: 0_level_0,ItemBought,TotalPrice,CustomerID,Country,InvoiceDate,MonthTime
InvoiceNo,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
539491,16,70.96,14646.0,Netherlands,2010-12-20 10:09:00,2010-12
539731,54,8520.92,14646.0,Netherlands,2010-12-21 15:05:00,2010-12
541206,79,10389.06,14646.0,Netherlands,2011-01-14 12:24:00,2011-01
541570,33,7722.04,14646.0,Netherlands,2011-01-19 12:34:00,2011-01
541608,2,305.28,14646.0,Netherlands,2011-01-20 09:54:00,2011-01


Next let's create the aggregated data frame to acquire total invoice generated by this customer per month:

In [32]:
monthly_14646 = invoice_14646[['TotalPrice','MonthTime']].groupby('MonthTime').sum()

To plot this we will add a `line` glyph to our plot:

In [33]:
p = figure(x_axis_type = "datetime",
           x_axis_label = "Month",
           y_axis_label = "Total Revenue",
           title = "Customer 14646 Monthly Revenue",
           tools = tools,
           tooltips = [('Value', '@TotalPrice{1,000}')],
           plot_height=300,
           plot_width=800)

p.line(x='MonthTime',
       y='TotalPrice',
       source=ColumnDataSource(monthly_14646),
       color='#42b549',
       line_width=2)

show(p)

![ ](assets/7.png)

A figure, as stated early in the chapter, can contain multiple glyphs. Now in the previous scatter plot we have created, notice a highly valueable customer from EIRE, numbered `14156`, that also tends to deviate from the normal behaviour from the rest of the customers. You can create a seperate gplyph that shows each specific customer data or use `MultiLine()` glyph that process a list of lists. To do that, we'll need to utilize a dictionary for our `ColumnDataSource`:

In [103]:
customers = [14646, 14156]
line_dict = {}

val = []
time_axis = []
for cust in customers:
    value = invoice_grouped[invoice_grouped.CustomerID == cust][['TotalPrice','MonthTime']].groupby('MonthTime').sum()
    val.append(value.TotalPrice.tolist()) 
    time = invoice_grouped.MonthTime.unique().strftime('%m/%Y')
    # print(time)
    time_axis.append(time.reshape(-1,1)) 
    
line_dict = {
    'customers': customers,
    'value' : val,
    'month': time_axis,
    'color': Paired[5][0:2]
}

`ColumnDataSource` works with both Data Frame and Dictionary, and to be able to plot a multi line glyph, we'll need to map each of the value to a dictionary. Instead of `x` and `y` parameters, `multi_line` glyphs need `xs` and `ys` parameters that contains a list of different values. In this case, let's take a look at our `value` dictionary:

In [105]:
line_dict['value']
line_dict['color']
line_dict

{'customers': [14646, 14156],
 'value': [[8591.879999999997,
   26476.68,
   22797.460000000003,
   21462.399999999998,
   2976.5599999999995,
   28408.140000000007,
   11260.53],
  [322.2,
   16774.719999999998,
   8655.329999999998,
   9744.62,
   2998.18,
   5412.180000000001,
   6935.05]],
 'month': [array([['12/2010'],
         ['01/2011'],
         ['02/2011'],
         ['03/2011'],
         ['04/2011'],
         ['05/2011'],
         ['06/2011']], dtype=object),
  array([['12/2010'],
         ['01/2011'],
         ['02/2011'],
         ['03/2011'],
         ['04/2011'],
         ['05/2011'],
         ['06/2011']], dtype=object)],
 'color': ('#a6cee3', '#1f78b4')}

The value is a list of list, and contains 2 series of list, each correspond to a different customers. Once we have confirm that, let's take a look on how to use this glyph:

In [106]:
p = figure(x_axis_label = "Month",
           y_axis_label = "Total Revenue",
           title = "Customer Monthly Revenue",
           plot_height=300,
           plot_width=800,
           x_range=invoice_grouped.MonthTime.unique().strftime('%m/%Y'))


p.multi_line(xs='month',
             ys='value',
             color='color',
             legend='customers',
             line_width=2,
             source=ColumnDataSource(line_dict))

show(p)



![ ](assets/8.png)

Last, we can modify a bit of our layout options from our figure by directly altering its attribute:

In [113]:
from bokeh.models.formatters import NumeralTickFormatter

p.xgrid.grid_line_color = None
p.ygrid.grid_line_alpha = 0.5

p.xaxis.minor_tick_out = 0
p.xaxis.major_tick_in = 0
p.xaxis.major_tick_out = 2

p.yaxis.major_tick_in = 0
p.yaxis.formatter = NumeralTickFormatter(format="0,0")

show(p)

![ ](assets/9.png)

A very simple line plot that gives us better insight on customer 14646 and 14156's monthly transaction dynamic. By looking at this plot, we can see that there are a plump of transaction in April 2011. By identifying this behaviour it is easy to guide our next analysis into a more specifci matters. For example: how does the commodities commonly bought by this specific customer? Is the lack of revenue in April is a seasonality behaviour or the results of other event?

Some cases, however, when we weren't trying to compare the sales side by side, it is possible to create multiple line plots and put it in a nice consecutive arrangement. Let's start by extracting more valuable customers straight from our data frame:

In [114]:
customer_size = invoice_grouped[invoice_grouped.Country.isin(second_market)].groupby('CustomerID').size()

top_5_customers = customer_size.sort_values(ascending=False).head(5)

Next, since we are going to create same multiple plots, I'm going to create a function to automate the plot generation and display it afterwards:

In [115]:
from bokeh.layouts import column

def RevenueLinePlot(customers, df):
    plots = []
    for cust in customers:
        source = ColumnDataSource(df[df.CustomerID == cust].groupby('MonthTime').agg({
            'TotalPrice': 'sum'
        }))
        p = figure(title=f'Customer {cust} Revenue Trend Analysis',
                   x_axis_label='Month',
                   y_axis_label='Total Revenue',
                   x_axis_type='datetime',
                   plot_width=800,
                   plot_height=200,
                   tools=['pan', 'wheel_zoom', 'reset', 'hover', 'save'],
                   tooltips=[('Revenue', '@TotalPrice{1,000}')])
        p.line(x='MonthTime',
               y='TotalPrice',
               source=source,
               color='#42b549',
               line_width=2)
        p.circle(x='MonthTime',
                 y='TotalPrice',
                 source=source,
                 color='firebrick',
                 fill_color='white',
                 size=10)
        p.xgrid.grid_line_color = None
        p.yaxis.formatter = NumeralTickFormatter(format="0,0")
        
        plots.append(p)
    return(plots)

multi_plots = RevenueLinePlot(top_5_customers.index.astype('int'), invoice_grouped)
show(column(multi_plots))

![ ](assets/10.png)

![ ](assets/11.png)

![ ](assets/12.png)

![ ](assets/13.png)

![ ](assets/14.png)

**Dive Deeper:**

Applying what you have learned, can you generate a line plots that shows number of invoice created from our top 5 secondary market countries stored in `second_market`?

In [122]:
invoice_grouped.dtypes

ItemBought              int64
TotalPrice            float64
CustomerID            float64
Country                object
InvoiceDate    datetime64[ns]
MonthTime           period[M]
dtype: object

In [128]:
## Your code below
# country_size = invoice_grouped[invoice_grouped.Country.isin(second_market)].groupby('Country').size()
# top_5_customers = customer_size.sort_values(ascending=False).head(5)

def RevenueCountryLinePlot(countries, df):
    plots = []
    for coun in countries:
        source = ColumnDataSource(df[df.Country == coun].groupby('MonthTime').agg({
            'TotalPrice': 'count'
        }))
        p = figure(title=f'Country {coun} Revenue Trend Analysis',
                   x_axis_label='Month',
                   y_axis_label='Number of Invoice',
                   x_axis_type='datetime',
                   plot_width=800,
                   plot_height=200,
                   tools=['pan', 'wheel_zoom', 'reset', 'hover', 'save'],
                   tooltips=[('Num Invoice', '@TotalPrice{1,000}')])
        p.line(x='MonthTime',
               y='TotalPrice',
               source=source,
               color='#42b549',
               line_width=2)
        p.circle(x='MonthTime',
                 y='TotalPrice',
                 source=source,
                 color='firebrick',
                 fill_color='white',
                 size=10)
        p.xgrid.grid_line_color = None
        p.yaxis.formatter = NumeralTickFormatter(format="0,0")
        
        plots.append(p)
    return(plots)

multi_plots_country = RevenueCountryLinePlot(second_market, invoice_grouped)
show(column(multi_plots_country))

### Category Comparison using Bar Plot

Last plot types we are going to create in this course is a bar plot. A bar plot is fundamentally the plot to compare different group of data. There are indeed fancier visualization techniques for data comparison (dot plot can be considered as one of my favorite), but each has its own pros and cons. By understanding bar plot, you'll learn to understand the perspective of comparing groups of data.

Now let's get back to our data. In the previous section we have talked of how `United Kingdom` can be considered as our primary market segment. But we haven't get any sense of how large the revenue proportion of each of the country. To extract the information, let's create grouped data of revenue gained from each country:

In [129]:
top_5_country = invoice_grouped.groupby('Country').agg({
    'TotalPrice' : 'sum'    
}).sort_values('TotalPrice', ascending=False).head(5).index.values

top_country_monthly = invoice_grouped[invoice_grouped.Country.isin(top_5_country)].groupby('Country').agg({
    'TotalPrice': 'sum'
})

Next, as we have already learned in the previous sections, we can add a glyph to our plot. To create a vertical bar, we use `vbar()` glyph and add `top` parameter:

In [130]:
p = figure(title='Top 5 Country for Revenue Source',
           plot_height=300,
           plot_width=800,
           x_range=top_5_country,
           x_axis_label="Country",
           y_axis_label="Total Revenue",
           tools=tools,
           tooltips=[('Value', "@TotalPrice{0,0}")])

p.vbar(x='Country',
       top='TotalPrice',
       source=ColumnDataSource(top_country_monthly),
       width=0.8,
       color='#42b549')

p.yaxis.formatter = NumeralTickFormatter(format='0,0')

show(p)

![ ](assets/15.png)

Now based on the plot above, we can see the difference from Revenue generated in United Kingdom is more than 30 times higher compared to the secondary market. It is safe to seperate the two (main and secondary market) into two different set of analysis since it has a non comparable values for revenue generation between the two.

However, do note that it can't be assumed that other markets is considered as insignificant. Let's see this in a different perspective. Instead of looking at total revenue, let's see how does average total price per invoice in the countries compared to each other:

In [131]:
average_invoice = invoice_grouped[invoice_grouped.Country.isin(top_5_country)].groupby('Country').agg({
    'TotalPrice': 'mean'
})

p = figure(title='Average Transaction Amount per Invoice',
           plot_height=300,
           plot_width=800,
           x_range=top_5_country,
           x_axis_label="Country",
           y_axis_label="Total Amount",
           tools=tools,
           tooltips=[('Value', "@TotalPrice{0,0}")])

p.vbar(x='Country',
       top='TotalPrice',
       source=ColumnDataSource(average_invoice),
       width=0.8,
       color='#42b549')

p.yaxis.formatter = NumeralTickFormatter(format='0,0')

show(p)

![ ](assets/16.png)

It shows clearly that `Netherlands` instead is the higher transaction spender by average compared to United Kingdom. Even compared to other secondary market. The primary market has a slightly smaller average transaction amount. This proves that each of the market segment has its own characteristics: while United Kingdom as the primary market is the largest revenue source, customers based on United Kingdom commonly spent around `410` per transaction while Netherlands customers by average spent up to `2,926` per transaction. This should also made sense since the retail store is based in United Kingdom and should have higher local exposure than internationally. Additionally, considering how much higher an international shipping cost could be, customers based outside of United Kingdom should be more effective in ordering a large batch of shipment.

Other useful technique for visualizing barplot is using a grouped barchart. Say for example we'd like to examine the second market revenue income throughout the months. To do that, we'll create one more level of `MonthTime` from our previous data frame:

In [132]:
second_monthly_revenue = invoice_grouped[invoice_grouped.Country.isin(second_market)].groupby(['Country','MonthTime']).agg({
    'TotalPrice': 'sum'
})

Now let's create another dictionary to use as data source. Bokeh requires the dictionary to have an `x` key that indicates all unique `x_range` and a sequence of tuple for `top` parameter:

In [139]:
country_id = []
values = []
for country in second_market:
    for month in second_monthly_revenue.index.levels[1].astype('str'):
        country_id.append((country, month))
        
    values.extend(second_monthly_revenue.xs(country, level='Country').TotalPrice.tolist())
    
    
bar_dict = dict(x=country_id, value=tuple(values))
len(bar_dict['value'])

35

To create a grouped layout on our barplot, we'll use `FactorRange` to define our `x_range` parameter in our `figure`:

In [136]:
from bokeh.models.ranges import FactorRange

p = figure(x_range=FactorRange(*country_id), 
           plot_width=850, 
           plot_height=300, 
           title="Secondary Market Monthly Sales by Country")

r = p.vbar(x='x',
           top='value',
           width=0.8,
           line_color=None,
           fill_color='#42b549',
           source=ColumnDataSource(bar_dict))

p.y_range.start = 0
p.x_range.range_padding = 0.1
p.xaxis.major_label_orientation = 1
p.xaxis.major_tick_in = 0
p.xgrid.grid_line_color = None
p.yaxis.formatter = NumeralTickFormatter(format='0,0')

show(p)

![ ](assets/17.png)

Since we have multiple groups of data, it always a good idea to add a stylized color to help audience create more focus in a specific information they would like to focus on. To do that, let's add one more dictionary name `color` that map each of the country to a specifed color, we are going to use our `col_dict` created in the earlier section of this course:

In [140]:
colors = [col_dict[color] for color in col_dict for i in range(0,7)]

bar_dict = dict(x=country_id, value=tuple(values), colors=colors)

p = figure(x_range=FactorRange(*country_id), 
           plot_width=850, 
           plot_height=300, 
           title="Secondary Market Monthly Sales by Country")

r = p.vbar(x='x',
           top='value',
           width=0.8,
           line_color=None,
           fill_color='colors',
           source=ColumnDataSource(bar_dict))

p.y_range.start = 0
p.x_range.range_padding = 0.1
p.xaxis.major_label_orientation = 1
p.xaxis.major_tick_in = 0
p.xgrid.grid_line_color = None
p.x_range.group_padding = 3
p.yaxis.formatter = NumeralTickFormatter(format='0,0')

show(p)

![ ](assets/18.png)

**Dive Deeper:**

Now your turn to enhance the above plot. Since we created the grouped sales for each month from December 2010 to June 2011, the time frame of analysis is not quite established yet. Say we are interesed to know our sales performance within the secondary market since the beginning of 2011. Then recheck the time range of our initial dataset, can you confirm wether or not all the months has been taken into account properly? (Hint: Only a complete group of dataset should be anlayzed).

Can you recreate the above plot using only months of interest? Also notice how I haven't add a hover tool, try add it in your code below!

In [None]:
## Your code below
