This coursebook is part of Tokopedia Python for Data Analytics course prepared by team at [Algoritma](https://algorit.ma/). Algoritma is a data science education center based in Jakarta. We organize workshops and training programs to help working professionals and students gain mastery in various data science sub-fields: data visualization, machine learning, data modeling, statistical inference etc. This coursebook is intended for a restricted audience only, i.e. the individuals and organization having received this coursebook directly from Algoritma. It may not be reproduced, distributed, translated or adapted in any form outside these individuals and organizations without permission.

## Training Objectives

The primary objectives in this coursebook is to provide a full hands-on experience in using visual exploratory techniques to help participants gain full proficiency in data visualization tools in Python. The library we'll be using is called Bokeh, an interactive visualization library.

The objectives of this training is divided into 2 main focus:

- **Introduction to Bokeh**
- Grammar of graphics
- Plotting parameters
- Adding annotations
- **Plotting Essentials**
- Data Sources and Transformations
- Basic plotting
- Choosing appropriate plot
- **Enhancing Plot**
- 

At the end of this course, we'll be working with **Learn by Building** module as their graded assignment. You'll be given a dataset to apply what you've learned. Create a visualization with the appropriate annotations and aestethics to generate a powerful insight and communicate a story.

## Introduction to Bokeh

### Why Learn Bokeh?

Visualizing dataset using any programming language can have a higher learning curve than most tools. Some of you might wonder: why use Python to visualize when you can use a drag-and-drop tools such as Tableau or Power BI. There are 2 main points of why we should learn this:
- Reproducibility
- Customization
- Data Streaming

Since we'll be creating our own visualization from scratch. It gives us limitless capability to create any kind of visualization that we can imagine. There are also limitless variance of the plot that we can try, iterating through the plot types can give us an unexplored insight we never think of, this process is called: **visual exploratory data analysis**. Once we're satisfied with the result, the plot can always be reproduced with the same script telling the same stories from different data variance.

In this hands-on exercise, we'll be going through a visual exploratory data analysis process drawing out all available insights within our dataset.

### Grammar of Graphics

Bokeh adopt a versatile system where user can alter graphic components layer by layer with a high level of modularity. On top of that, it's also able to be rendered within HTML, giving a flexibility of the plots to be presented in a standard browser.

To understand the grammar of graphics, we'll use one of the sample data provided by Bokeh library:

In [1]:
from bokeh.sampledata.autompg import autompg

autompg.dtypes

mpg       float64
cyl         int64
displ     float64
hp          int64
weight      int64
accel     float64
yr          int64
origin      int64
name       object
dtype: object

In order to produce a bokeh plot in a notebook, we need to use `output_notebook()` function in order to make the java script compatible with Jupyter's notebook.

In [2]:
from bokeh.io import output_notebook

output_notebook()

Now working with the grammar:

In [3]:
from bokeh.plotting import figure
from bokeh.io import show

p = figure(plot_height = 200)
type(p)

bokeh.plotting.figure.Figure

In [4]:
show(p)



Now we can use figure methods to add graphic layers on top of the plot:

In [5]:
p.circle(x = autompg['mpg'], y= autompg['hp'])
show(p)

Notice how we created a scatter plot to visualize the relationship between Miles per Gallon, and Horsepower passed in as the parameters of `circle` function. We can also add several annotations of the plot from the `figure` function:

In [6]:
p = figure(title = "Vehicle Testing",
           x_axis_label = "Miles per Gallon",
           y_axis_label = "Horsepower",
           plot_height=200)
p.circle(x = autompg['mpg'], y= autompg['hp'])
show(p)

**Dive Deeper:**

Notice how the widget provides feature buttons on the right side? Let's take our time to explore each of the buttons and utilize Bokeh's main strength: interactivity.

## Plotting Essential

Plots help us visually inspect our dataset in a more entirety perspectives. There are numerous types of plots we'll be looking on, let's first read our dataset using pandas:

In [120]:
import pandas as pd

retail = pd.read_csv("online_retail.csv")

print(retail.shape)

(240007, 8)


The data contain 8 columns and up to hundred thousands of rows. Let's inspect what are the columns we are deaing with:

In [14]:
print(retail.dtypes)

InvoiceNo       object
StockCode       object
Description     object
Quantity         int64
InvoiceDate     object
UnitPrice      float64
CustomerID     float64
Country         object
dtype: object


This dataset consist of transaction data of an online retail shops provided from UCI Machine Learning [repository](https://archive.ics.uci.edu/ml/datasets/online+retail). Before proceeding into plotting the data we'll need to understand thoroughly each variable. The following is the data glossary provided for this dataset:

- InvoiceNo: Invoice number. Nominal, a 6-digit integral number uniquely assigned to each transaction. If this code starts with letter 'c', it indicates a cancellation. 
- StockCode: Product (item) code. Nominal, a 5-digit integral number uniquely assigned to each distinct product. 
- Description: Product (item) name. Nominal. 
- Quantity: The quantities of each product (item) per transaction. Numeric.	
- InvoiceDate: Invice Date and time. Numeric, the day and time when each transaction was generated. 
- UnitPrice: Unit price. Numeric, Product price per unit in sterling. 
- CustomerID: Customer number. Nominal, a 5-digit integral number uniquely assigned to each customer. 
- Country: Country name. Nominal, the name of the country where each customer resides.



Let's implement things we have learned in previous course:

In [121]:
retail['InvoiceDate'] = pd.to_datetime(retail['InvoiceDate'])
retail['InvoiceDate'].describe()

count                  240007
unique                  11240
top       2010-12-06 16:57:00
freq                      675
first     2010-12-01 08:26:00
last      2011-06-26 10:59:00
Name: InvoiceDate, dtype: object

By converting our data into datetime type, we now understand how the invoice were created between December of 2010 to mid year of 2011. To make a richer visualization, let's extract the month of our invoice date into a new column:

In [122]:
retail['TransMonth'] = retail['InvoiceDate'].dt.month
retail['TransMonth'].value_counts()

12    42481
5     37030
3     36748
1     35147
6     30978
4     29916
2     27707
Name: TransMonth, dtype: int64

One more date feature extraction:

In [123]:
retail['MonthlyDate'] = retail['InvoiceDate'].dt.to_period('M')
retail['MonthlyDate'].value_counts()

2010-12    42481
2011-05    37030
2011-03    36748
2011-01    35147
2011-06    30978
2011-04    29916
2011-02    27707
Freq: M, Name: MonthlyDate, dtype: int64

**Dive Deeper:**

Before proceeding into data visualization let's do one more knowledge check. Can you answer the following questions by inspecting our data:

1. Can you confirm wether or not each row indicates unique invoice number?
2. Can someone order multiple items with 1 invoice?

### Data Distribution using Scatter Plot

One important note that you should always consider, not all data can be processed as is. Let's highlight one of the columns description: `InvoiceNo`.

It is said that the `InvoiceNo` is uniquely assigned for each transaction, but if the code starts with the letter 'c', it indicates cancellation. Now before going further into visualization techniques we must fully understand the behaviour of our data and the purpose of our plot.

Say, for example I wanted to known my customer's purchasing behaviour. So this is what I would like to know: what is the correlation between total invoice and numer of item bought? How does one affect the other and how is the common behaviour seen in our existing customers?

This hypothesis then leads me to create a plot that can illustrate how does the data distribution of amount paid over quantity bought is, to do that we will use *scatter plot*. Scatter plot might be one of the most used types of plot in understanding the correlation between numeric data.

Now going back to our `InvoiceNo` column that holds a cancelled transaction. To create a proper analysis, I'm going to remove all cancellation invoice from my data:

In [124]:
retail_clean = retail[retail.InvoiceNo.str.contains("^[^C].*")]
retail_clean.shape

(235657, 10)

In [35]:
retail_clean.Revenue = retail_clean.UnitPrice * retail_clean.Quantity

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  self[name] = value


In [130]:
retail_clean.dtypes

InvoiceNo              object
StockCode              object
Description            object
Quantity                int64
InvoiceDate    datetime64[ns]
UnitPrice             float64
CustomerID            float64
Country                object
TransMonth              int64
MonthlyDate         period[M]
dtype: object

In [37]:
invoice_grouped = retail_clean.groupby('InvoiceNo').agg({
    'StockCode': lambda x: x.unique().size,
    'Revenue': 'sum'
})

invoice_grouped.head()

Unnamed: 0_level_0,StockCode,Revenue
InvoiceNo,Unnamed: 1_level_1,Unnamed: 2_level_1
536365,7,139.12
536366,2,22.2
536367,12,278.73
536368,4,70.05
536369,1,17.85


In [38]:
p = figure(title = 'Price x Quantity',
          x_axis_label = "Number of Item",
          y_axis_label = "Total Invoice",
          tools = ["hover", "wheel_zoom", "pan", "save", "reset"],
          tooltips = 'Invoice: @InvoiceNo')
p.circle('StockCode', 'Revenue', source = invoice_grouped)
show(p)

The key, as it is with data visualization in general, is to have our plot be effective. A plot that is effective complements how human visual perception works. The plot above is ineffective because it could be communicating more with less "visual clutter" at the bottom left.

To deal with that, we need to create a smaller sample or funelled our analysis to a certain set, this time we'll try to focus on a certain Country transaction:

In [43]:
retail_clean.Country.value_counts()

United Kingdom          216510
Germany                   3958
France                    3579
EIRE                      2920
Netherlands               1139
Spain                     1131
Belgium                    914
Switzerland                697
Australia                  630
Portugal                   610
Norway                     368
Channel Islands            363
Cyprus                     350
Finland                    307
Italy                      299
Japan                      230
Hong Kong                  199
Sweden                     195
Denmark                    184
Poland                     180
Austria                    124
Singapore                  113
Iceland                    102
Greece                      85
Unspecified                 83
Canada                      68
Malta                       45
Lebanon                     45
Israel                      36
Lithuania                   35
Brazil                      32
European Community          31
United A

See how most of our transaction came from United Kingdom, we can safely assumed that our main market is in United Kingdom, since it's where the retailer are based off. So we can do it with 2 approach:  
1. We can do a random sampling of the main market (United Kingdom)
2. We define the secondary market and plot accordingly

In [64]:
second_market = ['Germany', 'France', 'EIRE', 'Netherlands', 'Spain']
retail_second = retail_clean[retail_clean.Country.isin(second_market)]
retail_second.shape

(12727, 9)

In [145]:
from bokeh.palettes import Greens5

r = retail_second.groupby('InvoiceNo').agg({
    'StockCode': lambda x: x.unique().size,
    'Revenue': 'sum',
    'Country': 'first',
    'CustomerID': 'first'
})

r.head()
Greens5
Dark2[5]

['#1b9e77', '#d95f02', '#7570b3', '#e7298a', '#66a61e']

In [146]:
from bokeh.transform import factor_cmap
from bokeh.palettes import Dark2

p = figure(title = 'Price x Quantity',
          x_axis_label = "Number of Item",
          y_axis_label = "Total Invoice",
          tools = ["hover", "box_zoom", "wheel_zoom", "pan", "save", "reset"],
          tooltips = [('Invoice', '@InvoiceNo'), ('Country:', '@Country'), ('Customer', '@CustomerID')])
p.circle('StockCode', 'Revenue', source = r,
         size = 6, fill_alpha=0.4,
         legend = 'Country',
        color = factor_cmap('Country', Dark2[5], second_market))
show(p)

Notice how we could have a more insightful plot by limiting the sample space of our data. Majority of our customers, can be seen in the bottom left and produced a similar shape as our entire transactions, but we can easily identify that there are several countries such as Netherland that tends to do a transaction with relatively large total invoice. Since I also created a hover tools for the customer ID, you would know that the relatively high invoices was generated from the same customer, and we can identify this customer as our valueable customers since there are multiple repurchase happening.

## TODO: Might move

In bokeh, you can map a color palette into designated colors by the following functions:
- linear_cmap()
- log_cmap()
- factor_cmap()

**Dive Deeper:**

Can you create a visualization from a sample of our customers originated form United Kingdom? Take the last 3 months of transaction and identify this:

- Does it follow a same pattern as the plots from our secondary market?
- Is there any apparent valuable customers in the sample data?

### Trend Analysis using Line Plot

The second type of plots we'll talk about is line plot. This plot is heavily used for analyzing trend or data movement. The movement is commonly tied to a time line. Since we have identify customer `14646` from Netherland that seems to have a repeated repurchase, let's see how does his behaviour movement looks like:

In [157]:
retail_14646 = retail_clean[retail_clean.CustomerID == 14646]
retail_14646 = retail_14646[['InvoiceNo', 'MonthlyDate']].groupby('MonthlyDate').count()
retail_14646

Unnamed: 0_level_0,InvoiceNo
MonthlyDate,Unnamed: 1_level_1
2010-12,70
2011-01,220
2011-02,177
2011-03,133
2011-04,24
2011-05,173
2011-06,137


To plot this we will add a `line` glyph to our plot:

In [181]:
p = figure(x_axis_type = "datetime",
           x_axis_label = "Month",
           y_axis_label = "Number of Transaction",
           title = "Customer 14646 Monthly Number of Transaction",
           tools = ["hover", "box_zoom", "wheel_zoom", "pan", "save", "reset"],
           tooltips = [('Value', '@InvoiceNo')],
           plot_height = 200)

p.xgrid.grid_line_color = None
p.xgrid.grid_line_alpha = 0.5
p.line(x = 'MonthlyDate', y = 'InvoiceNo',
       source = retail_14646,
       color = "#42b549")

show(p)

A very simple line plot that gives us better insight on customer 14646's monthly transaction dynamic. By looking at this plot, we can see that there are a plump of transaction in April 2011. By identifying this behaviour it is easy to guide our next analysis into the commodities that is commonly bought by the customer. We can also try to understand wether or not the lack of transactions is a seasonality behaviour or other cases.

Since we tried to visualize monthly transactions of a specific customers, let's try to create a multiple plot from our top valueable customers and put it in a nice layout:

In [206]:
retail_5

Unnamed: 0_level_0,Unnamed: 1_level_0,InvoiceNo
CustomerID,MonthlyDate,Unnamed: 2_level_1
12748.0,2010-12,668
12748.0,2011-01,25
12748.0,2011-02,101
12748.0,2011-03,189
12748.0,2011-04,102
12748.0,2011-05,268
12748.0,2011-06,226
14606.0,2010-12,228
14606.0,2011-01,228
14606.0,2011-02,274


In [205]:
df

Unnamed: 0_level_0,Unnamed: 1_level_0,a,b,b
Unnamed: 0_level_1,Unnamed: 1_level_1,b,a,b
A,B,1.0,8.0,10.0
A,C,2.0,7.0,
A,D,,,9.0


In [204]:
valuable_5 = retail_clean[['CustomerID', 'InvoiceNo']].groupby('CustomerID').count().sort_values(['InvoiceNo'], ascending = False).head(5).index

retail_5 = retail_clean[retail_clean.CustomerID.isin(valuable_5)][['CustomerID', 'InvoiceNo', 'MonthlyDate']].groupby(['CustomerID', 'MonthlyDate']).count()

source = ColumnDataSource(retail_5)

In [207]:
p1 = figure(x_axis_type = "datetime",
           x_axis_label = "Month",
           y_axis_label = "Number of Transaction",
           title = "Customer 14646 Monthly Number of Transaction",
           tools = ["hover", "box_zoom", "wheel_zoom", "pan", "save", "reset"],
           tooltips = [('Value', '@InvoiceNo')],
           plot_height = 200)
p1.xgrid.grid_line_color = None
p1.xgrid.grid_line_alpha = 0.5
p1.line(x = 'MonthlyDate', y = 'InvoiceNo',
       source = source,
       color = "#42b549")
show(p1)

ERROR:bokeh.core.validation.check:E-1001 (BAD_COLUMN_NAME): Glyph refers to nonexistent column name. This could either be due to a misspelling or typo, or due to an expected column being missing. : key "x" value "MonthlyDate" (closest match: "CustomerID_MonthlyDate") [renderer: GlyphRenderer(id='14016', ...)]


In [None]:
p1 = figure(x_axis_type = "datetime",
           x_axis_label = "Month",
           y_axis_label = "Number of Transaction",
           title = "Customer 14646 Monthly Number of Transaction",
           tools = ["hover", "box_zoom", "wheel_zoom", "pan", "save", "reset"],
           tooltips = [('Value', '@InvoiceNo')],
           plot_height = 200)
p1.xgrid.grid_line_color = None
p1.xgrid.grid_line_alpha = 0.5
p1.line(x = 'MonthlyDate', y = 'InvoiceNo',
       source = retail5,
       color = "#42b549")

### Category Comparison using Bar Plot

In our next module, we'll introduce other visualization tools that is heavily used by statistician: boxplot and histogram. Both are very useful in understanding data distribution. In this module, however, we'll focus more on plotting building blocks and will dealt in that area in a later session.

## Enhancing Bokeh Plot

### Using Widgets

### Data Streaming