# Data Visualization (2017/18)

## Solutions for Assignment 3 - Visualizing multivariate data 

Presented by Group 52: 
- Lucia Eve Berger
- Name2

Date: DD.MM.YYYY

## Setup

In [1]:
import pandas as pd
import numpy as np

# import bokeh 
from bokeh.plotting import figure, show, Figure
from bokeh.models import ColumnDataSource, Label
from bokeh.models.glyphs import Text
from bokeh.palettes import Spectral3, d3
from bokeh.layouts import row, column, gridplot

# tell bokeh to show the figures in the notebook
from bokeh.io import output_notebook
output_notebook()

Load data stored in bokeh:

In [2]:
from bokeh.sampledata.autompg import autompg
from bokeh.sampledata.iris import flowers
flowers.head()

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,species
0,5.1,3.5,1.4,0.2,setosa
1,4.9,3.0,1.4,0.2,setosa
2,4.7,3.2,1.3,0.2,setosa
3,4.6,3.1,1.5,0.2,setosa
4,5.0,3.6,1.4,0.2,setosa


## Helpful functions

Group a dataframe according to a variable (species) and compute some statistics for a second variable (petal_width).

In [3]:
flowers.groupby(['species']).petal_width.describe()

species          
setosa      count    50.000000
            mean      0.246000
            std       0.105386
            min       0.100000
            25%       0.200000
            50%       0.200000
            75%       0.300000
            max       0.600000
versicolor  count    50.000000
            mean      1.326000
            std       0.197753
            min       1.000000
            25%       1.200000
            50%       1.300000
            75%       1.500000
            max       1.800000
virginica   count    50.000000
            mean      2.026000
            std       0.274650
            min       1.400000
            25%       1.800000
            50%       2.000000
            75%       2.300000
            max       2.500000
Name: petal_width, dtype: float64

Find unique values and count them in categorical variable.

In [4]:
flowers.species.unique()

array(['setosa', 'versicolor', 'virginica'], dtype=object)

In [5]:
flowers.species.value_counts()

versicolor    50
virginica     50
setosa        50
Name: species, dtype: int64

Use numpy to compute a histogram for quantitative data. See the docu for further information and how to work with the output.

In [6]:
np.histogram(flowers.petal_width)

(array([41,  8,  1,  7,  8, 33,  6, 23,  9, 14]),
 array([0.1 , 0.34, 0.58, 0.82, 1.06, 1.3 , 1.54, 1.78, 2.02, 2.26, 2.5 ]))

## Exercise 1 a): Customize a scatterplot chart

The following code skeleton renders a scatterplot. Customize the chart to your liking. Think for example of many data points. 

This is meant to be a very quick exercise to demonstrate the concept for the following two charts.

Requirements:
- **Parameters**: The function accepts (at least) the following parameters:
    - **source**: a pandas DataFrameObject or bokeh ColumnDataSource that holds the data
    - **x**: variable (name as string) to be represented on the x-axis
    - **y**: variable (name as string) to be represented on the y-axis
- **Calling the scatterplot**: The function is a class method of Figure and can be called as follows
```python
p = figure()
p.scatter( data, x, y )
```
This is already setup in the code skeleton below.

**<font color="deeppink">Update code</font>**

In [7]:
def scatter( self, source, x, y, **kwargs ):
    # access the figure using the self variable
    self.circle( source=source, x=x, y=y, color='colors', size=7, fill_alpha=0.4)
    
    # add these to the appropriate average
    label = Label(x=50, y=50, x_units='screen', y_units='screen',
                 text='Setosa', render_mode='css')
    label2 = Label(x=200, y=200, x_units='screen', y_units='screen',
                 text='Versicolor', render_mode='css' )
    label3 = Label(x=300, y=300, x_units='screen', y_units='screen',
                 text='Virginica', render_mode='css' )
        
    self.add_layout(label)
    self.add_layout(label2)
    self.add_layout(label3)
    self.xaxis.axis_label = "petal width"
    self.yaxis.axis_label = "petal length"
    self.title.align = "center"

# add the function as class method to Figure    
Figure.scatter = scatter

**<font color="deeppink">Check</font>** that your code is working:

In [8]:
p = figure( plot_width=500, plot_height=500, title = "Petal Length and Width by Species")
colormap = {'setosa': 'red', 'versicolor': 'green', 'virginica': 'blue'}
colors = [colormap[x] for x in flowers['species']]
flowers['colors'] = colors

p.scatter(source=flowers, x='petal_width', y='petal_length')
show(p)

**<font color="deeppink">Test cases</font>**: Give three scenarios that need testing (bullet points, no implementation required). Think of scenarios where your code may fail.
- Test Case 1: NAN: for NAN values, there is a possibility of it failing 
- Test case 2: Outlier values
- Test case 3: Colour map only works for the three defined types, if the data supply were augmented, the code would not show a color for that map

## Exercise 1 b): Implement a boxplot chart

Requirements:
- **Parameters**: The function accepts (at least) the following parameters:
    - **source**: a pandas DataFrameObject that holds the data
    - **x**: variable (name as string) to be represented on the x-axis
    - **y**: variable (name as string) to be represented on the y-axis
- **Orientation**: Provide boxplots with horizontal and vertical orientation (call them hboxplot and vboxplot).
- **Calling the boxplot**: The function is a class method of Figure and can be called as follows
```python
p = figure()
p.vboxplot( data, x, y )
```
This is already setup in the code skeleton below.

Hints:
- A Bokeh sample implementation can be found here: [Boxplot](https://bokeh.pydata.org/en/latest/docs/gallery/boxplot.html)
- Adapt this implementation to work on the target variable only. See code below to get started.

**<font color="deeppink">Implement</font>**

In [55]:
## checks the outliers

    
def vboxplot( self, source, x, y, **kwargs ):
    if not isinstance(source, pd.DataFrame ):
        raise TypeError("source has to be a pandas DataFrame.")
    v_axis_type = source.get(x).unique()
    df = pd.DataFrame(dict(score=source.get(y), group=source.get(x)))
    groups = df.groupby('group')
    q1 = groups.quantile(q=0.25)
    q2 = groups.quantile(q=0.5)
    q3 = groups.quantile(q=0.75)
    iqr = q3 - q1
    upper = q3 + 1.5*iqr
    lower = q1 - 1.5*iqr
    # find the outliers for each category
    def outliers(group):
        group_name = group.name
        return group[(group.score > upper.loc[group_name]['score']) | (group.score < lower.loc[group_name]['score'])]['score']
    
    
    out = groups.apply(outliers).dropna()

    # prepare outlier data for plotting, we need coordinates for every outlier.
    if not out.empty:
        outx = []
        outy = []
        for keys in out.index:
            outx.append(keys[0])
            outy.append(out.loc[keys[0]].loc[keys[1]])
    # if no outliers, shrink lengths of stems to be no longer than the minimums or maximums
    qmin = groups.quantile(q=0.00)
    qmax = groups.quantile(q=1.00)
    upper.score = [min([x,y]) for (x,y) in zip(list(qmax.loc[:,'score']),upper.score)]
    lower.score = [max([x,y]) for (x,y) in zip(list(qmin.loc[:,'score']),lower.score)]

    # stems
    self.segment(v_axis_type, upper.score, v_axis_type, q3.score, line_color="black")
    self.segment(v_axis_type, lower.score, v_axis_type, q1.score, line_color="black")

    # boxes
    self.vbar(v_axis_type, 0.7, q2.score, q3.score, fill_color="#00FFFF", line_color="black")
    self.vbar(v_axis_type, 0.7, q1.score, q2.score, fill_color="#000080", line_color="black")

    # whiskers (almost-0 height rects simpler than segments)
    self.rect(v_axis_type, lower.score, 0.2, 0.01, line_color="black")
    self.rect(v_axis_type, upper.score, 0.2, 0.01, line_color="black")

    # outliers
    if not out.empty:
        self.circle(outx, outy, size=6, color="#F38630", fill_alpha=0.6)

    self.xgrid.grid_line_color = None
    self.ygrid.grid_line_color = "white"
    self.grid.grid_line_width = 2
    self.xaxis.major_label_text_font_size="10pt"    

Figure.vboxplot = vboxplot

In [60]:
def hboxplot( self, source, x, y, **kwargs ):
    if not isinstance(source, pd.DataFrame ):
        raise TypeError("source has to be a pandas DataFrame.")
    v_axis_type = source.get(y).unique()
    df = pd.DataFrame(dict(score=source.get(x), group=source.get(y)))
    groups = df.groupby('group')
    q1 = groups.quantile(q=0.25)
    q2 = groups.quantile(q=0.5)
    q3 = groups.quantile(q=0.75)
    iqr = q3 - q1
    upper = q3 + 1.5*iqr
    lower = q1 - 1.5*iqr
    # find the outliers for each category
    def outliers(group):
        group_name = group.name
        return group[(group.score > upper.loc[group_name]['score']) | (group.score < lower.loc[group_name]['score'])]['score']
    
    
    out = groups.apply(outliers).dropna()

    # prepare outlier data for plotting, we need coordinates for every outlier.
    if not out.empty:
        outx = []
        outy = []
        for keys in out.index:
            outx.append(keys[0])
            outy.append(out.loc[keys[0]].loc[keys[1]])
    
    # if no outliers, shrink lengths of stems to be no longer than the minimums or maximums
    qmin = groups.quantile(q=0.00)
    qmax = groups.quantile(q=1.00)
    upper.score = [min([x,y]) for (x,y) in zip(list(qmax.loc[:,'score']),upper.score)]
    lower.score = [max([x,y]) for (x,y) in zip(list(qmin.loc[:,'score']),lower.score)]

    # stems
    self.segment(v_axis_type, upper.score, v_axis_type, q3.score, line_color="black")
    self.segment(v_axis_type, lower.score, v_axis_type, q1.score, line_color="black")

    # boxes
    self.hbar(v_axis_type, 0.7, q2.score, q3.score, fill_color="#00FFFF", line_color="black")
    self.hbar(v_axis_type, 0.7, q1.score, q2.score, fill_color="#000080", line_color="black")

    # whiskers (almost-0 height rects simpler than segments)
    self.rect(v_axis_type, lower.score, 0.1, 0.01, line_color="black")
    self.rect(v_axis_type, upper.score, 0.1, 0.01, line_color="black")

    # outliers
    if not out.empty:
        self.circle(outx, outy, size=6, color="#F38630", fill_alpha=0.6)

    self.xgrid.grid_line_color = None
    self.ygrid.grid_line_color = "white"
    self.grid.grid_line_width = 2
    self.xaxis.major_label_text_font_size="10pt"

Figure.hboxplot = hboxplot

**<font color="deeppink">Check</font>** your boxplot

In [62]:
p1 = figure( plot_width=500, plot_height=500, y_range=['setosa', 'versicolor', 'virginica'] )
p1.hboxplot( flowers, 'petal_width', 'species' )
p1.xaxis.axis_label = 'petal_width'
p1.yaxis.axis_label = 'species'

p2 = figure( plot_width=500, plot_height=500, x_range=['setosa', 'versicolor', 'virginica'] )
p2.vboxplot( flowers, 'species', 'petal_width' )
p2.yaxis.axis_label = 'petal_width'
p2.xaxis.axis_label = 'species'

show(column(p2,p1))


['setosa' 'versicolor' 'virginica']


**<font color="deeppink">Test cases</font>**: Give three scenarios that need testing (bullet points, no implementation required).
- Test case 1:
- Test case 2:
- Test case 3:

## Exercise 1 c): Implement a histogram chart

Requirements:
- **Parameters**: The function accepts (at least) the following parameters:
    - **source**: a pandas DataFrameObject that holds the data
    - **x**: variable (name as string) to be represented on the x-axis
    - **nbins**: number of bins (optional argument). If not provided set a meaningful default.
- **Data type**: Provide histograms for categorical and quantitative data.
- **Scaling**: The y-axis shall give probabilities (0,1). Scale the axis to show the full range, e.g., (-0.05,1.05).
- **Calling the histogram**: The function is a class method of Figure and can be called as follows
```python
p = figure()
p.histogram( data, x )
```

Hints:
- Assume that all categorical data has type string. Respective columns in the data can be converted using:
```
df.var = df.var.astype('str')
```

**<font color="deeppink">Implement</font>**

In [161]:
from bokeh.models import Range1d, FactorRange

def histogram( self, source, x, nbins=0, *args, **kwargs ):  
    if not isinstance(source, pd.DataFrame ):
        raise TypeError("source has to be a pandas.DataFrame. Received ", type(df))
    
    data = source[x]
    if isinstance(data[1], float):
        hist, edges = np.histogram(data, density=True, bins=50)
        self.quad(top=hist, bottom=0, left=edges[:-1], right=edges[1:])
        label = Label( x=50, y=50, x_units='screen', y_units='screen',
                       text='ToDo', render_mode='css')
        self.add_layout(label)
    else:
        categorical_data = data.value_counts().to_frame('count')/len(data)
        hist, edges = np.histogram(categorical_data, density=True)
        self.vbar( source=categorical_data, x='index', width=1, bottom=0, top='count',fill_color='cornflowerblue')
        
Figure.histogram = histogram

**<font color="deeppink">Check</font>** your histogram

In [162]:
var1 = 'sepal_length'
var2 = 'species'
var3 = 'name'

p1 = figure( plot_width=600, plot_height=600 )
p1.histogram( flowers, var1 )
p1.yaxis.axis_label = 'probability'
p1.xaxis.axis_label = var1


labels = np.sort(flowers[var2].unique())
p2 = figure( plot_width=600, plot_height=600, x_range=labels )
p2.histogram( flowers, var2)
p2.xaxis.axis_label = var2

labels = np.sort(autompg[var3].unique())
p3 = figure( plot_width=1000, plot_height=600, x_range=labels )
p3.histogram( autompg, var3 )
p3.xaxis.axis_label = var3
p3.xaxis.major_label_orientation = "vertical"

show( column( p1, p2, p3 ) )

**<font color="deeppink">Test cases</font>**: Give three scenarios that need testing (bullet points, no implementation required).

## Exercise 2: Working with SPLOMs

The source code for the generalized scatterplot matrix (SPLOM) is stored in file splom.py. 

Usage:
```
p = splom( df, cols=['var1', 'var2', 'var3'], splom_width=1000 )
show(p)
```

Accepted parameters:
- **source** (req): pandas DataFrame
- **splom_width** (opt): total width/height of the plot.
- **cols** (opt): Array of column names to be used in the plot.
- **x_padding** (opt): additional space for the x-axis labels.
- **y_padding** (opt): additional space for the y-axis labels.

Hint:
- The SPLOM supports some interaction. Select points in the scatterplots and look at the results in the other scatterplots.

In [163]:
%run SPLOM.py

ERROR:root:File `'SPLOM.py'` not found.


### Exercise 2a): Baseball data

In [164]:
baseball = pd.read_csv( '../Ex2_explAna/baseball_data.csv')

FileNotFoundError: File b'../Ex2_explAna/baseball_data.csv' does not exist

In [165]:
show( splom( df=baseball, cols=['handedness', 'height', 'weight', 'avg', 'HR'],
             splom_width=1000, x_padding=40, y_padding=80 ) )

NameError: name 'splom' is not defined

### Exercise 2b): Passengers on the Titanic

Remarks:
- For some passengers age information is missing the `fillna` command replaces those entries with -1. Feal free to make changes to this treating of missing values.
- All data is given as quantitative values. To make distinction of categorical data easier, we turn them into strings.

In [166]:
titanic = pd.read_csv( '../Ex2_explAna/titanic3.csv')

titanic.pclass = titanic.pclass.astype('str')
titanic.survived = titanic.survived.astype('str')

titanic = titanic.fillna(-1)

p = splom( df=titanic, cols=['pclass', 'survived', 'sex', 'age', 'fare'], splom_width=1000, 
           x_padding=40, y_padding=80 )
show( p )

FileNotFoundError: File b'../Ex2_explAna/titanic3.csv' does not exist

## Exercise 3: 

### Option 1: Auto MPG

In [167]:
from bokeh.sampledata.autompg import autompg

autompg.cyl = autompg.cyl.astype('str')
autompg.origin = autompg.origin.astype('str')

show( splom( df=autompg, cols=['mpg', 'cyl', 'displ', 'hp', 'weight', 'accel', 'yr', 'origin'],
             splom_width=1000, x_padding=40, y_padding=80 ) )

NameError: name 'splom' is not defined

### Option 2: Iris flowers

In [168]:
from bokeh.sampledata.iris import flowers
p = splom( df=flowers, splom_width=1000, x_padding=40, y_padding=80 )
show( p )

NameError: name 'splom' is not defined