# Data Visualization (2017/18)

## Solutions for Assignment 3 - Visualizing multivariate data 

Presented by Group 52: 
- Lucia Eve Berger
- Syed Muhammad Kumail Raza

Date: 03.12.2018

## Setup

In [1]:
import pandas as pd
import numpy as np

# import bokeh 
from bokeh.plotting import figure, show, Figure
from bokeh.models import ColumnDataSource, Label
from bokeh.models.glyphs import Text
from bokeh.palettes import Spectral3, d3
from bokeh.layouts import row, column, gridplot

# tell bokeh to show the figures in the notebook
from bokeh.io import output_notebook
output_notebook()

Load data stored in bokeh:

In [2]:
from bokeh.sampledata.autompg import autompg
from bokeh.sampledata.iris import flowers
flowers.head()

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,species
0,5.1,3.5,1.4,0.2,setosa
1,4.9,3.0,1.4,0.2,setosa
2,4.7,3.2,1.3,0.2,setosa
3,4.6,3.1,1.5,0.2,setosa
4,5.0,3.6,1.4,0.2,setosa


## Helpful functions

Group a dataframe according to a variable (species) and compute some statistics for a second variable (petal_width).

In [3]:
flowers.groupby(['species']).petal_width.describe()

Unnamed: 0_level_0,count,mean,std,min,25%,50%,75%,max
species,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
setosa,50.0,0.246,0.105386,0.1,0.2,0.2,0.3,0.6
versicolor,50.0,1.326,0.197753,1.0,1.2,1.3,1.5,1.8
virginica,50.0,2.026,0.27465,1.4,1.8,2.0,2.3,2.5


Find unique values and count them in categorical variable.

In [4]:
flowers.species.unique()

array(['setosa', 'versicolor', 'virginica'], dtype=object)

In [5]:
flowers.species.value_counts()

virginica     50
versicolor    50
setosa        50
Name: species, dtype: int64

Use numpy to compute a histogram for quantitative data. See the docu for further information and how to work with the output.

In [6]:
np.histogram(flowers.petal_width)

(array([41,  8,  1,  7,  8, 33,  6, 23,  9, 14]),
 array([0.1 , 0.34, 0.58, 0.82, 1.06, 1.3 , 1.54, 1.78, 2.02, 2.26, 2.5 ]))

## Exercise 1 a): Customize a scatterplot chart

The following code skeleton renders a scatterplot. Customize the chart to your liking. Think for example of many data points. 

This is meant to be a very quick exercise to demonstrate the concept for the following two charts.

Requirements:
- **Parameters**: The function accepts (at least) the following parameters:
    - **source**: a pandas DataFrameObject or bokeh ColumnDataSource that holds the data
    - **x**: variable (name as string) to be represented on the x-axis
    - **y**: variable (name as string) to be represented on the y-axis
- **Calling the scatterplot**: The function is a class method of Figure and can be called as follows
```python
p = figure()
p.scatter( data, x, y )
```
This is already setup in the code skeleton below.

**<font color="deeppink">Update code</font>**

In [7]:
def scatter( self, source, x, y, **kwargs ):
    # access the figure using the self variable
    
    palette = ["#053061", "#2166ac", "#4393c3", "#92c5de", "#000080",
           "#00FFFF", "#fddbc7", "#f4a582", "#d6604d", "#b2182b",
           "#67001f"]
    
    color_range = source.get(y)
    low = min(source.get(y))
    high = max(source.get(y))
    colors_inds = [int(10 * (x - low) / (high - low)) for x in color_range]

    source['colors'] = [palette[i] for i in colors_inds]    
    self.circle( source=source, x=x, y=y, size=3, fill_alpha=1, color='colors')
    
    
    label = Label(x=50, y=50, x_units='screen', y_units='screen',
                  render_mode='css')
    label2 = Label(x=200, y=200, x_units='screen', y_units='screen',
                  render_mode='css' )
    label3 = Label(x=300, y=300, x_units='screen', y_units='screen',
                  render_mode='css' )
        
    self.add_layout(label)
    self.add_layout(label2)
    self.add_layout(label3)
    self.xaxis.axis_label = x
    self.yaxis.axis_label = y
    self.title.align = "center"

# add the function as class method to Figure    
Figure.scatter = scatter

**<font color="deeppink">Check</font>** that your code is working:

In [8]:
p = figure( plot_width=500, plot_height=500, title = "Petal Length and Width by Species")
p.scatter(source=flowers, x='petal_width', y='petal_length')
show(p)

**<font color="deeppink">Test cases</font>**: Give three scenarios that need testing (bullet points, no implementation required). Think of scenarios where your code may fail.
- Test Case 1: NAN: for NAN values, there is a possibility of it failing 
- Test case 2: Outlier values are not well indicated, not easily detected from sample
- Test case 3: Colour map only works for defined size, if the data supply were augmented, the code would not show a color for that map

## Exercise 1 b): Implement a boxplot chart

Requirements:
- **Parameters**: The function accepts (at least) the following parameters:
    - **source**: a pandas DataFrameObject that holds the data
    - **x**: variable (name as string) to be represented on the x-axis
    - **y**: variable (name as string) to be represented on the y-axis
- **Orientation**: Provide boxplots with horizontal and vertical orientation (call them hboxplot and vboxplot).
- **Calling the boxplot**: The function is a class method of Figure and can be called as follows
```python
p = figure()
p.vboxplot( data, x, y )
```
This is already setup in the code skeleton below.

Hints:
- A Bokeh sample implementation can be found here: [Boxplot](https://bokeh.pydata.org/en/latest/docs/gallery/boxplot.html)
- Adapt this implementation to work on the target variable only. See code below to get started.

**<font color="deeppink">Implement</font>**

In [9]:
## checks the outliers

def vboxplot( self, source, x, y, **kwargs ):
    if not isinstance(source, pd.DataFrame ):
        raise TypeError("source has to be a pandas DataFrame.")
    v_axis_type = source.get(x).unique()
    df = pd.DataFrame(dict(score=source.get(y), group=source.get(x)))
    groups = df.groupby('group')
    q1 = groups.quantile(q=0.25)
    q2 = groups.quantile(q=0.5)
    q3 = groups.quantile(q=0.75)
    iqr = q3 - q1
    upper = q3 + 1.5*iqr
    lower = q1 - 1.5*iqr
    # find the outliers for each category
    def outliers(group):
        group_name = group.name
        return group[(group.score > upper.loc[group_name]['score']) | (group.score < lower.loc[group_name]['score'])]['score']
    
    
    out = groups.apply(outliers).dropna()

    # prepare outlier data for plotting, we need coordinates for every outlier.
    if not out.empty:
        outx = []
        outy = []
        for keys in out.index:
            outx.append(keys[0])
            outy.append(out.loc[keys[0]].loc[keys[1]])
    # if no outliers, shrink lengths of stems to be no longer than the minimums or maximums
    qmin = groups.quantile(q=0.00)
    qmax = groups.quantile(q=1.00)
    upper.score = [min([x,y]) for (x,y) in zip(list(qmax.loc[:,'score']),upper.score)]
    lower.score = [max([x,y]) for (x,y) in zip(list(qmin.loc[:,'score']),lower.score)]

    # stems
    self.segment(v_axis_type, upper.score, v_axis_type, q3.score, line_color="black")
    self.segment(v_axis_type, lower.score, v_axis_type, q1.score, line_color="black")

    # boxes
    self.vbar(v_axis_type, 0.7, q2.score, q3.score, fill_color="#00FFFF", line_color="black")
    self.vbar(v_axis_type, 0.7, q1.score, q2.score, fill_color="#000080", line_color="black")

    # whiskers (almost-0 height rects simpler than segments)
    self.rect(v_axis_type, lower.score, 0.2, 0.01, line_color="black")
    self.rect(v_axis_type, upper.score, 0.2, 0.01, line_color="black")

    # outliers
    if not out.empty:
        self.circle(outx, outy, size=6, color="#F38630", fill_alpha=0.6)

    self.xgrid.grid_line_color = "grey"
    self.ygrid.grid_line_color = "grey"
    self.grid.grid_line_width = 0.1
    self.xaxis.major_label_text_font_size="10pt"   

Figure.vboxplot = vboxplot

In [10]:
def hboxplot( self, source, x, y, **kwargs ):
    if not isinstance(source, pd.DataFrame ):
        raise TypeError("source has to be a pandas DataFrame.")
    v_axis_type = source.get(y).unique()
    df = pd.DataFrame(dict(score=source.get(x), group=source.get(y)))
    groups = df.groupby('group')
    q1 = groups.quantile(q=0.25)
    q2 = groups.quantile(q=0.5)
    q3 = groups.quantile(q=0.75)
    iqr = q3 - q1
    upper = q3 + 1.5*iqr
    lower = q1 - 1.5*iqr
    # find the outliers for each category
    def outliers(group):
        group_name = group.name
        return group[(group.score > upper.loc[group_name]['score']) | (group.score < lower.loc[group_name]['score'])]['score']
    
    
    out = groups.apply(outliers).dropna()

    # prepare outlier data for plotting, we need coordinates for every outlier.
    if not out.empty:
        outx = []
        outy = []
        for keys in out.index:
            outx.append(keys[0])
            outy.append(out.loc[keys[0]].loc[keys[1]])
    
    # if no outliers, shrink lengths of stems to be no longer than the minimums or maximums
    qmin = groups.quantile(q=0.00)
    qmax = groups.quantile(q=1.00)
    upper.score = [min([x,y]) for (x,y) in zip(list(qmax.loc[:,'score']),upper.score)]
    lower.score = [max([x,y]) for (x,y) in zip(list(qmin.loc[:,'score']),lower.score)]    
    # boxes
    self.hbar(v_axis_type, 0.7, q2.score, q3.score, fill_color="#00FFFF", line_color="black")
    self.hbar(v_axis_type, 0.7, q1.score, q2.score, fill_color="#000080", line_color="black")
    
    # whiskers (almost-0 height rects simpler than segments)
    self.rect(lower.score,v_axis_type, 0.01, 0.2, line_color="black")
    self.rect(upper.score,v_axis_type, 0.01, 0.2, line_color="black")

    self.segment( upper.score, v_axis_type, v_axis_type, q3.score, line_color="black")
    self.segment(v_axis_type, lower.score, v_axis_type, q1.score, line_color="black")

    
    # outliers
    if not out.empty:
        self.circle(outy, outx, size=6, color="#F38630", fill_alpha=0.6)

    self.xgrid.grid_line_color = "grey"
    self.ygrid.grid_line_color = "grey"
    self.grid.grid_line_width = 0.1
    self.xaxis.major_label_text_font_size="10pt"

Figure.hboxplot = hboxplot

**<font color="deeppink">Check</font>** your boxplot

In [11]:
p1 = figure( plot_width=400, plot_height=400, y_range=['setosa', 'versicolor', 'virginica'] )
p1.hboxplot( flowers, 'petal_width', 'species' )
p1.xaxis.axis_label = 'petal_width'
p1.yaxis.axis_label = 'species'

p2 = figure( plot_width=400, plot_height=400, x_range=['setosa', 'versicolor', 'virginica'] )
p2.vboxplot( flowers, 'species', 'petal_width' )
p2.yaxis.axis_label = 'petal_width'
p2.xaxis.axis_label = 'species'

show(row(p1,p2))


**<font color="deeppink">Test cases</font>**: Give three scenarios that need testing (bullet points, no implementation required).
- Test case 1: Empty dataset
- Test case 2: Undefined x_range set
- Test case 3: Different quartile ranges (now hard coded at 0.25, 0.5, 0.75) would not function with other ranges

## Exercise 1 c): Implement a histogram chart

Requirements:
- **Parameters**: The function accepts (at least) the following parameters:
    - **source**: a pandas DataFrameObject that holds the data
    - **x**: variable (name as string) to be represented on the x-axis
    - **nbins**: number of bins (optional argument). If not provided set a meaningful default.
- **Data type**: Provide histograms for categorical and quantitative data.
- **Scaling**: The y-axis shall give probabilities (0,1). Scale the axis to show the full range, e.g., (-0.05,1.05).
- **Calling the histogram**: The function is a class method of Figure and can be called as follows
```python
p = figure()
p.histogram( data, x )
```

Hints:
- Assume that all categorical data has type string. Respective columns in the data can be converted using:
```
df.var = df.var.astype('str')
```

**<font color="deeppink">Implement</font>**

In [12]:
from bokeh.models import Range1d, FactorRange

def histogram( self, source, x, nbins=0, *args, **kwargs ):  
    if not isinstance(source, pd.DataFrame ):
        raise TypeError("source has to be a pandas.DataFrame. Received ", type(df))
    
    # in case of none
    if nbins == 0:
        nbins = 30
    
    data = source[x]
    if isinstance(data[1], float):
        hist, edges = np.histogram(data, density=True, bins=nbins)
        self.quad(top=hist, bottom=0, left=edges[:-1], right=edges[1:])
        label = Label( x=50, y=50, x_units='screen', y_units='screen',
                     render_mode='css')
        self.add_layout(label)
    else:
        
        categorical_data = data.value_counts().to_frame('count')/len(data)
        self.vbar(source=categorical_data, x='index'[:20], width=1, bottom=0, top='count',fill_color='cornflowerblue')
        
Figure.histogram = histogram

**<font color="deeppink">Check</font>** your histogram

In [13]:
var1 = 'sepal_length'
var2 = 'species'
var3 = 'name'

p1 = figure( plot_width=600, plot_height=600 )
p1.histogram( flowers, var1, nbins=35)
p1.yaxis.axis_label = 'probability'
p1.xaxis.axis_label = var1


labels = np.sort(flowers[var2].unique())
p2 = figure( plot_width=600, plot_height=600, x_range=labels )
p2.histogram( flowers, var2)
p2.xaxis.axis_label = var2

labels = np.sort(autompg[var3].unique())
p3 = figure( plot_width=2500, plot_height=600, x_range=labels )
p3.histogram( autompg, var3 )
p3.xaxis.axis_label = var3
p3.xaxis.major_label_orientation = "vertical"

show( column( p1, p2, p3 ) )

**<font color="deeppink">Test cases</font>**: Give three scenarios that need testing (bullet points, no implementation required).
- Different data types: If data column instance is an integer, rather than a float (different datatypes may render errors) 
- Huge/Complex datasets: Data set needs to be truncated (when dataset has too many categories to show): issue with the third set (or decide to group (i.e. amc, or audi))
- More than two variables: If there were more than two histograms plotted together: two comparitive histograms (would fail as not built for more complex dataset)
- Correct number of bins (we estimated on 35, with a default of 30) but this could be a misclassification for some, which could hide patterns or miss them (depending on case)



## Exercise 2: Working with SPLOMs

The source code for the generalized scatterplot matrix (SPLOM) is stored in file splom.py. 

Usage:
```
p = splom( df, cols=['var1', 'var2', 'var3'], splom_width=1000 )
show(p)
```

Accepted parameters:
- **source** (req): pandas DataFrame
- **splom_width** (opt): total width/height of the plot.
- **cols** (opt): Array of column names to be used in the plot.
- **x_padding** (opt): additional space for the x-axis labels.
- **y_padding** (opt): additional space for the y-axis labels.

Hint:
- The SPLOM supports some interaction. Select points in the scatterplots and look at the results in the other scatterplots.

In [14]:
%run SPLOM.py

### Exercise 2a): Baseball data

In [15]:
baseball = pd.read_csv( 'baseball_data.csv')

In [16]:
show( splom( df=baseball, cols=['handedness', 'height', 'weight', 'avg', 'HR'],
             splom_width=1000, x_padding=40, y_padding=80 ) )

categorical attributes ['handedness']


* The handedness histogram is a kind of left skewed distribution. This is because the right handed people are much higher than left handed and both handed. 
* The height histogram shows almost a normal distribution with a little right skew. Since height is a natural property, it's expected to show a normal distribution on histogram. It also shows that the perfect height for baseball is around 72-74 units. This is evident from the scatter plot of the HR as most HRs are with this height. 
* The weight histogram is a normal distribution. Around 50% of the players have weight in 180-200 units which is also evident from the boxplot of weight and handedness. Also the ideal weight for most homeruns is about of the same range (i.e 180-200), as shown in the weight-HR scatterplot. 
* The avg histogram shows two different different peaks. This typically means that the avg data can be divided into two categories. People who have 0 to 0.05 avg. units and people who have avg. from 0.15 to 0.3 units. This can also be seen in the HR-avg scatter plot which shows a data skew towards the latter range. Also the box plot for avg-handedness has a lot of outliers mainly near 0 avg units.
* The HR histogram is a right skewed distribution which again shows that max no of players with 0-50 homeruns. This is also evident from the HR-handedness boxplot which shows that 50% of the HR values for all the left, right and both handed players are around the same range (i.e 0-50).

### Exercise 2b): Passengers on the Titanic

Remarks:
- For some passengers age information is missing the `fillna` command replaces those entries with -1. Feal free to make changes to this treating of missing values.
- All data is given as quantitative values. To make distinction of categorical data easier, we turn them into strings.

In [17]:
titanic = pd.read_csv( 'titanic3.csv')

titanic = titanic.fillna(-1)
titanic.pclass = titanic.pclass.astype('str')
titanic.survived = titanic.survived.astype('str')



p = splom( df=titanic, cols=['pclass', 'survived', 'sex', 'age', 'fare'], splom_width=1000, 
           x_padding=40, y_padding=80 )
show( p )

categorical attributes ['pclass', 'survived', 'sex']


* Passenegers in 1st class had the highest fares with one outlier of 500 USD. They also had the most survivors around 200. Average 1st Class passenegers had age b/w 22-45 years. The passenegers in 2nd Class had avg fares around 50 USD and even less survivors. It mostly had young adults and adults with some exceptions. The third class had the lowest fares and max number of passenger also the least survival rate. Their average age was 20 with a lot of teenagers, infants and young adults.
* The male and female passengers were almost equally distributed w.r.t age however, the number of male passengers were much higher than the female passengers ( keeping in view that that some age info is missing). This is visible in the 
* The 3rd Class had significantly higher number of male passengers and more than 500 casualties (both male & female included). These are visible in the histograms plotted with passenger class and other variables. This is understandable as passengers from the higher classes (especially women) were given priority to the escape boats. If we look at the histogram b/w sex & survived, this is also evident from the data as the survival rate of females is much higher than the males.
* The differences b/w classes are visible in {fare, age,(box plots)}, {sex, survived, (histograms)}.

## Exercise 3: 

### Option 1: Auto MPG

In [18]:
from bokeh.sampledata.autompg import autompg

autompg.cyl = autompg.cyl.astype('str')
autompg.origin = autompg.origin.astype('str')

show( splom( df=autompg, cols=['mpg', 'cyl', 'displ', 'hp', 'weight', 'accel', 'yr', 'origin'],
             splom_width=1000, x_padding=40, y_padding=80 ) )

categorical attributes ['cyl', 'origin']


## Data Description
*  The data concerns city-cycle fuel consumption in miles per gallon, to be predicted in terms of 3 multivalued discrete and 5 continuous attributes.

1. mpg: continuous 
2. cylinders: multi-valued discrete 
3. displacement: continuous 
4. horsepower: continuous 
5. weight: continuous 
6. acceleration: continuous 
7. model year: multi-valued discrete 
8. origin: multi-valued discrete 
9. car name: string (unique for each instance)


## Analysis Protocol & Summary
### 1. Affect on MPG of Engine Displacement, Horsepower, Weight, Acceleration, Year of manufacture.

Looking at the scatter plots of different variables against mpg we have the following observations:
* Engine Displacement has a negative(inverse) relationship with MPG as one would expect. Becuase more engine displacement correspondes to greater fuel consumption, hence lower MPG.
* Horsepower also shows a non linear decreasing graph i.e inverse relationship. As the horsepower increases the MGP goes down as vehicles with higher horsepower are fuel hungry.
* Weight has a significant effect on MPG as the power to weight ratio decides how much fuel the car will eventually use. This also has a negative relationship because the heavier the vehicle the more powerful engine it needs to move, hence lower MPG.
* The acceleration doesn't show a clear relation ship with MPG, but looking at the scatter plot one can see a lot of outliers. Lower acceleration (i.e lower 0-60 mph time) means two things, either a light vehicle but powerful engine hence lower MPG. So all sports cars, supercars fall in this category. The other thing that corresponds to Lower Acceleration is a vehicle which is heavy but has a powerful engine, hence lower MPG. Hypercars fall into this category. The relationship of acceleration with wight and hp would tell us more about the MPG.
* Year of manufacture doesn't show a clear relationship with MPG but a general trend that we can notice is that it's an increasing funtion, i.e over the years the MPG has improved.

### 2.  Other relationships b/w variables:
* The box plot with MGP and Cylinders shows some interesting aspects. Most of the 8 cylinder vehicles (V8 powerful models of cars) have very low MPG. Some outliers do exist though which at max give 25 MPG, which is impressive.
* There are some 4 cylinder cars which give more MPG than average 3 cylinder cars. One of which gives more than 45 MPG! Most 8 cylinder cars have not surprisingly higher displacement, higher horsepower and lower 0-60 time. Poor acceleration is reported on 5 and 4 cylinder cars, unexpectedly.
* Looking at the box plots, over the years the trend of cars has shifted from higher number of cylinders towards lower number of cylinders more fuel efficient cars.
* The box plot of origin with weight, horsepower and displacement tell us something about the trend of vehicles in the 3 regions namely US, Europe, Japan. The plot for US is skewed toward heavier, large(heavy), powerful vehicles. Whereas the Europeans have the more 'European' attitude towards cars i.e normal acceleration times with average sized, moderately powerful vehicles with good fuel economy. The Japanese on the other hand have cars with sole focus on fuel efficiency and small size. Which inevitably means higher 0-60 times, lower power and engine displacements.
* The histograms tell us the distribution of cars in different variables. It seems as the max cars are of 4 cylinders, average displacement and acceleration times (0-60 mph) in the range of 14-16 secs.

### 3. Summary:
* The data typically show the distribution and the properties of vehicles in 3 different regions of the world. The automotive industry can be divided into these 3 attitudes (if you will) of these 3 regions, namely American, European and Japanese. The Americans like large, powerful, heavy cars and they don't seem to care much about fuel economy. The Europeans prefer average sized, moderately powerful fuel efficient cars with good acceleration times. The Japanese car industry have cars with higher fuel economy and small size but less powerful and poor acceleration times.