# Exploratory Data Analysis
------

## Problem Statement

We want to use the given images to create a time lapse video showing the vegetation in all four seasons. However, the dataset includes some bad images, e.g. when the shutters are closed, we have to clean the dataset. This will lead us to an **image classification task**. For this we want to examine and compare multiple **classification methods** and choose the one which fits best to our needs. Lets first have a deeper look at the *dataset attributes* and which *features* can be used for image classification.

> Note: We will refer to images showing the garden as **open** and images showing closed shutters as **closed**.

## What Kind of Data do We Have

As we saw in the [project-context notebook](./1.0-project-context.ipynb), we have some image data, see the examples. The images are saved as JPEG files. Each picture is named with a pattern, **pic_&lt;timestamp-in-seconds&gt;.jpg** e.g. **pic_1492067569.jpg**.

<table>
<tr>
    <td> <figure>
            <img src="../data/raw/example/open/pic_1492067569.jpg" width=350/>
            <figcaption>Example 1</figcaption>
        </figure>
    </td>
    <td>
        <figure>
        <img src="../data/raw/example/open/pic_1495255809.jpg" width=350/>
        <figcaption>Example 2</figcaption>
        </figure>
    </td>
    <td>
        <figure>
        <img src="../data/raw/example/closed/pic_1494511609.jpg" width=350/>
        <figcaption>Example 3</figcaption>
        </figure>
    </td>
</tr>
</table>

# What Information or Possible Features do We Have

**Information**
- Directory structure
- Time range of images captured
- Time difference in seconds between image caputers
- Date and time when images were taken

**Possible Features**
- EXIF data
- Image file size
- Image data, aka. the pixels in RGB

In the following sections we will have a quick overview of the given information and will check whether the possible features are suitable for the classification task, later.

## Directory Structure

Pictures were captured and saved in an own directory for each day, so the directory structure looks something like this.

```
$ tree .
.
├── ...
├── 2017-05-07_auto
├── 2017-05-08_auto
├── 2017-05-09_auto
├── 2017-05-10_auto
├── ...
├── 2017-05-27_auto
├── 2017-05-28_auto
├── 2017-05-29_auto
├── 2017-05-30_auto
├── 2017-05-31_auto
├── ...
├── 2017-08-06_auto
├── 2017-12-24_auto
├── 2017-12-31_auto
│   ├── pic_1514702547.jpg
│   ├── pic_1514702617.jpg
│   ├── pic_1514702727.jpg
│   ├── pic_1514702797.jpg
│   ├── pic_1514702879.jpg
│   ├── ...
└── ...
```

This directory structure makes it a bit easier to handle the huge amount of files, because file managers will create thumbnails to be able to show some preview data. It is quite computational intensive when directories containing more than 1000 images and makes browsing through the images slow. With this structure it works quiet well.

## Time Range of Images Captured

As mentioned before, images were captured over one year. Starting date is **2017-03-07** and last image date is **2018-04-04**, which means we have images showing all four seasons, naming **spring**, **summer**, **autumn** and **winter**, and different weather conditions, see following examples.

> Note: Yes, unfortunately the camera was moved a little bit, as we can see in the two rightmost pictures.

<table>
<tr>
    <td>
        <img src="../data/raw/example/time-range/spring-midday-pic_1492083489.jpg" width=350/>
        <figcaption>Spring</figcaption>
    </td>
    <td> <figure>
        <img src="../data/raw/example/time-range/summer-midday-pic_1495800549.jpg" width=350/> </figure>
        <figcaption>Summer</figcaption>
    </td>
    <td>
        <img src="../data/raw/example/time-range/autumn-midday-pic_1509968123.jpg" width=350/>
        <figcaption>Autumn</figcaption>
    </td>
    <td>
        <img src="../data/raw/example/time-range/winter-midday-pic_1518610019.jpg" width=350/>
        <figcaption>Winter</figcaption>
    </td>
</tr>
</table>

## Time Difference Between Image Captures

Just to get an overview for what we have, we want to extract the mean time between captured images. First lets list the top *n* files of an image directory.

In [1]:
import os

IMG_DIR = "../data/raw/raw"
FILE_COUNT = 10
DIR_COUNT = 1

# print first <file_count> directory entries of <dir_count> directories
dir_counter = 0
for dirpath, dirnames, filenames in os.walk(IMG_DIR, topdown=True):
    for idx, file in enumerate(filenames):
        if idx < FILE_COUNT:
            print("{}/{}".format(dirpath,file))
    # only analyze first <dir_count> directories of os.walk
    dir_counter += 1
    if dir_counter > DIR_COUNT:
        break

../data/raw/raw/2017-04-13_auto/pic_1492061169.jpg
../data/raw/raw/2017-04-13_auto/pic_1492056769.jpg
../data/raw/raw/2017-04-13_auto/pic_1492061348.jpg
../data/raw/raw/2017-04-13_auto/pic_1492056849.jpg
../data/raw/raw/2017-04-13_auto/pic_1492061269.jpg
../data/raw/raw/2017-04-13_auto/pic_1492056949.jpg
../data/raw/raw/2017-04-13_auto/pic_1492061409.jpg
../data/raw/raw/2017-04-13_auto/pic_1492057029.jpg
../data/raw/raw/2017-04-13_auto/pic_1492061519.jpg
../data/raw/raw/2017-04-13_auto/pic_1492057099.jpg


Extract the timestamps from the filenames and calculate statistics for time between two captured images, like **mean**, **std** and **quantiles**.

In [7]:
import os
import sys
import re
import numpy as np

IMG_DIR = "../data/raw/raw"
FILE_COUNT = sys.maxsize
SUB_DIR_COUNT = 1

# define a regex to get the timestamp
PATTERN_TIMESTAMP = re.compile("^pic_(\d{10,}).jpg$")

def extract_timestamp_from_filenames(
    path, sub_dir_count = 1,
    file_count = sys.maxsize,
    pattern_timestamp = PATTERN_TIMESTAMP,
    pattern_group = 1
):
    # extract timestamps
    dir_counter = 0
    timestamps = np.empty(0)
    for dirpath, dirnames, filenames in os.walk(path, topdown=True):
        # only analyze first <sub_dir_count> directories of os.walk
        if dir_counter > sub_dir_count:
            break
        dir_counter += 1

        # extract timestamp
        for idx, file in enumerate(filenames):
            if idx < file_count:
                timestamp = re.match(pattern_timestamp, file).group(pattern_group)
                timestamps = np.append(timestamps,timestamp)
                # DEBUG:
                #print("file: {}; found timestamp: {}".format(file, timestamp))
            else:
                break
    return timestamps

def calculate_timestamp_diffs(timestamps):
    # bring timestamps in ascending order
    sorted_timestamps = np.array(sorted(timestamps))
        
    return np.diff(sorted_timestamps.astype(int))
            
def timestamp_diff_statistics(timestamp_diffs):
    mean_timestamp = np.mean(timestamp_diffs)
    std_timestamp = np.std(timestamp_diffs)
    quantiles_timestamp = np.quantile(timestamp_diffs, q=[.25, .5, .75, 1])
    return (mean_timestamp, std_timestamp, quantiles_timestamp)


## use functions
timestamps = extract_timestamp_from_filenames(IMG_DIR, sub_dir_count=SUB_DIR_COUNT, file_count=FILE_COUNT)
#timestamps = extract_timestamp_from_filenames("../data/raw/raw/2017-05-20_auto", dir_count=1)
timestamp_diffs = calculate_timestamp_diffs(timestamps)

ts_mean, ts_std, ts_quantiles = timestamp_diff_statistics(timestamp_diffs)
print("Mean time between image captures in {}: {:.2f}s (std {:.2f}s, quartiles in s {})".format(
        IMG_DIR,
        ts_mean, ts_std, ts_quantiles)
)


Mean time between image captures in ../data/raw/raw: 89.93s (std 17.31s, quartiles in s [ 80.   85.5 100.  120. ])


In [14]:
import plotly.express as px
import dash_core_components as dcc
import dash_html_components as html
import pandas as pd
import os

from jupyter_dash import JupyterDash
from dash.dependencies import Input, Output

IMG_ROOT_DIR = "../data/raw/raw"

# get all subdirectories, which can be selected from dropdown
day_dates_images_captured = sorted(os.listdir(IMG_ROOT_DIR))

# Build App
app = JupyterDash(__name__)
app.layout = html.Div([
    html.H4("Time Differences Between Image Captures"),
    dcc.Graph(id='graph'),
    html.Label([
        "Capture Date",
        dcc.Dropdown(
            id = 'capture-date-dropdown',
            clearable = False,
            value = '2017-05-13_auto',
            options = [
                {'label': c, 'value': c}
                for c in day_dates_images_captured
            ])
    ]),
])

# Define callback to update graph
@app.callback(
    Output('graph', 'figure'),
    [Input("capture-date-dropdown", "value")]
)
def update_figure(capture_date):
    return get_figure(capture_date)

def get_figure(capture_date):
    # Load Data, use one subdirectory of IMG_ROOT_DIR
    path = os.path.sep.join( (IMG_ROOT_DIR, capture_date) )
    timestamps = extract_timestamp_from_filenames(path, sub_dir_count=1)
    timestamp_diffs = calculate_timestamp_diffs(timestamps)

    # create the bins
    df = pd.DataFrame(data=timestamp_diffs, columns=["diffs"])
    return px.histogram(df, x="diffs", nbins=10, labels={'diffs':'time between captures in seconds', 'y':'count'})

# Run app and display result inline in the notebook
## Workaround: set load_dotenv=False to prevent current working directory to be changed, which will break
## relative paths on multiple Jupyter-Notebook cell runs, e.g. IMG_DIR
## for more information, see https://github.com/pallets/flask/blob/master/src/flask/app.py #function run(...)
app.run_server(mode='inline', load_dotenv=False)
##app.run_server(mode='inline')

## export for offline use
#fig = get_figure('2017-05-15_auto')
#fig.show(renderer='svg')
#fig.write_image("exports/example-time-between-captures-2017-05-15.svg")

### Todo

- Tell why we examined time differences between image captures, e.g. for later time lapse video creation it is important which max/min time resoltion is possible
- Describe the results of the timestamp diffs analysis
- Describe what can be seen in the histograms

##  EXIF Data

When taking pictures with digital cameras, it is most likely that some metadata is saved within the image files, see [Wikipedia#EXIF](https://de.wikipedia.org/wiki/Exchangeable_Image_File_Format).

The following table contains some examples of the EXIF data found in the images.

<table>
    <tr>
        <th>Attribute</th>
        <th>Example Value</th>
        <th>EXIF Tag</th>
    </tr>
    <tr>
        <td>Image resolution (WxH)</td>
        <td>2592px x 1944px</td>
        <td>0x100, 0x101</td>
    </tr>
    <tr>
        <td>Exposure time</td>
        <td>1/51 s</td>
        <td>0x829a</td>
    </tr>
    <tr>
        <td>f-number</td>
        <td>f/2.9</td>
        <td>0x829d</td>
    </tr>
    <tr>
        <td>Shutter speed value</td>
        <td>5.69 EV</td>
        <td>0x9201</td>
    </tr>
    <tr>
        <td>Arperture value</td>
        <td>3.07 EV</td>
        <td>0x9202</td>
    </tr>
    <tr>
        <td>Brightness value</td>
        <td>2.67 EV</td>
        <td>0x9203</td>
    </tr>
    <tr>
        <td>Focal length</td>
        <td>3.6 mm</td>
        <td>0x920a</td>
    </tr>
    <tr>
        <td>ISO speed rating</td>
        <td>100</td>
        <td>0x8827</td>
    </tr>
</table>


### Hypothesis

Image EXIF tag values are differrent for *open* or *closed* images respectively.

### Analysis

When we want to extract EXIF data from our image files we need the EXIF tag numbers (hex) for the desired attributes we wish to extract. For our examples the required EXIF tags have been added to the table, see above.

The following code extracts EXIF data and provides two tables showing some extracted examples, oen table for each class, *open* and *closed*.

In [16]:
import os
import pandas as pd
import numpy as np
from PIL import Image, ExifTags

IMG_DIR = "../data/interim"

def get_exif_data(filepath):
    img = Image.open(filepath)
    img_exif = img.getexif()

    if img_exif is None:
        return
    else:
        exif_data = {
            'exposure_time': img_exif.get(0x829a),
            'f_number': img_exif.get(0x829d),
            'shutter_speed': img_exif.get(0x9201),
            'arperture': img_exif.get(0x9202),
            'brightness': img_exif.get(0x9203),
            'focal_length': img_exif.get(0x920a),
            'iso_speed': img_exif.get(0x8827)
        }
        return exif_data
    
def get_exif_data_frame(path):
    exifs = np.empty(0)
    for dirpath, dirnames, filenames in os.walk(path):
        for file in filenames:
            exifs = np.append(exifs, get_exif_data(os.path.sep.join((dirpath, file))))
    return pd.DataFrame.from_dict(data = exifs.tolist())

open_exif_data = get_exif_data_frame(os.path.sep.join((IMG_DIR, "open")))
print("EXIF Data for 'open' Images")
display(open_exif_data)
closed_exif_data = get_exif_data_frame(os.path.sep.join((IMG_DIR, "closed")))
print("EXIF Data for 'closed' Images")
display(closed_exif_data)

EXIF Data for 'open' Images


Unnamed: 0,exposure_time,f_number,shutter_speed,arperture,brightness,focal_length,iso_speed
0,0.033787,2.8984,4.887388,3.0705,2.55,3.5976,160
1,0.12499,2.8984,3.000115,3.0705,0.1,3.5976,500
2,0.03889,2.8984,4.684457,3.0705,2.63,3.5976,160
3,0.12499,2.8984,3.000115,3.0705,0.14,3.5976,500
4,0.035802,2.8984,4.803816,3.0705,2.56,3.5976,160
...,...,...,...,...,...,...,...
1808,0.039215,2.8984,4.672451,3.0705,2.63,3.5976,160
1809,0.040872,2.8984,4.612743,3.0705,2.65,3.5976,160
1810,0.04292,2.8984,4.542206,3.0705,2.63,3.5976,160
1811,0.046853,2.8984,4.415714,3.0705,2.61,3.5976,160


EXIF Data for 'closed' Images


Unnamed: 0,exposure_time,f_number,shutter_speed,arperture,brightness,focal_length,iso_speed
0,0.12499,2.8984,3.000115,3.0705,0.76,3.5976,640
1,0.12499,2.8984,3.000115,3.0705,0.3,3.5976,640
2,0.12499,2.8984,3.000115,3.0705,0.8,3.5976,640
3,0.12499,2.8984,3.000115,3.0705,0.39,3.5976,640
4,0.12499,2.8984,3.000115,3.0705,0.77,3.5976,640
...,...,...,...,...,...,...,...
747,0.12499,2.8984,3.000115,3.0705,1.56,3.5976,640
748,0.12499,2.8984,3.000115,3.0705,1.55,3.5976,640
749,0.12499,2.8984,3.000115,3.0705,1.54,3.5976,640
750,0.12499,2.8984,3.000115,3.0705,1.52,3.5976,640


In [20]:
import plotly.express as px
import plotly.graph_objects as go
import dash_core_components as dcc
import dash_html_components as html
import pandas as pd

from jupyter_dash import JupyterDash
from dash.dependencies import Input, Output

IMG_ROOT_DIR = "../data/interim"

exif_tags = ['exposure_time',
             'f_number',
             'shutter_speed',
             'arperture',
             'brightness',
             'focal_length',
             'iso_speed']

# Build App
app = JupyterDash(__name__)
app.layout = html.Div([
    html.H4("EXIF Data of Images"),
    dcc.Graph(id='graph'),
    html.Label([
        "EXIF Tag",
        dcc.Dropdown(
            id = 'exif-tag-dropdown',
            clearable = False,
            value = 'exposure_time',
            options = [
                {'label': c, 'value': c}
                for c in exif_tags
            ])
    ]),
])

# Load Data
open_exif_data = get_exif_data_frame(os.path.sep.join((IMG_DIR, 'open')))
closed_exif_data = get_exif_data_frame(os.path.sep.join((IMG_DIR, 'closed')))

# Define callback to update graph
@app.callback(
    Output('graph', 'figure'),
    [Input("exif-tag-dropdown", "value")]
)
def update_figure(exif_tag):
    return get_figure(exif_tag)

def get_figure(exif_tag):
    # create histogram figure
    fig = go.Figure()
    # must specify dtype, otherwise dash-plotly will complain, about non-serializable JSON on figure object
    fig.add_trace(go.Histogram(
        x = open_exif_data.get(exif_tag).to_numpy(dtype="float"),
        name = "open"))
    fig.add_trace(go.Histogram(
        x = closed_exif_data.get(exif_tag).to_numpy(dtype="float"),
        name = "closed"))

    fig.update_layout(
        barmode='overlay', # Overlay both histograms
        title_text='Image EXIF Tag: {}'.format(exif_tag), # title of plot
        xaxis_title_text='{}'.format(exif_tag), # xaxis label
        yaxis_title_text='Count', # yaxis label
    )

    # Reduce opacity to see both histograms
    fig.update_traces(opacity=0.75)
    return fig

# Run app and display result inline in the notebook
## Workaround: set load_dotenv=False to prevent current working directory to be changed, which will break
## relative paths on multiple Jupyter-Notebook cell runs, e.g. IMG_DIR
## for more information, see https://github.com/pallets/flask/blob/master/src/flask/app.py #function run(...)
app.run_server(mode='inline', load_dotenv=False)
#app.run_server(mode='inline')

## export for offline use
#fig = get_figure('iso_speed')
#fig.show(renderer='svg')
#fig.write_image("exports/example-exif-data-iso-speed.svg")

### Results

The data looks promising. It seems that some of the EXIF data can actually be used as features. When we look at the EXIF tags **exposure_time**, **shutter_speed**, **brightness** and **iso_speed**, we see that each class, *open* or *closed*, is well represented and may be distiguished using these features. When looking at the remaining EXIF tags, e.g. **f-number**, than these tags would not be helpful features, because the values of these tags stay the same, regardless of the class.

**EXIF Tags Selected as Features for Image Classifier**
- exposure_time
- shutter_speed
- brightness
- iso_speed

## Image File Size

### Hypothesis

Image file sizes differ for *open* or *closed* images respectively.

### Analysis

Because we do not have the image classifier ready, yet, we have manually selected some *open* and *closed* images from the dataset and saved it into the path ```../data/interim/{open,closed}```.

There are **1813 open** and **752 closed** images.

> Todo: Describe how the images have been selected, e.g. manually, which *days* were included and in which time of the year, because this may effect training through bias, e.g. image brightness varies over the year and so on.

In [21]:
import os
import numpy as np

IMG_DIR = "../data/interim"

# define and vectorize function to get absolute file paths and file sizes
def path_join(path, file):
    return os.path.sep.join((path, file))

VEC_PATH_JOIN = np.vectorize(path_join)

def get_size(path):
    return os.path.getsize(path)

VEC_GET_SIZE = np.vectorize(get_size)

def get_file_size_statistics(path):
    file_sizes = np.empty(0)
    for dirpath, dirnames, filenames in os.walk(path):
#         print(len(filenames))
        # to prevent nested for loops and for convenicence use vectorized function
        files = VEC_PATH_JOIN(dirpath, filenames)
        file_sizes = np.append(file_sizes, VEC_GET_SIZE(files))
    return file_sizes, np.mean(file_sizes), np.std(file_sizes)
    
## main
_, open_fs_mean, open_fs_std = get_file_size_statistics(os.path.sep.join((IMG_DIR, "open")))
print("Open file size mean: {:.2f} KByte, std: {:.2f} KByte".format(open_fs_mean / 1024, open_fs_std / 1024))

_, closed_fs_mean, closed_fs_std = get_file_size_statistics(os.path.sep.join((IMG_DIR, "closed")))
print("Closed file size mean: {:.2f} KByte, std: {:.2f} KByte".format(closed_fs_mean / 1024, closed_fs_std / 1024))

Open file size mean: 2735.71 KByte, std: 116.69 KByte
Closed file size mean: 2617.09 KByte, std: 179.97 KByte


In [28]:
import plotly.express as px
import plotly.graph_objects as go
import dash_core_components as dcc
import dash_html_components as html

from jupyter_dash import JupyterDash
from dash.dependencies import Input, Output

# Build App
app = JupyterDash(__name__)

# load data
IMG_DIR = "../data/interim"
open_file_sizes, open_fs_mean, _ = get_file_size_statistics(os.path.sep.join((IMG_DIR, "open")))
closed_file_sizes, closed_fs_mean, _ = get_file_size_statistics(os.path.sep.join((IMG_DIR, "closed")))

# create histogram figure
fig = go.Figure()
fig.add_trace(go.Histogram(x = open_file_sizes, name = "open"))
fig.add_trace(go.Histogram(x = closed_file_sizes, name = "closed"))

# add mean values
## todo: annotate lines or remove them
fig.add_shape(
        go.layout.Shape(type='line', xref='x', yref='paper',
                        x0=open_fs_mean, y0=0, x1=open_fs_mean, y1=1, line={'dash': 'dash'})
)
fig.add_shape(
        go.layout.Shape(type='line', xref='x', yref='paper',
                        x0=closed_fs_mean, y0=0, x1=closed_fs_mean, y1=1, line={'dash': 'dash'})
)

fig.update_layout(
    barmode='overlay', # Overlay both histograms
    title_text='Image File Sizes', # title of plot
    xaxis_title_text='File Size in Byte', # xaxis label
    yaxis_title_text='Count', # yaxis label
)
    
# Reduce opacity to see both histograms
fig.update_traces(opacity=0.75)

app.layout = html.Div([
    html.H4("Compare File Sizes of Open and Closed Images"),
    dcc.Graph(
        id='graph',
        figure=fig),
])

# Run app and display result inline in the notebook
## Workaround: set load_dotenv=False to prevent current working directory to be changed, which will break
## relative paths on multiple Jupyter-Notebook cell runs, e.g. IMG_DIR
## for more information, see https://github.com/pallets/flask/blob/master/src/flask/app.py #function run(...)
app.run_server(mode='inline', load_dotenv=False)
#app.run_server(mode='inline')

## export for offline use
#fig.show(renderer='svg')
#fig.write_image("exports/example-image-file-sizes.svg")

### Results

It seems that the images have a different distribution of file sizes, however, they overlap for a big part. The *closed* images have a relatively long left tail, meaning there are also many files with considerably small file sizes compared to *open* images. This might come from the fact that the *closed* images can be better compressed, due to lower color variance, having big parts only showing some kind of grey color.

##### Kolmogorov-Smirnov Test

$H_0:F_X(x) = F_Y(x)$

$H_1:F_X(x) \neq F_Y(x)$

In [8]:
from scipy import stats

ks_statistic, p_value = stats.ks_2samp(open_file_sizes, closed_file_sizes)
print("KS statistic: {:.3f}, p-value: {:.3f}".format(ks_statistic, p_value))

KS statistic: 0.319, p-value: 0.000


The result of Kolmogorov-Smirnov test shows that the two distributions are different and we can reject the $H_0$.

However, with that big overlap in file sizes it is most likely that this single feature might not be sufficient for precise image classification to sort out *closed* images. In combination with other features, however, the file size might be a useful feature.

## Image Data - Pixel Values in RGB

So what can we do about raw image data. The images consists of pixels and each pixel has three values in range of $[0,255]$ namely RGB which stands for the colors *red*, *green* and *blue*. The simplest feature we can extract from images is the color distribution using a color histogram.

What does the *color histogram* tell us. While independent from spacial location of the colors, rotation or other image transformation, a color histogram focuses on the different colors' distributions. However, due to this simplicity of this method, we can not detect high level image features, like *textures* or *objects*.

### Hypothesis

1. The color distributions are different for *open* and *closed* images.
1. In *closed* images, the colors RGB should be somehow equally distributed, because of mostly grey-ish pixels showing closed shutters. Therefore, *open* images' color distributions should vary much more.

### Analysis

In [27]:
import os
import numpy as np
import plotly.graph_objects as go
import dash_core_components as dcc
import dash_html_components as html

from PIL import Image
from jupyter_dash import JupyterDash
from itertools import chain

IMG_DIR = "../data/raw/example/color-histogram"

def show_histogram_plotly(img_path, filename):
    colors = {'R': 'FF0000', 'G': '00FF00', 'B': '0000FF'}
    image = Image.open(img_path)
    fig = go.Figure()
    for color in colors.keys():
        pixels = image.getchannel(color)
        ## this produces a 'divide by zero' warning which can be ignored
        ## because some color intensities have zero values
        counts, bin_edges = np.histogram(pixels, bins=range(0, 256, 1))
        fig.add_trace(go.Bar(x=bin_edges, y=np.log10(counts), name=color, marker=dict(color="#{}".format(colors.get(color)))))
    
    fig.update_layout(
        barmode='overlay', # Overlay both histograms
        bargap=0,
        #title_text='{}'.format(filename), # title of plot
        xaxis_title_text='Color intensity', # xaxis label
        yaxis_title_text='log_10(Count)', # yaxis label
    )
    fig.update_traces(
        opacity=0.75,
        marker_line_width=0
    )
    return fig

# return list of table data elements with original image file and color histogram
# e.g. for a two column table
def get_table_data(dirpath, file):
    return [
        html.Td([
            html.Figure([
                html.Img(src=app.get_asset_url(file), width = 300),
                html.Figcaption("File: {}".format(file))
            ])
        ]),
        html.Td([
            dcc.Graph(id='graph-{}'.format(file), figure=show_histogram_plotly(os.path.sep.join((dirpath, file)), file)) 
        ])
    ]
        
def get_table(image_directory):
    return html.Table([
        html.Thead([
            html.Tr([
                html.Th("Image"),
                html.Th("Color Histogram")
            ])
        ]),
        # create a table row <tr> for each image file
        # for each row create two table data elements <td>,<td> 
        html.Tbody(
            list(
                chain.from_iterable( # this flattens the [[list of lists]] to [list]
                    [ 
                        [ html.Tr( get_table_data(dirpath, file) ) for file in sorted(filenames) if file.endswith('.jpg') ]
                            for dirpath, _, filenames in os.walk(image_directory)
                    ]
                )
            )
        )        
    ], style={"width": 900})

# Build App
app = JupyterDash(__name__)

app.layout = html.Div([
    html.H4("Color Histogram Comparison"),
    get_table(IMG_DIR),
])


# Run app and display result inline in the notebook
## Workaround: set load_dotenv=False to prevent current working directory to be changed, which will break
## relative paths on multiple Jupyter-Notebook cell runs, e.g. IMG_DIR
## for more information, see https://github.com/pallets/flask/blob/master/src/flask/app.py #function run(...)
app.run_server(mode='inline', load_dotenv=False)
#app.run_server(mode='inline')

## export for offline use
#for dirpath, dirnames, filenames in os.walk(IMG_DIR):
#    for file in filenames:
#        fig = show_histogram_plotly(os.path.sep.join((dirpath, file)), file)
#        fig.show(renderer='svg')
#        fig.write_image("exports/example-color-histogram-{}.svg".format(file))


divide by zero encountered in log10



### Results

There are four example images, where two of them show the *open* class and the remaining two the *closed* class. For each class I tried to chose a normal example and a somehow extreme version. For example *open-1* is normal because there is nothing special about it, compared to *open-2*, which was captured early in the morning with difficult lighting conditions. Looking at the *closed* class, also *closed-1* is a normal representative and *closed-2* more extreme, because the shutter were completely closed and it seems that there is some light in the background which leads to reflexions in the image.

**Compare Examples 1 (normal)**

When we compare the color histograms of *open-1* and *closed-1*, then it seems that for normal image capture conditions the *open* class has more variation in the color distributions. The color intensities follow a mostly uniform distribution meaning all intensities are equally likely. Although, the ratio of *red* and *green* color is higher in the *open* class. One could expect that, because the picture shows more warm and green-ish colors compared to the image of closed shutters.

The color distribution of the *closed* image shows more dark color intensities which can be seen by the *right skewed* color distribution. Also, it shows more cold and blue-ish colors, as expected due to missing colorful elements in the image.

This observation would comply with our first hypothesis, that the color distributions of *closed* and *open* class are different. At least when looking at the first example, it should be possible to use the color distributions as features for classification.

**Compare Examples 2 (extreme)**

However, the color distributions vary significantly, not only between the classes *open* and *closed*, but also within the classes itself. Compare distributions of images *open-1* with *open-2*. The distributions are totally different. The mostly uniform color distribution over the entire spectrum is not given in *open-2*. This is because it was taken early in the morning where it was pretty dark, which can be seen that almost all color intensities are below 170. Furthermore, the red-green color shift is gone and the blue color is more present, like in the *closed-1* example.

Comparing the examples *closed-1* and *closed-2* that there is a major difference in the distribution, too. The second example is darker than the first one. This can be seen by the color intensities, which are mostly below 160. Furthermore, the blue shift which we saw in *closed-1* example, turned to a more red, green-ish color shift, because of the reflections. Seeing the color shifts in either example *closed-1* or *closed-2* implies that the second hypothesis does not hold, where it was assumed that the color distributions for *closed* images should be uniformly distributed due to grey-ish color pixels.

**Color Distribution a Valid Feature?**

With that few examples seen, it seems that color distributions might not be good features for the classificaiton task, becuase they seem not stable enough, even within the classes. However, we did not compare them mathematically, yet. For now this is only a week intuition that color distributions might not be valid features, but it should be examined further with more examples and more sophisticated methods than just gut feeling. This may be done in another *notebook*.